All Benchmark Results
These benchmarks measure single-user LLM inference speed across different combinations of hardware, models, and inference applications.
Each workload cell shows three numbers:
- Time: Total time in seconds to run the workload, from sending the API request to receiving the last generated token. Lower is better.
- Prompt processing: Speed in tokens per second to process the prompt part of the workload. Higher is better.
- Token generation: Speed in tokens per second to generate the output, which is always 500 tokens in length. Higher is better.
You may also want to read Method.
You can click on a row to open the detail page that shows the launch command and all measurements that were used to calculate these numbers.
Similar total times, different speeds: On very fast setups running short workloads, you might see the same total time (e.g., 4.3 seconds) for clearly different PP/TG speeds. This might look like an error, but the measurements typically show very fast PP times for both runs, with averages differing by only hundredths of a second. Because the total times are shown with limited precision, such small differences are lost.
PP speed increasing with prompt length: There is a certain (static) overhead included in the measured prompt processing time, caused by sending the request to the endpoint, internal processing, and generating & returning the first token. The longer the actual prompt processing takes, the smaller that overhead is proportionally.
If you still think you found inconsistent numbers, please tell me.
I put a lot of effort into the benchmark automation to make sure the launch configs are reasonably optimized and the results are reported just as they were measured. But don't decide for or against hardware/models/apps solely on these benchmarks. Research other sources.
| Hardware | Model | Inference |
Workloads
Describes the prompt length in tokens. The generation length is always 500 tokens. | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPU▲ | CPU▲ | LLM▲ | Quant▲ | App▲ | Option▲ | 1K ▲ | 4K ▲ | 8K ▲ | 16K ▲ | 32K ▲ | 64K ▲ |
| AMD Radeon Mi50 | — | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 20.8 577 26.3 | 31.8 329 25.5 | 53.8 246 23.6 | 83.4 259 23.3 | 164 229 20.9 | 385 181 17.4 |
| AMD Radeon Mi50 (2x) | — | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 22.0 572 24.6 | 29.0 499 23.9 | 43.1 393 22.1 | 63.8 392 21.9 | 108 387 19.9 | 230 321 16.5 |
| AMD Radeon Mi50 | — | devstral-small-2-24b-instruct-2512 | Q8_0 | llama.cpp | GPU only (ROCm) | 23.8 358 23.7 | 42.1 196 23.2 | 74.9 155 21.6 | 124 160 21.3 | 243 148 19.3 | OOM |
| AMD Radeon Mi50 (2x) | — | devstral-small-2-24b-instruct-2512 | Q8_0 | llama.cpp | GPU only (ROCm) | 25.4 342 22.3 | 36.8 288 21.9 | 55.6 259 20.4 | 82.3 279 20.2 | 149 265 18.3 | 308 234 15.4 |
| Nvidia RTX 4080 | Intel Core i7-13700K | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (25% Offload) | 33.3 2312 15.2 | 39.1 2276 13.4 | 51.6 1935 10.5 | 76.2 1992 7.3 | OOM | — |
| AMD Radeon Mi50 | AMD EPYC 7F52 | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 35.0 501 15.2 | 52.4 317 12.6 | 83.3 238 10.1 | 132 246 7.4 | 248 221 4.9 | 554 177 2.6 |
| AMD Radeon 8060S (AI Max+ 395) | — | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 35.3 850 14.6 | 44.3 451 14.2 | 66.4 298 12.7 | 90.4 318 12.6 | 178 241 11.0 | 498 146 8.9 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | devstral-small-2-24b-instruct-2512 | Q8_0 | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 47.6 311 11.3 | 72.6 187 9.8 | 115 149 8.1 | 186 150 6.3 | 340 142 4.4 | 729 122 2.5 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (75% Offload) | 57.9 413 9.0 | 86.3 309 6.8 | 133 233 5.1 | 215 238 3.4 | 393 215 2.0 | 858 171 1.0 |
| AMD Radeon 8060S (AI Max+ 395) | — | devstral-small-2-24b-instruct-2512 | Q8_0 | llama.cpp | GPU only (ROCm) | 58.6 920 8.7 | 67.5 454 8.5 | 89.1 298 8.1 | 118 291 7.9 | 188 270 7.3 | 409 195 6.3 |
| Nvidia RTX 4080 | Intel Core i7-13700K | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (75% Offload) | 69.1 1387 7.3 | 82.1 1577 6.3 | 112 1381 4.7 | 174 1401 3.1 | 280 1291 2.0 | 501 1039 1.1 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | devstral-small-2-24b-instruct-2512 | Q8_0 | llama.cpp | GPU (ROCm) & CPU (75% Offload) | 88.7 236 5.9 | 126 172 4.8 | 185 139 3.9 | 289 140 2.9 | 515 132 1.8 | 1073 115 1.0 |
| — | AMD EPYC 7F52 | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 91.1 51 7.0 | 243 25.1 6.1 | 562 17.9 4.4 | ~1301 | — | — |
| Nvidia RTX 4080 | Intel Core i7-13700K | devstral-small-2-24b-instruct-2512 | Q8_0 | llama.cpp | GPU (CUDA) & CPU (75% Offload) | 109 945 4.7 | 122 1592 4.2 | 153 1413 3.4 | 215 1446 2.5 | 324 1348 1.7 | 537 1126 1.0 |
| — | Intel Core i7-13700K | devstral-small-2-24b-instruct-2512 | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 115 52 5.2 | 274 24.5 4.5 | 552 19.8 3.5 | 1182 16.8 2.3 | ~2688 | — |
| AMD Radeon Mi50 (3x) | — | glm-4.5-air | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 22.6 318 25.8 | 38.5 291 20.4 | 56.1 286 18.1 | 80.3 306 18.4 | 176 230 13.9 | 547 130 9.2 |
| AMD Radeon 8060S (AI Max+ 395) | — | glm-4.5-air | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 29.3 257 19.8 | 58.1 168 14.9 | 106 123 12.4 | 158 136 12.7 | 322 124 8.1 | 985 74 4.4 |
| AMD Radeon 8060S (AI Max+ 395) | — | glm-4.5-air | UD-Q6-K-XL | llama.cpp | GPU only (ROCm) | 37.6 216 15.2 | 64.0 178 12.1 | 102 151 10.5 | 176 130 10.7 | 491 76 7.3 | ~2225 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | glm-4.5-air | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 40.4 138 15.1 | 73.6 120 12.5 | 117 111 11.3 | 187 112 11.5 | 409 98 6.3 | ~1210 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | glm-4.5-air | UD-Q6-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 51.4 93 12.3 | 94.8 84 10.6 | 154 80 9.4 | 252 80 9.7 | 508 72 7.8 | ~1254 |
| AMD Radeon Mi50 (3x) | AMD EPYC 7F52 | glm-4.5-air | UD-Q6-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 52.5 91 12.1 | 96.1 83 10.5 | 155 79 9.4 | 254 79 9.9 | 512 72 7.8 | ~1274 |
| — | AMD EPYC 7F52 | glm-4.5-air | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 95.4 30.8 8.0 | 309 23.1 3.7 | 763 16.1 1.9 | ~1882 | — | — |
| AMD Radeon Mi50 (3x) | — | gpt-oss-120b | MXFP4 | llama.cpp | GPU only (ROCm) | 11.0 394 59 | 15.2 631 57 | 20.3 707 56 | 29.6 773 56 | 55.0 703 53 | 123 571 47.1 |
| AMD Radeon 8060S (AI Max+ 395) | — | gpt-oss-120b | MXFP4 | llama.cpp | GPU only (ROCm) | 12.6 519 46.8 | 16.8 749 43.9 | 22.4 763 42.1 | 33.1 750 42.5 | 70.3 560 38.0 | 192 364 31.6 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | gpt-oss-120b | MXFP4 | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 24.6 129 29.8 | 30.3 329 27.6 | 38.9 374 28.7 | 57.3 395 29.7 | 104 369 28.2 | 223 314 25.9 |
| — | AMD EPYC 7F52 | gpt-oss-120b | MXFP4 | llama.cpp | CPU only (Generic) | 42.7 63 18.6 | 116 55 11.6 | 237 46.5 7.7 | 573 34.9 4.3 | ~1639 | — |
| Nvidia RTX 4080 | — | gpt-oss-20b | MXFP4 | llama.cpp | GPU only (CUDA) | 2.9 5197 186 | 3.4 6567 177 | 4.1 7009 170 | 5.0 7497 172 | 7.9 6850 155 | 15.2 5648 130 |
| Nvidia RTX 4080 | — | gpt-oss-20b | MXFP4 | vLLM | GPU only (CUDA) | 3.5 7942 147 | 3.9 9777 143 | 4.4 9435 139 | OOM | — | — |
| Nvidia RTX 4080 | Intel Core i7-13700K | gpt-oss-20b | MXFP4 | llama.cpp | GPU (CUDA) & CPU (25% Offload) | 5.1 2510 107 | 5.7 4501 104 | 6.5 5131 100 | 7.8 5557 101 | 11.3 5277 96 | 20.4 4414 85 |
| AMD Radeon Mi50 | — | gpt-oss-20b | MXFP4 | llama.cpp | GPU only (ROCm) | 5.8 918 107 | 8.6 1054 103 | 12.3 1086 101 | 19.6 1094 102 | 39.5 936 94 | 95.6 713 85 |
| AMD Radeon Mi50 (2x) | — | gpt-oss-20b | MXFP4 | llama.cpp | GPU only (ROCm) | 6.6 902 91 | 8.8 1355 85 | 11.6 1471 81 | 15.9 1606 84 | 28.5 1446 78 | 62.4 1159 70 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | gpt-oss-20b | MXFP4 | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 8.0 679 77 | 10.9 956 75 | 14.7 1012 74 | 22.4 1025 74 | 43.2 886 71 | 101 683 65 |
| AMD Radeon 8060S (AI Max+ 395) | — | gpt-oss-20b | MXFP4 | llama.cpp | GPU only (ROCm) | 8.3 1228 67 | 11.0 1322 63 | 14.3 1334 61 | 20.6 1296 61 | 42.7 952 55 | 118 596 45.8 |
| Nvidia RTX 4080 | Intel Core i7-13700K | gpt-oss-20b | MXFP4 | llama.cpp | GPU (CUDA) & CPU (100% Offload) | 11.3 1062 48.3 | 11.9 2744 47.7 | 13.0 3283 47.4 | 14.8 3723 47.7 | 19.4 3671 46.6 | 30.8 3312 43.6 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | gpt-oss-20b | MXFP4 | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 14.3 420 41.9 | 17.6 784 40.0 | 21.7 854 40.6 | 29.9 887 42.2 | 53.5 788 39.0 | 115 626 37.9 |
| — | Intel Core i7-13700K | gpt-oss-20b | MXFP4 | llama.cpp | CPU only (OneAPI MKL) | 25.2 111 30.9 | 59.5 98 26.4 | 118 84 21.4 | 275 68 12.7 | 733 48.3 7.0 | 2224 30.5 4.0 |
| — | Intel Core i7-13700K | gpt-oss-20b | MXFP4 | llama.cpp | CPU only (Generic) | 28.4 99 27.3 | 67.7 87 23.0 | 132 76 18.7 | 304 62 10.4 | 799 45.7 5.1 | 2387 29.1 2.7 |
| — | AMD EPYC 7F52 | gpt-oss-20b | MXFP4 | llama.cpp | CPU only (Generic) | 31.2 97 24.0 | 79.8 84 15.5 | 159 72 10.5 | 372 55 6.3 | 1019 36.5 3.5 | ~3263 |
| AMD Radeon Mi50 | — | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 14.9 477 39.0 | 20.5 530 38.6 | 27.7 548 38.2 | 41.8 557 38.3 | 72.3 544 37.4 | 139 513 35.9 |
| AMD Radeon Mi50 (2x) | — | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 16.9 417 34.5 | 21.1 621 34.1 | 26.1 702 34.0 | 36.4 735 34.1 | 59.0 727 33.5 | 107 706 32.0 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 17.4 396 33.5 | 24.1 448 33.1 | 32.5 463 32.8 | 49.1 473 32.7 | 84.5 464 32.3 | 167 426 31.2 |
| AMD Radeon 8060S (AI Max+ 395) | — | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 19.4 552 28.4 | 24.6 595 27.9 | 31.3 604 27.6 | 44.4 609 27.7 | 75.5 564 26.8 | 158 463 25.3 |
| AMD Radeon Mi50 (2x) | — | granite-4.0-h-small | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 20.2 273 30.3 | X | — | — | — | — |
| Nvidia RTX 4080 | Intel Core i7-13700K | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (100% Offload) | 21.4 591 25.3 | 24.8 848 25.0 | 28.4 921 25.4 | 36.4 968 25.2 | 52.4 981 25.4 | 86.5 970 24.6 |
| AMD Radeon Mi50 (3x) | — | granite-4.0-h-small | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 22.2 238 27.9 | 34.4 268 25.7 | 50.6 267 24.3 | 76.3 286 24.6 | OOM | — |
| AMD Radeon Mi50 | AMD EPYC 7F52 | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 28.8 230 20.5 | 39.9 279 19.6 | 55.9 276 18.6 | 80.0 303 18.4 | 141 288 16.6 | OOM |
| AMD Radeon 8060S (AI Max+ 395) | — | granite-4.0-h-small | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 30.9 513 17.3 | 36.1 578 17.1 | 43.0 588 17.0 | 56.4 594 17.0 | 86.5 566 16.7 | 164 483 16.1 |
| AMD Radeon Mi50 (3x) | — | granite-4.0-h-small | BF16 | llama.cpp | GPU only (ROCm) | 40.0 69 19.6 | 55.0 137 19.4 | 78.5 153 19.3 | 125 162 19.3 | 220 165 19.0 | 420 164 18.8 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | granite-4.0-h-small | UD-Q8-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 41.0 128 15.1 | 59.2 162 14.5 | 84.4 165 13.9 | 127 176 13.9 | OOM | — |
| AMD Radeon 8060S (AI Max+ 395) | — | granite-4.0-h-small | BF16 | llama.cpp | GPU only (ROCm) | 54.4 279 9.8 | 63.4 376 9.5 | 78.7 372 8.8 | 95.2 394 9.2 | 154 375 7.3 | OOM |
| — | AMD EPYC 7F52 | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 56.0 49.1 14.1 | 122 47.5 13.2 | 218 45.0 12.5 | 396 45.1 12.8 | 766 44.7 11.2 | ~1583 |
| — | Intel Core i7-13700K | granite-4.0-h-small | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 68.4 43.3 11.0 | 141 42.4 10.7 | 243 41.0 10.5 | 441 40.9 10.4 | 833 41.3 9.5 | ~1640 |
| AMD Radeon 8060S (AI Max+ 395) | — | minimax-m2.1 | UD-Q3-K-XL | llama.cpp | GPU only (ROCm) | 23.3 293 25.3 | 44.6 316 16.0 | 69.8 261 13.0 | 107 249 12.4 | 328 124 7.0 | ~1999 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | minimax-m2.1 | UD-Q3-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 51.3 52 15.6 | 118 50 12.9 | 211 48.4 10.7 | 369 48.7 11.4 | 705 48.4 9.4 | 1539 43.2 6.2 |
| AMD Radeon Mi50 (3x) | AMD EPYC 7F52 | minimax-m2.1 | UD-Q3-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 53.7 51 14.8 | 121 49.4 12.2 | 214 47.5 10.7 | 375 48.0 11.2 | 721 47.6 8.6 | 1560 42.6 6.2 |
| — | AMD EPYC 7F52 | minimax-m2.1 | UD-Q3-K-XL | llama.cpp | CPU only (Generic) | 95.4 27.6 8.9 | 303 19.7 4.8 | 706 14.3 3.2 | 1520 11.4 3.5 | ~3276 | — |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (25% Offload) | 5.0 2966 107 | OOM | — | — | — | — |
| AMD Radeon Mi50 | — | qwen3-30b-a3b-instruct-2507 | Q4-0 | llama.cpp | GPU only (ROCm) | 6.7 1310 84 | 10.7 1023 74 | 15.7 938 70 | 25.0 912 71 | 57.1 655 61 | 176 387 47.5 |
| AMD Radeon Mi50 | — | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 7.7 1207 73 | 11.8 976 66 | 17.0 896 62 | 26.7 873 63 | 59.6 635 55 | 181 379 43.7 |
| AMD Radeon 8060S (AI Max+ 395) | — | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 8.7 1238 63 | 14.9 849 49.4 | 22.0 763 44.1 | 33.7 730 44.5 | 80.3 489 33.8 | 257 273 22.9 |
| AMD Radeon Mi50 (2x) | — | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 8.9 1128 62 | 12.4 1212 55 | 16.4 1172 53 | 22.0 1294 53 | 49.3 874 47.0 | 114 636 39.0 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 9.5 900 60 | 14.2 828 54 | 20.6 742 52 | 31.0 756 52 | 66.9 571 46.4 | 193 356 38.1 |
| AMD Radeon Mi50 (2x) | — | qwen3-30b-a3b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 9.8 705 60 | 15.2 700 53 | 22.1 659 51 | 34.9 648 50 | 73.9 509 45.5 | 206 332 37.8 |
| AMD Radeon Mi50 (3x) | — | qwen3-30b-a3b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 10.6 698 55 | 15.5 702 51 | 22.4 657 49.1 | 35.3 646 48.9 | 74.6 508 43.4 | 208 331 35.7 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (100% Offload) | 10.8 1257 50 | 13.8 1256 47.1 | 17.4 1239 45.6 | 23.6 1275 45.5 | 40.0 1152 41.5 | 75.3 1059 34.0 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-30b-a3b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 12.3 497 48.8 | 18.2 550 45.7 | 27.1 510 44.4 | 42.5 516 44.6 | 87.9 424 40.4 | 234 293 32.9 |
| AMD Radeon 8060S (AI Max+ 395) | — | qwen3-30b-a3b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 12.9 1104 41.6 | 19.1 819 35.3 | 26.3 739 32.6 | 38.4 708 32.8 | 85.4 480 26.8 | 265 269 19.4 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-30b-a3b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (CUDA) & CPU (75% Offload) | 15.0 866 36.2 | 16.7 1449 36.0 | 19.9 1453 34.8 | 25.0 1497 35.0 | 38.5 1401 32.1 | OOM |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 16.1 521 35.3 | 23.4 458 34.1 | 34.6 425 32.0 | 52.5 436 32.2 | 104 369 29.0 | 262 266 24.9 |
| AMD Radeon Mi50 (2x) | — | qwen3-30b-a3b-instruct-2507 | BF16 | llama.cpp | GPU only (ROCm) | 18.2 156 42.5 | 24.3 340 40.1 | 37.1 331 38.9 | 61.5 331 38.8 | OOM | — |
| AMD Radeon Mi50 (3x) | — | qwen3-30b-a3b-instruct-2507 | BF16 | llama.cpp | GPU only (ROCm) | 19.2 156 39.3 | 25.2 340 37.3 | 37.9 333 36.3 | 62.5 330 36.1 | 125 292 33.0 | 304 224 28.9 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-30b-a3b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (ROCm) & CPU (100% Offload) | 20.7 285 29.1 | 27.1 419 28.5 | 41.4 352 27.1 | 62.2 376 25.8 | 120 323 23.8 | 290 241 21.2 |
| AMD Radeon 8060S (AI Max+ 395) | — | qwen3-30b-a3b-instruct-2507 | BF16 | llama.cpp | GPU only (ROCm) | 21.4 434 26.2 | 28.6 542 23.6 | 38.3 508 22.3 | 55.3 492 22.4 | 113 370 19.3 | 319 225 15.2 |
| — | Intel Core i7-13700K | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 38.0 77 20.9 | 147 39.9 11.1 | 356 27.4 7.9 | 654 27.3 7.8 | ~1202 | — |
| — | AMD EPYC 7F52 | qwen3-30b-a3b-instruct-2507 | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 39.3 83 19.7 | 154 43.2 8.6 | 370 28.1 6.0 | 445 43.0 7.2 | 1337 26.6 4.1 | ~4022 |
| Nvidia RTX 4080 | — | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (CUDA) | 3.0 9548 171 | 4.5 7187 128 | 5.7 6927 111 | 6.6 7667 113 | 11.8 5712 81 | 29.7 3211 51 |
| Nvidia RTX 4080 | — | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (CUDA) | 4.9 8779 105 | 6.3 7444 87 | 7.5 6989 79 | 8.3 7769 80 | 13.6 5743 62 | 31.5 3215 43.2 |
| Nvidia RTX 4080 | — | qwen3-4b-instruct-2507 | FP8 | vLLM | GPU only (CUDA) | 6.7 11897 75 | 7.4 12703 71 | 8.2 11333 67 | 10.2 8685 60 | 15.6 5874 49.2 | 31.9 3536 36.3 |
| AMD Radeon Mi50 | — | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 6.8 1298 83 | 11.4 980 69 | 17.2 884 62 | 25.8 907 63 | 57.6 675 49.3 | 171 410 34.1 |
| Nvidia RTX 4080 | — | qwen3-4b-instruct-2507 | BF16 | vLLM | GPU only (CUDA) | 6.9 8451 74 | 7.5 9469 70 | 8.4 8782 66 | 10.7 7105 60 | 16.5 5106 48.9 | OOM |
| Nvidia RTX 4080 | — | qwen3-4b-instruct-2507 | F16 | llama.cpp | GPU only (CUDA) | 7.0 8348 73 | 8.5 7176 63 | 9.7 6766 59 | 10.6 7440 59 | 15.9 5568 49.1 | OOM |
| AMD Radeon Mi50 | — | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 7.3 850 81 | 13.0 721 68 | 19.7 693 61 | 31.6 686 62 | 69.3 544 48.5 | 196 355 33.8 |
| AMD Radeon Mi50 (2x) | — | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 8.1 1240 69 | 11.8 1311 57 | 16.1 1265 52 | 21.0 1407 52 | 40.8 1102 42.7 | 106 715 31.1 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (25% Offload) | 8.4 6860 61 | 13.4 5485 39.5 | 22.8 5331 23.5 | 41.6 5895 12.9 | 77.4 4581 7.1 | 156 2807 3.8 |
| AMD Radeon Mi50 (2x) | — | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 8.4 838 69 | 12.9 1028 56 | 17.4 1023 53 | 23.7 1130 53 | 51.7 831 42.5 | 116 643 30.8 |
| AMD Radeon 8060S (AI Max+ 395) | — | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU only (ROCm) | 9.0 1775 60 | 15.0 1087 44.7 | 21.8 925 38.8 | 30.4 932 39.4 | 77.8 532 28.4 | 319 221 18.3 |
| AMD Radeon Mi50 | — | qwen3-4b-instruct-2507 | F16 | llama.cpp | GPU only (ROCm) | 9.8 1134 56 | 13.9 1078 49.1 | 19.0 998 45.7 | 27.4 987 46.1 | 59.1 697 38.0 | 183 388 28.6 |
| AMD Radeon Mi50 (2x) | — | qwen3-4b-instruct-2507 | F16 | llama.cpp | GPU only (ROCm) | 11.0 956 50 | 14.7 1267 43.7 | 18.1 1346 41.5 | 22.5 1536 41.8 | 41.6 1176 34.8 | 107 722 26.8 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 12.5 1177 43.1 | 23.3 928 26.4 | 37.6 836 17.9 | 61.9 864 11.6 | 126 642 6.6 | 318 392 3.2 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (CUDA) & CPU (25% Offload) | 13.0 5389 39.0 | 18.1 5557 28.8 | 27.9 5222 19.0 | 46.6 5890 11.4 | 81.0 4704 6.7 | 161 2978 3.6 |
| AMD Radeon 8060S (AI Max+ 395) | — | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU only (ROCm) | 14.0 1775 37.3 | 19.9 1102 30.8 | 26.2 976 27.9 | 35.0 953 28.2 | 72.6 641 22.1 | 235 317 15.4 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (ROCm) & CPU (25% Offload) | 15.3 774 35.7 | 27.3 708 23.2 | 42.2 666 16.6 | 69.8 661 11.0 | 140 519 6.4 | 353 326 3.2 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (CUDA) & CPU (75% Offload) | 16.9 4515 29.9 | 28.6 4194 18.1 | 53.4 3865 9.7 | 105 4292 5.0 | 194 3499 2.7 | 395 2194 1.4 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-4b-instruct-2507 | F16 | llama.cpp | GPU (CUDA) & CPU (25% Offload) | 18.5 4444 27.4 | 23.7 5128 21.8 | 32.7 4868 16.1 | 52.4 5445 10.1 | 86.8 4426 6.3 | 163 2873 3.5 |
| AMD Radeon 8060S (AI Max+ 395) | — | qwen3-4b-instruct-2507 | F16 | llama.cpp | GPU only (ROCm) | 20.7 1879 24.9 | 26.5 1151 21.8 | 32.6 1011 20.4 | 41.0 994 20.5 | 78.2 655 17.1 | 239 321 12.8 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | GPU (ROCm) & CPU (75% Offload) | 24.3 553 23.4 | 64.5 274 10.4 | 110 174 7.8 | 123 250 9.0 | 278 172 5.5 | ~1260 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (CUDA) & CPU (75% Offload) | 25.8 3392 19.6 | 37.5 4429 13.7 | 63.0 4154 8.2 | 119 4507 4.3 | 200 3803 2.6 | 401 2497 1.3 |
| AMD Radeon Mi50 | AMD EPYC 7F52 | qwen3-4b-instruct-2507 | UD-Q8-K-XL | llama.cpp | GPU (ROCm) & CPU (75% Offload) | 30.0 436 18.5 | 70.9 248 9.4 | 120 162 7.1 | 136 221 8.3 | 301 159 5.1 | ~1308 |
| — | Intel Core i7-13700K | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 30.9 122 22.1 | 83.2 86 13.7 | 190 65 7.4 | 493 44.5 3.8 | ~1447 | — |
| — | Intel Core i7-13700K | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | CPU only (OneAPI MKL) | 31.0 117 22.7 | 97.7 68 13.1 | 218 47.3 10.4 | 400 46.3 9.6 | 858 41.9 5.6 | ~2160 |
| Nvidia RTX 4080 | Intel Core i7-13700K | qwen3-4b-instruct-2507 | F16 | llama.cpp | GPU (CUDA) & CPU (75% Offload) | 38.7 2493 13.1 | 49.8 3903 10.3 | 75.6 3559 6.8 | 131 3965 4.0 | 217 3411 2.4 | 405 2332 1.3 |
| — | AMD EPYC 7F52 | qwen3-4b-instruct-2507 | UD-Q4-K-XL | llama.cpp | CPU only (Generic) | 41.2 91 17.1 | 137 49.8 9.2 | 311 33.6 6.9 | 542 33.7 7.8 | 1099 32.4 4.9 | ~2595 |