Llocalhost

All Benchmark Results

These benchmarks measure single-user LLM inference speed across different combinations of hardware, models, and inference applications.

Each workload cell shows three numbers:

Time (s) 12.5
150 Prompt processing (t/s)
45 Token generation (t/s)
  • Time: Total time in seconds to run the workload, from sending the API request to receiving the last generated token. Lower is better.
  • Prompt processing: Speed in tokens per second to process the prompt part of the workload. Higher is better.
  • Token generation: Speed in tokens per second to generate the output, which is always 500 tokens in length. Higher is better.

You may also want to read Method.

You can click on a row to open the detail page that shows the launch command and all measurements that were used to calculate these numbers.

Similar total times, different speeds: On very fast setups running short workloads, you might see the same total time (e.g., 4.3 seconds) for clearly different PP/TG speeds. This might look like an error, but the measurements typically show very fast PP times for both runs, with averages differing by only hundredths of a second. Because the total times are shown with limited precision, such small differences are lost.

PP speed increasing with prompt length: There is a certain (static) overhead included in the measured prompt processing time, caused by sending the request to the endpoint, internal processing, and generating & returning the first token. The longer the actual prompt processing takes, the smaller that overhead is proportionally.

If you still think you found inconsistent numbers, please tell me.

3
GPUs Tested
2
CPUs Tested
8
Models Tested
20
Quant Files Tested
102
Setups Benchmarked
2,223
Workloads Measured
46.8h
Total Runtime

I put a lot of effort into the benchmark automation to make sure the launch configs are reasonably optimized and the results are reported just as they were measured. But don't decide for or against hardware/models/apps solely on these benchmarks. Research other sources.

Filter Results

Click into a field to see and select available values. Use * as wildcard, | for OR.
Hardware Model Inference Workloads
Describes the prompt length in tokens. The generation length is always 500 tokens.
GPU CPU LLM Quant App Option 1K 4K 8K 16K 32K 64K
AMD Radeon Mi50 devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU only (ROCm)
20.8
577 26.3
31.8
329 25.5
53.8
246 23.6
83.4
259 23.3
164
229 20.9
385
181 17.4
AMD Radeon Mi50 (2x) devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU only (ROCm)
22.0
572 24.6
29.0
499 23.9
43.1
393 22.1
63.8
392 21.9
108
387 19.9
230
321 16.5
AMD Radeon Mi50 devstral-small-2-24b-instruct-2512 Q8_0 llama.cpp GPU only (ROCm)
23.8
358 23.7
42.1
196 23.2
74.9
155 21.6
124
160 21.3
243
148 19.3
OOM
AMD Radeon Mi50 (2x) devstral-small-2-24b-instruct-2512 Q8_0 llama.cpp GPU only (ROCm)
25.4
342 22.3
36.8
288 21.9
55.6
259 20.4
82.3
279 20.2
149
265 18.3
308
234 15.4
Nvidia RTX 4080 Intel Core i7-13700K devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (25% Offload)
33.3
2312 15.2
39.1
2276 13.4
51.6
1935 10.5
76.2
1992 7.3
OOM
AMD Radeon Mi50 AMD EPYC 7F52 devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (25% Offload)
35.0
501 15.2
52.4
317 12.6
83.3
238 10.1
132
246 7.4
248
221 4.9
554
177 2.6
AMD Radeon 8060S (AI Max+ 395) devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU only (ROCm)
35.3
850 14.6
44.3
451 14.2
66.4
298 12.7
90.4
318 12.6
178
241 11.0
498
146 8.9
AMD Radeon Mi50 AMD EPYC 7F52 devstral-small-2-24b-instruct-2512 Q8_0 llama.cpp GPU (ROCm) & CPU (25% Offload)
47.6
311 11.3
72.6
187 9.8
115
149 8.1
186
150 6.3
340
142 4.4
729
122 2.5
AMD Radeon Mi50 AMD EPYC 7F52 devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (75% Offload)
57.9
413 9.0
86.3
309 6.8
133
233 5.1
215
238 3.4
393
215 2.0
858
171 1.0
AMD Radeon 8060S (AI Max+ 395) devstral-small-2-24b-instruct-2512 Q8_0 llama.cpp GPU only (ROCm)
58.6
920 8.7
67.5
454 8.5
89.1
298 8.1
118
291 7.9
188
270 7.3
409
195 6.3
Nvidia RTX 4080 Intel Core i7-13700K devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (75% Offload)
69.1
1387 7.3
82.1
1577 6.3
112
1381 4.7
174
1401 3.1
280
1291 2.0
501
1039 1.1
AMD Radeon Mi50 AMD EPYC 7F52 devstral-small-2-24b-instruct-2512 Q8_0 llama.cpp GPU (ROCm) & CPU (75% Offload)
88.7
236 5.9
126
172 4.8
185
139 3.9
289
140 2.9
515
132 1.8
1073
115 1.0
AMD EPYC 7F52 devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp CPU only (Generic)
91.1
51 7.0
243
25.1 6.1
562
17.9 4.4
~1301
Nvidia RTX 4080 Intel Core i7-13700K devstral-small-2-24b-instruct-2512 Q8_0 llama.cpp GPU (CUDA) & CPU (75% Offload)
109
945 4.7
122
1592 4.2
153
1413 3.4
215
1446 2.5
324
1348 1.7
537
1126 1.0
Intel Core i7-13700K devstral-small-2-24b-instruct-2512 UD-Q4-K-XL llama.cpp CPU only (Generic)
115
52 5.2
274
24.5 4.5
552
19.8 3.5
1182
16.8 2.3
~2688
AMD Radeon Mi50 (3x) glm-4.5-air UD-Q4-K-XL llama.cpp GPU only (ROCm)
22.6
318 25.8
38.5
291 20.4
56.1
286 18.1
80.3
306 18.4
176
230 13.9
547
130 9.2
AMD Radeon 8060S (AI Max+ 395) glm-4.5-air UD-Q4-K-XL llama.cpp GPU only (ROCm)
29.3
257 19.8
58.1
168 14.9
106
123 12.4
158
136 12.7
322
124 8.1
985
74 4.4
AMD Radeon 8060S (AI Max+ 395) glm-4.5-air UD-Q6-K-XL llama.cpp GPU only (ROCm)
37.6
216 15.2
64.0
178 12.1
102
151 10.5
176
130 10.7
491
76 7.3
~2225
AMD Radeon Mi50 AMD EPYC 7F52 glm-4.5-air UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
40.4
138 15.1
73.6
120 12.5
117
111 11.3
187
112 11.5
409
98 6.3
~1210
AMD Radeon Mi50 AMD EPYC 7F52 glm-4.5-air UD-Q6-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
51.4
93 12.3
94.8
84 10.6
154
80 9.4
252
80 9.7
508
72 7.8
~1254
AMD Radeon Mi50 (3x) AMD EPYC 7F52 glm-4.5-air UD-Q6-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
52.5
91 12.1
96.1
83 10.5
155
79 9.4
254
79 9.9
512
72 7.8
~1274
AMD EPYC 7F52 glm-4.5-air UD-Q4-K-XL llama.cpp CPU only (Generic)
95.4
30.8 8.0
309
23.1 3.7
763
16.1 1.9
~1882
AMD Radeon Mi50 (3x) gpt-oss-120b MXFP4 llama.cpp GPU only (ROCm)
11.0
394 59
15.2
631 57
20.3
707 56
29.6
773 56
55.0
703 53
123
571 47.1
AMD Radeon 8060S (AI Max+ 395) gpt-oss-120b MXFP4 llama.cpp GPU only (ROCm)
12.6
519 46.8
16.8
749 43.9
22.4
763 42.1
33.1
750 42.5
70.3
560 38.0
192
364 31.6
AMD Radeon Mi50 AMD EPYC 7F52 gpt-oss-120b MXFP4 llama.cpp GPU (ROCm) & CPU (100% Offload)
24.6
129 29.8
30.3
329 27.6
38.9
374 28.7
57.3
395 29.7
104
369 28.2
223
314 25.9
AMD EPYC 7F52 gpt-oss-120b MXFP4 llama.cpp CPU only (Generic)
42.7
63 18.6
116
55 11.6
237
46.5 7.7
573
34.9 4.3
~1639
Nvidia RTX 4080 gpt-oss-20b MXFP4 llama.cpp GPU only (CUDA)
2.9
5197 186
3.4
6567 177
4.1
7009 170
5.0
7497 172
7.9
6850 155
15.2
5648 130
Nvidia RTX 4080 gpt-oss-20b MXFP4 vLLM GPU only (CUDA)
3.5
7942 147
3.9
9777 143
4.4
9435 139
OOM
Nvidia RTX 4080 Intel Core i7-13700K gpt-oss-20b MXFP4 llama.cpp GPU (CUDA) & CPU (25% Offload)
5.1
2510 107
5.7
4501 104
6.5
5131 100
7.8
5557 101
11.3
5277 96
20.4
4414 85
AMD Radeon Mi50 gpt-oss-20b MXFP4 llama.cpp GPU only (ROCm)
5.8
918 107
8.6
1054 103
12.3
1086 101
19.6
1094 102
39.5
936 94
95.6
713 85
AMD Radeon Mi50 (2x) gpt-oss-20b MXFP4 llama.cpp GPU only (ROCm)
6.6
902 91
8.8
1355 85
11.6
1471 81
15.9
1606 84
28.5
1446 78
62.4
1159 70
AMD Radeon Mi50 AMD EPYC 7F52 gpt-oss-20b MXFP4 llama.cpp GPU (ROCm) & CPU (25% Offload)
8.0
679 77
10.9
956 75
14.7
1012 74
22.4
1025 74
43.2
886 71
101
683 65
AMD Radeon 8060S (AI Max+ 395) gpt-oss-20b MXFP4 llama.cpp GPU only (ROCm)
8.3
1228 67
11.0
1322 63
14.3
1334 61
20.6
1296 61
42.7
952 55
118
596 45.8
Nvidia RTX 4080 Intel Core i7-13700K gpt-oss-20b MXFP4 llama.cpp GPU (CUDA) & CPU (100% Offload)
11.3
1062 48.3
11.9
2744 47.7
13.0
3283 47.4
14.8
3723 47.7
19.4
3671 46.6
30.8
3312 43.6
AMD Radeon Mi50 AMD EPYC 7F52 gpt-oss-20b MXFP4 llama.cpp GPU (ROCm) & CPU (100% Offload)
14.3
420 41.9
17.6
784 40.0
21.7
854 40.6
29.9
887 42.2
53.5
788 39.0
115
626 37.9
Intel Core i7-13700K gpt-oss-20b MXFP4 llama.cpp CPU only (OneAPI MKL)
25.2
111 30.9
59.5
98 26.4
118
84 21.4
275
68 12.7
733
48.3 7.0
2224
30.5 4.0
Intel Core i7-13700K gpt-oss-20b MXFP4 llama.cpp CPU only (Generic)
28.4
99 27.3
67.7
87 23.0
132
76 18.7
304
62 10.4
799
45.7 5.1
2387
29.1 2.7
AMD EPYC 7F52 gpt-oss-20b MXFP4 llama.cpp CPU only (Generic)
31.2
97 24.0
79.8
84 15.5
159
72 10.5
372
55 6.3
1019
36.5 3.5
~3263
AMD Radeon Mi50 granite-4.0-h-small UD-Q4-K-XL llama.cpp GPU only (ROCm)
14.9
477 39.0
20.5
530 38.6
27.7
548 38.2
41.8
557 38.3
72.3
544 37.4
139
513 35.9
AMD Radeon Mi50 (2x) granite-4.0-h-small UD-Q4-K-XL llama.cpp GPU only (ROCm)
16.9
417 34.5
21.1
621 34.1
26.1
702 34.0
36.4
735 34.1
59.0
727 33.5
107
706 32.0
AMD Radeon Mi50 AMD EPYC 7F52 granite-4.0-h-small UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (25% Offload)
17.4
396 33.5
24.1
448 33.1
32.5
463 32.8
49.1
473 32.7
84.5
464 32.3
167
426 31.2
AMD Radeon 8060S (AI Max+ 395) granite-4.0-h-small UD-Q4-K-XL llama.cpp GPU only (ROCm)
19.4
552 28.4
24.6
595 27.9
31.3
604 27.6
44.4
609 27.7
75.5
564 26.8
158
463 25.3
AMD Radeon Mi50 (2x) granite-4.0-h-small UD-Q8-K-XL llama.cpp GPU only (ROCm)
20.2
273 30.3
X
Nvidia RTX 4080 Intel Core i7-13700K granite-4.0-h-small UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (100% Offload)
21.4
591 25.3
24.8
848 25.0
28.4
921 25.4
36.4
968 25.2
52.4
981 25.4
86.5
970 24.6
AMD Radeon Mi50 (3x) granite-4.0-h-small UD-Q8-K-XL llama.cpp GPU only (ROCm)
22.2
238 27.9
34.4
268 25.7
50.6
267 24.3
76.3
286 24.6
OOM
AMD Radeon Mi50 AMD EPYC 7F52 granite-4.0-h-small UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
28.8
230 20.5
39.9
279 19.6
55.9
276 18.6
80.0
303 18.4
141
288 16.6
OOM
AMD Radeon 8060S (AI Max+ 395) granite-4.0-h-small UD-Q8-K-XL llama.cpp GPU only (ROCm)
30.9
513 17.3
36.1
578 17.1
43.0
588 17.0
56.4
594 17.0
86.5
566 16.7
164
483 16.1
AMD Radeon Mi50 (3x) granite-4.0-h-small BF16 llama.cpp GPU only (ROCm)
40.0
69 19.6
55.0
137 19.4
78.5
153 19.3
125
162 19.3
220
165 19.0
420
164 18.8
AMD Radeon Mi50 AMD EPYC 7F52 granite-4.0-h-small UD-Q8-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
41.0
128 15.1
59.2
162 14.5
84.4
165 13.9
127
176 13.9
OOM
AMD Radeon 8060S (AI Max+ 395) granite-4.0-h-small BF16 llama.cpp GPU only (ROCm)
54.4
279 9.8
63.4
376 9.5
78.7
372 8.8
95.2
394 9.2
154
375 7.3
OOM
AMD EPYC 7F52 granite-4.0-h-small UD-Q4-K-XL llama.cpp CPU only (Generic)
56.0
49.1 14.1
122
47.5 13.2
218
45.0 12.5
396
45.1 12.8
766
44.7 11.2
~1583
Intel Core i7-13700K granite-4.0-h-small UD-Q4-K-XL llama.cpp CPU only (Generic)
68.4
43.3 11.0
141
42.4 10.7
243
41.0 10.5
441
40.9 10.4
833
41.3 9.5
~1640
AMD Radeon 8060S (AI Max+ 395) minimax-m2.1 UD-Q3-K-XL llama.cpp GPU only (ROCm)
23.3
293 25.3
44.6
316 16.0
69.8
261 13.0
107
249 12.4
328
124 7.0
~1999
AMD Radeon Mi50 AMD EPYC 7F52 minimax-m2.1 UD-Q3-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
51.3
52 15.6
118
50 12.9
211
48.4 10.7
369
48.7 11.4
705
48.4 9.4
1539
43.2 6.2
AMD Radeon Mi50 (3x) AMD EPYC 7F52 minimax-m2.1 UD-Q3-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
53.7
51 14.8
121
49.4 12.2
214
47.5 10.7
375
48.0 11.2
721
47.6 8.6
1560
42.6 6.2
AMD EPYC 7F52 minimax-m2.1 UD-Q3-K-XL llama.cpp CPU only (Generic)
95.4
27.6 8.9
303
19.7 4.8
706
14.3 3.2
1520
11.4 3.5
~3276
Nvidia RTX 4080 Intel Core i7-13700K qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (25% Offload)
5.0
2966 107
OOM
AMD Radeon Mi50 qwen3-30b-a3b-instruct-2507 Q4-0 llama.cpp GPU only (ROCm)
6.7
1310 84
10.7
1023 74
15.7
938 70
25.0
912 71
57.1
655 61
176
387 47.5
AMD Radeon Mi50 qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (ROCm)
7.7
1207 73
11.8
976 66
17.0
896 62
26.7
873 63
59.6
635 55
181
379 43.7
AMD Radeon 8060S (AI Max+ 395) qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (ROCm)
8.7
1238 63
14.9
849 49.4
22.0
763 44.1
33.7
730 44.5
80.3
489 33.8
257
273 22.9
AMD Radeon Mi50 (2x) qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (ROCm)
8.9
1128 62
12.4
1212 55
16.4
1172 53
22.0
1294 53
49.3
874 47.0
114
636 39.0
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (25% Offload)
9.5
900 60
14.2
828 54
20.6
742 52
31.0
756 52
66.9
571 46.4
193
356 38.1
AMD Radeon Mi50 (2x) qwen3-30b-a3b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (ROCm)
9.8
705 60
15.2
700 53
22.1
659 51
34.9
648 50
73.9
509 45.5
206
332 37.8
AMD Radeon Mi50 (3x) qwen3-30b-a3b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (ROCm)
10.6
698 55
15.5
702 51
22.4
657 49.1
35.3
646 48.9
74.6
508 43.4
208
331 35.7
Nvidia RTX 4080 Intel Core i7-13700K qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (100% Offload)
10.8
1257 50
13.8
1256 47.1
17.4
1239 45.6
23.6
1275 45.5
40.0
1152 41.5
75.3
1059 34.0
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-30b-a3b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (ROCm) & CPU (25% Offload)
12.3
497 48.8
18.2
550 45.7
27.1
510 44.4
42.5
516 44.6
87.9
424 40.4
234
293 32.9
AMD Radeon 8060S (AI Max+ 395) qwen3-30b-a3b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (ROCm)
12.9
1104 41.6
19.1
819 35.3
26.3
739 32.6
38.4
708 32.8
85.4
480 26.8
265
269 19.4
Nvidia RTX 4080 Intel Core i7-13700K qwen3-30b-a3b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (CUDA) & CPU (75% Offload)
15.0
866 36.2
16.7
1449 36.0
19.9
1453 34.8
25.0
1497 35.0
38.5
1401 32.1
OOM
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
16.1
521 35.3
23.4
458 34.1
34.6
425 32.0
52.5
436 32.2
104
369 29.0
262
266 24.9
AMD Radeon Mi50 (2x) qwen3-30b-a3b-instruct-2507 BF16 llama.cpp GPU only (ROCm)
18.2
156 42.5
24.3
340 40.1
37.1
331 38.9
61.5
331 38.8
OOM
AMD Radeon Mi50 (3x) qwen3-30b-a3b-instruct-2507 BF16 llama.cpp GPU only (ROCm)
19.2
156 39.3
25.2
340 37.3
37.9
333 36.3
62.5
330 36.1
125
292 33.0
304
224 28.9
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-30b-a3b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (ROCm) & CPU (100% Offload)
20.7
285 29.1
27.1
419 28.5
41.4
352 27.1
62.2
376 25.8
120
323 23.8
290
241 21.2
AMD Radeon 8060S (AI Max+ 395) qwen3-30b-a3b-instruct-2507 BF16 llama.cpp GPU only (ROCm)
21.4
434 26.2
28.6
542 23.6
38.3
508 22.3
55.3
492 22.4
113
370 19.3
319
225 15.2
Intel Core i7-13700K qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp CPU only (Generic)
38.0
77 20.9
147
39.9 11.1
356
27.4 7.9
654
27.3 7.8
~1202
AMD EPYC 7F52 qwen3-30b-a3b-instruct-2507 UD-Q4-K-XL llama.cpp CPU only (Generic)
39.3
83 19.7
154
43.2 8.6
370
28.1 6.0
445
43.0 7.2
1337
26.6 4.1
~4022
Nvidia RTX 4080 qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (CUDA)
3.0
9548 171
4.5
7187 128
5.7
6927 111
6.6
7667 113
11.8
5712 81
29.7
3211 51
Nvidia RTX 4080 qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (CUDA)
4.9
8779 105
6.3
7444 87
7.5
6989 79
8.3
7769 80
13.6
5743 62
31.5
3215 43.2
Nvidia RTX 4080 qwen3-4b-instruct-2507 FP8 vLLM GPU only (CUDA)
6.7
11897 75
7.4
12703 71
8.2
11333 67
10.2
8685 60
15.6
5874 49.2
31.9
3536 36.3
AMD Radeon Mi50 qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (ROCm)
6.8
1298 83
11.4
980 69
17.2
884 62
25.8
907 63
57.6
675 49.3
171
410 34.1
Nvidia RTX 4080 qwen3-4b-instruct-2507 BF16 vLLM GPU only (CUDA)
6.9
8451 74
7.5
9469 70
8.4
8782 66
10.7
7105 60
16.5
5106 48.9
OOM
Nvidia RTX 4080 qwen3-4b-instruct-2507 F16 llama.cpp GPU only (CUDA)
7.0
8348 73
8.5
7176 63
9.7
6766 59
10.6
7440 59
15.9
5568 49.1
OOM
AMD Radeon Mi50 qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (ROCm)
7.3
850 81
13.0
721 68
19.7
693 61
31.6
686 62
69.3
544 48.5
196
355 33.8
AMD Radeon Mi50 (2x) qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (ROCm)
8.1
1240 69
11.8
1311 57
16.1
1265 52
21.0
1407 52
40.8
1102 42.7
106
715 31.1
Nvidia RTX 4080 Intel Core i7-13700K qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (25% Offload)
8.4
6860 61
13.4
5485 39.5
22.8
5331 23.5
41.6
5895 12.9
77.4
4581 7.1
156
2807 3.8
AMD Radeon Mi50 (2x) qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (ROCm)
8.4
838 69
12.9
1028 56
17.4
1023 53
23.7
1130 53
51.7
831 42.5
116
643 30.8
AMD Radeon 8060S (AI Max+ 395) qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU only (ROCm)
9.0
1775 60
15.0
1087 44.7
21.8
925 38.8
30.4
932 39.4
77.8
532 28.4
319
221 18.3
AMD Radeon Mi50 qwen3-4b-instruct-2507 F16 llama.cpp GPU only (ROCm)
9.8
1134 56
13.9
1078 49.1
19.0
998 45.7
27.4
987 46.1
59.1
697 38.0
183
388 28.6
AMD Radeon Mi50 (2x) qwen3-4b-instruct-2507 F16 llama.cpp GPU only (ROCm)
11.0
956 50
14.7
1267 43.7
18.1
1346 41.5
22.5
1536 41.8
41.6
1176 34.8
107
722 26.8
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (25% Offload)
12.5
1177 43.1
23.3
928 26.4
37.6
836 17.9
61.9
864 11.6
126
642 6.6
318
392 3.2
Nvidia RTX 4080 Intel Core i7-13700K qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (CUDA) & CPU (25% Offload)
13.0
5389 39.0
18.1
5557 28.8
27.9
5222 19.0
46.6
5890 11.4
81.0
4704 6.7
161
2978 3.6
AMD Radeon 8060S (AI Max+ 395) qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU only (ROCm)
14.0
1775 37.3
19.9
1102 30.8
26.2
976 27.9
35.0
953 28.2
72.6
641 22.1
235
317 15.4
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (ROCm) & CPU (25% Offload)
15.3
774 35.7
27.3
708 23.2
42.2
666 16.6
69.8
661 11.0
140
519 6.4
353
326 3.2
Nvidia RTX 4080 Intel Core i7-13700K qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (CUDA) & CPU (75% Offload)
16.9
4515 29.9
28.6
4194 18.1
53.4
3865 9.7
105
4292 5.0
194
3499 2.7
395
2194 1.4
Nvidia RTX 4080 Intel Core i7-13700K qwen3-4b-instruct-2507 F16 llama.cpp GPU (CUDA) & CPU (25% Offload)
18.5
4444 27.4
23.7
5128 21.8
32.7
4868 16.1
52.4
5445 10.1
86.8
4426 6.3
163
2873 3.5
AMD Radeon 8060S (AI Max+ 395) qwen3-4b-instruct-2507 F16 llama.cpp GPU only (ROCm)
20.7
1879 24.9
26.5
1151 21.8
32.6
1011 20.4
41.0
994 20.5
78.2
655 17.1
239
321 12.8
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp GPU (ROCm) & CPU (75% Offload)
24.3
553 23.4
64.5
274 10.4
110
174 7.8
123
250 9.0
278
172 5.5
~1260
Nvidia RTX 4080 Intel Core i7-13700K qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (CUDA) & CPU (75% Offload)
25.8
3392 19.6
37.5
4429 13.7
63.0
4154 8.2
119
4507 4.3
200
3803 2.6
401
2497 1.3
AMD Radeon Mi50 AMD EPYC 7F52 qwen3-4b-instruct-2507 UD-Q8-K-XL llama.cpp GPU (ROCm) & CPU (75% Offload)
30.0
436 18.5
70.9
248 9.4
120
162 7.1
136
221 8.3
301
159 5.1
~1308
Intel Core i7-13700K qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp CPU only (Generic)
30.9
122 22.1
83.2
86 13.7
190
65 7.4
493
44.5 3.8
~1447
Intel Core i7-13700K qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp CPU only (OneAPI MKL)
31.0
117 22.7
97.7
68 13.1
218
47.3 10.4
400
46.3 9.6
858
41.9 5.6
~2160
Nvidia RTX 4080 Intel Core i7-13700K qwen3-4b-instruct-2507 F16 llama.cpp GPU (CUDA) & CPU (75% Offload)
38.7
2493 13.1
49.8
3903 10.3
75.6
3559 6.8
131
3965 4.0
217
3411 2.4
405
2332 1.3
AMD EPYC 7F52 qwen3-4b-instruct-2507 UD-Q4-K-XL llama.cpp CPU only (Generic)
41.2
91 17.1
137
49.8 9.2
311
33.6 6.9
542
33.7 7.8
1099
32.4 4.9
~2595
OOM Out of memory
~ Predicted time (stopped benchmarking there)