Performance of llama.cpp on Nvidia CUDA #15013

olegshulyakov · 2025-08-01T15:20:29Z

olegshulyakov
Aug 1, 2025

This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on AMD ROCm(HIP) and Performance of llama.cpp with Vulkan, but for CUDA! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our CUDA releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and CUDA info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial CUDA updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14073.41 ± 115.16	290.02 ± 1.10	`8cf6b42`	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	14854.63 ± 22.73	274.20 ± 0.14	`79c1160`	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	9918.34 ± 176.97	267.81 ± 1.54	`5143fa8`	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	4849.53 ± 8.94	190.88 ± 0.33	`5143fa8`	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	10293.86 ± 134.72	189.33 ± 0.19	`79c1160`	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	11992.70 ± 107.99	186.21 ± 0.13	`2241453`	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	8297.36 ± 9.50	181.99 ± 0.42	`8a4280c`	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	6952.38 ± 13.73	176.85 ± 0.07	`933414c`	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	9229.23 ± 101.78	176.07 ± 0.26	`b8e09f0`	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6567.49 ± 20.30	171.19 ± 3.98	`9c35706`	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5174.69 ± 21.83	158.16 ± 0.21	`c76b420`	@m18coppola
L40	48 GB / GDDR6 / 384 bit	8870.49 ± 378.76	152.01 ± 0.28	`ee09828`	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	8125.15 ± 41.05	148.33 ± 0.20	`81086cd`	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	8031.64 ± 26.49	142.49 ± 0.16	20638e4	@Ristovski
RTX 3080	10 GB / GDDR6X / 320 bit	5013.86 ± 24.80	139.65 ± 0.99	`9c35706`	@slaren
RTX A6000	48 GB / GDDR6 / 384 bit	4913.93 ± 6.79	138.73 ± 2.75	`4795c91`	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	6924.53 ± 13.87	132.26 ± 0.16	`9c35706`	@Ristovski
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	4992.83 ± 113.52	131.66 ± 0.20	`7d77f07`	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4028.16 ± 19.14	130.07 ± 2.74	`e5155e6`	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	3042.64 ± 40.71	129.08 ± 0.05	`51f5a45`	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5184.75 ± 18.70	127.54 ± 0.46		@Spyro000
A40	48 GB / GDDR6 / 384 bit	4609.01 ± 10.67	124.11 ± 0.17	`3470a5c`	@Hedede
A30	24 GB / HBM2e / 3072 bit	2767.10 ± 1.88	124.81 ± 0.16	`583cb83`	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2617.46 ± 2.10	108.79 ± 0.05	`e56abd2`	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	2890.66 ± 2.42	107.51 ± 0.21	`9c35706`	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	2751.18 ± 19.43	102.77 ± 0.04	`b8e09f0`	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	2709.95 ± 3.35	102.68 ± 0.03	`b8e09f0`	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	2827.20 ± 66.43	97.32 ± 2.80	`5cdb27e`	@aleksyx
RTX 5060 Ti	16 GB / GDDR7 / 128 bit	3737.25 ± 6.79	90.94 ± 0.02	`89d1029`	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2088.34 ± 1.94	88.06 ± 0.28	`bc07349`	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2684.06 ± 15.28	83.77 ± 0.37	`65349f2`	@TinyServal
Titan Xp	12 GB / GDDR5X / 384 bit	1154.96 ± 1.46	76.08 ± 0.08	`c4510dc`	@Hedede
RTX 3060	12 GB / GDDR6 / 192 bit	2137.50 ± 10.12	75.57 ± 0.07	`baa9255`	@QuantiusBenignus
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1536.89 ± 0.90	65.62 ± 0.62	`7d77f07`	@Hedede
RTX 4060 Ti	8 GB / GDDR6 / 128 bit	3394.63 ± 7.44	63.86 ± 0.01	`89d1029`	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1084.41 ± 3.01	62.49 ± 0.06	`9c35706`	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	2779.77 ± 9.91	61.83 ± 0.04	`a74a0d6`	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1420.24 ± 1.95	60.04 ± 0.01	`5c0eb5e`	@ggerganov
Tesla P100	16 GB / HBM2 / 4096 bit	760.80 ± 2.92	58.35 ± 0.00	`b8372ee`	@Hedede
DGX Spark	128 GB / LPDDR5x	3062.31 ± 11.02	57.21 ± 0.06	`5acd455`	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1007.42 ± 1.23	54.74 ± 0.07	`c76b420`	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	1956.22 ± 7.74	50.62 ± 0.04	`756cfea`	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1219.06 ± 4.18	46.38 ± 0.73	`d32e03f`	@pt13762104
RTX 4050 Laptop	6 GB / GDDR6 / 96 bit	1725.85 + 17.85	43.72 + 0.41	`d79d8f3`	@TimCabbage
GTX 1660	6 GB / GDDR5 / 192 bit	148.91 ± 0.01	41.35 ± 0.02	`9515c61`	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	282.65 ± 0.15	38.04 ± 0.02	`97d5117`	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	714.44 ± 2.04	37.82 ± 0.02	`79c1160`	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	991.31 ± 1.15	33.58 ± 0.14	`c1b1876`	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	514.53 ± 3.06	33.29 ± 0.00	`c76b420`	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	406.94 ± 0.25	30.40 ± 0.02	`5fd160b`	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	416.85 ± 1.75	27.79 ± 0.02	`5fd160b`	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	79.44 ± 0.01	27.82 ± 0.18	`f6da8cb`	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	309.30 ± 0.05	23.63 ± 0.00	`baa9255`	@TinyServal
Quadro P1000	4 GB / GDDR5 / 128 bit	183.40 ± 0.11	13.99 ± 0.13	`1e74897`	@aleksyx
Tesla K80	12 GB / GDDR5 / 384 bit	133.14 ± 0.55	13.80 ± 0.02	`32732f2`	@pebaryan

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14970.15 ± 381.06	300.40 ± 0.28	`8cf6b42`	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	16618.98 ± 20.66	281.11 ± 0.41	`5143fa8`	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	11263.29 ± 98.34	280.74 ± 1.17	`5143fa8`	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	5285.96 ± 6.58	200.90 ± 0.12	`5143fa8`	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	12506.97 ± 11.51	191.57 ± 0.03	`79c1160`	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	14770.63 ± 102.93	188.96 ± 0.05	`2241453`	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	9487.70 ± 21.89	184.68 ± 0.05	`8a4280c`	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	8419.56 ± 35.50	182.43 ± 0.09	`933414c`	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	10576.85 ± 530.21	179.47 ± 0.32	`b8e09f0`	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6924.01 ± 10.76	172.26 ± 1.31	`9c35706`	@slaren
RTX PRO 4500 Blackwell	32 GB / GDDR7 / 256 bit	7251.66 ± 92.40	168.90 ± 0.20	`becc481`	@Hedede
RTX 3090	24 GB / GDDR6X / 384 bit	5560.06 ± 16.28	161.89 ± 0.18	`c76b420`	@m18coppola
L40	48 GB / GDDR6 / 384 bit	10097.64 ± 671.22	153.76 ± 0.12	`ee09828`	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	9439.01 ± 56.75	147.48 ± 1.41	`81086cd`	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	9205.93 ± 22.31	143.47 ± 0.02	20638e4	@Ristovski
RTX A6000	48 GB / GDDR6 / 384 bit	5662.39 ± 13.87	144.87 ± 0.18	`4795c91`	@Hedede
RTX 3080	10 GB / GDDR6X / 320 bit	5569.56 ± 14.04	139.95 ± 0.95	`9c35706`	@slaren
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	5674.44 ± 139.53	136.38 ± 0.13	`7d77f07`	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4552.15 ± 9.68	135.83 ± 0.11	`e5155e6`	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	2973.78 ± 3.62	134.76 ± 0.02	`51f5a45`	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	7612.32 ± 37.35	132.85 ± 0.31	`9c35706`	@Ristovski
RTX 5070	12 GB / GDDR7 / 192 bit	5783.44 ± 36.95	128.21 ± 2.52		@Spyro000
A40	48 GB / GDDR6 / 384 bit	5256.38 ± 19.39	126.24 ± 0.06	`3470a5c`	@Hedede
A30	24 GB / HBM2e / 3072 bit	3068.72 ± 0.63	131.93 ± 0.18	`583cb83`	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2481.25 ± 1.31	112.17 ± 0.01	`e56abd2`	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	3107.61 ± 4.34	109.17 ± 0.07	`9c35706`	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	3053.96 ± 1.37	104.38 ± 0.04	`b8e09f0`	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	3052.35 ± 5.64	103.63 ± 0.02	`b8e09f0`	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	3453.10 ± 49.19	103.00 ± 0.25	`5cdb27e`	@aleksyx
RTX 5060 Ti	16 GB / GDDR7 / 128 bit	4195.53 ± 1.98	93.46 ± 0.01	`89d1029`	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2293.29 ± 5.91	87.71 ± 0.29	`bc07349`	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2807.83 ± 52.44	85.17 ± 0.66	`65349f2`	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2407.67 ± 3.73	76.92 ± 0.03	`baa9255`	@QuantiusBenignus
Titan Xp	12 GB / GDDR5X / 384 bit	1218.12 ± 1.82	73.84 ± 0.04	`c4510dc`	@Hedede
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1662.80 ± 2.04	67.62 ± 0.67	`7d77f07`	@Hedede
RTX 4060 Ti	8 GB / GDDR6 / 128 bit	3803.45 ± 70.80	64.03 ± 0.53	`89d1029`	@mike-llamacpp
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	3171.86 ± 4.34	61.37 ± 0.01	`a74a0d6`	@sdwolfz
Tesla P100	16 GB / HBM2 / 4096 bit	787.36 ± 3.27	61.99 ± 0.00	`b8372ee`	@Hedede
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1138.14 ± 2.02	61.38 ± 0.03	`9c35706`	@ariya
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1563.77 ± 0.51	61.13 ± 0.05	`5c0eb5e`	@ggerganov
DGX Spark	128 GB / LPDDR5x	3661.37 ± 38.66	56.74 ± 0.03	`5acd455`	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1079.66 ± 0.18	53.73 ± 0.05	`c76b420`	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	2250.14 ± 5.91	50.71 ± 0.01	`756cfea`	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1309.73 ± 1.02	44.03 ± 0.57	`d32e03f`	@pt13762104
GTX 1660	6 GB / GDDR5 / 192 bit	154.45 ± 0.52	41.43 ± 0.01	`9515c61`	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	290.17 ± 0.11	39.98 ± 0.01	`97d5117`	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	790.52 ± 2.39	37.87 ± 0.00	`79c1160`	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	1171.96 ± 4.70	35.88 ± 0.18	`c1b1876`	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	529.53 ± 2.12	33.12 ± 0.03	`c76b420`	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	438.49 ± 0.38	30.64 ± 0.06	`5fd160b`	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	446.19 ± 0.81	28.18 ± 0.01	`5fd160b`	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	27.46 ± 0.23	27.46 ± 0.23	`f6da8cb`	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	311.55 ± 0.19	23.76 ± 0.01	`baa9255`	@TinyServal
Tesla K80	12 GB / GDDR5 / 384 bit	133.36 ± 0.60	14.27 ± 0.32	`32732f2`	@pebaryan
Quadro P1000	4 GB / GDDR5 / 128 bit	173.82 ± 0.02	13.65 ± 0.14	`1e74897`	@aleksyx

More detailed test

The main idea of this test is to show a decrease in performance with increasing size.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

m18coppola · 2025-08-01T16:11:12Z

m18coppola
Aug 1, 2025

Here's the results for my devices. Not sure how to get a "cuda info string" though.

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit
Tesla P4	514.53 ± 3.06	33.29 ± 0.00	`c76b420`
Tesla P40	1007.42 ± 1.23	54.74 ± 0.07	`c76b420`
RTX 3090	5174.69 ± 21.83	158.16 ± 0.21	`c76b420`

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	pp512 t/s	tg128 t/s	Commit
Tesla P4	529.53 ± 2.12	33.12 ± 0.03	`c76b420`
Tesla P40	1079.66 ± 0.18	53.73 ± 0.05	`c76b420`
RTX 3090	5560.06 ± 16.28	161.89 ± 0.18	`c76b420`

0 replies

bennmann · 2025-08-01T19:48:15Z

bennmann
Aug 1, 2025

While technically not directly related, there may also be value in comparing AMD ROCM build here too, as ROCM acts a replacement (sometimes a directly compatible layer) for most CUDA calls.

I admit risk of confusion for Nvidia users in the thread if this path is taken.

1 reply

olegshulyakov Aug 1, 2025
Author

As I know you cannot run ROCm on Nvidia GPU. If you would like to see compared results check Vulkan thread. You can find there results for Vulkan/CUDA and Vulkan/ROCm.

UPD: Created ROCm discussion.

slaren · 2025-08-01T20:21:40Z

slaren
Aug 1, 2025
Maintainer

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6567.49 ± 20.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	171.19 ± 3.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	6924.01 ± 10.76
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	172.26 ± 1.31

build: 9c35706 (6060)

Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5013.86 ± 24.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	139.65 ± 0.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5569.56 ± 14.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	139.95 ± 0.95

build: 9c35706 (6060)

0 replies

Ristovski · 2025-08-01T21:10:34Z

Ristovski
Aug 1, 2025

Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6924.53 ± 13.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	132.26 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	7612.32 ± 37.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	132.85 ± 0.31

build: 9c35706 (647)

3 replies

Ristovski Aug 7, 2025

@olegshulyakov One more benchmark for RTX 4080:

Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	8031.64 ± 26.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	142.49 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	9205.93 ± 22.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	143.47 ± 0.02

build: 20638e4 (2)

olegshulyakov Aug 7, 2025
Author

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Ristovski Aug 7, 2025

@Ristovski why so slow? Have you undervolted it? It pretty as RTX 3080, I expected somewhere between RTX 3090 and 3080 Ti =(

Hmm indeed, I didn't give much thought to the score at first. It should be stock but not completely sure as that is one of our work machines. I didn't have much time to investigate today, will check again tomorrow!

RodriMora · 2025-08-01T22:51:37Z

RodriMora
Aug 1, 2025

Device 0: 3090. Power limit to 250w

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	4175.47 ± 27.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	137.72 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4377.03 ± 89.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	138.34 ± 0.96

build: 9c35706 (6060)

Device 2: 5090. Power limit to 400w

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	12706.26 ± 13.30
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	236.73 ± 1.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	13823.36 ± 20.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	245.02 ± 1.08

build: 9c35706 (6060)

3 replies

olegshulyakov Aug 2, 2025
Author

Can you please launch them without a limit on full power?

RodriMora Aug 2, 2025

Sure, results with defaults power limits:

3090 at 390W
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5405.83 ± 5.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	151.04 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5932.44 ± 10.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	155.36 ± 0.09

build: 9c35706 (6060)

5090 at 600W
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	14751.98 ± 136.24
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	239.62 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	16041.54 ± 85.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	248.57 ± 0.05

build: 9c35706 (6060)

cmp-nct Dec 20, 2025

crazy, the additional 200W on the 5090 were likely consumed but the performance change was irrelevantly small

ariya · 2025-08-02T04:34:57Z

ariya
Aug 2, 2025

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	pp512	1084.41 ± 3.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	tg128	62.49 ± 0.06

Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	1138.14 ± 2.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	61.38 ± 0.03

build: 9c35706 (6060)

0 replies

ariya · 2025-08-02T17:01:55Z

ariya
Aug 2, 2025

@olegshulyakov To help users quickly understand the approximate largest models that can run on each GPU, I suggest adding a VRAM column next to the GPU name on the main scoreboard.

Example:

Chip	VRAM	pp512 t/s	tg128 t/s	Commit
RTX 3090 Ti	24 GB	6567.49 $\pm$ 20.30	171.19 $\pm$ 3.98	`9c35706`
RTX 3090	24 GB	5174.69 $\pm$ 21.83	158.16 $\pm$ 0.21	`c76b420`
RTX 3080	10 GB	5013.86 $\pm$ 24.80	139.65 $\pm$ 0.99	`9c35706`

1 reply

olegshulyakov Aug 2, 2025
Author

Made it a little bit better 🙂

ggerganov · 2025-08-02T19:17:24Z

ggerganov
Aug 2, 2025
Maintainer

Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1420.24 ± 1.95
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	60.04 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1563.77 ± 0.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	61.13 ± 0.05

build: 5c0eb5e (6075)

1 reply

olegshulyakov Aug 2, 2025
Author

@ggerganov Can you please add "performance" label?

mike-llamacpp · 2025-08-02T20:45:23Z

mike-llamacpp
Aug 2, 2025

@olegshulyakov I see you grabbed some of my numbers from the Vulkan thread. However, I flooded that post with a bunch of data that probably came across as noise. While you quoted my correct numbers for Non-FA, the FA results you grabbed were actually when run on two GPUs instead of one. To make things easier, here are the numbers from a single card:

RTX 5060 Ti 16 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 3: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	pp512	3737.25 ± 6.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	tg128	90.94 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	pp512	4195.53 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	tg128	93.46 ± 0.01

build: 89d10295 (6002)

And here's another GPU for the collection:

RTX 4060 Ti 8 GB

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3394.63 ± 7.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	63.86 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3803.45 ± 70.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	64.03 ± 0.53

build: 89d10295 (6002)

2 replies

rohan-sircar Aug 5, 2025

Nice 64GB VRAM setup you got there!

And here's another GPU for the collection:

We all be here showing off our GPU collections 😅

mike-llamacpp Aug 5, 2025

Thanks. It isn't the fastest setup around, especially when working with 70B+ models, but it is completely usable for inference. There are also some benefits I like about these particular cards (Gigabyte Windforce):

Two slots thick and only ~200 mm in length makes them easy to fit in a wide variety of cases
Physical x8 PCI-e connector lets them fit in either x8 or x16 slots without modification (5060 TIs only use 8 lanes anyhow)
Quiet (Silent when idle)
Low idle power consumption (~5 watts per card)
Relatively low power draw under full load (<180W each), so easy to power all four with an inexpensive PSU

ariya · 2025-08-04T06:20:35Z

ariya
Aug 4, 2025

Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	pp512	2890.66 ± 2.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	0	tg128	107.51 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	pp512	3107.61 ± 4.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	100	1	tg128	109.17 ± 0.07

build: 9c35706 (6060)

0 replies

lhl · 2025-08-06T08:26:05Z

lhl
Aug 6, 2025

Yeah also saw numbers for my 4090 taken from the Vulkan thread. Re-ran CUDA results so you can get the latest FA and non-FA results from same build:

FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	14770.63 ± 102.93
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	188.96 ± 0.05

Non-FA:

❯ CUDA_VISIBLE_DEVICES=0 build/bin/llama-bench -m /models/llm/gguf/llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	11992.70 ± 107.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	186.21 ± 0.13


build: 224145325 (6098)

nvidia-dkms 575.64.03-1

❯ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

0 replies

pebaryan · 2025-08-07T10:10:35Z

pebaryan
Aug 7, 2025

NVIDIA P106-100
6GB VRAM
Win 11
Driver Version: 566.36 CUDA Version: 12.7

I ran two times, took the best on 2 different build

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	406.94 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	30.40 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	438.49 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	30.64 ± 0.06

build: 5fd160b (6106)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA P106-100, compute capability 6.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	425.73 ± 0.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	29.42 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	436.90 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	29.94 ± 0.03

build: 860a9e4 (5688)

Sadly, nvidia was not supporting this device for the vulkan driver

2 replies

pebaryan Aug 7, 2025

I just bricked my gtx 1070 Ti :( so i would not be able to reproduce the result with newer build

olegshulyakov Aug 7, 2025
Author

@pebaryan I've taken the last build one.

DigitalRudeness · 2025-08-07T10:38:52Z

DigitalRudeness
Aug 7, 2025

Would like to participate with a slightly exotic one from my cute server cube.. :-) (RTX 2000 Ada, 16GB, 75W)

I did two runs:

pull/compilation of llama.cpp from yesterday:

gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1956.22 ± 7.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	50.62 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2250.14 ± 5.91
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	50.71 ± 0.01

build: 756cfea (6105)

fresh pull/compilation of llama.cpp ~5min ago:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1952.82 ± 7.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	50.59 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2237.16 ± 6.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	50.67 ± 0.01

build: 1d72c84 (6109)

Seems to make no big difference... ^^

0 replies

pebaryan · 2025-08-11T09:42:50Z

pebaryan
Aug 11, 2025

I finally got my hands on similar card as before (NP106) but with display output

NVIDIA GTX 1060
6GB GDDR5 192-bit
Driver 566.36

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	416.85 ± 1.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	27.79 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	446.19 ± 0.81
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	28.18 ± 0.01

build: 5fd160b (6106)

1 reply

pebaryan Aug 11, 2025

just realized i didn't use the latest build, not that difference though

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	pp512	413.59 ± 2.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	0	tg128	27.74 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	pp512	443.66 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	1	tg128	28.08 ± 0.04

build: 79c1160 (6123)

yc757 · 2026-04-05T08:10:43Z

yc757
Apr 5, 2026

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11441 MiB):
Device 0: Tesla K40c, compute capability 3.5, VMM: yes, VRAM: 11441 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	144.70 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	16.33 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	147.66 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	17.61 ± 0.00

build: d3416a4 (8651)

Sun Apr  5 08:05:25 2026
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 00000000:0B:00.0 Off |                    0 |
| 25%   49C    P0    64W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

0 replies

jeffyl · 2026-04-10T11:48:42Z

jeffyl
Apr 10, 2026

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8105 MiB):
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes, VRAM: 8105 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  0 |           pp512 |        789.48 ± 0.99 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  0 |           tg128 |         45.75 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |           pp512 |        825.38 ± 0.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |  1 |           tg128 |         47.50 ± 0.00 |

version: 8745 (f989a6e39)

GTX 1080 not TI

0 replies

ygafarov · 2026-05-03T17:22:55Z

ygafarov
May 3, 2026

❯ ~/llama.cpp/build-cuda/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24126 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5349.43 ± 78.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	168.83 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5992.10 ± 102.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	177.79 ± 0.13

build: db44417 (9011)

0 replies

RexBytes · 2026-05-08T15:47:27Z

RexBytes
May 8, 2026

NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition — SM 12.0 — comparison datapoint

Posting as a comparison datapoint to @Tom94's existing RTX PRO 6000 Blackwell entry.
This is the Max-Q / density-optimised variant; nvidia-smi reports a
power limit of 300.00 W on this card (vs the 600 W Workstation Edition).
Same SM120 silicon, lower power envelope — useful to see how the score scales
with TGP.

System

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
VRAM reported: 97887 MiB
Power limit reported: 300.00 W
CPU: AMD Ryzen 9 9950X3D 16-Core Processor
RAM: 123Gi
OS: Linux Mint 22.3
Driver: 595.58.03
CUDA toolkit: 12.8
llama.cpp: 5d6f18a on branch master (describe: b9072-6-g5d6f18a63)
Model SHA256: 78b8f9777dd620ad29cd2cffb6653b17fa8a5b1fddc1b8821180d60eedd24d48
Build: cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120

CUDA init

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97249 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97249 MiB

Results — Llama 2 7B Q4_0 (-ngl 99 -fa 0,1 -r 5)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	12242.46 ± 390.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	271.26 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	13403.70 ± 209.97
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	287.29 ± 0.21

build: 5d6f18a

0 replies

UzixLS · 2026-05-10T09:18:45Z

UzixLS
May 10, 2026

NVIDIA GeForce RTX 5060 Ti (CUDA 13)

tellur ...ama/llama-b9093-bin-win-cuda-13.1-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 36790 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
  Device 1: NVIDIA CMP 50HX, compute capability 7.5, VMM: yes, VRAM: 10239 MiB
  Device 2: NVIDIA P102-100, compute capability 6.1, VMM: yes, VRAM: 10239 MiB
load_backend: loaded CUDA backend from D:\LLM\llama\llama-b9093-bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  0 |           pp512 |      3733.17 + 78.77 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  0 |           tg128 |         92.52 + 0.18 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  1 |           pp512 |      4427.42 + 58.40 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  1 |           tg128 |         95.69 + 0.27 |

build: 1e5ad35d5 (9093)

NVIDIA GeForce RTX 5060 Ti (CUDA 12)

tellur ...ama/llama-b9093-bin-win-cuda-12.4-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 36790 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
  Device 1: NVIDIA CMP 50HX, compute capability 7.5, VMM: yes, VRAM: 10239 MiB
  Device 2: NVIDIA P102-100, compute capability 6.1, VMM: yes, VRAM: 10239 MiB
load_backend: loaded CUDA backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  0 |           pp512 |      3541.12 + 73.58 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  0 |           tg128 |         92.75 + 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  1 |           pp512 |      4430.01 + 77.59 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |   none |  1 |           tg128 |         95.10 + 1.32 |

build: 1e5ad35d5 (9093)

NVIDIA CMP 50HX (CUDA 12)

tellur ...ama/llama-b9093-bin-win-cuda-12.4-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 1
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 36790 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
  Device 1: NVIDIA CMP 50HX, compute capability 7.5, VMM: yes, VRAM: 10239 MiB
  Device 2: NVIDIA P102-100, compute capability 6.1, VMM: yes, VRAM: 10239 MiB
load_backend: loaded CUDA backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |   main_gpu |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |   none |  0 |           pp512 |        416.06 + 0.26 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |   none |  0 |           tg128 |         52.85 + 0.05 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |   none |  1 |           pp512 |        428.35 + 0.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |   none |  1 |           tg128 |         53.65 + 0.10 |

build: 1e5ad35d5 (9093)

NVIDIA P102-100 (CUDA 12)

tellur ...ama/llama-b9093-bin-win-cuda-12.4-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 2
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 36790 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
  Device 1: NVIDIA CMP 50HX, compute capability 7.5, VMM: yes, VRAM: 10239 MiB
  Device 2: NVIDIA P102-100, compute capability 6.1, VMM: yes, VRAM: 10239 MiB
load_backend: loaded CUDA backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |   main_gpu |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |   none |  0 |           pp512 |        913.16 + 2.56 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |   none |  0 |           tg128 |         51.30 + 0.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |   none |  1 |           pp512 |       1037.02 + 1.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |   none |  1 |           tg128 |         53.96 + 0.04 |

build: 1e5ad35d5 (9093)

NVIDIA GeForce RTX 5060 Ti (Vulkan)

tellur .../llama/llama-b9093-bin-win-vulkan-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5060 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA P102-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = NVIDIA CMP 50HX (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 3 = Intel(R) UHD Graphics 770 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  0 |           pp512 |      3359.51 + 21.54 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  0 |           tg128 |         90.61 + 1.23 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  1 |           pp512 |     3731.36 + 102.20 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  1 |           tg128 |         95.84 + 0.36 |

build: 1e5ad35d5 (9093)

NVIDIA CMP 50HX (Vulkan)

tellur .../llama/llama-b9093-bin-win-vulkan-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 2
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5060 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA P102-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = NVIDIA CMP 50HX (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 3 = Intel(R) UHD Graphics 770 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |   main_gpu |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          2 |   none |  0 |           pp512 |        188.88 + 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          2 |   none |  0 |           tg128 |         34.19 + 0.08 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          2 |   none |  1 |           pp512 |        189.50 + 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          2 |   none |  1 |           tg128 |         34.71 + 0.10 |

build: 1e5ad35d5 (9093)

NVIDIA P102-100 (Vulkan)

tellur .../llama/llama-b9093-bin-win-vulkan-x64 $ ./llama-bench.exe -m d:/LLM/gguf/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 1
load_backend: loaded RPC backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5060 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA P102-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 2 = NVIDIA CMP 50HX (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 3 = Intel(R) UHD Graphics 770 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\LLM\llama\llama-b9093-bin-win-vulkan-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl |   main_gpu |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          1 |   none |  0 |           pp512 |        518.53 + 0.22 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          1 |   none |  0 |           tg128 |         63.79 + 0.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          1 |   none |  1 |           pp512 |        572.26 + 0.19 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |          1 |   none |  1 |           tg128 |         66.59 + 0.32 |

build: 1e5ad35d5 (9093)

6 replies

UzixLS May 12, 2026

Yes, I was disappointed with this GPU. Not sure if this is my setup issue or GPU problem.

pt13762104 May 13, 2026

Try to run mmapeak on it and see what happens.

UzixLS May 13, 2026

...

$ ./mmapeak.exe
----------------------------------------
Device 0: NVIDIA GeForce RTX 5060 Ti
  Compute capability: 12.0
  Total global memory: 15.9 GiB
  Multiprocessor count: 36
Running benchmarks with target time: 3.0 seconds

======================================== INT ========================================
- mma_s4s4s32_8_8_32
run: 3065.2 ms 22.6 T(fl)ops
- mma_s8s8s32_16_16_16
run: 2909.5 ms 182.8 T(fl)ops
- mma_s8s8s32_32_8_16
run: 2956.6 ms 184.9 T(fl)ops

---------------------------------------- FP4 ----------------------------------------
- mma_mxf4mxf4f32_16_8_64
not supported
- mma_nvf4nvf4f32_16_8_64
not supported
- mma_f4f4f16_16_8_32
not supported
- mma_f4f4f32_16_8_32
not supported

---------------------------------------- FP6 ----------------------------------------
- mma_f6f6f16_16_8_32
not supported
- mma_f6f6f32_16_8_32
not supported
- mma_mxf6mxf6f32_16_8_32
not supported

---------------------------------------- FP8 ----------------------------------------
- mma_mxf8mxf8f32_16_8_32
not supported
- mma_f8f8f16_16_8_32
not supported
- mma_f8f8f32_16_8_32
not supported

---------------------------------------- FP16 ----------------------------------------
- mma_f16f16f16_16_16_16
run: 2938.9 ms 183.1 T(fl)ops
- mma_f16f16f16_32_8_16
run: 2953.2 ms 183.5 T(fl)ops
- mma_f16f16f32_16_16_16
run: 2950.3 ms 96.7 T(fl)ops
- mma_f16f16f32_32_8_16
run: 2951.6 ms 96.6 T(fl)ops
- mma_bf16bf16f32_16_16_16
run: 2977.3 ms 96.5 T(fl)ops
- mma_bf16bf16f32_32_8_16
run: 2978.3 ms 96.5 T(fl)ops

---------------------------------------- FP32 ----------------------------------------
- mma_tf32tf32f32_16_16_8
run: 2987.8 ms 24.2 T(fl)ops
- fma_fp32 (scalar)
fma_fp32: 3015.1 ms 20.5 T(fl)ops

---------------------------------------- FP64 ----------------------------------------
- fma_fp64 (scalar)
fma_fp64: 3014.2 ms 0.3 T(fl)ops
----------------------------------------
Device 1: NVIDIA CMP 50HX
  Compute capability: 7.5
  Total global memory: 10.0 GiB
  Multiprocessor count: 56
Running benchmarks with target time: 3.0 seconds

======================================== INT ========================================
- mma_s4s4s32_8_8_32
run: 2998.0 ms 12.5 T(fl)ops
- mma_s8s8s32_16_16_16
run: 2995.2 ms 6.2 T(fl)ops
- mma_s8s8s32_32_8_16
run: 2995.2 ms 6.2 T(fl)ops

---------------------------------------- FP4 ----------------------------------------
- mma_mxf4mxf4f32_16_8_64
not supported
- mma_nvf4nvf4f32_16_8_64
not supported
- mma_f4f4f16_16_8_32
not supported
- mma_f4f4f32_16_8_32
not supported

---------------------------------------- FP6 ----------------------------------------
- mma_f6f6f16_16_8_32
not supported
- mma_f6f6f32_16_8_32
not supported
- mma_mxf6mxf6f32_16_8_32
not supported

---------------------------------------- FP8 ----------------------------------------
- mma_mxf8mxf8f32_16_8_32
not supported
- mma_f8f8f16_16_8_32
not supported
- mma_f8f8f32_16_8_32
not supported

---------------------------------------- FP16 ----------------------------------------
- mma_f16f16f16_16_16_16
run: 2994.7 ms 3.1 T(fl)ops
- mma_f16f16f16_32_8_16
run: 2994.7 ms 3.1 T(fl)ops
- mma_f16f16f32_16_16_16
run: 2994.7 ms 3.1 T(fl)ops
- mma_f16f16f32_32_8_16
run: 2994.7 ms 3.1 T(fl)ops
- mma_bf16bf16f32_16_16_16
not supported
- mma_bf16bf16f32_32_8_16
not supported

---------------------------------------- FP32 ----------------------------------------
- mma_tf32tf32f32_16_16_8
not supported
- fma_fp32 (scalar)
fma_fp32: 3000.5 ms 0.4 T(fl)ops

---------------------------------------- FP64 ----------------------------------------
- fma_fp64 (scalar)
fma_fp64: 3003.6 ms 0.3 T(fl)ops

pt13762104 May 13, 2026

Weirdly how everything excluding FP64 is running at a rate simmilar to 50MHz. Was the clock speed extremely low or something? Suprisingly, all the ratios looks fine.
Edit: Looks like CMP 170HX has the same problem: https://niconiconi.neocities.org/tech-notes/nvidia-cmp-170hx-review/
I suppose dp4a doesn't have this problem, but to work around this problem you'd need to build from source which isn't worth it for such an outdated card.

arabel1a Jun 14, 2026

Hi! Can you please check my patch that partly restores tg on cmp90hx/cmp170hx? Hope it will work for cmp50hx too.

porly1985 · 2026-05-13T19:14:14Z

porly1985
May 13, 2026

max power consumption 70W

Device 0: NVIDIA RTX PRO 4000 Blackwell SFF Edition, compute capability 12.0, VMM: yes, VRAM: 23987 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3628.68 ± 42.39
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	89.73 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4078.52 ± 15.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	92.54 ± 0.03

build: 856c3ad (1)

0 replies

mattngaw · 2026-05-25T20:35:59Z

mattngaw
May 25, 2026

NVIDIA CMP 170HX (GA100, sm_80, 8 GB HBM2e)

Mining card (same die gen as A100)
Ubuntu 24.04, driver 570.211.01, CUDA 12.8, PCIe Gen1 x4 (firmware-locked).

$ ~/llama.cpp/build-cuda/bin/llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7892 MiB):
  Device 0: NVIDIA Graphics Device, compute capability 8.0, VMM: yes, VRAM: 7892 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	626.32 ± 3.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	65.48 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	653.01 ± 1.92
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	65.91 ± 0.04

build: 97895129e (8863)

Despite supposedly coming from the same die, the card is compute-clamped in some way, I'm interested in digging into it further. I measured FP16 GEMM ≈ 6.5 TFLOP/s, which is supposedly ~2.1% of the A100's peak. HBM2e bandwidth is supposedly intact.
Peak ~144 W / 51°C under this benchmark.

5 replies

mattngaw May 25, 2026

it's sad that nsys and ncu can't be used on these cards

==ERROR== ERR_NVCMPGPU - Profiling is not supported on the NVIDIA Crypto Mining Processors (CMP) of the target device 0. For more information, please visit https://developer.nvidia.com/ERR_NVCMPGPU

but to add more color to the above results:

GPU utilization during inference — nvidia-smi dmon -s u while running the above benchmarks

Regime	SM active % (avg)	SM active % (max)	HBM2e controller % (avg)	HBM2e controller % (max)	Bottleneck
pp512 (prompt processing)	96	100	3	5	compute (SM)
tg128 (token generation)	95	100	14	18	compute (SM)

pt13762104 May 26, 2026

CMP 170HX compute rate is basically nuked, Nvidia intentionally restricted the card to 1/32 its available compute... Most of the time that translates to 6x slower performance...

mattngaw May 26, 2026

if there's a will there's a way...
did some kernel hacking was able to get these results:

$ llama-bench -m /home/matto/models/llama-2-7b.Q4_0.gguf -p 512 -n 128 -ngl 99
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7892 MiB):
  Device 0: NVIDIA Graphics Device, compute capability 8.0, VMM: yes, VRAM: 7892 MiB

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	1175.43 ± 13.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	128.41 ± 0.10

just under 2x speedup
this card is not as nuked as we think ;)
of course, it's nowhere near an A100 but still

mattngaw May 26, 2026

the roofline of this card really is interesting, the TG:PP TPS ratio makes it really strange.
most other cards hover around single digit percents for TG/PP, but this card is at ~10%

arabel1a Jun 14, 2026

Hi! Please check my patch that partly restores tg on cmp90hx/cmp170hx? Can your kernel hacking be combined with that?

AlphaMo99 · 2026-05-25T21:13:51Z

AlphaMo99
May 25, 2026

Hello, four more results on laptops :

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15984 MiB):
Device 0: NVIDIA RTX A5500 Laptop GPU, compute capability 8.6, VMM: yes, VRAM: 15984 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	2180.23 ± 55.75
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	74.68 ± 0.84
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	2338.14 ± 53.16
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	74.66 ± 1.09

build: 1acee6b (9293)

And,

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16110 MiB):
Device 0: Quadro RTX 5000, compute capability 7.5, VMM: yes, VRAM: 16110 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1840.52 ± 22.87
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	74.76 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1970.10 ± 12.48
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	77.74 ± 0.32

build: b22ff4b (9299)

And

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 4031 MiB):
Device 0: Quadro P2000 with Max-Q Design, compute capability 6.1, VMM: yes, VRAM: 4031 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	216.34 ± 0.58
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	16.74 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	218.32 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	17.84 ± 0.00

build: dbe9c0c (9341)

And the last one :

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16266 MiB):
Device 0: Quadro P5000, compute capability 6.1, VMM: yes, VRAM: 16266 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	545.12 ± 1.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	34.71 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	549.43 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	36.11 ± 0.03

build: 0d18aaa (9351)

This test is not working with T2000, llama-bench is showing 3715 MiB, but nvidia-smi is showing 4096MiB and no program loaded (How did the results of T1000 were collected, does it have more vRam ?)

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 3715 MiB):
Device 0: Quadro T2000, compute capability 7.5, VMM: yes, VRAM: 3715 MiB

model	size	params	backend	ngl	fa	test	t/s
llama_bench: error: failed to load model 'llama-2-7b.Q4_0.gguf'

An other error on GTX 980M with 4096MiB of vRam
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 4022 MiB):
Device 0: NVIDIA GeForce GTX 980M, compute capability 5.2, VMM: yes, VRAM: 4022 MiB

model	size	params	backend	ngl	fa	test	t/s
llama_bench: error: failed to create context with model 'llama-2-7b.Q4_0.gguf'

1 reply

pt13762104 Jun 1, 2026

pp256 - 4GB VRAM isn't enough for this model...

Hastwell · 2026-05-26T07:52:42Z

Hastwell
May 26, 2026

CMP 100-210 (Mining GV100 - sm_70/Volta - 16GB HBM2)

I thought I'd chip in with a card I haven't seen good stats on: the CMP 100-210, a mining version of the V100. Like other CMP cards, the card is gimped to PCIe 1.1 x1 and has its Tensor Cores disabled, but seems to otherwise retain decent FP32 + FP16 performance and fully working 16GB of HBM2 memory. (FP64 however is mega gimped to ~0.377 TFLOPS unlike the real V100's 7.8 TFLOPs).

All tests were run with a 180W power limit; running at the full 250W gives no additional speed increase.
llama.cpp was compiled with -DGGML_CUDA=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES="70"
Host is Proxmox VE 9.1.4 (based on Debian Linux "Trixie")

llamacpp@llamacpp:~/llama.cpp/build/bin$ ./llama-bench -m /mnt/my/nas/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 32289 MiB):
  Device 0: NVIDIA CMP 100-210, compute capability 7.0, VMM: yes, VRAM: 16144 MiB
  Device 1: NVIDIA CMP 100-210, compute capability 7.0, VMM: yes, VRAM: 16145 MiB

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	pp512	354.07 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	0	tg128	99.84 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	pp512	380.40 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	none	1	tg128	109.49 ± 0.12

build: bb28c1f (9281)

However, more is possible: setting a small uBatch size on this card absurdly increased PP speed for some models. In my testing, -ub 56 gave maximum PP of ~1150 TPS before dropping back to ~300 TPS. Similar performance gains can be seen on Llama 3.1, 3.3, 3.3-based finetunes, and Qwen Coder Next, but is not universal. Eg. Qwen 3.6 (35B A3B and 27B) doesn't reproduce and has to be run at the normal uBatch size of 1024 before getting max performance. I don't have a real V100 so no idea if this reproduces on that card as well.

llamacpp@llamacpp:~/llama.cpp/build/bin$ ./llama-bench -m /mnt/my/nas/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -ub 56 -sm none -mg 0

model	size	params	backend	ngl	n_ubatch	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	56	none	0	pp512	977.33 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	56	none	0	tg128	99.95 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	56	none	1	pp512	1156.76 ± 6.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	56	none	1	tg128	109.46 ± 0.04

Full uBatch Stats

Full uBatch stats for lulz. PP only, as TG remained a constant ~109 TPS and would just clutter up the table.

llamacpp@llamacpp:~/llama.cpp/build/bin$ ./llama-bench -m /mnt/my/nas/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -ub 8,16,32,48,56,64,80,96,112,128,256,384,512,768,1024 -sm none -mg 0 -n 0

model	size	params	backend	ngl	n_ubatch	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8	none	0	pp512	404.51 ± 1.89
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8	none	1	pp512	430.36 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	16	none	0	pp512	609.71 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	16	none	1	pp512	692.44 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	32	none	0	pp512	818.35 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	32	none	1	pp512	979.11 ± 0.77
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	48	none	0	pp512	951.08 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	48	none	1	pp512	1126.89 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	56	none	0	pp512	978.78 ± 1.42
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	56	none	1	pp512	1153.72 ± 2.67
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	none	0	pp512	280.64 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	none	1	pp512	294.68 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	80	none	0	pp512	252.74 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	80	none	1	pp512	264.66 ± 0.51
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	96	none	0	pp512	284.04 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	96	none	1	pp512	298.99 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	112	none	0	pp512	284.93 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	112	none	1	pp512	299.82 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	none	0	pp512	322.56 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	none	1	pp512	342.00 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	none	0	pp512	341.72 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	none	1	pp512	364.49 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	384	none	0	pp512	329.78 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	384	none	1	pp512	351.14 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	none	0	pp512	353.60 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	none	1	pp512	379.95 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	768	none	0	pp512	353.65 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	768	none	1	pp512	379.84 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	none	0	pp512	353.61 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	none	1	pp512	361.64 ± 8.89

3 replies

arabel1a Jun 14, 2026

Hi! Can you please check my patch that partly restores tg on cmp90hx/cmp170hx? Hope it will work for cmp210 too. I will be very grateful if you measure numbers on your cmp100-210 and post them here.

Hastwell Jun 14, 2026

Commented on suggestion since this concerns future changes rather than current state of LlamaCPP. TLDR, on the CMP100 I saw either no performance change or even a loss of performance (based on uBatch size). The benchmarks provided with #24616 show stronger performance improvements on other CMP models which are more heavily nerfed than mine; it's probably just not needed on the CMP100.

mattngaw Jun 14, 2026

The CMP100 (Volta) probably needs a different patch than the CMP90HX/CMP170HX (Ampere).
I actually own one as well and I'm messing around with it to see what perf can be recouped.

AlphaMo99 · 2026-05-26T15:40:50Z

AlphaMo99
May 26, 2026

And a small laptop Blackwell on Windows :

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 6112 MiB):
Device 0: NVIDIA RTX PRO 500 Blackwell Generation Laptop GPU, compute capability 12.0, VMM: yes, VRAM: 6112 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	1154.30 ± 85.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	47.72 ± 2.49
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1440.48 ± 14.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	48.92 ± 0.04

build: 192d8ae (9334)

0 replies

rs38 · 2026-05-26T21:35:44Z

rs38
May 26, 2026

Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0, VMM: yes, VRAM: 24462 MiB
CUDA 13.2
load_backend: loaded CUDA backend from D:\llama\ggml-cuda.dll
load_backend: loaded RPC backend from D:\llama\ggml-rpc.dll
load_backend: loaded CPU backend from D:\llama\ggml-cpu-alderlake.dll

Win11, 95W Power Mode

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	5056.95 ± 58.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	112.36 ± 8.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	5007.90 ± 72.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	117.16 ± 0.57

Win11, 175W Power Mode

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6667.06 ± 123.40
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	156.49 ± 0.98
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	7641.89 ± 275.60
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	158.14 ± 4.70

build: 29f1482 (9253)

3 replies

AlphaMo99 May 30, 2026

Hello, What is the model of the laptop please ?, looking to buy one, Legion or Alienware... !?

rs38 May 30, 2026

https://www.hp.com/de-de/shop/products/laptops/omen-max-gaming-laptop-16-ah0790ng-be6g0ea-abd (20% discount with WBW20 code)
Ultra9 275HX(2.7GHz)
• 16" 2.5K OLED (2.560 x 1.600)
• 64 GB (2 x 32.768 MB)
• SSD 2TB PCIe NVMe
• NVIDIA RTX5090 24GB

rs38 May 30, 2026

must be plugged in AC to have full power of course

rs38 · 2026-05-26T21:49:20Z

rs38
May 26, 2026

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	6472.95 ± 323.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp1024	5867.10 ± 144.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp2048	5315.75 ± 72.55
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp4096	4468.21 ± 4.59
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp8192	3350.46 ± 4.80
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp16384	2178.52 ± 14.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp32768	1247.68 ± 1.74
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	151.31 ± 0.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg256	150.57 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg512	146.69 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg1024	143.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg2048	136.69 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	7555.44 ± 167.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp1024	7096.57 ± 206.79
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	6884.44 ± 50.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp4096	6222.59 ± 66.14
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	5233.92 ± 26.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp16384	3894.57 ± 3.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp32768	2602.28 ± 7.31
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	153.87 ± 2.84
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg256	154.57 ± 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg512	152.08 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg1024	147.90 ± 0.39
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg2048	140.23 ± 0.18

0 replies

totaldev · 2026-06-11T06:27:10Z

totaldev
Jun 11, 2026

Server version Nvidia RTX PRO 6000 (300W)
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97288 MiB):
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97288 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	12742.48 ± 285.22
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	260.42 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	14463.16 ± 488.65
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	289.30 ± 0.01

build: d2462f8 (1)

0 replies

odbguru · 2026-06-19T11:35:37Z

odbguru
Jun 19, 2026

GPU: Asus Dual GeForce RTX 5060 OC 8GB GDDR7 DLSS4 @ PCIE 5.0x8 (CUDA 13.3)
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7704 MiB):
Device 0: NVIDIA GeForce RTX 5060, compute capability 12.0, VMM: yes, VRAM: 7704 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	3269.08 ± 45.26
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	96.70 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	3799.47 ± 30.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	100.61 ± 0.08

build: 3a3edc9 (9715)

0 replies

wise-king-sullyman · 2026-06-24T22:29:49Z

wise-king-sullyman
Jun 24, 2026

GPU: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes, VRAM: 4031 MiB
CUDA: 13.0

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	266.70 ± 0.50
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	19.06 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	268.09 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	20.27 ± 0.00

build: 894bb27 (9783)

Also for fun I ran my 970, but it couldn't handle the q4_0 quant (thanks nvidia with your 3.5 + .5 vram) so I had to drop to q3_K:
Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes, VRAM: 4029 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	0	pp512	190.90 ± 0.22
llama 7B Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	0	tg128	14.44 ± 0.04
llama 7B Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	1	pp512	191.28 ± 0.17
llama 7B Q3_K - Small	2.75 GiB	6.74 B	CUDA	99	1	tg128	14.83 ± 0.01

0 replies

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

Uh oh!

Performance of llama.cpp on Nvidia CUDA #15013

Uh oh!

Uh oh!

Instructions

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

More detailed test

Replies: 116 comments · 95 replies

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

Uh oh!

CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)

CUDA Scoreboard for Llama 2 7B, Q4_0 (with FA)

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 1, 2025 Author

Uh oh!

slaren Aug 1, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 7, 2025 Author

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

ggerganov Aug 2, 2025 Maintainer

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

RTX 5060 Ti 16 GB

RTX 4060 Ti 8 GB

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 116 comments 95 replies

olegshulyakov Aug 1, 2025
Author

slaren
Aug 1, 2025
Maintainer

olegshulyakov Aug 7, 2025
Author

olegshulyakov Aug 2, 2025
Author

olegshulyakov Aug 2, 2025
Author

ggerganov
Aug 2, 2025
Maintainer

olegshulyakov Aug 2, 2025
Author