Performance of llama.cpp on Intel GPU with SYCL backend #23313

arthw · 2026-05-19T03:45:41Z

arthw
May 19, 2026
Collaborator

Purpose

It's used to share the performance data on Intel GPU with SYCL backend.

The performance data is only used as reference, since we don't double check the data.

It can not be used as any commercial purpose.

Rule

Encourage to test with default setting (environment variables).

If you want to update the data with special building or running setting, please create a new table.
Create/update the tables directly following the format.
Insert new record, instead of update it for same keys; Sort the records by col1, col2, col3.
Add your comments in the latest for more discussion.
Don't add table to compare with other hardware, framework or backend.
Please run 1+ times and update with the stable data.

Performance data on Intel GPU

Default setting

Build:

#fp32
./examples/sycl/build.sh

#fp16
set -DGGML_SYCL_F16=ON in ./examples/sycl/build.sh
./examples/sycl/build.sh

Run:

# choose the used GPUs in the test.
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
source /opt/intel/oneapi/setvars.sh
./build/bin/llama-bench -fa 0,1 -m ../models/llama-2-7b.Q4_0.gguf

Data:

LLM	GPU	Host	OS	Fp32 FP16	FA	pp512 t/s	tg128 t/s	Commit	Reporter	Date
gemma4 12B Q4_KM	B580 LEx1	Core Ultra 5 250K DDR5-6400 16GB	CachyOS 7.0.11	fp16	0	1320.02	41.34	-	@egeoz	2026/6/14
gemma4 12B Q4_KM	B580 LEx1	Core Ultra 5 250K DDR5-6400 16GB	CachyOS 7.0.11	fp16	1	1062.58	39.96	-	@egeoz	2026/6/14
gemma4 26B.A4B Q5_K - Medium	ARC 140T	Intel Ultra 255H 32GB	Linux	fp16	0	223.82	12.70	`d4c8e2c`	@jlionhan	2026/5/31
gemma4 26B.A4B Q5_K - Medium	ARC 140T	Intel Ultra 255H 32GB	Linux	fp16	1	166.46	13.70	`d4c8e2c`	@jlionhan	2026/5/31
gemma4 E4B Q4_0	B580 LEx1	Core Ultra 5 250K DDR5-6400 16GB	CachyOS 7.0.11	fp16	0	2763.42	82.30	-	@egeoz	2026/6/14
gemma4 E4B Q4_0	B580 LEx1	Core Ultra 5 250K DDR5-6400 16GB	CachyOS 7.0.11	fp16	1	1349.62	81.08	-	@egeoz	2026/6/14
llama-2-7b.Q4_0	Arc A380	Xeon E5-2695 V4	Ubuntu 24.04.4	fp16	0	495.16	23.84	`7c158fb`	@GerardoNevarez	2026/6/5
llama-2-7b.Q4_0	Arc A380	Xeon E5-2695 V4	Ubuntu 24.04.4	fp16	1	362.69	25.57	`7c158fb`	@GerardoNevarez	2026/6/5
llama-2-7b.Q4_0	Arc A380	Xeon E5-2695 V4	Ubuntu 24.04.4	fp32	0	226.11	23.81	`7c158fb`	@GerardoNevarez	2026/6/5
llama-2-7b.Q4_0	Arc A380	Xeon E5-2695 V4	Ubuntu 24.04.4	fp32	1	213.39	25.81	`7c158fb`	@GerardoNevarez	2026/6/5
llama-2-7b.Q4_0	Arc Graphics (iGPU)	Core Ultra 7 258V	Ubuntu 26.04	fp16	0	535.62	24.61	`6471e3c`	@twoplan	2026/6/12
llama-2-7b.Q4_0	Arc Graphics (iGPU)	Core Ultra 7 258V	Ubuntu 26.04	fp16	1	245.96	25.26	`6471e3c`	@twoplan	2026/6/12
llama-2-7b.Q4_0	Arc770x1	i7-13700K 64GB	Ubuntu 24.04.4	fp32	0	937.24	59.03	`053e01d`	@arthw	2026/5/19
llama-2-7b.Q4_0	Arc770x1	i7-13700K 64GB	Ubuntu 24.04.4	fp32	1	706.72	67.09	`19e92c3`	@arthw	2026/5/29
llama-2-7b.Q4_0	Arc770x1	i5-14600k	cachyOS	fp32	0	894.44	55.53	`5306f4b`	@digitalscream	2026/5/22
llama-2-7b.Q4_0	Arc770x1	i5-14600k	cachyOS	fp32	1	666.89	64.49	`5306f4b`	@digitalscream	2026/5/22
llama-2-7b.Q4_0	B570x1	Ryzen5 5600X DDR4-3600 128GB	Ubuntu 24.04.4 6.17.0-29-generic	fp16	0	1355.95	72.52	`5aba536`	@yqYo1	2026/6/1
llama-2-7b.Q4_0	B570x1	Ryzen5 5600X DDR4-3600 128GB	Ubuntu 24.04.4 6.17.0-29-generic	fp16	1	685.89	76.72	`5aba536`	@yqYo1	2026/6/1
llama-2-7b.Q4_0	B570x1	Ryzen5 5600X DDR4-3600 128GB	Ubuntu 24.04.4 6.17.0-29-generic	fp32	0	388.77	72.48	`5aba536`	@yqYo1	2026/6/1
llama-2-7b.Q4_0	B570x1	Ryzen5 5600X DDR4-3600 128GB	Ubuntu 24.04.4 6.17.0-29-generic	fp32	1	412.46	76.76	`5aba536`	@yqYo1	2026/6/1
llama-2-7b.Q4_0	B580x1	Ryzen7 5700X3D	Ubuntu 25.10	fp16	0	2063.52	73.76	`c6e4088`	@bedovyy	2026/5/27
llama-2-7b.Q4_0	B580 LEx1	Core Ultra 5 250K DDR5-6400 16GB	CachyOS 7.0.11	fp16	0	1954.15	83.67	-	@egeoz	2026/6/14
llama-2-7b.Q4_0	B580 LEx1	Core Ultra 5 250K DDR5-6400 16GB	CachyOS 7.0.11	fp16	1	893.17	88.13	-	@egeoz	2026/6/14
llama-2-7b.Q4_0	B580x2	Ryzen7 5700X3D	Ubuntu 25.10	fp16	0	1721.91	65.67	`c6e4088`	@bedovyy	2026/5/27
llama-2-7b.Q4_0	B70x1	EPYC8124P	Linux	fp16	0	2763.84	105.47	`dbe7901`	@FCLC	2026/5/21
llama-2-7b.Q4_0	B70x1	EPYC8124P	Linux	fp32	0	928.65	106.48	`6a257d4`	@FCLC	2026/5/21
llama-2-7b.Q4_0	B70x2	EPYC8124P	Linux	fp16	0	2683.19	103.70	`dbe7901`	@FCLC	2026/5/21
llama-2-7b.Q4_0	B70x2	EPYC8124P	Linux	fp32	0	926.58	104.29	`6a257d4`	@FCLC	2026/5/21
llama-2-7b.Q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp16	1	652.38	19.20	`d5ab083`	@bobguns	2026/6/2
llama-2-7b.Q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	1	337.13	18.96	`d5ab083`	@bobguns	2026/6/2
Llama-3.2-1B-Instruct-Q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp16	1	3433.03	88.69	`d5ab083`	@bobguns	2026/6/2
Llama-3.2-1B-Instruct-Q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	1	1981.17	89.39	`d5ab083`	@bobguns	2026/6/2
Qwen2.5-7B-Instruct-Q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	0		17.55	`55ac090`	@bobguns	2026/6/1
Qwen2.5-coder-7b-instruct-q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp16	1	739.80	17.44	`d5ab083`	@bobguns	2026/6/2
Qwen2.5-coder-7b-instruct-q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	1	331.78	17.17	`d5ab083`	@bobguns	2026/6/2
Qwen3-Coder-Next-Q4_K_M	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp16	0	212.68	18.45	`55ac090`	@bobguns	2026/6/2
Qwen3-Coder-Next-Q4_K_M	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	0	104.59	17.53	`55ac090`	@bobguns	2026/6/1
Qwen3.6-27B-Q4_0	B70x1	EPYC8124P	Linux	fp32	0	721.08	26.09	`6a257d4`	@FCLC	2026/5/21
Qwen3.6-27B-Q4_K_M	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp16	0	159.96	3.55	`d5ab083`	@bobguns	2026/6/2
Qwen3.6-27B-Q4_K_M	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	0	84.80	4.31	`d5ab083`	@bobguns	2026/6/2
Qwen3.6-27B-Q8_0	B70x1	EPYC8124P	Linux	fp32	0	833.93	15.48	`6a257d4`	@FCLC	2026/5/21

More PP/TG Types:

LLM	GPU	Host	OS	Fp32 FP16	FA	pp1024 t/s	pp4096 t/s	tg128 t/s	tg512 t/s	Commit	Reporter	Date
Qwen2.5-7B-Instruct-Q4_0	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	0	305.08	269.69	17.55	17.50	`55ac090`	@bobguns	2026/6/1
Qwen3-Coder-Next-Q4_K_M	iGPU Xe3 (12 EUs)	Intel Panther Lake	Ubuntu 26.04	fp32	0	166.85	172.98	17.53	17.32	`55ac090`	@bobguns	2026/6/1

FCLC · 2026-05-21T00:09:54Z

FCLC
May 21, 2026

compiled with cmake -B build-sycl -DGGML_SYCL=ON -DGGML_SYCL_F16=ON -DGGML_SYCL_TARGET=INTEL -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_FLAGS="-march=znver4" -DCMAKE_CXX_FLAGS="-march=znver4" -DCMAKE_BUILD_TYPE=Release && cmake --build build-sycl --config Release -j 16

single b70

ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       2763.84 ± 4.23 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        105.47 ± 0.05 |

build: dbe7901ca (9147)

dual b70:

~/Developement/llama.cpp$ ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (2 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_device: registered device SYCL1 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      2683.19 ± 10.32 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        103.70 ± 0.27 |

build: dbe7901ca (9147)

0 replies

FCLC · 2026-05-21T00:38:14Z

FCLC
May 21, 2026

If instead compiling and using with f16=off:

cmake -B build-sycl -DGGML_SYCL=ON -DGGML_SYCL_F16=OFF -DGGML_SYCL_TARGET=INTEL -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_FLAGS="-march=znver4" -DCMAKE_CXX_FLAGS="-march=znver4" -DCMAKE_BUILD_TYPE=Release && cmake --build build-sycl --config Release -j 16

Single B70:

ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |        928.65 ± 0.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        106.48 ± 0.05 |

build: 6a257d446 (9263)

Dual B70:

ONEAPI_DEVICE_SELECTOR="level_zero:0,1" ./build-sycl/bin/llama-bench -m ../models/llama27/llama-2-7b.Q4_0.gguf
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (2 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_device: registered device SYCL1 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-sycl.so
load_backend: failed to find ggml_backend_init in /home/lea/Developement/llama.cpp/build-sycl/bin/libggml-cpu.so
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       926.58 ± 14.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |        104.29 ± 0.15 |

build: 6a257d446 (9263)

0 replies

FCLC · 2026-05-21T00:52:30Z

FCLC
May 21, 2026

And with a much more interesting model, namely Qwen 3.6 27B:

q4

 ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/qwen36_27b/Qwen3.6-27B-Q4_0.gguf 
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | SYCL       |  99 |           pp512 |        721.08 ± 0.73 |
| qwen35 27B Q4_0                |  14.70 GiB |    26.90 B | SYCL       |  99 |           tg128 |         26.09 ± 0.03 |

build: 6a257d446 (9263)

and q8:

ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build-sycl/bin/llama-bench -m ../models/qwen36_27b/Qwen3.6-27B-Q8_0.gguf 
warning: asserts enabled, performance may be affected
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Graphics [0xe223])
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD EPYC 8124P 16-Core Processor)
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       |  99 |           pp512 |        833.93 ± 1.96 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       |  99 |           tg128 |         15.48 ± 0.00 |

1 reply

arthw May 21, 2026
Collaborator Author

@FCLC
I update the first post with your comment.

Thank you!

digitalscream · 2026-05-21T16:14:16Z

digitalscream
May 21, 2026

Ooft. A770 16GB, i5 14600k, current cachyOS.

fp16:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      1526.53 ± 31.96 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         27.42 ± 0.04 |

fp32:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      1515.53 ± 16.97 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         27.33 ± 0.21 |

Very weird, compared with the one in the table - much better prefill, half the decode performance.

9 replies

NeoZhangJianyu May 22, 2026

From the change of performance, I guess the Flash-attention is enabled in second case.

34 token/s is very similar with my old test result, which is impacted by driver.

Could you run:

lspci -nnk | grep -i vga -A3
00:02.0 VGA compatible controller [0300]: Intel Corporation Arrow Lake-U [Intel Graphics] [8086:7d67] (rev 06)
	DeviceName: Onboard - Video
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:d000]
	Kernel driver in use: i915
--
03:00.0 VGA compatible controller [0300]: Intel Corporation Device [8086:e211]
	Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device [1ef7:2542]
	Kernel driver in use: xe
	Kernel modules: xe

digitalscream May 22, 2026

Interesting. Just tried it with the FA switch:

lj@seraph:~/bin/llama.cpp> ./bin/llama-bench -m ../../llm/llama-2-7b.Q4_0.gguf -fa 0,1
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           pp512 |        912.36 ± 3.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           tg128 |         34.47 ± 0.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           pp512 |        692.82 ± 2.47 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           tg128 |         39.09 ± 0.09 |

So...I do get a bit of a boost from FA on! I don't know what's happened with the performance from recompiling it, it's just fallen off a cliff.

The lspci result is:

lj@seraph:~> lspci -nnk | grep -i vga -A3
03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A770] [8086:56a0] (rev 08)
	Subsystem: Sparkle Computer Co., Ltd. Device [172f:4134]
	Kernel driver in use: i915
	Kernel modules: i915, xe

Could it be the i915 driver that's the problem?

NeoZhangJianyu May 22, 2026

Yes, this is the root cause.
You need to keep one of i915 or xe to get the better performance.

FA's impact is still less than I expected.
Maybe impacted by driver too.

Please remove one of them.
i915 has better performance, but not recommended due to it's old driver.

It's risk to remove the GPU driver.
Don't do it remotly.

digitalscream May 22, 2026

Aha...winning!

MESA: warning: Support for this platform is experimental with Xe KMD, bug reports may be ignored.
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           pp512 |        894.44 ± 4.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  0 |           tg128 |         55.53 ± 0.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           pp512 |        666.89 ± 1.44 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |  1 |           tg128 |         64.49 ± 0.29 |

Mildly frustrating that I can't reproduce that nice PP result now, though.

NeoZhangJianyu May 22, 2026

It's known issue. We are checking it.

It's great to see better performance in your test.
I will update it in the table.

Thank you!

bedovyy · 2026-05-27T16:20:57Z

bedovyy
May 27, 2026

B580, AMD Ryzen 7 5700X3D, Ubuntu 25.10

built with fp16 in ./examples/sycl/build.sh (applied #23612)

./build/bin/llama-ls-sycl-device
Found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12168M|            1.6.34666|
| 1| [level_zero:gpu:1]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12168M|            1.6.34666|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
| 1| [level_zero:gpu:1]|      Y|

### 2xB580
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-bench -m ../models/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |      1721.91 ± 35.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         65.67 ± 0.46 |

build: c6e408837 (9368)

### 1xB580
ONEAPI_DEVICE_SELECTOR="level_zero:0" ./build/bin/llama-bench -m ../models/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           pp512 |       2063.52 ± 3.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |           tg128 |         73.76 ± 0.13 |

build: c6e408837 (9368)

1 reply

arthw May 28, 2026
Collaborator Author

Update as your feedback!
Thank you!

thordarsen · 2026-05-28T20:05:12Z

thordarsen
May 28, 2026

Intel Arc Pro B50, Intel i7-8700 32GB RAM
Recent changes (9397 was today, 9298 was less than a week ago) seem to have badly lowered PP on SYCL which was already lacking compared to Vulkan. TG can drastically outpace Vulkan though, but at this point I have to start thinking if prefill or decode will be more important for what I'm doing.

build: c0c7e14 (9298)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	424.50 ± 1.34
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	45.92 ± 0.09

build: 2f6c815 (9397)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	397.09 ± 1.02
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	45.74 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	397.73 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	47.80 ± 0.03
------------------------------	---------:	---------:	----------	--:	-:	--------------:	-------------------:
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	590.01 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	40.13 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	581.78 ± 1.61
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	41.82 ± 0.08

Command line arguments:
./llama-bench -r 7 -hf TheBloke/Llama-2-7B-GGUF:Q4_0 -fa 0,1

Build options:
cmake .. -B build -DGGML_SYCL=ON -DGGML_RPC=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_DEVICE_ARCH=bmg-g21

cmake .. -B build -DGGML_VULKAN=1 -DGGML_RPC=ON

nothing else changed between these runs, I tested my old version, ran "git pull", built it and retested

12 replies

thordarsen Jun 1, 2026

Yes - the Intel driver software shows Resizable BAR is active.

I know that PCIe3 isn't ideal, but based on my Vulkan results, I don't think it's the primary culprit

thordarsen Jun 1, 2026

OK set -DGGML_SYCL_F16=ON made a big difference

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	1174.51 ± 2.33
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	45.81 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	602.57 ± 1.31
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	48.03 ± 0.07

arthw Jun 2, 2026
Collaborator Author

@thordarsen
Yes, fp16 get better performance than fp32 in same cases.

Because your test result differs vastly from the common results as we known, I suggest not adding them in the table.
But your test result is good reference for the case based on the old PC with B580.

how do you think?

Thank you!

thordarsen Jun 3, 2026

It's an Arc Pro B50 and actually looking at the relative specs on Intel Ark I'm kinda inline with the B570 and B580

from the above data for FP16

Model	Int TOPs	Mem Bandwidth	PP	TG	PP/TOPs	TG/BW
Pro B50	170	224GB/s	1174	45.8	6.9	0.20
B570	203	380GB/s	1376	72.5	6.8	0.19
B580	233	456GB/s	2063	73.8	8.85	0.16

( I purchased the B50 for the 16GB VRAM and 70W draw )

NeoZhangJianyu Jun 3, 2026

I haven't B50.
Because it's low power (70w), compared to B580 (190w).
It's needed to more test on other test environment.

jlionhan · 2026-05-31T11:30:48Z

jlionhan
May 31, 2026

I hope it helps.

255H, ARC 140T, 32GB RAM

model	size	params	backend	ngl	fa	test	t/s
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	0	pp512	223.82 ± 5.05
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	0	tg128	12.70 ± 0.18
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	1	pp512	166.46 ± 3.61
gemma4 26B.A4B Q5_K - Medium	17.80 GiB	25.23 B	SYCL	-1	1	tg128	13.70 ± 0.15

build: d4c8e2c (9442)

cmake --fresh -B build -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL=1 -DBUILD_SHARED_LIBS=0 -DGGML_SYCL_F16=1

-- Using oneAPI Release SYCL compiler (icpx).
-- SYCL found
-- SYCL Compiler version: 20260000
-- SYCL_INCLUDE_DIR: /opt/intel/oneapi/compiler/2026.0/include
-- SYCL_LIBRARY=/opt/intel/oneapi/compiler/2026.0/lib/libsycl.so
-- Found IntelSYCL: /opt/intel/oneapi/compiler/2026.0/include (found version "202012")
-- GGML_SYCL_SUPPORT_LEVEL_ZERO ON
-- Level Zero loader found: /lib/libze_loader.so
-- Level Zero headers found: /usr/include
-- Found oneDNN: /opt/intel/oneapi/dnnl/2026.0/lib/libdnnl.so.3.11

0.00.794.752 I Build with Macros:
0.00.794.757 I   GGML_SYCL_FORCE_MMQ: no
0.00.794.757 I   GGML_SYCL_F16: yes
0.00.794.757 I   GGML_SYCL_GRAPH: yes
0.00.794.758 I   GGML_SYCL_DNNL: yes
0.00.794.758 I   GGML_SYCL_SUPPORT_LEVEL_ZERO: yes
0.00.794.759 I   GGML_SYCL_USE_VMM: yes
0.00.794.759 I Running with Environment Variables:
0.00.794.760 I   GGML_SYCL_DEBUG: 0
0.00.794.760 I   GGML_SYCL_DISABLE_OPT: 0
0.00.794.761 I   GGML_SYCL_DISABLE_GRAPH: 1
0.00.794.761 I   GGML_SYCL_ENABLE_LEVEL_ZERO: 1
0.00.794.761 I   GGML_SYCL_DISABLE_DNN: 0
0.00.794.762 I   GGML_SYCL_ENABLE_VMM: 1
0.00.794.762 I   GGML_SYCL_PRIORITIZE_DMMV: 0
0.00.794.764 I   GGML_SYCL_USE_ASYNC_MEM_OP: 1
0.00.794.764 I   GGML_SYCL_ENABLE_FLASH_ATTN: 1
0.00.794.767 I Found 1 SYCL devices:
0.00.794.768 I |  |                   |                                       |       |Max    |        |Max  |Global |                     |
0.00.794.768 I |  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
0.00.794.769 I |ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
0.00.794.769 I |--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
0.00.794.938 I | 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.74|    128|    1024|   32| 30588M|           1.15.38308|
0.00.794.938 I SYCL Optimization Feature:
0.00.794.939 I |ID|        Device Type|Reorder|
0.00.794.939 I |--|-------------------|-------|
0.00.794.941 I | 0| [level_zero:gpu:0]|      Y|

lspci -nnk | grep -i vga -A3
00:02.0 VGA compatible controller [0300]: Intel Corporation Arrow Lake-P [Arc Pro 130T/140T] [8086:7d51] (rev 03)
        DeviceName: Onboard IGD
        Subsystem: Hewlett-Packard Company Device [103c:8dea]
        Kernel driver in use: xe

1 reply

arthw Jun 1, 2026
Collaborator Author

@jlionhan
Update them in the table!

Thank you!

bobguns · 2026-06-01T14:06:05Z

bobguns
Jun 1, 2026

~/llama.cpp$ cmake -B build/ReleaseOV -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_OPENVINO=ON
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Including OPENVINO backend
-- ggml version: 0.13.1
-- ggml commit: 55ac090
-- OpenSSL found: 3.5.5
-- Generating embedded license file for target: llama-app
-- Configuring done (0.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/llama.cpp/build/ReleaseOV

:~/llama.cpp$ GGML_OPENVINO_STATEFUL_EXECUTION=1
GGML_OPENVINO_DEVICE=GPU
./build/ReleaseOV/bin/llama-bench
-m ~/models/Qwen2.5-7B-Instruct-Q4_0.gguf
-fa 1
-p 1024,4096
-n 128,512
OpenVINO: using device GPU

model	size	params	backend	ngl	fa	test	t/s
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	pp1024	2988.73 ± 8.18
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	pp4096	2551.74 ± 6.83
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	tg128	17.51 ± 0.08
qwen2 7B Q4_0	4.13 GiB	7.62 B	OPENVINO	-1	1	tg512	16.80 ± 0.02

specs https://www.asrockind.com/en-gb/NUC%20BOX-358H
Crucial 5600 SO-DIMMS

0 replies

bobguns · 2026-06-01T15:13:39Z

bobguns
Jun 1, 2026

📊 Intel Panther Lake Xe3 iGPU (12 EU) Benchmark Matrix: OpenVINO vs. Vulkan vs. SYCL

Benchmarking sweep across all three major acceleration backends available in llama.cpp for Intel hardware: OpenVINO, Vulkan, and SYCL (oneAPI).

Environment

Hardware: Intel Panther Lake mobile processor with Xe3 Graphics (12 Execution Units)
OS: Ubuntu Linux 26.04
Memory Architecture: Unified Memory Architecture (UMA) sharing system RAM directly with the iGPU.
llama.cpp Build: 55ac0909e (9458)

Models Tested

Qwen2.5-7B-Instruct-Q4_0 (Dense, 4.13 GiB)
Qwen3-Coder-Next-Q4_K_M (80B MoE, 45.15 GiB) — Note: Leverages UMA to allocate the entire 45 GB model file entirely within system-shared VRAM.

📈 Performance Summary Matrix

Model	Backend	Prompt Processing 1024 (t/s)	Prompt Processing 4096 (t/s)	Token Gen 128 (t/s)	Token Gen 512 (t/s)	Status / Observation
Qwen 2.5 7B (Dense)	OpenVINO	2988.73	2551.74	17.51	16.80	Champion prompt processing
	Vulkan	769.85	502.20	14.62	14.55	Strong ingestion, trailing generation
	SYCL	305.08	269.69	17.55	17.50	Maximum generation throughput
Qwen3 80B (MoE)	OpenVINO	—	—	—	—	CRASH (`CPY` memory layout bug)
	Vulkan	341.48	295.77	16.39	16.37	Fully stable, superior ingestion
	SYCL	166.85	172.98	17.53	17.32	Fully stable, maximum generation

🛠️ Deep-Dive Analysis

The OpenVINO Qwen3 MoE Crash
OpenVINO panics and drops a core dump immediately when attempting to initialize the new Qwen3-Coder-Next model.

Error: ggml-backend.cpp:898: pre-allocated tensor (cache_r_l0 (view) (copy of )) in a buffer (OPENVINO0) that cannot run the operation (CPY)
Root Cause: This is an upstream translation issue. The OpenVINO backend does not yet correctly compute memory copying operations (CPY) over the complex memory layouts required by the Gated DeltaNet layers unique to the new Qwen3 architecture.

SYCL vs. Vulkan on the 80B MoE Architecture
Both SYCL and Vulkan bypass the memory allocation crash cleanly, proving their robust memory routing handling over large UMA allocations:

Token Generation: SYCL wins by ~6%, delivering 17.53 t/s. By binding directly into Intel's low-level Level Zero driver layer, SYCL passes active experts through the matrix engines with minimal scheduling overhead.
Prompt Processing: Vulkan completely dominates SYCL by nearly 2x (341.48 t/s vs 166.85 t/s). The current SYCL implementation lacks the advanced cache-tiling optimizations needed to cleanly parallelize large prompt ingestion blocks across hybrid MoE weights.

📋 Raw Build & Execution Logs

1. Qwen3 80B MoE — OpenVINO Crash Log

:~/llama.cpp$ GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-bench -m ~/models/Qwen3-Coder-Next-Q4_K_M.gguf -fa 1 -p 1024,4096 -n 128,512
OpenVINO: using device GPU
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
/home/llama.cpp/ggml/src/ggml-backend.cpp:898: pre-allocated tensor (cache_r_l0 (view) (copy of )) in a buffer (OPENVINO0) that cannot run the operation (CPY)

#0  0x000070d0479484f3 in ggml_print_backtrace () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
#1  0x000070d0479486a6 in ggml_abort () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
#2  0x000070d04796170c in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
#3  0x000070d04796375f in ggml_backend_sched_split_graph () from /home/llama.cpp/build/ReleaseOV/bin/libggml-base.so.0
Aborted (core dumped)

specs https://www.asrockind.com/en-gb/NUC%20BOX-358H
Crucial 5600 SO-DIMMS

4 replies

arthw Jun 2, 2026
Collaborator Author

@bobguns
As the rule, we don't like to compare with other Backends and Non-Intel GPU here.
So I just update the SYCL backend data to the table.

Thank for your sharing!

bobguns Jun 2, 2026

@bobguns As the rule, we don't like to compare with other Backends and Non-Intel GPU here. So I just update the SYCL backend data to the table.

Thank for your sharing!

Okay i understand, made some new benchmarks with 512 input tokens

Hardware & Environment

OS: Ubuntu Linux
GPU: Intel Panther Lake Xe3 iGPU (12 EU, Unified Memory)
llama.cpp Build: d5ab0834a (9476)
Compiler: icpx (IntelLLVM 2026.0.0) linked against oneDNN and MKL
Target: level_zero:gpu

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: llama-2-7b.Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	337.13 ± 1.13
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	18.96 ± 0.22

build: d5ab083 (9476)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: Llama-3.2-1B-Instruct-Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 1B Q4_0	729.75 MiB	1.24 B	SYCL	-1	1	pp512	1981.17 ± 5.14
llama 1B Q4_0	729.75 MiB	1.24 B	SYCL	-1	1	tg128	89.39 ± 0.23

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: qwen2.5-coder-7b-instruct-q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
qwen2 7B Q4_0	4.12 GiB	7.62 B	SYCL	-1	1	pp512	331.78 ± 0.90
qwen2 7B Q4_0	4.12 GiB	7.62 B	SYCL	-1	1	tg128	17.17 ± 0.23

build: d5ab083 (9476)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: Qwen3.6-27B-Q4_K_M.gguf

model	size	params	backend	ngl	fa	test	t/s
qwen35 27B Q4_K - Medium	15.65 GiB	26.90 B	SYCL	-1	1	pp512	84.80 ± 0.09
qwen35 27B Q4_K - Medium	15.65 GiB	26.90 B	SYCL	-1	1	tg128	4.31 ± 0.00

build: d5ab083 (9476)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: Qwen3-Coder-Next-Q4_K_M.gguf

model	size	params	backend	ngl	fa	test	t/s
qwen3next 80B.A3B Q4_K - Medium	45.15 GiB	79.67 B	SYCL	-1	1	pp512	104.59 ± 0.99
qwen3next 80B.A3B Q4_K - Medium	45.15 GiB	79.67 B	SYCL	-1	1	tg128	7.40 ± 0.03

build: d5ab083 (9476)

bobguns Jun 2, 2026

16 bit

Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: llama-2-7b.Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	652.38 ± 4.71
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	19.20 ± 0.05

build: a468b89 (9477)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: Llama-3.2-1B-Instruct-Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 1B Q4_0	729.75 MiB	1.24 B	SYCL	-1	1	pp512	3433.03 ± 6.47
llama 1B Q4_0	729.75 MiB	1.24 B	SYCL	-1	1	tg128	88.69 ± 0.70

build: a468b89 (9477)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: qwen2.5-coder-7b-instruct-q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
qwen2 7B Q4_0	4.12 GiB	7.62 B	SYCL	-1	1	pp512	739.80 ± 7.25
qwen2 7B Q4_0	4.12 GiB	7.62 B	SYCL	-1	1	tg128	17.44 ± 0.04

build: a468b89 (9477)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: Qwen3.6-27B-Q4_K_M.gguf

model	size	params	backend	ngl	fa	test	t/s
qwen35 27B Q4_K - Medium	15.65 GiB	26.90 B	SYCL	-1	1	pp512	159.96 ± 0.29
qwen35 27B Q4_K - Medium	15.65 GiB	26.90 B	SYCL	-1	1	tg128	3.55 ± 0.03

build: a468b89 (9477)

📊 Running SYCL Benchmark on Panther Lake Xe3 iGPU
📦 Model: Qwen3-Coder-Next-Q4_K_M.gguf

model	size	params	backend	ngl	fa	test	t/s
qwen3next 80B.A3B Q4_K - Medium	45.15 GiB	79.67 B	SYCL	-1	1	pp512	212.68 ± 1.34
qwen3next 80B.A3B Q4_K - Medium	45.15 GiB	79.67 B	SYCL	-1	1	tg128	18.45 ± 0.06

build: a468b89 (9477)

arthw Jun 3, 2026
Collaborator Author

@bobguns
They are updated in the table.

Thank you for your sharing!

yqYo1 · 2026-06-01T17:02:06Z

yqYo1
Jun 1, 2026

HW:ryzen5 5600X, DDR4-3600 128GB, ARC B570
OS:Ubuntu 24.04.4, 6.17.0-29-generic
Since the output to the display uses a separate GPU (GT730), the B570 should only be processing llama.cpp.

FP32

❯ ./build/bin/llama-bench -fa 0,1 -m /data/llm_models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           pp512 |        388.77 ± 0.54 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           tg128 |         72.48 ± 0.18 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           pp512 |        412.46 ± 1.13 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           tg128 |         76.76 ± 0.06 |

build: 5aba5364d (9456)

FP16

❯ ./build/bin/llama-bench -fa 0,1 -m /data/llm_models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           pp512 |       1355.95 ± 7.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   0 |           tg128 |         72.52 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           pp512 |        685.89 ± 0.31 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  -1 |   1 |           tg128 |         76.72 ± 0.07 |

build: 5aba5364d (9456)

1 reply

arthw Jun 2, 2026
Collaborator Author

@yqYo1
It's updated to the table.
Thank for your sharing!

Andr010-lang · 2026-06-04T06:38:18Z

Andr010-lang
Jun 4, 2026

HW:ryzen5 5700X, DDR4-3600 64GB, ARC B580 + ARC PRO B60 (24gb)
OS:Ubuntu 26.04, 7.0.0-22-generic, oneapi 2026.0, xe driver 70.65.0

llama build

b60+b580
./llama-bench -m ~/llama-2-7b.Q4_0.gguf -fa 0

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	1737.00 ± 16.94
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	78.15 ± 0.22

build: 65ef50a (9501)

./llama-bench -m ~/llama-2-7b.Q4_0.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	1542.23 ± 24.72
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	83.39 ± 0.23

build: 65ef50a (9501)

on arc b60

./llama-bench -m ~/llama-2-7b.Q4_0.gguf -fa 0 -dev sycl0

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	SYCL0	pp512	1851.35 ± 1.16
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	SYCL0	tg128	77.92 ± 0.01

build: 65ef50a (9501)

./llama-bench -m ~/llama-2-7b.Q4_0.gguf -dev sycl0 -fa 1

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	SYCL0	pp512	1631.30 ± 1.21
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	SYCL0	tg128	82.85 ± 0.01

build: 65ef50a (9501)

on arc b580

./llama-bench -m ~/llama-2-7b.Q4_0.gguf -dev sycl1 -fa 0

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	SYCL1	pp512	2015.51 ± 15.87
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	SYCL1	tg128	82.49 ± 0.38

build: 65ef50a (9501)

./llama-bench -m ~/llama-2-7b.Q4_0.gguf -dev sycl1 -fa 1

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	SYCL1	pp512	1806.18 ± 8.68
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	SYCL1	tg128	86.54 ± 0.33

build: 65ef50a (9501)

VULKAN

After promt processing, the GPU frequency is reset to minimum and TG is low

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	pp512	1581.29 ± 43.72
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	tg128	53.56 ± 14.29

build: 65ef50a (9501)

I set the minimum frequency on the GPU to b60 2300 and b580 2683

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	pp512	1667.43 ± 11.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	tg128	65.94 ± 0.47

build: 65ef50a (9501)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	pp512	1626.71 ± 2.51
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	tg128	69.46 ± 0.41

build: 65ef50a (9501)

on B60

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	Vulkan1	pp512	1621.00 ± 2.51
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	Vulkan1	tg128	68.07 ± 0.01

build: 65ef50a (9501)

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	Vulkan1	pp512	1572.29 ± 2.51
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	Vulkan1	tg128	71.42 ± 0.00

build: 65ef50a (9501)

on b580

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	Vulkan0	pp512	1929.93 ± 17.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	Vulkan0	tg128	73.01 ± 0.44

build: 65ef50a (9501)

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	Vulkan0	pp512	1841.31 ± 22.24
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	Vulkan0	tg128	76.83 ± 0.18

build: 65ef50a (9501)

1 reply

arthw Jun 5, 2026
Collaborator Author

@Andr010-lang
The SYCL data are updated in the first table.
We can't show the Vulkan data. But your comments could be present for reference by others.

Thank you for your sharing!

GerardoNevarez · 2026-06-05T07:47:41Z

GerardoNevarez
Jun 5, 2026

Arc A380

I know, I know... I got this it for its AV1 encoding/decoding, not LLMs, but here we are...

OS: Ubuntu 24.04.4 LTS
CPU: Xeon E5-2695 V4
Kernel driver in use: xe

F16

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	495.16 ± 1.43
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	23.84 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	362.69 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	25.57 ± 0.20

F32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	226.11 ± 2.39
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	23.81 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	213.39 ± 8.38
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	25.81 ± 0.41

Got a couple of warnings during execution:

SYCL GPU device 0 does not use Level Zero backend, disabling Level Zero memory API
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

Other info:

Build config (F16/F32 variations): cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DBUILD_SHARED_LIBS=0 -DGGML_SYCL_F16=0 -DGGML_SYCL_F32=1 -DGGML_SYCL_GRAPH=1 -DGGML_SYCL_HOST_MEM_FALLBACK=1

sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz OpenCL 3.0 (Build 0) [2026.20.3.0.19_160000.xmain-hotfix]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A380 Graphics OpenCL 3.0 NEO  [26.18.38308.1]

llama-ls-sycl-device formatted output:

Found 1 SYCL devices:

ID	Device Type	Name	Version	Max compute units	Max work group	Max sub group	Global mem size	Driver version
0	[opencl:gpu:0]	Intel Arc A380 Graphics	3.0	128	1024	32	6064M	26.18.38308.1

SYCL Optimization Feature:

ID	Device Type	Reorder
0	[opencl:gpu:0]	Y

build: 7c158fb (b9518)

7 replies

arthw Jun 5, 2026
Collaborator Author

@GerardoNevarez
The SYCL data is updated in the first table.
We are glad to see the A380 can run the LLM with good performance, compare to the price and power. :)

If you don't care the performance and running time, you could run a 2-7B LLM by llama-server on existed Intel GPU (like A380) to help to create a report or abstract based on your raw texts.
You click the "start" and go to “Fika”.
When you come back, the report is ready.

Improve your work and life without cost.
That's the life we want to provide.

Thank you for your sharing!

GerardoNevarez Jun 5, 2026

@arthw

Updated Arc A380 bench runs with default build on e82beaa (9534) - you could use these for the table above.

Forgot to mention, this rig uses PCIe 3.0 only, but has ReBar enabled (SYCL refuses to use the xe driver otherwise)

F16

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	478.53 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	18.19 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	351.23 ± 0.62
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	19.12 ± 0.05

F32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	220.98 ± 0.30
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	18.16 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	216.06 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	19.70 ± 0.02

"Static" build e82beaa (9534) with -DBUILD_SHARED_LIBS=0 - would probably use these for whatever task I dedicate this GPU to (when not transcoding).

F16

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	491.34 ± 1.26
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	23.97 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	358.39 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	25.63 ± 0.01

F32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	224.16 ± 0.64
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	24.04 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	220.10 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	26.57 ± 0.01

arthw Jun 8, 2026
Collaborator Author

@GerardoNevarez
When add -DBUILD_SHARED_LIBS=0 in building, I find the performance is reduced on Arc770.
tg: 58->53 without FA.

But your case shows it is increased.
Could you confirm it?

Thank you!

GerardoNevarez Jun 8, 2026

@arthw I ran it twice before posting, I was surprised than the "official" build is slower, I'm running natively on Ubuntu, not in Docker. I just pulled updates from GH and ran them again, results below (without FA) - they are a bit worse than the last time, I can't explain that either.

"Official" build
F16: tg128 -> 18.15 ± 0.03
F32: tg128 -> 18.08 ± 0.03

"Static" build
F16: tg128 -> 24.15 ± 0.01
F32: tg128 -> 24.10 ± 0.02

My CPU is an old Xeon, but it has 45MB L3 cache, on a single die. Maybe communication between CPU/GPU gets improved by that - less long call jumps to the libraries? Does not explain your results on the A770 though.

arthw Jun 8, 2026
Collaborator Author

@GerardoNevarez
No matter!
I record the best result in the first table for A380.

We will check it later.
Thank you!

twoplan · 2026-06-12T15:22:08Z

twoplan
Jun 12, 2026

Hi,

Intel Core Ultra 7 258V
Ubuntu 26.04
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON

./llama-bench -fa 0,1 -m ../../models/llama-2-7b.Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	pp512	535.62 ± 1.11
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	0	tg128	24.61 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	pp512	245.96 ± 1.26
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	-1	1	tg128	25.26 ± 0.10

build: 6471e3c (9607)

sycl-ls

[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero V2, Intel(R) Arc(TM) Graphics 20.4.4 [1.15.38308+1]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 258V OpenCL 3.0 (Build 0) [2026.21.3.0.31_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [26.18.38308.1]

Thanks for your great work!

3 replies

jruhe-adesso Jun 13, 2026

Could you post your Vulkan results in this discussion #10879 , too, please?

twoplan Jun 14, 2026

Sure, Vulkan bench results

arthw Jun 15, 2026
Collaborator Author

@twoplan
Your test result is updated in the first table.

Thank you!

toomanybyt3s · 2026-06-12T23:35:34Z

toomanybyt3s
Jun 12, 2026

A380 - Docker

Apologies for the testing in docker, my local env is messed up in all sort of ways, im unable to test on bare metal however Ill share the docker compose and commands to reproduce if anyone is interested. Ill also attach other benchmarks to compare the current state as of this commit e95dae1

All results are the third results printed

SYCL F16

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	276.43 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	16.17 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	347.28 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	14.95 ± 0.01

SYCL F32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	170.14 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	15.97 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	201.62 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	14.96 ± 0.00

I ran a similar docker image a few days ago and I remember my results being far far better, not sure what has happened with the pp.

Vulkan

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) A380 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	259.51 ± 0.42
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	17.57 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	247.03 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	17.25 ± 0.02

Openvino

OpenVINO: using device GPU

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	OPENVINO	99	1	pp512	1497.21 ± 3.36
llama 7B Q4_0	3.56 GiB	6.74 B	OPENVINO	99	1	tg128	18.18 ± 0.02

Docker compose

services:
  bench-openvino:
    build:
      context: .
      dockerfile: .devops/openvino.Dockerfile
      target: full
    image: llama.cpp:full-openvino-local
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - llama-cache:/models
    environment:
      - LD_LIBRARY_PATH=/app
      - GGML_OPENVINO_DEVICE=${GGML_OPENVINO_DEVICE:-GPU}
      - GGML_OPENVINO_STATEFUL_EXECUTION=1
      - LLAMA_CACHE=/models
    entrypoint: /app/llama-bench
    command:
      - -hf
      - ${HF_REPO:-TheBloke/Llama-2-7B-GGUF:Q4_0}
      - -fa
      - "1"
      - -ngl
      - "99"

  bench-sycl-f16:
    build:
      context: .
      dockerfile: .devops/intel.Dockerfile
      target: full
      args:
        GGML_SYCL_F16: "ON"
    image: llama.cpp:full-sycl-f16-local
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - llama-cache:/models
    environment:
      - LLAMA_CACHE=/models
      - ONEAPI_DEVICE_SELECTOR=level_zero:0
      - ZES_ENABLE_SYSMAN=1
    entrypoint: /app/llama-bench
    command:
      - -hf
      - ${HF_REPO:-TheBloke/Llama-2-7B-GGUF:Q4_0}
      - -fa
      - "1,0"
      - -ngl
      - "99"

  bench-sycl-f32:
    build:
      context: .
      dockerfile: .devops/intel.Dockerfile
      target: full
      args:
        GGML_SYCL_F16: "OFF"
    image: llama.cpp:full-sycl-f32-local
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - llama-cache:/models
    environment:
      - LLAMA_CACHE=/models
      - ONEAPI_DEVICE_SELECTOR=level_zero:0
      - ZES_ENABLE_SYSMAN=1
    entrypoint: /app/llama-bench
    command:
      - -hf
      - ${HF_REPO:-TheBloke/Llama-2-7B-GGUF:Q4_0}
      - -fa
      - "1,0"
      - -ngl
      - "99"

  bench-vulkan:
    build:
      context: .
      dockerfile: .devops/vulkan.Dockerfile
      target: full
    image: llama.cpp:full-vulkan-local
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - llama-cache:/models
    environment:
      - LLAMA_CACHE=/models
    entrypoint: /app/llama-bench
    command:
      - -hf
      - ${HF_REPO:-TheBloke/Llama-2-7B-GGUF:Q4_0}
      - -fa
      - "1,0"
      - -ngl
      - "99"

volumes:
  llama-cache:
    name: llama-cache

Commands

docker compose run --build --rm bench-openvino   # OpenVINO
docker compose run --build --rm bench-sycl-f16  # SYCL F16
docker compose run --build --rm bench-sycl-f32  # SYCL F32
docker compose run --build --rm bench-vulkan     # Vulkan

For repeated runs remove the --build

1 reply

toomanybyt3s Jun 13, 2026

@arthw
Ive updated the system to ubuntu 25.10, Kernal 6.17, from ubuntu 24.04. Kernal 6.11
lspci -nnk | grep -i vga -A3

08:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05)
        Subsystem: Device [172f:3943]
        Kernel driver in use: xe
        Kernel modules: i915, xe

dpkg -l | grep libze-intel-gpu1

ii  libze-intel-gpu1                      26.18.38308.1-1~25.10~ppa1                   amd64        Intel oneAPI L0 support implementation for Intel GPUs -- shared library

SYCL F16

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	441.14 ± 0.30
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	19.76 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	351.23 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	18.62 ± 0.03

SYCL F32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	227.06 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	20.26 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	203.17 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	18.54 ± 0.02

Openvino

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	OPENVINO	99	1	pp512	1513.67 ± 2.71
llama 7B Q4_0	3.56 GiB	6.74 B	OPENVINO	99	1	tg128	19.21 ± 0.03

Vulkan

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	263.70 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	17.87 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	249.81 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	17.55 ± 0.01

Performance for SYCL F16 improved after driver update, vulkan still low somehow. Openvino PP is insane however model selection is still catching up.

arthw · 2026-06-13T02:00:04Z

arthw
Jun 13, 2026
Collaborator Author

@toomanybyt3s
I find your test result is lower than that of #23313 (comment).

Could you check the driver by following cmds?

lspci -nnk | grep -i vga -A3
dpkg -l | grep libze-intel-gpu1

4 replies

toomanybyt3s Jun 13, 2026

bytes@kraken:~/ai-stuff/helping-llamacpp-sycl$ lspci -nnk | grep -i vga -A3
08:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A380] [8086:56a5] (rev 05)
        Subsystem: Device [172f:3943]
        Kernel driver in use: i915
        Kernel modules: i915, xe
bytes@kraken:~/ai-stuff/helping-llamacpp-sycl$ dpkg -l | grep libze-intel-gpu1

Ah missing piece drivers, let me update them and ill rerun tests

toomanybyt3s Jun 13, 2026

Uname shows this as my kernal, i might also update my system to 25.10

Linux kraken 6.11.0-29-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 13 20:29:41 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

arthw Jun 15, 2026
Collaborator Author

@toomanybyt3s
There are two drivers in your OS: i915, Xe
Your A380 is using i915.

Could you disable i915 to enable Xe driver on A380?
I think it's possible to increase the performance, at least Xe driver is recommended/updated.

Thank you!

GerardoNevarez Jun 16, 2026

@toomanybyt3s try adjusting which driver is used, modifying the /etc/defaults/grub file as below, and restart:

GRUB_CMDLINE_LINUX_DEFAULT='quiet splash i915.force_probe=!56a5 xe.force_probe=56a5

This forces enabling the Xe and disabling i915 drivers for the card, value must match the PCIe Device ID for the card ( A380 = [8086:56a5]). Please ensure you have PCIe 'ReBar' (and maybe also 'Above 4G ') enabled in your BIOS, the SYCL/Xe driver combo need them enabled in order to work - if you can't enable them, I'm not sure you can use the Xe driver.

egeoz · 2026-06-13T18:26:03Z

egeoz
Jun 13, 2026

Hardware: Intel Core Ultra 5 250K Plus, DDR5-6400 16GBx1, Intel Arc B580 LE with minimum core clock set to 2850 MHz since it seems to drop during inference despite having plenty thermal headroom
OS: Cachy OS, 7.0.11-1-cachyos, oneapi: 2026.0, mesa-git 26.2.0_devel
Compile flags: -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON

model	size	params	backend	ngl	fa	test	t/s	utilized memory bandwidth GB/s	utilized memory bandwidth %
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	pp512	1954.15 ± 5.44
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	0	tg128	83.67 ± 0.14	301.212	0.66
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	pp512	893.17 ± 0.77
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	1	tg128	88.13 ± 0.08	317.268	0.69
gemma4 12B Q4_KM	6.86 GiB	11.91 B	SYCL	99	0	pp512	1320.02 ± 0.78
gemma4 12B Q4_KM	6.86 GiB	11.91 B	SYCL	99	0	tg128	41.34 ± 0.03	285.246	0.62
gemma4 12B Q4_KM	6.86 GiB	11.91 B	SYCL	99	1	pp512	1062.58 ± 3.35
gemma4 12B Q4_KM	6.86 GiB	11.91 B	SYCL	99	1	tg128	39.96 ± 0.09	276.0	0.6
gemma4 E4B Q4_0	4.49 GiB	7.52 B	SYCL	99	0	pp512	2763.42 ± 2.33
gemma4 E4B Q4_0	4.49 GiB	7.52 B	SYCL	99	0	tg128	82.30 ± 0.02	370.35	0.45
gemma4 E4B Q4_0	4.49 GiB	7.52 B	SYCL	99	1	pp512	1349.62 ± 4.12
gemma4 E4B Q4_0	4.49 GiB	7.52 B	SYCL	99	1	tg128	81.08 ± 0.24	364.95	0.45

Gemma4 E4B has 4B parameters active, so I have calculated the effective memory bandwidth utilization as 45%. There seems to be some overhead with MoE models in general.

1 reply

arthw Jun 15, 2026
Collaborator Author

@egeoz
It's updated in the first table!

Thank you!

Performance of llama.cpp on Intel GPU with SYCL backend #23313

Uh oh!

Uh oh!

arthw May 19, 2026 Collaborator

Purpose

Rule

Performance data on Intel GPU

Default setting

Replies: 16 comments · 46 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw May 21, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw May 28, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw Jun 2, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw Jun 1, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw Jun 2, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arthw Jun 3, 2026 Collaborator Author

Uh oh!

arthw
May 19, 2026
Collaborator

Replies: 16 comments 46 replies

arthw May 21, 2026
Collaborator Author

arthw May 28, 2026
Collaborator Author

arthw Jun 2, 2026
Collaborator Author

arthw Jun 1, 2026
Collaborator Author

arthw Jun 2, 2026
Collaborator Author

arthw Jun 3, 2026
Collaborator Author