Files
llama.cpp/ggml
Matt Corallo c6e4088376 vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887)
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32

Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

Note that this breaks some tests until the last commit which fixes
OOB A reads.

* vulkan: Use aligned loads in mul_mat_vec when available

Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec

Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.

Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes

There was a TODO to fix the OOB reads from the A matrix which we do
here.

It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
2026-05-27 17:19:23 +02:00
..
2024-07-13 18:12:39 +02:00