lhez
d6588daa80
opencl: support bf16 by converting to f16 ( #23839 )
b9436
2026-05-30 10:17:47 -07:00
Pascal
d38d50e7ff
ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked ( #23910 )
2026-05-30 16:50:54 +02:00
Johannes Gäßler
8b0e0db606
TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs ( #23843 )
...
* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs
* fix afmoe TP
b9434
2026-05-30 16:48:00 +03:00
Georgi Gerganov
2d9b7c8e98
metal : restore im2col implementation for large kernels ( #23901 )
b9433
2026-05-30 15:26:13 +03:00
Xuan-Son Nguyen
e674b1279b
test: (test-llama-archs) log the config name first ( #23885 )
b9432
2026-05-30 12:22:38 +02:00
Georgi Gerganov
4c4e91b799
ci : update ios-xcode release job to macos-26 ( #23906 )
...
* ci : disable libcommon build from xcframework
* ocd : fix name
* ci : ios-xcode change to macos-26
* cont : pin xcode
* cont : pin xcode to minor version
b9431
2026-05-30 13:21:46 +03:00
Jinyang He
d48a56effb
ggml : add some lsx support ( #23798 )
...
* loongarch : optimize LSX fp16 load/store with native intrinsics
Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.
* loongarch : add LSX implementation for q8_0 dot product
* loongarch : add LSX implementation for q6_K dot product
* loongarch : add LSX implementation for iq4_xs dot product
* Improve reduce ops when sun int16 pairs to int32
b9430
2026-05-30 11:53:26 +03:00
Ruben Ortlam
6e093b80ea
vulkan: add Flash Attention support for BFloat16 KV cache ( #23420 )
...
* vulkan: add flash attention bf16 kv support
* vulkan: bf16 FA coopmat1 support
* vulkan: bf16 FA coopmat2 support
* fix FA bf16 f32 fallback
* fix FA bf16 coopmat1 shader
* fix FA bf16 coopmat2 shader
* code cleanup
* cleanup comment change
* address feedback
* add O_TYPE for cm2 FA
* use O_TYPE for gqaStore function
* reduce BFLOAT16 ifdefs
2026-05-30 10:39:31 +02:00
Georgi Gerganov
337528571d
ci : fix s390x release job ( #23898 )
...
* ci : fix s390x release job
* ci : multi-thread build for `ios-xcode`
* ocd : names
b9428
2026-05-30 09:21:38 +03:00
Georgi Gerganov
d4204b03a5
ci : clear cache instead of "no timestamp" keys + fix macos ( #23895 )
...
* ci : ios use macos-15 again
* ci : add and test ccache-clear
* cont : fix
* cont : set permission
* cont : another permission
* cont : token
* cont : print key
* cont : bring back perms
* cont : test windows
* cont : add token
* cont : cleanup
* ci : make release jobs clean-up their ccache
2026-05-30 08:52:30 +03:00
Radoslav Gerganov
1738129bee
llama : do not skip iGPU when only RPC devices are present ( #23868 )
...
After #23007 reclassified integrated CUDA/HIP devices as IGPU, the device
selection logic dropped the local iGPU whenever any RPC server was added,
because RPC devices made `model->devices` non-empty. On systems where the
"iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified
memory), this caused all tensors to be allocated on the RPC peer alone and
model loading to fail.
Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer
suppress the local iGPU.
closes : #23858
b9426
2026-05-30 07:48:22 +03:00
Xuan-Son Nguyen
0821c5fcfd
server: in SSE mode, send HTTP headers when slot starts ( #23884 )
...
* server: in SSE mode, send HTTP headers when slot starts
* ref to pr
* stream should be false by default
2026-05-30 00:06:29 +02:00
Reese Levine
151f3a98e9
ggml-webgpu: Check earlier for WebGPU required features ( #23879 )
2026-05-29 14:16:05 -07:00
Reese Levine
b22da25889
ggml-webgpu: add q4_0/q8_0 SET_ROWS ( #23760 )
...
* Add q8_0 and q4_0 set_rows
* Add fast(er) quantization set_rows path
* formatting/naming
* a little more naming
* Remove unused constant
* Don't override other override
* Avoid bitcast
* Narrow relaxation
2026-05-29 14:14:11 -07:00
Ruixiang Wang
689a9a470e
server-bench : add speed-bench for speculative decoding benchmarking ( #23869 )
...
* spec: add speed-bench support for benchmarking
* speed-bench : add trailing newline to requirements.txt
* speed-bench : bump datasets to 4.8.0 to fix ty check
* server-bench : remove now-unused type: ignore after datasets bump
2026-05-29 23:09:47 +02:00
Pascal
5a46b46acd
app: add llama update self updater ( #23865 )
...
* wip: llama update POC
* cleaning: llama update
* llama-gen-docs
* app: delegate llama update to the install script
* app: spawn the installer detached so llama update can replace a running binary
* cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h}
* app: make llama_update static
Address review from @angt
2026-05-29 23:02:40 +02:00
ValdikSS
22d66b567e
ui: handle audio/vnd.wave as audio WAV file ( #23754 )
...
Firefox on Linux uses this MIME type
2026-05-29 21:41:35 +02:00
Tarek Dakhran
2084434e66
vocab : support tokenizer for LFM2.5-8B-A1B ( #23826 )
...
* vocab: Support tokenizer for LFM2.5-8B-A1B
* Keep liquid6 tokenizer in models
2026-05-29 20:25:43 +02:00
Sigbjørn Skjæret
764f1e64a1
graph : ensure DS32 kq_mask_lid is F32 ( #23864 )
2026-05-29 19:55:14 +02:00
Xuan-Son Nguyen
b5f52280fb
server: remove obsolete scripts ( #23870 )
2026-05-29 19:47:30 +02:00
Georgi Gerganov
dc71236b6c
ci : update macos release to use macos-26 runner ( #23878 )
2026-05-29 20:41:57 +03:00
Xuan-Son Nguyen
06d26dfdff
download: add option to skip_download ( #23059 )
...
* download: add option to skip_download
* fix
* fix 2
* if file doesn't exist, respect skip_download flag
b9415
2026-05-29 16:30:55 +02:00
Saba Fallah
da3f990a47
mtmd: Add DeepSeekOCR 2 Support ( #20975 )
...
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution
* introduced clip_image_f32::add_viewsep
* address PR review
- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d
* Update tools/mtmd/models/deepseekocr2.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
b9414
2026-05-29 16:13:51 +02:00
Oliver Simons
6ed481eea4
CUDA: Check PTX version on host side to guard PDL dispatch ( #23530 )
...
* CUDA: Check PTX version on host side to guard PDL dispatch
Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).
Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.
This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code
* Implement MurmurHash3 mixer for better hash distribution
Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* Address review comments, make seed non-zero
* Apply code-formatting
* Replace std::size_t -> size_t for consistency
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
b9413
2026-05-29 12:28:18 +02:00
Xuan-Son Nguyen
cb47092b00
server: bump timeout to 3600s ( #23842 )
...
* server: bump timeout to 3600s
* nits: change wording
b9412
2026-05-29 10:23:17 +02:00
fairydreaming
1f0aa2a696
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation ( #23346 )
...
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)
* convert : handle DeepseekV32ForCausalLM architecture
* ggml : support for f16 GGML_OP_FILL
* memory : separate hparams argument in llama_kv_cache constructor
* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)
* llama : support for LLM_ARCH_DEEPSEEK32
* model : llama_model_deepseek32 implementation
* model : merge two scale operations into one in DSA lightning indexer implementation
* chore : remove unused code
* model : support NVFP4 in DeepSeek V3.2
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* memory : refactoring TODO
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com >
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com >
b9411
2026-05-29 10:15:17 +02:00
Aman Gupta
031ddb2e08
llama: use f16 mask for FA to save VRAM ( #23764 )
...
* llama: use f16 mask for FA
* review: add llama_cast + formatting
* simplify
b9410
2026-05-29 15:44:43 +08:00
Georgi Gerganov
fe12e422ad
sync : ggml
b9409
2026-05-29 09:56:08 +03:00
Georgi Gerganov
ea02bc37f5
ggml : bump version to 0.13.1 (ggml/1523)
2026-05-29 09:56:08 +03:00
Omid Azizi
b000431a0b
ngram-mod : Add missing include ( #23857 )
...
[no release]
Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai >
2026-05-29 09:21:37 +03:00
Aman Gupta
eef59a7642
llama: add llm_graph_input_mtp ( #23643 )
...
* llama: add llm_graph_input_mtp
* rename input_mtp -> input_token_embd
* add TODO about mtmd embedding
* cont : clean-up
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b9406
2026-05-29 09:17:32 +03:00
Adrien Gallouët
98e480a32e
app : move licences to llama-app ( #23824 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9405
2026-05-29 07:46:11 +02:00
Andreas Kieslinger
241cbd41d2
cuda : disables launch_fattn PDL enrollment due to compiler bug ( #23825 )
b9404
2026-05-29 07:46:10 +03:00
Matt Corallo
33c718db1f
meta : Add missing buffer set in allreduce fallback !COMPUTE clear ( #23480 )
...
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
b9403
2026-05-29 06:30:24 +03:00
Max Krasnyansky
19e92c33ef
hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion ( #23835 )
...
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
b9402
2026-05-28 14:05:54 -07:00
Xuan-Son Nguyen
751ebd17a5
mtmd-debug: add color and rainbow mode ( #23829 )
...
* mtmd-debug: add color and rainbow mode
* fix M_PI
* max_dist
b9401
2026-05-28 20:59:14 +02:00
Xuan-Son Nguyen
c8914ad4f4
mtmd: fix gemma 4 projector pre_norm ( #23822 )
b9400
2026-05-28 20:58:55 +02:00
lhez
408ae2b9e5
opencl: move backend info printing into its own function ( #23702 )
...
* opencl: move backend info print into its own function
* opencl: move new log line
* opencl: fix for non adreno path
b9399
2026-05-28 11:05:42 -07:00
Sigbjørn Skjæret
3ef2369551
ci : run ui publish on ubuntu-slim ( #23818 )
...
* run ui publish on self-hosted fast
* run on ubuntu-slim
2026-05-28 20:58:32 +03:00
ValdikSS
2f6c815dc4
ui: fix audio and video modality detection ( #23756 )
...
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
2026-05-28 17:36:10 +02:00
Georgi Gerganov
445b7cef62
ci : releases use Github-hosted builds for the UI ( #23823 )
...
* ci : releases use Github-hosted builds for the UI
* cont : fix name
2026-05-28 17:50:32 +03:00
Adrien Gallouët
479a9a1b03
app : improve help output ( #23805 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9395
2026-05-28 16:45:06 +02:00
Saba Fallah
0b56d283bf
mtmd: n_head_kv defaults to n_head ( #23782 )
...
removed AI-generated comment
b9394
2026-05-28 16:44:36 +02:00
Xuan-Son Nguyen
d6be3158e1
mtmd: fix gemma 4 audio rms norm eps ( #23815 )
...
* mtmd: fix gemma 4 audio rms norm eps
* Update tools/mtmd/clip.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b9393
2026-05-28 16:31:37 +02:00
Georgi Gerganov
dd1557907a
ci : change Vulkan builds to Release to reduce ccache ( #23820 )
...
* ci : disable all CPU variant builds for Vulkan workflow
* cont : change cache key
* cont : change build type
2026-05-28 17:29:11 +03:00
Mikolaj Kucharski
7fb1e70b59
arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file ( #23167 )
b9391
2026-05-28 16:25:40 +02:00
Johannes Gäßler
d374e71e55
test-llama-archs: fix table format [no release] ( #23810 )
2026-05-28 15:53:54 +02:00
fl0rianr
30af6e2b98
ggml: auto apply iGPU flag CUDA/HIP if integrated device ( #23007 )
b9389
2026-05-28 15:01:14 +02:00
redfox
d7be46189f
mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … ( #23729 )
...
* mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING
* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
---------
Co-authored-by: Copilot <copilot@github.com >
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
b9388
2026-05-28 14:51:14 +02:00
Jaden_Mach
bc81d47aba
CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware ( #23227 )
...
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware
The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.
This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):
Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%)
Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%)
others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged)
Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.
Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.
Selected pp512 throughput (tok/s, ub=8):
Q4_K_S: 559 -> 940 (+68%)
Q5_K_S: 503 -> 884 (+76%)
Q3_K_S: 629 -> 879 (+40%)
Q2_K : 615 -> 809 (+32%)
Q6_K : 582 -> 776 (+33%)
Selected pp512 throughput (tok/s, ub=4):
Q4_K_S: 444 -> 480 (+ 8%)
Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ)
IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ)
* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block
* tune kernel selection logic for CDNA1
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
b9387
2026-05-28 14:50:25 +02:00