aafsmarak
0ef6f06d55
docs/android.md: Add dependency libandroid-spawn for building in termux ( #21812 )
...
Fixes https://github.com/ggml-org/llama.cpp/issues/18615
b9755
2026-06-22 05:48:31 +02:00
Aldehir Rojas
52b3df0023
common/peg : implement ac parser for stricter grammar generation ( #24869 )
...
* common/peg : implement ac parser
* cont : extract functions
* cont : tidy up
* cont : remove a test
* cont : move ac() def
b9754
2026-06-21 16:20:58 -05:00
Xuan-Son Nguyen
7c082bc417
server: fix report progress for loading spec models, add "stages" list ( #24870 )
...
* server: fix report progress for loading spec models, add "stages" list
* improve
* nits
* nits 2
b9753
2026-06-21 17:36:52 +02:00
Xuan-Son Nguyen
bddfd2b113
server: refactor batch construction ( #24843 )
...
* server: refactor batch construction
* wip
* wip 2
* wip 3
* wip 4
* add abort_all_slots
* handle batch full more carefully
* fix assert
* rm debug log
* small nits
* (debug) add timings
* debug: force llama_synchronize for accurate timings
* address comments
* disable DEBUG_TIMINGS
b9752
2026-06-21 14:16:11 +02:00
Xuan-Son Nguyen
0d135df48c
mtmd: fix mtmd_get_memory_usage ( #24867 )
b9751
2026-06-21 14:12:15 +02:00
Sigbjørn Skjæret
bf533823cd
jinja : implement call statement ( #24847 )
...
* implement call statement
* undo unintended change
* de-lambda
* simplify
* move caller context inside function handler
b9750
2026-06-21 14:04:52 +02:00
Xuan-Son Nguyen
2f89acc2bc
mtmd: add load progress callback ( #24865 )
2026-06-21 13:40:52 +02:00
Xuan-Son Nguyen
bfa3219177
server: add "verbose" field to schema ( #24864 )
b9748
2026-06-21 13:03:14 +02:00
Xuan-Son Nguyen
d6d899580d
server: real-time model load progress tracking via /models/sse ( #24828 )
...
* server: real-time model load progress tracking via /models/sse
* update docs
* add mutex for notify_to_router
* correct docs
b9747
2026-06-21 11:58:14 +02:00
Georgi Gerganov
8a118ee86c
minor : clean-up whitespaces ( #24862 )
...
[no ci]
2026-06-21 11:37:12 +03:00
YiChen Lv
d789527482
spec : Support Step3.5/3.7 flash mtp3 ( #24340 )
...
* add mtp_layer_offset + include nextn flags in graph reuse
* add llama_set_mtp_layer_offset + llama_model_n_nextn_layer API
* offset head select + require all MTP blocks
* speculative multi-head process()
* speculative multi-head draft()
* gather outputs via inp_out_ids
* cleanup
* fix core
* minor cleanup
* merged draft_multi_head into draft()
* mtp rename nextn
* Apply suggestions from code review
Co-authored-by: Aman Gupta <amangupta052@gmail.com >
* clean-up comments
* fix for multi seq
* apply suggestions && chain-heads comment
* add a reference for chain_heads discussion
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com >
b9745
2026-06-21 11:33:18 +03:00
Aldehir Rojas
063d9c156e
common/peg : refactor until gbnf grammar generation ( #24839 )
...
* common/peg : refactor until gbnf grammar into an ac automaton
* cont : add a test with multiple strings
* cont : pad state with 0s so rules line up
* cont : clean up comments
* cont : use set everywhere
* cont : inline state num string padding
* cont : add a ref to PR
* cont : fix regression in server-tools.cpp
b9744
2026-06-20 21:15:06 -05:00
Aldehir Rojas
c57607016a
common/json-schema-to-grammar : align spacing rules with parsers ( #24835 )
b9743
2026-06-20 17:43:04 -05:00
Guanhuai Zhang
4a80943174
fix(hexagon): use padded stride for ssm-conv weights ( #24470 )
b9742
2026-06-20 14:58:49 -07:00
Adrien Gallouët
84de01a1f1
llama : use LLM_KV for quantization_version & file_type ( #24802 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9741
2026-06-20 20:07:01 +02:00
Xuan-Son Nguyen
75f460ac28
arg: try fixing test-args-parser randomly fails ( #24826 )
...
* arg: try fixing test-args-parser randomly fails
* return ref
* try triggering the workflow
* exception wrapper
* wip
* test
* test 2
* arg: guard win32 utf8 argv override
make_utf8_argv rebuilds argv from GetCommandLineW to fix utf8 handling of
non ascii arguments on windows. the override runs unconditionally inside
common_params_parse, so it also clobbers a programmatic argv passed by a
caller. test-arg-parser builds a synthetic argv but then sees the real
process command line instead, the model argument is never parsed, and the
assert that expects success aborts via fastfail (0xC0000409). this shows up
as a random failure in the openvino windows workflow.
only override argv when its length matches the caller argc, so the utf8
repair still applies to real binaries while a programmatic argv stays intact.
---------
Co-authored-by: Pascal <admin@serveurperso.com >
b9740
2026-06-20 19:45:27 +02:00
Muhammad Salem
8452824611
release: add missing link for win opencl adreno arm64 ( #24809 )
b9739
2026-06-20 23:08:59 +08:00
Matti4
e27f308597
server: avoid forwarding auth headers in CORS proxy ( #24373 )
...
* server: avoid forwarding auth headers in CORS proxy
* format
* fix test
* fix e2e test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
b9738
2026-06-20 15:34:47 +02:00
Aldehir Rojas
67e9fd3b74
docker : prebuild web UI for s390x build [no release] ( #24829 )
b9737
2026-06-20 05:54:42 -05:00
davidrhodus
796f41bedc
model : glm-dsa load DSA indexer tensors as optional ( #24770 )
...
GLM-5.2 ships the DSA "lightning indexer" on only a subset of layers (the
"full" layers; others omit it), but the GLM_DSA loader created the five
indexer tensors on every layer as required, so loading any GLM-5.2 GGUF
failed with e.g. `missing tensor 'blk.3.indexer.k_norm.weight'`.
GLM_DSA's graph is llama_model_deepseek2::graph (plain MLA) and does not use
the indexer tensors (indexer runtime not yet implemented), so they are
loaded-but-unused. Marking them TENSOR_NOT_REQUIRED lets layers without an
indexer load as nullptr and the model runs as full MLA attention.
DeepSeek-V3.2 (uniform indexer on all layers) is unaffected.
b9736
2026-06-20 13:48:24 +03:00
Adrien Gallouët
37a77fb057
ggml : optimize AMX ( #24806 )
...
Flatten the partition over n_batch * M so every thread participates in
the quantization
| CPU | Model | Test | t/s OLD | t/s NEW | Speedup |
|:--------------------------------|:------------------------------|:-------|----------:|----------:|----------:|
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_NL - 4.5 bpw | pp512 | 730.71 | 779.86 | 1.07 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_NL - 4.5 bpw | tg128 | 87.88 | 86.79 | 0.99 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_XS - 4.25 bpw | pp512 | 725.09 | 1023.31 | 1.41 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B IQ4_XS - 4.25 bpw | tg128 | 83.64 | 83.62 | 1.00 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_0 | pp512 | 820.51 | 924.05 | 1.13 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_0 | tg128 | 90.59 | 92.46 | 1.02 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_1 | pp512 | 776.88 | 872.79 | 1.12 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_1 | tg128 | 89.39 | 90.94 | 1.02 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_M | pp512 | 719.28 | 1009.27 | 1.40 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_M | tg128 | 80.62 | 80.86 | 1.00 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_S | pp512 | 732.29 | 1077.29 | 1.47 |
| Intel(R) Xeon(R) Platinum 8488C | qwen35 0.8B Q4_K_S | tg128 | 86.42 | 83.53 | 0.97 |
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9735
2026-06-20 13:43:06 +03:00
Sigbjørn Skjæret
f4043fec01
convert : more consistent handling of rope_parameters ( #24833 )
2026-06-20 13:42:36 +03:00
Masashi Yoshimura
f449e05537
ggml-webgpu: add adapter toggles for F16 on Vulkan + NVIDIA
b9733
2026-06-20 08:12:32 +09:00
Xuan-Son Nguyen
2b686a9120
server: refactor child --> router communication ( #24821 )
...
* server: refactor child --> router communication
* fix wakeup case
* add docs
* improve update_status()
* nits
b9732
2026-06-20 01:02:26 +02:00
Adrien Gallouët
4b48a53b6c
server : optimize get_token_probabilities ( #24796 )
...
Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary
logprobs sort: vocab=128000 n_top=0 iters=100
full sort: 8555.6 us/op
partial sort: 704.3 us/op
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9731
2026-06-19 23:26:54 +02:00
Xuan-Son Nguyen
e475fa2b5f
mtmd, arg: fix utf8 handling on windows ( #24779 )
...
* mtmd, arg: fix utf8 handling on windows
* also fix ggml_fopen
* fix build fail
* also fix CLI
b9730
2026-06-19 22:28:38 +02:00
Xuan-Son Nguyen
175147e8f6
server: remove all internal mentions about "webui" ( #24817 )
b9729
2026-06-19 22:12:46 +02:00
Mikolaj Kucharski
fabde3bf51
arg: Add comment line support to --api-key-file ( #23168 )
b9728
2026-06-19 17:33:54 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
0d2d9ccbf6
vendor : update cpp-httplib to 0.48.0 ( #24787 )
b9727
2026-06-19 22:16:35 +08:00
Xuan-Son Nguyen
8c2d6f6475
server: add --agent arg, remove redundant webui naming compat ( #24801 )
...
* server: add --agent arg, remove redundant webui naming compat
* corrent env
* fix the test
* llama-gen-docs
* nits: wordings
b9726
2026-06-19 16:06:13 +02:00
Aldehir Rojas
38724ab593
docker : build the UI ( #24794 )
...
* docker : build the UI
* cont : use existing APP_VERSION
b9725
2026-06-19 15:32:31 +02:00
Xuan-Son Nguyen
e2e7a9b2d0
mtmd: several bug fixes ( #24784 )
...
* mtmd: several bug fixes
* fix build
* fix gemma4ua
* add sanity check in get_u32()
* fix build (2)
* area() avoid overflow
b9724
2026-06-19 12:18:36 +02:00
Ruixiang Wang
b14e3fb90c
spec: support eagle3 for qwen3.5 & 3.6 ( #24593 )
...
* spec: support qwen3.5 & 3.6 eagle3 draft
* eagle3: Add deferred boundary checkpoints restore support for hybrid models
* apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* spec: adapt to API change
* spec: fix naming
* cont : add TODO
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b9723
2026-06-19 13:08:50 +03:00
Xuan-Son Nguyen
159d093a43
server: fix non-bound n_discard value (ctx shifting) ( #24786 )
...
* server: fix non-bound n_discard value
* Update tools/server/server-context.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b9722
2026-06-19 10:53:44 +02:00
Georgi Gerganov
5fd2dc2c41
sync : ggml
b9721
2026-06-19 10:19:14 +03:00
Georgi Gerganov
1868af13ac
ggml : bump version to 0.15.2 (ggml/1548)
2026-06-19 10:19:14 +03:00
Georgi Gerganov
5bd21b8555
pi : remove docs from system prompt ( #24791 )
2026-06-19 09:34:00 +03:00
Georgi Gerganov
80452d65b9
server : consolidate slot selection into get_available_slot ( #24755 )
...
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
b9718
2026-06-19 09:22:34 +03:00
shalinib-ibm
8141e730f1
ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul ( #24753 )
...
* ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
* Apply suggestion from @taronaeo
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
b9717
2026-06-19 08:55:38 +03:00
Xuan-Son Nguyen
db52540f73
mtmd: add batching support for internvl ( #24775 )
b9716
2026-06-19 01:16:16 +02:00
Pascal
3a3edc9ac6
Ggml/cuda col2im 1d ( #24417 )
...
* cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op
* cuda: col2im_1d use fast_div_modulo for the index decomposition
* cuda: col2im_1d tighten supports_op, type match and contiguous dst
b9715
2026-06-18 22:23:01 +02:00
Reguna
40f3aafc45
server: add "X-Accel-Buffering": "no" header to streaming endpoints ( #24774 )
...
* server: add "X-Accel-Buffering": "no" header to streaming endpoints
This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).
b9714
2026-06-18 22:01:24 +02:00
Xuan-Son Nguyen
a6b3260a42
mtmd: add batching for mtmd-cli, add video tests ( #24778 )
b9713
2026-06-18 21:55:04 +02:00
o7si
32eddaf2ea
cmake : fix ui build with read-only source ( #24752 )
b9712
2026-06-18 18:59:18 +02:00
Xuan-Son Nguyen
060ce1bf72
mtmd: refactor llava-uhd overview image handling (always use ov_img_first) ( #24769 )
...
* add dedicated "overview" for mtmd_image_preproc_out
* corrections
* correct (again)
* nits
* nits (2)
b9711
2026-06-18 18:53:49 +02:00
Max Krasnyansky
d2c67959b3
hexagon: support for op-trace (fine-grain tracing of HVX/HMX/DMA events) ( #24592 )
...
* hex-optrace: add support for optrace and instrument matmul and flash-atten code
* hex-trace: improve trace event and prefetto generator
* hex-trace: add new script dedicated to handling traces, specifically perfetto traces
* hex-trace: add --head/--tail options to profile and trace tools
* hex-trace: fix whitespaces
* hex-trace: fix flake8 warnings
* hex-trace: fix flake8 warnings
* hmx-fa: restore q_tiles clearing
* hex-profile: remove circular dep in includes
* hex-trace: simplify trace sizing check
* hex-profile: sort events in the summary by name
2026-06-18 08:35:02 -07:00
Kangjia Gao
7b6c5a2aed
docs: fix export-lora --lora-scaled syntax [no release] ( #24703 )
...
Assisted-by: Codex
2026-06-18 16:46:17 +02:00
Xuan-Son Nguyen
fe7c8b2414
server: (router) fix stopping_thread potentially hang ( #24728 )
...
* server: (router) fix stopping_thread potentially hang
* fix windows build
2026-06-18 15:41:09 +02:00
Xuan-Son Nguyen
e1efd0991d
server: add "schema" and validation ( #24150 )
...
* wip
* working
* correct some limits
* add field name to error message
b9707
2026-06-18 15:40:58 +02:00
Aarni Koskela
08023072ef
server : add last-5-seconds generation speed display ( #24291 )
...
* server : add last-5-seconds generation speed display
* cont : clean-up
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2026-06-18 14:02:20 +02:00