Ruben Ortlam
5a7462237e
remove duplicated init calls
2026-06-19 11:07:38 +02:00
Ruben Ortlam
79210e3046
cleanup unused variable
2026-06-19 11:07:38 +02:00
Ruben Ortlam
84c4214b39
precompute name->buft map, map GPU host types to CPU buft
2026-06-19 11:07:38 +02:00
Ruben Ortlam
dbc5f7ec82
move model memory estimation to subprocess
2026-06-19 11:07:38 +02:00
Ruben Ortlam
384a495a00
extract duplicated check into helper function
2026-06-19 11:07:38 +02:00
Ruben Ortlam
997491a644
replace device memory map with buft memory map. Use llama_get_memory_breakdown
2026-06-19 11:07:38 +02:00
Georgi Gerganov
a35afd504f
cont : clean-up
2026-06-19 11:07:38 +02:00
Ruben Ortlam
3046b8853a
also strip models memory margin from child processes
2026-06-19 11:05:24 +02:00
Ruben Ortlam
216aaf1ad6
improve variable naming, fix style
2026-06-19 11:05:24 +02:00
Ruben Ortlam
ff41b3dbf7
improve memory_per_device map naming
2026-06-19 11:05:24 +02:00
Ruben Ortlam
0e2f08a535
fix model count exceeded check
2026-06-19 11:05:24 +02:00
Ruben Ortlam
669948ce12
move llama_context_device_memory function to llama-ext.h
2026-06-19 11:05:24 +02:00
Ruben Ortlam
09d8eb95a4
add server memory debug logging
2026-06-19 11:05:24 +02:00
Ruben Ortlam
c749b6882c
use memory margin instead of total size limit, apply to each device separately
2026-06-19 11:05:24 +02:00
Ruben Ortlam
4ed48154b0
only set model memory_mb if not previously calculated
2026-06-19 11:05:01 +02:00
Ruben Ortlam
6178b8755d
use no_alloc to get memory requirements for model load
2026-06-19 11:05:01 +02:00
Ruben Ortlam
340c867179
estimate with to-be-loaded model size included
2026-06-19 11:05:01 +02:00
Ruben Ortlam
f38c4f9419
server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold
2026-06-19 11:05:01 +02:00
Xuan-Son Nguyen
159d093a43
server: fix non-bound n_discard value (ctx shifting) ( #24786 )
...
* server: fix non-bound n_discard value
* Update tools/server/server-context.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b9722
2026-06-19 10:53:44 +02:00
Georgi Gerganov
5fd2dc2c41
sync : ggml
b9721
2026-06-19 10:19:14 +03:00
Georgi Gerganov
1868af13ac
ggml : bump version to 0.15.2 (ggml/1548)
2026-06-19 10:19:14 +03:00
Georgi Gerganov
5bd21b8555
pi : remove docs from system prompt ( #24791 )
2026-06-19 09:34:00 +03:00
Georgi Gerganov
80452d65b9
server : consolidate slot selection into get_available_slot ( #24755 )
...
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
b9718
2026-06-19 09:22:34 +03:00
shalinib-ibm
8141e730f1
ggml-cpu: support K tails in power10 Q8/Q4 MMA matmul ( #24753 )
...
* ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
* Apply suggestion from @taronaeo
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
b9717
2026-06-19 08:55:38 +03:00
Xuan-Son Nguyen
db52540f73
mtmd: add batching support for internvl ( #24775 )
b9716
2026-06-19 01:16:16 +02:00
Pascal
3a3edc9ac6
Ggml/cuda col2im 1d ( #24417 )
...
* cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op
* cuda: col2im_1d use fast_div_modulo for the index decomposition
* cuda: col2im_1d tighten supports_op, type match and contiguous dst
b9715
2026-06-18 22:23:01 +02:00
Reguna
40f3aafc45
server: add "X-Accel-Buffering": "no" header to streaming endpoints ( #24774 )
...
* server: add "X-Accel-Buffering": "no" header to streaming endpoints
This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).
b9714
2026-06-18 22:01:24 +02:00
Xuan-Son Nguyen
a6b3260a42
mtmd: add batching for mtmd-cli, add video tests ( #24778 )
b9713
2026-06-18 21:55:04 +02:00
o7si
32eddaf2ea
cmake : fix ui build with read-only source ( #24752 )
b9712
2026-06-18 18:59:18 +02:00
Xuan-Son Nguyen
060ce1bf72
mtmd: refactor llava-uhd overview image handling (always use ov_img_first) ( #24769 )
...
* add dedicated "overview" for mtmd_image_preproc_out
* corrections
* correct (again)
* nits
* nits (2)
b9711
2026-06-18 18:53:49 +02:00
Max Krasnyansky
d2c67959b3
hexagon: support for op-trace (fine-grain tracing of HVX/HMX/DMA events) ( #24592 )
...
* hex-optrace: add support for optrace and instrument matmul and flash-atten code
* hex-trace: improve trace event and prefetto generator
* hex-trace: add new script dedicated to handling traces, specifically perfetto traces
* hex-trace: add --head/--tail options to profile and trace tools
* hex-trace: fix whitespaces
* hex-trace: fix flake8 warnings
* hex-trace: fix flake8 warnings
* hmx-fa: restore q_tiles clearing
* hex-profile: remove circular dep in includes
* hex-trace: simplify trace sizing check
* hex-profile: sort events in the summary by name
2026-06-18 08:35:02 -07:00
Kangjia Gao
7b6c5a2aed
docs: fix export-lora --lora-scaled syntax [no release] ( #24703 )
...
Assisted-by: Codex
2026-06-18 16:46:17 +02:00
Xuan-Son Nguyen
fe7c8b2414
server: (router) fix stopping_thread potentially hang ( #24728 )
...
* server: (router) fix stopping_thread potentially hang
* fix windows build
2026-06-18 15:41:09 +02:00
Xuan-Son Nguyen
e1efd0991d
server: add "schema" and validation ( #24150 )
...
* wip
* working
* correct some limits
* add field name to error message
b9707
2026-06-18 15:40:58 +02:00
Aarni Koskela
08023072ef
server : add last-5-seconds generation speed display ( #24291 )
...
* server : add last-5-seconds generation speed display
* cont : clean-up
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2026-06-18 14:02:20 +02:00
Amos Wong
20832179e2
ui: provide touch accessible model selection UI ( #24604 )
...
* ui : add model selector storybook stories
Covers list, favorites, single-model, all status states
(loading/loaded/sleeping/failed/idle), and selection states.
* ui : improve model selector mobile UX with hover media queries
Use @media (hover:none) to show action buttons directly on touch
devices and color-code them by model status (amber=sleeping,
green=loaded, muted=idle). Status dots hidden on touch. Desktop
hover behavior unchanged.
2026-06-18 13:14:20 +02:00
Anuj Attri
10786217e9
server : return HTTP 400 on invalid grammar ( #24144 ) ( #24154 )
...
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.
Fixes #24144
b9704
2026-06-18 12:49:14 +02:00
Xuan-Son Nguyen
552258c535
server: (router) rework -hf preset repo ( #24739 )
...
* server: temporary remove HF remote preset
* rework remove preset.ini support
* rm unused get_remote_preset_whitelist()
* print warning
* add docs
* rm stray file
b9703
2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen
968c43891a
server: fix router args not being forwarded to child instances ( #24760 )
b9702
2026-06-18 12:15:46 +02:00
Xuan-Son Nguyen
24bba7b98e
mtmd: refactor preprocessor, add mtmd_image_preproc_out ( #24736 )
...
* add mtmd_image_preproc_out
* add dev docs
* remove unused clip API
* rm unused clip_image_f32_batch::grid
* change preprocess() call signature
b9701
2026-06-18 12:04:39 +02:00
Neo Zhang
9724f664e8
[SYCL] rename GGML_SYCL_SUPPORT_LEVEL_ZERO ( #24719 )
...
* rename GGML_SYCL_SUPPORT_LEVEL_ZERO to GGML_SYCL_SUPPORT_LEVEL_ZERO_API, and GGML_SYCL_ENABLE_LEVEL_ZERO to GGML_SYCL_USE_LEVEL_ZERO_API
* fix code format
* fix error when rebase
b9700
2026-06-18 11:18:26 +03:00
Neo Zhang
dd69db2924
sycl : support MUL_MAT and OUT_PROD with Q1_0 ( #24721 )
b9699
2026-06-18 11:17:37 +03:00
Adrien Gallouët
6ec59ddaea
app : enable self-update only when built with llama-install.sh ( #24754 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9698
2026-06-18 09:57:59 +02:00
Sigbjørn Skjæret
32e806b9c1
ci : fix check-release message parsing ( #24751 )
b9697
2026-06-18 09:32:56 +02:00
Neo Zhang
6f1034b32a
[SYCL] support OPs: conv_2d, conv_2d_dw, conv2d_transpose ( #24600 )
...
* fix conflict
* fix format issue, rename
* rm debug code
* correct the file name
2026-06-18 09:40:03 +03:00
Aleksander Grygier
0b73fc79fe
ui: Update code formatting command in pre-commit hook ( #24685 )
2026-06-18 08:33:50 +02:00
Ravi Panchumarthy
4a79037b8b
ci : fix Windows x64 (OpenVINO) release link ( #24731 )
b9694
2026-06-18 08:30:08 +02:00
Georgi Gerganov
cae0a3b0b0
metal : check for BF16 support in concat kernel ( #24747 )
b9693
2026-06-18 09:16:06 +03:00
Xuan-Son Nguyen
f3e1828164
mtmd: llava_uhd should no longer use batch dim ( #24732 )
b9692
2026-06-17 22:40:50 +02:00
shalinib-ibm
2e88c49c90
ggml-cpu: Conditionally enable power11 backend based on compiler support ( #24687 )
...
* ggml: Conditionally enable power11 backend based on compiler support
Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang toolchains while preserving forward compatibility once POWER11 support becomes available.
* Update CMakeLists.txt
ggml-cpu: Use -mcpu=power10 for P10 and P11
b9691
2026-06-18 02:45:19 +08:00