mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-29 15:50:22 +00:00
b8024
649 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
490eb96b88 | metal : support GGML_OP_SET (#19548) | ||
|
|
3b3a948134 | metal : update sum_rows kernel to support float4 (#19524) | ||
|
|
914dde72ba |
ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511)
* ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU |
||
|
|
89181c0b6d |
ggml : extend bin bcast for permuted src1 (#19484)
* tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify |
||
|
|
ceaa89b786 | metal : consolidate unary ops (#19490) | ||
|
|
9a96352729 | test: fix IMROPE perf test case (#19465) | ||
|
|
a0d585537c |
cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad |
||
|
|
1e8924fd65 |
cmake : add variable to skip installing tests (#19370)
When packaging downstream, there's usually little point in installing test. The default behaviour remains the same. |
||
|
|
db6adb3c88 |
tests: reduce number of FA test permutations (#19381)
Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not). |
||
|
|
449ec2ab07 |
vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281)
Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing). |
||
|
|
eaba92c3dc |
tests : add non-cont, inplace rope tests (#19296)
* tests : add non-cont, inplace rope tests * cont : exercise dim 3 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * cont : more dim3 exercises --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com> |
||
|
|
0dfcd3b607 |
jinja : add missing 'in' test to template engine (#19004) (#19239)
* jinja : add missing 'in' test to template engine (#19004) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * reuse test_is_in in binary op --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> |
||
|
|
9f682fb640 |
ggml-cpu: FA split across kv for faster TG (#19209)
* ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl |
||
|
|
7a4ca3cbd9 |
docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/ * Bump copyrights |
||
|
|
c3b87cebff | tests : add GQA=20 FA test (#19095) | ||
|
|
7b7ae857f6 |
chat : add parsing for solar-open-100b (#18540)
* chat : add parsing for solar-open-100b * add comments to rules * cont : make assistant start optional * cont : remove assistant start prefix altogether --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> |
||
|
|
b45ef2702c | jinja : do not pass empty tools and add some none filters (#19176) | ||
|
|
60368e1d73 |
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
* undefined is treated as iterable (string/array) by filters `tojson` is not a supported `undefined` filter * add tests * add sequence and iterable tests keep it DRY and fix some types |
||
|
|
2b4cbd2834 |
jinja : implement mixed type object keys (#18955)
* implement mixed type object keys * add tests * refactor * minor fixes * massive refactor * add more tests * forgotten tuples * fix array/object is_hashable * correct (albeit broken) jinja responses verified with transformers * improved hashing and equality * refactor hash function * more exhausive test case * clean up * cont * cont (2) * missing cstring --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> |
||
|
|
b0311c16d2 | CUDA: fix padding of GQA to power of 2 in FA (#19115) | ||
|
|
4e5b83b226 | GGUF: check that tensor size is representable (#19072) | ||
|
|
51fa458a92 |
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input * report template caps to webui * add docs * rm commented code |
||
|
|
a5eaa1d6a3 |
mla : make the V tensor a view of K (#18986)
* mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> |
||
|
|
c301172f66 |
jinja: support none|string (#18995)
* jinja: support none|string * Update common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add as_string() --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
33f890e579 | vulkan: support flash attention GQA/split_k with small batches (#18938) | ||
|
|
2c1f199653 |
cli : fix reasoning responses in CLI (#18961)
* cli : fix reasoning responses in CLI * fix build * fix build (2) |
||
|
|
959ecf7f23 |
jinja : fix undefined keys and attributes and int/float as bool (#18924)
* fix undefined keys and attributes * add falsy tests * as_bool for integers and floats * more falsy/truthy tests * --typo |
||
|
|
4037093c66 | ci : run test-jinja -py on high perf [no ci] (#18916) | ||
|
|
fe44d35574 |
tests : add test-jinja -py option for cross-checking (#18906)
* tests : add test-jinja -py option or cross-checking * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix + add source * SandboxedEnvironment * fix array.map case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
d03c45c9c5 |
jinja : attribute support for join, map and sort (#18883)
* support negative array index and default value * attribute support (int and str) for join, map and sort * add tests * update CODEOWNERS * improve fixme sorting comment |
||
|
|
10c98cbdf6 |
jinja : add missing tojson filter for bool (#18900)
* add missing tojson for bool * add more literal tests |
||
|
|
420960ab92 |
jinja : fix lexing of float literals with sign (#18901)
* fix lexing of float literals with sign * add test * consume_numeric |
||
|
|
f55b033ae6 | jinja: correct member access rule (#18905) | ||
|
|
388ce82241 |
ggml : extend ggml_pool_1d + metal (#16429)
* chore: resolve conflicts * feat: ggml metal impl * fix: ggml_metal_kargs_pool_1d struct * fix: require contiguous input * chore: test pool_1d * chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts * chore: add p0 and s0 to testing * fix: allow padding for cpu and metal * Update ggml/src/ggml-metal/ggml-metal.metal * fix: correct single-threaded loop * ggml : cleanup * tests : add ne[1] != 1 tests * fix: ne[1] handling in np * cont : fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
c15395f73c |
common : implement new jinja template engine (#18462)
* jinja vm * lexer * add vm types * demo * clean up * parser ok * binary_expression::execute * shadow naming * bin ops works! * fix map object * add string builtins * add more builtins * wip * use mk_val * eval with is_user_input * render gemma tmpl ok * track input string even after transformations * support binded functions * keyword arguments and slicing array * use shared_ptr for values * add mk_stmt * allow print source on exception * fix negate test * testing more templates * mostly works * add filter_statement * allow func to access ctx * add jinja-value.cpp * impl global_from_json * a lot of fixes * more tests * more fix, more tests * more fixes * rm workarounds * demo: type inferrence * add placeholder for tojson * improve function args handling * rm type inference * no more std::regex * trailing spaces * make testing more flexible * make output a bit cleaner * (wip) redirect minja calls * test: add --output * fix crash on macro kwargs * add minimal caps system * add some workarounds * rm caps_apply_workarounds * get rid of preprocessing * more fixes * fix test-chat-template * move test-chat-jinja into test-chat-template * rm test-chat-jinja from cmake * test-chat-template: use common * fix build * fix build (2) * rename vm --> interpreter * improve error reporting * correct lstrip behavior * add tojson * more fixes * disable tests for COMMON_CHAT_FORMAT_GENERIC * make sure tojson output correct order * add object.length * fully functional selectattr / rejectattr * improve error reporting * more builtins added, more fixes * create jinja rendering tests * fix testing.h path * adjust whitespace rules * more fixes * temporary disable test for ibm-granite * r/lstrip behavior matched with hf.js * minimax, glm4.5 ok * add append and pop * kimi-k2 ok * test-chat passed * fix lstrip_block * add more jinja tests * cast to unsigned char * allow dict key to be numeric * nemotron: rm windows newline * tests ok * fix test * rename interpreter --> runtime * fix build * add more checks * bring back generic format support * fix Apertus * [json.exception.out_of_range.403] key 'content' not found * rm generic test * refactor input marking * add docs * fix windows build * clarify error message * improved tests * split/rsplit with maxsplit * non-inverse maxsplit forgot to change after simplifying * implement separators for tojson and fix indent * i like to move it move it * rename null -- > none * token::eof * some nits + comments * add exception classes for lexer and parser * null -> none * rename global -> env * rm minja * update docs * docs: add input marking caveats * imlement missing jinja-tests functions * oops * support trim filter with args, remove bogus to_json reference * numerous argument fixes * updated tests * implement optional strip chars parameter * use new chars parameter * float filter also has default * always leave at least one decimal in float string * jinja : static analysis + header cleanup + minor fixes * add fuzz test * add string.cpp * fix chat_template_kwargs * nits * fix build * revert * unrevert sorry :) * add fuzz func_args, refactor to be safer * fix array.map() * loosen ensure_vals max count condition, add not impl for map(int) * hopefully fix windows * check if empty first * normalize newlines --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
ec997b4f2b |
tests : download models only when running ctest (#18843)
Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
36f0132464 |
CUDA: Factor out and re-use block_reduce function (#18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> |
||
|
|
f709c7a33f |
ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency * llama_dl_model -> llama_download_model * use EXPECTED_HASH for robust model downloading * Move llama_download_model to cmake/common.cmake Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
2bbe4c2cf8 |
vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally. |
||
|
|
84ae04f163 |
tests : refactor test-backend-sampler (#18753)
* tests : use "auto", use std::string * tests : refactor test-backend-sampler.cpp * cmake : remove redundant declarations * ci : use smaller model * tests : add struct test_params * tests : reduce logit bias 100.0f -> 10.0f |
||
|
|
b137718878 | test-backend-ops: fix mxfp4 tests on blackwell (#18736) | ||
|
|
55abc39355 |
vendor : update cpp-httplib to 0.30.0 (#18660)
* vendor : update cpp-httplib to 0.30.0 * common : allow custom headers when downloading |
||
|
|
07fbe19f1f |
arg: use CSV escape style for multiple-value args (#18643)
* arg: use CSV escape style for multiple-value args * add test |
||
|
|
f1768d8f03 | vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (#18582) | ||
|
|
b37124d2d2 |
vulkan: handle quantize_q8_1 overflowing the max workgroup count (#18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count * vulkan: Fix small tile size matmul on lavapipe * fix mul_mat_id failures |
||
|
|
67e3f6f601 |
CANN: add operator fusion support for ADD + RMS_NORM (#17512)
This commit implements operator fusion for ADD + RMS_NORM operations in the CANN backend to reduce memory access overhead and improve performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: false). Changes: - Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm - Add ggml_cann_can_fuse() to check fusion eligibility - Integrate fusion logic into computation graph evaluation - Add test cases for ADD + RMS_NORM fusion - Update documentation with new environment variable The fusion combines ADD and RMS_NORM into a single kernel call, which is more efficient than executing them separately. |
||
|
|
d3dce4e0a5 |
sampling : add support for backend sampling (#17004)
* sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ``` * sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection. * sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers. * sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode. * squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available. * common : fix regression caused by extra memory allocations during sampling * squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring. * squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit |
||
|
|
cef1d23c5a |
common/grammar : replace problematic backtracking regex [\s\S]* (#18342)
* grammar : add support for std::regex_search() with trigger patterns * common : update hermes2 pro trigger to search instead of match * common : use regex_search with anchoring for partial matching * common : adjust regex partial tests to use new pattern * grammar : check pattern directly instead of adding a type * common : adjust existing patterns to match new semantics |
||
|
|
be47fb9285 |
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk |
||
|
|
4cd162a123 |
chat: make tool description and parameters optional per OpenAI spec (#18478)
* chat: make tool description and parameters optional per OpenAI spec Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix #17667 * refactor: use value() for cleaner optional field access |