mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-30 08:10:20 +00:00
b7839
705 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
16639ba217 |
common : use two decimal places for float arg help messages (#19048)
* common : use two decimal places for float arg help messages This commit updates the help messages for various command-line arguments in arg.cpp to display floating-point default values with two decimal places instead of one. The motivation for this changes is that currently only having one decimal place means that values generated using --help or llama-gen-docs will not display the correct values. For example, currently the value of top-p in tools/server/README.md is `0.9`, but the default value is actually '0.95'. And running llama-gen-docs does not update this value as it uses the output from the help message, which shows only one decimal place, so the values look like they are unchanged. * docs : run llama-gen-docs to update docs |
||
|
|
e9fd8dcab4 | llama-fit-params: keep explicit --ctx-size 0 (#19070) | ||
|
|
bb02f74c61 |
chat: fix language input for translategemma (#19052)
* chat: fix language input for translategemma * Update common/chat.cpp Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> |
||
|
|
b5b8fa1c8b | chat : fix translategemma crash on common_chat_format_example (#19019) | ||
|
|
51fa458a92 |
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input * report template caps to webui * add docs * rm commented code |
||
|
|
c301172f66 |
jinja: support none|string (#18995)
* jinja: support none|string * Update common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add as_string() --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
14be5a39b1 |
common : improve error message when HTTPS is missing but required (#18987)
Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
1c7cf94b22 |
common, server : use the same User-Agent by default (#18957)
This commit also ensures that if a custom User-Agent is used, it will be the only one sent. Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
2c1f199653 |
cli : fix reasoning responses in CLI (#18961)
* cli : fix reasoning responses in CLI * fix build * fix build (2) |
||
|
|
6df686bee6 |
server : refactor oai_parser_opt, move it to server_chat_params (#18937)
* server_chat_params * move chat format into CLI * use meta whenever possible * clean up, no more chatml fallback |
||
|
|
959ecf7f23 |
jinja : fix undefined keys and attributes and int/float as bool (#18924)
* fix undefined keys and attributes * add falsy tests * as_bool for integers and floats * more falsy/truthy tests * --typo |
||
|
|
bbcdac0189 |
jinja : fix object item order (and properly implement dictsort) (#18904)
* fix object item order * as_ordered_object * copy whole object |
||
|
|
d03c45c9c5 |
jinja : attribute support for join, map and sort (#18883)
* support negative array index and default value * attribute support (int and str) for join, map and sort * add tests * update CODEOWNERS * improve fixme sorting comment |
||
|
|
10c98cbdf6 |
jinja : add missing tojson filter for bool (#18900)
* add missing tojson for bool * add more literal tests |
||
|
|
420960ab92 |
jinja : fix lexing of float literals with sign (#18901)
* fix lexing of float literals with sign * add test * consume_numeric |
||
|
|
f55b033ae6 | jinja: correct member access rule (#18905) | ||
|
|
c15395f73c |
common : implement new jinja template engine (#18462)
* jinja vm * lexer * add vm types * demo * clean up * parser ok * binary_expression::execute * shadow naming * bin ops works! * fix map object * add string builtins * add more builtins * wip * use mk_val * eval with is_user_input * render gemma tmpl ok * track input string even after transformations * support binded functions * keyword arguments and slicing array * use shared_ptr for values * add mk_stmt * allow print source on exception * fix negate test * testing more templates * mostly works * add filter_statement * allow func to access ctx * add jinja-value.cpp * impl global_from_json * a lot of fixes * more tests * more fix, more tests * more fixes * rm workarounds * demo: type inferrence * add placeholder for tojson * improve function args handling * rm type inference * no more std::regex * trailing spaces * make testing more flexible * make output a bit cleaner * (wip) redirect minja calls * test: add --output * fix crash on macro kwargs * add minimal caps system * add some workarounds * rm caps_apply_workarounds * get rid of preprocessing * more fixes * fix test-chat-template * move test-chat-jinja into test-chat-template * rm test-chat-jinja from cmake * test-chat-template: use common * fix build * fix build (2) * rename vm --> interpreter * improve error reporting * correct lstrip behavior * add tojson * more fixes * disable tests for COMMON_CHAT_FORMAT_GENERIC * make sure tojson output correct order * add object.length * fully functional selectattr / rejectattr * improve error reporting * more builtins added, more fixes * create jinja rendering tests * fix testing.h path * adjust whitespace rules * more fixes * temporary disable test for ibm-granite * r/lstrip behavior matched with hf.js * minimax, glm4.5 ok * add append and pop * kimi-k2 ok * test-chat passed * fix lstrip_block * add more jinja tests * cast to unsigned char * allow dict key to be numeric * nemotron: rm windows newline * tests ok * fix test * rename interpreter --> runtime * fix build * add more checks * bring back generic format support * fix Apertus * [json.exception.out_of_range.403] key 'content' not found * rm generic test * refactor input marking * add docs * fix windows build * clarify error message * improved tests * split/rsplit with maxsplit * non-inverse maxsplit forgot to change after simplifying * implement separators for tojson and fix indent * i like to move it move it * rename null -- > none * token::eof * some nits + comments * add exception classes for lexer and parser * null -> none * rename global -> env * rm minja * update docs * docs: add input marking caveats * imlement missing jinja-tests functions * oops * support trim filter with args, remove bogus to_json reference * numerous argument fixes * updated tests * implement optional strip chars parameter * use new chars parameter * float filter also has default * always leave at least one decimal in float string * jinja : static analysis + header cleanup + minor fixes * add fuzz test * add string.cpp * fix chat_template_kwargs * nits * fix build * revert * unrevert sorry :) * add fuzz func_args, refactor to be safer * fix array.map() * loosen ensure_vals max count condition, add not impl for map(int) * hopefully fix windows * check if empty first * normalize newlines --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
13f1e4a9ca |
llama : add adaptive-p sampler (#17927)
* initial commit for branch * simplify constants * add params to `struct common_params_sampling`, add reference to PR * explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]` * add args, rename `queue_size` -> `window_size` * improved comments * minor * remove old unused code from algorithm * minor * add power law case to `common_sampler_init`, add sampler name mappings * clarify behaviour when `window_size = 0` * add missing enums * remove `target_range` param, make `target == 1` no-op, cleanup code * oops, straggler * add missing parameters in `server-task.cpp` * copy from author ref: https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069 * remove old debug log, style nit * fix compiler warning, add commented-out logging per token * re-write + change parameters + simplify * oops forgot args.cpp * fix leftover `window_size` * add missing values to `common_params_sampling::print()` * with logging * does this fix it? * no, but does this? * update default decay * optimize * fix bad merge my git skills are lacking * silence `missing initializer for member` * update default decay to 0.9 * fix logging * format (double) * add power law to the new `samplers` vector * log sampler init values * improve logging messages in llama_sampler_power_law * remove extraneous logging * simplify target computation last commit with debug logging! * remove debug logging, explicitly clamp params at init * add `use_power_law` flag + logic, minor cleanup * update `power-law` -> `adaptive-p` * fix cold start EMA - `ctx->weighted_sum` is now initialized and reset to `target / (1.0f - clamped_decay)` - `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f - clamped_decay)` this fixes a "cold start" problem with the moving average * update `SHARPNESS` constant to `10.0f` * minor style fixes no functional changes * minor style fixes cont. * update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004) * separate into `apply` + `accept` functions * `pending_token_idx`: switch from `llama_token` to `int32` functionally identical (`llama.h` has `typedef int32_t llama_token;`), but its more correct now * don't transform logits <= -1e9f * fix masking in backend top-p, min-p * address review comments * typo in comments `RND` -> `RNG` * add docs * add recommended values in completion docs * address PR feedback * remove trailing whitespace (for CI `editorconfig`) * add to adaptive-p to `common_sampler_types_from_chars` |
||
|
|
39173bcacb |
context : reserve new scheduler when graph topology changes (#18547)
* context : reserve new scheduler when graph topology changes * cont : fix * cont : fix reserve * cont : reserve only when changes occur + timing * context : add comments * llama : reserve on sampler changes * common : allow null common_sampler * server : task declares needs (embd, logits, sampling) * server : do not init sampler if not needed * llama : fix need_reserve when unsetting a sampler * server : consolidate slot reset/clear logic |
||
|
|
d98b548120 |
Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)
* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality * Move to common * Remove unneeded header * Unlink from common * chore: update webui build output * Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code. * Revert change to webapp * Post-merge adjust * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Apply code review changes * Remove changes to server-context * Remove mtmd.h include * Remove utility functions from header * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Rename functions * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> |
||
|
|
516a4ca9b5 | refactor : remove libcurl, use OpenSSL when available (#18828) | ||
|
|
60591f01d4 |
model : add EXAONE MoE (#18543)
* Add EXAONE MoE implementations Co-authored-by: Junwon Hwang <nuclear1221@gmail.com> * Address PR feedback * Address PR feedback * [WIP] Add MTP for EXAONE-MoE * Address PR feedback * Address PR feedback * Address PR feedback * Address PR feedback * Address PR feedback * Address PR feedback * Address PR feedback --------- Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai> |
||
|
|
bcf7546160 |
server : add arg for disabling prompt caching (#18776)
* server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments |
||
|
|
4150da9a95 |
examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag is currently specified in the README.md as required, but is currently not available as a command line option for the batched example. The motivation for this is that specifying this flag as the README instructs, will lead to an error about the flag not being recognized, and without this option the example fail with the following error: ```console split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 4 main: llama_decode() failed ``` |
||
|
|
23f82f2420 |
preset: allow named remote preset (#18728)
* preset: allow named remote preset * nits: fix docs * cont docs |
||
|
|
ea23c15990 |
common : add --license to display embedded licenses (#18696)
This commit introduces a mechanism to embed all licenses directly into the compiled binaries. This eliminates the need to distribute separate LICENSE files alongside the executable, making the binaries self-contained and simplifying deployment. |
||
|
|
8ece3836b4 |
common: support remote preset (#18520)
* arg: support remote preset * proof reading * allow one HF repo to point to multiple HF repos * docs: mention about multiple GGUF use case * correct clean_file_name * download: also return HTTP status code * fix case with cache file used * fix --offline option |
||
|
|
55abc39355 |
vendor : update cpp-httplib to 0.30.0 (#18660)
* vendor : update cpp-httplib to 0.30.0 * common : allow custom headers when downloading |
||
|
|
64848deb18 | llama-fit-params: free memory target per device (#18679) | ||
|
|
2038101bd9 |
llama : add use_direct_io flag for model loading (#18166)
* Adding --direct-io flag for model loading * Fixing read_raw() calls * Fixing Windows read_raw_at * Changing type off_t to size_t for windows and Renaming functions * disable direct io when mmap is explicitly enabled * Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL * Fallback to std::fread in case O_DIRECT fails due to bad address * Windows: remove const keywords and unused functions * Update src/llama-mmap.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: jtischbein <jtischbein@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
56d2fed2b3 |
tools : remove llama-run (#18661)
* tools : remove llama-run * Remove licenses/LICENSE-linenoise Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
ffba4f29e6 |
examples : add debug utility/example (#18464)
* examples : add debug utility/example
This commit introduces a new example named llama-debug which is a
utility that is intended to be used to assist with developing/debugging
a converted model.
The motivation for this utilitiy is to assist in model conversion work
to verify that the model produces the expected outputs. It is intended
to replace logits.cpp in examples/model-conversion.
Example usage:
```console
./build/bin/llama-debug \
-m models/Qwen2.5-0.5B-Instruct.gguf \
--prompt "Hello, my name is" \
--save-logits
...
Model add_bos: false
Input prompt: "Hello, my name is"
Token ids (5):
Hello(9707) ,(11) my(847) name(829) is(374)
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.bin
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.txt
Prompt saved to data/llamacpp-Qwen2.5-0.5B-Instruct-prompt.txt
Tokens saved to data/llamacpp-Qwen2.5-0.5B-Instruct-tokens.bin
```
For more details about the options available for this example, please
refer to examples/debug/README.md.
* throw runtime error instead of logging error
* remove params.warmup and enable the warmup/nowarmup option
* model-conversion : remove logits.cpp
This commit removes logits.cpp in favor of using llama-debug for
generating logits and embeddings.
* examples : remove model-conversion directory
This was missed in the previous commit.
* model-conversion : add support for saving prompt and token ids
This commit add support for storing the prompt and the token ids for the
prompt when running the original models.
The motivation for this is that this will allow us to compare the prompt
and the tokens generated for the prompt when verifing the converted
model. Currently it is possible that even if the same prompt is used
that the tokens generated are different if there is a difference in the
tokenization between the original and converted model which would
currently go unnoticed (the verification will most likely fail but it
might not be obvious why).
* squash! model-conversion : add support for saving prompt and token ids
fix pyright errors.
* model-conversion : add compare_tokens utility
This commit adds a script to compare token outputs between original and
converted models.
Example usage:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
And there is a verbose flag that will also print out the prompts:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16 -v
Original model prompt (pytorch-gemma-3-270m-it):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Converted model prompt (llamacpp-gemma-3-270m-it-bf16):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
* model-conversion : add token comparison to verifiction scripts
This commit add the calling of the compare_tokens function in
compare-logits.py and semantic_check.py to ensure that the token ids
that the tokenizers procoduce are the same before proceeding with
verifying the logits/embeddings.
Placing them in the existing scripts instead calling them separately
ensures that the token comparison is always done prior to the
logit/embedding verifications.
Follow up commit/pr could refactor the causal logits verification into
a single script instead of the two that exist now. This would reduce the
code and make it consistent with the embeddings verficiation which only
has a single script.
* debug : use llama_model_n_embd_out
This commit updates the debug example to use the new function
llama_model_n_embd_out instead of llama_model_n_embd.
The motivation for this change is to support late interation retriever
models, like LFM2-ColBert-350M, where the output embeddings are down
projected to a lower dimension.
* debug : add print_usage function
This commit adds a print_usage function that is passed to the
common_params_parse.
The motivation for this is that this enables a specific usage message
which will be printed after all the options, for example:
```console
example usage:
Print tensors:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --verbose
The tensors to be printed can be filtered with --tensor-filter option.
Save logits/embeddings:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --save-logits
Add --embedding to save embeddings
```
|
||
|
|
07fbe19f1f |
arg: use CSV escape style for multiple-value args (#18643)
* arg: use CSV escape style for multiple-value args * add test |
||
|
|
d3dce4e0a5 |
sampling : add support for backend sampling (#17004)
* sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ``` * sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection. * sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers. * sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode. * squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available. * common : fix regression caused by extra memory allocations during sampling * squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring. * squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit |
||
|
|
cef1d23c5a |
common/grammar : replace problematic backtracking regex [\s\S]* (#18342)
* grammar : add support for std::regex_search() with trigger patterns * common : update hermes2 pro trigger to search instead of match * common : use regex_search with anchoring for partial matching * common : adjust regex partial tests to use new pattern * grammar : check pattern directly instead of adding a type * common : adjust existing patterns to match new semantics |
||
|
|
f4f5019254 |
model: add Solar Open model (#18511)
* model: add Solar-Open model * vocab: add solar-open to end eog blacklist * model: add proper llm type * chat: basic template for solar open * typo: fix comment about vocab * convert: sugested changes * convert: suggested changes * chat: change reasoning end tag for solar-open * llama-chat: add solar-open template |
||
|
|
4cd162a123 |
chat: make tool description and parameters optional per OpenAI spec (#18478)
* chat: make tool description and parameters optional per OpenAI spec Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix #17667 * refactor: use value() for cleaner optional field access |
||
|
|
0f89d2ecf1 |
common : default content to an empty string (#18485)
* common : default content to an empty string * common : fix tests that break when content != null |
||
|
|
cd78e57c3a |
lora: count lora nodes in graph_max_nodes (#18469)
* lora: count lora nodes in graph_max_nodes * 3 nodes per weight * 4 nodes * keep track n_lora_nodes from llama_model * fix assert * rm redundant header * common: load adapters before context creation * use 6 nodes |
||
|
|
daa242dfc8 |
common: fix return value check for setpriority (#18412)
* common: fix return value check for setpriority * tools: add logging for process priority setting |
||
|
|
60f17f56da |
rpc: fix segfault on invalid endpoint format (#18387)
* rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection |
||
|
|
026d2ad472 |
llama: fix magic number of 999 for GPU layers (#18266)
* llama: fix magic number of 999 for GPU layers * use strings for -ngl, -ngld * enacapsulate n_gpu_layers, split_mode |
||
|
|
f5acfb2ffa |
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option * also allow stop while loading * add docs * unload_lru: also wait for unload to complete |
||
|
|
10355dc7d0 |
common: add LLAMA_ARG_OVERRIDE_TENSOR env var for -ot arg (#18267)
|
||
|
|
147a521636 | tool/ex/tests: consistently free ctx, then model (#18168) | ||
|
|
9496bbb808 | common : reorganize includes to prioritize vendored deps (#18222) | ||
|
|
ddcb75dd8a |
server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments |
||
|
|
9e39a1e6a9 |
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options * add docs * load-on-startup * fix * Update common/arg.cpp Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com> |
||
|
|
14931a826e |
arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form * arg: update doc * arg: update test-arg-parser * arg: address review feedback from ngxson simplified to check first.length() <= last.length() only fixed: --sampler-seq, --rerank, --draft ordering note: middle positions in 3+ arg sets are not verified * arg: update doc |
||
|
|
98c1c7a7bf |
presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources * update docs * fix neg arg handling * fix empty mmproj * also filter out server-controlled args before to_ini() * skip loading custom_models if not specified * fix unset_reserved_args * fix crash on windows |