* server: temporary remove HF remote preset * rework remove preset.ini support * rm unused get_remote_preset_whitelist() * print warning * add docs * rm stray file
2.7 KiB
llama.cpp INI Presets
Introduction
The INI preset feature, introduced in PR#17859, allows users to create reusable and shareable parameter configurations for llama.cpp.
Using Presets with the Server
When running multiple models on the server (router mode), INI preset files can be used to configure model-specific parameters. Please refer to the server documentation for more details.
Using a Hugging Face Preset
Important
Please only use presets that you can trust! Unknown presets may be unsafe
You can push your preset to Hugging Face Hub and share with other users by:
- Creating an empty model repository on Hugging Face
- Creating a
preset.inifile in the root directory of the repository
Example of a preset.ini:
[*]
ctx-size = 0
mmap = 1
kv-unified = 1
parallel = 4
spec-default = 1
[Qwen3.5-4B]
hf = unsloth/Qwen3.5-4B-GGUF:Q4_K_M
ctx-size = 262144
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
[gpt-oss-120b-hf]
hf = ggml-org/gpt-oss-120b-GGUF
ctx-size = 262144
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}
The preset will be loaded similarly to the --models-preset option. Therefore, you can also override certain params via CLI arguments:
# Force temp = 0.1, overriding the preset value
llama-cli -hf username/my-preset --temp 0.1
Named presets
If you want to define multiple preset configurations for one or more GGUF models, you can create a blank HF repo containing a single preset.ini file that references the actual model(s):
[*]
mmap = 1
[gpt-oss-20b-hf]
hf = ggml-org/gpt-oss-20b-GGUF
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}
[gpt-oss-120b-hf]
hf = ggml-org/gpt-oss-120b-GGUF
batch-size = 2048
ubatch-size = 2048
top-p = 1.0
top-k = 0
min-p = 0.01
temp = 1.0
chat-template-kwargs = {"reasoning_effort": "high"}
You can then use it via llama-cli or llama-server, example:
llama-server -hf user/repo:gpt-oss-120b-hf
Please make sure to provide the correct hf-repo for each child preset. Otherwise, you may get error: The specified tag is not a valid quantization scheme.