Files
llama.cpp/tools
Ruben Ortlam e4d2e198b9 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold
estimate with to-be-loaded model size included

use no_alloc to get memory requirements for model load

only set model memory_mb if not previously calculated

use memory margin instead of total size limit, apply to each device separately

add server memory debug logging

move llama_context_device_memory function to llama-ext.h

fix model count exceeded check

improve memory_per_device map naming

improve variable naming, fix style

also strip models memory margin from child processes

cont : clean-up

replace device memory map with buft memory map. Use llama_get_memory_breakdown

extract duplicated check into helper function

move model memory estimation to subprocess

precompute name->buft map, map GPU host types to CPU buft

cleanup unused variable

remove duplicated init calls
2026-06-29 09:38:11 +02:00
..
2026-06-26 08:43:29 +02:00
2026-06-25 02:49:22 +02:00
2026-05-14 13:05:52 +03:00