llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-07-04 18:20:21 +00:00

Files

T

Ruben Ortlam e4d2e198b9 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold

estimate with to-be-loaded model size included

use no_alloc to get memory requirements for model load

only set model memory_mb if not previously calculated

use memory margin instead of total size limit, apply to each device separately

add server memory debug logging

move llama_context_device_memory function to llama-ext.h

fix model count exceeded check

improve memory_per_device map naming

improve variable naming, fix style

also strip models memory margin from child processes

cont : clean-up

replace device memory map with buft memory map. Use llama_get_memory_breakdown

extract duplicated check into helper function

move model memory estimation to subprocess

precompute name->buft map, map GPU host types to CPU buft

cleanup unused variable

remove duplicated init calls

2026-06-29 09:38:11 +02:00

batched-bench

cmake : add install() for impl libraries + fix apple builds (#23511 )

2026-05-22 11:46:26 +03:00

cli

mtmd, arg: fix utf8 handling on windows (#24779 )

2026-06-19 22:28:38 +02:00

completion

completion : remove useless statics (#24226 )

2026-06-06 12:16:16 +02:00

cvector-generator

libs : rename libcommon -> libllama-common (#21936 )