TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

nvpohanh 13c8e5a8a8 feat: Prefetch safetensors files before loading them (#4140 ) Prefetching safetensors files so that they are stored in the system file cache. This significantly speeds up the model weight loading for the very first run after entering the docker container. This is beneficial because model weight loading is done layer-by-layer, which means reading from the safetensors chunk-by-chunk, and that cannot utilize the internet bandwidth very well, assuming that these files are stored in some network drives. Instead, loading the whole files in bulk can achieve higher internet bandwidth utilization. When running with world_size>1, all ranks collaboratedly prefetch these files. In theory, we should add heuristics to decide whether to prefetch the files or not, but that is beyond the scope of this commit. For example, when the CPU memory is small, doing prefetching may result in file cache thrashing, resulting in slower weight loading time. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>		2025-05-13 13:35:30 +08:00
..
_torch	feat: Prefetch safetensors files before loading them (#4140 )	2025-05-13 13:35:30 +08:00
auto_parallel	fix: Fix NVLink version decoding. (#3996 )	2025-05-06 13:56:50 +08:00
bench	[TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. (#3875 )	2025-05-09 23:20:52 +08:00
commands	feat: add kv cache aware router (#3831 )	2025-05-12 07:23:57 -04:00
evaluate	[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946 )	2025-05-08 19:35:23 +08:00
executor	[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156 )	2025-05-12 21:35:15 -04:00
inputs	[TRTLLM-3925, https://nvbugs/5245262 ] [fix] Normalize LLM.generate API (#3985 )	2025-05-07 11:06:23 +08:00
layers	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
llmapi	feat: add kv cache aware router (#3831 )	2025-05-12 07:23:57 -04:00
models	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
plugin	fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988 )	2025-05-05 11:33:25 +08:00
quantization	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
runtime	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
scaffolding	[TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller (#4151 )	2025-05-12 15:29:09 +08:00
serve	feat: add kv cache aware router (#3831 )	2025-05-12 07:23:57 -04:00
tools	test: Fix breaking Phi3 multimodal tests (#3544 )	2025-04-15 08:02:34 +08:00
__init__.py	fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928 )	2025-04-29 11:26:13 +08:00
_common.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
_dlpack_utils.py	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
_ipc_utils.py	fix: Proper error bubbling for PyExecutor (#3321 )	2025-04-15 14:49:46 +08:00
_mnnvl_utils.py	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
_utils.py	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 )	2025-05-09 11:04:01 +08:00
builder.py	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
disaggregated_params.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
functional.py	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
graph_rewriting.py	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
logger.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
lora_manager.py	feat: support multi lora adapters and TP (#3885 )	2025-05-08 23:45:45 +08:00
mapping.py	Add smart router for moe (#3641 )	2025-04-23 12:21:59 +08:00
module.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
network.py	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
parameter.py	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
profiler.py	test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483 )	2025-04-22 07:38:16 +08:00
prompt_adapter_manager.py	Update TensorRT-LLM (#2333 )	2024-10-15 15:28:40 +08:00
python_plugin.py	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
sampling_params.py	feat: Support the Structural Tag in guided decoding (#4066 )	2025-05-12 17:24:50 +08:00
top_model_mixin.py	Update TensorRT-LLM (#2053 )	2024-07-30 21:25:01 +08:00
version.py	chore: bump version to 0.20.0rc2 (#3949 )	2025-04-30 11:44:43 +08:00