TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

nvpohanh 13c8e5a8a8 feat: Prefetch safetensors files before loading them (#4140 ) Prefetching safetensors files so that they are stored in the system file cache. This significantly speeds up the model weight loading for the very first run after entering the docker container. This is beneficial because model weight loading is done layer-by-layer, which means reading from the safetensors chunk-by-chunk, and that cannot utilize the internet bandwidth very well, assuming that these files are stored in some network drives. Instead, loading the whole files in bulk can achieve higher internet bandwidth utilization. When running with world_size>1, all ranks collaboratedly prefetch these files. In theory, we should add heuristics to decide whether to prefetch the files or not, but that is beyond the scope of this commit. For example, when the CPU memory is small, doing prefetching may result in file cache thrashing, resulting in slower weight loading time. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>		2025-05-13 13:35:30 +08:00
..
attention_backend	fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227 )	2025-05-12 23:25:54 +08:00
auto_deploy	[TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake (#4199 )	2025-05-12 11:11:40 -04:00
compilation	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 )	2025-05-09 11:04:01 +08:00
custom_ops	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
distributed	chore: Fix pipeline break caused by previous PR (#4081 ) rebase + pipeline reuse (#4169 )	2025-05-09 12:51:02 +08:00
models	refactor: Allow models to override apply_qk_norm. (#4078 )	2025-05-12 19:38:24 +08:00
modules	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
peft	feat: support multi lora adapters and TP (#3885 )	2025-05-08 23:45:45 +08:00
pyexecutor	feat: Prefetch safetensors files before loading them (#4140 )	2025-05-13 13:35:30 +08:00
speculative	[fix] Fix relaxed acceptance to support enabling it in context phase (#4126 )	2025-05-09 14:11:14 +08:00
__init__.py	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
autotuner.py	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 )	2025-04-08 14:28:36 +08:00
llm.py	test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references (#3069 )	2025-03-26 18:14:35 +08:00
metadata.py	feat: no-cache attention in PyTorch workflow (#3085 )	2025-04-05 01:54:32 +08:00
model_config.py	feat: support multi lora adapters and TP (#3885 )	2025-05-08 23:45:45 +08:00
pipeline_interface.py	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
utils.py	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 )	2025-05-09 11:04:01 +08:00