TensorRT-LLMs/tensorrt_llm
nvpohanh 13c8e5a8a8
feat: Prefetch safetensors files before loading them (#4140)
Prefetching safetensors files so that they are stored in the system file
cache. This significantly speeds up the model weight loading for the
very first run after entering the docker container.

This is beneficial because model weight loading is done layer-by-layer,
which means reading from the safetensors chunk-by-chunk, and that cannot
utilize the internet bandwidth very well, assuming that these files are
stored in some network drives. Instead, loading the whole files in bulk
can achieve higher internet bandwidth utilization.

When running with world_size>1, all ranks collaboratedly prefetch these
files.

In theory, we should add heuristics to decide whether to prefetch the
files or not, but that is beyond the scope of this commit.

For example, when the CPU memory is small, doing prefetching may result
in file cache thrashing, resulting in slower weight loading time.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-13 13:35:30 +08:00
..
_torch feat: Prefetch safetensors files before loading them (#4140) 2025-05-13 13:35:30 +08:00
auto_parallel fix: Fix NVLink version decoding. (#3996) 2025-05-06 13:56:50 +08:00
bench [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. (#3875) 2025-05-09 23:20:52 +08:00
commands feat: add kv cache aware router (#3831) 2025-05-12 07:23:57 -04:00
evaluate [TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946) 2025-05-08 19:35:23 +08:00
executor [TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156) 2025-05-12 21:35:15 -04:00
inputs [TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API (#3985) 2025-05-07 11:06:23 +08:00
layers Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
llmapi feat: add kv cache aware router (#3831) 2025-05-12 07:23:57 -04:00
models Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
plugin fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988) 2025-05-05 11:33:25 +08:00
quantization chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
runtime Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979) 2025-05-12 22:32:29 +02:00
scaffolding [TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller (#4151) 2025-05-12 15:29:09 +08:00
serve feat: add kv cache aware router (#3831) 2025-05-12 07:23:57 -04:00
tools test: Fix breaking Phi3 multimodal tests (#3544) 2025-04-15 08:02:34 +08:00
__init__.py fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928) 2025-04-29 11:26:13 +08:00
_common.py Update (#2978) 2025-03-23 16:39:35 +08:00
_dlpack_utils.py feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
_ipc_utils.py fix: Proper error bubbling for PyExecutor (#3321) 2025-04-15 14:49:46 +08:00
_mnnvl_utils.py feat: Add MNNVL MoE A2A support (#3504) 2025-04-25 17:29:08 +08:00
_utils.py [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
builder.py chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
disaggregated_params.py Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
functional.py Support RingAttention in the BertAttention plugin and the DiT model (#3661) 2025-05-09 08:06:54 +08:00
graph_rewriting.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
logger.py Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
lora_manager.py feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
mapping.py Add smart router for moe (#3641) 2025-04-23 12:21:59 +08:00
module.py Update (#2978) 2025-03-23 16:39:35 +08:00
network.py chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
parameter.py Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
profiler.py test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483) 2025-04-22 07:38:16 +08:00
prompt_adapter_manager.py Update TensorRT-LLM (#2333) 2024-10-15 15:28:40 +08:00
python_plugin.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
sampling_params.py feat: Support the Structural Tag in guided decoding (#4066) 2025-05-12 17:24:50 +08:00
top_model_mixin.py Update TensorRT-LLM (#2053) 2024-07-30 21:25:01 +08:00
version.py chore: bump version to 0.20.0rc2 (#3949) 2025-04-30 11:44:43 +08:00