TensorRT-LLMs/tensorrt_llm/_torch
nvpohanh 13c8e5a8a8
feat: Prefetch safetensors files before loading them (#4140)
Prefetching safetensors files so that they are stored in the system file
cache. This significantly speeds up the model weight loading for the
very first run after entering the docker container.

This is beneficial because model weight loading is done layer-by-layer,
which means reading from the safetensors chunk-by-chunk, and that cannot
utilize the internet bandwidth very well, assuming that these files are
stored in some network drives. Instead, loading the whole files in bulk
can achieve higher internet bandwidth utilization.

When running with world_size>1, all ranks collaboratedly prefetch these
files.

In theory, we should add heuristics to decide whether to prefetch the
files or not, but that is beyond the scope of this commit.

For example, when the CPU memory is small, doing prefetching may result
in file cache thrashing, resulting in slower weight loading time.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-13 13:35:30 +08:00
..
attention_backend fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227) 2025-05-12 23:25:54 +08:00
auto_deploy [TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake (#4199) 2025-05-12 11:11:40 -04:00
compilation [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
custom_ops feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
distributed chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse (#4169) 2025-05-09 12:51:02 +08:00
models refactor: Allow models to override apply_qk_norm. (#4078) 2025-05-12 19:38:24 +08:00
modules feat: Add heuristic for GroupRMSNorm kernel selection. (#4047) 2025-05-13 08:52:53 +08:00
peft feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
pyexecutor feat: Prefetch safetensors files before loading them (#4140) 2025-05-13 13:35:30 +08:00
speculative [fix] Fix relaxed acceptance to support enabling it in context phase (#4126) 2025-05-09 14:11:14 +08:00
__init__.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
autotuner.py feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151) 2025-04-08 14:28:36 +08:00
llm.py test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references (#3069) 2025-03-26 18:14:35 +08:00
metadata.py feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
model_config.py feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
pipeline_interface.py chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
utils.py [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00