TensorRT-LLMs/tensorrt_llm/_torch/pyexecutor
nvpohanh 13c8e5a8a8
feat: Prefetch safetensors files before loading them (#4140)
Prefetching safetensors files so that they are stored in the system file
cache. This significantly speeds up the model weight loading for the
very first run after entering the docker container.

This is beneficial because model weight loading is done layer-by-layer,
which means reading from the safetensors chunk-by-chunk, and that cannot
utilize the internet bandwidth very well, assuming that these files are
stored in some network drives. Instead, loading the whole files in bulk
can achieve higher internet bandwidth utilization.

When running with world_size>1, all ranks collaboratedly prefetch these
files.

In theory, we should add heuristics to decide whether to prefetch the
files or not, but that is beyond the scope of this commit.

For example, when the CPU memory is small, doing prefetching may result
in file cache thrashing, resulting in slower weight loading time.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-13 13:35:30 +08:00
..
__init__.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
_util.py feat: Support the Structural Tag in guided decoding (#4066) 2025-05-12 17:24:50 +08:00
config.py [TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156) 2025-05-12 21:35:15 -04:00
cuda_graph_runner.py [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804) 2025-05-09 11:04:01 +08:00
decoder.py feat: adopt new logprob definition in PyTorch flow (#4057) 2025-05-08 20:16:40 +08:00
guided_decoder.py feat: Support the Structural Tag in guided decoding (#4066) 2025-05-12 17:24:50 +08:00
kv_cache_transceiver.py cacheTransceiver buffer manager (#3798) 2025-04-27 11:48:15 +08:00
layerwise_nvtx_marker.py Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
llm_request.py feat: adopt new logprob definition in PyTorch flow (#4057) 2025-05-08 20:16:40 +08:00
model_engine.py feat: Prefetch safetensors files before loading them (#4140) 2025-05-13 13:35:30 +08:00
py_executor_creator.py [fix] Fix llama4 + eagle3 (#3998) 2025-05-08 19:20:27 -04:00
py_executor.py [TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156) 2025-05-12 21:35:15 -04:00
resource_manager.py [fix] Fix add_dummy_requests for spec decoding cases (#4084) 2025-05-09 16:52:51 +08:00
scheduler.py refactor: collect executor and decoder states into dataclass (#3234) 2025-04-15 16:31:45 +08:00
seq_slot_manager.py fix: skip add new slot if request has slot 0 (#3991) 2025-05-06 07:46:39 +02:00