TensorRT-LLMs/cpp/tensorrt_llm
katec846 eeb605abd6
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380)
* Feat: Offload ptable to cpu if enable_chunk_context

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Feat: offload ptable to cpu for chunk context mode

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix and add comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update Readme for multimodal and add a new param mm_embedding_offloading

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* fix: Correct prompt table offloading condition in PromptTuningBuffers

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up the code

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Add commits to explain copy from cpu <-> gpu using pinned memory

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix namings based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix format based on precommit

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify --mm_embedding_offloading flag

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-21 14:31:01 +08:00
..
batch_manager feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380) 2025-04-21 14:31:01 +08:00
common feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387) 2025-04-21 10:01:33 +08:00
cutlass_extensions/include/cutlass_extensions feat: Update cutlass (#2981) 2025-03-26 22:36:27 +08:00
executor feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380) 2025-04-21 14:31:01 +08:00
executor_worker Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
kernels feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387) 2025-04-21 10:01:33 +08:00
layers fix: Eagle decoding (#3456) 2025-04-11 22:06:38 +08:00
plugins refactor: Clean up CMakeLists.txt (#3479) 2025-04-18 14:39:29 +08:00
pybind feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380) 2025-04-21 14:31:01 +08:00
runtime feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582) 2025-04-18 15:59:21 +08:00
thop feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387) 2025-04-21 10:01:33 +08:00
CMakeLists.txt refactor: Clean up CMakeLists.txt (#3479) 2025-04-18 14:39:29 +08:00