Commit Graph

183 Commits

Author SHA1 Message Date
Daniel Cámpora
d68b8180d3
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-10 07:28:34 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Yuxian Qiu
e79527d195
chore: Refine weight prefetching. (#4893)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 21:24:16 +08:00
Mike Iovine
f4d9c87c51
[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… (#4843)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-09 08:31:35 -04:00
Yukun He
137fe35539
fix: Fix warmup phase batch size out of range. (#4986)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-09 19:19:16 +08:00
amitz-nv
77e8d739f1
[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler (#4819) 2025-06-09 06:30:01 +03:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way (#4799)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
Omer Ullman Argov
8731f5f14f
chore: Mass integration of release/0.20 (#4898)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-08 23:26:26 +08:00
Mike Iovine
ec0d984656
[nvbug/5280806][fix] Fix 2 model spec decode flow (#4807)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-08 07:40:02 -04:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
QI JUN
5ee0de7f2a
Resubmit #4894 (#4969)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-08 04:42:15 +08:00
nv-guomingz
0c7dd660d8
fix:https://nvbugs/5324248 (#4973)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-07 04:14:07 +08:00
Fanrong Li
75d020cf07
fix: fix cuda graph padding for spec decoding (#4853)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-06 22:21:42 +08:00
QI JUN
ec50684d80
Revert "fix a bug of global cuda graph dummy request" (#4970) 2025-06-06 08:54:45 +08:00
QI JUN
154f7cc40a
fix a bug of global cuda graph dummy request (#4894)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 19:47:40 +08:00
QI JUN
b8c5e3892b
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 17:43:30 +08:00
ixlmar
6437756da8
fix: handle OOMs during KV cache estimation (#4690)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-05 10:02:26 +02:00
ixlmar
2bbb6b5976
chore: introduce KvCacheCreator (#4581)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-04 11:03:17 +02:00
tomeras91
8d31e16877
[TRTLLM-4923][feat] Paged mamba cache (#4822)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-04 09:27:08 +03:00
Yan Chunwei
ac20159d32
fix: build_config in TorchLlmArgs and avoid invalid args (#4600)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-04 13:17:29 +08:00
QI JUN
e2eea80c1d
Chore: refine comments of prepare inputs method of model engine (#4837)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-04 12:14:13 +08:00
Yukun He
5fa6fbd989
feat: Enhance AutoTuner inference path and code readability (#4466)
Fix AutoTuner warmup request generating.
* The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases.
Refactor AutoTuner API and reduce host overhead.

Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition.
* Refine cache key assembly process to reduce host overhead and simplify API.
* Fix lru_cache usage to reduce host overhead.
* Move tuning config initialization as a one-time object in tunable runner to reduce host overhead.

Improve tuning config readability.
* Use dataclass to define tuning config.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-04 10:53:11 +08:00
Shunkangz
c835f06371
Refactor the first token response in PD (#4692)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-04 09:11:23 +08:00
hlu1
b4ed4b22f3
[Arch] Freeze model_config (#4814)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-04 02:51:35 +08:00
pcastonguay
01f29ce38b
[nvbug 5294316] fix: Fix queued request stats (#4714)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-03 08:33:08 -04:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched (#4700)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling (#4532)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
Yuxian Qiu
ec796e44e4
feat: add heuristics for checkpoint files prefetching. (#4765)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-03 12:10:37 +08:00
Fanrong Li
380a5d1690
[https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue (#4536)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Fanrong Li
7d356efc7d
fix: fix accuracy and illegal memory access issues when using mtp + attention dp (#4379)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 00:35:52 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
QI JUN
99fdef20c4
[TRTLLM-5516] perf: replicate dummy request for cuda graph padding (#4729)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-30 17:14:23 +08:00
ixlmar
c026dda400
fix: iteration logging and typing in PyExecutor (#4734)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 11:01:20 +02:00
ixlmar
7e6d06d5d7
feat: estimate GPU mem. usage w/ minimal KV cache (#4574)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 10:40:45 +02:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main (#4739)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
QI JUN
255779a91d
Chore: fuse _merge_requests method into _fetch_new_requests method (#4689)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-29 17:47:44 +08:00
Yan Chunwei
5506f60037
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs (#4603)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-28 18:43:04 +08:00
ixlmar
fbe4db207d
feat: forward exceptions to Python and catch OOMs (#4497)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-28 11:58:10 +02:00
Bo Li
9c4b8f66b4
feat: Integration of Fused QKNorm+RoPE. (#4611)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-28 11:20:45 +08:00
Shunkangz
6493401986
Fix handle cancel request for attentionDP (#4648)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-28 11:04:02 +08:00
QI JUN
1582361400
Chore: only pad one dummy request for attention dp scenario (#4664)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-27 14:56:22 +08:00
Enwei Zhu
88190faa34
feat: large-scale EP(part 4: Static EP load balancer integration) (#4615)
* MoeLoadBalancerConfig

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* MoeLoadBalancer integration

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* config file

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-26 18:25:11 +08:00
QI JUN
44eb053b95
introduce RequestQueueItem class instead of using tuple (#4649)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-26 17:34:53 +08:00
QI JUN
4a81991b65
Chore: refine shutdown signal of PyExecutor (#4614)
* refine shutdown signal of PyExecutor

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-26 11:14:54 +08:00
Yuxian Qiu
8f055f5d14
feat: Skip sampler for intermediate pp stages. (#4514)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-26 10:08:51 +08:00
shaharmor98
2b8f6d2871
Fix snake case format (#4559)
fix snake case format

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-25 17:57:17 +08:00
Robin Kobus
7b2818a47b
refactor: CreateNewDecoderRequests (#4452)
* refactor: CreateNewDecoderRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Consolidate request generation in CreateNewDecoderRequests

- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Simplify request handling in CreateNewDecoderRequests

- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests

- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests

- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters

- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests

- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535)
* optimize kv cache reuse workflow for MLA

write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* support fp8 kv cache for MLA kv cache reuse

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00