Yuxian Qiu
ec796e44e4
feat: add heuristics for checkpoint files prefetching. ( #4765 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-03 12:10:37 +08:00
Yan Chunwei
e013c8cbc2
fix [nvbug5256044]: bench hang due to llmapi ipc ( #4798 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 10:10:53 +08:00
Fanrong Li
380a5d1690
[ https://nvbugs/5271281 ][fix] fix a pd+mtp accuracy issue ( #4536 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Tian Zheng
9832787050
[feat] Enable NVFP4 output for TRTLLM attention kernels ( #4737 )
...
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-06-03 10:00:17 +08:00
Yilin Fan
eb2d51a429
[fix] Fix llama4 min-latency mode ( #4810 )
2025-06-02 08:50:01 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) ( #4695 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Fanrong Li
7d356efc7d
fix: fix accuracy and illegal memory access issues when using mtp + attention dp ( #4379 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 00:35:52 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H ( #4494 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
Lucas Liebenwein
491a09b0c6
[AutoDeploy] Increased Model Coverage Mass Migration Week 2 ( #4817 )
...
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-06-01 14:40:29 +08:00
Enwei Zhu
0087bd27ba
[fix] Fix SamplingParams check on n and best_of ( #4655 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 09:11:55 +08:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler ( #4538 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Enwei Zhu
25dde49c28
fix: EP load balancer with MTP layer and route offset by EP rank ( #4767 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 00:07:44 +08:00
Yuxian Qiu
a02df6aa4b
fix: re-enable tp/pp for quickstart_advanced.py. ( #4766 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-31 19:13:46 +08:00
Yan Chunwei
93c0632ee4
opt: the perormance for dist-agg streaming generation ( #4214 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-31 17:40:32 +08:00
Mike Iovine
8cb6163a57
[fix] Fix Llama 3.3 70b EAGLE ( #4772 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-30 10:08:08 -04:00
Yuxian Qiu
f82e44bbb9
fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. ( #4758 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-30 18:04:08 +08:00
Pengyun Lin
bac22ff7b5
[feat] support sharegpt downloading in benchmark_serving ( #4578 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-30 17:27:53 +08:00
QI JUN
99fdef20c4
[TRTLLM-5516] perf: replicate dummy request for cuda graph padding ( #4729 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-30 17:14:23 +08:00
ixlmar
c026dda400
fix: iteration logging and typing in PyExecutor ( #4734 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 11:01:20 +02:00
ixlmar
7e6d06d5d7
feat: estimate GPU mem. usage w/ minimal KV cache ( #4574 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 10:40:45 +02:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE ( #4303 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00
hlu1
3093c747b7
[Architecture] Redesign Linear module ( #4721 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-29 16:05:46 -07:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main ( #4739 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
QI JUN
255779a91d
Chore: fuse _merge_requests method into _fetch_new_requests method ( #4689 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-29 17:47:44 +08:00
Yan Chunwei
33a9ba55f5
fix: test trtllm-bench mgmn ( #4613 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 14:43:47 +08:00
Yan Chunwei
ac17142495
chore: rename ExecutorBindingsWorker/Proxy ( #4716 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 10:32:35 +08:00
Arthur Rasmusson
812b1abf86
feature: KV Cache GPUDirect Storage ( #3209 )
...
Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-28 23:27:43 +00:00
Yuxian Qiu
bf691b3d28
feat: support packed weights in vanilla moe ( #4719 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-29 06:24:24 +08:00
Yan Chunwei
5506f60037
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs ( #4603 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-28 18:43:04 +08:00
ixlmar
fbe4db207d
feat: forward exceptions to Python and catch OOMs ( #4497 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-28 11:58:10 +02:00
Iman Tabrizian
c875184f78
Add missing serialization classes ( #4642 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-28 16:40:23 +08:00
amirkl94
fbec0c3552
Release 0.20 to main ( #4577 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-28 16:25:33 +08:00
Pengyun Lin
971d16a2ee
[TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend ( #4623 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-28 11:36:44 +08:00
Bo Li
9c4b8f66b4
feat: Integration of Fused QKNorm+RoPE. ( #4611 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-28 11:20:45 +08:00
Shunkangz
6493401986
Fix handle cancel request for attentionDP ( #4648 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-28 11:04:02 +08:00
Yuxian Qiu
5700a4ffcd
feat: Add vanilla MOE. ( #4682 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-28 10:44:14 +08:00
Yuxian Qiu
e538b0d95e
refactor: extract and reuse filter_weights. ( #4681 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-27 19:48:01 +08:00
Lucas Liebenwein
5cdd6bb10f
[AutoDeploy] Increased Model Coverage Mass Migration Week 1 ( #4468 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-27 16:43:15 +08:00
QI JUN
1582361400
Chore: only pad one dummy request for attention dp scenario ( #4664 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-27 14:56:22 +08:00
Tracin
268171bc66
[NVBUG 5301980] Fix fp4 gemm padding. ( #4662 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-27 11:30:53 +08:00
Chuang Zhu
4318037ca3
fix disagg config params ( #4646 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-26 23:28:52 +08:00
Enwei Zhu
88190faa34
feat: large-scale EP(part 4: Static EP load balancer integration) ( #4615 )
...
* MoeLoadBalancerConfig
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* MoeLoadBalancer integration
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* config file
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-26 18:25:11 +08:00
QI JUN
44eb053b95
introduce RequestQueueItem class instead of using tuple ( #4649 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-26 17:34:53 +08:00
Shunkangz
fd27f89df6
fix: Remove duplicate tokenization in generation server ( #4492 )
...
* Add nvtx
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Add draft change
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Refactor and add support of chat
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
---------
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-26 16:43:07 +08:00
QI JUN
4a81991b65
Chore: refine shutdown signal of PyExecutor ( #4614 )
...
* refine shutdown signal of PyExecutor
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* clean
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* fix ci
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* fix ci
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-26 11:14:54 +08:00
Yuxian Qiu
8f055f5d14
feat: Skip sampler for intermediate pp stages. ( #4514 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-26 10:08:51 +08:00
Yibin Li
bb2f545729
fix pipeline tests due to rebase ( #4640 )
...
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-26 08:38:08 +08:00
shaharmor98
2b8f6d2871
Fix snake case format ( #4559 )
...
fix snake case format
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-25 17:57:17 +08:00
Anton
5dff0bff8f
[ #4633 ][doc] Fixed typo in scaffolding README.md ( #4634 )
...
* Fixed typos in the scaffolding README.MD
Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>
* Fixed links for 'More examples' and 'Contribute Guide'
Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>
---------
Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>
2025-05-25 09:04:12 +08:00
Robin Kobus
7b2818a47b
refactor: CreateNewDecoderRequests ( #4452 )
...
* refactor: CreateNewDecoderRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Consolidate request generation in CreateNewDecoderRequests
- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Simplify request handling in CreateNewDecoderRequests
- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests
- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests
- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters
- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests
- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00