Commit Graph

421 Commits

Author SHA1 Message Date
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Enwei Zhu
25dde49c28
fix: EP load balancer with MTP layer and route offset by EP rank (#4767)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 00:07:44 +08:00
Dom Brown
338d6e9f95
[nvbug 5305210] fix: Resolve nvbug 5305210 (#4759)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-31 19:21:06 +08:00
Chuang Zhu
f117d6abe9
Fabric Memory for KV Cache Transfer (#4717)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-30 15:50:21 +08:00
Thor Johnsen
55d56f8155
[JIRA-5226219][fix] Fix Bug in KV cache manager (#4596)
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
2025-05-29 22:03:20 -07:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main (#4739)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
Robin Kobus
79a94a28f9
refactor: unique_ptr instead of shared_ptr (#4697)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-29 22:49:35 +02:00
Jhao-Ting Chen
fcadce9f8d
[fix] Eagle-2 LLMAPI pybind argument fix. (#3967)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-29 12:23:25 -07:00
Arthur Rasmusson
812b1abf86
feature: KV Cache GPUDirect Storage (#3209)
Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-28 23:27:43 +00:00
Robin Kobus
12763779c4
chore: Clean up cpp runtime (#4449)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-28 16:32:59 +02:00
ixlmar
fbe4db207d
feat: forward exceptions to Python and catch OOMs (#4497)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-28 11:58:10 +02:00
Kaiyu Xie
b800adc65c
Fix: hang on disagg when MNNVL two-shot AllReduce is enabled (#4678)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-28 13:03:53 +08:00
yunruis
29ac4c20e0
fix: fix dsr1 min lat cga ar rate drop(0.2) (#4561)
Signed-off-by: yunruis <yunruis@nvidia.com>
2025-05-27 21:59:57 +08:00
Perkz Zheng
40a7161f4f
fix: fmha_v2 compilation (#4659)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-27 17:39:39 +08:00
qsang-nv
157fe62965
fix fmha v2 tests (#4661)
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-05-27 09:47:01 +08:00
Robin Kobus
93a54457ac
[nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608) (#4621)
- Moved sorting related logic to a dedicated function for better clarity and maintainability.
- Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID.
- Updated function documentation to reflect the sorting behavior and its purpose.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-26 17:10:55 +08:00
Robin Kobus
502758aaa9
fix: Handle additional model outputs based on pipeline parallel rank (#4498)
- Only allocate additional outputs on last pipeline parallel rank in trtGptModelInflightBatching and executorImpl.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-26 09:04:40 +02:00
Zheng Duan
ce7f5fae5a
sort llm request state (#4607)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-26 13:47:01 +08:00
Perkz Zheng
4d711be8f4
Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564)
* move cubins to LFS

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add sliding-window-attention generation-phase kernels on Blackwell

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* address comments

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-26 09:06:33 +08:00
shaharmor98
2b8f6d2871
Fix snake case format (#4559)
fix snake case format

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-25 17:57:17 +08:00
Chuang Zhu
b60846b47d
fix datatype check (#4606)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-24 08:36:17 +08:00
Yao Yao
ef763b0ddc
fix: rename some terms (#4534)
Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
2025-05-23 23:23:49 +08:00
Robin Kobus
7b2818a47b
refactor: CreateNewDecoderRequests (#4452)
* refactor: CreateNewDecoderRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Consolidate request generation in CreateNewDecoderRequests

- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Simplify request handling in CreateNewDecoderRequests

- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests

- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests

- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters

- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests

- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535)
* optimize kv cache reuse workflow for MLA

write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* support fp8 kv cache for MLA kv cache reuse

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend (#4530)
* MoE TRTLLM backend for Qwen3

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* add extra moe_backend to test

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* address comments

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* conditionally compile kernels on newer archs

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* missing positional arg

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* Update the routing kernels

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Revise usage of TLLM_LOG_ERROR

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Add unit test for Qwen3 moe (trtllm_gen backend)

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* improve weight processing speed of moe_backend=TRTLLM; roughly 2x

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* tidy and minor fix

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* temporarily disable accuracy test that has known issue

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

---------

Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
Bo Li
9ae705af1b
perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482)
* Add Julien's origina kernel.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Get rid of UpdateKVCache functionality.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add kernels.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add torch OP.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Update cmake.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Torch OP must use double as argument dtype.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add unittest.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add unittest.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Fix misaligned access when head_dim=64.
In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Remove unroll (compiler can do that).
Cleanup code.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add switch for interleave.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Refactor vectorized load/store.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Implement is_neox. Result not correct yet.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Fix is_neox=True.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

* Add q_weight and k_weight.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

---------

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-23 15:31:04 +08:00
djns99
87f734b563
[https://nvbugs/5297775] fix: Correct memory guard for large MOE tests to account for TP space (#4553)
fix: Correct memory guard for large MOE tests to account for TP space

Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-05-23 14:57:49 +12:00
CarstyYou
ef280e687e
[feat] support fp8 blockscale gemm on sm89 (#4481)
* [feat] integrate ada blockwise gemm

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [fix] align scale M

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [feat] swizzle mma output

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [test] add ut for sm89

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [delete] remove useless comments

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [chore] codestyle

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [fix] fix review comments

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [chore] fix license

Signed-off-by: CarstyYou <xiy@nvidia.com>

* [chore] fix license

Signed-off-by: CarstyYou <xiy@nvidia.com>

---------

Signed-off-by: CarstyYou <xiy@nvidia.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-05-23 10:39:10 +08:00
nv-guomingz
e3a534d0ee
chore: guardword clean for header file. (#4540)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-23 10:08:14 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243)
* feat: Enabling dis serving with TRT backend with Python runtime

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing formatting

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg mtp test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
dongxuy04
338744fba6
fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer (#4573)
fix moe possible race cond and add bypass worker thread for no updates

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-23 09:24:23 +08:00
nv-guomingz
3549b68c1c
chroe:clean useless flag (#4567)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-23 07:05:15 +08:00
Mike Iovine
9c0de251db
[feat] Integrate Hopper chunked attention kernels (#4330)
* Integrate chunked attention kernels

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix cache key

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix lint

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

---------

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-22 17:10:57 -04:00
Chuang Zhu
558eaecf16
fix sequence data race (#4565)
stash for debug broken promise

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 23:13:48 +08:00
Chuang Zhu
44cfd757b2
Agent interface impl for NIXL (#4125)
* agentConnection

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

recv

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

agentState

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

NIXL interfaces

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update cmakelists

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

nixl improve

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

remove cppzmq

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

transferAgent remove register

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

work for cache Test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

reduce sleep time

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

intergarte

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

nixl env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix rebase error

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

cpp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

stash for send metaData

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

loadRemoteMD after fetchRemoteMD

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

workaround for mixed gen and context

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

test_env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

avoid port conflict in test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use std::string

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* typo

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix transferAgentTest

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 09:09:41 +08:00
Nikita Korobov
e1b42be3d1
fix: TRT-LLM Gen dtype declaration (#4503)
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-05-21 23:56:37 +02:00
Zongfei Jing
dbaddb3a29
Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216)
* Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe

Signed-off-by: Shiyu Li <shili@nvidia.com>

Adding comments

Signed-off-by: Shiyu Li <shili@nvidia.com>

Add unittest of the twoshot kernel.

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update dispatch logic

Signed-off-by: Shiyu Li <shili@nvidia.com>

Use cpu barrier instead of GPU at init

Signed-off-by: Shiyu Li <shili@nvidia.com>

Merge dispatch logic fix

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update the kernel to use GPU-managed buffer

Signed-off-by: Shiyu Li <shili@nvidia.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean up

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Simplify AllReduce interface

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix warning

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Tidy code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Skip ut for no_fusion

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Shiyu Li <shili@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Shiyu Li <shili@nvidia.com>
2025-05-22 03:42:36 +08:00
Robin Kobus
cd0c826417
refactor: DisaggExecutorTest (#4398)
* chore: Improve formatting of DisaggExecutorTest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Typed InstanceRole param in DisaggExecutorTest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Skip DisaggExecutorTest based on device count

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-21 18:01:45 +08:00
Perkz Zheng
6a35c599ef
Clean: fmha codes (#4496)
clean codes

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-21 11:45:47 +08:00
Ruoqian Guo
db7446fda7
Feat: add deep_gemm swapab Kernel (#4430)
* feat: add deepgemm_swapab

feat: add fp8_gemm_kernel_swapab

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

feat: set threshold for deepgemm and deepgemmswapab

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

* docs: update README.md

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

* fix: std::runtime_error needs #include <stdexcept>

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

* chores: remove the redundant code

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

* feat: support for dense deep_gemm swapab

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

* chores: remove redundant code

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>

---------

Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-05-21 10:48:43 +08:00
Shi Xiaowei
3d62727303
test: NIXL single process test (#4486) 2025-05-21 10:41:46 +08:00
Thor Johnsen
5d438be59a
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936)
* v1.5

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

v1.5.4 Add back draft_overhead to spec dec stats

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* v1.5.5: fix CI error

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.6: fix CI error 8196 > 8192

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* Address reviewer concerns

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* Address reviewer concerns

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* precommit run

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* v2.0: Address reviewer concerns

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v2.1: add fix from wili

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* Revert changes that require use of TypeAlias because that requires python version >= 3.10

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

---------

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-21 10:40:00 +08:00
Perkz Zheng
426f6fd2bc
Feat: add chunked-attention kernels on Blackwell (#4394)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add chunked-attention kernels on blackwell

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

fix

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-21 10:16:46 +08:00
djns99
a030a898d1
perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-05-21 10:00:36 +08:00
Robin Kobus
8564c5a41f
refactor: Unify request order in TRT and PyTorch workflow (#4096)
* chore: Partition context requests in MicroBatchScheduler

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! chore: Partition context requests in MicroBatchScheduler

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-20 18:49:27 +02:00
dongxuy04
21aff2e313
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384)
* first commit of cpp moe loadbalance code

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python bindings for moe load balance

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python wrapper, ut and bug fixes

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add binding for layerId and update binding test

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add host tensor sharing and ut

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-20 17:53:48 +08:00
kanghui0204
6f3922f318
feat: Low Precision Allreduce for PCIe based GPU (#4344)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-20 06:53:46 +08:00
Yuxian Qiu
c8e062bfd3
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-19 14:25:36 -07:00
Perkz Zheng
1c5b0d6a13
[Feat] add chunked-attention kernels on Hopper (for llama4) (#4291)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add mtp for fmha_v2 MLA kernels and add chunked-attention support for hopper fmha kernels

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-05-19 09:57:10 -07:00
Faraz
7656af1b57
[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) (#4335)
* add mixtral7x8b fp8 test with fixed cutlass fp8 moe gemm

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

* update cutlass versions

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

* added internal cutlass with fix and docker update

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

* added mixtral to pro 6000

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

---------

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-05-19 08:56:21 -07:00
liji-nv
58e405624a
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 (#3952)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
Dom Brown
c45f414bbf
Test: Improve model re-use in C++ DGX tests for CI stability (#4263)
* Fix padded vocab size for Llama

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Refactor multi GPU llama executor tests, and reuse the built model engines

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Fix test list typo

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* WIP

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Further WIP

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* WIP

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Update test lists and readme

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Try parametrize for asymmetric

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Parametrize + skip unsupported combinations

Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>

* Update test list

Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>

* Reduce environment duplicated code

Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>

---------

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
2025-05-19 14:20:21 +01:00
Shi Xiaowei
df2798e0c3
feat: NIXL interface integration (#3934)
NIXL interfaces

Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-05-19 18:18:22 +08:00
Void
62bb7f9286
fix potential issues in allreduce fusion kernel and ut (#4226)
fix allreduce fuison kernels and ut

Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>

---------

Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>
2025-05-19 17:38:29 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible (#3439)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
Robin Kobus
4e370a509a
refactor: Copy sequence lengths once in decoder setup (#4102)
* refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update DecoderInputBuffers to remove duplicated buffers

- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move getEmbeddingBias to anonymous namespace

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Filter context requests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: GenerateRequestOptions using more fine-grained functions

- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMDecoder

- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove const where we modify input buffers

- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-16 22:03:55 +08:00
Nikita Korobov
fa3879629e
feat: TRT-LLM Gen integration for BMM and MoE refactoring (#4280)
- Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator. 
- Refactors TRT-LLM Gen MoE runner to call to BMM interface
- The accuracy is verified for DeepSeek R1 FP4 

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-05-16 13:31:53 +02:00
ixlmar
f7ad49bb9b
chore: improve log-level setting UX (#4352)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-16 09:47:44 +01:00
Yuan Tong
f5ddb7ab4a
fix: support TensorRT 10.11+ in FindTensorRT.cmake (#4353)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-05-16 14:04:56 +08:00
NVJiangShao
6cc3f2093a
Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow (#4348)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>
2025-05-16 10:02:30 +08:00
Erin
c44cf34373
fix: update checks that broke medusa tests when use_py_session=True (#4339)
fix check

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-15 15:47:28 -07:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. (#4244)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
Yuan Tong
593f65ff6a
fix: better method to help torch find nvtx3 (#4110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-05-15 16:42:30 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA

load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* add CI test

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* use GPTJ style RoPE for MLA

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix rebase error and some docs

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix kv_lens

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* tiny fix

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: use normal device memory instead of pinned memory for unit test

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* fix L0 tests

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile after rebase

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments again

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Zhanrui Sun
5dc3b539ba
infra: Down the gcc toolset version from 13 to 11 (#4114)
* Down the gcc toolset version from 13 to 11

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Update rocky8 images

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-05-15 11:08:51 +08:00
qsang-nv
0fd59d64ab
infra: open source fmha v2 kernels (#4185)
* add fmha repo

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* fix format

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* fix code style

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* fix header

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* fix header kernel_traits.h

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* add .gitignore file

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* add SLIDING_WINDOW_ATTENTION

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* fix style

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* fix format

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* update setup.py

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

* update build_wheel.py

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

---------

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Signed-off-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>
2025-05-15 10:56:34 +08:00
QI JUN
498ce8a056
Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340)
Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)"

This reverts commit 5e634dd1bd.
2025-05-15 09:52:39 +08:00
hlu1
7fb0af9320
[fix] Remove stale cublas heuristics (#4326)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-14 17:35:51 -07:00
Robin Kobus
d31fefde2c
[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092)
* chore: Remove GptSession/V1 from TRT workflow

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove stateful decoders

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove GptSession buffers

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove GptSession utils

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove GptSession kernels

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove V1 GPT models from tests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove gptSessionBenchmark from scripts and docs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove gptSession IO classes

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove GptSession from test lists

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove GptSession from docs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove useless encoder test

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove mActualBatchSize from DecoderState

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove static batching from ExecutorTest

- Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter.
- Adjusted related test functions to reflect the changes in parameter lists.
- Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-14 23:10:04 +02:00
Robin Kobus
c67da1fbaa
fix: Eagle decoding in TRT flow (#4229)
* fix: EagleBuffers lifetime issue

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Clean up Eagle kernel parameters

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: Eagle draft tokens init

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Add check for updated sequence length in TrtGptModelInflightBatching

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: Skip check for beam search

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-14 16:10:49 +02:00
DylanChen-NV
206f82115d
[bug/5247505] fix: CP accuracy on Blackwell (#4188)
* fix xqa params for cp

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* add test

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* add test

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* try adding B200 multi gpu test

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* add accuracy tests for cp

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

---------

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-05-14 17:40:50 +08:00
kanghui0204
5e634dd1bd
feat: Low Precision Allreduce for PCIe based GPU (#3851)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-14 16:45:43 +08:00
Barry Kang
20b42912ce
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123)
Support DeepSeek-R1 W4A8 on Hopper

Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-14 15:48:07 +08:00
Perkz Zheng
e8d7834c50
fix: [https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled (#4101)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-13 14:58:54 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156)
* feat: Add per-request stats support with PyT backend

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing stats unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing test with overlap

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
wili
eba3623a54
Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979)
* feat/vbws-part4-v1.8: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* feat/vbws-part4-v1.9: fix incorrect output when using short output length

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.1: remove useless variables

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.2:fix incorrect output when using short output length

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.3: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.4: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.5: remove API change

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-12 22:32:29 +02:00
Yixin Dong
c90ebadd84
feat: Support the Structural Tag in guided decoding (#4066)
* finish

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* exc overlap scheduler

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix api ref

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 17:24:50 +08:00
Perkz Zheng
3f29d2f006
Feat: support exporting softmax statistics and update the kernel-selection heuristic (#4155)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* support exporting softmax statistics and update the kernel-selection heuristic

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-12 15:31:46 +08:00
Dom Brown
2d0f93a054
Refactor: Restructure C++ tests for better modularisation of non-shared code (#4027)
* Refactor: Restructure C++ tests for better modularisation of non-shared code

Start cleanup of pytest code for C++ tests

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

Clean up names and remove references to test_cpp.py

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

WIP

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

Move multi-GPU code

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

Update doc and try un-waiving

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Update multi GPU file check

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Address minor multi-GPU setup bug

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

---------

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-09 19:16:51 +01:00
zhhuang-nv
0a36db0aa4
[fix] trtllm-gen mla kernel warnings (#4119)
fix trtllm-gen mla kernel warnings

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-09 20:21:28 +08:00
NVJiangShao
57b2fe2019
[#4085][fix] Fix apply_per_channel_scale for extremely large input sequence length. (#4089)
Fix apply_per_channel_scale for extremely large input seq length.

Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Co-authored-by: crazy-JiangDongHua <759421566@qq.com>
2025-05-09 11:57:01 +08:00
Yi Zhang
91bf5e6a8e
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804)
Add Piecewise CUDA Graph Support

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-09 11:04:01 +08:00
Yukun He
5b61486d87
chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-09 10:20:41 +08:00
forrestl
9477661f4c
Support RingAttention in the BertAttention plugin and the DiT model (#3661)
support ring attn for bert_attention plugin and dit model

Signed-off-by: ChunhuanLin <lch_xdu@163.com>
2025-05-09 08:06:54 +08:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main (#4086)
* feat: TRT-LLM Gen FP8 MoE Llama4

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* feat: TRT-LLM Gen llama4 MoE Top1 routing

Signed-off-by: Jiqun Tu <jtu@nvidia.com>

* feat: add per tensor FP8 TRT-LLM Gen GEMMs

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add guard for routingIndicesClusterKernel

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
Yukun He
bb7bcc75c2
feat: Fallback to NCCL for various patterns when input size is large. (#4080)
* Fallback to NCCL for various patterns when input size is large.
Move the previous implementation to cpp side.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Revising.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-08 11:13:13 -07:00
nv-guomingz
4dfa3ccf43
chore: enhance the cmake experience by ignoring the additional semicolon (#3992)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-08 18:43:36 +08:00
Simeng Liu
bb766eca0a
feat: Reduce branch overhead in groupRMSNorm kernels (#4067)
* feat: Reduce branch overhead in groupRMSNorm kernels
* Fix race condition with sm < 90 and avoid all threads in one warp writing to the same shared memory.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-08 00:55:27 +08:00
Yan Chunwei
0c26059703
chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732)
* beam_width and max_new_token

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove beam_width

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove min_length

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove return_num_sequences

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-07 13:20:25 +08:00
Chuang Zhu
09a28becae
fix cache buffer (#3942)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-07 09:49:44 +08:00
Daniel Cámpora
c56a2aca46
fix: Properly get decoding mode according to same logic as cpp. (#4026)
* Properly get decoding mode according to same logic as cpp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Cross reference getDecodingMode implementations in pytorch - cpp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Better bindings for DecodingMode.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Revert to version in main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Revert configuration.py.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-06 21:53:17 +08:00
Robin Kobus
72057a0a64
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625)
* disable overlap in encoder

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: invokeGatherBatch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: overlap same batch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: add enableTrtOverlap to ExecutorConfig

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* disable overlap for beam search and spec decode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* skip overlap tests with beam search or speculative decoding

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* enable overlap in GptChunkedLongContextTests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Enable overlap in gptManagerBenchmark

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Improve early exit

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use OptionalRef for newOutputTokens tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Add overlap scheduling support to TRTLLMDecoder

- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: allNewTokens in PP

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 15:06:46 +02:00
dominicshanshan
3ac6637005
fix: trtllm-serve hang in stress test and ds v3 stress parameter update (#3836)
* Remove stdout pipe for genai-perf and make stress time as public parameter.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

* Update llmRequest based on comment.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

* launch process function refactor.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

---------

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-05-06 16:52:30 +08:00
Robin Kobus
e943ad5a2a
[https://nvbugs/5247414] fix: draft/target probs shape (#4055)
Shape was wrongly changed in DecoderState introduction.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 09:56:43 +02:00
Yuan Tong
4b6c19737b
feat: support add internal cutlass kernels as subproject (#3658)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-05-06 11:35:07 +08:00
brb-nv
5b1aeb6730
test: Test OOB access issue in penaltyKernel for endId=-1 (#4035)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-05 10:24:28 -07:00
Mike Iovine
8caf200322
[fix] Skip debugCheckSemaphores in stream capture mode (#4032)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-05 10:24:10 -07:00
Robin Kobus
ccff86068e
fix: request termination in pipeline parallelism (#3892)
* feat: Implement synchronous request termination in batch manager

- Added `terminateRequestSync` method to `TrtEncoderModel` and `TrtGptModelInflightBatching` for handling request termination in the next `forwardSync` call.
- Updated existing request termination logic to utilize the new synchronous method, ensuring generated tokens are cleared appropriately.
- Enhanced logging for clarity in token management during request processing.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! feat: Implement synchronous request termination in batch manager

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: MockedModelCancelRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! feat: Implement synchronous request termination in batch manager

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: terminate with timeout

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! feat: Implement synchronous request termination in batch manager

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* docs: Update doc string for allottedTimeMs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-05 21:51:41 +08:00
Robin Kobus
9f9edd783c
refactor: Introduce MpiTag enumeration and update MPI function signatures (#3893)
* refactor: Move executor recv functions into classes

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance MPI logging and error handling

- Updated MPI logging to include destination and tag information for better traceability during send and receive operations.
- Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests.
- Improved code structure for clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce MpiTag enumeration and update MPI function signatures

- Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability.
- Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags.
- Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity.
- Removed redundant MPI tag constants from several classes, streamlining the code.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Rename tags for consistency

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-04 13:24:29 +02:00
Robin Kobus
403370af62
refactor: Move ModelSpec to core library (#3980)
* refactor: Move ModelSpec from tests to core library

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move ModelSpec from runtime to separatedir

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use new bindings path and clean up

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Updated licenses

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Remove script_dir from path

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-04 01:39:09 +08:00
Daniel Cámpora
c7cf032b89
fix: Move all casters to customCasters. (#3945)
* Move all casters to customCasters.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use customCasters in all bindings.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added customCasters to userbuffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-02 19:08:28 +08:00
Simeng Liu
873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438)
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.

Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.

This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes

Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
Erin
8fe7bdeacf
feat: LogitsProcessor in PyTorch backend (#3145)
* support lp in pytorch backend

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* fix tp

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-01 14:15:30 -07:00
Erin
83f37614ef
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388)
* support return logprob in llmapi

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

update and add test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

stability test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* revert removal of old flag

Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
2025-05-01 12:47:14 -04:00
YueWeng
b1621e8d4e
feat: add relaxed acceptance for DS (#3865)
* add relaxed acceptance for DS R1

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* clean and update docs

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* Modified based on review

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix mtp manager issue

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

---------

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-01 21:50:36 +08:00
hlu1
1294ecb12f
Add attention workspace memory check (#3970)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-04-30 23:51:09 -07:00
Kate Cheng
7dbe618683
feat: Add multimodal embedding field in LlmRequest (#3855)
* Add a new param to LlmRequest and Request to natively support mm

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* update comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update tests to match the new LlmRequest constructor parameters

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify unitTest and modify mm_embeding's dict name in llama4

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix LlmRequest initialization in kvCacheManagerTest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up code for promt_tuning_config

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up prompt_tuning_config in GenerationRequest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:23:30 +08:00
Yukun He
9cc5922a0b
Clean up allreduce op in Deepseek V3 model. (#3829)
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 07:56:36 +08:00
Dom Brown
b40f351b7a
[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests (#3206)
* Squash of dev commits

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Add timer + waive test with suspected GptSession bug

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

* Respond to reviewer comments

Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>

---------

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
2025-05-01 05:31:08 +08:00
tburt-nv
7053d0ad5a
infra: add conan (#3744)
This MR integrates Conan into the build system, so that it can be used to fetch dependencies in future changes.

Also installs all requirements-dev.txt inside a virtualenv instead of the system, since some of Conan's dependencies may conflict with the system packages. Virtualenv is used instead of venv because the triton server backend container has only virtualenv installed. This also allows developers to cache the requirements-dev.txt packages between container launches.


Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-30 11:53:14 -07:00
nv-guomingz
dd959de0fd
chore: update internal_cutlass_kernels. (#3973)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-30 22:13:17 +08:00
Ming Wei
ed887940d4
infra: open source XQA kernels (#3762)
Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which
consists of two parts:

1. NVRTC glue code
2. XQA kernel code

During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via
gen_cpp_header.py and passed to NVRTC for JIT compilation.

Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>
2025-04-30 18:05:15 +08:00
Bo Li
a80d2373a3
fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. (#3862)
* Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp.
For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Run pre-commit.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Fix compile error.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

---------

Signed-off-by: Bo Li <bobboli0202@gmail.com>
2025-04-30 14:27:38 +08:00
djns99
cc989ea49f
perf: Optimise MOE prologue to use fused setup function (#3790)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-04-30 11:44:48 +08:00
Pamela Peng
f98a80f9d9
sync internal cutlass kernel changes (#3968)
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-30 08:57:28 +08:00
xiweny
68a19a33d4
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770)
* upgrade cutlass to 3.9

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* update internal cutlass kernels

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* fix file

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* remove unnecessary change

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

* update hash

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

---------

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-29 11:19:11 -04:00
Dom Brown
8709fe8b53
chore: bump version to 0.19.0 (#3598) (#3841)
test: add test cases for 0.19 release (#3608)

* fix test name



* add quickstart test for nemotron-ultra



* add rcca multi-node test case for deepseek-v3



* add rcca info



---------




squash (#3642)



fix: nvbugs/5187237: fix deterministic mode crash (#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive


* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.



* revert ar fusion



---------



update fp8 doc (#3647)




tests: change qa perf test to trtllm-bench (#3619)




 fix: FP8 quantized lm_head (NvBug 5214229) (#3567)



infra: Add PR approval protection for the release branch (#3634)



fix: nvbugs/5231298: pytorch allreduce issue (#3673)



Fix: nvbugs/5222698 variable not defined (#3630)

* Fix: nvbugs/5222698 variable not defined



* Tidy code



---------



test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)



test:restore fp8 kv cache testing for L0 (#3671)



doc: Update DeepSeek perf docs (#3693)

* Update DeepSeek perf docs



* update



* Apply suggestions from code review




---------




tests: waive test_llm_multi_node (#3664)



fix: update test_user_buffers_mm_add_prologue atol (#3711)



Fix: cherry-pick hmac encryption from main branch (#3635)

* security fix cherry-pick changes from main



* fix hmac in remote mpi session (#3649)



---------





Un-waive DS-V3-Lite tests. (#3621)



fix: FP8 kv accuracy (#3675)

* fix FP8 kv accuracy



* update doc



---------



Fix script options for engines. (#3622)



unwaive multi-node test (#3721)



chore : Split more tests out of gpt tests (#3524) (#3674)



doc:add torch examples link into torch backend documentation (#3749)




test: Get Eagle tests working (#3593) (#3722)




Waive L0 test (#3756)



waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)





Update ds v3 parameters in stress test. (#3676)

waive gemma on L20 (#3766)



https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.



fix: PP4 fixes and cleanup (#3688)




remove benchmark test list (#3643)



skip disagg deepseek test if sm!=90 (#3720)



test: skip failed cases on B200 (#3710)

* add skip condition to tests



* fix error



---------



test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)

* skip_pre_ada for fp8 cases



* update



* update after rebase



---------



add know issue to deepseek doc. (#3800)



Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)




Waive L0 tests (#3826)



fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.



* Fix fused_moe fallback issue. (#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.



---------



[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)




Fix pre-commit



Fix again



Address some review comments for the MI

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-29 16:57:22 +08:00
zhhuang-nv
94e6167879
optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907)
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-04-29 14:17:07 +08:00
Perkz Zheng
35c5e4f1c5
feat: add CGA reduction fmha kernels on Blackwell. (#3763)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add trtllm-gen kernels for eagle3 and also kernels with cga-reduction

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* address the comments

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-04-29 10:43:54 +08:00
Jinyang Yuan
dafc28fb85
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863) 2025-04-29 09:09:43 +08:00
Yukun He
5502a522d2
Fixing minor typo in allreduce kernel selection (#3912)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-04-28 23:06:49 +08:00
Chuang Zhu
e2318756ed
cacheTransceiver buffer manager (#3798)
* cacheTransceiver buffer manager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix args

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* cpp kvCacheManager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-27 11:48:15 +08:00
Dom Brown
7ff9fd345c
Test: Split C++ unit tests for CI granularity (#3868)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-04-25 13:30:58 -07:00
qixiang-99
ecd621fb0a
feat: Add head size 72 support for QKV Preprocessing kernel (#3743)
* refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow

- Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels.
- Added support for head size 72 in unfused attention kernels(QKVPreprocessing).
- Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72).

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

* update: Waive head_dim=72 test cases and enhance test representation

- Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues.
- Introduced a custom __repr__ method in the Scenario class for pytest substring match.

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>

---------

Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-04-25 11:07:40 -07:00
dongxuy04
16535991b2
feat: Add MNNVL MoE A2A support (#3504)
* add MNNVL memory mapping support

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add more MPI environment for trtllm-llmapi-launch

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MoE communication and prepare kernels

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MNNVL AlltoAll support for DeepSeekV3

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add output dump for throughput benchmark

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* support dynamic kernel launch grid

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments #2

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-04-25 17:29:08 +08:00
QI JUN
991939a0f4
chore: increase A30 for cpp test (#3811)
* increase A30 for cpp test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* enable parallel run test for gpt_executor

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* decrease freeGpuMemoryFraction of cpp tests

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-24 16:34:39 -07:00
Shi Xiaowei
1d5178814b
Fix: Revert commit 25f9669 (#3832)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-04-24 14:03:20 +08:00
QI JUN
d0d19e81ca
chore: fix some invalid paths of contrib models (#3818)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-24 05:36:16 +08:00
Kaiyu Xie
dfbcb543ce
doc: fix path after examples migration (#3814)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-24 02:36:45 +08:00
Julien Debache
0c6c8eaffd
fix: 5197419 and removed unused runtime kernels (#3631)
- Removed kernel under test call, as it was not needed
- Removed kernel itself
- Removed kernel tests
- Removed other unused kernels and their tests
- Some static analysis clean up
2025-04-23 18:04:50 +02:00
Daniel Cámpora
1299f27c74
fix: Fix C++ decoder synchronization in PyTorch (#3106)
* Use updateDecoderBuffers in python decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix synchronize in trtllm decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Enable by default.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use guided_decoder to setup seqslots and free them.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use always decode_async and update_requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update decoder buffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix speculative decoding tests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Send new_tensors_host instead of assuming dict.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make default False in enable_trtllm_decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Partially fix mtp, partially fix py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update request states before sending disagg ctx cache.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix disagg test for torch decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add disagg serving case to guided decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Get overlap scheduling to work.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update cutlass to main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update after rebasing.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update to use decode async and update requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Properly pass information to update_requests

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make disaggregated serving a step closer to working.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase and format.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Copy new device tokens more pythonic.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Restore MTP add dummy reqs.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add ordereddict import to py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added seq slot manager. Add test.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use transmission for single tensor except when list of tensors is received.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add TRTLLMDecoder allocation to estimate max kv cache tokens.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add stream synchronization

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Format

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add decoder creation to estimate max kv.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update submodule UCXX inline with main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-04-23 23:55:27 +08:00
Shi Xiaowei
25f96697ad
fix: Intercept the error of multi-rank binding to a single card (#3525) 2025-04-23 15:50:18 +08:00
Zongfei Jing
1e5af736ea
Add smart router for moe (#3641)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-23 12:21:59 +08:00
Perkz Zheng
0324a7389d
add QMMA-based MLA kernels (#3752) 2025-04-23 10:18:19 +08:00
William Tambellini
44bff85e08
Fix double link to fp8_blockscale_gemm_src (#3707)
Fix https://github.com/NVIDIA/TensorRT-LLM/issues/3690

Signed-off-by: William Tambellini <wtambellini@sdl.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-04-23 10:16:07 +08:00
Zongfei Jing
7eee9a9d28
doc: Update doc for Deepseek min latency (#3717)
* Tidy code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Update doc for min latency deepseek

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Throw exception for RouterKernel when not running on sm90+

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-22 23:07:59 +08:00
Yukun He
0ae7017342
Unify two versions of AllReduce custom op (#3032)
* Rewrite unit test for unified allreduce op. Removing the legacy unit test.
* Revise formats, fusion_op bindings. Put all tensors as optional inputs.
* Move the MoeAllreduceOp to a separate custom op.
* Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions.
* Add more TODOs, fixing minor bugs, and remove legacy code.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-22 21:58:42 +08:00
Robin Kobus
8340657ae4
refactor: Introduce DecoderOutputBuffers per batch (#3506)
* refactor: Restructure DecoderBuffers and DecoderStepAsyncSend

- Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`.
- Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend

- Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`.
- Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes enhance the structure and readability of the batch manager's decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Log DecodingMode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce DecoderOutputBuffers and update related classes

- Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose.
- Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic.
- Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability.
- Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces.

These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update MPI communicator handling

- Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity.
- Updated related methods and constructors to reflect the new parameter type, enhancing maintainability.
- Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`.

These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Replace shared_ptr with unique_ptr for buffer management

- Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`.
- This change enhances memory management and ownership semantics, reducing overhead and improving performance.

These modifications contribute to a cleaner and more efficient architecture in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-22 12:25:53 +08:00
Zheng Duan
ae48abefc1
bind block key and hasher (#3712) 2025-04-21 18:50:57 +08:00
Iman Tabrizian
af04b6f6aa
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
* Fix hang bug when KV cache is low

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Review comments

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Fix attentiondp typo

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Add CI test for this case

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* fix: Fix the insertion order for responder futures

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* fix: Fix disagg CPP

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-21 15:16:55 +08:00
Jinyang Yuan
bc2b01d1dd
chore: update FMHA cubin files (#3680)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-21 15:04:04 +08:00
katec846
eeb605abd6
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380)
* Feat: Offload ptable to cpu if enable_chunk_context

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Feat: offload ptable to cpu for chunk context mode

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix and add comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update Readme for multimodal and add a new param mm_embedding_offloading

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* fix: Correct prompt table offloading condition in PromptTuningBuffers

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up the code

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Add commits to explain copy from cpu <-> gpu using pinned memory

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix namings based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix format based on precommit

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify --mm_embedding_offloading flag

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-21 14:31:01 +08:00
hlu1
31624b079a
feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387)
* Add TRT-LLM Gen MOE to Deepseek

fix fused moe rebase bug.

Fix atol in test_fp4_gemm_quantize.py

fix fused moe rebase bug.

Fix FusedMoe.

Disable 2nd routing kernel preexit

Bump routing reduction to fp32

Disable PDL for fc1

[DEBUG] Lift token limit to 16k

[Bugfix] Token limit to 16k + fp32 routing + tanh

Make fp8 tileN 8

Fix FP8 MoE + Remove redundent temp output for FP4

[FP8-only] Avoid wasting CTAs for activation kernel

fix: unblock FP8 weightloading with trtllm-gen

Remove max_token limit for trtllm-gen path

perf: avoid type-conversion and fill_ from aten

Minor fix

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix rebase issues

Signed-off-by: Hao Lu <haolu@nvidia.com>

* Fix compile issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* CI clean

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Hao Lu <haolu@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-21 10:01:33 +08:00
QI JUN
d51ae53940
move the reset models into examples/models/core directory (#3555)
* move rest models to examples/models/core directory

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update multimodal readme

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix example path

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix cpp test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix tensorrt test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-19 20:48:59 -07:00
Dom Brown
dbd9a83b0d
feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582)
* feat: Integrate GPUDirect Storage (GDS) into Executor API

Squash of several dev commits

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-04-18 15:59:21 +08:00
Yuan Tong
0b0e6d8a0a
refactor: Clean up CMakeLists.txt (#3479)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-04-18 14:39:29 +08:00
Jackch-NV
1b2b112d44
fix sage attention headsize check error in bertAttentionPlugin.cpp (#3660)
Signed-off-by: Jackch-NV <69230184+Jackch-NV@users.noreply.github.com>
2025-04-18 09:28:04 +08:00