Commit Graph

514 Commits

Author SHA1 Message Date
Mike Iovine
9c0de251db
[feat] Integrate Hopper chunked attention kernels (#4330)
* Integrate chunked attention kernels

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix cache key

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix lint

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

---------

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-22 17:10:57 -04:00
Robin Kobus
e5c90883a9
fix: Move cv2 import to load_video function (#4541)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-22 17:56:07 +02:00
QI JUN
1e5d526db4
Chore: clean up _merge_dummy_request method of PyExecutor (#4438)
* clean up _merge_dummy_request method of PyExecutor

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update comment

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-22 18:19:07 +08:00
Chuang Zhu
3410508020
cache_transceiver_config (#4556)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 13:59:51 +08:00
Kaiyu Xie
2898d268f9
feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856) (#4349)
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>
2025-05-22 11:46:06 +08:00
Yan Chunwei
4798d088d9
chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs (#3823)
* partition LlmArgs

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* update backend

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-22 09:40:56 +08:00
Chuang Zhu
44cfd757b2
Agent interface impl for NIXL (#4125)
* agentConnection

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

recv

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

agentState

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

NIXL interfaces

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update cmakelists

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

nixl improve

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

remove cppzmq

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

transferAgent remove register

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

work for cache Test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

reduce sleep time

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

intergarte

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

nixl env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix rebase error

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

cpp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

stash for send metaData

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

loadRemoteMD after fetchRemoteMD

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

workaround for mixed gen and context

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

test_env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

avoid port conflict in test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use std::string

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* typo

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix transferAgentTest

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 09:09:41 +08:00
Zongfei Jing
dbaddb3a29
Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216)
* Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe

Signed-off-by: Shiyu Li <shili@nvidia.com>

Adding comments

Signed-off-by: Shiyu Li <shili@nvidia.com>

Add unittest of the twoshot kernel.

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update dispatch logic

Signed-off-by: Shiyu Li <shili@nvidia.com>

Use cpu barrier instead of GPU at init

Signed-off-by: Shiyu Li <shili@nvidia.com>

Merge dispatch logic fix

Signed-off-by: Shiyu Li <shili@nvidia.com>

Update the kernel to use GPU-managed buffer

Signed-off-by: Shiyu Li <shili@nvidia.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix issue

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Clean up

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Simplify AllReduce interface

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix warning

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Tidy code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Rename

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Skip ut for no_fusion

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Refine

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Shiyu Li <shili@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Shiyu Li <shili@nvidia.com>
2025-05-22 03:42:36 +08:00
dongxuy04
4018806742
feat: large-scale EP(part 3 - refactor: FusedMoe for redundant expert) (#4495)
refactor fused_moe for redundant expert

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-21 17:17:49 +08:00
WeiHaocheng
a201ce9d53
docs: update the introduction for scaffolding (#4360)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-21 14:54:01 +08:00
Thor Johnsen
5d438be59a
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936)
* v1.5

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

v1.5.4 Add back draft_overhead to spec dec stats

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* v1.5.5: fix CI error

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.6: fix CI error 8196 > 8192

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* Address reviewer concerns

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* Address reviewer concerns

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* precommit run

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

* v2.0: Address reviewer concerns

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v2.1: add fix from wili

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* Revert changes that require use of TypeAlias because that requires python version >= 3.10

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

---------

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-21 10:40:00 +08:00
Yan Chunwei
9199793848
fix: llmapi-launch add add trtllm-bench test with engine building (#4091)
* add trtllm-bench mgmn test

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-21 10:18:01 +08:00
Yuxian Qiu
62c16b6d37
fix: skip weights defined in create_weights for pp. (#4447)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-21 10:13:20 +08:00
djns99
a030a898d1
perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-05-21 10:00:36 +08:00
Zheng Duan
77a0189554
feat: conditional disaggregation in disagg server (#3974) 2025-05-21 09:57:46 +08:00
Rohan Varma
3d940e77f0
[TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug (#4290)
* add bidirectional support and fix EarlyStopDecoder unsqueeze to be compatible with LogitsStorage

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

* run pre-commit

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

* instead of bidirectional flag use ModelConfig.is_generation

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

* fix unit test to extract logits from correct dim

Signed-off-by: Rohan Varma <rohanv@nvidia.com>

---------

Signed-off-by: Rohan Varma <rohanv@nvidia.com>
2025-05-20 10:15:36 -07:00
Robin Kobus
8564c5a41f
refactor: Unify request order in TRT and PyTorch workflow (#4096)
* chore: Partition context requests in MicroBatchScheduler

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! chore: Partition context requests in MicroBatchScheduler

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-20 18:49:27 +02:00
Daniel Cámpora
f038218f83
fix: Fix TRTLLMSampler beam width bug. (#4473)
* Fix TRTLLMSampler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added type hint.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-20 18:00:14 +02:00
Yan Chunwei
174c5188a2
fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu (#4428)
* add test

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-20 20:16:14 +08:00
dongxuy04
21aff2e313
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384)
* first commit of cpp moe loadbalance code

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python bindings for moe load balance

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python wrapper, ut and bug fixes

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add binding for layerId and update binding test

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add host tensor sharing and ut

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-20 17:53:48 +08:00
Zhanrui Sun
f2c0565577
chore: bump version to 0.21.0rc0 (#4465)
* chore: bump version to 0.21.0rc0

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Update CODEOWNERS

Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-05-20 12:19:50 +08:00
Lucas Liebenwein
de409e8468
[AutoDeploy] HF factory improvements (#4371)
* [AutoDeploy] HF factory improvements

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* improve monkey-patches and add unit tests

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-19 20:13:43 -07:00
kanghui0204
6f3922f318
feat: Low Precision Allreduce for PCIe based GPU (#4344)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-20 06:53:46 +08:00
Yuxian Qiu
c8e062bfd3
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-19 14:25:36 -07:00
Venky
bb02d86b54
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) (#4128)
* changes to run llama-v3.3-nemotron-super-49b

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* yapf

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* address review comments pt 1

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* re-add cpp super tests 

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-19 12:00:48 -07:00
liji-nv
58e405624a
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 (#3952)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
Iman Tabrizian
c6074c47da
Add llama4 disagg accuracy tests (#4336)
* Add llama4 disagg accuracy tests

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Make it async and add GSM8K benchmark

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-19 21:55:08 +08:00
Yukun He
98018f3bb9
Downgrade the logger level for fallback tactic warning. (#4440)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-19 18:26:54 +08:00
Zhenhuan Chen
e70a205dab
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split (#4337)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-05-19 17:53:41 +08:00
Yuxian Qiu
cf6cd940e5
feat: Add pp support for hybrid attn/mamba model (#4358)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-19 14:47:45 +08:00
Yan Chunwei
5b1c88de8d
chore: cleanup perf_evaluator code (#3833)
* chore: cleanup perf_evaluator code

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* up

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-19 13:21:36 +08:00
Pengyun Lin
039f7e3118
[https://nvbugspro.nvidia.com/bug/5243740][fix] deduce default max_tokens for trtllm-serve (#4265)
* Deduce default max_tokens for trtllm-serve

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Improve executor_config.max_seq_len assignment in TRT workflow

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Enhance error message

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Add deduced max_tokens test

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

---------

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-19 00:34:40 +08:00
shaharmor98
27afcb9928
add changes for fp8, nemotron-nas, API (#4180)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-18 23:27:25 +08:00
Kaiyu Xie
3e08cd231c
fix: Remove real size allocation (#4396)
Remove real size allocation

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-18 19:13:22 +08:00
rakib-hasan
49f993d862
Removing the outdated argument (#4408)
removing the outdated argument

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-18 15:52:15 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible (#3439)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
Lucas Liebenwein
7c85890ec7
[AutoDeploy] eager pattern matcher new pattern (#4370)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:44 -04:00
Lucas Liebenwein
0e872ef0b0
[AutoDeploy] fix: proper process group clean up (#4373)
[AutoDeploy] proper process group clean up

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:25 -04:00
Netanel Haber
9cd8148f28
API Breaking Change + Readability: "decoder"->"sampler" (#4121)
* *decoder*->*sampler*; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors

* **Breaking Change**, as it changes public interfaces, main changes:
* PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler.
* Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py.

---------

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-05-16 23:52:25 +08:00
ixlmar
13b61405e8
fix: improve PyExecutor resource allocations (#4299)
chore: restore symmetry of worker start/shutdown
chore: fix return type of cal_max_tokens
chore: type some more return values
fix: free resources before re-claiming

Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-16 16:28:10 +01:00
Tracin
7b19acfab1
fix: Fix chat template kwargs bug. (#4387)
* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 23:07:46 +08:00
Lucas Liebenwein
8e4320ede5
[AutoDeploy] configurable cache resize (#4372)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 10:07:09 -04:00
Robin Kobus
4e370a509a
refactor: Copy sequence lengths once in decoder setup (#4102)
* refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update DecoderInputBuffers to remove duplicated buffers

- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move getEmbeddingBias to anonymous namespace

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Filter context requests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: GenerateRequestOptions using more fine-grained functions

- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMDecoder

- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove const where we modify input buffers

- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-16 22:03:55 +08:00
Fridah-nv
bce281d592
feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) (#3638)
* add docstring to summarize current rope support

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor: replace call_method, adjust inserting order of cos_sin_cache calculation node

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* add unit test for triton rope and ds rope

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update rope matcher to match DS RoPE, add custom op for reference, add unit test case

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* cache cos[pos_idx].unsqueeze and sin[pos_idxs].unsqueeze

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor doc update

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate pattern matching and optimization for explicit and complex rope + minor updates

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* clean rope impl in repo

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* replace fused_flattened_mla_with_cache's rope impl with torch_apply_rope_with_qk_interleaving, update unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate layout infer and transpose to a new transformation

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update rope_with_explicit_freqs and rope_with_input_interleaved to expose unsqueeze_dim and support match_rope_layout, add unit tests

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* solve merge conflict in transform.py, need to fix optimize_rope with cuda graph capture

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor clean up after rebase

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* fix pre-commit

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* support map to bnsd layout and infer unsqueeze_dim from op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix cos/sin not the same across prompts in the same batch issue when mapping to flashinfer op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix for unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix custom op input/output node ordering issue for DeepSeek V3 rope

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* clean code

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* move flattening of cos_sin_cache to the graph, update flashinfer op docstring and test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* debug transform unit test failure

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-05-16 09:55:32 -04:00
NVJiangShao
a6f2a1e918
Fix test_fused_moe_w4afp8 (#4393)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-05-16 17:21:33 +08:00
Daniel Cámpora
df19430629
chore: Mass Integration 0.19 (#4255)
* fix: Fix/fused moe 0.19 (#3799)

* fix bug of stream init

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix: Add pre-download of checkpoint before benchmark. (#3772)

* Add pre-download of checkpoint before benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add missing remote code flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move from_pretrained to throughput benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move download and use snapshot_download.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Removed trusted flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix benchmark command in iteration log test.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fuse

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* TRTLLM-4875 feat: Add version switcher to doc (#3871)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* waive a test (#3897)

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939)

Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>

* fix: remote mpi session abort (#3884)

* fix remote mpi session

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* skip fp8 gemm for pre-hopper (#3931)

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update multigpu list

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix namings

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* Doc: Fix H200 DeepSeek R1 perf doc (#4006)

* fix doc

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* update perf number

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* Fix the perf regression caused by insufficient cache warmup. (#4042)

Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* doc: Update 0.19.0 release notes (#3976)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* Optimize the AutoTuner cache access code to reduce host code overhead. (#4060)

The NVFP4 Linear op is very sensitive to the host overhead.
This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Update switcher (#4098)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* doc: update release notes (#4108)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* docs:update 0.19 doc. (#4120)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* docs:add torch flow supported model list. (#4129)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* doc: Release V0.19 Perf Overview Update (#4166)

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>

* Fix readme of autodeploy.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update tensorrt_llm/_torch/pyexecutor/llm_request.py

Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Revert mgmn worker node.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Change to disable_overlap_scheduler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>
2025-05-16 10:53:25 +02:00
Tracin
46c5a56444
Support dynamic per-tensor FP8 (#4250)
* Support dynamic per-tensor FP8

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Update test cases.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 13:33:58 +08:00
WeiHaocheng
54d28718c7
feat: support benchmark on scaffolding (#3328) (#4286)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-16 12:28:49 +08:00
yuxianq
a1daa22970
doc: Add docstring for Attention and MLA module. (#4354)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-16 09:37:04 +08:00
Suyog Gupta
b0f7522c82
[AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile (#4240)
* add a torch-compile backend

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* readme changes

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* add torch-cudagraph backend

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* update readme

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* update readme

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* further enhanced compiler backends

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* further enhance readme

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* better specified defaults in simple_config.py

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* fix typo in simple_config.py

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* updated deepseek-v3 support

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* revert accidental deletion in AD Readme

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 08:38:15 +08:00
rakib-hasan
25407249a5
[TRTLLM-5054][fix] Removing repeated loading of input processor (#4161)
removing repeated loading of input processor

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-16 08:04:58 +08:00
Lucas Liebenwein
4883121477
[AutoDeploy] fix: disable overlap scheduler until supported (#4365)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-15 16:19:30 -07:00
Yechan Kim
c6e2111f4e
feat: enhance trtllm serve multimodal (#3757)
* feat: enhance trtllm serve multimodal

1. made the load_image and load_video asynchronous
2. add image_encoded input support to be compatible with genai-perf
3. support text-only on multimodal mdoels(currently, Qwen2-VL & Qwen2.5-VL)

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix bandit

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* trimming uils

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* trimming for test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* genai perf command fix

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* command fix

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* refactor chat_utils

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* stress test genai-perf command

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-05-15 16:16:31 -07:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
Venky
5ebe32f06f
enh: Enable option in trtllm-bench build subcommand to avoid loading weights (#4142)
* expose load_format 

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

* yapf

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-05-16 03:50:53 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. (#4244)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
ixlmar
4ee82fc0fd
chore: reduce code duplication (#4297)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-15 09:25:37 +01:00
Zongfei Jing
f0ca60a95d
Add allreduce and rmsnorm fusion for qwen3 (#4304)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-05-15 16:22:11 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA

load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* add CI test

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* use GPTJ style RoPE for MLA

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix rebase error and some docs

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix kv_lens

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* tiny fix

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: use normal device memory instead of pinned memory for unit test

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* fix L0 tests

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile after rebase

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments again

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Kaiyu Xie
b4e5df0ee0
Breaking change: perf: Enable scheduling overlap by default (#4174)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-15 14:27:36 +08:00
Fridah-nv
d008d6412f
feat:[AutoDeploy] Update MoE pattern matcher to drop expert selection logic (#3283)
* update matcher to match expert compute first, then extract other args with LCA

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* support 3D and 2D input in torch.ops.moe.trtllm_fused_moe

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update custom ops to support 3D and 2D inputs

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* update deepseek patch

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-05-15 13:53:09 +08:00
nv-guomingz
e76cf9d9fe
fix:https://nvbugs/5234033 enable starcoder trt-flow with transforme… (#3909)
fix:https://nvbugs/5234033 enable startcoder trt-flow with transformer 4.51.3.

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-05-15 11:16:45 +08:00
Zeyu WANG
2681b26e48
[TRTLLM-2795] feat: Add yarn support for other models in trt-flow (#3840)
Add yarn support for general models(e.g. llama, qwen) other than deepseek in trt-flow.

Signed-off-by: Zeyu Wang <zeyuw@nvidia.com>
2025-05-15 11:03:57 +08:00
Mike Iovine
f9adac3dea
[feat] Enable chunked context for flashinfer (#4132)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-15 10:59:38 +08:00
QI JUN
498ce8a056
Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340)
Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)"

This reverts commit 5e634dd1bd.
2025-05-15 09:52:39 +08:00
sugunav14
7c828d767f
feat: [AutoDeploy] DSV3 mla attn ref op (#4272)
* raw ref op + new patch untested

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* Added mla attn ref op and unit tests for attn + module patches

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* update stray changes in deepseek.py

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* Updated stale documentation

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* removed stray update in sdpa return shapes

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-05-15 01:58:20 +08:00
HuiGao-NV
f4059c6e2e
Add test case for kv memory estimation (#4158)
* Add test case for kv memory estimation
* Dump running log into file and parse kv cache memory size from file
* Set bigger peak memory size for mixed percision case and test_ptp_quickstart_advanced_eagle3 case
* Revert change to usage of fraction
* use context manager to guard temp files

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-14 18:39:25 +08:00
kanghui0204
5e634dd1bd
feat: Low Precision Allreduce for PCIe based GPU (#3851)
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-14 16:45:43 +08:00
Barry Kang
20b42912ce
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123)
Support DeepSeek-R1 W4A8 on Hopper

Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-14 15:48:07 +08:00
brb-nv
8280c3d4f2
feat: Support Gemma3-1b-it in Pytorch workflow (#3999)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 14:02:44 +08:00
Fridah-nv
21dbd163a7
[TRTLLM-5188] fix: [AutoDeploy] unwaive AD build test (#4273)
* unwaive small build test

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* unwaive mutigpu/integration tests

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* fix for torch.compile+flashinfer attention

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
2025-05-14 10:40:12 +08:00
Zhanrui Sun
23b9705bf4
chore: bump version to 0.20.0rc3 (#4261)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-05-14 10:15:25 +08:00
Anurag Mukkara
b0a03a289c
fix: Merge PP overlap and non-overlap executor loop (#3878)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-05-14 06:04:36 +08:00
brb-nv
cd5b3d21a0
feat: Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 03:47:22 +08:00
Frank
c0c3c7f68c
[TRTLLM-5233][feat]: Add chunking to PyT heuristic for trtllm-bench. (#4133)
* Add chunking to PyT heuristic.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Cast tokens and batch size to ints.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-13 21:47:06 +08:00
Yukun He
cbca6505ff
[nvbugs/5268808][fix] Fix the list-out-of-range access issue of AllReduce workspace on multi-node. (#4159)
This issue is found for tp=ep=8 on the multi-node machine due to the inconsistent PP sizes.
* Reform the workspace allocation implementation to avoid the list-out-of-range issues.
* Disable min_latency_mode under the multi-node scenario to avoid the illegal memory access issue.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-13 17:17:25 +08:00
Perkz Zheng
e8d7834c50
fix: [https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled (#4101)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-13 14:58:54 +08:00
v-shobhit
1770dd96d8
Fix Pipeline Parallelism in Llama4 (#4106)
Signed-off-by: Shobhit Verma <shobhitv@nvidia.com>
2025-05-12 22:54:37 -07:00
nvpohanh
13c8e5a8a8
feat: Prefetch safetensors files before loading them (#4140)
Prefetching safetensors files so that they are stored in the system file
cache. This significantly speeds up the model weight loading for the
very first run after entering the docker container.

This is beneficial because model weight loading is done layer-by-layer,
which means reading from the safetensors chunk-by-chunk, and that cannot
utilize the internet bandwidth very well, assuming that these files are
stored in some network drives. Instead, loading the whole files in bulk
can achieve higher internet bandwidth utilization.

When running with world_size>1, all ranks collaboratedly prefetch these
files.

In theory, we should add heuristics to decide whether to prefetch the
files or not, but that is beyond the scope of this commit.

For example, when the CPU memory is small, doing prefetching may result
in file cache thrashing, resulting in slower weight loading time.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-13 13:35:30 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156)
* feat: Add per-request stats support with PyT backend

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing stats unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing test with overlap

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
Erin
4becf32360
fix: reshape token_ids for lp in torch backend (#4239)
reshape token_ids

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-13 08:43:47 +08:00
wili
eba3623a54
Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979)
* feat/vbws-part4-v1.8: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* feat/vbws-part4-v1.9: fix incorrect output when using short output length

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.1: remove useless variables

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.2:fix incorrect output when using short output length

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.3: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.4: rebase

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

* v1.9.5: remove API change

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-12 22:32:29 +02:00
yuxianq
a4c3359513
fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-12 23:25:54 +08:00
Fridah-nv
3dbb087292
[TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake (#4199)
* update output shape of fake kernel prepare_fused_mha_metadata_fake

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-05-12 11:11:40 -04:00
yuxianq
b35f9a67f9
refactor: Allow models to override apply_qk_norm. (#4078)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-12 19:38:24 +08:00
Zheng Duan
c9e2a963e0
feat: add kv cache aware router (#3831)
* kv cache aware router

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* add tests

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* router config

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* eviction test

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

add test

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* eviction detect in worker test

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* move worker tests to single gpu

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* reduce memory fraction

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

* fix partial block

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>

---------

Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-12 07:23:57 -04:00
Yixin Dong
c90ebadd84
feat: Support the Structural Tag in guided decoding (#4066)
* finish

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* exc overlap scheduler

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix api ref

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 17:24:50 +08:00
Zhenhuan Chen
9212e9a740
[TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller (#4151)
feat(scaffolding): make sampling_params only setable by controller

Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-05-12 15:29:09 +08:00
Chuang Zhu
1333f4f5d5
remove cache_transceiver_prealloc_size (#4153)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-12 11:53:53 +08:00
Frank
0dcf47f1c2
[TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. (#3875)
* Set cuda graph max batch size.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Set padding.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-05-09 23:20:52 +08:00
Mike Iovine
4b8ba7ad61
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue (#4069)
[fix] Fix llama 4 test lists

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-09 22:45:14 +08:00
chenfeiz0326
ffc13bd325
Cherry-pick: Use multi-threading to load MoE expert weights (#4137)
* Use multi-threading to load MoE expert weights

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

* Update code formatting

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update code formatting

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-09 17:29:24 +08:00
WeiHaocheng
0f01826dde
feat: support task collection for to collect information (#3328) (#3824)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-09 17:09:01 +08:00
Fanrong Li
0cf0fce5d3
[fix] Fix add_dummy_requests for spec decoding cases (#4084)
* fix add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add max_seq_len to eagle3 test and fix add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix prompt_len in add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add prepare_resource condition in add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add some description of token_nums to add_dummy_requests and fix token_nums in torch compile warmup.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix available_tokens.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 16:52:51 +08:00
Fanrong Li
77f8e43592
[fix] Fix relaxed acceptance to support enabling it in context phase (#4126)
* fix relaxed acceptance to support enable this feature in context phase.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix sample_and_accept_draft_tokens unit test.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 14:11:14 +08:00
Yukun He
c9cac432dc
chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse (#4169)
Fix import break caused by rebase.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-09 12:51:02 +08:00
Mike Iovine
d80dc40135
[nvbug/5262268][fix] Fix trtllm-bench for llama 4 (#4104)
[fix] Fix trtllm-bench for llama 4

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>
2025-05-08 21:27:57 -07:00
Erin
cdf5ae1547
fix: change pp broadcast pattern for LPs (#4130)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-08 20:07:13 -07:00
Yi Zhang
91bf5e6a8e
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804)
Add Piecewise CUDA Graph Support

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-09 11:04:01 +08:00