shaharmor98
ee4aab72ec
feat: Support PeftCacheManager in Torch ( #3186 )
...
* Add PeftCacheManager implementation
Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-04 12:38:08 +08:00
Tracin
bb6c338730
AWQ support Modelopt ckpts. ( #3258 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-04 08:10:35 +08:00
pcastonguay
b763051ba4
chore: Refactor disaggregated serving scripts ( #3073 )
...
* chore: Refactor to reduce duplicated code in disagg server, reuse trtllm-serve
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Updating README, removing launch script
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing integration tests
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Adding scripts to populate urls section of disagg config based on SLURM env vars
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-03 14:55:05 -04:00
Fanrong Li
1fe64b90be
fix: fix the acceptance rate of pytorch workflow in trtllm-bench ( #3240 )
...
* fix acceptance rate of pytorch workflow.
* revert the RequestOutput API change.
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-03 15:12:24 +08:00
Frank
2d80db4c36
chore: Remove build config from Pytorch kwargs. ( #3210 )
...
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-03 15:00:29 +08:00
Zongfei Jing
dcc0ebd273
Fix warning ( #3254 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-03 13:30:23 +08:00
Jinyang Yuan
2fdfa39ea8
fix: Fix an error related to dummy request when MTP is used ( #3146 )
2025-04-03 11:08:12 +08:00
Anurag Mukkara
d998339855
Raise error for PP + MTP ( #3244 )
...
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-04-03 04:45:31 +08:00
QI JUN
abcb0486dc
fix deepseek failure with pipeline parallelism ( #3225 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 22:56:39 +08:00
Enwei Zhu
3cf7066350
test: Accuracy test improvement (Part 3.2): Move Qwen tests (NvBug 5135332) ( #3219 )
...
* remove test_llm_models_multi_gpu.py
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* qwen 2.5
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* upgrade
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-02 17:29:57 +08:00
QI JUN
bb10cdcfb8
chore: refine fetch new requests method ( #3213 )
...
* refine broadcast new requests method
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* refine fetch new requests method
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 10:46:00 +08:00
Zheng Duan
35b828ca2d
fix streaming in dist-serving ( #3087 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-04-02 10:08:07 +08:00
Zongfei Jing
c7548ad72c
perf: Add optimizations for deepseek in min latency mode ( #3093 )
...
* Add optimizations for deepseek min latency
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Fix compile error
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Update internal cutlass kernel libs
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Format code
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Resolve conflicts
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
---------
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-02 09:05:24 +08:00
brb-nv
1fe3e30356
Add support for Phi-4-mini ( #2990 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-02 08:34:39 +08:00
Zhanrui Sun
42963baacd
chore: bump version to 0.19.0.dev2025040800 ( #3171 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-02 08:21:55 +08:00
QI JUN
8fe2e5865e
refine broadcast new requests method ( #3198 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 08:05:20 +08:00
Enwei Zhu
b2f69db507
test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval ( #3167 )
...
* add eval_llmapi
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
tmp commit
port to CLI tool
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
move
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
setup llmapi
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix spec_dec_algo
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
_update_from_hf_quant_config
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
migrate test_pytorch.py
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix fp8 block scales
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix fp8 rowwise
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
adj alpha
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
move test_pytorch.py cases
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
move
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
rename test_accuracy.py to test_cli.py
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
clean
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix cnn_dailymail
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* renaming to cli flow
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* rename MMLU
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* rename
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* add error
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-01 22:20:29 +08:00
amirkl94
bf02b9144f
feature: Add LoRA support for gemma ( #3068 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-04-01 19:15:55 +08:00
WeiHaocheng
ff35af77ea
feat: refactor scaffolding worker and support openai api worker ( #3166 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-01 18:31:52 +08:00
Jinyang Yuan
992d513bc6
feat: Optionally split MoE inputs into chunks to reduce GPU memory usage ( #3104 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-04-01 16:07:02 +08:00
brb-nv
727d78e785
Support prequantized fp8 ckpt for nemotron-mini-4b-instruct ( #3046 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-01 14:52:09 +08:00
dongjiyingdjy
22ff81b047
fix:fix illeagel memory access when mtp >= 2 ( #3006 )
...
* fix - fix illeagel memory access when mtp > 2
---------
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-01 13:36:45 +08:00
Shunkangz
dda7354d1a
Refactor return of first gen token in PD ( #2986 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-01 12:28:27 +08:00
jiahanc
c4ee14e43a
fix: Reverse cuda graph size order ( #3116 )
...
Signed-off-by: jiahanc <jiahanc@nvidia.com>
2025-04-01 11:28:36 +08:00
Aurelien Chartier
14e194433c
chore: cleanup py_executor code ( #3132 )
...
* chore: cleanup py_executor code
* Add common loop cleanup function
* Remove checks for attention DP if nothing to queue
* Remove extra return statements
* Remove extra variables
* Remove commented debug print
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* rename cleanup function
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
---------
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-04-01 09:27:04 +08:00
Anurag Mukkara
435cd2983d
perf: Optimisations for PP + attention DP ( #3134 )
...
* Minor tp_rank fix
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Delete unused function
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* PP broadcast for ADP new requests
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Sync request finish point for intermediate and last pp ranks
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Use local PP layers only for KV cache estimation
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
---------
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-04-01 08:59:16 +08:00
Frank
8bb3eea285
perf: Readd iteration logging for trtllm-bench. ( #3039 )
...
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-01 08:13:09 +08:00
WeiHaocheng
f665f83256
feat: improve scaffolding shutdown process ( #3084 )
2025-03-31 20:39:20 +08:00
Zhanrui Sun
36ac5e78ed
chore: bump version to 0.19.0.dev2025040100 ( #3152 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-03-31 16:36:06 +08:00
Quanfeng Li
839aad4d6e
fix: Add missing parameter for WeightOnlyQuantRowLinear module ( #2768 )
...
Signed-off-by: Quanfeng Li <liquanfeng7@foxmail.com>
2025-03-31 16:20:30 +08:00
QI JUN
9560fcd5ec
Chore: waive tests and fix multi-GPU tests ( #3157 )
...
* waive tests
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* update
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* clean up
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-03-31 16:05:45 +08:00
liji-nv
e0d0dde058
None - Add one-shot version for UB AR NORM FP16/BF16 ( #2995 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-03-31 11:16:03 +08:00
Yan Chunwei
794f61c997
fix: fix single-node cannot quit issue on slurm ( #3140 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-03-31 10:15:27 +08:00
Mike Iovine
5416966ddb
Add initial EAGLE-3 implementation ( #3035 )
...
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-03-29 22:31:24 +08:00
Erin
c75d7cd684
move BuildConfig functional args to llmargs ( #3036 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-03-29 02:20:18 +08:00
Aurelien Chartier
3de82c41cd
Pytorch PP + attention DP support ( #3044 )
...
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-03-28 00:11:19 +08:00
Fanrong Li
ec03159e60
fix: Waive twoshot to fix acc issue ( #3066 )
...
* waive twoshot to fix acc issue
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-03-27 21:38:52 +08:00
Yan Chunwei
87ab794aa2
fix: fix hang in mgmn with trtllm-llmapi-launch command ( #3119 )
...
* init
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* restore
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-03-27 18:45:43 +08:00
Fanrong Li
0976360204
add support for MTP+cuda_graph_padding. ( #3096 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-03-27 16:06:14 +08:00
Yan Chunwei
82edd90350
fix gpus_per_node in trtllm-bench when world_size < device_count ( #3007 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-03-27 09:31:40 +08:00
Suyog Gupta
047f2b234d
perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench ( #3041 )
...
* Enable AutoDeploy as a backend in trtllm-bench
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* update how caches are resized
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* fix: files permission from 100755 to 100644
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* some comments
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Fix function name
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* refactor
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Remove spurious change
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Add cursor generated doc strings
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* re-enable ad test
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* some perf cleanup
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* debug ci
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* ensure that overlap scheduler is enabled
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Reorder the tests
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
---------
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-26 14:33:14 -07:00
wili
3e035f2219
v1.2 ( #3082 )
...
Signed-off-by: wili <wili@nvidia.com>
2025-03-26 23:31:29 +08:00
Jinyang Yuan
6b583f6f83
perf: Enable CUDA graphs when attention DP is used and active requests on different GPUs are uneven ( #3010 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-03-26 21:09:25 +08:00
Enwei Zhu
224469b096
test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references ( #3069 )
...
* committed APIs validation
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* clean name
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* separate
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* add TODOs
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix naming
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-03-26 18:14:35 +08:00
Kaiyu Xie
ea3739ee62
Fix: fuse message not aligned on different processes ( #3067 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-26 17:15:27 +08:00
Yechan Kim
3c7cb6629c
Add EXAONE-Deep ( #3054 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-03-26 14:24:04 +08:00
DylanChen-NV
1ac0566a93
fix: fix for cp > kvHeadNum ( #3002 )
...
* fix for cp > kvHeadNum
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* fix for None kv_head_num
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
---------
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-03-26 12:39:02 +08:00
HuiGao-NV
25f2434495
fix: Set correct draft_token_nums to dummy requests for torch compilation with MTP ( #3053 )
...
Set correct draft_token_nums to dummy requests for torch compilation with MTP
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-03-26 11:32:57 +08:00
yuxianq
268933b5cc
Refactor imports inside tensorrt_llm._torch. ( #3015 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-03-26 11:01:07 +08:00
WeiHaocheng
7ac04ada2a
doc: Add README.md for scaffolding ( #3048 )
...
* Add README.md for scaffolding
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
* Update tensorrt_llm/scaffolding/README.md
Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
---------
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
2025-03-25 13:58:01 +08:00