Zongfei Jing
|
dcc0ebd273
|
Fix warning (#3254)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
|
2025-04-03 13:30:23 +08:00 |
|
Jinyang Yuan
|
2fdfa39ea8
|
fix: Fix an error related to dummy request when MTP is used (#3146)
|
2025-04-03 11:08:12 +08:00 |
|
Anurag Mukkara
|
d998339855
|
Raise error for PP + MTP (#3244)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
|
2025-04-03 04:45:31 +08:00 |
|
QI JUN
|
abcb0486dc
|
fix deepseek failure with pipeline parallelism (#3225)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
|
2025-04-02 22:56:39 +08:00 |
|
Enwei Zhu
|
3cf7066350
|
test: Accuracy test improvement (Part 3.2): Move Qwen tests (NvBug 5135332) (#3219)
* remove test_llm_models_multi_gpu.py
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* qwen 2.5
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* upgrade
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-04-02 17:29:57 +08:00 |
|
QI JUN
|
bb10cdcfb8
|
chore: refine fetch new requests method (#3213)
* refine broadcast new requests method
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* refine fetch new requests method
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
|
2025-04-02 10:46:00 +08:00 |
|
Zheng Duan
|
35b828ca2d
|
fix streaming in dist-serving (#3087)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
|
2025-04-02 10:08:07 +08:00 |
|
Zongfei Jing
|
c7548ad72c
|
perf: Add optimizations for deepseek in min latency mode (#3093)
* Add optimizations for deepseek min latency
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Fix compile error
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Update internal cutlass kernel libs
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Format code
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Resolve conflicts
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
---------
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
|
2025-04-02 09:05:24 +08:00 |
|
brb-nv
|
1fe3e30356
|
Add support for Phi-4-mini (#2990)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-04-02 08:34:39 +08:00 |
|
Zhanrui Sun
|
42963baacd
|
chore: bump version to 0.19.0.dev2025040800 (#3171)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
|
2025-04-02 08:21:55 +08:00 |
|
QI JUN
|
8fe2e5865e
|
refine broadcast new requests method (#3198)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
|
2025-04-02 08:05:20 +08:00 |
|
Enwei Zhu
|
b2f69db507
|
test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167)
* add eval_llmapi
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
tmp commit
port to CLI tool
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
move
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
setup llmapi
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix spec_dec_algo
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
_update_from_hf_quant_config
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
migrate test_pytorch.py
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix fp8 block scales
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
fix fp8 rowwise
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
adj alpha
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
move test_pytorch.py cases
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
move
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
rename test_accuracy.py to test_cli.py
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
clean
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix cnn_dailymail
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* renaming to cli flow
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* rename MMLU
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* rename
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* add error
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-04-01 22:20:29 +08:00 |
|
amirkl94
|
bf02b9144f
|
feature: Add LoRA support for gemma (#3068)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
|
2025-04-01 19:15:55 +08:00 |
|
WeiHaocheng
|
ff35af77ea
|
feat: refactor scaffolding worker and support openai api worker (#3166)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
|
2025-04-01 18:31:52 +08:00 |
|
Jinyang Yuan
|
992d513bc6
|
feat: Optionally split MoE inputs into chunks to reduce GPU memory usage (#3104)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
|
2025-04-01 16:07:02 +08:00 |
|
brb-nv
|
727d78e785
|
Support prequantized fp8 ckpt for nemotron-mini-4b-instruct (#3046)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-04-01 14:52:09 +08:00 |
|
dongjiyingdjy
|
22ff81b047
|
fix:fix illeagel memory access when mtp >= 2 (#3006)
* fix - fix illeagel memory access when mtp > 2
---------
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-04-01 13:36:45 +08:00 |
|
Shunkangz
|
dda7354d1a
|
Refactor return of first gen token in PD (#2986)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
2025-04-01 12:28:27 +08:00 |
|
jiahanc
|
c4ee14e43a
|
fix: Reverse cuda graph size order (#3116)
Signed-off-by: jiahanc <jiahanc@nvidia.com>
|
2025-04-01 11:28:36 +08:00 |
|
Aurelien Chartier
|
14e194433c
|
chore: cleanup py_executor code (#3132)
* chore: cleanup py_executor code
* Add common loop cleanup function
* Remove checks for attention DP if nothing to queue
* Remove extra return statements
* Remove extra variables
* Remove commented debug print
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* rename cleanup function
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
---------
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
|
2025-04-01 09:27:04 +08:00 |
|
Anurag Mukkara
|
435cd2983d
|
perf: Optimisations for PP + attention DP (#3134)
* Minor tp_rank fix
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Delete unused function
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* PP broadcast for ADP new requests
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Sync request finish point for intermediate and last pp ranks
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Use local PP layers only for KV cache estimation
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
---------
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
|
2025-04-01 08:59:16 +08:00 |
|
Frank
|
8bb3eea285
|
perf: Readd iteration logging for trtllm-bench. (#3039)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
|
2025-04-01 08:13:09 +08:00 |
|
WeiHaocheng
|
f665f83256
|
feat: improve scaffolding shutdown process (#3084)
|
2025-03-31 20:39:20 +08:00 |
|
Zhanrui Sun
|
36ac5e78ed
|
chore: bump version to 0.19.0.dev2025040100 (#3152)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
|
2025-03-31 16:36:06 +08:00 |
|
Quanfeng Li
|
839aad4d6e
|
fix: Add missing parameter for WeightOnlyQuantRowLinear module (#2768)
Signed-off-by: Quanfeng Li <liquanfeng7@foxmail.com>
|
2025-03-31 16:20:30 +08:00 |
|
QI JUN
|
9560fcd5ec
|
Chore: waive tests and fix multi-GPU tests (#3157)
* waive tests
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* update
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* clean up
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
|
2025-03-31 16:05:45 +08:00 |
|
liji-nv
|
e0d0dde058
|
None - Add one-shot version for UB AR NORM FP16/BF16 (#2995)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
|
2025-03-31 11:16:03 +08:00 |
|
Yan Chunwei
|
794f61c997
|
fix: fix single-node cannot quit issue on slurm (#3140)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-03-31 10:15:27 +08:00 |
|
Mike Iovine
|
5416966ddb
|
Add initial EAGLE-3 implementation (#3035)
Signed-off-by: Mike Iovine <miovine@nvidia.com>
|
2025-03-29 22:31:24 +08:00 |
|
Erin
|
c75d7cd684
|
move BuildConfig functional args to llmargs (#3036)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
|
2025-03-29 02:20:18 +08:00 |
|
Aurelien Chartier
|
3de82c41cd
|
Pytorch PP + attention DP support (#3044)
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
|
2025-03-28 00:11:19 +08:00 |
|
Fanrong Li
|
ec03159e60
|
fix: Waive twoshot to fix acc issue (#3066)
* waive twoshot to fix acc issue
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-03-27 21:38:52 +08:00 |
|
Yan Chunwei
|
87ab794aa2
|
fix: fix hang in mgmn with trtllm-llmapi-launch command (#3119)
* init
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* restore
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-03-27 18:45:43 +08:00 |
|
Fanrong Li
|
0976360204
|
add support for MTP+cuda_graph_padding. (#3096)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
|
2025-03-27 16:06:14 +08:00 |
|
Yan Chunwei
|
82edd90350
|
fix gpus_per_node in trtllm-bench when world_size < device_count (#3007)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-03-27 09:31:40 +08:00 |
|
Suyog Gupta
|
047f2b234d
|
perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench (#3041)
* Enable AutoDeploy as a backend in trtllm-bench
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* update how caches are resized
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* fix: files permission from 100755 to 100644
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* some comments
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* lint
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Fix function name
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* refactor
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Remove spurious change
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Add cursor generated doc strings
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* re-enable ad test
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* some perf cleanup
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* debug ci
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* ensure that overlap scheduler is enabled
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
* Reorder the tests
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
---------
Signed-off-by: Suyog Gupta <suyogg@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-03-26 14:33:14 -07:00 |
|
wili
|
3e035f2219
|
v1.2 (#3082)
Signed-off-by: wili <wili@nvidia.com>
|
2025-03-26 23:31:29 +08:00 |
|
Jinyang Yuan
|
6b583f6f83
|
perf: Enable CUDA graphs when attention DP is used and active requests on different GPUs are uneven (#3010)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
|
2025-03-26 21:09:25 +08:00 |
|
Enwei Zhu
|
224469b096
|
test: [TRTLLM-4334] Create 1.0 criteria scope from API stability references (#3069)
* committed APIs validation
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* clean name
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* separate
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* add TODOs
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix naming
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-03-26 18:14:35 +08:00 |
|
Kaiyu Xie
|
ea3739ee62
|
Fix: fuse message not aligned on different processes (#3067)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-03-26 17:15:27 +08:00 |
|
Yechan Kim
|
3c7cb6629c
|
Add EXAONE-Deep (#3054)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
|
2025-03-26 14:24:04 +08:00 |
|
DylanChen-NV
|
1ac0566a93
|
fix: fix for cp > kvHeadNum (#3002)
* fix for cp > kvHeadNum
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* fix for None kv_head_num
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
---------
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
|
2025-03-26 12:39:02 +08:00 |
|
HuiGao-NV
|
25f2434495
|
fix: Set correct draft_token_nums to dummy requests for torch compilation with MTP (#3053)
Set correct draft_token_nums to dummy requests for torch compilation with MTP
Signed-off-by: Hui Gao <huig@nvidia.com>
|
2025-03-26 11:32:57 +08:00 |
|
yuxianq
|
268933b5cc
|
Refactor imports inside tensorrt_llm._torch. (#3015)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
|
2025-03-26 11:01:07 +08:00 |
|
WeiHaocheng
|
7ac04ada2a
|
doc: Add README.md for scaffolding (#3048)
* Add README.md for scaffolding
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
* Update tensorrt_llm/scaffolding/README.md
Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
---------
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com>
|
2025-03-25 13:58:01 +08:00 |
|
Aurelien Chartier
|
ef78518310
|
Only gather responses on rank 0 (#3040)
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
|
2025-03-24 21:54:51 -07:00 |
|
Zhanrui Sun
|
c2ffce7dbd
|
chore: bump version to "0.19.0.dev2025032500" (#3019)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
|
2025-03-25 10:04:17 +08:00 |
|
bhsueh_NV
|
11f9ecb2fd
|
chore: remove useless param (#3023)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
|
2025-03-25 08:36:45 +08:00 |
|
Netanel Haber
|
da0b0e0ee3
|
fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size (#2983)
* fix variable window size reuse - disable when *min attention window* starts sliding, not max
* isPreCyclic -> isCyclic, and invert logic, for clarity
* getDecoderState()
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
2025-03-24 22:49:52 +08:00 |
|
Yan Chunwei
|
531b98ed62
|
feat: Add several pure python configs to LlmArgs (#2997)
* add SchedulerConfig
* add PeftCacheConfig
|
2025-03-24 16:16:17 +08:00 |
|