Commit Graph

1675 Commits

Author SHA1 Message Date
mpikulski
fc7f78c400
[TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling (#8110)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-10-02 10:20:32 +02:00
Eran Geva
32c7f8c36f
[#7588][feat] lock gpu clocks in test_perf.py to reliably detect perf regressions (#8099)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-10-02 11:18:10 +03:00
brb-nv
bd3d0ad233
[TRTLLM-7733][feat] Executor changes to support helix parallelism (#7972)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-10-01 22:13:03 -04:00
Izzy Putterman
1ad7bc4c78
[None][feat] Draft: Save state first pass (#7012)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-10-01 18:40:55 -04:00
Frida Hou
de99e23696
[#5860][feat] Add ModelOPT INT4 awq fake quant support in AutoDeploy (#7770)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-10-01 13:13:45 -07:00
Yibin Li
d7581bb551
[TRTLLM-8031][feat] Add chunked return_generation_logits logic (#7831)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-10-01 12:47:07 -04:00
sychen52
ba8abeab10
[OMNIML-2336][feat] add W4A8 NVFP4 FP8 fused moe (#7968)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-10-01 02:39:33 -04:00
Patrice Castonguay
b77f19f4ff
[https://nvbugs/5434320][fix] fix: Unwaiving disagg pp tests (#8069)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-10-01 00:33:59 -04:00
Emma Qiao
b1e3fef8aa
[None][infra] Skip failed tests in post-merge for main (#8102)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-10-01 10:12:10 +08:00
brb-nv
84aa3c981e
[None][chore] Waive failing MNNVL alltoall multi-gpu test (#8106)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-09-30 20:05:42 -04:00
mpikulski
ee5ae49337
[TRTLLM-8269][fix] Revert "do not explicitly pass temperature=0 to select greedy sampling" (#8103)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-30 16:53:49 -04:00
Iman Tabrizian
c510b67fa0
[https://nvbugs/5547414][fix] avoid downloading Tiny llama from HF (#8071)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-30 13:47:59 -04:00
xinhe-nv
1dba9fa89e
[TRTLLM-6239][feat] add test cases into QA test list (#8081)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-30 00:23:45 -04:00
Kaiyu Xie
b0cb9ca50e
[None] [test] Add MNNVL AlltoAll tests to pre-merge (#7466)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-09-29 23:12:24 -04:00
Lucas Liebenwein
dcfd3ef81c
[#4593][feat] AutoDeploy: Linear Attention Support (SSM + causal_conv + Bamba + Nemotron-H) (#8068)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-09-29 22:41:06 -04:00
Cao Dong
62010c0ab7
[None][feat] Return topk logprobs in torch backend (#7976)
Signed-off-by: Cao Dong <87467313+dcaox@users.noreply.github.com>
2025-09-30 09:32:37 +08:00
Cheng Hang
cdce68c3e0
[TRTLLM-6741][fix] Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True (#7891)
Signed-off-by: Cheng Hang <chang@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-30 09:24:35 +08:00
Patrice Castonguay
6396cb9208
[https://nvbugs/5538098][fix] Checking connection to etcd server in unit test (#8006)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-09-29 20:53:32 -04:00
Chang Liu
334e2cab0d
[https://nvbugs/5542867][fix] Fix the non-determinism issue in the mm_encoder test (#8033)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-29 09:45:16 -07:00
amitz-nv
e5f9b6aaa0
[None][fix] Fix TRT-python multi LoRA TP=2 test arguments (#8059)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-09-29 12:20:04 -04:00
mpikulski
31a1a5ff80
[TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling (#7909)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-29 14:52:18 +01:00
xiweny
48e779ae8c
[https://nvbugs/5541494] [fix] add back missing sm100f bmm kernels (#8051)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-29 05:35:44 -04:00
yufeiwu-nv
3ba6727a68
[None][test] Update get_sysinfo.py to avoid UnboundLocalError (#7982)
Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
2025-09-29 05:14:38 -04:00
Gal Hubara-Agam
b2095aa074
[#4674][bugfix] AutoDeploy Fix memory leak in fuse_moe (#7844)
Delete the unstacked weights immediately to save GPU memory, cleanup occurs automatically after the transformation, but for large models we'll run out of memory during the transformation itself.

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2025-09-29 11:01:07 +03:00
xinhe-nv
20e6cd39f1
[None][chore] Add failed cases into waives.txt (#8043)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-29 03:37:39 -04:00
Emma Qiao
ce381d6813
[None][infra] Waive failed cases for main on 0929 (#8053)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-29 02:46:02 -04:00
HuiGao-NV
7ac932d45e
[https://nvbugs/5532087][CI] Enable test case (#8029)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-29 01:46:28 -04:00
Ivy Zhang
1e2e851db8
[None][chore] update test case constraint (#8020)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-09-29 13:25:09 +08:00
Eran Geva
9cea6bfb30
[#7288][feat] Added AutoDeploy backend support to test_perf.py (#7588)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-28 21:21:27 -07:00
Ivy Zhang
0ecafd84da
[None][chore] Update chunked prefill test case configs (#7868)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-09-29 10:37:34 +08:00
Yukun He
28b9a81c58
[TRTLLM-4500][feat] Add serialization/deserialization options for AutoTuner profiling cache (#7738)
To achieve determinism for the AutoTuner profiling cache, serialization and deserialization are introduced to store the cache on disk in JSON format. Use TLLM_AUTOTUNER_CACHE_PATH to indicate the path where the cache file should be stored:

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-09-29 07:40:51 +08:00
Emma Qiao
2be05cbd6e
[None][infra] Skip failed test for main branch on 9/28 (#8040)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-28 07:00:55 -04:00
ChristinaZ
95eac2cda7
[https://nvbugs/5537738][fix] Add fp8 post-quant allgather support (#8008)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-28 15:32:45 +08:00
Iman Tabrizian
33282351a2
[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-27 19:29:30 -04:00
Frida Hou
a36b48bcab
[#5860][autodeploy] GPT-OSS MXFP4 support (#7451)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-09-26 15:36:06 -07:00
Jhao-Ting Chen
c33f43e13a
[https://nvbugs/5518713][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) (#7856)
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-09-26 14:29:32 -07:00
Emma Qiao
c8bef27ebb
[None][infra] Waive failed cases in post-merge 2305 (#8019)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-26 10:20:12 -07:00
YueWeng
a4243f0da5
[TRTLLM-6393][feat] add static tree sampling and verification (#7161)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-09-26 13:16:16 -04:00
xinhe-nv
ba6ab62bd1
[None][chore] Add failed cases into waives.txt (#8004)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-09-26 00:41:02 -07:00
xinhe-nv
f32f5730b2
[None][chore] Add failed cases into waives.txt (#7986)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-09-25 23:50:09 -07:00
Lucas Liebenwein
3a96d75a3c
[https://nvbugs/5527956][fix] AutoDeploy: fix IMA due to outdated metadata (#8002)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-25 22:05:55 -07:00
sunnyqgg
2e5850c28a
[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363)
Signed-off-by: qgai <qgai@nvidia.com>
2025-09-26 11:28:05 +08:00
QI JUN
4c0f8482f1
[None][ci] Waive test_mm_encoder_standalone.py::test_multi_request_batch_chat[llava-v1.6-mistral-7b-hf] (#8010)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-26 11:07:54 +08:00
Enwei Zhu
d650320de4
[None][infra] Improve the failure message for accuracy test suite (#7994)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-26 10:04:47 +08:00
Yiqing Yan
108248ece1
[TRTLLM-7999][infra] Add B300/GB300 single gpu test (#7951)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-09-26 09:59:11 +08:00
QI JUN
1529a6f22d
[None][chore] extract weights loading related logic to model loader (#7579)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-25 10:19:22 -07:00
Emma Qiao
2dc93c6371
[None][infra] Waive failed tests on main (#8001)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-25 08:13:39 -07:00
xxi
57ff5f4c0d
[None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 (#7954)
Signed-off-by: xxi <xxi@nvidia.com>
2025-09-25 07:53:42 -07:00
Matthias Jouanneaux
eda1467061
[TRTLLM-5966][feat] Helix: add alltoall op (#6815)
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-09-25 07:18:29 -07:00
Guoming Zhang
202bed4574 [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
5999fab146 [https://nvbugs/5427043][fix] cherrypick: request length exceeds max_num_tokens (#7718)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
cb466a846d [None][fix] api stability bug in status label (#7861)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Guoming Zhang
9f0f52249e [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
5342c607cd [https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (#7717)
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
xinhe-nv
e30d9aced9
[https://nvbugs/4955671][fix] update test list (#7980)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-25 02:58:09 -07:00
Chuang Zhu
791e73edf6
[https://nvbugs/5536141][fix] fix_disagg_single_gpu_test (#7990)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-25 02:07:22 -07:00
Emma Qiao
cb53261aaf
[None][infra] Unwaive some tests since dev already have a PR to collect more info (#7984)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-25 01:03:13 -07:00
fredricz-20070104
0945403174
[TRTLLM-6541][test] Add NIM perf test cases (#7924)
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
2025-09-25 13:15:26 +08:00
Iman Tabrizian
be7e51727e
[https://nvbugs/5456485][bug] unwaive triton test (#7966)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 17:02:55 -07:00
Iman Tabrizian
da30d496b0
[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756)" (#7969)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 15:36:38 -07:00
sychen52
5a65af24cd
[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-24 12:14:35 -07:00
Mike Iovine
42c2ec3239
[https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 (#7220)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-24 12:16:27 -04:00
Pamela Peng
b1dc84b4a3
[TRTLLM-7399][test] Add DS-R1/Qwen3 test cases for RTX 6000 (#7662)
Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-09-24 11:40:26 -04:00
Yuxian Qiu
48fda86c56
[None][fix] Fix dummy load format for DeepSeek. (#7874)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-24 23:03:16 +08:00
Eran Geva
603517f72a
[#7675][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-24 10:11:44 -04:00
HuiGao-NV
c8bda4b3a9
[None][ci] Waive some intermittent failures (#7955)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-24 19:00:38 +08:00
Enwei Zhu
a1a57e83b8
[TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve (#7925)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-24 18:30:23 +08:00
xinhe-nv
b8bfa63197
[None][chore] add test_w4_1gpu[True-True-cutlass-fp8] & TestKimiK2::test_fp8_blocks… (#7944)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-24 03:25:17 -07:00
QI JUN
18ff1e31b8
[None][ci] remove duplicate test cases (#7956)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-24 17:47:22 +08:00
yufeiwu-nv
f323b74d42
[None][test] Update llm_models_root to improve path handling on BareMetal environment (#7876)
Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-09-24 17:35:57 +08:00
HuiGao-NV
29e63d3bc2
[https://nvbugs/5532248][fix] Fix fused_moe OOM (#7931)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-24 02:22:38 -07:00
QI JUN
946ffcd2eb
[None][ci] optimize test cases of dgx b200 (#7948)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-24 00:39:45 -07:00
Cao Dong
2f8dc6feb0
[None][feat] Return topk logprobs in torch backend (#7756)
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-09-24 15:30:39 +08:00
xinhe-nv
62563760fb
[None][chore] update chunked prefill cases (#7921)
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-24 15:14:49 +08:00
Pengbo Wang
b890d7fea4
[None][infra] Skip failed test for nvbugs 5537738 (#7946)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 23:48:50 -07:00
Yueh-Ting (eop) Chen
cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768)
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.

In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
  originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
  Please note that detach is still guarded off for SWA when reuse
  is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
  the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`

Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00
Yuan Tong
70c3b100eb
[#7692][fix] recognize RequestError as per-request error in background handler (#7726)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-09-24 11:11:17 +08:00
Lizhi Zhou
e4f1f90202
[https://nvbugs/5477404][chore] unwaive test_disaggregated_single_gpu.py::test_disaggregated_llama_context_capacity (#7857)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-24 10:31:35 +08:00
Venky
6ff0fad75e
[TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend (#7580)
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-09-23 18:48:10 -07:00
Lizhi Zhou
7550251988
[TRTLLM-7182][test] add multi-nodes test for disagg-serving (#7470)
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-24 08:31:56 +08:00
mpikulski
9970345919
[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-09-23 16:05:05 -07:00
Zheng Duan
e3c1a9409f
[TRTLLM-6549][fix] add kv cache time output back (#7798)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-09-23 14:12:42 -04:00
Yanchao Lu
6a36349964
[None][test] Waive another intermittent OOM test (#7930)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-23 22:34:09 +08:00
Zheyu Fu
34963ec39c
[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511)
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-23 06:54:18 -07:00
ruodil
05bec3bf0f
[None][test] rename llm_perf_full to llm_perf_core and add missing cases (#7899)
Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>
2025-09-22 23:04:34 -07:00
Pengbo Wang
a4b4ed4535
[None][fix] Fix and add test for TRTLLM MoE backend (#7755)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 11:26:25 +08:00
Pengbo Wang
08cc7a041f
[https://nvbugs/5355128][fix] Add missing wgmma intrinsic for starcoder (#7643)
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 10:38:58 +08:00
yunruis
126cd707e3
[None][opt] Add batch waiting when scheduling (#7416)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-09-23 10:27:37 +08:00
Chang Liu
998857bcde
[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) (#7577)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-22 19:07:18 -07:00
Enwei Zhu
8330d5363a
[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 09:10:09 +08:00
xxi
d471655242
[TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712) 2025-09-23 08:41:38 +08:00
Enwei Zhu
59f57598a7
[https://nvbugs/5504086][fix] Fix MTP vanilla (#7904)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-23 08:38:28 +08:00
ChristinaZ
be576a3152
[None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-09-23 08:24:21 +08:00
Jin Li
b5391b4ac6
[https://nvbugs/5516665][fix] Fix CUTLASS moe fake impl errors (#7714)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-22 11:08:39 -07:00
Linda
b1738c3f18
[https://nvbugs/5477359][fix] Removing test waivers (#7877)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-09-22 08:59:13 -07:00
Wanli Jiang
2a30f11d63
[None][chore] Upgrade transformers to 4.56.0 (#7523)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-22 22:20:16 +08:00
Emma Qiao
324301ccba
[None][infra] Skip failed test for nvbugs 5532023 (#7905)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-22 03:49:44 -07:00
Yechan Kim
f77aca9f2c
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance (#7250)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-09-22 03:40:02 -07:00
Bo Deng
8cf95681e6
[TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package (#7766)
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-09-22 16:43:35 +08:00
Emma Qiao
d330d0005c
[None][infra] Waive a failed case on main (#7901)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-22 00:37:01 -07:00