wili
|
2e3cf42e03
|
[refactor] Simplification of Speculative decoding configs (#5639)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
|
2025-07-10 11:37:30 -04:00 |
|
Kaiyu Xie
|
7b09a415c1
|
fix: Make the bench serving script compatible with different usages (#5905)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-07-10 19:36:26 +08:00 |
|
Enwei Zhu
|
055c4a9fe6
|
[NvBug 5370718, 5371538] fix: Fix incremental detokenization (#5825)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
|
2025-07-10 16:30:00 +08:00 |
|
CarstyYou
|
dc32f9ae73
|
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
|
2025-07-10 15:16:18 +08:00 |
|
Anthony Chang
|
7d21b55b5a
|
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
|
2025-07-10 14:06:50 +08:00 |
|
Yan Chunwei
|
07f6da763d
|
[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner (#5876)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-07-10 11:31:35 +08:00 |
|
Hanjun Cho
|
6490a27ad7
|
[feat] Add TensorRT-Engine Qwen3 (dense) model support (#5650)
Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal>
|
2025-07-10 10:26:06 +08:00 |
|
brb-nv
|
3209b31665
|
feat: Custom masking utils for Gemma3 VLM (#5853)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-07-10 06:18:04 +09:00 |
|
2ez4bz
|
87fe44fd29
|
feat(models): Mistral3.1 VLM pytorch backend support (#5529)
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
|
2025-07-09 13:17:40 -07:00 |
|
Chang Liu
|
b61a717275
|
[1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes (#5396)
|
2025-07-10 05:12:53 +09:00 |
|
Wanli Jiang
|
3f7cedec7c
|
Update transformers to 4.53.0 (#5747)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
|
2025-07-09 09:32:24 -07:00 |
|
DylanChen-NV
|
74dca0aa7b
|
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
|
2025-07-09 23:16:42 +08:00 |
|
tomeras91
|
5aa958a11a
|
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
2025-07-09 11:30:15 +03:00 |
|
Dom Brown
|
3e3b1769ad
|
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
|
2025-07-09 08:21:58 +01:00 |
|
dongxuy04
|
dd3c736c7e
|
chore: some refactor on WideEP (#5727)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
|
2025-07-09 14:26:57 +08:00 |
|
chenfeiz0326
|
64fd64fcf2
|
[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue (#5834)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
|
2025-07-09 14:23:21 +08:00 |
|
Chang Liu
|
4df5f96c8d
|
[Bugfix] LLama4: fix for llama4 multimodal support (#5809)
|
2025-07-09 13:03:40 +09:00 |
|
Xianjie Qiao
|
5ab1cf5ae6
|
Remove unnecessary benchmarking results (#5852)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
|
2025-07-09 11:19:06 +08:00 |
|
brb-nv
|
2bd09ed2d4
|
fix: Skip rope scaling for local layers in Gemma3 VLM (#5857)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-07-09 10:10:33 +08:00 |
|
Omer Ullman Argov
|
d6d2ab2c99
|
[fix] Catch inference failures in trtllm-bench (#5841)
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
|
2025-07-09 03:53:03 +03:00 |
|
Iman Tabrizian
|
c508b994b6
|
Fix lost requests for disaggregated serving (#5815)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
|
2025-07-09 08:42:45 +09:00 |
|
Kaiyu Xie
|
bb5b16fcb9
|
feat: Return context response immediately when stream_interval > 1 (#5836)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
|
2025-07-09 00:19:57 +09:00 |
|
Raayan Dhar
|
e3268a4221
|
[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg (#5732)
Signed-off-by: raayandhar <rdhar@nvidia.com>
|
2025-07-08 09:39:58 -04:00 |
|
Yukun He
|
e104f8bbb5
|
[5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
|
2025-07-08 19:51:05 +08:00 |
|
Yegor
|
b01d1c28f7
|
[feat] Detokenize option in /v1/completions request (#5382)
Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com>
Signed-off-by: Yegor Yershov <yegor6741@gmail.com>
|
2025-07-08 19:36:04 +08:00 |
|
xiweny
|
eaf8bec88b
|
fix: Disaggregate serving with attention DP (#4993)
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
|
2025-07-08 16:15:03 +08:00 |
|
Yiqing Yan
|
5203a0f6df
|
chore: bump version to 1.0.0rc3 (#5819)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
|
2025-07-08 16:04:40 +09:00 |
|
Zhenhuan Chen
|
dee6644ed9
|
feat(scaffolding): add streaming scaffolding_llm.generate_async support (#5345)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
|
2025-07-08 15:08:40 +09:00 |
|
nv-guomingz
|
0be41b6524
|
Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" (#5818)
|
2025-07-08 13:15:30 +09:00 |
|
Yechan Kim
|
5bc3a15f10
|
feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL (#5522)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
|
2025-07-07 18:03:12 -07:00 |
|
nv-guomingz
|
5a8173c121
|
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#5795)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
|
2025-07-08 08:52:36 +08:00 |
|
Robin Kobus
|
30a19fcf7c
|
[TRTLLM-6291] feat: Add user-provided speculative decoding support (#5204)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-07 16:30:43 +02:00 |
|
Tailing Yuan
|
85b4a6808d
|
Refactor: move DeepEP from Docker images to wheel building (#5534)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
|
2025-07-07 22:57:03 +09:00 |
|
Daniel Cámpora
|
1260e2f33f
|
feat: Optimize TRTLLM Sampler perf single beam single step (#5550)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-07-07 15:44:47 +02:00 |
|
DylanChen-NV
|
5ca2b9bb15
|
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
|
2025-07-07 18:04:57 +08:00 |
|
Yan Chunwei
|
dfce61f4b9
|
[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler (#5751)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
|
2025-07-07 17:05:14 +08:00 |
|
Zheng Duan
|
de10774c2e
|
chore: log stack trace on error in openai server (#5749)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
|
2025-07-07 14:54:36 +08:00 |
|
Daniel Stokes
|
ec6c7dff1a
|
feat: Add support for MXFP8xMXFP4 in pytorch (#5535)
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
|
2025-07-06 15:32:06 -07:00 |
|
Robin Kobus
|
ae27261094
|
refactor: decoding inputs (#5679)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
|
2025-07-06 08:21:02 +02:00 |
|
Xianjie Qiao
|
b1976c2add
|
Add wide-ep benchmarking scripts (#5760)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
2025-07-05 19:29:39 +08:00 |
|
Xianjie Qiao
|
089fd55eda
|
Add dummy all_reduce for kernel breakdown (#5745)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
|
2025-07-05 13:08:58 +09:00 |
|
Frank
|
d61893dc77
|
[fix] Update to properly set cuda graphs in trtllm-bench overrides. (#5634)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
|
2025-07-05 05:19:16 +09:00 |
|
Stefan Niebler
|
d1112aac37
|
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
|
2025-07-05 01:35:13 +09:00 |
|
HuiGao-NV
|
3ed3bbcb5d
|
Fix: pass allreduce strategy to pytorchConfig (#5746)
Signed-off-by: Hui Gao <huig@nvidia.com>
|
2025-07-04 21:32:13 +09:00 |
|
Shunkangz
|
32339d1b20
|
Raise shut down error for each request (#4936)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
2025-07-04 18:58:24 +09:00 |
|
Tailing Yuan
|
e134a52e07
|
Perf: reduce DeepEPLowLatency memory and time (#5712)
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
|
2025-07-04 14:46:28 +08:00 |
|
Shunkangz
|
a79d8c9f5e
|
Fix none response in PD (#5422)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
|
2025-07-04 14:25:10 +08:00 |
|
brb-nv
|
cdaa6abce7
|
fix: Investigate Gemma3 1B decoder output discrepancy (#5564)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
2025-07-04 13:14:13 +08:00 |
|
Frank
|
819ae903de
|
[https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. (#5625)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
|
2025-07-04 13:14:13 +08:00 |
|
Clay
|
7a319524da
|
feat: support more parameters in openai worker of scaffolding (#5115)
Signed-off-by: Clay <ccs96307@gmail.com>
|
2025-07-04 09:35:34 +08:00 |
|