TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Chuang Zhu	8733e830fc	[None][fix] Add lock for request_to_session in sendReadySingal (#8310 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-10-14 04:32:37 -07:00
Yan Chunwei	86be06bda4	[None][ci] waive several rpc tests (#8349 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>	2025-10-14 03:12:49 -07:00
Cao Dong	62cea877b1	[None][feat] Move StreamGeneration to scaffolding main directory (#8347 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-10-14 17:16:04 +08:00
William Zhang	72d65d079a	[https://nvbugs/5542878 ][fix] Unwaive test (#8027 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-10-14 07:58:07 +02:00
xinhe-nv	371fcb0338	[TRTLLM-8366][feat] add kimi multi nodes case (#8025 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-13 21:36:03 -07:00
yuanjingx87	d90b4c57cc	[None][infra] Pin numexpr in requirements.txt (#8343 ) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>	2025-10-13 21:09:08 -07:00
Yuxian Qiu	3450fe9944	[None][fix] Fix dummy load format for key models. (#7993 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-10-14 11:18:39 +08:00
Aurelien Chartier	9bc055faf1	[None][fix] Disable DeepGEMM for Qwen3 MoE Attention layers (#8087 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-10-13 18:38:47 -07:00
Lucas Liebenwein	22aa4ac08c	[None][feat] AutoDeploy: VLMs with subgraphs + cudagraph/compile (#8203 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-10-13 17:34:09 -07:00
Zheyu Fu	bac665e650	[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. (#7283 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-10-13 15:51:14 -07:00
Grzegorz Kwasniewski	ea4658197f	[TRTLLM-6342][feat] Factory TP sharding of quantized models (#8123 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-10-13 14:04:46 -07:00
Yuxian Qiu	bd740c9ba6	[None][fix] Avoid unnecessary concat in attn_output_gate case. (#8094 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-10-13 12:59:40 -07:00
mpikulski	6c4cc4c8b2	[None][fix] workaround for numexpr issue (#8327 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-10-13 11:56:03 -07:00
Yueh-Ting (eop) Chen	4882815fa1	[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 ) This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled, that is, the block movement behavior is identical between SWA and full attention when reuse is enabled. This merge request attempts to enable OOW block detach when reuse is enabled. The required changes are: - Let KV cache manager keep track of which block is used by which sequence - Remove restriction for the eviction policy to be able to release a non-leaf block Along with the development, bugs inside freeChildren and offload mechanism under getFreeBlock is resolved because they will affect the functionality this merge request is trying to achieve. When a block goes OOW, it is released from the sequence, it will be available to be reclaimed and the block is held by the eviction policy for another sequence to acquire upon calling. On the other hand, we want to potentially store the sequence for reuse. To safely achieve this, the record of block ownership is done under WindowBlockManager::getFreeBlock. If the block acquired was originally owned by another sequence that is live inside the manager, then we invalidate the sequence for store for reuse. At the end of a sequence (when removeSequence is called toward it), the KV cache manager will check if the sequence has all blocks not reclaimed by another sequence. If so, then the sequence is safe to be stored for reuse and store for reuse action will be performed. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-10-13 09:18:12 -07:00
Kaiyu Xie	9ff9fa6413	[None] [doc] Update README (#8326 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-10-13 07:18:32 -07:00
Kaiyu Xie	040103ab56	[None] [blog] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary) (#8323 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-10-13 06:37:17 -07:00
Robin Kobus	db8c63b9b1	[TRTLLM-4517] [feat] Additional model outputs (#7206 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-10-13 15:33:18 +02:00
amitz-nv	bbae7a05f0	[https://nvbugs/5521949 ][fix] Replace test_codellama_fp8_with_bf16_lora with test_llama_3_1_8b_fp8_with_bf16_lora (#8199 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-13 06:01:55 -07:00
Fanrong Li	1e0fbb776d	[TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention (#8301 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-10-13 05:54:48 -07:00
Xianjie Qiao	d145e87f6f	[None][chore] Update disagg benchmark configs (#8289 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>	2025-10-13 18:15:46 +08:00
Cao Dong	d882c92a84	[None][fix] Fix EventLoopShutdownError (#8260 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-10-13 17:31:33 +08:00
Po-Han Huang (NVIDIA)	6fc6f70a68	[https://nvbugs/5441729 ][test] Fix test_modeling_llama_min_latency.py failures (#7478 ) Signed-off-by: Po-Han Huang <pohanh@nvidia.com>	2025-10-13 15:35:02 +08:00
xinhe-nv	9fe63dd8db	[None][chore] Add failed cases into waives.txt (#8290 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-10-13 00:07:00 -07:00
Emma Qiao	fe17e78f27	[None][infra] Add back gb200 multi-node test stage to pre-merge (#8281 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-12 23:56:07 -07:00
Leslie Fang	8d1b068b1a	[TRTLLM-8477][chore] Replace KvCacheConfigCpp with KvCacheConfig inside PyExecutor (#8259 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-10-13 14:55:36 +08:00
Yilin Fan	1a9044949f	[None][fix] Fix bench_serving import error (#8296 ) Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-12 22:46:31 -07:00
xiweny	5ce9719759	[https://nvbugs/5503138 ] [fix] Remove compile warnings (#8167 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-10-13 13:24:23 +08:00
xinhe-nv	72fcff1044	[None][fix] add timeout for llama4 (#8254 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-12 21:04:20 -07:00
DylanChen-NV	d6e315e9ff	[None][feat] Add torch compile support for cuda core GEMM OP (#8261 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-10-12 20:57:17 -07:00
Guoming Zhang	989c25fcba	[None][doc] Add qwen3-next doc into deployment guid and test case into L0. (#8288 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Faradawn Yang <faradawny@gmail.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-10-13 10:25:45 +08:00
Guoming Zhang	656d73087e	[None][doc] Fix several invalid ref links in deployment guide sections. (#8287 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-10-13 10:22:32 +08:00
amitz-nv	fac47e2826	[https://nvbugs/5510879 ][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 (#8063 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-12 12:29:52 -07:00
Eran Geva	a1ed03fe8a	[None][fix] AD test_trtllm_bench to use small model config and skip loading weights (#8149 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-10-12 18:30:20 +03:00
Emma Qiao	fdbeea51d3	[None][infra] Skip failed cases for main branch (#8293 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-12 08:04:09 -07:00
kris1025	a7ea544dbe	[TRTLLM-7384][feat] enable rejection sampling for CDL (#7731 ) Signed-off-by: linquanh <linquanh@nvidia.com>	2025-10-12 20:38:48 +08:00
Zhanrui Sun	5798a12199	[None][infra] Remove WAR code for GH200 node (#8266 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-11 20:33:14 -07:00
brb-nv	56a539cd37	[None][chore] Waive failing pre-merge test on main (#8282 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-10-10 23:52:05 -07:00
Ziyi Xiong	efd4ffa03b	[https://nvbugs/5534705 ][fix] Skip unnecessary CUDA graph capture (#8050 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-10-11 13:26:55 +08:00
Zhenhuan Chen	84d2f12818	[TRTLLM-6748][feat] add PDL support for more kernels (#7977 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-10-11 08:32:05 +08:00
Yilin Fan	2695d70d42	[None][feat] Add request timing breakdown option in benchmark_serving (#8128 ) Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-10 09:24:54 -07:00
Chuang Zhu	85f157f389	[None][fix] Add Lock to protect mReqeustToSession (#8085 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>	2025-10-10 21:51:50 +08:00
QI JUN	48c15d805c	[https://nvbugs/5558167 ][fix] update canceled_req_ids correctly for canceled requests (#8207 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-10 18:58:26 +08:00
xinhe-nv	2655995a09	[None][fix] add gc for test fixture (#8220 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-10 02:50:25 -07:00
bhsueh_NV	d3059dbd8a	[https://nvbugs/5547416 ][fix] unwaive no_cache test (#8213 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-10-10 01:50:13 -07:00
xinhe-nv	b555f1ff98	[None][chore] Add failed cases into waives.txt (#8229 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-09 23:45:28 -07:00
HuiGao-NV	795a051765	[None][chore] Print log with time for starting to load safetensor weights (#8218 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-10-10 13:54:54 +08:00
xinhe-nv	e8c9bae37e	[None][chore] Remove closed bugs (#8151 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-10 16:39:40 +11:00
Jonas Li	76a47c7bef	[None][fix] Enable FP8 ContextMLA on GB300 (#8080 ) Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>	2025-10-10 10:20:46 +08:00
Pengbo Wang	7da4b05289	[https://nvbugs/5501820 ][fix] Add requirements for numba-cuda version to WAR mem corruption (#7992 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-10-10 10:18:27 +08:00
mpikulski	7b6803b6e9	[TRTLLM-7769][chore] document the role of 'd2t' (#8174 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-10-09 13:13:50 -04:00

1 2 3 4 5 ...

3159 Commits