TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 20:23:08 +08:00

Author	SHA1	Message	Date
Yuxian Qiu	bd740c9ba6	[None][fix] Avoid unnecessary concat in attn_output_gate case. (#8094 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-10-13 12:59:40 -07:00
mpikulski	6c4cc4c8b2	[None][fix] workaround for numexpr issue (#8327 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-10-13 11:56:03 -07:00
Yueh-Ting (eop) Chen	4882815fa1	[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach (#7922 ) This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled, that is, the block movement behavior is identical between SWA and full attention when reuse is enabled. This merge request attempts to enable OOW block detach when reuse is enabled. The required changes are: - Let KV cache manager keep track of which block is used by which sequence - Remove restriction for the eviction policy to be able to release a non-leaf block Along with the development, bugs inside freeChildren and offload mechanism under getFreeBlock is resolved because they will affect the functionality this merge request is trying to achieve. When a block goes OOW, it is released from the sequence, it will be available to be reclaimed and the block is held by the eviction policy for another sequence to acquire upon calling. On the other hand, we want to potentially store the sequence for reuse. To safely achieve this, the record of block ownership is done under WindowBlockManager::getFreeBlock. If the block acquired was originally owned by another sequence that is live inside the manager, then we invalidate the sequence for store for reuse. At the end of a sequence (when removeSequence is called toward it), the KV cache manager will check if the sequence has all blocks not reclaimed by another sequence. If so, then the sequence is safe to be stored for reuse and store for reuse action will be performed. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-10-13 09:18:12 -07:00
Kaiyu Xie	9ff9fa6413	[None] [doc] Update README (#8326 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-10-13 07:18:32 -07:00
Kaiyu Xie	040103ab56	[None] [blog] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary) (#8323 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-10-13 06:37:17 -07:00
Robin Kobus	db8c63b9b1	[TRTLLM-4517] [feat] Additional model outputs (#7206 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-10-13 15:33:18 +02:00
amitz-nv	bbae7a05f0	[https://nvbugs/5521949 ][fix] Replace test_codellama_fp8_with_bf16_lora with test_llama_3_1_8b_fp8_with_bf16_lora (#8199 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-13 06:01:55 -07:00
Fanrong Li	1e0fbb776d	[TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention (#8301 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-10-13 05:54:48 -07:00
Xianjie Qiao	d145e87f6f	[None][chore] Update disagg benchmark configs (#8289 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>	2025-10-13 18:15:46 +08:00
Cao Dong	d882c92a84	[None][fix] Fix EventLoopShutdownError (#8260 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-10-13 17:31:33 +08:00
Po-Han Huang (NVIDIA)	6fc6f70a68	[https://nvbugs/5441729 ][test] Fix test_modeling_llama_min_latency.py failures (#7478 ) Signed-off-by: Po-Han Huang <pohanh@nvidia.com>	2025-10-13 15:35:02 +08:00
xinhe-nv	9fe63dd8db	[None][chore] Add failed cases into waives.txt (#8290 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-10-13 00:07:00 -07:00
Emma Qiao	fe17e78f27	[None][infra] Add back gb200 multi-node test stage to pre-merge (#8281 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-12 23:56:07 -07:00
Leslie Fang	8d1b068b1a	[TRTLLM-8477][chore] Replace KvCacheConfigCpp with KvCacheConfig inside PyExecutor (#8259 ) Signed-off-by: leslie-fang25 <leslief@nvidia.com>	2025-10-13 14:55:36 +08:00
Yilin Fan	1a9044949f	[None][fix] Fix bench_serving import error (#8296 ) Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-12 22:46:31 -07:00
xiweny	5ce9719759	[https://nvbugs/5503138 ] [fix] Remove compile warnings (#8167 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-10-13 13:24:23 +08:00
xinhe-nv	72fcff1044	[None][fix] add timeout for llama4 (#8254 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-12 21:04:20 -07:00
DylanChen-NV	d6e315e9ff	[None][feat] Add torch compile support for cuda core GEMM OP (#8261 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-10-12 20:57:17 -07:00
Guoming Zhang	989c25fcba	[None][doc] Add qwen3-next doc into deployment guid and test case into L0. (#8288 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Faradawn Yang <faradawny@gmail.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-10-13 10:25:45 +08:00
Guoming Zhang	656d73087e	[None][doc] Fix several invalid ref links in deployment guide sections. (#8287 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-10-13 10:22:32 +08:00
amitz-nv	fac47e2826	[https://nvbugs/5510879 ][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 (#8063 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-12 12:29:52 -07:00
Eran Geva	a1ed03fe8a	[None][fix] AD test_trtllm_bench to use small model config and skip loading weights (#8149 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-10-12 18:30:20 +03:00
Emma Qiao	fdbeea51d3	[None][infra] Skip failed cases for main branch (#8293 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-12 08:04:09 -07:00
kris1025	a7ea544dbe	[TRTLLM-7384][feat] enable rejection sampling for CDL (#7731 ) Signed-off-by: linquanh <linquanh@nvidia.com>	2025-10-12 20:38:48 +08:00
Zhanrui Sun	5798a12199	[None][infra] Remove WAR code for GH200 node (#8266 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-10-11 20:33:14 -07:00
brb-nv	56a539cd37	[None][chore] Waive failing pre-merge test on main (#8282 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-10-10 23:52:05 -07:00
Ziyi Xiong	efd4ffa03b	[https://nvbugs/5534705 ][fix] Skip unnecessary CUDA graph capture (#8050 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-10-11 13:26:55 +08:00
Zhenhuan Chen	84d2f12818	[TRTLLM-6748][feat] add PDL support for more kernels (#7977 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-10-11 08:32:05 +08:00
Yilin Fan	2695d70d42	[None][feat] Add request timing breakdown option in benchmark_serving (#8128 ) Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>	2025-10-10 09:24:54 -07:00
Chuang Zhu	85f157f389	[None][fix] Add Lock to protect mReqeustToSession (#8085 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>	2025-10-10 21:51:50 +08:00
QI JUN	48c15d805c	[https://nvbugs/5558167 ][fix] update canceled_req_ids correctly for canceled requests (#8207 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-10 18:58:26 +08:00
xinhe-nv	2655995a09	[None][fix] add gc for test fixture (#8220 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-10 02:50:25 -07:00
bhsueh_NV	d3059dbd8a	[https://nvbugs/5547416 ][fix] unwaive no_cache test (#8213 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-10-10 01:50:13 -07:00
xinhe-nv	b555f1ff98	[None][chore] Add failed cases into waives.txt (#8229 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-09 23:45:28 -07:00
HuiGao-NV	795a051765	[None][chore] Print log with time for starting to load safetensor weights (#8218 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-10-10 13:54:54 +08:00
xinhe-nv	e8c9bae37e	[None][chore] Remove closed bugs (#8151 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-10-10 16:39:40 +11:00
Jonas Li	76a47c7bef	[None][fix] Enable FP8 ContextMLA on GB300 (#8080 ) Signed-off-by: Jonas Li <6110159+longlee0622@users.noreply.github.com>	2025-10-10 10:20:46 +08:00
Pengbo Wang	7da4b05289	[https://nvbugs/5501820 ][fix] Add requirements for numba-cuda version to WAR mem corruption (#7992 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-10-10 10:18:27 +08:00
mpikulski	7b6803b6e9	[TRTLLM-7769][chore] document the role of 'd2t' (#8174 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-10-09 13:13:50 -04:00
Emma Qiao	ccd949ea5b	[None][infra] Waive failed tests on main 10/09 (#8230 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-10-09 22:46:07 +08:00
amitz-nv	d560054e1b	[None][chore] Restore asserts in pytorch flow LoRA tests (#8227 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-10-09 17:10:38 +03:00
QI JUN	e10121345e	[None][ci] pin flashinfer-python version (#8217 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-09 02:48:49 -07:00
Guoming Zhang	a193867f8f	[None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… (#8214 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-10-09 17:11:24 +08:00
bhsueh_NV	27677a36f5	[https://nvbugs/5516666 ][fix] unwaive some Qwen3 CI tests (#8130 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-10-09 09:44:58 +08:00
Lizhi Zhou	fdf29ab8fa	[TRTLLM-7846][feat] Http disagg-cluster management implemention (#7869 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-10-09 09:44:01 +08:00
QI JUN	6884d06aed	[None][ci] move some llama4 test cases to pre merge (#8189 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-10-08 18:34:08 -07:00
dongfengy	9f2a3ae88c	[None][fix] Restrict tinygemm use to certain SMs (#8182 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>	2025-10-08 17:55:57 -07:00
Liao Lanyu	ed8e00ad4a	[https://nvbugs/5522746 ][fix] unwaive tests caused by node issues after rebooting (#8193 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-10-09 08:45:56 +08:00
Mike Iovine	c88913dc03	[https://nvbugs/5541545 ][fix] Remove test_llama4 (#8031 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-10-08 15:20:15 -07:00
brb-nv	80517b7812	[None][chore] Waive some tests failing on main post merge (#8186 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-10-08 06:52:30 -07:00

1 2 3 4 5 ...

3148 Commits