TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-25 13:12:45 +08:00

Author	SHA1	Message	Date
Grzegorz Kwasniewski	cff54fcae3	[#8948 ][feat] Support custom sharding config (#9143 ) Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>	2025-11-29 05:28:05 +08:00
Matthias Jouanneaux	f8dd494536	[None][perf] Helix: improve all-to-all perf for large CP size (#9494 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com> Co-authored-by: Zheyu Fu <zheyuf@nvidia.com>	2025-11-28 07:24:55 -08:00
mpikulski	e5f39ec7cf	[TRTLLM-9488][feat] add 'disable_flashinfer_sampling' config option (#9454 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-11-28 13:00:39 +01:00
Robin Kobus	5eae3650c3	[None][fix] Pass checkpoint_format to create_input_processor (#9521 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-28 10:32:29 +01:00
Yukun He	60c43a200a	[None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner. (#9211 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-11-28 13:32:21 +08:00
Lucas Liebenwein	2f8bd6fb36	[#9150 ][feat] AutoDeploy Nemotron-Flash support (#9504 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-11-27 18:03:57 +01:00
Bo Li	62b771877c	[TRTLLM-9389][chore] Refactor AlltoallMethodType. (#9388 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-27 21:09:29 +08:00
Fanrong Li	2d5eadf65f	[None][fix] fix TP support for DeepSeek-V3.2 on hopper (#9484 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-11-27 21:02:25 +08:00
Ziyi Xiong	1dd55d8507	[https://nvbugs/5698581 ][fix] Init draft tokens for CUDA graph dummy request (#9505 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-11-27 13:05:37 +08:00
Jiagan Cheng	14762e0287	[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning (#9294 ) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-11-27 12:22:01 +08:00
Chenghao Zhang	18fbda5cdb	[None][feat] AutoDeploy: Add A_log fusion for Mamba layers (#9422 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-11-26 14:39:20 -08:00
Chenghao Zhang	bc7b60e016	[None][feat] AutoDeploy: Remove redundant copies in mamba layers (#9461 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-11-26 14:38:33 -08:00
Aurelien Chartier	ef7ee6a940	[None][feat] Add environment variable to force spec-dec number of accepted tokens (#9371 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-11-26 07:22:16 -08:00
Chang Liu	b10137fdd5	[None][feat] Support MLA chunked prefill for DeepSeek V3.2 model (#9376 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-11-26 16:38:25 +08:00
Enwei Zhu	1bf2d750a2	[None][chore] Upgrade CuteDSL to 4.3.0 (#9444 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-11-26 14:53:09 +08:00
JunyiXu-nv	b7308a4000	[https://nvbugs/5580099 ][fix] Cherry pick IMA issue fix from release/1.1 (#9032 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>	2025-11-26 13:09:06 +08:00
shuyixiong	d8acea1db3	[TRTLLM-9293][feat] Enable partial weight loading to support streaming update weights (#9224 ) Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>	2025-11-26 10:59:06 +08:00
Chuang Zhu	0e9c7f8c07	[https://nvbugs/5685143 ][fix] avoid cudaFree overlap with cuda graph (#9438 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-11-25 16:20:29 -08:00
Suyog Gupta	e484bec82f	[None][chore] AutoDeploy add multi stream moe pass to default.yaml (#9430 ) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-11-25 14:16:13 -08:00
Robin Kobus	32f53910ef	[TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode (#9308 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-11-25 22:11:51 +01:00
Eran Geva	afc52d7b93	[https://nvbugs/5647400 ] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. (#9145 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-11-25 10:56:07 -08:00
mpikulski	899fda9e47	[TRTLLM-9490][feat] use FlashInfer's top_k_sampling_from_probs (#9457 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-11-25 18:53:53 +01:00
mpikulski	c5f52ab304	[TRTLLM-8376][feat] top-p optimization (removes redundant softmax) (#9411 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-11-25 18:46:48 +01:00
YueWeng	cc336c4abd	[TRTLLM-8160][feat] Add draft token tree runtime on CDL (#8586 ) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-11-25 09:40:55 -05:00
Yueh-Ting (eop) Chen	a38d91aae2	[https://nvbugs/5537996 ][fix] Let KV cache manager block initialization be aware whether it is doing a dry run or not (#9093 ) Before this commit, the kv cache manager does the same regardless, which causes a mis-calculation in free memory available to allocate for the KV cache manager, hence causing a crash. This commit fixes this by letting KV cache manager initialization be aware whether it is doing the dry run or not. If it is a dry run, use the max_tokens setting that is already pre-calculated and filled into kv_cache_config.max_tokens. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-11-25 17:27:11 +08:00
Yukun He	e580da4155	[TRTLLM-7963][feat] Cold L2 cache when doing autotune benchmarking. (#8779 ) The performance results of some kernels could be easily affected by the warm/cold L2 cache status. To achieve more precise profiling results, the L2 cache is cleared for every execution by the circular buffer method for better benchmarking during autotuning. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-11-25 15:06:22 +08:00
William Zhang	a4049fc557	[#9413 ][fix] Minor fixes to nemotron H and custom models in AD (#9416 ) * Why? There were a couple of issues with the recently merged custom model injection for AutoDeploy + the reference implementation of nemotron H: - `d_mlp` was left in despite being mathematically always null (could lead to runtime issues during sharding). - the custom model mapping was inherited by children factories. * What? This commit fixes these issues, and refactors the key of the custom implementation to be based on the name of the configuration class as well. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-11-24 20:17:33 -08:00
Suyog Gupta	efd503751f	[#9271 ][perf] Enable multi-stream MOE optimization in AutoDeploy (#9322 ) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-11-24 19:50:10 -08:00
Yuxian Qiu	8a0295015f	[None][chore] Reduce nested nvtx ranges. (#9347 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-11-25 09:58:41 +08:00
bhsueh_NV	1a93583438	[None][feat] Support Yarn on QwQ-32B model (#9059 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: NVJiangShao <91270701+StudyingShao@users.noreply.github.com>	2025-11-25 07:27:28 +08:00
Yibin Li	1ce483c999	[TRTLLM-7967][feat] Adding Starcoder2 PyTorch Backend Support (#8923 ) Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-11-24 11:23:22 -08:00
Yukun He	960851f419	[None][chore] Remove unnecessary log in the short tuning profile (#9387 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-11-24 12:31:26 +08:00
Yukun He	39076410a8	[https://nvbugs/5676748 ][fix] Fix mismatched nvfp4 gemm sf shape. (#9336 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-11-24 12:16:32 +08:00
brb-nv	c045e359a7	[https://nvbugs/5637012 ][fix] Fix helix unit tests (#9369 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-11-23 19:34:22 -08:00
Yukun He	c3acf965a6	[TRTLLM-7963][fix] Several improvements of autotuning quality (#9348 ) * Skip the shape profile generating process if the profile has already been found in the cache under tuning mode. This is a prerequisite for nested autotuning because host overhead might be included during the profiling of the high-level op. * Enable the profiling with CUDA graph as the default profiling method. * Apply a heuristic method to cut off the number of repeat times of profiling according to a few-run time measurement.	2025-11-24 10:38:45 +08:00
Bo Li	fcfec93cad	[TRTLLM-9389][chore] Rename AlltoAll backend names (#9329 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-11-23 13:52:57 -08:00
William Zhang	11a0b276fb	[#9230 ][feat] Slimmed down implementation of nemotron H (#9235 ) * Why? The reference nemotron H code on HuggingFace is out of date, and therefore bugged, and has several untested code paths. This makes an already hairy patching system even hairier. The proposal is to do away with those patches, and replace the original implementation with one that is heavily slimmed down. * What? This PR sets the basis for an alternative path with such a slimmed down implementation that: - fixes bugs in the current HF implementation - adds no new dependencies to TensorRT-LLM - does away with unnecessary features for TensorRT-LLM/ AutoDeploy: - no training related code (dropout, gradient checkpointing, etc.) - no caching logic (we want to replace it with our own anyway) - no attention masking where possible - reuses existing AD custom ops for mamba SSM update / causal conv1d / attention In order for the above to be usable in the AD apparatus, `AutoModelForCausalLMFactory` is extended to allow registrations of custom model implementations. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-11-23 03:13:32 -08:00
Neta Zmora	3952a61681	[#9388 ][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation (#9339 ) Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>	2025-11-21 17:05:03 -08:00
Chenghao Zhang	564989865c	[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT (#9106 ) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>	2025-11-21 16:05:48 -08:00
Izzy Putterman	eb7792e875	[None][feat] Eagle: PostNorm and multilayer options (#9233 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-11-21 17:39:00 -05:00
Enwei Zhu	13fbd4366a	[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) (#9288 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-11-21 14:03:38 -08:00
Ziyi Xiong	5df907b388	[https://nvbugs/5590408 ][fix] Fallback to greedy sampling in two-model overlap scheduler (#9321 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-11-21 10:19:59 -05:00
HuiGao-NV	6dd2fcd7b3	[https://nvbugs/5629833 ][fix] Don't fill tensors with 0 (#9296 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-11-21 20:50:05 +08:00
mpikulski	095b6864a8	[TRTLLM-8650][fix] beam search request validation (#8433 ) (#9228 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-11-21 04:08:45 -08:00
xxi	cc0dc7c124	[TRTLLM-8957][feat] create communication related classes (#8968 )	2025-11-20 22:32:42 -08:00
Yukun He	9a79f32f7a	[https://nvbugs/5608489 ][fix] Fix output unpack issues for Llama3/4 NVFP4 models. (#8679 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Lizhi Zhou	33b0b945c7	[https://nvbugs/5582277 ][fix] rework DisaggPPTerminationHandler to fix hang issue (#8519 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Jin Li	3454eacd74	[https://nvbugs/5546510 ][fix] Move torch.cuda.Stream out of torch com… (#8494 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
JunyiXu-nv	ee6944bfa2	[https://nvbugs/5569713 ][fix] Disable fp8 deep gemm for EXAONE-4.0-32B-FP8 (#8429 ) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-11-20 12:43:13 -05:00
Liao Lanyu	04ad9f96fa	[https://nvbugs/5667687 ][fix] Set correct lm_head_tp_size_upper_bound (#9300 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-11-20 00:41:00 -08:00

1 2 3 4 5 ...

1210 Commits