TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-29 07:02:56 +08:00

Author	SHA1	Message	Date
tomeras91	35010e8073	Support NemotronH FP8 Quantization (1) match quant exclude modules names to TRTLLM names (2) No need for any special weight loading for quantization scales weights (#3891) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-04-29 18:51:43 +03:00
xiweny	68a19a33d4	TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770 ) * upgrade cutlass to 3.9 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * update internal cutlass kernels Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * fix file Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * remove unnecessary change Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * update hash Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>	2025-04-29 11:19:11 -04:00
yuxianq	0f8ec693b2	fix: get head_dim from model’s config. (#3916 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 23:04:29 +08:00
HuiGao-NV	8e6eead6a5	refactor: (part1) Add contraints doc for fusedMoe module. (#3882 ) * Add doc string for FusedMoe module * Address comments. Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-29 22:23:02 +08:00
Junhong Liu	06e76020d7	feat: parallel q_b_proj and concat (#3917 ) * add parallel_q_b_proj_and_concat Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * code cleanup Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * one gemm/concat and then split the latent_cache and pass them separately to context/gen Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> --------- Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>	2025-04-29 22:07:05 +08:00
Dom Brown	8709fe8b53	chore: bump version to 0.19.0 (#3598 ) (#3841 ) test: add test cases for 0.19 release (#3608) * fix test name * add quickstart test for nemotron-ultra * add rcca multi-node test case for deepseek-v3 * add rcca info --------- squash (#3642) fix: nvbugs/5187237: fix deterministic mode crash (#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive * Revert "remove waive" This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac. * revert ar fusion --------- update fp8 doc (#3647) tests: change qa perf test to trtllm-bench (#3619) fix: FP8 quantized lm_head (NvBug 5214229) (#3567) infra: Add PR approval protection for the release branch (#3634) fix: nvbugs/5231298: pytorch allreduce issue (#3673) Fix: nvbugs/5222698 variable not defined (#3630) * Fix: nvbugs/5222698 variable not defined * Tidy code --------- test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685) test:restore fp8 kv cache testing for L0 (#3671) doc: Update DeepSeek perf docs (#3693) * Update DeepSeek perf docs * update * Apply suggestions from code review --------- tests: waive test_llm_multi_node (#3664) fix: update test_user_buffers_mm_add_prologue atol (#3711) Fix: cherry-pick hmac encryption from main branch (#3635) * security fix cherry-pick changes from main * fix hmac in remote mpi session (#3649) --------- Un-waive DS-V3-Lite tests. (#3621) fix: FP8 kv accuracy (#3675) * fix FP8 kv accuracy * update doc --------- Fix script options for engines. (#3622) unwaive multi-node test (#3721) chore : Split more tests out of gpt tests (#3524) (#3674) doc:add torch examples link into torch backend documentation (#3749) test: Get Eagle tests working (#3593) (#3722) Waive L0 test (#3756) waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656) Update ds v3 parameters in stress test. (#3676) waive gemma on L20 (#3766) https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. fix: PP4 fixes and cleanup (#3688) remove benchmark test list (#3643) skip disagg deepseek test if sm!=90 (#3720) test: skip failed cases on B200 (#3710) * add skip condition to tests * fix error --------- test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718) * skip_pre_ada for fp8 cases * update * update after rebase --------- add know issue to deepseek doc. (#3800) Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761) Waive L0 tests (#3826) fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. * Fix fused_moe fallback issue. (#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. --------- [doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797) Fix pre-commit Fix again Address some review comments for the MI Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-29 16:57:22 +08:00
zhhuang-nv	94e6167879	optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-04-29 14:17:07 +08:00
bhsueh_NV	2e230b73ec	change log level of some text from info to debug (#3930 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-29 13:38:34 +08:00
yuxianq	adfa04745e	fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 11:26:13 +08:00
bhsueh_NV	0610d0ff84	add num_scheduled_requests into print_log (#3914 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-29 11:22:22 +08:00
Frank	cf15efa15e	[TRTLLM-4883][fix]: Update output speed calculation. (#3923 ) * Update gen tps calculation. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add back output speed for comparison. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix issue with f-string. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix some spacing. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Replace output speed with per-request genphase tput. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add gen TPS breakdown. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Update some tagging. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-29 11:04:12 +08:00
QI JUN	c381380ecc	increase H100 CI nodes for PyTorch only pipelines (#3927 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-29 10:58:43 +08:00
Perkz Zheng	35c5e4f1c5	feat: add CGA reduction fmha kernels on Blackwell. (#3763 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address the comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-04-29 10:43:54 +08:00
hlu1	d2f312b8e4	Fix fp8 kvcache (#3877 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-29 10:31:10 +08:00
WeiHaocheng	8a994d879f	feat: fix erros on scaffolding README (#3899 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-29 10:15:06 +08:00
qixiang-99	f370dd0e32	refactor(test): remove random context sequence lengths and set seed for reproducibility in attention tests (#3919 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-04-29 10:08:04 +08:00
yuxianq	b91da764de	chore: remove DummyKvCacheManager. (#3896 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 09:59:37 +08:00
Jinyang Yuan	dafc28fb85	fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863 )	2025-04-29 09:09:43 +08:00
Erin	0577ea0155	waive test_attention_no_cache (#3921 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-04-28 13:57:01 -07:00
Mike Iovine	e534bf09cc	[fix] Fix flashinfer + speculation issues (#3686 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-28 14:34:22 -04:00
xiweny	f84dd8f815	test: add deepseek v3 & r1 cases (#3528 ) * test: add deepseek v3 & r1 cases Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-04-28 23:37:26 +08:00
Yukun He	5502a522d2	Fixing minor typo in allreduce kernel selection (#3912 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-04-28 23:06:49 +08:00
Mike Iovine	e6f7ff3a46	[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-28 10:58:03 -04:00
Zhenhuan Chen	19da82d68f	fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found (#3906 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-04-28 18:35:19 +08:00
Xianjie Qiao	3617e948fd	Add docs about DeepSeek-R1 long context support. (#3910 ) * Add docs about DeepSeek-R1 long context support Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> * update docs Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> * reformat Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> --------- Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-04-28 18:33:05 +08:00
Zhenhuan Chen	ad15e45f07	[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding (#3807 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-04-28 17:15:33 +08:00
Tao Li @ NVIDIA	2fe35924e3	Fix the link of doc (#3903 ) Signed-off-by: taoli <litaotju@users.noreply.github.com> Co-authored-by: taoli <litaotju@users.noreply.github.com>	2025-04-28 14:41:40 +08:00
xinhe-nv	82a8e43557	test: [CI] Add failed cases into waives.txt (#3867 ) * update waive list Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> * update waives Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> --------- Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-04-28 14:32:48 +08:00
xinhe-nv	e20b67e9fd	update waives & tests (#3887 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-04-28 14:29:35 +08:00
Zhenhuan Chen	d5bca18807	infra: add scaffolding paths to pytorch only files (#3835 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-04-28 13:49:27 +08:00
Yanchao Lu	068c72ebf8	Test: waive intermittent test hang (#3894 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-04-28 08:53:20 +08:00
bhsueh_NV	f77252e9ff	fix bug of create cuda stream as default parameter which will be init… (#3764 ) * fix bug of create cuda stream as default parameter which will be initialized during importing Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * add torch.cuda.Stream() for the leader node Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit issue Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-28 08:16:03 +08:00
Iman Tabrizian	74cc9e26ff	infra: install Triton in the base image (#3759 ) * infra: install Triton in the base image Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * install Triton from the base image Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * update base image Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * Address review comments Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * update base image Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * waive test Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> --------- Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-04-28 07:36:30 +08:00
Yan Chunwei	ad4226d946	fix: trtllm-bench build trt engine on slurm (#3825 ) * add submit_sync to RemoteMpiSessionClient Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> add barrier Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> fix comment Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> disable test Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-27 22:26:23 +08:00
bhsueh_NV	76f2c631fb	fix: add warmup flag into py_executor to prevent enable profiler during wa… (#3852 ) * add warmup flag into py_executor to prevent enable profiler during warmup Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * change setting warmup to all ranks Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-27 19:22:42 +08:00
Chuang Zhu	e2318756ed	cacheTransceiver buffer manager (#3798 ) * cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix args Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * cpp kvCacheManager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-27 11:48:15 +08:00
HuiGao-NV	136aab5c54	fix: Update num_of_ctx_tokens in iteration stats (#3785 ) * Update num_of_ctx_tokens in iteration stats * Revert not neccessary change of importing module	2025-04-27 10:24:47 +08:00
Emma Qiao	a4b483b969	Infra: Remove empty junit xml (#3794 ) * Remote results.xml when no cases ran Signed-off-by: qqiao <qqiao@nvidia.com> * Change some test config to verify Signed-off-by: qqiao <qqiao@nvidia.com> * Update for quotes Signed-off-by: qqiao <qqiao@nvidia.com> * Move the remove results.xml in catch section Signed-off-by: qqiao <qqiao@nvidia.com> * Add missed path Signed-off-by: qqiao <qqiao@nvidia.com> * Change back the test stage setting Signed-off-by: qqiao <qqiao@nvidia.com> --------- Signed-off-by: qqiao <qqiao@nvidia.com>	2025-04-26 18:46:18 -07:00
bhsueh_NV	e9fab4f3d9	fix bug of deepseek gropu_size setting (#3860 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-27 09:10:37 +08:00
yuxianq	e6c14ca97a	fix: Detect pmix and raise error when mpirun is not used. (#3858 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-26 21:49:41 +08:00
milesial	362a8272f8	feat: llama4 input processor (#3383 ) Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-25 16:47:14 -07:00
Kaiyu Xie	d7472231f9	TRTLLM-4875 feat: Add version switcher to doc (#3846 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-26 05:42:46 +08:00
Dom Brown	7ff9fd345c	Test: Split C++ unit tests for CI granularity (#3868 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-25 13:30:58 -07:00
QI JUN	6ac1a54f57	chore: update pytorch only change file list (#3873 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-26 04:03:51 +08:00
qixiang-99	ecd621fb0a	feat: Add head size 72 support for QKV Preprocessing kernel (#3743 ) * refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow - Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels. - Added support for head size 72 in unfused attention kernels(QKVPreprocessing). - Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72). Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * update: Waive head_dim=72 test cases and enhance test representation - Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues. - Introduced a custom __repr__ method in the Scenario class for pytest substring match. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-04-25 11:07:40 -07:00
sugunav14	5b9897a8cd	fix: [AutoDeploy] update hf loading for e_score_correction_bias (#3847 ) Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>	2025-04-26 02:03:47 +08:00
Mike Iovine	68e774ff9e	[chore] Add Llama 4 Maverick to quickstart README (#3848 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-26 01:04:24 +08:00
Yiqing Yan	238fefc659	[infra] Waive L0 tests (#3853 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-04-25 17:32:21 +08:00
dongxuy04	16535991b2	feat: Add MNNVL MoE A2A support (#3504 ) * add MNNVL memory mapping support Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments #2 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-04-25 17:29:08 +08:00
Yuan Tong	57944206ba	feat: return logits in PyTorch flow (#3221 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-24 16:56:03 -07:00

1 2 3 4 5 ...

651 Commits