TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
QI JUN	d0671494cd	chore: fix wheel version <= 0.45.1 (#3391 ) * fix wheel version to 0.45.1 Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * relax version Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-09 12:31:55 +08:00
sugunav14	64abb01a36	Fix failing DSV3 unit tests (#3385 ) * Skipping DSV3 module patch unit tests Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> * update tested Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> * Fixed failing unit test Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>	2025-04-09 11:57:05 +08:00
tburt-nv	3a8443f1e1	extend allowlist (#3379 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-09 11:10:42 +08:00
Iman Tabrizian	8401722245	test: Add single gpu disaggregated tests (#3295 ) * test: Add single gpu disaggregated tests Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Add deepseek with overlap tests Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Use updated prompt Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Move test to disaggregated folder Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> --------- Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>	2025-04-09 09:34:45 +08:00
Tracin	2a2b7bfc66	Fix miss bias add for FP4Linear. (#3361 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-09 09:17:54 +08:00
Mike Iovine	5bdf997963	Add Llama 4 (#3302 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-04-09 03:35:21 +08:00
yuxianq	7225bd8b91	chore: Refine attention backend interface. (#3271 ) Refine attention backend interface. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-09 02:34:53 +08:00
Zhanrui Sun	7199588796	infra: [TRTLLM-4450] Support more files for pytorch only mode (#3365 ) * infra: [TRTLLM-4450] Support more files for pytorch only mode Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Test pytorch only mode Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Revert "Test pytorch only mode" This reverts commit b32f54d7858bd2432251734bc7b31669147ed94b. Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Fix review Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-09 01:39:04 +08:00
wili	54ad95eaa8	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 ) * feat/Variable-Beam-Width-Search-Part3, v1.0 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.1 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>	2025-04-08 23:51:27 +08:00
sugunav14	84fc07b011	feat: [TRTLLM-3510] DeepseekV3 support in AutoDeploy (#3281 ) Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>	2025-04-08 21:47:57 +08:00
pcastonguay	02f446a9ff	chore: Adding DS V3-lite tests with overlap + cuda graph (#3342 ) * chore: Adding DS V3-lite tests with overlap + cuda graph Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing pre-commit Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-04-08 09:36:09 -04:00
Zhanrui Sun	63b0194c50	chore: bump version to 0.19.0.dev2025041500 (#3360 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-08 20:45:27 +08:00
Void	316e5c3be3	feat: fix and improve allreduce and fusion kernels (#3064 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-04-08 19:33:52 +08:00
yuxianq	7b03350527	Add thread leak check and fix thread/memory leak issues. (#3270 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-08 19:03:18 +08:00
liji-nv	dca6397d1e	feat: Introduce UB allocator for pytorch flow (#3257 ) * Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-04-08 18:39:49 +08:00
Zhanrui Sun	c692474b59	infra: Fix bot help error when " in bot command (#3314 ) * Fix bot help error when " in bot command Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Delete a.txt Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-04-08 18:16:05 +08:00
Chuang Zhu	cdb0906be4	disagg test single h100 (#3353 )	2025-04-08 17:45:35 +08:00
amirkl94	e04f6a1b9b	fix: Fix p-tuning test bug (#3326 ) * fix: Fix p-tuning test bug * A change in the vocab_size calculation for T5Tokenizer, introduced in transformers version 4.34, caused addition of incorrect vtokens for ptuning. In general, instead of adding tokens which are outside the vocabulary, tokens inside the vocabulary were added. Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>	2025-04-08 17:14:00 +08:00
Yan Chunwei	deb876ecdb	clean up trtllm-llmapi-launch logs (#3358 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-08 16:00:59 +08:00
Enwei Zhu	8ee019f8c4	test: Accuracy test improvement (Part 3.4): Move LLaMA tests (#3350 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-08 15:07:57 +08:00
Pengyun Lin	60e02a3684	Use llm.tokenizer in OpenAIServer (#3199 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>	2025-04-08 14:55:02 +08:00
Yukun He	c678774c99	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 ) * Several optimizations and fixings on the Autotuner. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Apply the new Python side Autotuner on current linear for nvFP4 data type. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Apply the new Python side Autotuner on MoE op * Remove routers from cache key to improve inference perf * Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic. * Remove try-catch inside moe profiling process. * Move default tactic -1 to 0 transforms in cpp runner. * Revise relavant tests. * Predefined the bucketizing strategy for fused_moe Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization * Add specific_profile for moe * Add specific profile for linear Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Fixing and revising according to reviewer's suggestions. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Use lru_cache for inference pref optimization. * Revert gen_custom_cache_key feature Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Replace runner with runner id to achieve a serializable cache. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Code clean up and minor fixings. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Move all tunable runners and custom ops into torch_custom_ops. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-08 14:28:36 +08:00
Gabriel Wu	f1655afb0d	feat: enable DeepGEMM by default (#3341 ) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>	2025-04-08 13:58:57 +08:00
Fanrong Li	62e0876e39	Waive unittest/trt/model/test_mamba.py::TestMamba::test_loaders_mamba_130m_hf_from_checkpoint. Will fix it later. (#3356 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-07 22:36:35 -07:00
MinaHuai	31422e7e46	add tp=2 ci test for vision encoder (#3319 ) Signed-off-by: mhuai <mhuai@nvidia.com>	2025-04-07 21:46:08 -07:00
Gabriel Wu	42c8574e93	fix: revert extra cmake var (#3351 ) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-08 11:57:16 +08:00
Chuang Zhu	1c88af1378	feat: use cudaMalloc to allocate kvCache (#3303 )	2025-04-08 10:59:14 +08:00
Kaiyu Xie	0a4e1d5a55	breaking change: perf: Make ipc_periodically the default responses_handler (#3102 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-08 10:36:39 +08:00
pcastonguay	add5e5cd93	feat: Add option to run disaggregated serving without ctx servers,… (#3243 ) * feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing comment in sanity check Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-04-07 21:56:03 -04:00
Void	efe2ecfb37	fix: runtime error in est_deepseek_allreduce.py (#3226 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-04-08 09:19:47 +08:00
Ivan Sorokin	d40fce474a	fix: redrafter sampling (#3278 ) * Fix redrafter sampling Signed-off-by: Ivan Sorokin <isorokin@nvidia.com> * Rename redrafter bream search var Signed-off-by: Ivan Sorokin <isorokin@nvidia.com> * Remove _beam_search_candidates_v0 Signed-off-by: Ivan Sorokin <isorokin@nvidia.com> * Remove unused import Signed-off-by: Ivan Sorokin <isorokin@nvidia.com> --------- Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>	2025-04-08 07:49:32 +08:00
Enwei Zhu	ba019a43d6	test: Accuracy test improvement (Part 3.3): Move DeepSeek tests (#3260 ) add skip fix fix update update test list fixqa list move bf16 to postmerge Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-08 07:19:04 +08:00
Chuang Zhu	f3237e52ed	update readme for disaggregated (#3323 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-07 21:29:15 +08:00
Gabriel Wu	376731013d	feat: use NVRTC for DeepGEMM JIT compilation (#3239 ) * feat: use NVRTC for DeepGEMM JIT compilation Signed-off-by: Zihua Wu * fix: add license Signed-off-by: Zihua Wu * feat: store NVRTC JIT results in memory by default Signed-off-by: Zihua Wu * feat: refinement Signed-off-by: Zihua Wu * feat: refinement Signed-off-by: Zihua Wu * test: set timeout to 7200 Signed-off-by: Zihua Wu --------- Signed-off-by: Zihua Wu	2025-04-07 20:29:23 +08:00
YueWeng	aab6214801	test: fix conflicting test names (#3316 ) Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-04-07 20:10:01 +08:00
Yao Yao	3545d59635	Support speculative decoding with Hopper XQA (#3269 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-04-07 17:14:34 +08:00
amitz-nv	e5407ea89a	Fix torch nvsmall through pyexecutor and fix its TP support (#3238 ) * Fix NemotronNAS support Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-04-07 11:53:17 +03:00
pansicheng	ef1ba468a1	feat: support abort disconnected requests (#3214 ) Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>	2025-04-07 16:14:58 +08:00
Yiqing Yan	e232d037a2	chore: Blossom debug hook (#3091 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-04-07 15:47:48 +08:00
Bo Li	515dd0d78f	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 ) * fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compilation error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * pick blackwell support Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add missing license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit `f0c859d459`. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com> Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: Dylan Chen <ziqingc@nvidia.com> Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-07 15:14:13 +08:00
QI JUN	a2fad51011	chore: waive a timeout multi-GPU test case (#3310 ) * debug CI timeout issue Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * waive timeout case Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-07 14:04:54 +08:00
nv-guomingz	a6a4920b1d	chore: update internal cutlass library base #2981 and #3165 . (#3308 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-04-07 13:53:02 +08:00
Shunkangz	62bc13430e	fix: fix attentionDP padding request type (#3299 ) * Fix attentionDP padding request type Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> * Refactor import Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> --------- Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-04-07 13:28:21 +08:00
Fanrong Li	e8b97341de	fix the py_decoding_iter update in decoder. (#3297 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-07 11:18:33 +08:00
brb-nv	017361c26c	test: Waive non-Llama Eagle tests (#3309 )	2025-04-07 09:25:41 +08:00
Chuang Zhu	5aeef6d4c7	ucx interface (#3306 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-07 08:44:34 +08:00
Nick Comly	4735b87f1f	L4 added to readme (#3301 ) * Add L4 chart Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com> * Add L4 to readme Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com> * Add files via upload Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com> * Update README.md Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com> * Add files via upload Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com> * Add files via upload Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com> --------- Signed-off-by: Nick Comly <85702008+ncomly-nvidia@users.noreply.github.com>	2025-04-06 19:09:28 +08:00
tburt-nv	7a659885e3	chore: remove usernames from comments (#3291 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-05 13:44:28 +08:00
Yan Chunwei	b21cfcfed1	chore: refactor the LlmArgs with Pydantic and migrate remaining pybinding configs to python (#3025 ) * make LlmArgs Pydantic Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * amending doc fix api_stability fix tests Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * restore yaml groups refine StackTrace singleton clean tests Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix trtllm-bench fix pytorch Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix serve distagg Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-05 13:31:48 +08:00
Frank	f8a4cc0629	perf: Add total token throughput metric. (#3212 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-05 13:17:59 +08:00

1 2 3 4 5 ...

345 Commits