TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Kaiyu Xie	b286b51118	feat: Support torch profiler (#3470 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-14 22:06:06 +08:00
Zhanrui Sun	714ff3eedd	chore: bump version to 0.19.0rc0 (#3535 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-14 18:11:20 +08:00
Zhanrui Sun	ee4ce0379d	chore: bump version to 0.19.0rc0 (#3514 ) * chore: bump version to 0.19.0.rc0 Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update README Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-14 17:32:30 +08:00
dongjiyingdjy	2fb1d65d43	fix: fix max_seq_len in executor_config (#3487 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-04-14 15:13:29 +08:00
HuiGao-NV	9f41e826bf	fix: remove one duplicated line of code (#3523 ) Signed-off-by: Hui Gao <huig@nvidia.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-14 14:52:46 +08:00
brb-nv	44090a5388	Add support for Phi-4-MM (#3296 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-04-14 14:24:10 +08:00
yuxianq	9d64b6b890	Cache sin cos in model instead of global LRU cache. (#3378 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 11:19:09 +08:00
pcastonguay	fe6f14b2b1	fix: Fixing issue with first gen token being returned twice in streaming (#3427 ) * fix: Fixing issue with first gen token being returned twice with streaming Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing not_expectring_strings in test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-13 22:45:09 -04:00
yuxianq	baeec63dda	refactor: Remove _pp_forward. (#3496 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 09:49:44 +08:00
HuiGao-NV	d0f83d19f1	fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache (#3497 ) * fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache Signed-off-by: Hui Gao * Fix code to get model_config of draft model Signed-off-by: Hui Gao --------- Signed-off-by: Hui Gao	2025-04-13 23:02:14 +08:00
Yan Chunwei	b37c5c0a4d	make LLM-API slurm examples executable (#3402 ) Signed-off-by: chunweiy <328693+Superjomn@users.noreply.github.com>	2025-04-13 21:42:45 +08:00
Yan Chunwei	74850c61e9	fix: switch ZMQ from file socket to tcp socket in RemoteMpiCommSession (#3462 ) * switch ZMQ from file socket to tcp Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix comment Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-13 09:15:55 +08:00
WeiHaocheng	c6081abb0e	feat: Make scaffolding Controller more generic #3408 (#3416 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-12 21:35:38 +08:00
QI JUN	012fb9a1c4	remove useless max_num_tokens member in PyTorchConfig (#3493 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-12 21:09:58 +08:00
Robin Kobus	2ab71f9a80	refactor: decoder buffers (#3307 ) * refactor: remove cumLogProbs and logProbs from DecoderBuffers - Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management. - Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities. These modifications enhance code clarity and maintainability by reducing redundancy in buffer management. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched - Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping. - Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width. These changes enhance clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: integrate DecoderState for sequence length management in decoding process - Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose. - Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management. - Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow. refactor: update SlotDecoderBuffers to manage sequence lengths directly - Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths. - Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy. - Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process. These changes improve maintainability and streamline the decoding workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Delegate to asyncSend method in SlotDecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-12 11:41:24 +02:00
yuxianq	29c5085400	fix: Fix PP for llama. (#3449 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-12 17:20:27 +08:00
nv-guomingz	adf60a8723	fix:update the default excluded_modules value for fp8rowwise recipe. (#3477 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-04-12 16:00:21 +08:00
hlu1	4855431d3d	[Deepseek] Redesign multi-stream API (#3459 ) Signed-off-by: Hao Lu <haolu@nvidia.com>	2025-04-11 23:00:25 -07:00
Iman Tabrizian	3041bbdab3	fix: Fix disagg MTP with overlap (#3406 ) * fix: disagg overlap with MTP Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Review comment Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> --------- Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>	2025-04-12 12:27:24 +08:00
HuiGao-NV	c51e90d7d7	fix: don't perform memory estimation for start_attention (#3485 ) * fix: don't perform memory estimation for start_attention * Enable tests of unittest/_torch/multi_gpu Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-12 11:34:46 +08:00
Iman Tabrizian	c539750d42	fix: Allow context_and_generation request type in disagg overlap (#3489 ) Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>	2025-04-11 16:15:01 -07:00
QI JUN	d167cbd5bb	refactor: remove ParallelConfig in tensorrt_llm._torch.distributed module (#3370 ) * remove tensorrt_llm._torch.distributed.ParallelConfig Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix embedding test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix comments Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * polish Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * rebase Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-11 15:34:20 -07:00
Enwei Zhu	cf9ceea890	test: Add DeepSeek-V3-Lite PP=4 cases (#3454 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-12 00:09:12 +08:00
Fridah-nv	ec723fa993	feat:[AutoDeploy] Enhance RoPE support (#3115 ) * add test to map flashinfer rope op with triton custom rope ops and pytorch rope in fused_mha Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add rope matcher and unit tests Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * capture cos and sin from graph Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * revert fuse_mha op change Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor update to address comment and remove redundant unit test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * move view and transpose into graph nodes and update unit test to test custom op directly Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * move view into custom op, update bfs with bound, update custom op return type to be half precision Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * custom op update to support 3D input Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * handle bnsd and bsnd format, update tests, handle 3D cos/sin input to the custom op Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add llama4 rope test, update custom op with is_neox flag Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add llama4 style rope to matcher and update unit test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * separate into two transformations Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * fix when num_head != num_kv_head; add support for cached position_ids and cos_sin_cache in graph; update unit tests Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor update, cache locally and propagate meta info of qk nodes Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor: fix cos_sin_cache not float Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor: move cache into matcher Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> --------- Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-04-11 23:51:24 +08:00
Yukun He	ff82aef99b	Fix the issues related to fused moe path. (#3435 ) * One of the tactic is not supported during dispatch. * final_hidden_states should be unpacked if it is not min_latency_mode. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-11 21:41:15 +08:00
liji-nv	b168adba70	feat: Add NVFP4 UB pattern optimization pass in torch compile (#3371 ) * feat: Add NVFP4 UB pattern optimization pass in torch compile * Add an additional flag for UB fp4 pattern to avoid inverse the scale * Add NVFP4 related UB patterns Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> * Update atol, some points fails for B200 umbriel. Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com> --------- Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>	2025-04-11 21:25:29 +08:00
Shunkangz	ea050084ad	feat: Add support of chat completion in PD (#2985 ) * Add support of chat completion in PD Add support of include_usage in PD Reformat * Remove redundant code Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> * Refactor code Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> * Add chat completion test Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> * Refactor code Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> --------- Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-04-11 17:53:28 +08:00
Yechan Kim	5bc6f093c8	fix: mllama e2e pytorch flow fix (#3397 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-04-11 17:33:15 +08:00
Ivy Zhang	d998832b33	test: add torch flow test case in qa test list (#3404 ) Signed-off-by: Ivy Zhang <yanzh@nvidia.com>	2025-04-11 16:57:41 +08:00
Zhihan Jiang	8300218d21	feat: support llama4 nope layers; support FP8 checkpoint loading; (#3382 ) * Enable NOPE, Fix a rotary embedding bug for gptj_stype_rope, Address PR comment, Properly skip the rotary_embdding for Llama4 ROPE layers * Add support for FP8 checkpoint, Fix ckpt weighting loading for FP8 * Temporarily disable min_latency_mode for llama4 --------- Co-authored-by: Yilin Fan <yilinf@nvidia.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-04-10 10:16:42 -07:00
amitz-nv	a6a2ae6cc1	chore: Rename nvsmall to nemotron nas (#3447 ) * Rename nvsmall to nemotron NAS * Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests * Add NemotronNAS to pytorch supported models table Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-04-10 23:16:52 +08:00
wm2012011492	af05749e90	feat: add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa… (#3369 ) * add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa_llmapi.py Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> * fix coding style Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> * add unittest Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> --------- Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> Co-authored-by: mengw <12670782+wm2012011492@users.noreply.github.com>	2025-04-10 22:45:57 +08:00
Yan Chunwei	c5e803ba48	chore: code cleanup for error logging and SharedMemory in proxy.py (#3432 ) * cleanup log Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove shared-memory Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove ExecutorResponse Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * add assert for postproc Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-10 21:57:06 +08:00
HuiGao-NV	3ade9375ba	feat: Run PyExecutor's inference flow to estimate max_num_tokens for kv_cache_manager (#3092 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-10 18:29:40 +08:00
yuxianq	16c8f39fc5	feat: Support TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD for debugging (#3417 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-10 13:18:30 +08:00
hlu1	fbcf954d9c	[MLA] Deallocate tensors after use (#3286 ) Signed-off-by: Hao Lu <haolu@nvidia.com>	2025-04-09 21:36:07 -07:00
brb-nv	c59abae436	feat: Add Gemma3 text-only model support (#3247 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-04-10 12:34:58 +08:00
Frank	9307ff95ae	fix: Add nested aliases for Llama 4 (#3381 ) * Add nested aliases for Llama 4 Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix missed alias. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>	2025-04-10 10:18:53 +08:00
Yechan Kim	943218b54a	feat: Add Qwen2.5-VL and refactor Qwen2-VL (#3156 ) * feat: Add Qwen2.5-VL and refactor Qwen2-VL Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix yapf and codespell Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * add test Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix test_e2e Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * generalize get_rope_index Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix qwen2.5-vl on REAME Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix test Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix image test Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> --------- Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-10 04:09:03 +08:00
Maximiliano Levi	996696203f	fix: #3137 speculative decoding and multimodal input support (#3276 ) * fix: broadcast embeddings input when using speculative decoding Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com> * fix: use shape tensor instead of tuple Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com> * fix: comment Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com> --------- Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-09 23:40:19 +08:00
danielafrimi	47f5cf6c0d	lora_tests (#3201 ) LoRA tests and layers Signed-off-by: Ubuntu <dafrimi@nvidia.com> Co-authored-by: Ubuntu <dafrimi@nvidia.com>	2025-04-09 18:06:52 +03:00
WeiHaocheng	6eee15900e	feat: Enhance the integrated robustness of scaffolding with __init__.py #3305 (#3312 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-09 21:13:47 +08:00
Tracin	2a2b7bfc66	Fix miss bias add for FP4Linear. (#3361 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-09 09:17:54 +08:00
Mike Iovine	5bdf997963	Add Llama 4 (#3302 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-04-09 03:35:21 +08:00
yuxianq	7225bd8b91	chore: Refine attention backend interface. (#3271 ) Refine attention backend interface. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-09 02:34:53 +08:00
wili	54ad95eaa8	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 ) * feat/Variable-Beam-Width-Search-Part3, v1.0 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.1 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>	2025-04-08 23:51:27 +08:00
sugunav14	84fc07b011	feat: [TRTLLM-3510] DeepseekV3 support in AutoDeploy (#3281 ) Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>	2025-04-08 21:47:57 +08:00
Zhanrui Sun	63b0194c50	chore: bump version to 0.19.0.dev2025041500 (#3360 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-08 20:45:27 +08:00
yuxianq	7b03350527	Add thread leak check and fix thread/memory leak issues. (#3270 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-08 19:03:18 +08:00
liji-nv	dca6397d1e	feat: Introduce UB allocator for pytorch flow (#3257 ) * Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-04-08 18:39:49 +08:00

1 2 3 4 5

205 Commits