TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Yukun He	c9cac432dc	chore: Fix pipeline break caused by previous PR (#4081 ) rebase + pipeline reuse (#4169 ) Fix import break caused by rebase. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 12:51:02 +08:00
Mike Iovine	d80dc40135	[nvbug/5262268][fix] Fix trtllm-bench for llama 4 (#4104 ) [fix] Fix trtllm-bench for llama 4 Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>	2025-05-08 21:27:57 -07:00
Erin	cdf5ae1547	fix: change pp broadcast pattern for LPs (#4130 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-08 20:07:13 -07:00
Yi Zhang	91bf5e6a8e	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 ) Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-05-09 11:04:01 +08:00
Yukun He	5b61486d87	chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 10:20:41 +08:00
bhsueh_NV	700d09ab65	[TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model (#4141 ) * fix bug of attention dp on qwen3 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit changes Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention dp 8 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-09 09:29:39 +08:00
pcastonguay	836c142e1b	[feat] Allow overriding cli args with yaml file in trtllm-serve (#4164 ) feat: Allow overriding cli args with yaml file in trtllm-serve Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-08 21:19:05 -04:00
dongxuy04	7147efb2e8	fix: alltoall padding for chunked MoE (#4157 ) fix alltoall padding for chunked MoE Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-09 09:01:35 +08:00
forrestl	9477661f4c	Support RingAttention in the BertAttention plugin and the DiT model (#3661 ) support ring attn for bert_attention plugin and dit model Signed-off-by: ChunhuanLin <lch_xdu@163.com>	2025-05-09 08:06:54 +08:00
Mike Iovine	9afe510367	[fix] Fix llama4 + eagle3 (#3998 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-08 19:20:27 -04:00
Frank	57afbf6b79	Fix incorrect conversion. (#4112 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-05-09 06:34:52 +08:00
Lucas Liebenwein	48ed38a2ac	[fix] [AutoDeploy] flashinfer usage on H100 (#4162 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-09 06:00:57 +08:00
chenfeiz0326	7f5716ef83	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 ) * feat: TRT-LLM Gen FP8 MoE Llama4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * feat: TRT-LLM Gen llama4 MoE Top1 routing Signed-off-by: Jiqun Tu <jtu@nvidia.com> * feat: add per tensor FP8 TRT-LLM Gen GEMMs Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add guard for routingIndicesClusterKernel Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> --------- Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> Signed-off-by: Jiqun Tu <jtu@nvidia.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Co-authored-by: Nikita Korobov <nkorobov@nvidia.com> Co-authored-by: Jiqun Tu <jtu@nvidia.com>	2025-05-08 14:13:01 -07:00
Yukun He	bb7bcc75c2	feat: Fallback to NCCL for various patterns when input size is large. (#4080 ) * Fallback to NCCL for various patterns when input size is large. Move the previous implementation to cpp side. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Revising. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-08 11:13:13 -07:00
shaharmor98	7d94c9561f	feat: support multi lora adapters and TP (#3885 ) * support multi lora, tp Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-08 23:45:45 +08:00
Yuan Tong	5b93273156	feat: adopt new logprob definition in PyTorch flow (#4057 ) feat: align logprob definition of PyTorch flow Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>	2025-05-08 20:16:40 +08:00
Enwei Zhu	74df12bbaa	[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval (#3946 ) * fix formula Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update doc Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * 1st version Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * polish Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-08 19:35:23 +08:00
Tracin	b0dd581e6b	Fix TP8 for NVFP4 kv dupilcation. (#4143 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-05-08 17:30:02 +08:00
zihaok	81cc60a0fd	[feat/] enable attention DP in Llama4 maverick model - part 1 (#4065 ) * add feature cosmetic changes Signed-off-by: Zihao Kong <zihaok@nvidia.com> address precommit fix cosmetic Signed-off-by: Zihao Kong <zihaok@nvidia.com> * add feature Signed-off-by: Zihao Kong <zihaok@nvidia.com> * fix bug Signed-off-by: Zihao Kong <zihaok@nvidia.com> * address comments Signed-off-by: Zihao Kong <zihaok@nvidia.com> * remove WAR Signed-off-by: Zihao Kong <zihaok@nvidia.com> * fix format precommit Signed-off-by: Zihao Kong <zihaok@nvidia.com> * Update tensorrt_llm/_torch/models/modeling_llama.py Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com> --------- Signed-off-by: Zihao Kong <zihaok@nvidia.com> Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-05-08 05:06:40 +08:00
hlu1	26a2679217	[Deepseek] Refactor Deepseek Decoder layer (#4016 ) Refactor Deepseek Decoder layer Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-08 01:43:10 +08:00
Pengyun Lin	721f84a0ac	fix: Align default setting & remove unnecessary check for chat and completion (#3888 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-05-07 14:42:53 +08:00
Yan Chunwei	0c26059703	chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732 ) * beam_width and max_new_token Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove beam_width Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove min_length Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove return_num_sequences Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-05-07 13:20:25 +08:00
rakib-hasan	bf9ac96de3	Adding option to specify a set of token ids for multimodal tokens (#4107 ) Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-05-07 12:15:41 +08:00
bhsueh_NV	f670a036df	[Qwen3] chore: fix bug of fused_moe on tp > 1 (#4093 ) * fix bug of fused_moe on tp > 1 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * refine codes Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-07 11:06:37 +08:00
Enwei Zhu	c28b90984f	[TRTLLM-3925, https://nvbugs/5245262 ] [fix] Normalize LLM.generate API (#3985 ) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-07 11:06:23 +08:00
Kaiyu Xie	52d4302dda	bench: TRTLLM-4936 Port benchmark_serving.py (#4011 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>	2025-05-07 09:45:14 +08:00
milesial	001e666fc5	fix: Pass local dir to processor creation (#4018 ) Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-06 12:25:04 -07:00
Erin	cba1793cda	cleanup logprob params (#4039 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-07 00:50:16 +08:00
Daniel Cámpora	c56a2aca46	fix: Properly get decoding mode according to same logic as cpp. (#4026 ) * Properly get decoding mode according to same logic as cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Cross reference getDecodingMode implementations in pytorch - cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Better bindings for DecodingMode. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert to version in main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert configuration.py. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-06 21:53:17 +08:00
Robin Kobus	72057a0a64	[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625 ) * disable overlap in encoder Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: overlap same batch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Improve early exit Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 15:06:46 +02:00
yuxianq	b6cfe08c52	fix: Fix NVLink version decoding. (#3996 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-06 13:56:50 +08:00
HuiGao-NV	5a4794b387	fix: skip add new slot if request has slot 0 (#3991 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-05-06 07:46:39 +02:00
Suyog Gupta	ac2ab9ba36	[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy (#4024 ) * reuse batch_indices, positions across layers Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * fix flashinfer unit tests Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * simplify call to get_batch_indices_positions Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * fix call to get_batch_indices_positions Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> --------- Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-05-06 10:46:36 +08:00
bhsueh_NV	e053cb651b	Fix: fix bug of qwen3 moe (#4058 ) * fix bug of qwen3 moe Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update threshold Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-06 08:20:15 +08:00
pansicheng	e84dc6b3c7	feat: add deepseek-r1 reasoning parser to trtllm-serve (#3354 ) * add deepseek-r1 reasoning parser Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com> * fix test Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> --------- Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com> Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-05-06 08:13:04 +08:00
Iman Tabrizian	85867d76dd	test: Add disaggregated serving accuracy tests (#4036 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-05-05 08:56:59 -07:00
Daniel Cámpora	aa980dc92f	fix: instantiate decoder early in pytorch (#4029 ) * Instantiate decoder early to have better mem estimation. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Improve mem estimation by instantiating decoder earlier. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-05 10:31:53 +02:00
yuxianq	017701343e	fix: apply rope twice in Qwen3. (#4040 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 15:12:45 +08:00
Yukun He	aa38e28cfa	fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988 ) * Fix AllReduce kernel hang issue when both tp and pp are enabled. Allocate one workspace for each pp rank to avoid potential race. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * update waive list Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-05 11:33:25 +08:00
yuxianq	266fef88f2	feat: support to trace executor loop. (#3983 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 10:26:33 +08:00
qixiang-99	bf4f7ad744	feat: add Pytorch support of Vision Encoder for multimodal models (#3791 ) * feat: Add rename_weights_with_regex function for dynamic weight key renaming Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Implement SiglipVisionModel and related components Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder. Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs. Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Add CLIPVisionModel and associated components Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Consolidate weight loading logic into a shared implementation Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-03 05:13:47 +08:00
Daniel Cámpora	cb2c1cc829	[https://nvbugs/5248923 ] fix: Correctly sizes seqslotmanager considering pp. (#3984 ) * Correctly sizes seqslotmanager considering pp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Adapt order. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 20:06:32 +02:00
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
hlu1	52edabab30	Fix Deepseek MTP with moe_backend=TRTLLM (#4001 ) Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-02 14:47:22 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Lucas Liebenwein	be916b19e0	feat: [AutoDeploy] unfusing attention for native support (#3668 ) * [AutoDeploy] unfused streamlined attention + caching Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * improved unit testing Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * reviewer feedback Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * some updates to attn_mask handling Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * updated manual benchmarking and cudagraph capture Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-02 09:06:49 +08:00
Yukun He	a1645c922b	Fallback to NCCL for various patterns when input size is large. (#4009 ) When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 15:17:16 -07:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Suyog Gupta	f94af0fb86	[AutoDeploy] Make all ranks agree on kv-cache size (#4007 ) * make all ranks agree on kv-cache size Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * minor cleanups Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * use all_gather_object wrapper Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com>	2025-05-02 04:07:28 +08:00
Erin	83f37614ef	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 ) * support return logprob in llmapi Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> update and add test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> stability test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * revert removal of old flag Signed-off-by: Erin Ho <erinh@nvidia.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Erin Ho <erinh@nvidia.com>	2025-05-01 12:47:14 -04:00

1 2 3 4 5 ...

368 Commits