TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-26 05:32:57 +08:00

Author	SHA1	Message	Date
Chuang Zhu	1333f4f5d5	remove cache_transceiver_prealloc_size (#4153 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-12 11:53:53 +08:00
Mike Iovine	4b8ba7ad61	[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue (#4069 ) [fix] Fix llama 4 test lists Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-09 22:45:14 +08:00
chenfeiz0326	ffc13bd325	Cherry-pick: Use multi-threading to load MoE expert weights (#4137 ) * Use multi-threading to load MoE expert weights Signed-off-by: Po-Han Huang <pohanh@nvidia.com> * Update code formatting Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Update code formatting Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> --------- Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Co-authored-by: Po-Han Huang <pohanh@nvidia.com>	2025-05-09 17:29:24 +08:00
Fanrong Li	0cf0fce5d3	[fix] Fix add_dummy_requests for spec decoding cases (#4084 ) * fix add_dummy_requests. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * add max_seq_len to eagle3 test and fix add_dummy_requests. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix prompt_len in add_dummy_requests. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * add prepare_resource condition in add_dummy_requests. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * add some description of token_nums to add_dummy_requests and fix token_nums in torch compile warmup. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix available_tokens. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-09 16:52:51 +08:00
Fanrong Li	77f8e43592	[fix] Fix relaxed acceptance to support enabling it in context phase (#4126 ) * fix relaxed acceptance to support enable this feature in context phase. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix sample_and_accept_draft_tokens unit test. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-09 14:11:14 +08:00
Yukun He	c9cac432dc	chore: Fix pipeline break caused by previous PR (#4081 ) rebase + pipeline reuse (#4169 ) Fix import break caused by rebase. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 12:51:02 +08:00
Erin	cdf5ae1547	fix: change pp broadcast pattern for LPs (#4130 ) Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-08 20:07:13 -07:00
Yi Zhang	91bf5e6a8e	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 ) Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-05-09 11:04:01 +08:00
Yukun He	5b61486d87	chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 10:20:41 +08:00
bhsueh_NV	700d09ab65	[TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model (#4141 ) * fix bug of attention dp on qwen3 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit changes Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention dp 8 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-09 09:29:39 +08:00
dongxuy04	7147efb2e8	fix: alltoall padding for chunked MoE (#4157 ) fix alltoall padding for chunked MoE Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-09 09:01:35 +08:00
Mike Iovine	9afe510367	[fix] Fix llama4 + eagle3 (#3998 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-08 19:20:27 -04:00
Lucas Liebenwein	48ed38a2ac	[fix] [AutoDeploy] flashinfer usage on H100 (#4162 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-09 06:00:57 +08:00
chenfeiz0326	7f5716ef83	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 ) * feat: TRT-LLM Gen FP8 MoE Llama4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * feat: TRT-LLM Gen llama4 MoE Top1 routing Signed-off-by: Jiqun Tu <jtu@nvidia.com> * feat: add per tensor FP8 TRT-LLM Gen GEMMs Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add guard for routingIndicesClusterKernel Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> --------- Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> Signed-off-by: Jiqun Tu <jtu@nvidia.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Co-authored-by: Nikita Korobov <nkorobov@nvidia.com> Co-authored-by: Jiqun Tu <jtu@nvidia.com>	2025-05-08 14:13:01 -07:00
Yukun He	bb7bcc75c2	feat: Fallback to NCCL for various patterns when input size is large. (#4080 ) * Fallback to NCCL for various patterns when input size is large. Move the previous implementation to cpp side. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Revising. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-08 11:13:13 -07:00
shaharmor98	7d94c9561f	feat: support multi lora adapters and TP (#3885 ) * support multi lora, tp Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-08 23:45:45 +08:00
Yuan Tong	5b93273156	feat: adopt new logprob definition in PyTorch flow (#4057 ) feat: align logprob definition of PyTorch flow Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>	2025-05-08 20:16:40 +08:00
Tracin	b0dd581e6b	Fix TP8 for NVFP4 kv dupilcation. (#4143 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-05-08 17:30:02 +08:00
zihaok	81cc60a0fd	[feat/] enable attention DP in Llama4 maverick model - part 1 (#4065 ) * add feature cosmetic changes Signed-off-by: Zihao Kong <zihaok@nvidia.com> address precommit fix cosmetic Signed-off-by: Zihao Kong <zihaok@nvidia.com> * add feature Signed-off-by: Zihao Kong <zihaok@nvidia.com> * fix bug Signed-off-by: Zihao Kong <zihaok@nvidia.com> * address comments Signed-off-by: Zihao Kong <zihaok@nvidia.com> * remove WAR Signed-off-by: Zihao Kong <zihaok@nvidia.com> * fix format precommit Signed-off-by: Zihao Kong <zihaok@nvidia.com> * Update tensorrt_llm/_torch/models/modeling_llama.py Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com> --------- Signed-off-by: Zihao Kong <zihaok@nvidia.com> Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-05-08 05:06:40 +08:00
hlu1	26a2679217	[Deepseek] Refactor Deepseek Decoder layer (#4016 ) Refactor Deepseek Decoder layer Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-08 01:43:10 +08:00
rakib-hasan	bf9ac96de3	Adding option to specify a set of token ids for multimodal tokens (#4107 ) Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-05-07 12:15:41 +08:00
bhsueh_NV	f670a036df	[Qwen3] chore: fix bug of fused_moe on tp > 1 (#4093 ) * fix bug of fused_moe on tp > 1 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * refine codes Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-07 11:06:37 +08:00
Daniel Cámpora	c56a2aca46	fix: Properly get decoding mode according to same logic as cpp. (#4026 ) * Properly get decoding mode according to same logic as cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Cross reference getDecodingMode implementations in pytorch - cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Better bindings for DecodingMode. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert to version in main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert configuration.py. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-06 21:53:17 +08:00
Robin Kobus	72057a0a64	[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625 ) * disable overlap in encoder Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: overlap same batch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Improve early exit Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 15:06:46 +02:00
HuiGao-NV	5a4794b387	fix: skip add new slot if request has slot 0 (#3991 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-05-06 07:46:39 +02:00
Suyog Gupta	ac2ab9ba36	[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy (#4024 ) * reuse batch_indices, positions across layers Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * fix flashinfer unit tests Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * simplify call to get_batch_indices_positions Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * fix call to get_batch_indices_positions Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> --------- Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-05-06 10:46:36 +08:00
bhsueh_NV	e053cb651b	Fix: fix bug of qwen3 moe (#4058 ) * fix bug of qwen3 moe Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update threshold Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-06 08:20:15 +08:00
Daniel Cámpora	aa980dc92f	fix: instantiate decoder early in pytorch (#4029 ) * Instantiate decoder early to have better mem estimation. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Improve mem estimation by instantiating decoder earlier. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-05 10:31:53 +02:00
yuxianq	017701343e	fix: apply rope twice in Qwen3. (#4040 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 15:12:45 +08:00
Yukun He	aa38e28cfa	fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988 ) * Fix AllReduce kernel hang issue when both tp and pp are enabled. Allocate one workspace for each pp rank to avoid potential race. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * update waive list Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-05 11:33:25 +08:00
yuxianq	266fef88f2	feat: support to trace executor loop. (#3983 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 10:26:33 +08:00
qixiang-99	bf4f7ad744	feat: add Pytorch support of Vision Encoder for multimodal models (#3791 ) * feat: Add rename_weights_with_regex function for dynamic weight key renaming Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Implement SiglipVisionModel and related components Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder. Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs. Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Add CLIPVisionModel and associated components Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Consolidate weight loading logic into a shared implementation Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-03 05:13:47 +08:00
Daniel Cámpora	cb2c1cc829	[https://nvbugs/5248923 ] fix: Correctly sizes seqslotmanager considering pp. (#3984 ) * Correctly sizes seqslotmanager considering pp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Adapt order. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 20:06:32 +02:00
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
hlu1	52edabab30	Fix Deepseek MTP with moe_backend=TRTLLM (#4001 ) Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-02 14:47:22 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Lucas Liebenwein	be916b19e0	feat: [AutoDeploy] unfusing attention for native support (#3668 ) * [AutoDeploy] unfused streamlined attention + caching Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * improved unit testing Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * reviewer feedback Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * some updates to attn_mask handling Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * updated manual benchmarking and cudagraph capture Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-02 09:06:49 +08:00
Yukun He	a1645c922b	Fallback to NCCL for various patterns when input size is large. (#4009 ) When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 15:17:16 -07:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Suyog Gupta	f94af0fb86	[AutoDeploy] Make all ranks agree on kv-cache size (#4007 ) * make all ranks agree on kv-cache size Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * minor cleanups Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * use all_gather_object wrapper Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com>	2025-05-02 04:07:28 +08:00
bhsueh_NV	129bf19980	model: support Qwen3 (#4010 ) * add qwen3 dense model pytorch backend support, initial commit solve the results error issue add qwen3 moe model pytorch backend support reformat the code * perf - use flash_infer rmsnorm for qwen3 * feat - support qwen3 moe rmsnorm * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications. * fix bugs of running qwen3 public models and fp8 models Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs due to rebase Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs captured by pre-commi Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Co-authored-by: Keddy Jin <jin.gq@aliyun.com> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Co-authored-by: shao <shao@nvidia.com>	2025-05-01 23:12:41 +08:00
YueWeng	b1621e8d4e	feat: add relaxed acceptance for DS (#3865 ) * add relaxed acceptance for DS R1 Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * clean and update docs Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * Modified based on review Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix mtp manager issue Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> --------- Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-01 21:50:36 +08:00
milesial	6ded5f984b	Llama4 processor fixes (#3994 ) * fix: Propagate sampling params Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> * fix: type hints Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> --------- Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:45:53 +08:00
Kate Cheng	7dbe618683	feat: Add multimodal embedding field in LlmRequest (#3855 ) * Add a new param to LlmRequest and Request to natively support mm Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * update comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update tests to match the new LlmRequest constructor parameters Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify unitTest and modify mm_embeding's dict name in llama4 Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix LlmRequest initialization in kvCacheManagerTest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up code for promt_tuning_config Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up prompt_tuning_config in GenerationRequest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:23:30 +08:00
Yukun He	9cc5922a0b	Clean up allreduce op in Deepseek V3 model. (#3829 ) * Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op. * Minor revision of moe_allreduce op argument names. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 07:56:36 +08:00
Mike Iovine	8c2c969fcb	[fix] Pad requests to maximum draft length in spec decode (#3957 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-30 11:02:18 -04:00
Julien Debache	83670571dd	feat: Mistral-Large-2 support in the Pytorch workflow - Added modelling file for models configured by a `MistralConfiguration` object as it is slightly different from the Llama one	2025-04-30 20:12:39 +08:00
Fanrong Li	e6b482ef47	fix: change the seq_lens sync copy to an async one (#3786 ) --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-29 23:56:49 +08:00
tomeras91	35010e8073	Support NemotronH FP8 Quantization (1) match quant exclude modules names to TRTLLM names (2) No need for any special weight loading for quantization scales weights (#3891) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-04-29 18:51:43 +03:00
yuxianq	0f8ec693b2	fix: get head_dim from model’s config. (#3916 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 23:04:29 +08:00

1 2 3 4

199 Commits