TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-21 18:25:20 +08:00

Author	SHA1	Message	Date
Robin Kobus	72057a0a64	[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625 ) * disable overlap in encoder Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: overlap same batch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Improve early exit Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 15:06:46 +02:00
yuxianq	b6cfe08c52	fix: Fix NVLink version decoding. (#3996 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-06 13:56:50 +08:00
HuiGao-NV	5a4794b387	fix: skip add new slot if request has slot 0 (#3991 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-05-06 07:46:39 +02:00
Suyog Gupta	ac2ab9ba36	[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy (#4024 ) * reuse batch_indices, positions across layers Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * fix flashinfer unit tests Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * simplify call to get_batch_indices_positions Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> * fix call to get_batch_indices_positions Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> --------- Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>	2025-05-06 10:46:36 +08:00
bhsueh_NV	e053cb651b	Fix: fix bug of qwen3 moe (#4058 ) * fix bug of qwen3 moe Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * update threshold Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-06 08:20:15 +08:00
pansicheng	e84dc6b3c7	feat: add deepseek-r1 reasoning parser to trtllm-serve (#3354 ) * add deepseek-r1 reasoning parser Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com> * fix test Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> --------- Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com> Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com> Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-05-06 08:13:04 +08:00
Iman Tabrizian	85867d76dd	test: Add disaggregated serving accuracy tests (#4036 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-05-05 08:56:59 -07:00
Daniel Cámpora	aa980dc92f	fix: instantiate decoder early in pytorch (#4029 ) * Instantiate decoder early to have better mem estimation. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Improve mem estimation by instantiating decoder earlier. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-05 10:31:53 +02:00
yuxianq	017701343e	fix: apply rope twice in Qwen3. (#4040 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 15:12:45 +08:00
Yukun He	aa38e28cfa	fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. (#3988 ) * Fix AllReduce kernel hang issue when both tp and pp are enabled. Allocate one workspace for each pp rank to avoid potential race. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * update waive list Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-05 11:33:25 +08:00
yuxianq	266fef88f2	feat: support to trace executor loop. (#3983 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 10:26:33 +08:00
qixiang-99	bf4f7ad744	feat: add Pytorch support of Vision Encoder for multimodal models (#3791 ) * feat: Add rename_weights_with_regex function for dynamic weight key renaming Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Implement SiglipVisionModel and related components Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder. Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs. Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Add CLIPVisionModel and associated components Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Consolidate weight loading logic into a shared implementation Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-03 05:13:47 +08:00
Daniel Cámpora	cb2c1cc829	[https://nvbugs/5248923 ] fix: Correctly sizes seqslotmanager considering pp. (#3984 ) * Correctly sizes seqslotmanager considering pp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Adapt order. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 20:06:32 +02:00
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
hlu1	52edabab30	Fix Deepseek MTP with moe_backend=TRTLLM (#4001 ) Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-02 14:47:22 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Lucas Liebenwein	be916b19e0	feat: [AutoDeploy] unfusing attention for native support (#3668 ) * [AutoDeploy] unfused streamlined attention + caching Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * improved unit testing Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * reviewer feedback Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * some updates to attn_mask handling Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * updated manual benchmarking and cudagraph capture Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-02 09:06:49 +08:00
Yukun He	a1645c922b	Fallback to NCCL for various patterns when input size is large. (#4009 ) When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 15:17:16 -07:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Suyog Gupta	f94af0fb86	[AutoDeploy] Make all ranks agree on kv-cache size (#4007 ) * make all ranks agree on kv-cache size Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * minor cleanups Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * use all_gather_object wrapper Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com>	2025-05-02 04:07:28 +08:00
Erin	83f37614ef	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 ) * support return logprob in llmapi Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> update and add test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> stability test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * revert removal of old flag Signed-off-by: Erin Ho <erinh@nvidia.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Erin Ho <erinh@nvidia.com>	2025-05-01 12:47:14 -04:00
bhsueh_NV	129bf19980	model: support Qwen3 (#4010 ) * add qwen3 dense model pytorch backend support, initial commit solve the results error issue add qwen3 moe model pytorch backend support reformat the code * perf - use flash_infer rmsnorm for qwen3 * feat - support qwen3 moe rmsnorm * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications. * fix bugs of running qwen3 public models and fp8 models Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs due to rebase Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs captured by pre-commi Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Co-authored-by: Keddy Jin <jin.gq@aliyun.com> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Co-authored-by: shao <shao@nvidia.com>	2025-05-01 23:12:41 +08:00
YueWeng	b1621e8d4e	feat: add relaxed acceptance for DS (#3865 ) * add relaxed acceptance for DS R1 Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * clean and update docs Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * Modified based on review Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix mtp manager issue Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> --------- Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-01 21:50:36 +08:00
milesial	6ded5f984b	Llama4 processor fixes (#3994 ) * fix: Propagate sampling params Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> * fix: type hints Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> --------- Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:45:53 +08:00
Kate Cheng	7dbe618683	feat: Add multimodal embedding field in LlmRequest (#3855 ) * Add a new param to LlmRequest and Request to natively support mm Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * update comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update tests to match the new LlmRequest constructor parameters Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify unitTest and modify mm_embeding's dict name in llama4 Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix LlmRequest initialization in kvCacheManagerTest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up code for promt_tuning_config Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up prompt_tuning_config in GenerationRequest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:23:30 +08:00
Frank	1e317c98c6	[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. (#3776 ) * Move world options to a different group for clarity. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add eos_id option. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-05-01 09:42:46 +08:00
Yukun He	9cc5922a0b	Clean up allreduce op in Deepseek V3 model. (#3829 ) * Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op. * Minor revision of moe_allreduce op argument names. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 07:56:36 +08:00
Mike Iovine	8c2c969fcb	[fix] Pad requests to maximum draft length in spec decode (#3957 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-30 11:02:18 -04:00
Julien Debache	83670571dd	feat: Mistral-Large-2 support in the Pytorch workflow - Added modelling file for models configured by a `MistralConfiguration` object as it is slightly different from the Llama one	2025-04-30 20:12:39 +08:00
Zhanrui Sun	86e7474a9b	chore: bump version to 0.20.0rc2 (#3949 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-30 11:44:43 +08:00
yuxianq	f568cbb671	chore: Remove duplicated get_sm_version. (#3935 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-30 11:43:53 +08:00
Fanrong Li	e6b482ef47	fix: change the seq_lens sync copy to an async one (#3786 ) --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-29 23:56:49 +08:00
tomeras91	35010e8073	Support NemotronH FP8 Quantization (1) match quant exclude modules names to TRTLLM names (2) No need for any special weight loading for quantization scales weights (#3891) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-04-29 18:51:43 +03:00
yuxianq	0f8ec693b2	fix: get head_dim from model’s config. (#3916 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 23:04:29 +08:00
HuiGao-NV	8e6eead6a5	refactor: (part1) Add contraints doc for fusedMoe module. (#3882 ) * Add doc string for FusedMoe module * Address comments. Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-29 22:23:02 +08:00
Junhong Liu	06e76020d7	feat: parallel q_b_proj and concat (#3917 ) * add parallel_q_b_proj_and_concat Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * code cleanup Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * one gemm/concat and then split the latent_cache and pass them separately to context/gen Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> --------- Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>	2025-04-29 22:07:05 +08:00
Dom Brown	8709fe8b53	chore: bump version to 0.19.0 (#3598 ) (#3841 ) test: add test cases for 0.19 release (#3608) * fix test name * add quickstart test for nemotron-ultra * add rcca multi-node test case for deepseek-v3 * add rcca info --------- squash (#3642) fix: nvbugs/5187237: fix deterministic mode crash (#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive * Revert "remove waive" This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac. * revert ar fusion --------- update fp8 doc (#3647) tests: change qa perf test to trtllm-bench (#3619) fix: FP8 quantized lm_head (NvBug 5214229) (#3567) infra: Add PR approval protection for the release branch (#3634) fix: nvbugs/5231298: pytorch allreduce issue (#3673) Fix: nvbugs/5222698 variable not defined (#3630) * Fix: nvbugs/5222698 variable not defined * Tidy code --------- test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685) test:restore fp8 kv cache testing for L0 (#3671) doc: Update DeepSeek perf docs (#3693) * Update DeepSeek perf docs * update * Apply suggestions from code review --------- tests: waive test_llm_multi_node (#3664) fix: update test_user_buffers_mm_add_prologue atol (#3711) Fix: cherry-pick hmac encryption from main branch (#3635) * security fix cherry-pick changes from main * fix hmac in remote mpi session (#3649) --------- Un-waive DS-V3-Lite tests. (#3621) fix: FP8 kv accuracy (#3675) * fix FP8 kv accuracy * update doc --------- Fix script options for engines. (#3622) unwaive multi-node test (#3721) chore : Split more tests out of gpt tests (#3524) (#3674) doc:add torch examples link into torch backend documentation (#3749) test: Get Eagle tests working (#3593) (#3722) Waive L0 test (#3756) waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656) Update ds v3 parameters in stress test. (#3676) waive gemma on L20 (#3766) https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. fix: PP4 fixes and cleanup (#3688) remove benchmark test list (#3643) skip disagg deepseek test if sm!=90 (#3720) test: skip failed cases on B200 (#3710) * add skip condition to tests * fix error --------- test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718) * skip_pre_ada for fp8 cases * update * update after rebase --------- add know issue to deepseek doc. (#3800) Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761) Waive L0 tests (#3826) fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. * Fix fused_moe fallback issue. (#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. --------- [doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797) Fix pre-commit Fix again Address some review comments for the MI Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-29 16:57:22 +08:00
bhsueh_NV	2e230b73ec	change log level of some text from info to debug (#3930 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-29 13:38:34 +08:00
yuxianq	adfa04745e	fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 11:26:13 +08:00
bhsueh_NV	0610d0ff84	add num_scheduled_requests into print_log (#3914 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-29 11:22:22 +08:00
Frank	cf15efa15e	[TRTLLM-4883][fix]: Update output speed calculation. (#3923 ) * Update gen tps calculation. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add back output speed for comparison. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix issue with f-string. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix some spacing. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Replace output speed with per-request genphase tput. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add gen TPS breakdown. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Update some tagging. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-29 11:04:12 +08:00
Perkz Zheng	35c5e4f1c5	feat: add CGA reduction fmha kernels on Blackwell. (#3763 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address the comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-04-29 10:43:54 +08:00
hlu1	d2f312b8e4	Fix fp8 kvcache (#3877 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-29 10:31:10 +08:00
WeiHaocheng	8a994d879f	feat: fix erros on scaffolding README (#3899 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-29 10:15:06 +08:00
yuxianq	b91da764de	chore: remove DummyKvCacheManager. (#3896 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 09:59:37 +08:00
Mike Iovine	e534bf09cc	[fix] Fix flashinfer + speculation issues (#3686 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-28 14:34:22 -04:00
Mike Iovine	e6f7ff3a46	[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-28 10:58:03 -04:00
Zhenhuan Chen	ad15e45f07	[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding (#3807 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-04-28 17:15:33 +08:00
bhsueh_NV	f77252e9ff	fix bug of create cuda stream as default parameter which will be init… (#3764 ) * fix bug of create cuda stream as default parameter which will be initialized during importing Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * add torch.cuda.Stream() for the leader node Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit issue Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-28 08:16:03 +08:00
Yan Chunwei	ad4226d946	fix: trtllm-bench build trt engine on slurm (#3825 ) * add submit_sync to RemoteMpiSessionClient Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> add barrier Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> fix comment Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> disable test Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-27 22:26:23 +08:00
bhsueh_NV	76f2c631fb	fix: add warmup flag into py_executor to prevent enable profiler during wa… (#3852 ) * add warmup flag into py_executor to prevent enable profiler during warmup Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * change setting warmup to all ranks Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-27 19:22:42 +08:00
Chuang Zhu	e2318756ed	cacheTransceiver buffer manager (#3798 ) * cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix args Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * cpp kvCacheManager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-27 11:48:15 +08:00
HuiGao-NV	136aab5c54	fix: Update num_of_ctx_tokens in iteration stats (#3785 ) * Update num_of_ctx_tokens in iteration stats * Revert not neccessary change of importing module	2025-04-27 10:24:47 +08:00
bhsueh_NV	e9fab4f3d9	fix bug of deepseek gropu_size setting (#3860 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-27 09:10:37 +08:00
yuxianq	e6c14ca97a	fix: Detect pmix and raise error when mpirun is not used. (#3858 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-26 21:49:41 +08:00
milesial	362a8272f8	feat: llama4 input processor (#3383 ) Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-25 16:47:14 -07:00
sugunav14	5b9897a8cd	fix: [AutoDeploy] update hf loading for e_score_correction_bias (#3847 ) Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>	2025-04-26 02:03:47 +08:00
dongxuy04	16535991b2	feat: Add MNNVL MoE A2A support (#3504 ) * add MNNVL memory mapping support Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments #2 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-04-25 17:29:08 +08:00
Yuan Tong	57944206ba	feat: return logits in PyTorch flow (#3221 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-24 16:56:03 -07:00
hlu1	d72add1794	[Deepseek] Pass hidden_states_fp4 to shared_experts (#3819 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-24 13:12:12 -07:00
HuiGao-NV	7420ddc3d0	fix: fix lora case failure (#3838 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-24 07:29:08 -07:00
WeiHaocheng	3fc2a16920	feat(part 2): Enhance the integrated robustness of scaffolding with __init__.py #3305 (#3731 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-24 18:47:03 +08:00
Zhanrui Sun	ae34d60108	chore: bump version to 0.20.0rc1 (#3834 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-24 17:43:37 +08:00
hlu1	cd2bcdc1a9	Fix create_weights in attention (#3692 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-24 07:30:00 +08:00
Kaiyu Xie	dfbcb543ce	doc: fix path after examples migration (#3814 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-24 02:36:45 +08:00
Daniel Cámpora	1299f27c74	fix: Fix C++ decoder synchronization in PyTorch (#3106 ) * Use updateDecoderBuffers in python decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix synchronize in trtllm decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Enable by default. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use guided_decoder to setup seqslots and free them. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use always decode_async and update_requests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update decoder buffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix speculative decoding tests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Send new_tensors_host instead of assuming dict. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make default False in enable_trtllm_decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Partially fix mtp, partially fix py_executor. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update request states before sending disagg ctx cache. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix disagg test for torch decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make isend_tensor_list and recv_tensor_list for sending the tensors_host. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add disagg serving case to guided decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Get overlap scheduling to work. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update cutlass to main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update after rebasing. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update to use decode async and update requests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Properly pass information to update_requests Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make disaggregated serving a step closer to working. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase and format. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Copy new device tokens more pythonic. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Restore MTP add dummy reqs. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add ordereddict import to py_executor. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added seq slot manager. Add test. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use transmission for single tensor except when list of tensors is received. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add TRTLLMDecoder allocation to estimate max kv cache tokens. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add stream synchronization Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Format Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add decoder creation to estimate max kv. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update submodule UCXX inline with main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-04-23 23:55:27 +08:00
Mike Iovine	0bc520f15e	fix: Limit llama4 context length to 8k (#3778 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-23 08:55:10 -07:00
shaharmor98	49262a62a5	add passing E2E LoRA flow (#3788 ) add passing E2E LoRA flow (#3788) Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-23 18:38:06 +03:00
Enwei Zhu	a51b3cf7a6	[TRTLLM-4763][test] Accuracy test improvement (Part 3.6): Deprecate mmlu_llmapi.py (#3802 ) * cleanup mmlu_llmapi.py Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * polish Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-23 23:05:13 +08:00
Fanrong Li	bc1c4ddcb5	fix: remove the unnecessary metadata changes in mtp. (#3787 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-23 16:01:28 +08:00
Zongfei Jing	1e5af736ea	Add smart router for moe (#3641 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-23 12:21:59 +08:00
shaharmor98	5fff8f0935	Add running E2E LoRA flow (#3648 ) * add passing E2E LoRA flow Signed-off-by: Shahar Mor <smor@nvidia.com> * add experimental feature Signed-off-by: Shahar Mor <smor@nvidia.com> * fix llma_args definition Signed-off-by: Shahar Mor <smor@nvidia.com> * decreased manually size of max loras to address OOM Signed-off-by: Shahar Mor <smor@nvidia.com> --------- Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-23 11:19:41 +08:00
Alessio Netti	4728256bb6	chore: Move cv2 import inside load_video() function (#3768 ) Signed-off-by: Netti, Alessio <netti.alessio@gmail.com>	2025-04-22 21:48:35 +02:00
Lucas Liebenwein	06b914e0f9	feat: [AutoDeploy] generalizing cudagraph to multiple dynamic inputs (#3589 ) * generalizing cudagraph to multiple dynamic inputs Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * fix for failing test Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-04-23 03:38:51 +08:00
Xianjie Qiao	ba4131f176	Add log_level for disaggregated_mpi_worker (#3765 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>	2025-04-22 09:14:46 -07:00
Zongfei Jing	7eee9a9d28	doc: Update doc for Deepseek min latency (#3717 ) * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Update doc for min latency deepseek Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Throw exception for RouterKernel when not running on sm90+ Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-22 23:07:59 +08:00
Yukun He	0ae7017342	Unify two versions of AllReduce custom op (#3032 ) * Rewrite unit test for unified allreduce op. Removing the legacy unit test. * Revise formats, fusion_op bindings. Put all tensors as optional inputs. * Move the MoeAllreduceOp to a separate custom op. * Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions. * Add more TODOs, fixing minor bugs, and remove legacy code. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-22 21:58:42 +08:00
bhsueh_NV	b87f26ee2a	chore: remove useless allgather (#3751 ) * remove useless allgather Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit issue Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-22 21:26:22 +08:00
Enwei Zhu	353699a3b3	fix: fnmatch usage in modeling_utils.py (#3754 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-22 13:13:53 +08:00
Yi Zhang	98966cb45e	test: Unwaive Llama 3.1 with torch compile test (#3475 ) * Fix log info Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Revert "test: Waive torch compile tests (#3471)" This reverts commit `410f56357e`. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Update test_llm_api_pytorch.py Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> --------- Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-04-22 10:41:56 +08:00
Kaiyu Xie	a32389b4cd	fix: Remove unnecessary max call (#3574 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-22 10:33:50 +08:00
Enwei Zhu	3fa19ffa4e	test [TRTLLM-4477,TRTLLM-4481]: Accuracy test improvement (Part 3.5): Support GSM8K and GPQA (#3483 ) * add gsm8k Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix gsm8k Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * add gpqa Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * conditional import lm_eval Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * gpqa in lm_eval Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * system prompt Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * shuffle Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update AA prompt and regex Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * revert AA prompt and regex Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * integration to tests Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * add DS-R1 Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix and clean Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update tests Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * clean up Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * free_gpu_memory_fraction=0.8 Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-22 07:38:16 +08:00
bhsueh_NV	0c07d4dc21	Fix/executor bugs (#3681 ) * fix bugs of py executor Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs of py executor Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * revert changes about mpi_barrier() Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-22 07:23:27 +08:00
Kaiyu Xie	943f3ff8f6	Revert "Report number of context tokens in one iteration (#3691 )" (#3740 ) This reverts commit `e0446a4dc0`. Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-22 01:21:43 +08:00
Iman Tabrizian	af04b6f6aa	bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095 ) * Fix hang bug when KV cache is low Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Review comments Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Fix attentiondp typo Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Add CI test for this case Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * fix: Fix the insertion order for responder futures Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * fix: Fix disagg CPP Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> --------- Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-04-21 15:16:55 +08:00
katec846	eeb605abd6	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 ) * Feat: Offload ptable to cpu if enable_chunk_context Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Feat: offload ptable to cpu for chunk context mode Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix and add comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update Readme for multimodal and add a new param mm_embedding_offloading Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * fix: Correct prompt table offloading condition in PromptTuningBuffers Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up the code Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Add commits to explain copy from cpu <-> gpu using pinned memory Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix namings based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix format based on precommit Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify --mm_embedding_offloading flag Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-21 14:31:01 +08:00
yuxianq	faef37782a	fix: Remove ParallelConfig. (#3678 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-21 14:14:08 +08:00
HuiGao-NV	e0446a4dc0	Report number of context tokens in one iteration (#3691 ) Report number of context tokens in one iteration	2025-04-21 13:45:28 +08:00
yuxianq	591f3d2be8	fix: Support TLLM_OVERRIDE_LAYER_NUM for llama4. (#3679 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-21 12:28:56 +08:00
hlu1	31624b079a	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 ) * Add TRT-LLM Gen MOE to Deepseek fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix rebase issues Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix compile issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * CI clean Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Hao Lu <haolu@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-21 10:01:33 +08:00
hlu1	17eba98445	Refactor Deepseek tp_size calculation (#3695 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-19 23:55:19 -07:00
brb-nv	c35d2a7532	test: Get Eagle tests working (#3593 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-04-20 00:50:57 +08:00
yuxianq	5346f53250	feat: Introduce feature properties for attention backend. (#3659 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-19 12:37:27 +08:00
hlu1	c861b6cf17	Clean up modeling_deepseek.py (#3640 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-18 17:54:33 -07:00
Yechan Kim	5460d18b10	feat: trtllm-serve multimodal support (#3590 ) * feat: trtllm-serve multimodal support Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * remove disable argument Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * remove disable Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * add and separate tests and move the doc Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * remove block_resue arg from serve.py Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> --------- Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-19 05:01:28 +08:00
pcastonguay	ae5671644a	feat: Disaggregated router class (#3584 ) * Add draft scheduler class Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> * Refactor the design Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> * feat: Introduce router class for disaggregated server Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Add unit tests for router class Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Adding tests for disagg_utils Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing missing import Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing disagg integration tests Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Addressing MR review comments Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-04-19 00:34:12 +08:00
Zheng Duan	bce7ea8c38	test: add kv cache event tests for disagg workers (#3602 )	2025-04-18 18:30:19 +08:00
Yan Chunwei	2a09826ec4	fix hmac in remote mpi session (#3649 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-04-18 17:47:51 +08:00
HuiGao-NV	d3608d6818	Remove dummy forward path (#3669 ) Remove dummy forward path	2025-04-18 16:17:50 +08:00
Dom Brown	dbd9a83b0d	feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582 ) * feat: Integrate GPUDirect Storage (GDS) into Executor API Squash of several dev commits Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-18 15:59:21 +08:00

1 2 3 4 5 ...

389 Commits