TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-26 13:43:38 +08:00

Author	SHA1	Message	Date
Chang Liu	b8818b45be	fix: llama4: address couple of issues in llama4 attention module (#3491 ) * fix attn module for llama4 * Address comments * Rebase to accommodate latest attn refactor and refactor l4attn * Remove aux_stream from classic attn * Use RMSNorm for L2Norm * Update tensorrt_llm/_torch/models/modeling_llama.py Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: Chang Liu <lc9114@gmail.com> * Add typing informations for _attn_qkv * Remove redundant comment * Simplify llama4 DecoderLayer logic --------- Signed-off-by: Chang Liu <lc9114@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-18 01:54:59 +00:00
rakib-hasan	ff3b741045	feat: adding multimodal (only image for now) support in trtllm-bench (#3490 ) * feat: adding multimodal (only image for now) support in trtllm-bench Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fix: add in load_dataset() calls to maintain the v2.19.2 behavior Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * re-adding prompt_token_ids and using that for prompt_len Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * updating the datasets version in examples as well Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * api changes are not needed Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * moving datasets requirement and removing a missed api change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * refactoring the quickstart example Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-04-18 07:06:16 +08:00
Yukun He	83b36ebecd	Fix fused_moe fallback issue. (#3652 ) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-17 23:17:04 +08:00
yuxianq	b9b1c1368c	feat: Support unfused rope in MLA. (#3610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 16:50:49 +08:00
Netanel Haber	3c52ac098f	feat: allocate minimal blocks per window size (#3028 ) * implement variable window attention by breaking the block manager into window block managers per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * revert isCyclic to be true if the min attention window is reached, not per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add explanatory comment to mCyclicThreshold Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * load correct gemma config Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * if TYPE_CHECKING Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * pass dtype as well Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * test_gemma variable sliding window attention Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove \|\| mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * turn off request delaying for MaxUtil Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * make comments better Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * windowSizesTotalSum using std::accumulate Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comments Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove assert that kills disagg tests, since it isn't necessary Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add Gemma3 to SUPPORTED_HF_ARCHITECTURES Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * support Gemma3 Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix kvfactor field for deepseek Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix gemma-3 entries in testlist to include vswa Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * only quantize gemma2 VSWA Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> remove misleading comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix: disable KV cache reuse if using attention sink (#3021) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-17 16:04:57 +08:00
danielafrimi	0f084d9566	added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455 ) Loraop integration into torch modules Signed-off-by: Ubuntu <dafrimi@nvidia.com>	2025-04-17 12:48:27 +08:00
yuxianq	239fe0ff26	chore: Use ellipsis as default value to detect whether residual argument is provided (#3626 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 12:31:58 +08:00
Luis Vega	a06bff5052	Fix rotary_emb param in NemotronH attention (#3646 ) Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>	2025-04-16 21:03:07 -07:00
Luis Vega	0bda1f9780	feat: Nemotron-H model support (#3430 ) * added files for nemotron-h Signed-off-by: Luis Vega <lvega@nvidia.com> * use try/except to import RMSNorm Signed-off-by: Luis Vega <lvega@nvidia.com> --------- Signed-off-by: Luis Vega <lvega@nvidia.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-16 14:05:56 -07:00
Mike Iovine	41a6c98544	Support CUDA graphs for EAGLE3 (#3176 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-17 04:53:50 +08:00
hlu1	b6bae33453	Clean up linear.py, mlp.py, gated_mlp.py (#3553 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-16 12:21:44 -07:00
yuxianq	fd8ded2b2b	feat: Support cos_sin_cache in all cases. (#3517 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-16 13:48:44 +08:00
Jinyang Yuan	efabf6b443	chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs (#3572 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 21:51:42 -07:00
Daniel Cámpora	41ce5440fe	chore: Mass integration of release/0.18 (#3421 ) * [Infra][TRTLLM-4063] - Branch out for the TRT-LLM v0.18.0 release Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> (cherry picked from commit de90312020e51c22ba5e75b3502c7ee90c059265) * [Infra][TRTLLM-3652] - Update dependencies to TRT 10.9 / CUDA 12.8.1 / DLFW 25.03(Internal) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> (cherry picked from commit 58db1340ef7db22f1910f878d220a92be5b830d1) * [None][Doc] - Update docs for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit d23e75bc95619ce3b116213d55319272888e0c88) * [Infra] - Fix or WAR issues in the package sanity check stages Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit e874e2b127515c52ba10c8df1cc2631627f74ffe) * [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path Signed-off-by: Yuki Huang <yukih@nvidia.com> (cherry picked from commit 731811d4e182d70a66193d646152cb71dfafe83a) * cherry-pick 'test: Updat cluster and multi node test lists and trtllm-bench' test to fix perf drop issue Signed-off-by: Ruodi Lu <ruodil@nvidia.com> (cherry picked from commit 5214616283fbc15ae98871a1d84c78d8e1f2e6e8) * Revert "Merge branch 'user/yukih/fix_5173454_5173432' into 'release/0.18'" Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 8d34831cb2b81ee2dfa8021b68e7158b33789a5f) * [Infra]Restrict setuptools version to avoid sasb pip install issue Signed-off-by: Emma Qiao <qqiao@nvidia.com> (cherry picked from commit 1e60ad29e0dafec0e295bedb5d89b716a02a707c) * [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path Signed-off-by: Yuki Huang <yukih@nvidia.com> (cherry picked from commit 3ed8164e5bfea1d5aa2039b5408439fd6cf59dac) * WAR for bug 5173448 Signed-off-by: Thor Johnsen <tjohnsen@nvidia.com> (cherry picked from commit b6528b2ba15322b6c6a4c81a8b74c04d4973de4f) * [Infra][TRTLLM-3652] - Update dependencies to CUDA 12.8.1 / DLFW 25.03 Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> (cherry picked from commit 6560983d132d9d257ee15849664eb055e94adaa9) * [Docs] - Doc changes for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 26769b61218a947c8f9d070f73b63d576fcc20c4) * [Doc] - Doc change for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4b3b5ed6bfbc2300e3775fe75456083faad7b235) * [Infra] update version to 0.18.1 Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> (cherry picked from commit 59e8326c75639275837d34de8e140358737a3365) * Add back nemotron file. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix recurrentgemma reqs. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Adding WAR for bug 5173448. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Remove duplicated file. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update examples/prompt_lookup/requirements.txt Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Remove glm-4-9b from model dir in chatglm test. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Remove indent change. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Revert changes on l0_test.groovy. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update dev images Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> * Remove duplicated import. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix custom op Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Fix flashinfer & vanilla backend Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Skip problematic case. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Skip problematic test_moe_w4a8_1_14336_4096_8_bfloat16_True_False case. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Ruodi Lu <ruodil@nvidia.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Thor Johnsen <tjohnsen@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-04-16 10:03:29 +08:00
Yuan Tong	d4c0423cdb	refactor: collect executor and decoder states into dataclass (#3234 ) * fix: Proper error bubbling for PyExecutor Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-15 16:31:45 +08:00
shaharmor98	ede7058544	Feat/ Integrate peftCacheManager in PyExecutor creation (#3372 ) * integrate peftCacheManager in PyExecutor creation Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-15 15:14:43 +08:00
Yuan Tong	668a0335e4	fix: Proper error bubbling for PyExecutor (#3321 ) * fix: Proper error bubbling for PyExecutor * fix: Proper shutdown * fix: multi gpu proper shutdown Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-15 14:49:46 +08:00
Jinyang Yuan	0305942808	chore: Modifications that should have been included but were mistakenly overwritten in PR #3467 (#3557 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 14:08:07 +08:00
yuxianq	0e7e949feb	refactor: Split llama4 model from llama model. (#3530 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-15 13:41:05 +08:00
Jinyang Yuan	175adb94ab	chore: Log memory sizes of weights and activations separately (#3467 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 09:48:35 +08:00
QI JUN	112f716155	chore: move all distributed related codes into _torch.distributed directory (#3511 ) * move all distributed related codes into _torch.distributed directory Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-15 08:39:17 +08:00
Aurelien Chartier	8cf2785bc6	chore: unify pp_layers helpers (#3429 ) * chore: unify pp_layers helpers Fix assumptions about equal number of layers per PP rank in prepare_attention_inputs Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-04-15 04:49:17 +08:00
Chang Liu	01cb3ccb04	use global expert idx to load expert weights (#3386 ) Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-14 11:58:30 -07:00
Chang Liu	1902d73eb5	fix: llama4: add an option `apply_router_weight_on_input` for in FusedMoE (#3492 ) * apply a tenative fix to moe bypass kernel update * Pass none to disable final stage in moe Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: Chang Liu <lc9114@gmail.com> --------- Signed-off-by: Chang Liu <lc9114@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-14 11:56:42 -07:00
Kaiyu Xie	b286b51118	feat: Support torch profiler (#3470 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-14 22:06:06 +08:00
HuiGao-NV	9f41e826bf	fix: remove one duplicated line of code (#3523 ) Signed-off-by: Hui Gao <huig@nvidia.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-14 14:52:46 +08:00
yuxianq	9d64b6b890	Cache sin cos in model instead of global LRU cache. (#3378 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 11:19:09 +08:00
pcastonguay	fe6f14b2b1	fix: Fixing issue with first gen token being returned twice in streaming (#3427 ) * fix: Fixing issue with first gen token being returned twice with streaming Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing not_expectring_strings in test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-13 22:45:09 -04:00
yuxianq	baeec63dda	refactor: Remove _pp_forward. (#3496 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 09:49:44 +08:00
HuiGao-NV	d0f83d19f1	fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache (#3497 ) * fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache Signed-off-by: Hui Gao * Fix code to get model_config of draft model Signed-off-by: Hui Gao --------- Signed-off-by: Hui Gao	2025-04-13 23:02:14 +08:00
QI JUN	012fb9a1c4	remove useless max_num_tokens member in PyTorchConfig (#3493 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-12 21:09:58 +08:00
Robin Kobus	2ab71f9a80	refactor: decoder buffers (#3307 ) * refactor: remove cumLogProbs and logProbs from DecoderBuffers - Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management. - Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities. These modifications enhance code clarity and maintainability by reducing redundancy in buffer management. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched - Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping. - Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width. These changes enhance clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: integrate DecoderState for sequence length management in decoding process - Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose. - Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management. - Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow. refactor: update SlotDecoderBuffers to manage sequence lengths directly - Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths. - Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy. - Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process. These changes improve maintainability and streamline the decoding workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Delegate to asyncSend method in SlotDecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-12 11:41:24 +02:00
yuxianq	29c5085400	fix: Fix PP for llama. (#3449 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-12 17:20:27 +08:00
hlu1	4855431d3d	[Deepseek] Redesign multi-stream API (#3459 ) Signed-off-by: Hao Lu <haolu@nvidia.com>	2025-04-11 23:00:25 -07:00
Iman Tabrizian	3041bbdab3	fix: Fix disagg MTP with overlap (#3406 ) * fix: disagg overlap with MTP Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Review comment Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> --------- Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>	2025-04-12 12:27:24 +08:00
HuiGao-NV	c51e90d7d7	fix: don't perform memory estimation for start_attention (#3485 ) * fix: don't perform memory estimation for start_attention * Enable tests of unittest/_torch/multi_gpu Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-12 11:34:46 +08:00
QI JUN	d167cbd5bb	refactor: remove ParallelConfig in tensorrt_llm._torch.distributed module (#3370 ) * remove tensorrt_llm._torch.distributed.ParallelConfig Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix embedding test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix comments Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * polish Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * rebase Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-11 15:34:20 -07:00
Fridah-nv	ec723fa993	feat:[AutoDeploy] Enhance RoPE support (#3115 ) * add test to map flashinfer rope op with triton custom rope ops and pytorch rope in fused_mha Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add rope matcher and unit tests Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * capture cos and sin from graph Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * revert fuse_mha op change Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor update to address comment and remove redundant unit test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * move view and transpose into graph nodes and update unit test to test custom op directly Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * move view into custom op, update bfs with bound, update custom op return type to be half precision Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * custom op update to support 3D input Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * handle bnsd and bsnd format, update tests, handle 3D cos/sin input to the custom op Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add llama4 rope test, update custom op with is_neox flag Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add llama4 style rope to matcher and update unit test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * separate into two transformations Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * fix when num_head != num_kv_head; add support for cached position_ids and cos_sin_cache in graph; update unit tests Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor update, cache locally and propagate meta info of qk nodes Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor: fix cos_sin_cache not float Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor: move cache into matcher Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> --------- Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>	2025-04-11 23:51:24 +08:00
Yukun He	ff82aef99b	Fix the issues related to fused moe path. (#3435 ) * One of the tactic is not supported during dispatch. * final_hidden_states should be unpacked if it is not min_latency_mode. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-11 21:41:15 +08:00
liji-nv	b168adba70	feat: Add NVFP4 UB pattern optimization pass in torch compile (#3371 ) * feat: Add NVFP4 UB pattern optimization pass in torch compile * Add an additional flag for UB fp4 pattern to avoid inverse the scale * Add NVFP4 related UB patterns Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> * Update atol, some points fails for B200 umbriel. Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com> --------- Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>	2025-04-11 21:25:29 +08:00
Yechan Kim	5bc6f093c8	fix: mllama e2e pytorch flow fix (#3397 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-04-11 17:33:15 +08:00
Ivy Zhang	d998832b33	test: add torch flow test case in qa test list (#3404 ) Signed-off-by: Ivy Zhang <yanzh@nvidia.com>	2025-04-11 16:57:41 +08:00
Zhihan Jiang	8300218d21	feat: support llama4 nope layers; support FP8 checkpoint loading; (#3382 ) * Enable NOPE, Fix a rotary embedding bug for gptj_stype_rope, Address PR comment, Properly skip the rotary_embdding for Llama4 ROPE layers * Add support for FP8 checkpoint, Fix ckpt weighting loading for FP8 * Temporarily disable min_latency_mode for llama4 --------- Co-authored-by: Yilin Fan <yilinf@nvidia.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-04-10 10:16:42 -07:00
amitz-nv	a6a2ae6cc1	chore: Rename nvsmall to nemotron nas (#3447 ) * Rename nvsmall to nemotron NAS * Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests * Add NemotronNAS to pytorch supported models table Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-04-10 23:16:52 +08:00
wm2012011492	af05749e90	feat: add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa… (#3369 ) * add qwen2 moe to torch flow; fix wrong imported KvCacheConfig in gpqa_llmapi.py Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> * fix coding style Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> * add unittest Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> --------- Signed-off-by: mengw <12670782+wm2012011492@users.noreply.github.com> Co-authored-by: mengw <12670782+wm2012011492@users.noreply.github.com>	2025-04-10 22:45:57 +08:00
HuiGao-NV	3ade9375ba	feat: Run PyExecutor's inference flow to estimate max_num_tokens for kv_cache_manager (#3092 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-10 18:29:40 +08:00
yuxianq	16c8f39fc5	feat: Support TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD for debugging (#3417 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-10 13:18:30 +08:00
hlu1	fbcf954d9c	[MLA] Deallocate tensors after use (#3286 ) Signed-off-by: Hao Lu <haolu@nvidia.com>	2025-04-09 21:36:07 -07:00
Yechan Kim	943218b54a	feat: Add Qwen2.5-VL and refactor Qwen2-VL (#3156 ) * feat: Add Qwen2.5-VL and refactor Qwen2-VL Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix yapf and codespell Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * add test Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix test_e2e Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * generalize get_rope_index Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix qwen2.5-vl on REAME Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix test Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> * fix image test Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> --------- Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-10 04:09:03 +08:00
danielafrimi	47f5cf6c0d	lora_tests (#3201 ) LoRA tests and layers Signed-off-by: Ubuntu <dafrimi@nvidia.com> Co-authored-by: Ubuntu <dafrimi@nvidia.com>	2025-04-09 18:06:52 +03:00

1 2 3

104 Commits