Commit Graph

28 Commits

Author SHA1 Message Date
Chang Liu
01cb3ccb04
use global expert idx to load expert weights (#3386)
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-14 11:58:30 -07:00
Chang Liu
1902d73eb5
fix: llama4: add an option apply_router_weight_on_input for in FusedMoE (#3492)
* apply a tenative fix to moe bypass kernel update

* Pass none to disable final stage in moe

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Chang Liu <lc9114@gmail.com>

---------

Signed-off-by: Chang Liu <lc9114@gmail.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-14 11:56:42 -07:00
yuxianq
baeec63dda
refactor: Remove _pp_forward. (#3496)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-14 09:49:44 +08:00
hlu1
4855431d3d
[Deepseek] Redesign multi-stream API (#3459)
Signed-off-by: Hao Lu <haolu@nvidia.com>
2025-04-11 23:00:25 -07:00
QI JUN
d167cbd5bb
refactor: remove ParallelConfig in tensorrt_llm._torch.distributed module (#3370)
* remove tensorrt_llm._torch.distributed.ParallelConfig

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* clean

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix embedding test

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix comments

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* polish

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* rebase

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-11 15:34:20 -07:00
Yukun He
ff82aef99b
Fix the issues related to fused moe path. (#3435)
* One of the tactic is not supported during dispatch.
* final_hidden_states should be unpacked if it is not min_latency_mode.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-11 21:41:15 +08:00
Zhihan Jiang
8300218d21
feat: support llama4 nope layers; support FP8 checkpoint loading; (#3382)
* Enable NOPE, Fix a rotary embedding bug for gptj_stype_rope, Address PR comment, Properly skip the rotary_embdding for Llama4 ROPE layers

* Add support for FP8 checkpoint, Fix ckpt weighting loading for FP8

* Temporarily disable min_latency_mode for llama4

---------

Co-authored-by: Yilin Fan <yilinf@nvidia.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-04-10 10:16:42 -07:00
hlu1
fbcf954d9c
[MLA] Deallocate tensors after use (#3286)
Signed-off-by: Hao Lu <haolu@nvidia.com>
2025-04-09 21:36:07 -07:00
danielafrimi
47f5cf6c0d
lora_tests (#3201)
LoRA tests and layers

Signed-off-by: Ubuntu <dafrimi@nvidia.com>
Co-authored-by: Ubuntu <dafrimi@nvidia.com>
2025-04-09 18:06:52 +03:00
Tracin
2a2b7bfc66
Fix miss bias add for FP4Linear. (#3361)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-09 09:17:54 +08:00
Mike Iovine
5bdf997963
Add Llama 4 (#3302)
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-04-09 03:35:21 +08:00
yuxianq
7225bd8b91
chore: Refine attention backend interface. (#3271)
Refine attention backend interface.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-09 02:34:53 +08:00
liji-nv
dca6397d1e
feat: Introduce UB allocator for pytorch flow (#3257)
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
  are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.

* In common usecase, the Userbuffers will be allocated correctly during
  warm up stage. There is no dynamic allocation during inference.

* UB fusion pattern is rewroten using the new UB Allocator. It contains
  following passes:

1. Fuse Quant with allreduce, replace with UB impl, and insert a
   copy_to_userbuffers. Currently the normal allreduce still does not
   support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
   writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
   allreduce.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-04-08 18:39:49 +08:00
Yukun He
c678774c99
feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151)
* Several optimizations and fixings on the Autotuner.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on current linear for nvFP4 data type.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Fixing and revising according to reviewer's suggestions.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Replace runner with runner id to achieve a serializable cache.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Code clean up and minor fixings.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Move all tunable runners and custom ops into torch_custom_ops.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-08 14:28:36 +08:00
Bo Li
515dd0d78f
feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190)
* fp8 kv + bf16 ctx MLA + fp8 gen MLA

Use BF16 for context MLA.
mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together.

Allow mSM==90 for mFP8GenerationMLA==true.
For FMHA, dataTypeKv should be FP8.

For FP8 MLA generation, the output is still in BF16.

Refine debug info for FMHA kernel metadata.

Use inputType, outputType, SM together to hash kernel list.

Add FP8 MLA generation FMHA kernel.

Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel.

Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails.

Refine debug info in fused_multihead_attention_v2.cpp

Correct FP8 MLA metadata.

New kernel provided by Yuxin, which outputs BF16.

smem size is not set correctly, which will lead to illegal mem access.

Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched.

There are two bmm1 scales that should be set correctly.

New kernel generated by Yuxin.

Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA.

Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false.

Skip a check in fmhaDispatcher.

Modifications in fmhaRunner:
- Debug dump.
- if (!isFP8GenerationMLA) skips a lot of flag setting.
- TMA descriptor modification for qo (by Yuxin).

Cleanup debug output.

Clean up o tma descriptor modifications.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Resolve conflicts.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Apply the patch of FP8 FlashMLA and resolve conflicts.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Fix compilation error.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Fix compile error.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* pick blackwell support

Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>

* Add copyright notice to fused_multihead_attention_v2.cpp.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Add license.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Add missing license.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Exclude building flashMLA kernels under sm90.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Revert "Exclude building flashMLA kernels under sm90."

    This reverts commit f0c859d459.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Use macro to skip compiling FlashMLA for non sm90 targets.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

---------

Signed-off-by: Bo Li <bobboli0202@gmail.com>
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: Dylan Chen <ziqingc@nvidia.com>
Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-07 15:14:13 +08:00
Zongfei Jing
dcc0ebd273
Fix warning (#3254)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-03 13:30:23 +08:00
Jinyang Yuan
2fdfa39ea8
fix: Fix an error related to dummy request when MTP is used (#3146) 2025-04-03 11:08:12 +08:00
Zongfei Jing
c7548ad72c
perf: Add optimizations for deepseek in min latency mode (#3093)
* Add optimizations for deepseek min latency

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Update internal cutlass kernel libs

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Format code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Resolve conflicts

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-02 09:05:24 +08:00
Jinyang Yuan
992d513bc6
feat: Optionally split MoE inputs into chunks to reduce GPU memory usage (#3104)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-04-01 16:07:02 +08:00
yuxianq
268933b5cc
Refactor imports inside tensorrt_llm._torch. (#3015)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-03-26 11:01:07 +08:00
Kaiyu Xie
2631f21089
Update (#2978)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-23 16:39:35 +08:00
Kaiyu Xie
3aa6b11d13
Update TensorRT-LLM (#2936)
* Update TensorRT-LLM

---------

Co-authored-by: changcui <cuichang147@gmail.com>
2025-03-18 21:25:19 +08:00
Kaiyu Xie
9b931c0f63
Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
Kaiyu Xie
77d7fe1eb2
Update TensorRT-LLM (#2849)
* Update TensorRT-LLM

---------

Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00
Kaiyu Xie
ab5b19e027
Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
Kaiyu Xie
2ea17cdad2
Update TensorRT-LLM (#2792)
* Update TensorRT-LLM

---------

Co-authored-by: jlee <jungmoolee@clika.io>
2025-02-18 21:27:39 +08:00
Kaiyu Xie
e88da961c5
Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
Dan Blanaru
16d2467ea8 Update TensorRT-LLM (#2755)
* Update TensorRT-LLM

---------

Co-authored-by: Denis Kayshev <topenkoff@gmail.com>
Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>

Update
2025-02-11 03:01:00 +00:00