Commit Graph

88 Commits

Author SHA1 Message Date
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00
hlu1
3093c747b7
[Architecture] Redesign Linear module (#4721)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-29 16:05:46 -07:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main (#4739)
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
Yuxian Qiu
bf691b3d28
feat: support packed weights in vanilla moe (#4719)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-29 06:24:24 +08:00
amirkl94
fbec0c3552
Release 0.20 to main (#4577)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-28 16:25:33 +08:00
Bo Li
9c4b8f66b4
feat: Integration of Fused QKNorm+RoPE. (#4611)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-28 11:20:45 +08:00
Yuxian Qiu
5700a4ffcd
feat: Add vanilla MOE. (#4682)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-28 10:44:14 +08:00
Enwei Zhu
88190faa34
feat: large-scale EP(part 4: Static EP load balancer integration) (#4615)
* MoeLoadBalancerConfig

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* MoeLoadBalancer integration

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* config file

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-26 18:25:11 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535)
* optimize kv cache reuse workflow for MLA

write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* support fp8 kv cache for MLA kv cache reuse

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend (#4530)
* MoE TRTLLM backend for Qwen3

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* add extra moe_backend to test

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* address comments

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* conditionally compile kernels on newer archs

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* missing positional arg

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* Update the routing kernels

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Revise usage of TLLM_LOG_ERROR

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Add unit test for Qwen3 moe (trtllm_gen backend)

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* improve weight processing speed of moe_backend=TRTLLM; roughly 2x

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* tidy and minor fix

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* temporarily disable accuracy test that has known issue

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

---------

Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
Mike Iovine
9c0de251db
[feat] Integrate Hopper chunked attention kernels (#4330)
* Integrate chunked attention kernels

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix cache key

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

* Fix lint

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

---------

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-22 17:10:57 -04:00
dongxuy04
4018806742
feat: large-scale EP(part 3 - refactor: FusedMoe for redundant expert) (#4495)
refactor fused_moe for redundant expert

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-21 17:17:49 +08:00
dongxuy04
21aff2e313
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384)
* first commit of cpp moe loadbalance code

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python bindings for moe load balance

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add python wrapper, ut and bug fixes

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add binding for layerId and update binding test

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add host tensor sharing and ut

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-20 17:53:48 +08:00
liji-nv
58e405624a
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 (#3952)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
shaharmor98
27afcb9928
add changes for fp8, nemotron-nas, API (#4180)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-18 23:27:25 +08:00
Kaiyu Xie
3e08cd231c
fix: Remove real size allocation (#4396)
Remove real size allocation

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-18 19:13:22 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible (#3439)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
NVJiangShao
a6f2a1e918
Fix test_fused_moe_w4afp8 (#4393)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-05-16 17:21:33 +08:00
Tracin
46c5a56444
Support dynamic per-tensor FP8 (#4250)
* Support dynamic per-tensor FP8

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Update test cases.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 13:33:58 +08:00
yuxianq
a1daa22970
doc: Add docstring for Attention and MLA module. (#4354)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-16 09:37:04 +08:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. (#4244)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA

load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* add CI test

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* use GPTJ style RoPE for MLA

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix rebase error and some docs

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix kv_lens

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* tiny fix

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: use normal device memory instead of pinned memory for unit test

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* fix L0 tests

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile after rebase

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments again

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Barry Kang
20b42912ce
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123)
Support DeepSeek-R1 W4A8 on Hopper

Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-14 15:48:07 +08:00
brb-nv
8280c3d4f2
feat: Support Gemma3-1b-it in Pytorch workflow (#3999)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 14:02:44 +08:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. (#4047)
* feat: Add heuristic for GroupRMSNorm kernel selection.

Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
  (better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
  (better block scheduling in large batch scenarios)

Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.

The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Address the comments.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
yuxianq
b35f9a67f9
refactor: Allow models to override apply_qk_norm. (#4078)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-12 19:38:24 +08:00
Mike Iovine
4b8ba7ad61
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue (#4069)
[fix] Fix llama 4 test lists

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-09 22:45:14 +08:00
chenfeiz0326
ffc13bd325
Cherry-pick: Use multi-threading to load MoE expert weights (#4137)
* Use multi-threading to load MoE expert weights

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

* Update code formatting

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update code formatting

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-09 17:29:24 +08:00
dongxuy04
7147efb2e8
fix: alltoall padding for chunked MoE (#4157)
fix alltoall padding for chunked MoE

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-09 09:01:35 +08:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main (#4086)
* feat: TRT-LLM Gen FP8 MoE Llama4

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* feat: TRT-LLM Gen llama4 MoE Top1 routing

Signed-off-by: Jiqun Tu <jtu@nvidia.com>

* feat: add per tensor FP8 TRT-LLM Gen GEMMs

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Update

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Add guard for routingIndicesClusterKernel

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

* Guard sm90+ for routingkernels

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

---------

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
shaharmor98
7d94c9561f
feat: support multi lora adapters and TP (#3885)
* support multi lora, tp

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-08 23:45:45 +08:00
zihaok
81cc60a0fd
[feat/] enable attention DP in Llama4 maverick model - part 1 (#4065)
* add feature

cosmetic changes

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

address precommit fix

cosmetic

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* add feature

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* fix bug

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* address comments

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* remove WAR

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* fix format precommit

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* Update tensorrt_llm/_torch/models/modeling_llama.py

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>

---------

Signed-off-by: Zihao Kong <zihaok@nvidia.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-08 05:06:40 +08:00
hlu1
26a2679217
[Deepseek] Refactor Deepseek Decoder layer (#4016)
Refactor Deepseek Decoder layer

Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-08 01:43:10 +08:00
bhsueh_NV
f670a036df
[Qwen3] chore: fix bug of fused_moe on tp > 1 (#4093)
* fix bug of fused_moe on tp > 1

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* refine codes

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-07 11:06:37 +08:00
yuxianq
017701343e
fix: apply rope twice in Qwen3. (#4040)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-05 15:12:45 +08:00
hlu1
52edabab30
Fix Deepseek MTP with moe_backend=TRTLLM (#4001)
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-02 14:47:22 +08:00
Simeng Liu
873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438)
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.

Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.

This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes

Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.

Signed-off-by: Simeng Liu <simengl@nvidia.com>

* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE

Signed-off-by: Simeng Liu <simengl@nvidia.com>

---------

Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
Erin
8fe7bdeacf
feat: LogitsProcessor in PyTorch backend (#3145)
* support lp in pytorch backend

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* fix tp

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-01 14:15:30 -07:00
bhsueh_NV
129bf19980
model: support Qwen3 (#4010)
* add qwen3 dense model pytorch backend support, initial commit

solve the results error issue

add qwen3 moe model pytorch backend support

reformat the code

* perf - use flash_infer rmsnorm for qwen3

* feat - support qwen3 moe rmsnorm

* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.

* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications.

* fix bugs of running qwen3 public models and fp8 models

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs due to rebase

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bugs captured by pre-commi

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of attention

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Keddy Jin <jin.gq@aliyun.com>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Co-authored-by: shao <shao@nvidia.com>
2025-05-01 23:12:41 +08:00
yuxianq
0f8ec693b2
fix: get head_dim from model’s config. (#3916)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 23:04:29 +08:00
HuiGao-NV
8e6eead6a5
refactor: (part1) Add contraints doc for fusedMoe module. (#3882)
* Add doc string for FusedMoe module
* Address comments.

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-04-29 22:23:02 +08:00
Junhong Liu
06e76020d7
feat: parallel q_b_proj and concat (#3917)
* add parallel_q_b_proj_and_concat

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

* code cleanup

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

* one gemm/concat and then split the latent_cache and pass them separately to context/gen

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>

---------

Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
2025-04-29 22:07:05 +08:00
hlu1
d2f312b8e4
Fix fp8 kvcache (#3877)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-04-29 10:31:10 +08:00
Mike Iovine
e6f7ff3a46
[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 10:58:03 -04:00
bhsueh_NV
f77252e9ff
fix bug of create cuda stream as default parameter which will be init… (#3764)
* fix bug of create cuda stream as default parameter which will be initialized during importing

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* add torch.cuda.Stream() for the leader node

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix pre-commit issue

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-28 08:16:03 +08:00
milesial
362a8272f8
feat: llama4 input processor (#3383)
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-25 16:47:14 -07:00
dongxuy04
16535991b2
feat: Add MNNVL MoE A2A support (#3504)
* add MNNVL memory mapping support

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add more MPI environment for trtllm-llmapi-launch

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MoE communication and prepare kernels

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add MNNVL AlltoAll support for DeepSeekV3

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* add output dump for throughput benchmark

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* support dynamic kernel launch grid

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

* address review comments #2

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

---------

Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-04-25 17:29:08 +08:00
hlu1
cd2bcdc1a9
Fix create_weights in attention (#3692)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-04-24 07:30:00 +08:00
shaharmor98
49262a62a5
add passing E2E LoRA flow (#3788)
add passing E2E LoRA flow (#3788)

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-23 18:38:06 +03:00