ChristinaZ
f45aff2b7d
Add customized renormalized moe routing kernel for moe cutlass backend ( #4955 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-09 17:38:50 +08:00
Bo Li
c104388d37
chore: Refactor apply_rope. ( #4918 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-09 16:51:59 +08:00
Daniel Stokes
3a4851b7c3
feat: Add Mixture of Experts FP8xMXFP4 support ( #4750 )
...
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
2025-06-09 13:25:04 +08:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) ( #4818 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
Bo Li
f414a079ad
chore: Change the type annotations of input_ids and position_ids to int32. ( #4632 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-07 16:10:47 +08:00
Anthony Chang
eeb555e37b
chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM ( #4826 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-06 16:13:54 +08:00
Yuxian Qiu
6b3242654e
fix: Fix broken vanilla moe since FusedMoE refactor. ( #4897 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-05 03:56:41 +08:00
tomeras91
8d31e16877
[TRTLLM-4923][feat] Paged mamba cache ( #4822 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-04 09:27:08 +03:00
Omer Ullman Argov
e71de2a13e
chore: Mass integration of release/0.20. ( #4871 )
...
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-06-04 14:12:27 +08:00
Nikita Korobov
8043d7a03c
feat: update DeepSeek FP8 TRT-LLM Gen cubins ( #4643 )
...
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-06-03 14:07:54 -07:00
hlu1
320195dc0d
[Architecture] Refactor FusedMoE ( #4790 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-03 14:02:19 +08:00
Tian Zheng
9832787050
[feat] Enable NVFP4 output for TRTLLM attention kernels ( #4737 )
...
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-06-03 10:00:17 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) ( #4695 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H ( #4494 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
Enwei Zhu
25dde49c28
fix: EP load balancer with MTP layer and route offset by EP rank ( #4767 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 00:07:44 +08:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE ( #4303 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00
hlu1
3093c747b7
[Architecture] Redesign Linear module ( #4721 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-29 16:05:46 -07:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main ( #4739 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
Yuxian Qiu
bf691b3d28
feat: support packed weights in vanilla moe ( #4719 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-29 06:24:24 +08:00
amirkl94
fbec0c3552
Release 0.20 to main ( #4577 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-28 16:25:33 +08:00
Bo Li
9c4b8f66b4
feat: Integration of Fused QKNorm+RoPE. ( #4611 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-28 11:20:45 +08:00
Yuxian Qiu
5700a4ffcd
feat: Add vanilla MOE. ( #4682 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-28 10:44:14 +08:00
Enwei Zhu
88190faa34
feat: large-scale EP(part 4: Static EP load balancer integration) ( #4615 )
...
* MoeLoadBalancerConfig
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* MoeLoadBalancer integration
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* config file
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-26 18:25:11 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA ( #4535 )
...
* optimize kv cache reuse workflow for MLA
write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* support fp8 kv cache for MLA kv cache reuse
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
---------
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend ( #4530 )
...
* MoE TRTLLM backend for Qwen3
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* add extra moe_backend to test
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* address comments
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* conditionally compile kernels on newer archs
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* missing positional arg
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* Update the routing kernels
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Revise usage of TLLM_LOG_ERROR
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Add unit test for Qwen3 moe (trtllm_gen backend)
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* improve weight processing speed of moe_backend=TRTLLM; roughly 2x
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* tidy and minor fix
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* temporarily disable accuracy test that has known issue
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
---------
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
Mike Iovine
9c0de251db
[feat] Integrate Hopper chunked attention kernels ( #4330 )
...
* Integrate chunked attention kernels
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
* Fix cache key
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
* Fix lint
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
---------
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-22 17:10:57 -04:00
dongxuy04
4018806742
feat: large-scale EP(part 3 - refactor: FusedMoe for redundant expert) ( #4495 )
...
refactor fused_moe for redundant expert
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-21 17:17:49 +08:00
dongxuy04
21aff2e313
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) ( #4384 )
...
* first commit of cpp moe loadbalance code
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python bindings for moe load balance
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python wrapper, ut and bug fixes
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add binding for layerId and update binding test
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add host tensor sharing and ut
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-20 17:53:48 +08:00
liji-nv
58e405624a
[ https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 ( #3952 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
shaharmor98
27afcb9928
add changes for fp8, nemotron-nas, API ( #4180 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-18 23:27:25 +08:00
Kaiyu Xie
3e08cd231c
fix: Remove real size allocation ( #4396 )
...
Remove real size allocation
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-18 19:13:22 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible ( #3439 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
NVJiangShao
a6f2a1e918
Fix test_fused_moe_w4afp8 ( #4393 )
...
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-05-16 17:21:33 +08:00
Tracin
46c5a56444
Support dynamic per-tensor FP8 ( #4250 )
...
* Support dynamic per-tensor FP8
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
* Update test cases.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
---------
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 13:33:58 +08:00
yuxianq
a1daa22970
doc: Add docstring for Attention and MLA module. ( #4354 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-16 09:37:04 +08:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism ( #4034 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. ( #4244 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA ( #3571 )
...
* support kv cache reuse for MLA
load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* add CI test
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* use GPTJ style RoPE for MLA
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix rebase error and some docs
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix kv_lens
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* tiny fix
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix torch compile
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix: use normal device memory instead of pinned memory for unit test
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
* fix L0 tests
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix torch compile after rebase
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments again
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
---------
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Barry Kang
20b42912ce
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper ( #4123 )
...
Support DeepSeek-R1 W4A8 on Hopper
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-14 15:48:07 +08:00
brb-nv
8280c3d4f2
feat: Support Gemma3-1b-it in Pytorch workflow ( #3999 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 14:02:44 +08:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. ( #4047 )
...
* feat: Add heuristic for GroupRMSNorm kernel selection.
Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
(better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
(better block scheduling in large batch scenarios)
Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.
The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Address the comments.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
yuxianq
b35f9a67f9
refactor: Allow models to override apply_qk_norm. ( #4078 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-12 19:38:24 +08:00
Mike Iovine
4b8ba7ad61
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue ( #4069 )
...
[fix] Fix llama 4 test lists
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-09 22:45:14 +08:00
chenfeiz0326
ffc13bd325
Cherry-pick: Use multi-threading to load MoE expert weights ( #4137 )
...
* Use multi-threading to load MoE expert weights
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
* Update code formatting
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Update code formatting
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
---------
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-09 17:29:24 +08:00
dongxuy04
7147efb2e8
fix: alltoall padding for chunked MoE ( #4157 )
...
fix alltoall padding for chunked MoE
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-09 09:01:35 +08:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main ( #4086 )
...
* feat: TRT-LLM Gen FP8 MoE Llama4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* feat: TRT-LLM Gen llama4 MoE Top1 routing
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
* feat: add per tensor FP8 TRT-LLM Gen GEMMs
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* Update
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Update
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Add guard for routingIndicesClusterKernel
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Guard sm90+ for routingkernels
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Guard sm90+ for routingkernels
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
---------
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
shaharmor98
7d94c9561f
feat: support multi lora adapters and TP ( #3885 )
...
* support multi lora, tp
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-08 23:45:45 +08:00
zihaok
81cc60a0fd
[feat/] enable attention DP in Llama4 maverick model - part 1 ( #4065 )
...
* add feature
cosmetic changes
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
address precommit fix
cosmetic
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
* add feature
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
* fix bug
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
* address comments
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
* remove WAR
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
* fix format precommit
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
* Update tensorrt_llm/_torch/models/modeling_llama.py
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>
---------
Signed-off-by: Zihao Kong <zihaok@nvidia.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-08 05:06:40 +08:00
hlu1
26a2679217
[Deepseek] Refactor Deepseek Decoder layer ( #4016 )
...
Refactor Deepseek Decoder layer
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-08 01:43:10 +08:00
bhsueh_NV
f670a036df
[Qwen3] chore: fix bug of fused_moe on tp > 1 ( #4093 )
...
* fix bug of fused_moe on tp > 1
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* refine codes
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-07 11:06:37 +08:00