TensorRT-LLMs/tensorrt_llm/_torch/modules
Anthony Chang bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend (#4530)
* MoE TRTLLM backend for Qwen3

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* add extra moe_backend to test

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* address comments

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* conditionally compile kernels on newer archs

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* missing positional arg

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* Update the routing kernels

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Revise usage of TLLM_LOG_ERROR

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* Add unit test for Qwen3 moe (trtllm_gen backend)

Signed-off-by: Christina Zhang <christinaz@nvidia.com>

* improve weight processing speed of moe_backend=TRTLLM; roughly 2x

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* tidy and minor fix

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

* temporarily disable accuracy test that has known issue

Signed-off-by: Anthony Chang <anchengc@nvidia.com>

---------

Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
..
mamba Fix create_weights in attention (#3692) 2025-04-24 07:30:00 +08:00
__init__.py Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
attention.py [feat] Integrate Hopper chunked attention kernels (#4330) 2025-05-22 17:10:57 -04:00
decoder_layer.py feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034) 2025-05-16 04:16:53 +08:00
embedding.py feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034) 2025-05-16 04:16:53 +08:00
fused_moe.py Qwen3 supports TRTLLM FP4 MoE backend (#4530) 2025-05-23 18:31:08 +08:00
gated_mlp.py feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
linear.py add changes for fp8, nemotron-nas, API (#4180) 2025-05-18 23:27:25 +08:00
logits_processor.py feat: LogitsProcessor in PyTorch backend (#3145) 2025-05-01 14:15:30 -07:00
mlp.py feat: support multi lora adapters and TP (#3885) 2025-05-08 23:45:45 +08:00
moe_load_balancer.py feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384) 2025-05-20 17:53:48 +08:00
multi_stream_utils.py [chore] Make llama4 MoE use maybe_execute_in_parallel (#3779) 2025-04-28 10:58:03 -04:00
rms_norm.py feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034) 2025-05-16 04:16:53 +08:00
rotary_embedding.py feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438) 2025-05-02 13:25:30 +08:00