TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-12 14:03:48 +08:00

History

Neta Zmora 34dc6869f3 [#8732 ][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011 ) Update TRTLLM Cutlass MoE kernels with ReLU2 activation. Nemotron-6 requires ReLU2 (i.e. squared ReLU) MoE activation function. The PR adds this and adds an API to set the activation function, in general. The ReLU2 changes are based on this FlashInfer PR: https://github.com/flashinfer-ai/flashinfer/pull/1954. The PR also updates the Auto Deploy MoE backend for 16-bit and FP8 from Triton (`torch.ops.auto_deploy.triton_moe_fused`, `torch.ops.auto_deploy.triton_quant_fp8_moe`) to TRTLLM/Cutlass (`torch.ops.auto_deploy.trtllm_moe_fused`, `torch.ops.auto_deploy.trtllm_quant_fp8_moe_fused`). Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com> Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>		2025-11-13 16:54:45 -08:00
..
attention	[#6507 ][fix] Fix precision issue due to KV layout mismatch for split/concat kernels (#6917 )	2025-11-13 12:14:58 +08:00
auto_deploy	[#8732 ][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011 )	2025-11-13 16:54:45 -08:00
compilation	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 )	2025-05-09 11:04:01 +08:00
debugger	Fix: fix nvbug 5356427 (#5464 )	2025-06-25 22:24:26 +08:00
executor	[None][chore] Remove is_disaggregated param in executor request queue (#9049 )	2025-11-12 13:37:15 -05:00
misc	[TRTLLM-8511][feat] Add update_weights and sleep_wakeup support for rl integration (#8302 )	2025-11-04 10:19:24 -08:00
modeling	[TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner (#7572 )	2025-11-11 10:13:45 -08:00
models/checkpoints/hf	[None][feat] Skip prefetching consolidated safetensors when appropriate (#7013 )	2025-08-25 23:56:21 -04:00
modules	[https://nvbugs/5565565 ] [fix] Remove waiver (#8450 )	2025-11-04 16:42:31 +08:00
multi_gpu	[None][infra] Waive failed cases on main 11/05 (#8936 )	2025-11-04 22:54:45 -08:00
multi_gpu_modeling	[https://nvbugs/5536131 ][fix] Fix illegal access issue when scale is not provided in Llama3/4. (#7960 )	2025-10-16 22:46:19 +08:00
multimodal	[None][fix] InputProcessor config naming convention fix (#8705 )	2025-11-03 22:29:21 -08:00
ray_orchestrator	[None][chore] Relocate rlhf_utils.py (#8938 )	2025-11-10 19:03:23 -08:00
sampler	[TRTLLM-9175][test] ensure sampling is async (#9076 )	2025-11-12 15:27:52 +01:00
speculative	[TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding (#8706 )	2025-11-13 10:20:16 -05:00
thop	[None][fix] support topk autotuner input for expert slot per group larger than 32 (#9087 )	2025-11-14 08:37:20 +08:00
helpers.py	[TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner (#7572 )	2025-11-11 10:13:45 -08:00
pattern_watcher.py	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 )	2025-05-09 11:04:01 +08:00
test_connector.py	[None][feat] KV Cache Connector API (#7228 )	2025-08-28 23:09:27 -04:00