TensorRT-LLMs/cpp/tensorrt_llm
Yukun He c678774c99
feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151)
* Several optimizations and fixings on the Autotuner.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on current linear for nvFP4 data type.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Fixing and revising according to reviewer's suggestions.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Replace runner with runner id to achieve a serializable cache.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Code clean up and minor fixings.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Move all tunable runners and custom ops into torch_custom_ops.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-08 14:28:36 +08:00
..
batch_manager feat: use cudaMalloc to allocate kvCache (#3303) 2025-04-08 10:59:14 +08:00
common feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
cutlass_extensions/include/cutlass_extensions feat: Update cutlass (#2981) 2025-03-26 22:36:27 +08:00
executor ucx interface (#3306) 2025-04-07 08:44:34 +08:00
executor_worker Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
kernels feat: enable DeepGEMM by default (#3341) 2025-04-08 13:58:57 +08:00
layers chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
plugins chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
pybind feat: Add option to run disaggregated serving without ctx servers,… (#3243) 2025-04-07 21:56:03 -04:00
runtime chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
thop feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151) 2025-04-08 14:28:36 +08:00
CMakeLists.txt [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00