TensorRT-LLMs/tensorrt_llm/_torch/auto_deploy
Yukun He c678774c99
feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151)
* Several optimizations and fixings on the Autotuner.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on current linear for nvFP4 data type.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Fixing and revising according to reviewer's suggestions.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Replace runner with runner id to achieve a serializable cache.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Code clean up and minor fixings.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Move all tunable runners and custom ops into torch_custom_ops.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-08 14:28:36 +08:00
..
compile perf: [AutoDeploy] Enable AutoDeploy as a backend in trtllm-bench (#3041) 2025-03-26 14:33:14 -07:00
custom_ops feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151) 2025-04-08 14:28:36 +08:00
distributed chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
models Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
shim chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
transformations chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
utils Refactor imports inside tensorrt_llm._torch. (#3015) 2025-03-26 11:01:07 +08:00
__init__.py Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00