* generalizing cudagraph to multiple dynamic inputs
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* fix for failing test
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* add test to map flashinfer rope op with triton custom rope ops and pytorch rope in fused_mha
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add rope matcher and unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* capture cos and sin from graph
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* revert fuse_mha op change
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor update to address comment and remove redundant unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move view and transpose into graph nodes and update unit test to test custom op directly
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move view into custom op, update bfs with bound, update custom op return type to be half precision
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* custom op update to support 3D input
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* handle bnsd and bsnd format, update tests, handle 3D cos/sin input to the custom op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add llama4 rope test, update custom op with is_neox flag
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add llama4 style rope to matcher and update unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate into two transformations
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix when num_head != num_kv_head; add support for cached position_ids and cos_sin_cache in graph; update unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor update, cache locally and propagate meta info of qk nodes
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: fix cos_sin_cache not float
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: move cache into matcher
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
---------
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* Several optimizations and fixings on the Autotuner.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on current linear for nvFP4 data type.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Fixing and revising according to reviewer's suggestions.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Replace runner with runner id to achieve a serializable cache.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Code clean up and minor fixings.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Move all tunable runners and custom ops into torch_custom_ops.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>