mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-11 05:23:38 +08:00
Fix AutoTuner warmup request generating. * The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases. Refactor AutoTuner API and reduce host overhead. Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition. * Refine cache key assembly process to reduce host overhead and simplify API. * Fix lru_cache usage to reduce host overhead. * Move tuning config initialization as a one-time object in tunable runner to reduce host overhead. Improve tuning config readability. * Use dataclass to define tuning config. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| _torch | ||
| api_stability | ||
| bindings | ||
| disaggregated | ||
| llmapi | ||
| others | ||
| scaffolding | ||
| tools | ||
| trt | ||
| utils | ||
| conftest.py | ||
| dump_checkpoint_stats.py | ||
| profile_utils.py | ||
| pytest.ini | ||
| test_model_runner_cpp.py | ||
| test_pip_install.py | ||