The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Revert "[nvbugs/5274894] fix: Moving finished context requests to generation (#4576)"
This reverts commit d39bcb6b40.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: Sort requests for functional correctness and performance
- Moved sorting related logic to a dedicated function for better clarity and maintainability.
- Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID.
- Updated function documentation to reflect the sorting behavior and its purpose.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Restore per-channel pre-quant
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
* Update TRT test script
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
* Fix pre-commit
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
---------
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
* Unwaive test for Qwen model.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* update.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* extend pyt nano tests perf coverage
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
* explicitly set maxnt for some cases
This is because the test harness default to no prefill chunking, that means the isl specified is the true context.
When explicitly unspecified in the test harness, the `maxnt` passed down to `trtllm-bench` is 2048.
This means trtllm-bench gets conflicting inputs when isl>2048 but maxnt=2048; hence overriding maxnt to be consistent with isl for such cases.
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
---------
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
fix: Moving finished context requests to generation
- Unfinished chunked context requests appear at end of context requests vector.
- Replaced std::find_if with std::partition to find the correct position to move finished context requests to generation.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Fixed the mrope argument missing issue in the summary tasks for Qwen models.
And re-enabled the fixed tests.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* docs: Add KV Cache Management documentation
* Introduced a new document detailing the hierarchy and event system for KV cache management, including definitions for Pool, Block, and Page.
* Updated the index.rst to include a reference to the new kv-cache-management.md file.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Update docs/source/advanced/kv-cache-management.md
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Update KV Cache Pool Management
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Addcross-file links
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Clarify tokens_per_block
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Clarify acronyms
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
* Fix TRTLLMSampler.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added type hint.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Chore: waive torch compile test cases of deepseek v3 lite (#4508)
waive torch compile test cases of deepseek v3 lite
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>