mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* disable overlap in encoder Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: overlap same batch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Improve early exit Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| cpp | ||
| python | ||
| README.md | ||
TensorRT-LLM Benchmarks
Overview
There are currently three workflows to benchmark TensorRT-LLM:
- C++ benchmarks
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
- Python benchmarks
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
- The Python benchmarking suite
- This benchmarker is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
- NOTE: This benchmarking suite is a current work in progress and is prone to large changes.