diff --git a/docs/source/developer-guide/perf-benchmarking.md b/docs/source/developer-guide/perf-benchmarking.md index 5b01c50601..6fcf8b64fe 100644 --- a/docs/source/developer-guide/perf-benchmarking.md +++ b/docs/source/developer-guide/perf-benchmarking.md @@ -8,13 +8,13 @@ Expect breaking API changes. ``` TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it -easier for users to reproduce our officially published [performance overview](../performance/perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows: +easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows: - A streamlined way to build tuned engines for benchmarking for a variety of models and platforms. - An entirely Python workflow for benchmarking. - Ability to benchmark various flows and features within TensorRT LLM. -`trtllm-bench` executes all benchmarks using [in-flight batching] -- for more information see +`trtllm-bench` executes all benchmarks using `in-flight batching` -- for more information see the [in-flight batching section](../features/attention.md#inflight-batching) that describes the concept in further detail. @@ -67,7 +67,7 @@ sudo nvidia-smi boost-slider --vboost While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list that have been validated extensively and is the same listing as seen on the -[Performance Overview](../performance/perf-overview.md) page. +[Performance Overview](./perf-overview.md) page. - [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) - [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf) diff --git a/docs/source/legacy/performance/perf-overview.md b/docs/source/developer-guide/perf-overview.md similarity index 97% rename from docs/source/legacy/performance/perf-overview.md rename to docs/source/developer-guide/perf-overview.md index d354d869aa..0a144a58d4 100644 --- a/docs/source/legacy/performance/perf-overview.md +++ b/docs/source/developer-guide/perf-overview.md @@ -238,15 +238,13 @@ RTX 6000 Pro Blackwell Server Edition | 20000/2000 | 1,804 | 1,351 | - - - ## Reproducing Benchmarked Results -> [!NOTE] The only models supported in this workflow are those listed in the table above. +```{note} +Only the models shown in the table above are supported by this workflow. +``` -The following tables are references for commands that are used as part of the benchmarking process. For a more detailed -description of this benchmarking workflow, see the [benchmarking suite documentation](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html). +The following tables are references for commands that are used as part of the benchmarking process. For a more detailed description of this benchmarking workflow, see the [benchmarking suite documentation](./perf-benchmarking.md). ### Command Overview @@ -274,7 +272,7 @@ Starting with v0.19, testing was performed using the PyTorch backend - this work ### Preparing a Dataset -In order to prepare a dataset, you can use the provided [script](../../../benchmarks/cpp/prepare_dataset.py). +In order to prepare a dataset, you can use the provided [script](source:benchmarks/cpp/prepare_dataset.py). To generate a synthetic dataset, run the following command: ```shell @@ -310,7 +308,7 @@ remain in the system longer and therefore require less requests to achieve stead To run the benchmark with the generated data set, simply use the `trtllm-bench throughput` subcommand. The benchmarker will run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide -a model name (HuggingFace reference or path to a local model), a [generated dataset](#preparing-a-dataset), and a file containing any desired extra options to the LLMApi (details in [tensorrt_llm/llmapi/llm_args.py:LlmArgs](../../../tensorrt_llm/llmapi/llm_args.py)). +a model name (HuggingFace reference or path to a local model), a [generated dataset](#preparing-a-dataset), and a file containing any desired extra options to the LLM APIs (details in [tensorrt_llm/llmapi/llm_args.py:LlmArgs](source:tensorrt_llm/llmapi/llm_args.py)). For dense / non-MoE models: