mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
doc: fix invalid link in llama 4 example documentation (#6340)
Signed-off-by: Liana Koleva <43767763+lianakoleva@users.noreply.github.com>
This commit is contained in:
parent
54f68287fc
commit
96d004d800
@ -134,7 +134,7 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
|
||||
- `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
|
||||
- `max_batch_size` should not be too low to bottleneck the throughput. Note with Attention DP, the the whole system's max_batch_size will be `max_batch_size*dp_size`.
|
||||
- CUDA grah `max_batch_size` should be same value as TensorRT-LLM server's `max_batch_size`.
|
||||
- For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
|
||||
- For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../../../../docs/source/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user