From f191d5630e371a575bee2ab4eedc9332dcfd43f9 Mon Sep 17 00:00:00 2001 From: Chunyang Wen Date: Fri, 29 May 2026 22:40:05 +0800 Subject: [PATCH] docs: clarify ITL acronym in optimization docs (#43922) Signed-off-by: chunyang.wen --- docs/configuration/optimization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index eb6bdce37b9..5bf789a0919 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -46,14 +46,14 @@ In V1, **chunked prefill is enabled by default whenever possible**. With chunked This policy has two benefits: -- It improves ITL and generation decode because decode requests are prioritized. +- It improves inter-token latency (ITL) and generation decode because decode requests are prioritized. - It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch. ### Performance Tuning with Chunked Prefill You can tune the performance by adjusting `max_num_batched_tokens`: -- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes. +- Smaller values (e.g., 2048) achieve better ITL because there are fewer prefills slowing down decodes. - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch. - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs. - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).