docs: clarify ITL acronym in optimization docs (#43922)

Signed-off-by: chunyang.wen <[email protected]>
2026-08-01 11:28:00 +00:00 · 2026-05-29 07:40:05 -07:00
parent 11dfa3169d
commit f191d5630e
1 changed files with 2 additions and 2 deletions
@@ -46,14 +46,14 @@ In V1, **chunked prefill is enabled by default whenever possible**. With chunked

 This policy has two benefits:

- It improves ITL and generation decode because decode requests are prioritized.
+- It improves inter-token latency (ITL) and generation decode because decode requests are prioritized.
 - It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.

 ### Performance Tuning with Chunked Prefill

 You can tune the performance by adjusting `max_num_batched_tokens`:

- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
+- Smaller values (e.g., 2048) achieve better ITL because there are fewer prefills slowing down decodes.
 - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
 - For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
 - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).