Note

Non-breaking: --config <file.yaml> is the preferred flag for passing a YAML configuration file. Existing workflows using --extra_llm_api_options <file.yaml> continue to work; it is an equivalent alias.

Note

Traffic Patterns: The ISL (Input Sequence Length) and OSL (Output Sequence Length) values in each configuration represent the maximum supported values for that config. Requests exceeding these limits may result in errors.

To handle requests with input sequences longer than the configured ISL, add the following to your config file:

enable_chunked_prefill: true

This enables chunked prefill, which processes long input sequences in chunks rather than requiring them to fit within a single prefill operation. Note that enabling chunked prefill does not guarantee optimal performance—these configs are tuned for the specified ISL/OSL.

Note

The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the Preconfigured Recipes section below which covers a larger set of traffic patterns and performance profiles.