[TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands (#7191)

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
This commit is contained in:
Shi Xiaowei 2025-08-25 20:21:43 +08:00 committed by GitHub
parent 5d165186d5
commit d010b2043a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
50 changed files with 144 additions and 140 deletions

View File

@ -336,7 +336,7 @@ cd cpp/build
`disaggServerBenchmark` only supports `decoder-only` models.
Here is the basic usage:
```
export TRTLLM_USE_MPI_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=1
mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
--generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
```
@ -344,7 +344,7 @@ This command will launch m context engines and n generation engines. You need to
for example:
```
export TRTLLM_USE_MPI_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=1
mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}
# need 6 gpus and 7 processes to launch the benchmark.

View File

@ -66,17 +66,6 @@ A. Yes, it's recommended that different executor use different GPUs . We support
### Debugging FAQs
*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
A. please set `backendType` of `CacheTransceiverConfig`.
```cpp
ExecutorConfig executorConfig{...};
executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
```
When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.
*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.

View File

@ -124,10 +124,10 @@ From the `examples/cpp/executor/build` folder, you can also run the `executorExa
```
./executorExampleDisaggregated -h
```
Note setting `TRTLLM_USE_MPI_KVCACHE=1` is required to run disaggregated executor.
Note setting `TRTLLM_USE_UCX_KVCACHE=1` is required to run disaggregated executor.
For example, you can run :
```
export TRTLLM_USE_MPI_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=1
mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv

View File

@ -12,24 +12,39 @@ cache_transceiver_config:
max_tokens_in_buffer: <int>
```
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is UCX.
`backend` specifies the communication backend for transferring the KV cache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`, the default backend is `UCX`.
`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
`max_tokens_in_buffer` defines the buffer size for KV cache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
You can use multiple `trtllm-serve` commands to launch the context and generation servers that will be used
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
You can use multiple `trtllm-serve` commands to launch the context and generation servers required for disaggregated serving. For instance, you might start two context servers and one generation server as shown below.
Begin by creating `ctx_extra-llm-api-config.yml` and `gen_extra-llm-api-config.yml` following the specified format.
```yaml
# ctx_extra-llm-api-config.yml
# The overlap scheduler for context servers is currently disabled, as it is
# not yet supported in disaggregated context server architectures.
disable_overlap_scheduler: True
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 2048
```
```yaml
# gen_extra-llm-api-config.yml
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 2048
```
Then, start the context and generation servers separately.
```bash
# Generate context_extra-llm-api-config.yml
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
# Start context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 &
# Generate gen_extra-llm-api-config.yml
echo -e "cache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > gen_extra-llm-api-config.yml
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_1 &
# Start generation servers
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
@ -95,8 +110,8 @@ After this, you can enable the dynamic scaling feature for the use case above as
export TRTLLM_USE_UCX_KVCACHE=1
# Context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
# Generation servers
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --server_role GENERATION --extra_llm_api_options ./gen_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_gen_0 &
@ -180,4 +195,4 @@ trtllm-serve disaggregated -c disagg_config.yaml
## Know Issues
The MPI communication backend for kvCache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and kvCache transfer.
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.

View File

@ -11,7 +11,7 @@ context_servers:
kv_cache_config:
free_gpu_memory_fraction: 0.2
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8001"
generation_servers:
@ -19,6 +19,6 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8002"

View File

@ -197,7 +197,7 @@ def gen_config_file(config_path: str,
},
'cache_transceiver_config': {
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
'backend': 'default',
'backend': 'DEFAULT',
},
},
'generation_servers': {
@ -225,7 +225,7 @@ def gen_config_file(config_path: str,
},
'cache_transceiver_config': {
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
'backend': 'default',
'backend': 'DEFAULT',
},
'stream_interval': 20,
}

View File

@ -1039,7 +1039,7 @@ class CacheTransceiverConfig(StrictBaseModel, PybindMirror):
Configuration for the cache transceiver.
"""
backend: Optional[Literal["default", "ucx", "nixl", "mpi"]] = Field(
backend: Optional[Literal["DEFAULT", "UCX", "NIXL", "MPI"]] = Field(
default=None,
description=
"The communication backend type to use for the cache transceiver.")

View File

@ -260,7 +260,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
"disable_overlap_scheduler": True,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
@ -269,7 +269,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
"disable_overlap_scheduler": True,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
@ -309,8 +309,8 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
gen_server_config = {
"disable_overlap_scheduler": disable_overlap_scheduler
}
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
disaggregated_server_config = {
"hostname": "localhost",
"port": 8000,
@ -351,7 +351,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
"disable_overlap_scheduler": True,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
@ -359,7 +359,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
"speculative_config": speculative_decoding_config,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
disaggregated_server_config = {
@ -404,7 +404,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
"max_num_tokens": 13393 * 2,
"max_batch_size": 1,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
},
"cuda_graph_config": None,
}
@ -418,7 +418,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
"max_num_tokens": 13393 * 2,
"max_batch_size": 16,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
},
"cuda_graph_config": None,
}
@ -472,8 +472,8 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
def test_auto_dtype(self, overlap_scheduler):
ctx_server_config = {"disable_overlap_scheduler": True}
gen_server_config = {"disable_overlap_scheduler": overlap_scheduler}
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
# Keep this low to avoid warmup OOM in CI
ctx_server_config["max_seq_len"] = 8192
gen_server_config["max_seq_len"] = 8192
@ -513,13 +513,13 @@ class TestDeepSeekV3Lite(LlmapiAccuracyTestHarness):
ctx_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
gen_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
disaggregated_server_config = {
@ -550,8 +550,8 @@ class TestDeepSeekV3Lite(LlmapiAccuracyTestHarness):
def test_auto_dtype(self, overlap_scheduler, mtp_nextn):
ctx_server_config = {"disable_overlap_scheduler": True}
gen_server_config = {"disable_overlap_scheduler": not overlap_scheduler}
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
if mtp_nextn > 0:
ctx_server_config["speculative_config"] = {
"decoding_type": "MTP",
@ -597,14 +597,14 @@ class TestGemma3_1BInstruct(LlmapiAccuracyTestHarness):
"disable_overlap_scheduler": True,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
"disable_overlap_scheduler": overlap_scheduler,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
ctx_server_config["kv_cache_config"] = {
@ -648,13 +648,13 @@ class TestQwen3_8B(LlmapiAccuracyTestHarness):
ctx_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
gen_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
ctx_server_config["cache_transceiver_config"]
@ -686,14 +686,14 @@ class TestQwen3_8B(LlmapiAccuracyTestHarness):
"disable_overlap_scheduler": True,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
"disable_overlap_scheduler": overlap_scheduler,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
disaggregated_server_config = {

View File

@ -21,7 +21,7 @@ context_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
- "localhost:8002"
@ -35,7 +35,7 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
kv_cache_config:
enable_block_reuse: True
enable_partial_reuse: False

View File

@ -17,7 +17,7 @@ context_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8001"
- "localhost:8002"
@ -33,7 +33,7 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8003"
- "localhost:8004"

View File

@ -15,7 +15,7 @@ context_servers:
enable_partial_reuse: True
event_buffer_max_size: 1024
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -30,6 +30,6 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.05
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -15,7 +15,7 @@ context_servers:
enable_partial_reuse: True
event_buffer_max_size: 1024
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -30,6 +30,6 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.05
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -18,7 +18,7 @@ context_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.15
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -33,6 +33,6 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.15
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -18,7 +18,7 @@ context_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.15
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -33,6 +33,6 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.15
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -16,7 +16,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -31,6 +31,6 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -16,7 +16,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -31,6 +31,6 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -16,7 +16,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -31,6 +31,6 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -10,7 +10,7 @@ context_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -18,6 +18,6 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -14,7 +14,7 @@ context_servers:
pipeline_parallel_size: 1
enable_attention_dp: true
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -23,6 +23,6 @@ generation_servers:
pipeline_parallel_size: 1
enable_attention_dp: false
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -14,7 +14,7 @@ context_servers:
enable_attention_dp: true
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -24,6 +24,6 @@ generation_servers:
enable_attention_dp: true
disable_overlap_scheduler: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -14,7 +14,7 @@ context_servers:
pipeline_parallel_size: 1
enable_attention_dp: true
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -25,4 +25,4 @@ generation_servers:
urls:
- "localhost:8002"
cache_transceiver_config:
backend: default
backend: DEFAULT

View File

@ -16,7 +16,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -31,6 +31,6 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -10,7 +10,7 @@ context_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -18,7 +18,7 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"
- "localhost:8003"

View File

@ -8,7 +8,7 @@ context_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -16,7 +16,7 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"
- "localhost:8003"

View File

@ -10,7 +10,7 @@ context_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -18,6 +18,6 @@ generation_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -11,7 +11,7 @@ context_servers:
pipeline_parallel_size: 1
enable_attention_dp: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -20,6 +20,6 @@ generation_servers:
pipeline_parallel_size: 1
enable_attention_dp: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -11,7 +11,7 @@ context_servers:
pipeline_parallel_size: 1
enable_attention_dp: true
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -20,6 +20,6 @@ generation_servers:
pipeline_parallel_size: 1
enable_attention_dp: false
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -14,7 +14,7 @@ context_servers:
pipeline_parallel_size: 1
enable_attention_dp: true
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -23,7 +23,7 @@ generation_servers:
pipeline_parallel_size: 1
enable_attention_dp: false
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -11,7 +11,7 @@ context_servers:
enable_attention_dp: True
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -21,6 +21,6 @@ generation_servers:
enable_attention_dp: True
disable_overlap_scheduler: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -10,7 +10,7 @@ context_servers:
enable_attention_dp: true
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -22,6 +22,6 @@ generation_servers:
enable_padding: False
disable_overlap_scheduler: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -9,7 +9,7 @@ context_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "mpi"
backend: "MPI"
urls:
- "localhost:8001"
generation_servers:
@ -17,6 +17,6 @@ generation_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "mpi"
backend: "MPI"
urls:
- "localhost:8002"

View File

@ -9,7 +9,7 @@ context_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "nixl"
backend: "NIXL"
urls:
- "localhost:8001"
generation_servers:
@ -17,6 +17,6 @@ generation_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "nixl"
backend: "NIXL"
urls:
- "localhost:8002"

View File

@ -9,7 +9,7 @@ context_servers:
pipeline_parallel_size: 1
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -20,6 +20,6 @@ generation_servers:
enable_padding: False
disable_overlap_scheduler: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -9,7 +9,7 @@ context_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "ucx"
backend: "UCX"
urls:
- "localhost:8001"
generation_servers:
@ -17,6 +17,6 @@ generation_servers:
tensor_parallel_size: 2
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "ucx"
backend: "UCX"
urls:
- "localhost:8002"

View File

@ -16,7 +16,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -31,6 +31,6 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -16,7 +16,7 @@ context_servers:
batch_sizes: [1,3000]
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -34,6 +34,6 @@ generation_servers:
batch_sizes: [1,4,8,16,24,32]
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -10,7 +10,7 @@ context_servers:
max_num_tokens: 512
max_batch_size: 64
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -18,6 +18,6 @@ generation_servers:
max_num_tokens: 256
max_batch_size: 32
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -14,7 +14,7 @@ generation_servers:
enable_block_reuse: False
enable_partial_reuse: False
cache_transceiver_config:
backend: default
backend: DEFAULT
print_iter_log: True
urls:
- "localhost:8002"

View File

@ -17,7 +17,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -32,6 +32,6 @@ generation_servers:
free_gpu_memory_fraction: 0.2
enable_partial_reuse: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -13,7 +13,7 @@ generation_servers:
enable_block_reuse: False
enable_partial_reuse: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"
- "localhost:8003"

View File

@ -19,7 +19,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
- "localhost:8002"
@ -38,7 +38,7 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: False
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8003"
- "localhost:8004"

View File

@ -10,7 +10,7 @@ context_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -18,7 +18,7 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
- "localhost:8002"

View File

@ -9,7 +9,7 @@ context_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8001"
generation_servers:
@ -17,7 +17,7 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8002"
speculative_config:

View File

@ -16,7 +16,7 @@ context_servers:
enable_partial_reuse: False
disable_overlap_scheduler: True
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -31,6 +31,6 @@ generation_servers:
enable_partial_reuse: False
disable_overlap_scheduler: False
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -10,7 +10,7 @@ context_servers:
kv_cache_config:
free_gpu_memory_fraction: 0.2
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
@ -18,6 +18,6 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"

View File

@ -16,7 +16,7 @@ context_servers:
free_gpu_memory_fraction: 0.2
enable_partial_reuse: False
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
disable_overlap_scheduler: True
urls:
- "localhost:8001"
@ -32,7 +32,7 @@ generation_servers:
free_gpu_memory_fraction: 0.2
enable_partial_reuse: False
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
disable_overlap_scheduler: False
urls:
- "localhost:8002"

View File

@ -1276,8 +1276,8 @@ def test_disaggregated_benchmark_on_diff_backends(
if "DeepSeek-V3-Lite" in benchmark_model_root and "fp8" in benchmark_model_root and get_sm_version(
) != 90:
pytest.skip("The test should only run on Hopper")
nixl_config = get_config_for_benchmark(benchmark_model_root, "nixl")
ucx_config = get_config_for_benchmark(benchmark_model_root, "ucx")
nixl_config = get_config_for_benchmark(benchmark_model_root, "NIXL")
ucx_config = get_config_for_benchmark(benchmark_model_root, "UCX")
temp_dir = tempfile.TemporaryDirectory()
nixl_config_path = os.path.join(temp_dir.name, "nixl_config.yaml")
ucx_config_path = os.path.join(temp_dir.name, "ucx_config.yaml")

View File

@ -244,7 +244,7 @@ def create_config_files(config):
context_config_content = """pytorch_backend_config:
disable_overlap_scheduler: True
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
max_tokens_in_buffer: 2048"""
with open(CONTEXT_CONFIG_FILE, 'w') as file:
@ -252,7 +252,7 @@ cache_transceiver_config:
# Create generation config file
generation_config_content = """cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
max_tokens_in_buffer: 2048"""
with open(GENERATION_CONFIG_FILE, 'w') as file:

View File

@ -131,7 +131,7 @@ def verify_disaggregated(model, generation_overlap, enable_cuda_graph, prompt,
kv_cache_configs = [KvCacheConfig(max_tokens=2048 * 8) for _ in range(2)]
cache_transceiver_configs = [
CacheTransceiverConfig(backend="default") for _ in range(2)
CacheTransceiverConfig(backend="DEFAULT") for _ in range(2)
]
model_names = [model_path(model) for _ in range(2)]
ranks = [0, 1]
@ -274,7 +274,7 @@ def test_disaggregated_llama_context_capacity(model, enable_cuda_graph,
for _ in range(2)
]
cache_transceiver_configs = [
CacheTransceiverConfig(backend="default") for _ in range(2)
CacheTransceiverConfig(backend="DEFAULT") for _ in range(2)
]
model_names = [model_path(model) for _ in range(2)]
ranks = [0, 1]
@ -377,7 +377,7 @@ def test_disaggregated_spec_dec_batch_slot_limit(model, spec_dec_model_path,
for _ in range(2)
]
cache_transceiver_configs = [
CacheTransceiverConfig(backend="default") for _ in range(2)
CacheTransceiverConfig(backend="DEFAULT") for _ in range(2)
]
model_names = [model_path(model) for _ in range(2)]
ranks = [0, 1]

View File

@ -661,15 +661,15 @@ class TestStrictBaseModelArbitraryArgs:
def test_cache_transceiver_config_arbitrary_args(self):
"""Test that CacheTransceiverConfig rejects arbitrary arguments."""
# Valid arguments should work
config = CacheTransceiverConfig(backend="ucx",
config = CacheTransceiverConfig(backend="UCX",
max_tokens_in_buffer=1024)
assert config.backend == "ucx"
assert config.backend == "UCX"
assert config.max_tokens_in_buffer == 1024
# Arbitrary arguments should be rejected
with pytest.raises(
pydantic_core._pydantic_core.ValidationError) as exc_info:
CacheTransceiverConfig(backend="ucx", invalid_config="should_fail")
CacheTransceiverConfig(backend="UCX", invalid_config="should_fail")
assert "invalid_config" in str(exc_info.value)
def test_torch_compile_config_arbitrary_args(self):