mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
[TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands (#7191)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
This commit is contained in:
parent
5d165186d5
commit
d010b2043a
@ -336,7 +336,7 @@ cd cpp/build
|
||||
`disaggServerBenchmark` only supports `decoder-only` models.
|
||||
Here is the basic usage:
|
||||
```
|
||||
export TRTLLM_USE_MPI_KVCACHE=1
|
||||
export TRTLLM_USE_UCX_KVCACHE=1
|
||||
mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
|
||||
--generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
|
||||
```
|
||||
@ -344,7 +344,7 @@ This command will launch m context engines and n generation engines. You need to
|
||||
|
||||
for example:
|
||||
```
|
||||
export TRTLLM_USE_MPI_KVCACHE=1
|
||||
export TRTLLM_USE_UCX_KVCACHE=1
|
||||
mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}
|
||||
|
||||
# need 6 gpus and 7 processes to launch the benchmark.
|
||||
|
||||
@ -66,17 +66,6 @@ A. Yes, it's recommended that different executor use different GPUs . We support
|
||||
|
||||
### Debugging FAQs
|
||||
|
||||
*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
|
||||
|
||||
A. please set `backendType` of `CacheTransceiverConfig`.
|
||||
```cpp
|
||||
ExecutorConfig executorConfig{...};
|
||||
|
||||
executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
|
||||
```
|
||||
|
||||
When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.
|
||||
|
||||
*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*
|
||||
|
||||
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
|
||||
|
||||
@ -124,10 +124,10 @@ From the `examples/cpp/executor/build` folder, you can also run the `executorExa
|
||||
```
|
||||
./executorExampleDisaggregated -h
|
||||
```
|
||||
Note setting `TRTLLM_USE_MPI_KVCACHE=1` is required to run disaggregated executor.
|
||||
Note setting `TRTLLM_USE_UCX_KVCACHE=1` is required to run disaggregated executor.
|
||||
For example, you can run :
|
||||
```
|
||||
export TRTLLM_USE_MPI_KVCACHE=1
|
||||
export TRTLLM_USE_UCX_KVCACHE=1
|
||||
|
||||
mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv
|
||||
|
||||
|
||||
@ -12,24 +12,39 @@ cache_transceiver_config:
|
||||
max_tokens_in_buffer: <int>
|
||||
```
|
||||
|
||||
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is UCX.
|
||||
`backend` specifies the communication backend for transferring the KV cache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`, the default backend is `UCX`.
|
||||
|
||||
`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
|
||||
`max_tokens_in_buffer` defines the buffer size for KV cache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
|
||||
|
||||
You can use multiple `trtllm-serve` commands to launch the context and generation servers that will be used
|
||||
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
|
||||
You can use multiple `trtllm-serve` commands to launch the context and generation servers required for disaggregated serving. For instance, you might start two context servers and one generation server as shown below.
|
||||
|
||||
Begin by creating `ctx_extra-llm-api-config.yml` and `gen_extra-llm-api-config.yml` following the specified format.
|
||||
|
||||
```yaml
|
||||
# ctx_extra-llm-api-config.yml
|
||||
|
||||
# The overlap scheduler for context servers is currently disabled, as it is
|
||||
# not yet supported in disaggregated context server architectures.
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: UCX
|
||||
max_tokens_in_buffer: 2048
|
||||
```
|
||||
|
||||
```yaml
|
||||
# gen_extra-llm-api-config.yml
|
||||
|
||||
cache_transceiver_config:
|
||||
backend: UCX
|
||||
max_tokens_in_buffer: 2048
|
||||
```
|
||||
|
||||
Then, start the context and generation servers separately.
|
||||
|
||||
```bash
|
||||
# Generate context_extra-llm-api-config.yml
|
||||
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
|
||||
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
|
||||
|
||||
# Start context servers
|
||||
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 &
|
||||
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 &
|
||||
|
||||
# Generate gen_extra-llm-api-config.yml
|
||||
echo -e "cache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > gen_extra-llm-api-config.yml
|
||||
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_0 &
|
||||
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_1 &
|
||||
|
||||
# Start generation servers
|
||||
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
|
||||
@ -95,8 +110,8 @@ After this, you can enable the dynamic scaling feature for the use case above as
|
||||
export TRTLLM_USE_UCX_KVCACHE=1
|
||||
|
||||
# Context servers
|
||||
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
|
||||
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
|
||||
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
|
||||
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
|
||||
|
||||
# Generation servers
|
||||
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --server_role GENERATION --extra_llm_api_options ./gen_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_gen_0 &
|
||||
@ -180,4 +195,4 @@ trtllm-serve disaggregated -c disagg_config.yaml
|
||||
|
||||
## Know Issues
|
||||
|
||||
The MPI communication backend for kvCache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and kvCache transfer.
|
||||
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.
|
||||
|
||||
@ -11,7 +11,7 @@ context_servers:
|
||||
kv_cache_config:
|
||||
free_gpu_memory_fraction: 0.2
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -19,6 +19,6 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -197,7 +197,7 @@ def gen_config_file(config_path: str,
|
||||
},
|
||||
'cache_transceiver_config': {
|
||||
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
|
||||
'backend': 'default',
|
||||
'backend': 'DEFAULT',
|
||||
},
|
||||
},
|
||||
'generation_servers': {
|
||||
@ -225,7 +225,7 @@ def gen_config_file(config_path: str,
|
||||
},
|
||||
'cache_transceiver_config': {
|
||||
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
|
||||
'backend': 'default',
|
||||
'backend': 'DEFAULT',
|
||||
},
|
||||
'stream_interval': 20,
|
||||
}
|
||||
|
||||
@ -1039,7 +1039,7 @@ class CacheTransceiverConfig(StrictBaseModel, PybindMirror):
|
||||
Configuration for the cache transceiver.
|
||||
"""
|
||||
|
||||
backend: Optional[Literal["default", "ucx", "nixl", "mpi"]] = Field(
|
||||
backend: Optional[Literal["DEFAULT", "UCX", "NIXL", "MPI"]] = Field(
|
||||
default=None,
|
||||
description=
|
||||
"The communication backend type to use for the cache transceiver.")
|
||||
|
||||
@ -260,7 +260,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
|
||||
"disable_overlap_scheduler": True,
|
||||
"kv_cache_config": kv_cache_config,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
gen_server_config = {
|
||||
@ -269,7 +269,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
|
||||
"disable_overlap_scheduler": True,
|
||||
"kv_cache_config": kv_cache_config,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
|
||||
@ -309,8 +309,8 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
gen_server_config = {
|
||||
"disable_overlap_scheduler": disable_overlap_scheduler
|
||||
}
|
||||
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
|
||||
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
|
||||
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
|
||||
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
|
||||
disaggregated_server_config = {
|
||||
"hostname": "localhost",
|
||||
"port": 8000,
|
||||
@ -351,7 +351,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
"disable_overlap_scheduler": True,
|
||||
"kv_cache_config": kv_cache_config,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
gen_server_config = {
|
||||
@ -359,7 +359,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
"speculative_config": speculative_decoding_config,
|
||||
"kv_cache_config": kv_cache_config,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
disaggregated_server_config = {
|
||||
@ -404,7 +404,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
"max_num_tokens": 13393 * 2,
|
||||
"max_batch_size": 1,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
},
|
||||
"cuda_graph_config": None,
|
||||
}
|
||||
@ -418,7 +418,7 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
"max_num_tokens": 13393 * 2,
|
||||
"max_batch_size": 16,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
},
|
||||
"cuda_graph_config": None,
|
||||
}
|
||||
@ -472,8 +472,8 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
|
||||
def test_auto_dtype(self, overlap_scheduler):
|
||||
ctx_server_config = {"disable_overlap_scheduler": True}
|
||||
gen_server_config = {"disable_overlap_scheduler": overlap_scheduler}
|
||||
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
|
||||
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
|
||||
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
|
||||
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
|
||||
# Keep this low to avoid warmup OOM in CI
|
||||
ctx_server_config["max_seq_len"] = 8192
|
||||
gen_server_config["max_seq_len"] = 8192
|
||||
@ -513,13 +513,13 @@ class TestDeepSeekV3Lite(LlmapiAccuracyTestHarness):
|
||||
ctx_server_config = {
|
||||
"disable_overlap_scheduler": True,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "nixl"
|
||||
"backend": "NIXL"
|
||||
}
|
||||
}
|
||||
gen_server_config = {
|
||||
"disable_overlap_scheduler": True,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "nixl"
|
||||
"backend": "NIXL"
|
||||
}
|
||||
}
|
||||
disaggregated_server_config = {
|
||||
@ -550,8 +550,8 @@ class TestDeepSeekV3Lite(LlmapiAccuracyTestHarness):
|
||||
def test_auto_dtype(self, overlap_scheduler, mtp_nextn):
|
||||
ctx_server_config = {"disable_overlap_scheduler": True}
|
||||
gen_server_config = {"disable_overlap_scheduler": not overlap_scheduler}
|
||||
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
|
||||
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
|
||||
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
|
||||
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
|
||||
if mtp_nextn > 0:
|
||||
ctx_server_config["speculative_config"] = {
|
||||
"decoding_type": "MTP",
|
||||
@ -597,14 +597,14 @@ class TestGemma3_1BInstruct(LlmapiAccuracyTestHarness):
|
||||
"disable_overlap_scheduler": True,
|
||||
"cuda_graph_config": None,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
gen_server_config = {
|
||||
"disable_overlap_scheduler": overlap_scheduler,
|
||||
"cuda_graph_config": None,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
ctx_server_config["kv_cache_config"] = {
|
||||
@ -648,13 +648,13 @@ class TestQwen3_8B(LlmapiAccuracyTestHarness):
|
||||
ctx_server_config = {
|
||||
"disable_overlap_scheduler": True,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "nixl"
|
||||
"backend": "NIXL"
|
||||
}
|
||||
}
|
||||
gen_server_config = {
|
||||
"disable_overlap_scheduler": True,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "nixl"
|
||||
"backend": "NIXL"
|
||||
}
|
||||
}
|
||||
ctx_server_config["cache_transceiver_config"]
|
||||
@ -686,14 +686,14 @@ class TestQwen3_8B(LlmapiAccuracyTestHarness):
|
||||
"disable_overlap_scheduler": True,
|
||||
"cuda_graph_config": None,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
gen_server_config = {
|
||||
"disable_overlap_scheduler": overlap_scheduler,
|
||||
"cuda_graph_config": None,
|
||||
"cache_transceiver_config": {
|
||||
"backend": "default"
|
||||
"backend": "DEFAULT"
|
||||
}
|
||||
}
|
||||
disaggregated_server_config = {
|
||||
|
||||
@ -21,7 +21,7 @@ context_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
- "localhost:8002"
|
||||
@ -35,7 +35,7 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
kv_cache_config:
|
||||
enable_block_reuse: True
|
||||
enable_partial_reuse: False
|
||||
|
||||
@ -17,7 +17,7 @@ context_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.1
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
- "localhost:8002"
|
||||
@ -33,7 +33,7 @@ generation_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.1
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8003"
|
||||
- "localhost:8004"
|
||||
|
||||
@ -15,7 +15,7 @@ context_servers:
|
||||
enable_partial_reuse: True
|
||||
event_buffer_max_size: 1024
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -30,6 +30,6 @@ generation_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.05
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -15,7 +15,7 @@ context_servers:
|
||||
enable_partial_reuse: True
|
||||
event_buffer_max_size: 1024
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -30,6 +30,6 @@ generation_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.05
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -18,7 +18,7 @@ context_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.15
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -33,6 +33,6 @@ generation_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.15
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -18,7 +18,7 @@ context_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.15
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -33,6 +33,6 @@ generation_servers:
|
||||
event_buffer_max_size: 1024
|
||||
free_gpu_memory_fraction: 0.15
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -31,6 +31,6 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -31,6 +31,6 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -31,6 +31,6 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -18,6 +18,6 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -14,7 +14,7 @@ context_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: true
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -23,6 +23,6 @@ generation_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: false
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -14,7 +14,7 @@ context_servers:
|
||||
enable_attention_dp: true
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -24,6 +24,6 @@ generation_servers:
|
||||
enable_attention_dp: true
|
||||
disable_overlap_scheduler: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -14,7 +14,7 @@ context_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: true
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -25,4 +25,4 @@ generation_servers:
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -31,6 +31,6 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -18,7 +18,7 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
- "localhost:8003"
|
||||
|
||||
@ -8,7 +8,7 @@ context_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -16,7 +16,7 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
- "localhost:8003"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -18,6 +18,6 @@ generation_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -11,7 +11,7 @@ context_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -20,6 +20,6 @@ generation_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -11,7 +11,7 @@ context_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: true
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -20,6 +20,6 @@ generation_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: false
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -14,7 +14,7 @@ context_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: true
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -23,7 +23,7 @@ generation_servers:
|
||||
pipeline_parallel_size: 1
|
||||
enable_attention_dp: false
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -11,7 +11,7 @@ context_servers:
|
||||
enable_attention_dp: True
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -21,6 +21,6 @@ generation_servers:
|
||||
enable_attention_dp: True
|
||||
disable_overlap_scheduler: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
enable_attention_dp: true
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -22,6 +22,6 @@ generation_servers:
|
||||
enable_padding: False
|
||||
disable_overlap_scheduler: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -9,7 +9,7 @@ context_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "mpi"
|
||||
backend: "MPI"
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -17,6 +17,6 @@ generation_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "mpi"
|
||||
backend: "MPI"
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -9,7 +9,7 @@ context_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "nixl"
|
||||
backend: "NIXL"
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -17,6 +17,6 @@ generation_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "nixl"
|
||||
backend: "NIXL"
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -9,7 +9,7 @@ context_servers:
|
||||
pipeline_parallel_size: 1
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -20,6 +20,6 @@ generation_servers:
|
||||
enable_padding: False
|
||||
disable_overlap_scheduler: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -9,7 +9,7 @@ context_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "ucx"
|
||||
backend: "UCX"
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -17,6 +17,6 @@ generation_servers:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "ucx"
|
||||
backend: "UCX"
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -31,6 +31,6 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
batch_sizes: [1,3000]
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -34,6 +34,6 @@ generation_servers:
|
||||
batch_sizes: [1,4,8,16,24,32]
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
max_num_tokens: 512
|
||||
max_batch_size: 64
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -18,6 +18,6 @@ generation_servers:
|
||||
max_num_tokens: 256
|
||||
max_batch_size: 32
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -14,7 +14,7 @@ generation_servers:
|
||||
enable_block_reuse: False
|
||||
enable_partial_reuse: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
print_iter_log: True
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -17,7 +17,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -32,6 +32,6 @@ generation_servers:
|
||||
free_gpu_memory_fraction: 0.2
|
||||
enable_partial_reuse: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -13,7 +13,7 @@ generation_servers:
|
||||
enable_block_reuse: False
|
||||
enable_partial_reuse: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
- "localhost:8003"
|
||||
|
||||
@ -19,7 +19,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
- "localhost:8002"
|
||||
@ -38,7 +38,7 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: False
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8003"
|
||||
- "localhost:8004"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -18,7 +18,7 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
- "localhost:8002"
|
||||
|
||||
@ -9,7 +9,7 @@ context_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -17,7 +17,7 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
speculative_config:
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -31,6 +31,6 @@ generation_servers:
|
||||
enable_partial_reuse: False
|
||||
disable_overlap_scheduler: False
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -10,7 +10,7 @@ context_servers:
|
||||
kv_cache_config:
|
||||
free_gpu_memory_fraction: 0.2
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
generation_servers:
|
||||
@ -18,6 +18,6 @@ generation_servers:
|
||||
tensor_parallel_size: 1
|
||||
pipeline_parallel_size: 1
|
||||
cache_transceiver_config:
|
||||
backend: default
|
||||
backend: DEFAULT
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -16,7 +16,7 @@ context_servers:
|
||||
free_gpu_memory_fraction: 0.2
|
||||
enable_partial_reuse: False
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
disable_overlap_scheduler: True
|
||||
urls:
|
||||
- "localhost:8001"
|
||||
@ -32,7 +32,7 @@ generation_servers:
|
||||
free_gpu_memory_fraction: 0.2
|
||||
enable_partial_reuse: False
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
disable_overlap_scheduler: False
|
||||
urls:
|
||||
- "localhost:8002"
|
||||
|
||||
@ -1276,8 +1276,8 @@ def test_disaggregated_benchmark_on_diff_backends(
|
||||
if "DeepSeek-V3-Lite" in benchmark_model_root and "fp8" in benchmark_model_root and get_sm_version(
|
||||
) != 90:
|
||||
pytest.skip("The test should only run on Hopper")
|
||||
nixl_config = get_config_for_benchmark(benchmark_model_root, "nixl")
|
||||
ucx_config = get_config_for_benchmark(benchmark_model_root, "ucx")
|
||||
nixl_config = get_config_for_benchmark(benchmark_model_root, "NIXL")
|
||||
ucx_config = get_config_for_benchmark(benchmark_model_root, "UCX")
|
||||
temp_dir = tempfile.TemporaryDirectory()
|
||||
nixl_config_path = os.path.join(temp_dir.name, "nixl_config.yaml")
|
||||
ucx_config_path = os.path.join(temp_dir.name, "ucx_config.yaml")
|
||||
|
||||
@ -244,7 +244,7 @@ def create_config_files(config):
|
||||
context_config_content = """pytorch_backend_config:
|
||||
disable_overlap_scheduler: True
|
||||
cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
max_tokens_in_buffer: 2048"""
|
||||
|
||||
with open(CONTEXT_CONFIG_FILE, 'w') as file:
|
||||
@ -252,7 +252,7 @@ cache_transceiver_config:
|
||||
|
||||
# Create generation config file
|
||||
generation_config_content = """cache_transceiver_config:
|
||||
backend: "default"
|
||||
backend: "DEFAULT"
|
||||
max_tokens_in_buffer: 2048"""
|
||||
|
||||
with open(GENERATION_CONFIG_FILE, 'w') as file:
|
||||
|
||||
@ -131,7 +131,7 @@ def verify_disaggregated(model, generation_overlap, enable_cuda_graph, prompt,
|
||||
|
||||
kv_cache_configs = [KvCacheConfig(max_tokens=2048 * 8) for _ in range(2)]
|
||||
cache_transceiver_configs = [
|
||||
CacheTransceiverConfig(backend="default") for _ in range(2)
|
||||
CacheTransceiverConfig(backend="DEFAULT") for _ in range(2)
|
||||
]
|
||||
model_names = [model_path(model) for _ in range(2)]
|
||||
ranks = [0, 1]
|
||||
@ -274,7 +274,7 @@ def test_disaggregated_llama_context_capacity(model, enable_cuda_graph,
|
||||
for _ in range(2)
|
||||
]
|
||||
cache_transceiver_configs = [
|
||||
CacheTransceiverConfig(backend="default") for _ in range(2)
|
||||
CacheTransceiverConfig(backend="DEFAULT") for _ in range(2)
|
||||
]
|
||||
model_names = [model_path(model) for _ in range(2)]
|
||||
ranks = [0, 1]
|
||||
@ -377,7 +377,7 @@ def test_disaggregated_spec_dec_batch_slot_limit(model, spec_dec_model_path,
|
||||
for _ in range(2)
|
||||
]
|
||||
cache_transceiver_configs = [
|
||||
CacheTransceiverConfig(backend="default") for _ in range(2)
|
||||
CacheTransceiverConfig(backend="DEFAULT") for _ in range(2)
|
||||
]
|
||||
model_names = [model_path(model) for _ in range(2)]
|
||||
ranks = [0, 1]
|
||||
|
||||
@ -661,15 +661,15 @@ class TestStrictBaseModelArbitraryArgs:
|
||||
def test_cache_transceiver_config_arbitrary_args(self):
|
||||
"""Test that CacheTransceiverConfig rejects arbitrary arguments."""
|
||||
# Valid arguments should work
|
||||
config = CacheTransceiverConfig(backend="ucx",
|
||||
config = CacheTransceiverConfig(backend="UCX",
|
||||
max_tokens_in_buffer=1024)
|
||||
assert config.backend == "ucx"
|
||||
assert config.backend == "UCX"
|
||||
assert config.max_tokens_in_buffer == 1024
|
||||
|
||||
# Arbitrary arguments should be rejected
|
||||
with pytest.raises(
|
||||
pydantic_core._pydantic_core.ValidationError) as exc_info:
|
||||
CacheTransceiverConfig(backend="ucx", invalid_config="should_fail")
|
||||
CacheTransceiverConfig(backend="UCX", invalid_config="should_fail")
|
||||
assert "invalid_config" in str(exc_info.value)
|
||||
|
||||
def test_torch_compile_config_arbitrary_args(self):
|
||||
|
||||
Loading…
Reference in New Issue
Block a user