[Docs][ROCm] MoRI-IO Connector Usage Guide (#43603)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
Simon Danielsson
2026-05-26 15:52:30 +02:00
committed by GitHub
parent a970fb5a1a
commit d565357a90
2 changed files with 272 additions and 8 deletions
+8 -8
View File
@@ -19,25 +19,25 @@ Two main reasons:
Please refer to [examples/disaggregated/disaggregated_prefill.sh](../../examples/disaggregated/disaggregated_prefill.sh) for the example usage of disaggregated prefilling.
Now supports 6 types of connectors:
Now supports 9 types of connectors:
- **ExampleConnector**: refer to [examples/disaggregated/example_connector/run.sh](../../examples/disaggregated/example_connector/run.sh) for the example usage of ExampleConnector disaggregated prefilling.
- **LMCacheConnectorV1**: refer to [examples/disaggregated/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh](../../examples/disaggregated/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh) for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
- **NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md). For feature compatibility details, see [NixlConnector Compatibility Matrix](nixl_connector_compatibility.md).
- **NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md). For feature compatibility details, see [NixlConnector Compatibility Matrix](nixl_connector_compatibility.md). You may specify one or multiple NIXL transfer backends, such as:
```bash
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_buffer_device":"cuda", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'
```
- **P2pNcclConnector**: refer to [examples/disaggregated/p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh](../../examples/disaggregated/p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh) for the example usage of P2pNcclConnector disaggregated prefilling.
- **MooncakeConnector**: refer to [examples/disaggregated/mooncake_connector/run_mooncake_connector.sh](../../examples/disaggregated/mooncake_connector/run_mooncake_connector.sh) for the example usage of MooncakeConnector disaggregated prefilling. For detailed usage guide, see [MooncakeConnector Usage Guide](mooncake_connector_usage.md).
- **MoRIIOConnector** (ROCm only): see [MoRI-IO Usage Guide](moriio_connector_usage.md) for example usage and detailed documentation.
- **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
```bash
--kv-transfer-config '{"kv_connector":"MultiConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"NixlConnector","kv_role":"kv_both"},{"kv_connector":"ExampleConnector","kv_role":"kv_both","kv_connector_extra_config":{"shared_storage_path":"local_storage"}}]}}'
```
For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
```bash
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_buffer_device":"cuda", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'
```
- **OffloadingConnector**: enable offloading of KV data to CPU memory, customizing the CPU block size (in tokens) and total CPU memory bytes to allocate:
```bash
+264
View File
@@ -0,0 +1,264 @@
# MoRIIOConnector Usage Guide
`MoRIIOConnector` is a high-performance KV connector used for KV cache transfer in PD disaggregated deployments, built on ROCm's [MoRI-IO](https://github.com/rocm/mori) communication library for point-to-point communication with ultra-low overhead.
## Prerequisites
### Installation
**Docker:** MoRI is shipped with the official ROCm vLLM image: `vllm/vllm-openai-rocm:nightly`.
**Manual installation:** MoRI wheel can be installed with
```bash
pip install amd_mori
```
Refer to the [Dockerfile.rocm_base](../../docker/Dockerfile.rocm_base) for more information, or [official MoRI repository](https://github.com/rocm/mori) for instructions on how to build MoRI from source.
For instructions on installing appropriate NIC userspace libraries, see [Installing NIC userspace libraries](#appendix-installing-nic-userspace-libraries).
## Basic usage (single host)
Start the proxy first; the producer and consumer instances will retry registration until the proxy is reachable.
### Producer (prefiller) configuration
Start a prefiller instance that produces KV caches
```bash
# Prefill instance (GPU 0-3)
export VLLM_ROCM_USE_AITER=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HIP_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
-tp 4 \
--port 20005 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
"kv_connector": "MoRIIOConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"proxy_ip": "127.0.0.1",
"proxy_ping_port": "36367",
"http_port": "20005",
"handshake_port": "6301",
"notify_port": "6105"
}
}'
```
### Consumer (decoder) configuration
Start a decoder instance that consumes KV caches:
```bash
# Decode instance (GPU 4-7)
export VLLM_ROCM_USE_AITER=1
export CUDA_VISIBLE_DEVICES=4,5,6,7
export HIP_VISIBLE_DEVICES=4,5,6,7
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
-tp 4 \
--port 40005 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{
"kv_connector": "MoRIIOConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"proxy_ip": "127.0.0.1",
"http_port": "40005",
"proxy_ping_port": "36367",
"handshake_port": "7301",
"notify_port": "7501"
}
}'
```
### Proxy server
The proxy fronts the producer and consumer instances and routes incoming requests to them. `vllm-router` is the recommended proxy; it can be installed manually or run as a Docker container. Note that the port `36367` below is the `proxy_ping_port` configured on each vLLM instance.
**Docker:**
```bash
docker run \
--network host \
vllm/vllm-router:nightly \
vllm-router \
--vllm-pd-disaggregation \
--kv-connector moriio \
--vllm-discovery-address "0.0.0.0:36367"
```
**Manual install:**
```bash
pip install vllm-router
vllm-router \
--vllm-pd-disaggregation \
--kv-connector moriio \
--vllm-discovery-address "0.0.0.0:36367"
```
Alternatively, you can use the reference implementation proxy shipped with vLLM:
```bash
cd <path_to>/vllm
pip install quart aiohttp msgpack
python examples/disaggregated/disaggregated_serving/moriio_toy_proxy_server.py
```
## Configuration
The connector is configured at two levels: the application level and the transport level.
### Application-level configuration
**Modes:** MoRI has two modes of operation: WRITE and READ mode.
- In WRITE mode, the producer actively pushes computed KV blocks after every layer into the consumer's memory.
- In READ mode, the consumer pulls the KV blocks from the producer all at once, as soon as it has been notified those blocks are ready.
WRITE mode is used by default. READ mode can be configured by setting `--kv-transfer-config.kv_connector_extra_config.read_mode true`.
**Control-plane configuration:** MoRI moves KV bytes over RDMA/xGMI, but producers and consumers also need out-of-band TCP channels for handshake, block id exchange, liveness, and completion signaling. These keys live under `kv_connector_extra_config`:
- `proxy_ip`: IP address of the disaggregation proxy/router that fronts the prefiller and decoder. Each vLLM instance uses it to register itself and to send heartbeats so the proxy knows where to route incoming requests.
- `proxy_ping_port`: TCP port on `proxy_ip` where the proxy listens for instance heartbeats and registration messages. Used to detect dead vLLM instances and keep routing tables fresh.
- `http_port`: HTTP port that this vLLM instance exposes its OpenAI-compatible API on. The proxy registers this port, and forwards user requests to this port once it has picked an instance.
- `handshake_port`: TCP port used for the one-time MoRI engine handshake between a prefiller and a decoder. The two sides exchange RDMA engine descriptors here before any KV transfer can happen.
- `notify_port`: TCP port used for control and synchronization messages between prefiller and decoder. Used differently in the two modes:
- WRITE mode: **Block allocation:** the decoder notifies the prefiller about its block ids, so the prefiller can push its computed KV blocks into the correct place on the decoder instance. **Completion:** once all blocks have been transferred, the prefiller notifies the decoder that it's safe to use its blocks.
- READ mode: **Completion:** once the decoder has read all blocks from the prefiller, it notifies the prefiller so it can free its KV cache blocks.
!!! note
`notify_port` is used as a *base* port: each (DP rank, TP rank) pair within an instance uses `notify_port + offset` where the offset is based on the rank. Make sure the range starting at `notify_port` is free on the host.
### Transport configuration
MoRI has two transport backends: RDMA and xGMI. You can select backend using `--kv-transfer-config.kv_connector_extra_config.backend $BACKEND`, with `$BACKEND` being `rdma` or `xgmi`. RDMA is the default backend and should be used in multi-node deployments.
The configuration options for each backend are as follows.
#### RDMA backend
- `qp_per_transfer`: number of RDMA Queue Pairs (QPs) used per transfer. More QPs let a single transfer be striped over multiple QPs to increase NIC concurrency, at the cost of more RDMA resources.
- `post_batch_size`: how many RDMA Work Requests (WR) are batched into one `ibv_post_send` doorbell. Defaults to -1, meaning the backend default. Larger batches reduce the posting overhead per WR.
- `num_workers`: number of worker threads MoRI uses to post and poll transfer completions.
Advanced users can also configure MoRI itself using environment variables such as `MORI_IO_QP_MAX_SEND_WR`, `MORI_IO_QP_MAX_CQE`, etc. These are MoRI library variables and are separate from vLLM's own `VLLM_MORIIO_*` settings. Refer to the [MoRI repository](https://github.com/rocm/mori) for more information.
#### xGMI backend
Use xGMI when the prefiller and decoder run on the same physical host so transfers go over the AMD GPU fabric and skip the NIC entirely. Currently only configured using MoRI-specific environment variables; see the [MoRI repository](https://github.com/rocm/mori).
## Multi-node deployment
The example below shows how to run a 1P1D deployment on two nodes. We run the proxy on the same node as the prefill instance.
### On both nodes
```bash
# Set on both nodes before running any command
export PREFILL_IP=<node1-ip>
export DECODE_IP=<node2-ip>
```
### On node 1
Start the proxy first as described in [Proxy server](#proxy-server), then start the prefill instance:
```bash
docker run \
--name moriio-prefill \
--init --network host --ipc host --privileged \
--security-opt seccomp=unconfined \
--ulimit memlock=-1 --ulimit stack=67108864 --shm-size 256G \
--group-add video --group-add render \
--device /dev/kfd --device /dev/dri --device /dev/infiniband \
-e VLLM_ROCM_USE_AITER=1 \
vllm/vllm-openai-rocm:nightly \
deepseek-ai/DeepSeek-R1-0528 \
--port 8100 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--gpu-memory-utilization 0.8 \
--trust-remote-code \
--kv-transfer-config '{
"kv_connector": "MoRIIOConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"proxy_ip": "'"${PREFILL_IP}"'",
"proxy_ping_port": "36367",
"http_port": "8100",
"handshake_port": "6301",
"notify_port": "61005"
}
}'
```
### On node 2
Decode instance:
```bash
docker run \
--name moriio-decode \
--init --network host --ipc host --privileged \
--security-opt seccomp=unconfined \
--ulimit memlock=-1 --ulimit stack=67108864 --shm-size 256G \
--group-add video --group-add render \
--device /dev/kfd --device /dev/dri --device /dev/infiniband \
-e VLLM_ROCM_USE_AITER=1 \
vllm/vllm-openai-rocm:nightly \
deepseek-ai/DeepSeek-R1-0528 \
--port 8200 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.8 \
--trust-remote-code \
--enable-expert-parallel \
--kv-transfer-config '{
"kv_connector": "MoRIIOConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"proxy_ip": "'"${PREFILL_IP}"'",
"proxy_ping_port": "36367",
"http_port": "8200",
"handshake_port": "6301",
"notify_port": "61005"
}
}'
```
## Troubleshooting
### `availDevices.size() > 0` assertion failure
**Problem:** vLLM fails to launch with the following log:
```bash
libibverbs: Warning: Driver bnxt_re does not support the kernel ABI of 6 (supports 1 to 1) for device /sys/class/infiniband/rdma4
...
ker: /app/mori/src/io/rdma/backend_impl.cpp: mori::io::RdmaManager::RdmaManager(const RdmaBackendConfig, application::RdmaContext *): Assertion `availDevices.size() > 0' failed.
```
**Fix:** The installed RDMA userspace libraries do not match the driver and firmware version installed on the host. You must install NIC userspace libraries corresponding to your RDMA kernel module and firmware version. See [Installing NIC userspace
libraries](#appendix-installing-nic-userspace-libraries) for more information.
## Appendix: installing NIC userspace libraries
To run MoRI with RDMA, your environment must have the necessary RDMA userspace libraries installed that match the associated kernel module and firmware version.
The official image `vllm/vllm-openai-rocm:nightly` comes pre-installed with userspace libraries for the following NICs and kernel module versions:
- AINIC (AMD Pensando Pollara): version `1.117.3-hydra`, tested with `ioinic-dkms=25.11.1.001`
- Thor2 (Broadcom): version `235.2.86.0`, tested with `bnxt-en-dkms=1.10.3.235.2.86.0`, `bnxt-re-dkms=235.2.86.0`
Refer to [Dockerfile.rocm](../../docker/Dockerfile.rocm) for more details. For users with NICs, kernel modules, and/or FW other than those stated above we refer to
the vendors' own installation instructions.
## Further reading
- [Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation](https://vllm.ai/blog/2026-04-07-moriio-kv-connector).