mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

[None][doc] Add feature docs for helix parallelism (#9684 )

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

2025-12-04 18:08:40 -08:00

3.4 KiB

Raw Blame History

Helix Parallelism

Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional attention-FFN disaggregation (AFD) techniques, which spatially separate attention and FFN blocks onto different GPUs, Helix temporally separates them by reconfiguring the same GPUs.

For all details, see the original paper: Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

How Helix Works

In Helix parallelism:

KV cache distribution: The KV cache is partitioned across CP ranks during generation, with each rank responsible for a portion of the cached context
Attention computation: Each rank computes partial attention over its local KV cache shard
Attention postprocessing: Partial results are combined / corrected across ranks to produce the final attention output
FFN layers: CP ranks are repurposed as tensor parallelism (TP) ranks for FFN/MoE layers, maximizing GPU utilization

When to Use Helix

Helix parallelism provides performance benefits when all of the following conditions apply:

Disaggregated serving: Helix is designed for generation servers in a disaggregated (prefill/decode split) deployment architecture
Long input sequences: Performance gains typically appear with input sequence lengths >64K tokens or more
Low batch sizes: Optimal for latency-sensitive workloads with high tokens/second/user requirements

On a typical latency vs. throughput Pareto curve, Helix targets operating points toward the right side (low latency, high per-user throughput).

Supported Models

Helix parallelism currently supports models using Multi-head Latent Attention (MLA) on Blackwell GPU architecture:

DeepSeek-V3 / DeepSeek-V3-Lite

Configuration

Configuration Parameters

Please set the following parameters for the generation servers in disaggregated mode. Example can be seen in the e2e accuracy test mentioned below.

Parameter	Description	Required
`context_parallel_size`	Number of GPUs for context parallelism (≥2 for Helix)	Yes
`cp_config.cp_type`	Must be `"HELIX"` or `CpType.HELIX`	Yes
`cp_config.tokens_per_block`	Tokens per KV cache block	Yes
`kv_cache_config.tokens_per_block`	Must match `cp_config.tokens_per_block`	Yes

JSON Configuration (for YAML/JSON configs)

{
    "context_parallel_size": 2,
    "cp_config": {
        "cp_type": "HELIX",
        "tokens_per_block": 32
    },
    "kv_cache_config": {
        "tokens_per_block": 32
    }
}

Testing Helix with TensorRT-LLM

Unit Test: MLA Module Correctness

The simplest correctness test validates the MLA attention module with Helix enabled:

# Run the MLA Helix unit test
pytest tests/unittest/_torch/modules/test_mla_helix.py -v

This test verifies that attention outputs match between single-GPU and Helix-parallelized execution.

End-to-End Accuracy test

For end-to-end validation, the accuracy benchmark evaluates DeepSeek-V3-Lite in disaggregated mode on MMLU and GSM8K benchmarks:

Test location: tests/integration/defs/accuracy/test_disaggregated_serving.py
Test name: TestDeepSeekV3Lite::test_auto_dtype_with_helix

This test demonstrates proper disaggregated server configuration with Helix.

3.4 KiB Raw Blame History