TensorRT-LLMs/docs/source/torch/features/overlap_scheduler.md
QI JUN 82547f733d
add feature support matrix for PyTorch backend (#5037)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-01 10:09:54 +08:00

903 B

Overlap Scheduler

To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.

How It Works

At step n, the system launches GPU computation for step n+1 without waiting for CPU tasks (e.g., stop criteria checks) from step n to complete. This allows:

  • CPU work (step n) and GPU computation (step n+1) to run concurrently.
  • Better GPU occupancy by reducing idle time.

Tradeoff

The optimization introduces one extra decoding step but significantly improves throughput.

Usage

Enabled by default. To disable, set disable_overlap_scheduler=True in the configuration.

References