mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Guoming Zhang f53fb4c803 [TRTLLM-5930][doc] 1.0 Documentation. (#6696 )

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

2025-09-09 12:16:03 +08:00

1.4 KiB

Raw Blame History

Overlap Scheduler

To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.

How It Works

At step n, the system launches GPU computation for step n+1 without waiting for CPU tasks (e.g., stop criteria checks) from step n to complete. This allows:

CPU work (step n) and GPU computation (step n+1) to run concurrently.
Better GPU occupancy by reducing idle time.

This concurrent execution pipeline is illustrated in the PyExecutor's logic:

# Schedule and launch GPU work for the current step (n)
scheduled_batch, _, _ = self._schedule()
batch_outputs = self._forward_step(scheduled_batch, previous_tensors_device)
sample_state = self._sample_async(scheduled_batch, batch_outputs)

# While the GPU is busy, process the CPU-bound results from the previous step (n-1)
if self.previous_batch is not None:
    self._process_previous_batch()

Tradeoff

The optimization introduces one extra decoding step but significantly improves throughput.

Usage

Enabled by default. To disable, set disable_overlap_scheduler=True in the configuration.

1.4 KiB Raw Blame History

Overlap Scheduler

How It Works

Tradeoff

Usage

References

1.4 KiB

Raw Blame History