mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

add feature support matrix for PyTorch backend (#5037 )

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

2025-07-01 10:09:54 +08:00

903 B

Raw Blame History

Overlap Scheduler

To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.

How It Works

At step n, the system launches GPU computation for step n+1 without waiting for CPU tasks (e.g., stop criteria checks) from step n to complete. This allows:

CPU work (step n) and GPU computation (step n+1) to run concurrently.
Better GPU occupancy by reducing idle time.

Tradeoff

The optimization introduces one extra decoding step but significantly improves throughput.

Usage

Enabled by default. To disable, set disable_overlap_scheduler=True in the configuration.

903 B Raw Blame History

Overlap Scheduler

How It Works

Tradeoff

Usage

References

903 B

Raw Blame History