mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
903 B
903 B
Overlap Scheduler
To maximize GPU utilization, the scheduler overlaps CPU tasks (e.g., checking sampling stop criteria, updating responses, scheduling the next batch) with GPU computation.
How It Works
At step n, the system launches GPU computation for step n+1 without waiting for CPU tasks (e.g., stop criteria checks) from step n to complete. This allows:
- CPU work (step n) and GPU computation (step n+1) to run concurrently.
- Better GPU occupancy by reducing idle time.
Tradeoff
The optimization introduces one extra decoding step but significantly improves throughput.
Usage
Enabled by default. To disable, set disable_overlap_scheduler=True in the configuration.