# Scheduler TensorRT-LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically at each LLM step. The scheduler is invoked to determine which requests are scheduled at the current step. ## Scheduler Introduction There are two kinds of schedulers: - `CapacityScheduler`: This scheduler decides if resources should be allocated for each active request. It considers the KV cache capacity and other resources, if applicable. The input to `CapacityScheduler` includes all active requests that need processing. The primary output is `fitting_requests`, representing the requests for which resources are reserved at the current step. Another output is `paused_requests`, which supports request pausing in the C++ runtime. - `MicroBatchScheduler`: This scheduler selects some requests from `fitting_requests` chosen by `CapacityScheduler`. Another input is `inflight_request_ids`, which supports pipeline parallelism or overlapped execution in the C++ runtime. Since PyTorch Flow does not support pipeline parallelism, `inflight_request_ids` is an empty set. The outputs are `context_requests` and `generation_requests`, which are the scheduled context and generation requests. Requests not in these lists are not selected for the model forward pass. `SimpleScheduler` combines these two schedulers, first using `CapacityScheduler` and then `MicroBatchScheduler`, to get the final schedule result. The inputs to `SimpleScheduler` include `active_requests` and `inflight_request_ids`, and the outputs are `context_requests`, `generation_requests`, and `paused_requests`. ## Customize Your Own Scheduler To customize the scheduler or batching mechanism, implement your own `CapacityScheduler` and `MicroBatchScheduler` by inheriting their respective classes. If two-step scheduling is unnecessary, inherit `RequestScheduler` and implement `schedule_request` directly. An example of a `CapacityScheduler` implementation is the `GuaranteedNoEvictScheduler` class, found in [scheduler.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/scheduler.py). This class was used before the C++ binding of `CapacityScheduler` and initially employed a Python-based scheduler. It inherits `CapacityScheduler` and implements its own `schedule_request` method. This method processes all `active_requests` and tries to schedule more requests that can fit in the KV cache. Resource estimation should align with resource allocation and deallocation in `kv_cache_manager`. Here is the code snippet: ```python class GuaranteedNoEvictScheduler(CapacityScheduler): # only schedule requests has no_schedule_until_state <= state < no_schedule_after_state no_schedule_until_state = LlmRequestState.CONTEXT_INIT no_schedule_after_state = LlmRequestState.GENERATION_COMPLETE def __init__(self, max_num_requests: int, kv_cache_manager): super(GuaranteedNoEvictScheduler, self).__init__() self.max_num_requests = max_num_requests self.kv_cache_manager = kv_cache_manager def schedule_request( self, active_requests: RequestList ) -> tuple[list[LlmRequest], list[LlmRequest]]: scheduled_requests = [] pending_requests = [] reserved_blocks = 0 max_blocks = self.kv_cache_manager.get_max_resource_count() for request in active_requests: req_state = request.state # if request cannot be scheduled yet or request should no longer be scheduled, skip if req_state.value < self.no_schedule_until_state.value or req_state.value >= self.no_schedule_after_state.value: continue if len(scheduled_requests ) >= self.max_num_requests or reserved_blocks >= max_blocks: break elif req_state == LlmRequestState.GENERATION_IN_PROGRESS or req_state == LlmRequestState.GENERATION_TO_COMPLETE: scheduled_requests.append(request) reserved_blocks += self.kv_cache_manager.get_needed_resource_to_completion( request) else: pending_requests.append(request) avaiable_blocks = max_blocks - reserved_blocks for request in pending_requests: req_state = request.state if len(scheduled_requests) >= self.max_num_requests: break elif req_state == LlmRequestState.CONTEXT_INIT: needed_blocks = self.kv_cache_manager.get_needed_resource_to_completion( request) if needed_blocks <= avaiable_blocks: scheduled_requests.append(request) avaiable_blocks -= needed_blocks elif needed_blocks > avaiable_blocks: # If one requests fails to be scheduled, break break assert len(scheduled_requests) > 0, ( "no pending request can get enough resource to complete, " "please increase KV cache pool size.") return scheduled_requests, [] ``` After implementing your own scheduler, integrate it into the PyExecutor. For the PyTorch backend, the code is in [py_executor_creator.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py). In the `create_pytorch_model_based_executor` function, there are two lines creating `CapacityScheduler`: ```python capacitor_scheduler = BindCapacityScheduler(max_num_requests, kv_cache_manager.impl) ``` Similar adjustments can be made for `MicroBatchScheduler`. This allows the `PyExecutor` to execute with your customized scheduling logic.