TensorRT-LLMs/examples/scaffolding/contrib/AsyncGeneration/README.md
narutolhy ccd73c71a5
feat: Add stream generation task scaffolding examples (#3527)
* stream generation task/controller

Signed-off-by: narutolhy <582909902@qq.com>

* edit README

Signed-off-by: narutolhy <582909902@qq.com>

* rename README

Signed-off-by: narutolhy <582909902@qq.com>

---------

Signed-off-by: narutolhy <582909902@qq.com>
2025-04-16 11:33:55 +08:00

47 lines
2.4 KiB
Markdown

## Overview
`StreamGenerationTask` is an extension of `GenerationTask` designed for token-level streaming generation in asynchronous LLM workflows using TensorRT-LLM. It enables the controller to receive partial results during generation, which is critical for real-time or latency-sensitive applications such as chatbots, speech generation, or UI-interactive systems.
---
## Features
- ✅ Supports **streamed token delivery** by step (e.g., `streaming_step=1`).
- ✅ Supports **cancellation** of generation using a flag (`cancel_flag=True`).
- ✅ Tracks **stream completion status** (`end_flag=True` when done).
- ✅ Integrated with a streaming-capable LLM interface (`generate_async`).
---
## Fields in `StreamGenerationTask`
| Field | Description |
|-------|-------------|
| `cancel_flag` | If `True`, the generation will be cancelled on the worker side. |
| `streaming_step` | Number of new tokens required before returning control to the controller. If set to `0`, the task is returned immediately if any new tokens are available. |
| `request_handle` | Internal handle for the streaming generation (used by the worker). |
| `end_flag` | Indicates whether generation is finished. |
| `output_str` / `output_tokens` / `logprobs` | Outputs after each generation step. |
---
## Usage in Controller/Worker
The Controller can utilize `StreamGenerationTask` to enable efficient streaming-based generation workflows:
- It sends tasks to the worker, which returns them when the number of newly generated tokens reaches the specified `streaming_step`.
- It can cancel long-running tasks by setting `task.cancel_flag = True` when the number of generated tokens exceeds a predefined threshold.
To support this behavior on the worker side, you need to implement a `stream_generation_handler` and register it with the worker. This handler should process `StreamGenerationTask` instances step-by-step and update relevant fields such as `output_tokens`, `output_str`.
This design allows the controller and worker to coordinate generation in a token-efficient and responsive manner, ideal for real-time applications.
You can see more details in stream_generation_controller.py and stream_generation_task.py
## Notes
Ensure the `worker.llm.generate_async(...)` method supports streaming=True.
## TODO
- Add error handling for failed `request_handle`
- Support retry or backoff mechanism if generation stalls