TensorRT-LLMs/examples/scaffolding/contrib/AsyncGeneration
narutolhy ccd73c71a5
feat: Add stream generation task scaffolding examples (#3527)
* stream generation task/controller

Signed-off-by: narutolhy <582909902@qq.com>

* edit README

Signed-off-by: narutolhy <582909902@qq.com>

* rename README

Signed-off-by: narutolhy <582909902@qq.com>

---------

Signed-off-by: narutolhy <582909902@qq.com>
2025-04-16 11:33:55 +08:00
..
README.md feat: Add stream generation task scaffolding examples (#3527) 2025-04-16 11:33:55 +08:00
stream_generation_controller.py feat: Add stream generation task scaffolding examples (#3527) 2025-04-16 11:33:55 +08:00
stream_generation_run.py feat: Add stream generation task scaffolding examples (#3527) 2025-04-16 11:33:55 +08:00
stream_generation_task.py feat: Add stream generation task scaffolding examples (#3527) 2025-04-16 11:33:55 +08:00

Overview

StreamGenerationTask is an extension of GenerationTask designed for token-level streaming generation in asynchronous LLM workflows using TensorRT-LLM. It enables the controller to receive partial results during generation, which is critical for real-time or latency-sensitive applications such as chatbots, speech generation, or UI-interactive systems.


Features

  • Supports streamed token delivery by step (e.g., streaming_step=1).
  • Supports cancellation of generation using a flag (cancel_flag=True).
  • Tracks stream completion status (end_flag=True when done).
  • Integrated with a streaming-capable LLM interface (generate_async).

Fields in StreamGenerationTask

Field Description
cancel_flag If True, the generation will be cancelled on the worker side.
streaming_step Number of new tokens required before returning control to the controller. If set to 0, the task is returned immediately if any new tokens are available.
request_handle Internal handle for the streaming generation (used by the worker).
end_flag Indicates whether generation is finished.
output_str / output_tokens / logprobs Outputs after each generation step.

Usage in Controller/Worker

The Controller can utilize StreamGenerationTask to enable efficient streaming-based generation workflows:

  • It sends tasks to the worker, which returns them when the number of newly generated tokens reaches the specified streaming_step.
  • It can cancel long-running tasks by setting task.cancel_flag = True when the number of generated tokens exceeds a predefined threshold.

To support this behavior on the worker side, you need to implement a stream_generation_handler and register it with the worker. This handler should process StreamGenerationTask instances step-by-step and update relevant fields such as output_tokens, output_str.

This design allows the controller and worker to coordinate generation in a token-efficient and responsive manner, ideal for real-time applications.

You can see more details in stream_generation_controller.py and stream_generation_task.py

Notes

Ensure the worker.llm.generate_async(...) method supports streaming=True.

TODO

  • Add error handling for failed request_handle
  • Support retry or backoff mechanism if generation stalls