TensorRT-LLMs/tensorrt_llm/scaffolding/README.md

# Scaffolding

## Introduction

Scaffolding is a framework to run various inference time compute methods such as CoT, majority vote, best of N, MCTS, etc.

Scaffolding is built around three key principles:
- Ease of Use: Users can easily run inference time compute methods and enginners can easily customize the framework to add new methods and execution backends.
- Modularity: Scaffolding is designed to be modular. Engineers can fully reuse existing modules when defining new methods.
- Performance: Scaffolding is designed to be performant. It considers the design of concurrent schedule and provides more information to the backend to help optimize performance.

## Architecture
There are following key components in Scaffolding:

- `Controller`: The class that defines the workflow of inference time compute methods.
- `Worker`: The class that handles the operations such as generation and reward scoring.
- `ScaffoldingLlm`: The interface class to run inference time compute methods.

Workflow of Scaffolding:
1. Users instantiate a `ScaffoldingLlm` instance by assembling a `Controller` and some `Worker`s.
2. Users call `ScaffoldingLlm`'s API to run inference time compute methods.
3. `ScaffoldingLlm` instantiate a `Controller` instance and get `Task` from `Controller`.
4. `ScaffoldingLlm` dispatch the `Task` to `Worker` and return the completed `Task` back to `Controller`.
5. `Controller` create new `Task` until the inference time compute method is finished.
6. `ScaffoldingLlm` return the result to users.


## Usage

See [example/scaffolding](example/scaffolding).

## Future Work
- support openai api worker（on the way）
- support reward model （on the way）
- performance benchmark （on the way）
- support best of N
- support MCTS
- support sandbox