TensorRT-LLMs/examples/high-level-api/README.md
Kaiyu Xie d37b507f41
Update TensorRT-LLM main branch (#754)
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2023-12-27 17:41:24 +08:00

63 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# High-level API
We are working on a high-level API for LLM workflow, which is still in incubation and may change later.
Here we show you a preview of how it works and how to use it.
Note that the APIs are not stable and only support LLaMA model on single-node-single-gpu with limited optimization.
We appreciate your patience and understanding as we improve this API.
## Basic usage
To use the API, import the `LLM` and `ModelConfig` from `tensorrt_llm` package and create an LLM object with a HuggingFace model directly.
For example:
``` python
from tensorrt_llm.hlapi.llm import LLM, ModelConfig
config = ModelConfig(model_dir=<llama_model_path>)
llm = LLM(config)
```
It will trigger TRT-LLM engine building in the backend, and create HuggingFace tokenizer by default to support an end-to-end generation.
To generate text, use the `__call__` method of `LLM` object directly with a batch of prompts, for example:
``` python
prompts = ["To tell a story"]
for output in llm(prompts):
print(output)
```
The output might be something like:
``` python
GenerationPiece(index=0, text="with a picture.\nI'm a writer, but I'm also a photographer.")
```
You can also dump the runtime engine to disk, and load from the engine file directly in the next run to save the engine building time from HuggingFace model.
``` python
# dump the llm
llm.save(<engine-path>)
# next time
config = ModelConfig(model_dir=<engine-path>)
llm = LLM(config)
```
## Customization
By default, the high-level API uses transformers `AutoTokenizer`. You can override it with your own tokenizer by passing it when creating the LLM object. For example:
```python
llm = LLM(config, tokenizer=<my_faster_one>)
```
Besides tokenizer, you can also override the model by passing in an in-memory model, that will save much efforts for a highly-customed model.
``` python
class MyModel(Module): ...
my_model = MyModel(...)
llm = LLM(model=my_model)
```