mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-27 14:13:34 +08:00

* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

2023-12-27 17:41:24 +08:00

1.9 KiB

Raw Blame History

High-level API

We are working on a high-level API for LLM workflow, which is still in incubation and may change later. Here we show you a preview of how it works and how to use it.

Note that the APIs are not stable and only support LLaMA model on single-node-single-gpu with limited optimization. We appreciate your patience and understanding as we improve this API.

Basic usage

To use the API, import the LLM and ModelConfig from tensorrt_llm package and create an LLM object with a HuggingFace model directly. For example:

from tensorrt_llm.hlapi.llm import LLM, ModelConfig

config = ModelConfig(model_dir=<llama_model_path>)
llm = LLM(config)

It will trigger TRT-LLM engine building in the backend, and create HuggingFace tokenizer by default to support an end-to-end generation.

To generate text, use the __call__ method of LLM object directly with a batch of prompts, for example:

prompts = ["To tell a story"]
for output in llm(prompts):
    print(output)

The output might be something like:

GenerationPiece(index=0, text="with a picture.\nI'm a writer, but I'm also a photographer.")

You can also dump the runtime engine to disk, and load from the engine file directly in the next run to save the engine building time from HuggingFace model.

# dump the llm
llm.save(<engine-path>)

# next time
config = ModelConfig(model_dir=<engine-path>)
llm = LLM(config)

Customization

By default, the high-level API uses transformers’ AutoTokenizer. You can override it with your own tokenizer by passing it when creating the LLM object. For example:

llm = LLM(config, tokenizer=<my_faster_one>)

Besides tokenizer, you can also override the model by passing in an in-memory model, that will save much efforts for a highly-customed model.

class MyModel(Module): ...

my_model = MyModel(...)
llm = LLM(model=my_model)

1.9 KiB Raw Blame History Unescape Escape

High-level API

Basic usage

Customization

1.9 KiB

Raw Blame History