mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-04 10:11:47 +08:00

Update TensorRT-LLM Release branch (#1192 )

* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

2024-02-29 17:20:55 +08:00

739 B

Raw Blame History

Usage:

Single GPU

python ./llama.py --hf_model_dir <hf llama dir> --engine_dir ./llama.engine

The bot read one sentence at a time and generate at max 20 tokens for you. Type "q" or "quit" to stop chatting.

Multi-GPU

Using multi GPU tensor parallel to build and run llama, and then generate on pre-defined dataset. Note that multi GPU can also support the chat scenario, need to add additional code to read input from the root process, and broadcast the tokens to all worker processes. The example only targets to demonstrate the TRT-LLM API usage here, so it uses pre-defined dataset for simplicity.

python ./llama_multi_gpu.py --hf_model_dir <llama-7b-hf path> --engine_dir ./llama.engine.tp2 -c --tp_size 2

739 B Raw Blame History

Single GPU

Multi-GPU

739 B

Raw Blame History