mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-04 18:21:52 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
21 lines
739 B
Markdown
21 lines
739 B
Markdown
Usage:
|
|
|
|
# Single GPU
|
|
```bash
|
|
python ./llama.py --hf_model_dir <hf llama dir> --engine_dir ./llama.engine
|
|
```
|
|
|
|
The bot read one sentence at a time and generate at max 20 tokens for you.
|
|
Type "q" or "quit" to stop chatting.
|
|
|
|
|
|
# Multi-GPU
|
|
|
|
Using multi GPU tensor parallel to build and run llama, and then generate on pre-defined dataset.
|
|
Note that multi GPU can also support the chat scenario, need to add additional code to read input from the root process, and broadcast the tokens to all worker processes.
|
|
The example only targets to demonstrate the TRT-LLM API usage here, so it uses pre-defined dataset for simplicity.
|
|
|
|
```
|
|
python ./llama_multi_gpu.py --hf_model_dir <llama-7b-hf path> --engine_dir ./llama.engine.tp2 -c --tp_size 2
|
|
```
|