TensorRT-LLMs/examples/apps/README.md
Kaiyu Xie 250d9c293d
Update TensorRT-LLM Release branch (#1445)
* Update TensorRT-LLM

---------

Co-authored-by: Bhuvanesh Sridharan <bhuvan.sridharan@gmail.com>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Co-authored-by: Eddie-Wang1120 <wangjinheng1120@163.com>
Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
2024-04-12 17:59:19 +08:00

34 lines
1.2 KiB
Markdown

# Apps examples with GenerationExecutor / High-level API
## Python chat
[chat.py](./chat.py) provides a small examples to play around with your model. You can run it with
`python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>`
or
`mpirun -n <world_size> python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>`
You can modify prompt setting by entering options starting with '!!'. Type '!!help' to see available commands.
## FastAPI server
### Install the additional requirements
` pip install -r examples/apps/requirements.txt`
### Start the server
Suppose you have build an engine with `trtllm-build`, you can now serve it with:
`python3 -m examples.apps.fastapi_server <path_to_tllm_engine_dir> <tokenizer_type> &`
or
`mpirun -n <world_size> python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &`
### Send requests
You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict:
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'`
You can also use the streaming interface with:
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -`