mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* Update TensorRT-LLM --------- Co-authored-by: Bhuvanesh Sridharan <bhuvan.sridharan@gmail.com> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com> Co-authored-by: Eddie-Wang1120 <wangjinheng1120@163.com> Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
34 lines
1.2 KiB
Markdown
34 lines
1.2 KiB
Markdown
# Apps examples with GenerationExecutor / High-level API
|
|
|
|
## Python chat
|
|
|
|
[chat.py](./chat.py) provides a small examples to play around with your model. You can run it with
|
|
|
|
`python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>`
|
|
or
|
|
`mpirun -n <world_size> python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>`
|
|
|
|
You can modify prompt setting by entering options starting with '!!'. Type '!!help' to see available commands.
|
|
|
|
## FastAPI server
|
|
|
|
### Install the additional requirements
|
|
|
|
` pip install -r examples/apps/requirements.txt`
|
|
|
|
### Start the server
|
|
|
|
Suppose you have build an engine with `trtllm-build`, you can now serve it with:
|
|
|
|
`python3 -m examples.apps.fastapi_server <path_to_tllm_engine_dir> <tokenizer_type> &`
|
|
or
|
|
`mpirun -n <world_size> python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &`
|
|
|
|
### Send requests
|
|
|
|
You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict:
|
|
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'`
|
|
|
|
You can also use the streaming interface with:
|
|
`curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -`
|