* Update TensorRT-LLM --------- Co-authored-by: Bhuvanesh Sridharan <bhuvan.sridharan@gmail.com> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com> Co-authored-by: Eddie-Wang1120 <wangjinheng1120@163.com> Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
1.2 KiB
Apps examples with GenerationExecutor / High-level API
Python chat
chat.py provides a small examples to play around with your model. You can run it with
python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>
or
mpirun -n <world_size> python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>
You can modify prompt setting by entering options starting with '!!'. Type '!!help' to see available commands.
FastAPI server
Install the additional requirements
pip install -r examples/apps/requirements.txt
Start the server
Suppose you have build an engine with trtllm-build, you can now serve it with:
python3 -m examples.apps.fastapi_server <path_to_tllm_engine_dir> <tokenizer_type> &
or
mpirun -n <world_size> python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &
Send requests
You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict:
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'
You can also use the streaming interface with:
curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -