mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Update TensorRT-LLM Release branch (#1445 )

* Update TensorRT-LLM

---------

Co-authored-by: Bhuvanesh Sridharan <bhuvan.sridharan@gmail.com>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Co-authored-by: Eddie-Wang1120 <wangjinheng1120@163.com>
Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>

2024-04-12 17:59:19 +08:00

1.2 KiB

Raw Blame History

Apps examples with GenerationExecutor / High-level API

Python chat

chat.py provides a small examples to play around with your model. You can run it with

python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir> or mpirun -n <world_size> python3 examples/apps/chat.py <path_to_tllm_engine_dir> <path_to_tokenizer_dir>

You can modify prompt setting by entering options starting with '!!'. Type '!!help' to see available commands.

FastAPI server

Install the additional requirements

pip install -r examples/apps/requirements.txt

Start the server

Suppose you have build an engine with trtllm-build, you can now serve it with:

python3 -m examples.apps.fastapi_server <path_to_tllm_engine_dir> <tokenizer_type> & or mpirun -n <world_size> python3 -m examples.server.server <path_to_tllm_engine_dir> <tokenizer_type> &

Send requests

You can pass request arguments like "max_new_tokens", "top_p", "top_k" in your JSON dict: curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8}'

You can also use the streaming interface with: curl http://localhost:8000/generate -d '{"prompt": "In this example,", "max_new_tokens": 8, "streaming": true}' --output -

1.2 KiB Raw Blame History