Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
7.6 KiB
OpenAI-Compatible Server
vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more! This functionality lets you serve models and interact with them using an HTTP client.
Supported APIs
We currently support the following OpenAI APIs:
- Completions API (
/v1/completions)- Only applicable to text generation models.
- Note:
suffixparameter is not supported.
- Responses API (
/v1/responses)- Only applicable to text generation models.
- Chat Completions API (
/v1/chat/completions)- Only applicable to text generation models with a chat template.
- Note:
userparameter is ignored. - Note: Setting the
parallel_tool_callsparameter tofalseensures vLLM only returns zero or one tool call per request. Setting it totrue(the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set totrue, as that behavior is model dependent and not all models are designed to support parallel tool calls.
- Embeddings API (
/v1/embeddings)- Only applicable to embedding models.
- Transcriptions API (
/v1/audio/transcriptions)- Only applicable to Automatic Speech Recognition (ASR) models.
- Translation API (
/v1/audio/translations)- Only applicable to Automatic Speech Recognition (ASR) models.
Completions API
In your terminal, you can install vLLM, then start the server with the vllm serve command. (You can also use our Docker image.)
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the official OpenAI Python client.
??? code
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"},
],
)
print(completion.choices[0].message)
```
!!! tip
vLLM supports some parameters that are not supported by OpenAI, top_k for example.
You can pass these parameters to vLLM using the OpenAI client in the extra_body parameter of your requests, i.e. extra_body={"top_k": 50} for top_k.
!!! important
By default, the server applies generation_config.json from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
Extra Parameters
vLLM supports a set of parameters that are not part of the OpenAI API. In order to use them, you can pass them as extra parameters in the OpenAI client. Or directly merge them into the JSON payload if you are using HTTP call directly.
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_body={
"structured_outputs": {"choice": ["positive", "negative"]},
},
)
Extra HTTP Headers
Only X-Request-Id HTTP request header is supported for now. It can be enabled
with --enable-request-id-headers.
??? code
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
},
)
print(completion._request_id)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
},
)
print(completion._request_id)
```
API Reference
Completions API
Our Completions API is compatible with OpenAI's Completions API; you can use the official OpenAI Python client to interact with it.
Code example: examples/basic/online_serving/openai_completion_client.py
Extra parameters
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/completion/protocol.py:completion-extra-params"
```
Chat API
Our Chat API is compatible with OpenAI's Chat Completions API; you can use the official OpenAI Python client to interact with it.
We support both Vision- and Audio-related parameters; see our Multimodal Inputs guide for more information.
- Note:
image_url.detailparameter is not supported.
Code example: examples/basic/online_serving/openai_chat_completion_client.py
Extra parameters
The following sampling parameters are supported.
??? code
```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-sampling-params"
```
The following extra parameters are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/chat_completion/protocol.py:chat-completion-extra-params"
```
Responses API
Our Responses API is compatible with OpenAI's Responses API; you can use the official OpenAI Python client to interact with it.
Code example: examples/tool_calling/openai_responses_client_with_tools.py
Extra parameters
The following extra parameters in the request object are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-extra-params"
```
The following extra parameters in the response object are supported:
??? code
```python
--8<-- "vllm/entrypoints/openai/responses/protocol.py:responses-response-extra-params"
```