mirror of
https://github.com/vllm-project/vllm.git
synced 2026-06-06 00:16:14 +00:00
[Docs] Update server entrypoint examples (#42077)
Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
This commit is contained in:
@@ -12,8 +12,7 @@ vLLM can be deployed on [RunPod](https://www.runpod.io/), a cloud GPU platform t
|
||||
SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model <model-name> \
|
||||
vllm serve <model-name> \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
@@ -79,9 +79,8 @@ Key points from the example YAML:
|
||||
- -c
|
||||
- >
|
||||
bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2;
|
||||
python3 -m vllm.entrypoints.openai.api_server
|
||||
vllm serve meta-llama/Llama-3.1-405B-Instruct
|
||||
--port 8080
|
||||
--model meta-llama/Llama-3.1-405B-Instruct
|
||||
--tensor-parallel-size 8
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
@@ -145,7 +144,7 @@ spec:
|
||||
- sh
|
||||
- -c
|
||||
- "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2;
|
||||
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
|
||||
vllm serve meta-llama/Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline-parallel-size 2"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: "8"
|
||||
|
||||
@@ -62,8 +62,7 @@ The filesystem resolver is installed with vLLM by default and enables loading Lo
|
||||
3. **Start vLLM server**:
|
||||
Your base model can be `meta-llama/Llama-2-7b-hf`. Please make sure you set up the Hugging Face token in your env var `export HF_TOKEN=xxx235`.
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model your-base-model \
|
||||
vllm serve your-base-model \
|
||||
--enable-lora
|
||||
```
|
||||
|
||||
|
||||
@@ -16,7 +16,7 @@ User-set flags take precedence over optimization level defaults.
|
||||
|
||||
```bash
|
||||
# CLI usage
|
||||
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1
|
||||
vllm serve RedHatAI/Llama-3.2-1B-FP8 -O1
|
||||
|
||||
# Python API usage
|
||||
from vllm.entrypoints.llm import LLM
|
||||
|
||||
Reference in New Issue
Block a user