[Docs] Update server entrypoint examples (#42077)

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
This commit is contained in:
Ethan Feng
2026-05-09 10:03:52 +08:00
committed by GitHub
parent 236bf9d152
commit a43bc34baf
4 changed files with 5 additions and 8 deletions
+1 -2
View File
@@ -12,8 +12,7 @@ vLLM can be deployed on [RunPod](https://www.runpod.io/), a cloud GPU platform t
SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:
```bash
python -m vllm.entrypoints.openai.api_server \
--model <model-name> \
vllm serve <model-name> \
--host 0.0.0.0 \
--port 8000
```
+2 -3
View File
@@ -79,9 +79,8 @@ Key points from the example YAML:
- -c
- >
bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2;
python3 -m vllm.entrypoints.openai.api_server
vllm serve meta-llama/Llama-3.1-405B-Instruct
--port 8080
--model meta-llama/Llama-3.1-405B-Instruct
--tensor-parallel-size 8
--pipeline-parallel-size 2
```
@@ -145,7 +144,7 @@ spec:
- sh
- -c
- "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2;
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
vllm serve meta-llama/Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline-parallel-size 2"
resources:
limits:
nvidia.com/gpu: "8"
+1 -2
View File
@@ -62,8 +62,7 @@ The filesystem resolver is installed with vLLM by default and enables loading Lo
3. **Start vLLM server**:
Your base model can be `meta-llama/Llama-2-7b-hf`. Please make sure you set up the Hugging Face token in your env var `export HF_TOKEN=xxx235`.
```bash
python -m vllm.entrypoints.openai.api_server \
--model your-base-model \
vllm serve your-base-model \
--enable-lora
```
+1 -1
View File
@@ -16,7 +16,7 @@ User-set flags take precedence over optimization level defaults.
```bash
# CLI usage
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1
vllm serve RedHatAI/Llama-3.2-1B-FP8 -O1
# Python API usage
from vllm.entrypoints.llm import LLM