[Docs] Update server entrypoint examples (#42077)

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
2026-06-06 00:16:14 +00:00 · 2026-05-09 10:03:52 +08:00
parent 236bf9d152
commit a43bc34baf
4 changed files with 5 additions and 8 deletions
@@ -12,8 +12,7 @@ vLLM can be deployed on [RunPod](https://www.runpod.io/), a cloud GPU platform t
 SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:

 ```bash
-python -m vllm.entrypoints.openai.api_server \
-    --model <model-name> \
+vllm serve <model-name> \
    --host 0.0.0.0 \
    --port 8000
 ```
@@ -79,9 +79,8 @@ Key points from the example YAML:
    - -c
    - >
      bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2;
-      python3 -m vllm.entrypoints.openai.api_server
+      vllm serve meta-llama/Llama-3.1-405B-Instruct
        --port 8080
-        --model meta-llama/Llama-3.1-405B-Instruct
        --tensor-parallel-size 8
        --pipeline-parallel-size 2
  ```
@@ -145,7 +144,7 @@ spec:
                  - sh
                  - -c
                  - "bash /vllm-workspace/examples/ray_serving/multi-node-serving.sh leader --ray_cluster_size=2; 
-                    python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
+                    vllm serve meta-llama/Llama-3.1-405B-Instruct --port 8080 --tensor-parallel-size 8 --pipeline-parallel-size 2"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
@@ -62,8 +62,7 @@ The filesystem resolver is installed with vLLM by default and enables loading Lo
 3. **Start vLLM server**:
   Your base model can be `meta-llama/Llama-2-7b-hf`. Please make sure you set up the Hugging Face token in your env var `export HF_TOKEN=xxx235`.
   ```bash
-   python -m vllm.entrypoints.openai.api_server \
-       --model your-base-model \
+   vllm serve your-base-model \
       --enable-lora
   ```

@@ -16,7 +16,7 @@ User-set flags take precedence over optimization level defaults.

 ```bash
 # CLI usage
-python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1
+vllm serve RedHatAI/Llama-3.2-1B-FP8 -O1

 # Python API usage
 from vllm.entrypoints.llm import LLM