mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
[None][doc] Update gpt-oss deployment guide to latest release image (#7101)
Signed-off-by: Farshad Ghodsian <47931571+farshadghodsian@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
This commit is contained in:
parent
ba0a86e0bb
commit
2d40e8750b
@ -18,10 +18,9 @@ TensorRT-LLM
|
||||
<div align="left">
|
||||
|
||||
## Tech Blogs
|
||||
* [08/06] Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM
|
||||
* [08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT-LLM
|
||||
✨ [➡️ link](./docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md)
|
||||
|
||||
|
||||
* [08/01] Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization)
|
||||
✨ [➡️ link](./docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)
|
||||
|
||||
@ -44,6 +43,7 @@ TensorRT-LLM
|
||||
✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)
|
||||
|
||||
## Latest News
|
||||
* [08/05] 🌟 TensorRT-LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B [➡️ link](https://huggingface.co/openai/gpt-oss-120b) and GPT-OSS-20B [➡️ link](https://huggingface.co/openai/gpt-oss-20b)
|
||||
* [07/15] 🌟 TensorRT-LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 [➡️ link](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
|
||||
* [06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ [➡️ link](https://events.nvidia.com/scaletheunscalablenextgenai)
|
||||
* [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
|
||||
|
||||
@ -19,11 +19,11 @@ We have a forthcoming guide for achieving great performance on H100; however, th
|
||||
|
||||
In this section, we introduce several ways to install TensorRT-LLM.
|
||||
|
||||
### NGC Docker Image of dev branch
|
||||
### NGC Docker Image
|
||||
|
||||
Day-0 support for gpt-oss is provided via the NGC container image `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`. This image was built on top of the pre-day-0 **dev branch**. This container is multi-platform and will run on both x64 and arm64 architectures.
|
||||
Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use. You can also check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status of the latest releases.
|
||||
|
||||
Run the following docker command to start the TensorRT-LLM container in interactive mode:
|
||||
Run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release):
|
||||
|
||||
```bash
|
||||
docker run --rm --ipc=host -it \
|
||||
@ -33,7 +33,7 @@ docker run --rm --ipc=host -it \
|
||||
-p 8000:8000 \
|
||||
-e TRTLLM_ENABLE_PDL=1 \
|
||||
-v ~/.cache:/root/.cache:rw \
|
||||
nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev \
|
||||
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
@ -53,9 +53,9 @@ Additionally, the container mounts your user `.cache` directory to save the down
|
||||
Support for gpt-oss has been [merged](https://github.com/NVIDIA/TensorRT-LLM/pull/6645) into the **main branch** of TensorRT-LLM. As we continue to optimize gpt-oss performance, you can build TensorRT-LLM from source to get the latest features and support. Please refer to the [doc](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) if you want to build from source yourself.
|
||||
|
||||
|
||||
### Regular Release of TensorRT-LLM
|
||||
### TensorRT-LLM Python Wheel Install
|
||||
|
||||
Since gpt-oss has been supported on the main branch, you can get TensorRT-LLM out of the box through its regular release in the future. Please check the latest [release notes](https://github.com/NVIDIA/TensorRT-LLM/releases) to keep track of the support status. The release is provided as [NGC Container Image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) or [pip Python wheel](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
|
||||
Regular releases of TensorRT-LLM are also provided as [Python wheels](https://pypi.org/project/tensorrt-llm/#history). You can find instructions on the pip install [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
|
||||
|
||||
|
||||
## Performance Benchmarking and Model Serving
|
||||
@ -210,7 +210,10 @@ We can use `trtllm-serve` to serve the model by translating the benchmark comman
|
||||
|
||||
```bash
|
||||
trtllm-serve \
|
||||
gpt-oss-120b \ # Or ${local_model_path}
|
||||
Note: You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
|
||||
|
||||
trtllm-serve \
|
||||
openai/gpt-oss-120b \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--backend pytorch \
|
||||
@ -228,7 +231,8 @@ For max-throughput configuration, run:
|
||||
|
||||
```bash
|
||||
trtllm-serve \
|
||||
gpt-oss-120b \ # Or ${local_model_path}
|
||||
trtllm-serve \
|
||||
openai/gpt-oss-120b \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--backend pytorch \
|
||||
@ -262,7 +266,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What is NVIDIA's advantage for inference?"
|
||||
"content": "What is NVIDIAs advantage for inference?"
|
||||
}
|
||||
],
|
||||
"max_tokens": 1024,
|
||||
@ -348,12 +352,7 @@ others according to your needs.
|
||||
|
||||
## (H200/H100 Only) Using OpenAI Triton Kernels for MoE
|
||||
|
||||
OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please enable `TRITON` backend with the steps below if you are running on Hopper GPUs.
|
||||
|
||||
### Installing OpenAI Triton
|
||||
|
||||
The `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev` has prepared Triton already (`echo $TRITON_ROOT` could reveal the path). In other situations, you will need to build and install a specific version of Triton. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe).
|
||||
|
||||
OpenAI ships a set of Triton kernels optimized for its MoE models. TensorRT-LLM can leverage these kernels for Hopper-based GPUs like NVIDIA's H200 for optimal performance. `TRTLLM` MoE backend is not supported on Hopper, and `CUTLASS` backend support is still ongoing. Please follow the instructions in this [link](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt_oss#using-openai-triton-kernels-for-moe) to install and enable the `TRITON` MoE kernels on Hopper GPUs.
|
||||
|
||||
### Selecting Triton as the MoE backend
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user