mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
chores: merge examples for v1.0 doc (#5736)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
This commit is contained in:
parent
5ab1cf5ae6
commit
e277766f0d
31
README.md
31
README.md
@ -61,22 +61,23 @@ TensorRT-LLM
|
||||
* [02/12] 🌟 How Scaling Laws Drive Smarter, More Powerful AI
|
||||
[➡️ link](https://blogs.nvidia.com/blog/ai-scaling-laws/?ncid=so-link-889273&linkId=100000338837832)
|
||||
|
||||
* [01/25] Nvidia moves AI focus to inference cost, efficiency [➡️ link](https://www.fierceelectronics.com/ai/nvidia-moves-ai-focus-inference-cost-efficiency?linkId=100000332985606)
|
||||
|
||||
* [01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions [➡️ link](https://developer.nvidia.com/blog/optimize-ai-inference-performance-with-nvidia-full-stack-solutions/?ncid=so-twit-400810&linkId=100000332621049)
|
||||
|
||||
* [01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
|
||||
|
||||
* [01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
|
||||
|
||||
* [01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
|
||||
|
||||
* [01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
|
||||
[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
|
||||
|
||||
<details close>
|
||||
<summary>Previous News</summary>
|
||||
|
||||
* [2025/01/25] Nvidia moves AI focus to inference cost, efficiency [➡️ link](https://www.fierceelectronics.com/ai/nvidia-moves-ai-focus-inference-cost-efficiency?linkId=100000332985606)
|
||||
|
||||
* [2025/01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions [➡️ link](https://developer.nvidia.com/blog/optimize-ai-inference-performance-with-nvidia-full-stack-solutions/?ncid=so-twit-400810&linkId=100000332621049)
|
||||
|
||||
* [2025/01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
|
||||
|
||||
* [2025/01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
|
||||
|
||||
* [2025/01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
|
||||
|
||||
* [2025/01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
|
||||
[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
|
||||
|
||||
* [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview
|
||||
[➡️ link](https://build.nvidia.com/meta/llama-3_3-70b-instruct)
|
||||
|
||||
@ -204,11 +205,9 @@ Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
|
||||
|
||||
TensorRT-LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
|
||||
|
||||
Recently [re-architected with a **PyTorch backend**](https://nvidia.github.io/TensorRT-LLM/torch.html), TensorRT-LLM now combines peak performance with a more flexible and developer-friendly workflow. The original [TensorRT](https://developer.nvidia.com/tensorrt)-based backend remains supported and continues to provide an ahead-of-time compilation path for building highly optimized "[Engines](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#ecosystem)" for deployment. The PyTorch backend complements this by enabling faster development iteration and rapid experimentation.
|
||||
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT-LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
|
||||
|
||||
TensorRT-LLM provides a flexible [**LLM API**](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) to simplify model setup and inference across both PyTorch and TensorRT backends. It supports a wide range of inference use cases from a single GPU to multiple nodes with multiple GPUs using [Tensor Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#tensor-parallelism) and/or [Pipeline Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#pipeline-parallelism). It also includes a [backend](https://github.com/triton-inference-server/tensorrtllm_backend) for integration with the [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).
|
||||
|
||||
Several popular models are pre-defined and can be easily customized or extended using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py) (for the PyTorch backend) or a [PyTorch-style Python API](./tensorrt_llm/models/llama/model.py) (for the TensorRT backend).
|
||||
TensorRT-LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
|
||||
|
||||
|
||||
## Getting Started
|
||||
|
||||
@ -110,10 +110,10 @@ The MTP module follows the design in DeepSeek-V3. The embedding layer and output
|
||||
Attention is also a very important component in supporting MTP inference. The changes are mainly in the attention kernels for the generation phase. For the normal request, there will be only one input token in the generation phase, but for MTP, there will be $K+1$ input tokens. Since MTP sequentially predicts additional tokens, the predicted draft tokens are chained. Though we have an MTP Eagle path, currently, we only have the chain-based support for MTP Eagle. So, a causal mask is enough for the attention kernel to support MTP. In our implementation, TensorRT-LLM will use the fp8 flashMLA generation kernel on Hopper GPU, while using TRTLLM customized attention kernels on Blackwell for better performance.
|
||||
|
||||
### How to run DeepSeek models with MTP
|
||||
Run DeepSeek-V3/R1 models with MTP, use [examples/pytorch/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_advanced.py) with additional options:
|
||||
Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:
|
||||
|
||||
```bash
|
||||
cd examples/pytorch
|
||||
cd examples/llm-api
|
||||
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
|
||||
```
|
||||
|
||||
@ -165,10 +165,10 @@ Note that the Relaxed Acceptance will only be used during the thinking phase, wh
|
||||
|
||||
### How to run the DeepSeek-R1 model with Relaxed Acceptance
|
||||
|
||||
Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/pytorch/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_advanced.py) with additional options:
|
||||
Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:
|
||||
|
||||
```bash
|
||||
cd examples/pytorch
|
||||
cd examples/llm-api
|
||||
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
|
||||
```
|
||||
|
||||
|
||||
@ -59,7 +59,10 @@ LLMAPI_SECTIONS = ["Basics", "Customization", "Slurm"]
|
||||
|
||||
def generate_examples():
|
||||
root_dir = Path(__file__).parent.parent.parent.resolve()
|
||||
ignore_list = {'__init__.py', 'quickstart_example.py'}
|
||||
ignore_list = {
|
||||
'__init__.py', 'quickstart_example.py', 'quickstart_advanced.py',
|
||||
'quickstart_multimodal.py', 'star_attention.py'
|
||||
}
|
||||
doc_dir = root_dir / "docs/source/examples"
|
||||
|
||||
def collect_script_paths(examples_subdir: str) -> list[Path]:
|
||||
|
||||
@ -2,28 +2,11 @@
|
||||
|
||||
The LLM API is a high-level Python API designed to streamline LLM inference workflows.
|
||||
|
||||
It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
|
||||
It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo).
|
||||
|
||||
While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.
|
||||
|
||||
|
||||
## Supported Models
|
||||
|
||||
* DeepSeek variants
|
||||
* Llama (including variants Mistral, Mixtral, InternLM)
|
||||
* GPT (including variants Starcoder-1/2, Santacoder)
|
||||
* Gemma-1/2/3
|
||||
* Phi-1/2/3/4
|
||||
* ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
|
||||
* QWen-1/1.5/2/3
|
||||
* Falcon
|
||||
* Baichuan-1/2
|
||||
* GPT-J
|
||||
* Mamba-1/2
|
||||
|
||||
|
||||
> **Note:** For the most up-to-date list of supported models, you may refer to the [TensorRT-LLM model definitions](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/models).
|
||||
|
||||
## Quick Start Example
|
||||
A simple inference example with TinyLlama using the LLM API:
|
||||
|
||||
@ -31,7 +14,8 @@ A simple inference example with TinyLlama using the LLM API:
|
||||
:language: python
|
||||
:linenos:
|
||||
```
|
||||
More examples can be found [here]().
|
||||
|
||||
For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).
|
||||
|
||||
## Model Input
|
||||
|
||||
@ -65,7 +49,6 @@ llm = LLM(model=<local_path_to_model>)
|
||||
> **Note:** Some models require accepting specific [license agreements]((https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.
|
||||
|
||||
|
||||
|
||||
## Tips and Troubleshooting
|
||||
|
||||
The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:
|
||||
|
||||
@ -196,8 +196,8 @@ if __name__ == '__main__':
|
||||
main()
|
||||
```
|
||||
|
||||
We provide an out-of-tree modeling example in `examples/pytorch/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:
|
||||
We provide an out-of-tree modeling example in `examples/llm-api/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:
|
||||
|
||||
```bash
|
||||
python examples/pytorch/out_of_tree_example/main.py
|
||||
python examples/llm-api/out_of_tree_example/main.py
|
||||
```
|
||||
|
||||
@ -1,3 +1,57 @@
|
||||
# LLM API Examples
|
||||
|
||||
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/latest/examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
|
||||
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/) including [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
|
||||
|
||||
|
||||
## Run the advanced usage example script:
|
||||
|
||||
```bash
|
||||
# FP8 + TP=2
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --tp_size 2
|
||||
|
||||
# FP8 (e4m3) kvcache
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --kv_cache_dtype fp8
|
||||
|
||||
# BF16 + TP=8
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 --tp_size 8
|
||||
|
||||
# Nemotron-H requires disabling cache reuse in kv cache
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Nemotron-H-8B-Base-8K --disable_kv_cache_reuse --max_batch_size 8
|
||||
```
|
||||
|
||||
## Run the multimodal example script:
|
||||
|
||||
```bash
|
||||
# default inputs
|
||||
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality image [--use_cuda_graph]
|
||||
|
||||
# user inputs
|
||||
# supported modes:
|
||||
# (1) N prompt, N media (N requests are in-flight batched)
|
||||
# (2) 1 prompt, N media
|
||||
# Note: media should be either image or video. Mixing image and video is not supported.
|
||||
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality video --prompt "Tell me what you see in the video briefly." "Describe the scene in the video briefly." --media "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4" "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/world.mp4" --max_tokens 128 [--use_cuda_graph]
|
||||
```
|
||||
|
||||
## Run the speculative decoding script:
|
||||
|
||||
```bash
|
||||
# NGram drafter
|
||||
python3 quickstart_advanced.py \
|
||||
--model_dir meta-llama/Llama-3.1-8B-Instruct \
|
||||
--spec_decode_algo NGRAM \
|
||||
--max_matching_ngram_size=2 \
|
||||
--spec_decode_nextn=4 \
|
||||
--disable_overlap_scheduler
|
||||
```
|
||||
|
||||
```bash
|
||||
# Draft Taret
|
||||
python3 quickstart_advanced.py \
|
||||
--model_dir meta-llama/Llama-3.1-8B-Instruct \
|
||||
--spec_decode_algo draft_target \
|
||||
--spec_decode_nextn 5 \
|
||||
--draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
|
||||
--disable_overlap_scheduler
|
||||
--disable_kv_cache_reuse
|
||||
```
|
||||
|
||||
@ -9,7 +9,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
|
||||
@ -9,7 +9,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
|
||||
@ -30,7 +30,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -48,7 +47,6 @@ def main():
|
||||
|
||||
# Got output like
|
||||
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
|
||||
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
|
||||
# Prompt: 'The capital of France is', Generated text: 'Paris.'
|
||||
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
|
||||
|
||||
|
||||
@ -11,7 +11,6 @@ def run_medusa_decoding(use_modelopt_ckpt=False, model_dir=None):
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
|
||||
@ -57,7 +57,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -73,7 +72,6 @@ def main():
|
||||
|
||||
# Got output like
|
||||
# Prompt: 'Hello, my name is', Generated text: 'Jane Smith. I am a resident of the city. Can you tell me more about the public services provided in the area?'
|
||||
# Prompt: 'The president of the United States is', Generated text: 'considered the head of state, and the vice president of the United States is considered the head of state. President and Vice President of the United States (US)'
|
||||
# Prompt: 'The capital of France is', Generated text: 'located in Paris, France. The population of Paris, France, is estimated to be 2 million. France is home to many famous artists, including Picasso'
|
||||
# Prompt: 'The future of AI is', Generated text: 'an open and collaborative project. The project is an ongoing effort, and we invite participation from members of the community.\n\nOur community is'
|
||||
|
||||
|
||||
@ -13,7 +13,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -28,7 +27,6 @@ def main():
|
||||
|
||||
# Got output like
|
||||
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
|
||||
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
|
||||
# Prompt: 'The capital of France is', Generated text: 'Paris.'
|
||||
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
|
||||
|
||||
|
||||
@ -13,7 +13,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -36,7 +35,6 @@ def main():
|
||||
|
||||
# Got output like follows:
|
||||
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
|
||||
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
|
||||
# Prompt: 'The capital of France is', Generated text: 'Paris.'
|
||||
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
|
||||
|
||||
|
||||
@ -14,7 +14,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
|
||||
@ -21,7 +21,6 @@ def main():
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -36,7 +35,6 @@ def main():
|
||||
|
||||
# Got output like
|
||||
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
|
||||
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
|
||||
# Prompt: 'The capital of France is', Generated text: 'Paris.'
|
||||
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
|
||||
|
||||
|
||||
@ -36,7 +36,7 @@
|
||||
# the LOCAL_MODEL directory.
|
||||
|
||||
# Adjust the paths to run
|
||||
export script=$SOURCE_ROOT/examples/pytorch/quickstart_advanced.py
|
||||
export script=$SOURCE_ROOT/examples/llm-api/quickstart_advanced.py
|
||||
|
||||
# Just launch the PyTorch example with trtllm-llmapi-launch command.
|
||||
srun -l \
|
||||
|
||||
@ -6,7 +6,6 @@ from tensorrt_llm import LLM
|
||||
def main():
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -8,7 +8,6 @@ from tensorrt_llm.llmapi import (CudaGraphConfig, DraftTargetDecodingConfig,
|
||||
|
||||
example_prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
@ -6,12 +6,12 @@ def main():
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Alternatively, use "nvidia/Llama-3.1-8B-Instruct-FP8" to enable FP8 inference.
|
||||
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
@ -39,7 +39,7 @@ git clone https://huggingface.co/naver-hyperclovax/$MODEL_NAME hf_models/$MODEL_
|
||||
```
|
||||
|
||||
### HyperCLOVAX-SEED-Vision
|
||||
Download the HuggingFace checkpoints of the HyperCLOVAX-SEED-Vision model. We support the HyperCLOVAX-SEED-Vision model in [PyTorch flow](../../../pytorch).
|
||||
Download the HuggingFace checkpoints of the HyperCLOVAX-SEED-Vision model. We support the HyperCLOVAX-SEED-Vision model in [PyTorch flow](../../../llm-api).
|
||||
|
||||
```bash
|
||||
export MODEL_NAME=HyperCLOVAX-SEED-Vision-Instruct-3B
|
||||
@ -49,12 +49,12 @@ git clone https://huggingface.co/naver-hyperclovax/$MODEL_NAME hf_models/$MODEL_
|
||||
## PyTorch flow
|
||||
|
||||
### LLM
|
||||
To quickly run HyperCLOVAX-SEED-Text, you can use [examples/pytorch/quickstart_advanced.py](../../../pytorch/quickstart_advanced.py):
|
||||
To quickly run HyperCLOVAX-SEED-Text, you can use [examples/llm-api/quickstart_advanced.py](../../../llm-api/quickstart_advanced.py):
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
|
||||
python ../../../pytorch/quickstart_advanced.py --model_dir hf_models/$MODEL_NAME
|
||||
python ../../../llm-api/quickstart_advanced.py --model_dir hf_models/$MODEL_NAME
|
||||
```
|
||||
|
||||
The output will be like:
|
||||
@ -66,12 +66,12 @@ The output will be like:
|
||||
```
|
||||
|
||||
### Multimodal
|
||||
To quickly run HyperCLOVAX-SEED-Vision, you can use [examples/pytorch/quickstart_multimodal.py](../../../pytorch/quickstart_multimodal.py):
|
||||
To quickly run HyperCLOVAX-SEED-Vision, you can use [examples/llm-api/quickstart_multimodal.py](../../../llm-api/quickstart_multimodal.py):
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
|
||||
python ../../../pytorch/quickstart_multimodal.py --model_dir hf_models/$MODEL_NAME
|
||||
python ../../../llm-api/quickstart_multimodal.py --model_dir hf_models/$MODEL_NAME
|
||||
```
|
||||
|
||||
The output will be like:
|
||||
@ -81,7 +81,7 @@ The output will be like:
|
||||
[2] Prompt: 'Describe the traffic condition on the road in the image.', Generated text: '이미지 속 도로의 교통 상태는 비교적 원활해 보입니다. 여러 차선이 있고, 차선마다 차량들이 일정한 간격을 유지하며 주행하고 있습니다. 도로의 왼쪽 차선에는 여러 대의 차량이 있고, 오른쪽 차선에도 몇 대의 차량이 보입니다. 도로의 중앙에는 파란'
|
||||
```
|
||||
|
||||
For more information, you can refer to [examples/pytorch](../../../pytorch).
|
||||
For more information, you can refer to [examples/llm-api](../../../llm-api).
|
||||
|
||||
## TRT flow
|
||||
The next section describes how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. We will use llama's [convert_checkpoint.py](../../core/llama/convert_checkpoint.py) for the HyperCLOVAX model and then build the model with `trtllm-build`.
|
||||
|
||||
@ -77,10 +77,10 @@ git clone https://huggingface.co/deepseek-ai/DeepSeek-V3 <YOUR_MODEL_DIR>
|
||||
## Quick Start
|
||||
|
||||
### Run a single inference
|
||||
To quickly run DeepSeek-V3, [examples/pytorch/quickstart_advanced.py](../pytorch/quickstart_advanced.py):
|
||||
To quickly run DeepSeek-V3, [examples/llm-api/quickstart_advanced.py](../pytorch/quickstart_advanced.py):
|
||||
|
||||
```bash
|
||||
cd examples/pytorch
|
||||
cd examples/llm-api
|
||||
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --tp_size 8
|
||||
```
|
||||
|
||||
@ -94,9 +94,9 @@ Prompt: 'The future of AI is', Generated text: ' a topic of great interest and s
|
||||
```
|
||||
|
||||
### Multi-Token Prediction (MTP)
|
||||
To run with MTP, use [examples/pytorch/quickstart_advanced.py](../pytorch/quickstart_advanced.py) with additional options, see
|
||||
To run with MTP, use [examples/llm-api/quickstart_advanced.py](../pytorch/quickstart_advanced.py) with additional options, see
|
||||
```bash
|
||||
cd examples/pytorch
|
||||
cd examples/llm-api
|
||||
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
|
||||
```
|
||||
|
||||
@ -123,7 +123,7 @@ When verifying and receiving draft tokens, there are two ways:
|
||||
Here is an example. We allow the first 15 (`--relaxed_topk 15`) tokens to be used as the initial candidate set, and use delta (`--relaxed_delta 0.5`) to filter out tokens with a large probability gap, which may be semantically different from the top-1 token.
|
||||
|
||||
```bash
|
||||
cd examples/pytorch
|
||||
cd examples/llm-api
|
||||
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 15 --relaxed_delta 0.5
|
||||
```
|
||||
|
||||
@ -794,7 +794,7 @@ Chunked Prefill is supported for MLA only on SM100 currently. You should add `--
|
||||
More specifically, we can imitate what we did in the [Quick Start](#quick-start):
|
||||
|
||||
``` bash
|
||||
cd examples/pytorch
|
||||
cd examples/llm-api
|
||||
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --enable_chunked_prefill
|
||||
```
|
||||
|
||||
|
||||
@ -624,10 +624,10 @@ git clone https://huggingface.co/Qwen/Qwen3-30B-A3B <YOUR_MODEL_DIR>
|
||||
|
||||
#### Run a single inference
|
||||
|
||||
To quickly run Qwen3, [examples/pytorch/quickstart_advanced.py](../../../pytorch/quickstart_advanced.py):
|
||||
To quickly run Qwen3, [examples/llm-api/quickstart_advanced.py](../../../pytorch/quickstart_advanced.py):
|
||||
|
||||
```bash
|
||||
python3 examples/pytorch/quickstart_advanced.py --model_dir Qwen3-30B-A3B/ --kv_cache_fraction 0.6
|
||||
python3 examples/llm-api/quickstart_advanced.py --model_dir Qwen3-30B-A3B/ --kv_cache_fraction 0.6
|
||||
```
|
||||
|
||||
### Evaluation
|
||||
|
||||
@ -89,7 +89,7 @@ python examples/summarize.py \
|
||||
### V2 workflow
|
||||
|
||||
```bash
|
||||
python3 examples/pytorch/quickstart_advanced.py \
|
||||
python3 examples/llm-api/quickstart_advanced.py \
|
||||
--max_matching_ngram_size=2 \
|
||||
--spec_decode_nextn=4
|
||||
```
|
||||
|
||||
@ -1,97 +0,0 @@
|
||||
# TRT-LLM with PyTorch
|
||||
|
||||
## Run the quick start script:
|
||||
|
||||
```bash
|
||||
python3 quickstart.py
|
||||
```
|
||||
|
||||
## Run the advanced usage example script:
|
||||
|
||||
```bash
|
||||
# BF16
|
||||
python3 quickstart_advanced.py --model_dir meta-llama/Llama-3.1-8B-Instruct
|
||||
|
||||
# FP8
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8
|
||||
|
||||
# BF16 + TP=2
|
||||
python3 quickstart_advanced.py --model_dir meta-llama/Llama-3.1-8B-Instruct --tp_size 2
|
||||
|
||||
# FP8 + TP=2
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --tp_size 2
|
||||
|
||||
# FP8(e4m3) kvcache
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --kv_cache_dtype fp8
|
||||
|
||||
# BF16 + TP=8
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 --tp_size 8
|
||||
|
||||
# Nemotron-H requires disabling cache reuse in kv cache
|
||||
python3 quickstart_advanced.py --model_dir nvidia/Nemotron-H-8B-Base-8K --disable_kv_cache_reuse --max_batch_size 8
|
||||
```
|
||||
|
||||
## Run the multimodal example script:
|
||||
|
||||
```bash
|
||||
# default inputs
|
||||
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality image [--use_cuda_graph]
|
||||
|
||||
# user inputs
|
||||
# supported modes:
|
||||
# (1) N prompt, N media (N requests are in-flight batched)
|
||||
# (2) 1 prompt, N media
|
||||
# Note: media should be either image or video. Mixing image and video is not supported.
|
||||
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality video --prompt "Tell me what you see in the video briefly." "Describe the scene in the video briefly." --media "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4" "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/world.mp4" --max_tokens 128 [--use_cuda_graph]
|
||||
```
|
||||
|
||||
### Supported Models
|
||||
| Architecture | Model | HuggingFace Example | Modality |
|
||||
| :----------------------------------: | :----------------------------------------------------------- | :----------------------------------------------------------- | :------: |
|
||||
| `BertForSequenceClassification` | BERT-based | `textattack/bert-base-uncased-yelp-polarity` | L |
|
||||
| `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3 ` | L |
|
||||
| `Gemma3ForCausalLM` | Gemma3 | `google/gemma-3-1b-it` | L |
|
||||
|`HCXVisionForCausalLM`| HyperCLOVAX-SEED-Vision | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | L + V |
|
||||
| `LlavaLlamaModel` | VILA | `Efficient-Large-Model/NVILA-8B` | L + V |
|
||||
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT | `llava-hf/llava-v1.6-mistral-7b-hf` | L + V |
|
||||
| `LlamaForCausalLM` | Llama 3 <br> Llama 3.1 <br> Llama 2 <br> LLaMA | `meta-llama/Meta-Llama-3.1-70B` | L |
|
||||
| `Llama4ForConditionalGeneration` | Llama 4 Scout <br> Llama 4 Maverick | `meta-llama/Llama-4-Scout-17B-16E-Instruct` <br> `meta-llama/Llama-4-Maverick-17B-128E-Instruct` | L + V |
|
||||
| `MistralForCausalLM` | Mistral | `mistralai/Mistral-7B-v0.1` | L |
|
||||
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` | L |
|
||||
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` | L |
|
||||
| `NemotronForCausalLM` | Nemotron-3 <br> Nemotron-4 <br> Minitron | `nvidia/Minitron-8B-Base` | L |
|
||||
| `NemotronHForCausalLM` | Nemotron-H | `nvidia/Nemotron-H-8B-Base-8K` <br> `nvidia/Nemotron-H-47B-Base-8K` <br> `nvidia/Nemotron-H-56B-Base-8K` | L |
|
||||
| `NemotronNASForCausalLM` | LLamaNemotron <br> LlamaNemotron Super <br> LlamaNemotron Ultra | `nvidia/Llama-3_1-Nemotron-51B-Instruct` <br> `nvidia/Llama-3_3-Nemotron-Super-49B-v1` <br> `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | L |
|
||||
| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/Qwen2-7B-Instruct` | L |
|
||||
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B` | L |
|
||||
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B` | L |
|
||||
| `Qwen2VLForConditionalGeneration` | Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | L + V |
|
||||
| `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | `Qwen/Qwen2.5-VL-7B-Instruct` | L + V |
|
||||
|
||||
Note:
|
||||
- L: Language only
|
||||
- L + V: Language and Vision multimodal support
|
||||
- Llama 3.2 accepts vision input, but our support currently limited to text only.
|
||||
|
||||
## Run the speculative decoding script:
|
||||
|
||||
```bash
|
||||
# NGram drafter
|
||||
python3 examples/pytorch/quickstart_advanced.py \
|
||||
--model_dir meta-llama/Llama-3.1-8B-Instruct \
|
||||
--spec_decode_algo NGRAM \
|
||||
--max_matching_ngram_size=2 \
|
||||
--spec_decode_nextn=4 \
|
||||
--disable_overlap_scheduler
|
||||
```
|
||||
|
||||
```bash
|
||||
# Draft Taret
|
||||
python3 examples/pytorch/quickstart_advanced.py \
|
||||
--model_dir meta-llama/Llama-3.1-8B-Instruct \
|
||||
--spec_decode_algo draft_target \
|
||||
--spec_decode_nextn 5 \
|
||||
--draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
|
||||
--disable_overlap_scheduler
|
||||
--disable_kv_cache_reuse
|
||||
```
|
||||
@ -1,23 +0,0 @@
|
||||
from tensorrt_llm import LLM, SamplingParams
|
||||
|
||||
|
||||
def main():
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(max_tokens=32)
|
||||
|
||||
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for i, output in enumerate(outputs):
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@ -1546,7 +1546,7 @@ def test_build_time_benchmark_sanity(llm_root, llm_venv):
|
||||
|
||||
### Pivot-To-Python examples
|
||||
def test_ptp_quickstart(llm_root, llm_venv):
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
|
||||
src = f"{llm_models_root()}/llama-3.1-model/Llama-3.1-8B-Instruct"
|
||||
dst = f"{llm_venv.get_working_directory()}/meta-llama/Llama-3.1-8B-Instruct"
|
||||
@ -1558,7 +1558,7 @@ def test_ptp_quickstart(llm_root, llm_venv):
|
||||
dir="./",
|
||||
delete=True,
|
||||
delete_on_close=True) as running_log:
|
||||
venv_check_call(llm_venv, [str(example_root / "quickstart.py")],
|
||||
venv_check_call(llm_venv, [str(example_root / "quickstart_example.py")],
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [4.60, 0, 0, 0])
|
||||
|
||||
@ -1616,7 +1616,7 @@ def test_ptp_quickstart(llm_root, llm_venv):
|
||||
])
|
||||
def test_ptp_quickstart_advanced(llm_root, llm_venv, model_name, model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
if model_name == "Nemotron-H-8B":
|
||||
llm_venv.run_cmd([
|
||||
str(example_root / "quickstart_advanced.py"),
|
||||
@ -1656,7 +1656,7 @@ def test_ptp_quickstart_advanced(llm_root, llm_venv, model_name, model_path):
|
||||
def test_ptp_quickstart_advanced_mtp(llm_root, llm_venv, model_name,
|
||||
model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
with tempfile.NamedTemporaryFile(mode='w+t',
|
||||
suffix=f".{model_name}.log",
|
||||
dir="./",
|
||||
@ -1682,7 +1682,7 @@ def test_ptp_quickstart_advanced_bs1(llm_root, llm_venv):
|
||||
model_name = "DeepSeek-V3-Lite-FP8"
|
||||
model_path = "DeepSeek-V3-Lite/fp8"
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
llm_venv.run_cmd([
|
||||
str(example_root / "quickstart_advanced.py"),
|
||||
"--use_cuda_graph",
|
||||
@ -1713,7 +1713,7 @@ def test_ptp_quickstart_advanced_deepseek_multi_nodes(llm_root, llm_venv,
|
||||
model_path):
|
||||
# "RCCA https://nvbugs/5163844"
|
||||
print(f"Testing {model_path}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
run_cmd = [
|
||||
"trtllm-llmapi-launch",
|
||||
"python3",
|
||||
@ -1737,7 +1737,7 @@ def test_ptp_quickstart_advanced_deepseek_multi_nodes(llm_root, llm_venv,
|
||||
def test_ptp_quickstart_advanced_eagle3(llm_root, llm_venv, model_name,
|
||||
model_path, eagle_model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
with tempfile.NamedTemporaryFile(mode='w+t',
|
||||
suffix=f".{model_name}.log",
|
||||
dir="./",
|
||||
@ -1766,7 +1766,7 @@ def test_ptp_quickstart_advanced_eagle3(llm_root, llm_venv, model_name,
|
||||
def test_ptp_quickstart_advanced_ngram(llm_root, llm_venv, model_name,
|
||||
model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
with tempfile.NamedTemporaryFile(mode='w+t',
|
||||
suffix=f".{model_name}.log",
|
||||
dir="./",
|
||||
@ -1800,7 +1800,7 @@ def test_ptp_quickstart_advanced_ngram(llm_root, llm_venv, model_name,
|
||||
def test_ptp_quickstart_advanced_deepseek_r1_8gpus(llm_root, llm_venv,
|
||||
model_name, model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
with tempfile.NamedTemporaryFile(mode='w+t',
|
||||
suffix=f".{model_name}.log",
|
||||
dir="./",
|
||||
@ -1834,7 +1834,7 @@ def test_ptp_quickstart_advanced_deepseek_r1_8gpus(llm_root, llm_venv,
|
||||
def test_relaxed_acceptance_quickstart_advanced_deepseek_r1_8gpus(
|
||||
llm_root, llm_venv, model_name, model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
with tempfile.NamedTemporaryFile(mode='w+t',
|
||||
suffix=f".{model_name}.log",
|
||||
dir="./",
|
||||
@ -1886,7 +1886,7 @@ def test_relaxed_acceptance_quickstart_advanced_deepseek_r1_8gpus(
|
||||
def test_ptp_quickstart_advanced_8gpus(llm_root, llm_venv, model_name,
|
||||
model_path):
|
||||
print(f"Testing {model_name}.")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
mapping = {
|
||||
"Llama3.1-70B-BF16": 21.0,
|
||||
"Mixtral-8x7B-BF16": 16.5,
|
||||
@ -1927,7 +1927,7 @@ def test_ptp_quickstart_advanced_8gpus(llm_root, llm_venv, model_name,
|
||||
def test_ptp_quickstart_advanced_2gpus_sm120(llm_root, llm_venv, model_name,
|
||||
model_path):
|
||||
print(f"Testing {model_name} on 2 GPUs (SM120+).")
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
llm_venv.run_cmd([
|
||||
str(example_root / "quickstart_advanced.py"),
|
||||
"--enable_chunked_prefill",
|
||||
@ -1939,7 +1939,7 @@ def test_ptp_quickstart_advanced_2gpus_sm120(llm_root, llm_venv, model_name,
|
||||
|
||||
@skip_pre_blackwell
|
||||
def test_ptp_quickstart_advanced_mixed_precision(llm_root, llm_venv):
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
model_path = "Llama-3_1-8B-Instruct_fp8_nvfp4_hf"
|
||||
with tempfile.NamedTemporaryFile(mode='w+t',
|
||||
suffix=f".{model_path}.log",
|
||||
@ -1968,7 +1968,7 @@ def test_ptp_quickstart_multimodal(llm_root, llm_venv, model_name, model_path,
|
||||
llm_venv.run_cmd(
|
||||
['-m', 'pip', 'install', 'flash-attn==2.7.3', '--no-build-isolation'])
|
||||
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
test_data_root = Path(
|
||||
os.path.join(llm_models_root(), "multimodals", "test_data"))
|
||||
print(f"Accuracy test {model_name} {modality} mode with example inputs.")
|
||||
@ -2212,7 +2212,7 @@ def test_ptp_star_attention_example(llm_root, llm_venv, model_name, model_path,
|
||||
star_attention_input_root):
|
||||
print(f"Testing {model_name}.")
|
||||
workspace = llm_venv.get_working_directory()
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
input_file = Path(
|
||||
os.path.join(star_attention_input_root,
|
||||
"test_star_attention_input.jsonl"))
|
||||
@ -2260,7 +2260,7 @@ def test_ptp_quickstart_advanced_llama_multi_nodes(llm_root, llm_venv,
|
||||
if "Llama-4" in model_path:
|
||||
tp_size, pp_size = 8, 2
|
||||
|
||||
example_root = Path(os.path.join(llm_root, "examples", "pytorch"))
|
||||
example_root = Path(os.path.join(llm_root, "examples", "llm-api"))
|
||||
run_cmd = [
|
||||
"trtllm-llmapi-launch",
|
||||
"python3",
|
||||
|
||||
@ -23,7 +23,7 @@ class TestOutOfTree(unittest.TestCase):
|
||||
sys.path.append(
|
||||
os.path.join(
|
||||
os.path.dirname(__file__),
|
||||
'../../../../examples/pytorch/out_of_tree_example'))
|
||||
'../../../../examples/llm-api/out_of_tree_example'))
|
||||
import modeling_opt # noqa
|
||||
|
||||
model_dir = str(llm_models_root() / "opt-125m")
|
||||
|
||||
Loading…
Reference in New Issue
Block a user