From d0d19e81ca992bcfd74b5d993424d29b024891fe Mon Sep 17 00:00:00 2001 From: QI JUN <22017000+QiJune@users.noreply.github.com> Date: Wed, 23 Apr 2025 14:36:16 -0700 Subject: [PATCH] chore: fix some invalid paths of contrib models (#3818) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --- cpp/tests/README.md | 4 --- docs/source/blogs/Falcon180B-H200.md | 2 +- docs/source/reference/support-matrix.md | 33 ++++++++++++----------- examples/models/contrib/grok/README.md | 4 +-- examples/models/contrib/opt/README.md | 2 +- examples/models/core/multimodal/README.md | 2 +- 6 files changed, 23 insertions(+), 24 deletions(-) diff --git a/cpp/tests/README.md b/cpp/tests/README.md index 6c3cc682b0..716d41d348 100644 --- a/cpp/tests/README.md +++ b/cpp/tests/README.md @@ -60,9 +60,7 @@ To build the engines from the top-level directory: ```bash PYTHONPATH=examples/models/core/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gpt_engines.py -PYTHONPATH=examples/models/contrib/gpt:$PYTHONPATH python3 cpp/tests/resources/scripts/build_gptj_engines.py PYTHONPATH=examples/models/core/llama:$PYTHONPATH python3 cpp/tests/resources/scripts/build_llama_engines.py -PYTHONPATH=examples/chatglm:$PYTHONPATH python3 cpp/tests/resources/scripts/build_chatglm_engines.py PYTHONPATH=examples/medusa:$PYTHONPATH python3 cpp/tests/resources/scripts/build_medusa_engines.py PYTHONPATH=examples/eagle:$PYTHONPATH python3 cpp/tests/resources/scripts/build_eagle_engines.py PYTHONPATH=examples/redrafter:$PYTHONPATH python3 cpp/tests/resources/scripts/build_redrafter_engines.py @@ -86,9 +84,7 @@ End-to-end tests read inputs and expected outputs from Numpy files located at [c ```bash PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gpt_output.py -PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_gptj_output.py PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_llama_output.py -PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_chatglm_output.py PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_medusa_output.py PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_eagle_output.py PYTHONPATH=examples:$PYTHONPATH python3 cpp/tests/resources/scripts/generate_expected_redrafter_output.py diff --git a/docs/source/blogs/Falcon180B-H200.md b/docs/source/blogs/Falcon180B-H200.md index 419f8d8e94..f2c2fe7592 100644 --- a/docs/source/blogs/Falcon180B-H200.md +++ b/docs/source/blogs/Falcon180B-H200.md @@ -57,7 +57,7 @@ step further by performing FP8 computation on Hopper GPUs instead of the standard FP16. Similar examples running Falcon-180B with quantization in TensorRT-LLM are -available in [examples/falcon](/examples/falcon). +available in [examples/models/contrib/falcon](/examples/models/contrib/falcon). ## Llama-70B on H200 up to 6.7x A100 diff --git a/docs/source/reference/support-matrix.md b/docs/source/reference/support-matrix.md index 3bdb78f98c..f2765f1150 100644 --- a/docs/source/reference/support-matrix.md +++ b/docs/source/reference/support-matrix.md @@ -8,27 +8,30 @@ TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA ### LLM Models -- [Arctic](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/arctic) -- [Baichuan/Baichuan2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/baichuan) +- [Arctic](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/arctic) +- [Baichuan/Baichuan2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/baichuan) - [BART](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) - [BERT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert) -- [BLOOM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/bloom) +- [BLOOM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/bloom) - [ByT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) -- [GLM/ChatGLM/ChatGLM2/ChatGLM3/GLM-4](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/chatglm) +- [ChatGLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/chatglm-6b) +- [ChatGLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/chatglm2-6b) +- [ChatGLM3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/chatglm3-6b-32k) - [Code LLaMA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) -- [DBRX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/dbrx) +- [DBRX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/dbrx) - [Exaone](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone) - [FairSeq NMT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) -- [Falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/falcon) +- [Falcon](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/falcon) - [Flan-T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) [^encdec] - [Gemma/Gemma2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gemma) +- [GLM-4](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/glm-4-9b) - [GPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt) -- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gpt) +- [GPT-J](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gptj) - [GPT-Nemo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt) -- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gptneox) +- [GPT-NeoX](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/gptneox) - [Granite-3.0](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/granite) -- [Grok-1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/grok) -- [InternLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/internlm) +- [Grok-1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/grok) +- [InternLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples//models/contrib/internlm) - [InternLM2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/internlm2) - [LLaMA/LLaMA 2/LLaMA 3/LLaMA 3.1](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) - [Mamba](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mamba) @@ -37,19 +40,19 @@ TensorRT-LLM optimizes the performance of a range of well-known models on NVIDIA - [Mistral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) - [Mistral NeMo](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) - [Mixtral](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/mixtral) -- [MPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mpt) +- [MPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/mpt) - [Nemotron](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/nemotron) - [mT5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) -- [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt) +- [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/opt) - [Phi-1.5/Phi-2/Phi-3](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi) - [Qwen/Qwen1.5/Qwen2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwen) - [Qwen-VL](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/qwenvl) - [RecurrentGemma](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/recurrentgemma) -- [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mpt) [^replitcode] +- [Replit Code](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/mpt) [^replitcode] - [RoBERTa](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/bert) - [SantaCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt) -- [Skywork](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/skywork) -- [Smaug](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/smaug) +- [Skywork](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/skywork) +- [Smaug](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/smaug) - [StarCoder](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/gpt) - [T5](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec) - [Whisper](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/whisper) diff --git a/examples/models/contrib/grok/README.md b/examples/models/contrib/grok/README.md index 51f85d97d9..2e2bfede26 100644 --- a/examples/models/contrib/grok/README.md +++ b/examples/models/contrib/grok/README.md @@ -22,7 +22,7 @@ The grok-1 model requires a node with 8x80GB GPU memory(at least). ## Overview -The TensorRT-LLM Grok-1 implementation can be found in [tensorrt_llm/models/grok/model.py](../../../../tensorrt_llm/models/grok/model.py). The TensorRT-LLM Grok-1 example code is located in [`examples/grok`](./). There is one main file: +The TensorRT-LLM Grok-1 implementation can be found in [tensorrt_llm/models/grok/model.py](../../../../tensorrt_llm/models/grok/model.py). The TensorRT-LLM Grok-1 example code is located in [`examples/models/contrib/grok`](./). There is one main file: * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert the Grok-1 model into tensorrt-llm checkpoint format. @@ -38,7 +38,7 @@ In addition, there are two shared files in the parent folder [`examples`](../../ ## Usage -The TensorRT-LLM Grok-1 example code locates at [examples/grok](./). It takes xai weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference. +The TensorRT-LLM Grok-1 example code locates at [examples/models/contrib/grok](./). It takes xai weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference. ### Build TensorRT engine(s) diff --git a/examples/models/contrib/opt/README.md b/examples/models/contrib/opt/README.md index 9c2532b440..f2b2b9b52d 100644 --- a/examples/models/contrib/opt/README.md +++ b/examples/models/contrib/opt/README.md @@ -18,7 +18,7 @@ multiple GPUs or multiple nodes with multiple GPUs. ## Overview -The TensorRT-LLM OPT implementation can be found in [`tensorrt_llm/models/opt/model.py`](../../tensorrt_llm/models/opt/model.py). The TensorRT-LLM OPT example code is located in [`examples/opt`](./). There is one file: +The TensorRT-LLM OPT implementation can be found in [`tensorrt_llm/models/opt/model.py`](../../tensorrt_llm/models/opt/model.py). The TensorRT-LLM OPT example code is located in [`examples/models/contrib/opt`](./). There is one file: * [`convert_checkpoint.py`](./convert_checkpoint.py) to convert a checkpoint from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format diff --git a/examples/models/core/multimodal/README.md b/examples/models/core/multimodal/README.md index 348b634288..29822c1129 100644 --- a/examples/models/core/multimodal/README.md +++ b/examples/models/core/multimodal/README.md @@ -359,7 +359,7 @@ Firstly, please install transformers with 4.45.2 pip install -r requirements-internlm-xcomposer2.txt ``` -1. Convert Huggingface weights to TRT-LLM checkpoint format using `examples/internlm/README.md`. +1. Convert Huggingface weights to TRT-LLM checkpoint format using `examples/models/contrib/internlm/README.md`. 2. Use `trtllm-build` command to build TRT-LLM engine for OPT.