mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* fix: Fix/fused moe 0.19 (#3799) * fix bug of stream init Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix: Add pre-download of checkpoint before benchmark. (#3772) * Add pre-download of checkpoint before benchmark. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add missing remote code flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Move from_pretrained to throughput benchmark. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Move download and use snapshot_download. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Removed trusted flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix benchmark command in iteration log test. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fuse Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * TRTLLM-4875 feat: Add version switcher to doc (#3871) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * waive a test (#3897) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> * fix: remote mpi session abort (#3884) * fix remote mpi session Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * skip fp8 gemm for pre-hopper (#3931) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update multigpu list Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix namings Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * Doc: Fix H200 DeepSeek R1 perf doc (#4006) * fix doc Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> * update perf number Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> * Fix the perf regression caused by insufficient cache warmup. (#4042) Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * doc: Update 0.19.0 release notes (#3976) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * Optimize the AutoTuner cache access code to reduce host code overhead. (#4060) The NVFP4 Linear op is very sensitive to the host overhead. This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Update switcher (#4098) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * doc: update release notes (#4108) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * docs:update 0.19 doc. (#4120) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> * docs:add torch flow supported model list. (#4129) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> * doc: Release V0.19 Perf Overview Update (#4166) Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> * Fix readme of autodeploy. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update tensorrt_llm/_torch/pyexecutor/llm_request.py Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Revert mgmn worker node. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Change to disable_overlap_scheduler. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>
297 lines
15 KiB
Markdown
297 lines
15 KiB
Markdown
<div align="center">
|
|
|
|
# 🔥🚀⚡ AutoDeploy
|
|
|
|
<h4> Seamless Model Deployment from PyTorch to TRT-LLM</h4>
|
|
|
|
<div align="left">
|
|
|
|
AutoDeploy is designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
|
|
|
|
______________________________________________________________________
|
|
|
|
## Latest News 🔥
|
|
|
|
- \[2025/02/14\] Initial experimental release of `auto_deploy` backend for TensorRT-LLM
|
|
|
|
______________________________________________________________________
|
|
|
|
## Motivation & Approach
|
|
|
|
Deploying large language models (LLMs) can be challenging, especially when balancing ease of use with high performance. Teams need simple, intuitive deployment solutions that reduce engineering effort, speed up the integration of new models, and support rapid experimentation without compromising performance.
|
|
|
|
AutoDeploy addresses these challenges with a streamlined, (semi-)automated pipeline that transforms in-framework PyTorch models, including Hugging Face models, into optimized inference-ready models for TRT-LLM. It simplifies deployment, optimizes models for efficient inference, and bridges the gap between simplicity and performance.
|
|
|
|
### **Key Features:**
|
|
|
|
- **Seamless Model Transition:** Automatically converts PyTorch/Hugging Face models to TRT-LLM without manual rewrites.
|
|
- **Unified Model Definition:** Maintain a single source of truth with your original PyTorch/Hugging Face model.
|
|
- **Optimized Inference:** Built-in transformations for sharding, quantization, KV-cache integration, MHA fusion, and CudaGraph optimization.
|
|
- **Immediate Deployment:** Day-0 support for models with continuous performance enhancements.
|
|
- **Quick Setup & Prototyping:** Lightweight pip package for easy installation with a demo environment for fast testing.
|
|
|
|
______________________________________________________________________
|
|
|
|
## Get Started
|
|
|
|
1. **Install AutoDeploy:**
|
|
|
|
AutoDeploy is accessible through TRT-LLM installation.
|
|
|
|
```bash
|
|
sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
|
|
```
|
|
|
|
You can refer to [TRT-LLM installation guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation/linux.md) for more information.
|
|
|
|
2. **Run Llama Example:**
|
|
|
|
You are ready to run an in-framework LLama Demo now.
|
|
|
|
The general entrypoint to run the auto-deploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory:
|
|
|
|
```bash
|
|
cd examples/auto_deploy
|
|
python build_and_run_ad.py --config '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}'
|
|
```
|
|
|
|
______________________________________________________________________
|
|
|
|
## Support Matrix
|
|
|
|
AutoDeploy streamlines the model deployment process through an automated workflow designed for efficiency and performance. The workflow begins with a PyTorch model, which is exported using `torch.export` to generate a standard Torch graph. This graph contains core PyTorch ATen operations alongside custom attention operations, determined by the attention backend specified in the configuration.
|
|
|
|
The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TRT-LLM runtime.
|
|
|
|
### Supported Models
|
|
|
|
**Bring Your Own Model**: AutoDeploy leverages `torch.export` and dynamic graph pattern matching, enabling seamless integration for a wide variety of models without relying on hard-coded architectures.
|
|
|
|
Additionally, we have officially verified and fully optimized support for the following models:
|
|
|
|
<details>
|
|
<summary>Click to expand supported models list</summary>
|
|
|
|
| Model Series | HF Model Card | Precision | World Size | Runtime | Compile Backend ||| Attention Backend |||
|
|
|--------------|----------------------|-----------|------------|---------|-----------------|--------------------|--------------------|--------------------|----------|----------|
|
|
| | | | | | torch-simple | torch-compile | torch-opt | TritonWithFlattenedInputs | FlashInfer | MultiHeadLatentAttention |
|
|
| LLaMA | meta-llama/Llama-2-7b-chat-hf<br>meta-llama/Meta-Llama-3.1-8B-Instruct<br>meta-llama/Llama-3.1-70B-Instruct<br>codellama/CodeLlama-13b-Instruct-hf | BF16 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
|
|
| Nvidia Minitron | nvidia/Llama-3_1-Nemotron-51B-Instruct<br>nvidia/Llama-3.1-Minitron-4B-Width-Base<br>nvidia/Llama-3.1-Minitron-4B-Depth-Base | BF16 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
|
|
| Nvidia Model Optimizer | nvidia/Llama-3.1-8B-Instruct-FP8<br>nvidia/Llama-3.1-405B-Instruct-FP8 | FP8 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
|
|
| DeepSeek | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | BF16 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
|
|
| Mistral | mistralai/Mixtral-8x7B-Instruct-v0.1<br>mistralai/Mistral-7B-Instruct-v0.3 | BF16 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
|
|
| BigCode | bigcode/starcoder2-15b | FP32 | 1,2,4 | demollm, trtllm | ✅ | ✅ | ✅ | ✅ | ✅ | n/a |
|
|
| Deepseek-V3 | deepseek-ai/DeepSeek-V3 | BF16 | 1,2,4 | demollm | ✅ | ❌ | ❌ | n/a | n/a | ✅ |
|
|
|
|
</details>
|
|
|
|
### Runtime Integrations
|
|
|
|
AutoDeploy runs natively with the entire `TRT-LLM` stack via the `LLM` API. In addition, we provide a light-weight wrapper of the `LLM` API for onboarding and debugging new models:
|
|
|
|
| `"runtime"` | Description |
|
|
|-------------|-------------|
|
|
| `trtllm` | A robust, production-grade runtime optimized for high-performance inference. |
|
|
| `demollm` | A lightweight runtime wrapper designed for development and testing, featuring a naive scheduler and KV-cache manager for simplified debugging and testing. |
|
|
|
|
### Compile Backends
|
|
|
|
AutoDeploy supports multiple backends for compiling the exported Torch graph:
|
|
|
|
| `"compile_backend"` | Description |
|
|
|--------------------|-------------|
|
|
| `torch-simple` | Exports the graph without additional optimizations. |
|
|
| `torch-compile` | Applies `torch.compile` to the graph after all AutoDeploy transformations have been completed. |
|
|
| `torch-cudagraph` | Performs CUDA graph capture (without torch.compile). |
|
|
| `torch-opt` | Uses `torch.compile` along with CUDA Graph capture to enhance inference performance. |
|
|
|
|
### Attention backends
|
|
|
|
Optimize attention operations using different attention kernel implementations:
|
|
|
|
| `"attn_backend"` | Description |
|
|
|----------------------|-------------|
|
|
| `TritonWithFlattenedInputs` | Custom fused multi-head attention (MHA) with KV Cache kernels for efficient attention processing. |
|
|
| `FlashInfer` | Uses off-the-shelf optimized attention kernels with KV Cache from the [`flashinfer`](https://github.com/flashinfer-ai/flashinfer.git) library. |
|
|
|
|
### Precision Support
|
|
|
|
AutoDeploy supports a range of precision formats to enhance model performance, including:
|
|
|
|
- BF16, FP32
|
|
- Quantization formats like FP8.
|
|
|
|
______________________________________________________________________
|
|
|
|
## Advanced Usage
|
|
|
|
### Example Build Script ([`build_and_run_ad.py`](./build_and_run_ad.py))
|
|
|
|
#### Base Command
|
|
|
|
To build and run AutoDeploy example, use the following command with the [`build_and_run_ad.py`](./build_and_run_ad.py) script:
|
|
|
|
In the below example:
|
|
|
|
| Configuration Key | Description |
|
|
|-------------------|-------------|
|
|
| `"model"` | The HF model card or path to a HF checkpoint folder |
|
|
| `"model_factory"` | Choose model factory implementation (`"hf"` or `"llama4"`) |
|
|
| `"skip_loading_weights"` | Only load the architecture, not the weights |
|
|
| `"customize_tokenizer"` | Use tokenizer from model factory (true) or from LLM API (false) |
|
|
| `"model_kwargs"` | Extra kwargs for the model config class to customize the model config |
|
|
| `"tokenizer_kwargs"` | Extra kwargs for the tokenizer class to customize the tokenizer |
|
|
| `"world_size"` | The number of GPUs for Tensor Parallel |
|
|
| `"runtime"` | Specifies which type of Engine to use during runtime |
|
|
| `"compile_backend"` | Specifies how to compile the graph at the end |
|
|
| `"attn_backend"` | Specifies kernel implementation for attention |
|
|
| `"mla_backend"` | Specifies implementation for multi-head latent attention |
|
|
| `"max_seq_len"` | Maximum sequence length for inference/cache |
|
|
| `"max_batch_size"` | Maximum dimension for statically allocated KV cache |
|
|
| `"page_size"` | Page size for attention |
|
|
| `"benchmark"` | Indicates whether to run the built-in benchmark for token generation |
|
|
|
|
For default values and additional configuration options, refer to the [simple_config.py](./simple_config.py) file.
|
|
|
|
```bash
|
|
cd examples/auto_deploy
|
|
python build_and_run_ad.py \
|
|
--config '{"model": {HF_modelcard_or_path_to_local_folder}, "world_size": {num_GPUs}, "runtime": {"demollm"|"trtllm"}, "compile_backend": {"torch-simple"|"torch-opt"}, "attn_backend": {"TritonWithFlattenedInputs"|"FlashInfer"}, "benchmark": {true|false} }'
|
|
```
|
|
|
|
#### Experiment Configuration
|
|
|
|
The experiment configuration `dataclass` is defined in
|
|
[simple_config.py](./simple_config.py). Check it out for detailed documentation on each
|
|
available configuration.
|
|
|
|
Arguments can be overwritten during runtime by specifying the `--config` argument on the command
|
|
line and providing a valid config dictionary in `json` format. For example, to run any experiment
|
|
with benchmarking enabled, use:
|
|
|
|
```bash
|
|
cd examples/auto_deploy
|
|
python build_and_run_ad.py --config '{"benchmark": true}'
|
|
```
|
|
|
|
The `model_kwargs` and `tokenizer_kwargs` dictionaries can be supplied on the command line via
|
|
`--model-kwargs '{}'` and `--tokenizer-kwargs '{}'`.
|
|
|
|
#### Logging Level
|
|
|
|
Use the following env variable to specify the logging level of our built-in logger ordered by
|
|
decreasing verbosity;
|
|
|
|
```bash
|
|
AUTO_DEPLOY_LOG_LEVEL=DEBUG
|
|
AUTO_DEPLOY_LOG_LEVEL=INFO
|
|
AUTO_DEPLOY_LOG_LEVEL=WARNING
|
|
AUTO_DEPLOY_LOG_LEVEL=ERROR
|
|
AUTO_DEPLOY_LOG_LEVEL=INTERNAL_ERROR
|
|
```
|
|
|
|
The default level is `INFO`.
|
|
|
|
### Model Evaluation with LM Evaluation Harness
|
|
|
|
lm-evaluation-harness is supported. To run the evaluation, please use the following command:
|
|
|
|
```bash
|
|
# model is defined the same as above. Other config args can also be specified in the model_args (comma separated).
|
|
# You can specify any tasks supported with lm-evaluation-harness.
|
|
cd examples/auto_deploy
|
|
python lm_eval_ad.py \
|
|
--model autodeploy --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,world_size=2 --tasks mmlu
|
|
```
|
|
|
|
### Mixed-precision Quantization using TensorRT Model Optimizer
|
|
|
|
TensorRT Model Optimizer [AutoQuantize](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) algorithm is a PTQ algorithm from ModelOpt which quantizes a model by searching for the best quantization format per-layer while meeting the performance constraint specified by the user. This way, `AutoQuantize` enables to trade-off model accuracy for performance.
|
|
|
|
Currently `AutoQuantize` supports only `effective_bits` as the performance constraint (for both weight-only quantization and weight & activation quantization). See
|
|
[AutoQuantize documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) for more details.
|
|
|
|
#### 1. Quantize a model with ModelOpt
|
|
|
|
Refer to [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_autodeploy/README.md) for generating quantized model checkpoint.
|
|
|
|
#### 2. Deploy the quantized model with AutoDeploy
|
|
|
|
```bash
|
|
cd examples/auto_deploy
|
|
python build_and_run_ad.py --config '{"world_size": 1, "model": "{<MODELOPT_CKPT_PATH>}"}'
|
|
```
|
|
|
|
### Incorporating `auto_deploy` into your own workflow
|
|
|
|
AutoDeploy can be seamlessly integrated into your existing workflows using TRT-LLM's LLM high-level API. This section provides a blueprint for configuring and invoking AutoDeploy within your custom applications.
|
|
|
|
Here is an example of how you can build an LLM object with AutoDeploy integration:
|
|
|
|
<details>
|
|
<summary>Click to expand the example</summary>
|
|
|
|
```
|
|
from tensorrt_llm import LLM
|
|
from tensorrt_llm.builder import BuildConfig
|
|
from tensorrt_llm._torch.auto_deploy.shim import AutoDeployConfig
|
|
|
|
# 1. Set up the build configuration
|
|
build_config = BuildConfig(
|
|
max_seq_len=<MAX_SEQ_LEN>,
|
|
max_batch_size=<MAX_BS>,
|
|
)
|
|
build_config.plugin_config.tokens_per_block = <PAGE_SIZE>
|
|
# if using "TritonWithFlattenedInputs" as backend, <PAGE_SIZE> should equal to <MAX_SEQ_LEN>
|
|
# Refer to examples/auto_deploy/simple_config.py (line 109) for details.
|
|
|
|
# 2. Set up AutoDeploy configuration
|
|
# AutoDeploy will use its own cache implementation
|
|
model_kwargs = {"use_cache":False}
|
|
|
|
ad_config = AutoDeployConfig(
|
|
use_cuda_graph=True, # set True if using "torch-opt" as compile backend
|
|
torch_compile_enabled=True, # set True if using "torch-opt" as compile backend
|
|
model_kwargs=model_kwargs,
|
|
attn_backend="TritonWithFlattenedInputs", # choose between "TritonWithFlattenedInputs" and "FlashInfer"
|
|
skip_loading_weights=False,
|
|
)
|
|
|
|
# 3. Construct the LLM high-level interface object with autodeploy as backend
|
|
llm = LLM(
|
|
model=<HF_MODEL_CARD_OR_DIR>,
|
|
backend="autodeploy",
|
|
build_config=build_config,
|
|
pytorch_backend_config=ad_config,
|
|
tensor_parallel_size=<NUM_WORLD_RANK>,
|
|
)
|
|
|
|
```
|
|
|
|
</details>
|
|
|
|
For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html).
|
|
|
|
______________________________________________________________________
|
|
|
|
## Roadmap
|
|
|
|
1. **Model Coverage:**
|
|
|
|
- Expand support for additional LLM variants and features:
|
|
- LoRA
|
|
- Speculative Decoding
|
|
- Model specialization for disaggregated serving
|
|
|
|
1. **Performance Optimization:**
|
|
|
|
- Enhance inference speed and efficiency with:
|
|
- MoE fusion and all-reduce fusion techniques
|
|
- Reuse of TRT-LLM PyTorch operators for greater efficiency
|
|
|
|
______________________________________________________________________
|
|
|
|
## Disclaimer
|
|
|
|
This project is in active development and is currently in an early (beta) stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability. Use at your own risk.
|