* Update TensorRT-LLM --------- Co-authored-by: Timur Abishev <abishev.timur@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: Saeyoon Oh <saeyoon.oh@furiosa.ai> Co-authored-by: hattizai <hattizai@gmail.com>
27 KiB
(release-notes)=
Release Notes
All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our NVIDIA Developer Forum.
TensorRT-LLM Release 0.11.0
Key Features and Enhancements
- Supported very long context for LLaMA (see “Long context evaluation” section in
examples/llama/README.md). - Low latency optimization
- Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
- Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
- Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
- LoRA enhancements
- Supported running FP8 LLaMA with FP16 LoRA checkpoints.
- Added support for quantized base model and FP16/BF16 LoRA.
- SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA
- INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA
- Weight-Only Group-wise + FP16/BF16/FP32 LoRA
- Added LoRA support to Qwen2, see “Run models with LoRA” section in
examples/qwen/README.md. - Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in
examples/phi/README.md. - Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in
examples/gpt/README.md.
- Encoder-decoder models C++ runtime enhancements
- Supported paged KV cache and inflight batching. (#800)
- Supported tensor parallelism.
- Supported INT8 quantization with embedding layer excluded.
- Updated default model for Whisper to
distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in #1337. - Supported HuggingFace model automatically download for the Python high level API.
- Supported explicit draft tokens for in-flight batching.
- Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in #1762.
- Added batched logits post processor.
- Added Hopper qgmma kernel to XQA JIT codepath.
- Supported tensor parallelism and expert parallelism enabled together for MoE.
- Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
- Added
numQueuedRequeststo the iteration stats log of the executor API. - Added
iterLatencyMilliSecto the iteration stats log of the executor API. - Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in #1674.
API Changes
- [BREAKING CHANGE]
trtllm-buildcommand- Migrated Whisper to unified workflow (
trtllm-buildcommand), see documents: examples/whisper/README.md. max_batch_sizeintrtllm-buildcommand is switched to 256 by default.max_num_tokensintrtllm-buildcommand is switched to 8192 by default.- Deprecated
max_output_lenand addedmax_seq_len. - Removed unnecessary
--weight_only_precisionargument fromtrtllm-buildcommand. - Removed
attention_qk_half_accumulationargument fromtrtllm-buildcommand. - Removed
use_context_fmha_for_generationargument fromtrtllm-buildcommand. - Removed
strongly_typedargument fromtrtllm-buildcommand. - The default value of
max_seq_lenreads from the HuggingFace mode config now.
- Migrated Whisper to unified workflow (
- C++ runtime
- [BREAKING CHANGE] Renamed
free_gpu_memory_fractioninModelRunnerCpptokv_cache_free_gpu_memory_fraction. - [BREAKING CHANGE] Refactored
GptManagerAPI- Moved
maxBeamWidthintoTrtGptModelOptionalParams. - Moved
schedulerConfigintoTrtGptModelOptionalParams.
- Moved
- Added some more options to
ModelRunnerCpp, includingmax_tokens_in_paged_kv_cache,kv_cache_enable_block_reuseandenable_chunked_context.
- [BREAKING CHANGE] Renamed
- [BREAKING CHANGE] Python high-level API
- Removed the
ModelConfigclass, and all the options are moved toLLMclass. - Refactored the
LLMclass, please refer toexamples/high-level-api/README.md- Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
- Exposed
modelto accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine. - Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
- Support build cache to reuse the built TensorRT-LLM engines by setting environment variable
TLLM_HLAPI_BUILD_CACHE=1or passingenable_build_cache=TruetoLLMclass. - Exposed low-level options including
BuildConfig,SchedulerConfigand so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
- Refactored
LLM.generate()andLLM.generate_async()API.- Removed
SamplingConfig. - Added
SamplingParamswith more extensive parameters, seetensorrt_llm/hlapi/utils.py.- The new
SamplingParamscontains and manages fields from Python bindings ofSamplingConfig,OutputConfig, and so on.
- The new
- Refactored
LLM.generate()output asRequestOutput, seetensorrt_llm/hlapi/llm.py.
- Removed
- Updated the
appsexamples, specially by rewriting bothchat.pyandfastapi_server.pyusing theLLMAPIs, please refer to theexamples/apps/README.mdfor details.- Updated the
chat.pyto support multi-turn conversation, allowing users to chat with a model in the terminal. - Fixed the
fastapi_server.pyand eliminate the need formpirunin multi-GPU scenarios.
- Updated the
- Removed the
- [BREAKING CHANGE] Speculative decoding configurations unification
- Introduction of
SpeculativeDecodingMode.hto choose between different speculative decoding techniques. - Introduction of
SpeculativeDecodingModule.hbase class for speculative decoding techniques. - Removed
decodingMode.h.
- Introduction of
gptManagerBenchmark- [BREAKING CHANGE]
apiingptManagerBenchmarkcommand isexecutorby default now. - Added a runtime
max_batch_size. - Added a runtime
max_num_tokens.
- [BREAKING CHANGE]
- [BREAKING CHANGE] Added a
biasargument to theLayerNormmodule, and supports non-bias layer normalization. - [BREAKING CHANGE] Removed
GptSessionPython bindings.
Model Updates
- Supported Jais, see
examples/jais/README.md. - Supported DiT, see
examples/dit/README.md. - Supported VILA 1.5.
- Supported Video NeVA, see
Video NeVAsection inexamples/multimodal/README.md. - Supported Grok-1, see
examples/grok/README.md. - Supported Qwen1.5-110B with FP8 PTQ.
- Supported Phi-3 small model with block sparse attention.
- Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in #1392.
- Supported Phi-3-medium models, see
examples/phi/README.md. - Supported Qwen1.5 MoE A2.7B.
- Supported phi 3 vision multimodal.
Fixed Issues
- Fixed brokens outputs for the cases when batch size is larger than 1. (#1539)
- Fixed
top_ktype inexecutor.py, thanks to the contribution from @vonjackustc in #1329. - Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in #1486.
- Fixed some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in #1328.
- Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in #1537.
- Fixed an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in #1660.
- Fixed LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in #1650.
- Fixed
qkv_biasshape issue for Qwen1.5-32B (#1589), thanks to the contribution from @Tlntin in #1637. - Fixed the error of Ada traits for
fpA_intB, thanks to the contribution from @JamesTheZ in #1583. - Update
examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in #1248. - Fixed rsLoRA scaling in
lora_manager, thanks to the contribution from @TheCodeWrangler in #1669. - Fixed Qwen1.5 checkpoint convert failure #1675.
- Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in #1535.
- Fixed
convert_hf_mpt_legacycall failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in #1534. - Fixed
use_fp8_context_fmhabroken outputs (#1539). - Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in #1742.
- Fixed stop words and bad words in python bindings. (#1642)
- Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from @Ace-RR: #1732.
- Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from @bprus: #1738
- Fixed the failure when
quantize.pyis export data to config.json, thanks to the contribution from @janpetrov: #1676 - Raise error when autopp detects unsupported quant plugin #1626.
- Fixed the issue that
shared_embedding_tableis not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz. - Fixed stop and bad words list contiguous for
ModelRunner#1815, thanks to the contribution from @Marks101. - Fixed missing comment for
FAST_BUILD, thanks to the support from @lkm2835 in #1851. - Fixed the issues that Top-P sampling occasionally produces invalid tokens. #1590
- Fixed #1424.
- Fixed #1529.
- Fixed
benchmarks/cpp/README.mdfor #1562 and #1552. - Fixed dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: https://github.com/triton-inference-server/tensorrtllm_backend/pull/478, https://github.com/triton-inference-server/tensorrtllm_backend/pull/482 and https://github.com/triton-inference-server/tensorrtllm_backend/pull/449.
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.05-py3. - Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.05-py3. - The dependent TensorRT version is updated to 10.2.0.
- The dependent CUDA version is updated to 12.4.1.
- The dependent PyTorch version is updated to 2.3.1.
- The dependent ModelOpt version is updated to v0.13.0.
Known Issues
- In a conda environment on Windows, installation of TensorRT-LLM may succeed. However, when importing the library in Python, you may receive an error message of
OSError: exception: access violation reading 0x0000000000000000. This issue is under investigation.
TensorRT-LLM Release 0.10.0
Announcements
- TensorRT-LLM supports TensorRT 10.0.1 and NVIDIA NGC 24.03 containers.
Key Features and Enhancements
- The Python high level API
- Added embedding parallel, embedding sharing, and fused MLP support.
- Enabled the usage of the
executorAPI.
- Added a weight-stripping feature with a new
trtllm-refitcommand. For more information, refer toexamples/sample_weight_stripping/README.md. - Added a weight-streaming feature. For more information, refer to
docs/source/advanced/weight-streaming.md. - Enhanced the multiple profiles feature;
--multiple_profilesargument intrtllm-buildcommand builds more optimization profiles now for better performance. - Added FP8 quantization support for Mixtral.
- Added support for pipeline parallelism for GPT.
- Optimized
applyBiasRopeUpdateKVCachekernel by avoiding re-computation. - Reduced overheads between
enqueuecalls of TensorRT engines. - Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
- Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
- Added debug options (
--visualize_networkand--dry_run) to thetrtllm-buildcommand to visualize the TensorRT network before engine build. - Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
- Improved the performance of pipeline parallelism when enabling in-flight batching.
- Supported quantization for Nemotron models.
- Added LoRA support for Mixtral and Qwen.
- Added in-flight batching support for ChatGLM models.
- Added support to
ModelRunnerCppso that it runs with theexecutorAPI for IFB-compatible models. - Enhanced the custom
AllReduceby adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance. - Optimized the performance of checkpoint conversion process for LLaMA.
- Benchmark
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to
gptManagerBenchmark. - Enabled streaming and support
Time To the First Token (TTFT)latency andInter-Token Latency (ITL)metrics forgptManagerBenchmark. - Added the
--max_attention_windowoption togptManagerBenchmark.
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to
API Changes
- [BREAKING CHANGE] Set the default
tokens_per_blockargument of thetrtllm-buildcommand to 64 for better performance. - [BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
- [BREAKING CHANGE] Renamed
GptModelConfigtoModelConfig. - [BREAKING CHANGE] Added speculative decoding mode to the builder API.
- [BREAKING CHANGE] Refactor scheduling configurations
- Unified the
SchedulerPolicywith the same name inbatch_schedulerandexecutor, and renamed it toCapacitySchedulerPolicy. - Expanded the existing configuration scheduling strategy from
SchedulerPolicytoSchedulerConfigto enhance extensibility. The latter also introduces a chunk-based configuration calledContextChunkingPolicy.
- Unified the
- [BREAKING CHANGE] The input prompt was removed from the generation output in the
generate()andgenerate_async()APIs. For example, when given a prompt asA B, the original generation result could be<s>A B C D Ewhere onlyC D Eis the actual output, and now the result isC D E. - [BREAKING CHANGE] Switched default
add_special_tokenin the TensorRT-LLM backend toTrue. - Deprecated
GptSessionandTrtGptModelV1.
Model Updates
- Support DBRX
- Support Qwen2
- Support CogVLM
- Support ByT5
- Support LLaMA 3
- Support Arctic (w/ FP8)
- Support Fuyu
- Support Persimmon
- Support Deplot
- Support Phi-3-Mini with long Rope
- Support Neva
- Support Kosmos-2
- Support RecurrentGemma
Fixed Issues
-
- Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
- Fixed segmentation fault with pipeline parallelism and
gather_all_token_logits. (#1284) - Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (#1256)
- Fixed an unsupported ScalarType issue for BF16 LoRA. (https://github.com/triton-inference-server/tensorrtllm_backend/issues/403)
- Eliminated the load and save of prompt table in multimodal. (https://github.com/NVIDIA/TensorRT-LLM/discussions/1436)
- Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (#1344)
- Fixed early stopping and failures on in-flight batching cases of Medusa. (#1449)
- Added support for more NVLink versions for auto parallelism. (#1467)
- Fixed the assert failure caused by default values of sampling config. (#1447)
- Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (#1446)
- Fixed MMHA relative position calculation error in
gpt_attention_pluginfor enc-dec models. (#1343)
Infrastructure changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.03-py3. - Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.03-py3. - The dependent TensorRT version is updated to 10.0.1.
- The dependent CUDA version is updated to 12.4.0.
- The dependent PyTorch version is updated to 2.2.2.
TensorRT-LLM Release 0.9.0
Announcements
- TensorRT-LLM requires TensorRT 9.3 and 24.02 containers.
Key Features and Enhancements
- [BREAKING CHANGES] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- [BREAKING CHANGES] Added support for embedding sharing for Gemma
- Added support for context chunking to work with KV cache reuse
- Enabled different rewind tokens per sequence for Medusa
- Added BART LoRA support (limited to the Python runtime)
- Enabled multi-LoRA for BART LoRA
- Added support for
early_stopping=Falsein beam search for C++ Runtime - Added support for logits post processor to the batch manager
- Added support for import and convert HuggingFace Gemma checkpoints
- Added support for loading Gemma from HuggingFace
- Added support for auto parallelism planner for high-level API and unified builder workflow
- Added support for running
GptSessionwithout OpenMPI - Added support for Medusa IFB
- [Experimental] Added support for FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- Added support for more head sizes for LLaMA-like models
- NVIDIA Ampere (SM80, SM86), NVIDIA Ada Lovelace (SM89), NVIDIA Hopper (SM90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256]
- Added support for OOTB functionality
- T5
- Mixtral 8x7B
- Benchmark features
- Added emulated static batching in
gptManagerBenchmark - Added support for arbitrary dataset from HuggingFace for C++ benchmarks
- Added percentile latency report to
gptManagerBenchmark
- Added emulated static batching in
- Performance features
- Optimized
gptDecoderBatchto support batched sampling - Enabled FMHA for models in BART, Whisper, and NMT family
- Removed router tensor parallelism to improve performance for MoE models
- Improved custom all-reduce kernel
- Optimized
- Infrastructure features
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.02-py3 - The dependent PyTorch version is updated to 2.2
- Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.02-py3 - The dependent CUDA version is updated to 12.3.2 (12.3 Update 2)
- Base Docker image for TensorRT-LLM is updated to
API Changes
- Added C++
executorAPI - Added Python bindings
- Added advanced and multi-GPU examples for Python binding of
executorC++ API - Added documents for C++
executorAPI - Migrated Mixtral to high-level API and unified builder workflow
- [BREAKING CHANGES] Moved LLaMA convert checkpoint script from examples directory into the core library
- Added support for
LLM()API to accept engines built bytrtllm-buildcommand - [BREAKING CHANGES] Removed the
modelparameter fromgptManagerBenchmarkandgptSessionBenchmark - [BREAKING CHANGES] Refactored GPT with unified building workflow
- [BREAKING CHANGES] Refactored the Qwen model to the unified build workflow
- [BREAKING CHANGES] Removed all the LoRA related flags from
convert_checkpoint.pyscript and the checkpoint content totrtllm-buildcommand to generalize the feature better to more models - [BREAKING CHANGES] Removed the
use_prompt_tuningflag, options from theconvert_checkpoint.pyscript, and the checkpoint content to generalize the feature better to more models. Usetrtllm-build --max_prompt_embedding_table_sizeinstead. - [BREAKING CHANGES] Changed the
trtllm-build --world_sizeflag to the--auto_parallelflag. The option is used for auto parallel planner only. - [BREAKING CHANGES]
AsyncLLMEngineis removed. Thetensorrt_llm.GenerationExecutorclass is refactored to work with both explicitly launching withmpirunin the application level and accept an MPI communicator created bympi4py. - [BREAKING CHANGES]
examples/serverare removed. - [BREAKING CHANGES] Removed LoRA related parameters from the convert checkpoint scripts.
- [BREAKING CHANGES] Simplified Qwen convert checkpoint script.
- [BREAKING CHANGES] Reused the
QuantConfigused intrtllm-buildtool to support broader quantization features. - Added support for TensorRT-LLM checkpoint as model input.
- Refined
SamplingConfigused inLLM.generateorLLM.generate_asyncAPIs, with the support of beam search, a variety of penalties, and more features. - Added support for the
StreamingLLMfeature. Enable it by settingLLM(streaming_llm=...).
Model Updates
- Added support for distil-whisper
- Added support for HuggingFace StarCoder2
- Added support for VILA
- Added support for Smaug-72B-v0.1
- Migrate BLIP-2 examples to
examples/multimodal
Limitations
openai-tritonexamples are not supported on Windows.
Fixed Issues
- Fixed a weight-only quant bug for Whisper to make sure that the
encoder_input_len_rangeis not0. (#992) - Fixed an issue that log probabilities in Python runtime are not returned. (#983)
- Multi-GPU fixes for multimodal examples. (#1003)
- Fixed a wrong
end_idissue for Qwen. (#987) - Fixed a non-stopping generation issue. (#1118, #1123)
- Fixed a wrong link in
examples/mixtral/README.md. (#1181) - Fixed LLaMA2-7B bad results when INT8 kv cache and per-channel INT8 weight only are enabled. (#967)
- Fixed a wrong
head_sizewhen importing a Gemma model from HuggingFace Hub. (#1148) - Fixed ChatGLM2-6B building failure on INT8. (#1239)
- Fixed a wrong relative path in Baichuan documentation. (#1242)
- Fixed a wrong
SamplingConfigtensor inModelRunnerCpp. (#1183) - Fixed an error when converting SmoothQuant LLaMA. (#1267)
- Fixed an issue that
examples/run.pyonly load one line from--input_file. - Fixed an issue that
ModelRunnerCppdoes not transferSamplingConfigtensor fields correctly. (#1183)
TensorRT-LLM Release 0.8.0
Key Features and Enhancements
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
- LoRA support for C++ runtime (see docs/source/lora.md)
- Medusa decoding support (see examples/medusa/README.md)
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the
temperatureparameter of sampling configuration should be 0
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
- Support for batch manager to return logits from context and/or generation phases
- Include support in the Triton backend
- Support AWQ and GPTQ for QWEN
- Support ReduceScatter plugin
- Support for combining
repetition_penaltyandpresence_penalty#274 - Support for
frequency_penalty#275 - OOTB functionality support:
- Baichuan
- InternLM
- Qwen
- BART
- LLaMA
- Support enabling INT4-AWQ along with FP8 KV Cache
- Support BF16 for weight-only plugin
- Baichuan
- P-tuning support
- INT4-AWQ and INT4-GPTQ support
- Decoder iteration-level profiling improvements
- Add
masked_selectandcumsumfunction for modeling - Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
- Support FP16 fMHA on NVIDIA V100 GPU
Some features are not enabled for all models listed in the [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder.
Model Updates
- Phi-1.5/2.0
- Mamba support (see examples/mamba/README.md)
- The support is limited to beam width = 1 and single-node single-GPU
- Nougat support (see examples/multimodal/README.md#nougat)
- Qwen-VL support (see examples/qwenvl/README.md)
- RoBERTa support, thanks to the contribution from @erenup
- Skywork model support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
Refer to the {ref}support-matrix-software section for a list of supported models.
- API
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
- [BREAKING CHANGES] Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
- [BREAKING CHANGES] Deprecate
LayerNormandRMSNormplugins and removed corresponding build parameters - [BREAKING CHANGES] Remove optional parameter
maxNumSequencesfor GPT manager
- Fixed Issues
- Fix the first token being abnormal issue when
--gather_all_token_logitsis enabled #639 - Fix LLaMA with LoRA enabled build failure #673
- Fix InternLM SmoothQuant build failure #705
- Fix Bloom int8_kv_cache functionality #741
- Fix crash in
gptManagerBenchmark#649 - Fix Blip2 build error #695
- Add pickle support for
InferenceRequest#701 - Fix Mixtral-8x7b build failure with custom_all_reduce #825
- Fix INT8 GEMM shape #935
- Minor bug fixes
- Fix the first token being abnormal issue when
- Performance
- [BREAKING CHANGES] Increase default
freeGpuMemoryFractionparameter from 0.85 to 0.9 for higher throughput - [BREAKING CHANGES] Disable
enable_trt_overlapargument for GPT manager by default - Performance optimization of beam search kernel
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
- Custom AllReduce plugins performance optimization
- Top-P sampling performance optimization
- LoRA performance optimization
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
- Integrate XQA kernels for GPT-J (beamWidth=4)
- [BREAKING CHANGES] Increase default
- Documentation
- Batch manager arguments documentation updates
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
- Update to the
docs/source/new_workflow.mddocumentation - Update AWQ INT4 weight only quantization documentation for GPT-J
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
- Refine TensorRT-LLM backend README structure #133
- Typo fix #739
TensorRT-LLM Release 0.7.1
Key Features and Enhancements
-
Speculative decoding (preview)
-
Added a Python binding for
GptManager -
Added a Python class
ModelRunnerCppthat wraps C++gptSession -
System prompt caching
-
Enabled split-k for weight-only cutlass kernels
-
FP8 KV cache support for XQA kernel
-
New Python builder API and
trtllm-buildcommand (already applied to blip2 and OPT) -
Support
StoppingCriteriaandLogitsProcessorin Python generate API -
FHMA support for chunked attention and paged KV cache
-
Performance enhancements include:
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped GEMM
- Optimize Hopper warp specialized kernels
- Optimize
AllReducefor parallel attention on Falcon and GPT-J - Enable split-k for weight-only cutlass kernel when SM>=75
-
Added {ref}
workflowdocumentation
Model Updates
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
Fixed Issues
- Fixed tokenizer usage in
quantize.py#288 - Fixed LLaMa with LoRA error
- Fixed LLaMA GPTQ failure
- Fixed Python binding for InferenceRequest issue
- Fixed CodeLlama SQ accuracy issue
Known Issues
- The hang reported in issue #149 has not been reproduced by the TensorRT-LLM team. If it is caused by a bug in TensorRT-LLM, that bug may be present in that release.