Update TensorRT-LLM (#1233)

* Update TensorRT-LLM

---------

Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
This commit is contained in:
Kaiyu Xie 2024-03-05 18:32:53 +08:00 committed by GitHub
parent b7c309d1c9
commit 728cc0044b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
163 changed files with 4151 additions and 3978 deletions

2
3rdparty/cutlass vendored

@ -1 +1 @@
Subproject commit 8236f30675bbe98f81d11c05764b77bfcb25b8cc
Subproject commit a8f2c80db0564c74f4efccac71993b971dfc448b

View File

@ -1,5 +1,41 @@
# Change Log
## Versions 0.7.0 / 0.7.1
* Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
* Features
- [Preview] Speculative decoding
- Add Python binding for `GptManager`
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
* Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
* Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
* Documentation
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
## Versions 0.6.0 / 0.6.1
* Models

112
README.md
View File

@ -8,7 +8,7 @@ TensorRT-LLM
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
[![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
[![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
[![version](https://img.shields.io/badge/release-0.9.0.dev-green)](./setup.py)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
[Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@ -38,6 +38,9 @@ TensorRT-LLM
## Table of Contents
- [TensorRT-LLM](#tensorrt-llm)
- [Latest News](#latest-news)
- [Table of Contents](#table-of-contents)
- [TensorRT-LLM Overview](#tensorrt-llm-overview)
- [Installation](#installation)
- [Quick Start](#quick-start)
@ -56,6 +59,8 @@ TensorRT-LLM
- [Troubleshooting](#troubleshooting)
- [Release notes](#release-notes)
- [Change Log](#change-log)
- [Versions 0.8.0](#versions-080)
- [For history change log, please see CHANGELOG.md.](#for-history-change-log-please-see-changelogmd)
- [Known Issues](#known-issues)
- [Report Issues](#report-issues)
@ -288,7 +293,7 @@ The list of supported models is:
* [Replit Code](examples/mpt)
* [RoBERTa](examples/bert)
* [SantaCoder](examples/gpt)
* [StarCoder](examples/gpt)
* [StarCoder1/StarCoder2](examples/gpt)
* [T5](examples/enc_dec)
* [Whisper](examples/whisper)
@ -402,50 +407,91 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
## Release notes
* TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
* TensorRT-LLM requires TensorRT 9.2 and 23.12 containers.
### Change Log
#### Versions 0.7.0 / 0.7.1
#### Versions 0.8.0
* Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
* Model Support
- Phi-1.5/2.0
- Mamba support (see examples/mamba/README.md)
- The support is limited to beam width = 1 and single-node single-GPU
- Nougat support (see examples/multimodal/README.md#nougat)
- Qwen-VL support (see examples/qwenvl/README.md)
- RoBERTa support, thanks to the contribution from @erenup
- Skywork model support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
* Features
- [Preview] Speculative decoding
- Add Python binding for `GptManager`
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
- LoRA support for C++ runtime (see docs/source/lora.md)
- Medusa decoding support (see examples/medusa/README.md)
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
- Support for batch manager to return logits from context and/or generation phases
- Include support in the Triton backend
- Support AWQ and GPTQ for QWEN
- Support ReduceScatter plugin
- Support for combining `repetition_penalty` and `presence_penalty` #274
- Support for `frequency_penalty` #275
- OOTB functionality support:
- Baichuan
- InternLM
- Qwen
- BART
- LLaMA
- Support enabling INT4-AWQ along with FP8 KV Cache
- Support BF16 for weight-only plugin
- Baichuan
- P-tuning support
- INT4-AWQ and INT4-GPTQ support
- Decoder iteration-level profiling improvements
- Add `masked_select` and `cumsum` function for modeling
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
- Support FP16 fMHA on NVIDIA V100 GPU
* API
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
- **[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
- **[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
- **[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
* Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
- Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
- Fix LLaMA with LoRA enabled build failure #673
- Fix InternLM SmoothQuant build failure #705
- Fix Bloom int8_kv_cache functionality #741
- Fix crash in `gptManagerBenchmark` #649
- Fix Blip2 build error #695
- Add pickle support for `InferenceRequest` #701
- Fix Mixtral-8x7b build failure with custom_all_reduce #825
- Fix INT8 GEMM shape #935
- Minor bug fixes
* Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
- **[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
- **[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
- Performance optimization of beam search kernel
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
- Custom AllReduce plugins performance optimization
- Top-P sampling performance optimization
- LoRA performance optimization
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
- Integrate XQA kernels for GPT-J (beamWidth=4)
* Documentation
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
- Batch manager arguments documentation updates
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
- Update to the `docs/source/new_workflow.md` documentation
- Update AWQ INT4 weight only quantization documentation for GPT-J
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
- Refine TensorRT-LLM backend README structure #133
- Typo fix #739
#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
### Known Issues
* On windows, running context FMHA plugin with FP16 accumulation on LLaMA, Mistral and Phi models suffers from poor accuracy and the resulting inference output may be garbled. The suggestion to workaround these is to enable FP32 accumulation when building the models, i.e. passing the options `--context_fmha disable --context_fmha_fp32_acc enable` to `trtllm-build` command as a work-around, and this should be fixed in the next version
* The hang reported in issue
[#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
has not been reproduced by the TensorRT-LLM team. If it is caused by a bug

View File

@ -103,7 +103,8 @@ For example, setting mean=100 and std dev=10 would generate requests where 95.4%
--tokenizer <path/to/tokenizer> \
token-norm-dist \
--num-requests 100 \
--input-mean 100 --input-stdev 10 --output-mean 15 --output-stdev 0 --num-requests 100
--input-mean 100 --input-stdev 10 \
--output-mean 15 --output-stdev 0
```
For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
@ -141,8 +142,25 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
--max_num_samples 500
```
To emulate `gptSessionBenchmark` static batching, you can use the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated-timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count.
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
#### Emulated static batching
To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
```
python prepare_dataset.py \
--output tokens-fixed-lengths.json \
--request-rate -1 \
--time-delay-dist constant \
--tokenizer <path/to/tokenizer> \
token-norm-dist \
--num-requests 128 \
--input-mean 60 --input-stdev 0 \
--output-mean 20 --output-stdev 0
```
Take GPT-350M as an example for single GPU with static batching
```
@ -152,7 +170,5 @@ Take GPT-350M as an example for single GPU with static batching
--type IFB \
--static_emulated_batch_size 32 \
--static_emulated_timeout 100 \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
--dataset ../../benchmarks/cpp/tokens-fixed-lengths.json
```
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.

View File

@ -57,12 +57,12 @@ std::string engineFilename(
std::filesystem::path const& dataPath, WorldConfig const& worldConfig, std::string const& model)
{
auto constexpr allowExceptions = true;
auto constexpr ingoreComments = true;
auto constexpr ignoreComments = true;
auto const jsonFilePath = dataPath / "config.json";
TLLM_CHECK_WITH_INFO(
std::filesystem::exists(jsonFilePath), std::string("File does not exist: ") + jsonFilePath.string());
std::ifstream jsonStream(jsonFilePath);
auto const json = nlohmann::json::parse(jsonStream, nullptr, allowExceptions, ingoreComments);
auto const json = nlohmann::json::parse(jsonStream, nullptr, allowExceptions, ignoreComments);
auto const& builderConfig = json.at("builder_config");
auto const precision = builderConfig.at("precision").template get<std::string>();
auto const worldSize = builderConfig.at("tensor_parallel").template get<SizeType>();
@ -97,9 +97,9 @@ void benchmarkBert(std::string const& modelName, std::filesystem::path const& da
allocator.setZero(*inputIdsBuffer);
tensorMap.insert(std::make_pair("input_ids", inputIdsBuffer));
// input_lengths
std::vector<SizeType> inputLenghtsHost(batchSize);
std::vector<SizeType> inputLengthsHost(batchSize);
auto inLensBuffer = std::shared_ptr<ITensor>{
allocator.copyFrom(inputLenghtsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU)};
allocator.copyFrom(inputLengthsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU)};
allocator.setZero(*inLensBuffer);
tensorMap.insert(std::make_pair("input_lengths", inLensBuffer));

View File

@ -1049,12 +1049,8 @@ int main(int argc, char* argv[])
padId = result["pad_id"].as<int>();
}
std::optional<int32_t> eosId;
// Argument: End-of-sentence token id
if (result.count("eos_id"))
{
eosId = result["eos_id"].as<int>();
}
std::optional<int32_t> eosId = result["eos_id"].as<int>();
std::optional<int> staticEmulatedBatchSize;
// Argument: Static emulated batch size

View File

@ -120,9 +120,9 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
auto peakMemFuture = std::async(&monitorMemory, std::ref(done));
TLLM_LOG_INFO(memoryCounter.toString());
std::vector<SizeType> inputLenghtsHost(batchSize, maxInputLength);
auto inputLenghts
= bufferManager.copyFrom(inputLenghtsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU);
std::vector<SizeType> inputLengthsHost(batchSize, maxInputLength);
auto inputLengths
= bufferManager.copyFrom(inputLengthsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU);
// copy inputs and wrap into shared_ptr
GenerationInput::TensorPtr inputIds;
@ -147,7 +147,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
TLLM_LOG_INFO(memoryCounter.toString());
GenerationInput generationInput{
endId, padId, std::move(inputIds), std::move(inputLenghts), inputPacked};
endId, padId, std::move(inputIds), std::move(inputLengths), inputPacked};
// runtime will allocate memory for output if this tensor is empty
GenerationOutput generationOutput{
@ -183,6 +183,8 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
int iterIdx = 0;
float curDuration = 0;
std::vector<float> latencies;
std::vector<float> generationTimes;
auto generationProfiler = std::make_shared<GptSession::GenerationProfiler>();
while (iterIdx < numRuns || curDuration / 1000 < duration)
{
auto const start = std::chrono::steady_clock::now();
@ -190,7 +192,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
generationOutput.onTokenGenerated
= [&numSteps, maxNewTokens](GenerationOutput::TensorPtr const& outputIds, SizeType step,
bool finished) { ++numSteps; };
session.generate(generationOutput, generationInput, samplingConfig);
session.generate(generationOutput, generationInput, samplingConfig, generationProfiler);
bufferManager.getStream().synchronize();
auto const end = std::chrono::steady_clock::now();
@ -198,6 +200,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
float latency = std::chrono::duration<float, std::milli>(end - start).count();
curDuration += latency;
latencies.emplace_back(latency);
generationTimes.emplace_back(generationProfiler->getElapsedTimeMs());
}
TLLM_LOG_INFO(memoryCounter.toString());
@ -231,12 +234,16 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
{
auto const averageLatency = curDuration / iterIdx;
float const tokensPerSec = batchSize * maxNewTokens / (averageLatency / 1000);
auto const avgGenerationTime
= std::reduce(generationTimes.begin(), generationTimes.end(), 0.0f) / generationTimes.size();
float const generationTokensPerSec = batchSize * maxNewTokens / (avgGenerationTime / 1000);
// convert to GB
float const peakMemGB = peakMem / 1e9;
printf(
"[BENCHMARK] batch_size %d input_length %d output_length %d latency(ms) %.2f tokensPerSec "
"%.2f gpu_peak_mem(gb) %.2f\n",
batchSize, maxInputLength, maxNewTokens, averageLatency, tokensPerSec, peakMemGB);
"%.2f generation_time(ms) %.2f generationTokensPerSec %.2f gpu_peak_mem(gb) %.2f\n",
batchSize, maxInputLength, maxNewTokens, averageLatency, tokensPerSec, avgGenerationTime,
generationTokensPerSec, peakMemGB);
}
// logits are store in last rank
@ -246,7 +253,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
{
std::cout << "generationOutput.contextLogits.shape: "
<< generationOutput.contextLogits->getShape()
<< std::endl; // (batchsize, prompt_len, vocabsize)
<< std::endl; // (batch_size, prompt_len, vocab_size)
std::cout << "generationOutput.contextLogits: " << *generationOutput.contextLogits << std::endl;
}
@ -254,7 +261,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
{
std::cout << "generationOutput.generationLogits.shape: "
<< generationOutput.generationLogits->getShape()
<< std::endl; // (batchsize, beamwidth, maxNewTokens, vocabsize)
<< std::endl; // (batch_size, beam_width, maxNewTokens, vocab_size)
generationOutput.generationLogits->reshape(ITensor::makeShape({batchSize * beamWidth,
maxNewTokens, modelConfig.getVocabSizePadded(worldConfig.getSize())}));

View File

@ -75,6 +75,43 @@ class BaseBenchmark(object):
self.quant_mode = QuantMode(0)
self.enable_fp8 = False
if engine_dir is not None:
# Read config from engine directory
config_path = os.path.join(engine_dir, 'config.json')
with open(config_path, 'r') as f:
self.config = json.load(f)
# Sanity checks
if 'pretrained_config' in self.config: # new build api branch
config_dtype = self.config['pretrained_config']['dtype']
assert dtype == config_dtype, f"Engine dtype ({config_dtype}) != Runtime dtype ({dtype})"
world_size = self.config['pretrained_config']['mapping'][
'world_size']
assert world_size == self.world_size, \
(f'Engine world size ({world_size}) != Runtime world size ({self.world_size})')
# Load config into self
for key, value in self.config['pretrained_config'].items():
setattr(self, key, value)
self.quant_mode = QuantMode.from_quant_algo(
quant_algo=self.quantization['quant_algo'],
kv_cache_quant_algo=self.quantization['kv_cache_quant_algo']
)
self.enable_fp8 = self.quant_mode.has_fp8_qdq()
self.fp8_kv_cache = self.quant_mode.has_fp8_kv_cache()
for key, value in self.config['build_config'].items():
setattr(self, key, value)
for key, value in self.plugin_config.items():
if "plugin" in key:
key = "use_" + key
setattr(self, key, value)
self.engine_name = f"rank{self.runtime_rank}.engine"
self.num_kv_heads = self.num_key_value_heads
self.num_layers = self.num_hidden_layers
self.num_heads = self.num_attention_heads
else:
# Read config from engine directory
config_path = os.path.join(engine_dir, 'config.json')
with open(config_path, 'r') as f:
@ -100,9 +137,14 @@ class BaseBenchmark(object):
if "plugin" in key:
key = "use_" + key
setattr(self, key, value)
self.engine_name = get_engine_name(self.engine_model_name,
self.dtype, self.world_size,
self.runtime_rank)
else:
self.engine_name = get_engine_name(self.engine_model_name,
self.dtype, self.world_size,
self.runtime_rank)
self.engine_name = get_engine_name(self.engine_model_name, self.dtype,
self.world_size, self.runtime_rank)
self.runtime_mapping = tensorrt_llm.Mapping(world_size=self.world_size,
rank=self.runtime_rank,
tp_size=self.world_size)

View File

@ -53,11 +53,12 @@ def parse_arguments():
'--mode',
type=str,
default="plugin",
choices=['ootb', 'plugin', 'ootb-except-mha'],
choices=['ootb', 'plugin', 'plugin-ifb', 'ootb-except-mha'],
help=
('Choose mode between ootb/plugin/ootb-except-mha. '
'\"ootb\" means the engines will be built without any plugins, '
'\"plugin\" means the engines will be built with tuned recipe of using plugins.'
'\"plugin-ifb\" will include additional options required for inflight batching.'
'\"ootb-except-mha\" means the engines will be built with only attention plugins.'
))
@ -749,7 +750,7 @@ def build_gpt(args):
network.plugin_config.to_legacy_setting()
# Plugins
if args.mode == 'plugin':
if args.mode in ['plugin', 'plugin-ifb']:
network.plugin_config.set_gpt_attention_plugin(dtype=args.dtype)
network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
network.plugin_config.enable_remove_input_padding()
@ -773,6 +774,10 @@ def build_gpt(args):
# RMS norm plugin for SmoothQuant
network.plugin_config.set_rmsnorm_quantization_plugin(
dtype=args.dtype)
# Inflight batching
if args.mode == 'plugin-ifb':
network.plugin_config.enable_paged_kv_cache()
elif args.mode == 'ootb-except-mha':
network.plugin_config.set_gpt_attention_plugin(dtype=args.dtype)
network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
@ -801,7 +806,7 @@ def build_gpt(args):
else:
tensorrt_llm_model(*inputs)
if args.mode == 'plugin':
if args.mode in ['plugin', 'plugin-ifb']:
tensorrt_llm.graph_rewriting.optimize(network)
# Network -> Engine

View File

@ -109,6 +109,7 @@ class GPTBenchmark(BaseBenchmark):
if not hasattr(self, 'num_kv_heads') or self.num_kv_heads is None:
self.num_kv_heads = self.num_heads
model_config = tensorrt_llm.runtime.ModelConfig(
max_batch_size=self.max_batch_size,
max_beam_width=self.num_beams,
@ -118,6 +119,9 @@ class GPTBenchmark(BaseBenchmark):
num_kv_heads=ceil(self.num_kv_heads / self.world_size),
hidden_size=self.hidden_size // self.world_size,
gpt_attention_plugin=self.use_gpt_attention_plugin,
paged_kv_cache=self.paged_kv_cache if hasattr(
self, 'paged_kv_cache') else False,
dtype=self.dtype,
remove_input_padding=self.remove_input_padding,
quant_mode=self.quant_mode,
use_custom_all_reduce=self.use_custom_all_reduce,

View File

@ -120,7 +120,7 @@ public:
{
if (req.getEmbeddingBias())
{
mEmbeddingBias = executor::detail::toITensor(*(req.getEmbeddingBias().value()));
mEmbeddingBias = executor::detail::toITensor(req.getEmbeddingBias().value());
// Add leading 1 dimension since that's what IFB code expects
mEmbeddingBias.value()->unsqueeze(0);
}
@ -136,7 +136,7 @@ public:
auto pTuningConfig = req.getPromptTuningConfig();
if (pTuningConfig)
{
mPromptEmbeddingTable = executor::detail::toITensor(*pTuningConfig.value().getEmbeddingTable());
mPromptEmbeddingTable = executor::detail::toITensor(pTuningConfig.value().getEmbeddingTable());
TLLM_CHECK(mPromptEmbeddingTable.value()->getShape().nbDims == 2);
mPromptVocabSize = mPromptEmbeddingTable.value()->getShape().d[0];
mPromptEmbeddingTable.value()->unsqueeze(0);
@ -145,10 +145,10 @@ public:
auto loraConfig = req.getLoraConfig();
if (loraConfig)
{
mLoraWeights = executor::detail::toITensor(*loraConfig.value().getWeights());
mLoraWeights = executor::detail::toITensor(loraConfig.value().getWeights());
mLoraWeights.value()->unsqueeze(0);
mLoraConfig = executor::detail::toITensor(*loraConfig.value().getConfig());
mLoraConfig = executor::detail::toITensor(loraConfig.value().getConfig());
mLoraConfig.value()->unsqueeze(0);
}
@ -159,7 +159,7 @@ public:
if (speculativeDecodingConfig.value().getLogits())
{
mDraftLogits = executor::detail::toITensor(*speculativeDecodingConfig.value().getLogits().value());
mDraftLogits = executor::detail::toITensor(speculativeDecodingConfig.value().getLogits().value());
}
// NOTE: Draft acceptance threshold is stored in mSamplingConfig
@ -551,7 +551,7 @@ public:
return mState == REQUEST_STATE_CONTEXT_INIT;
}
[[nodiscard]] bool isGenerationInProgessState() const noexcept
[[nodiscard]] bool isGenerationInProgressState() const noexcept
{
return mState == REQUEST_STATE_GENERATION_IN_PROGRESS;
}
@ -680,14 +680,12 @@ public:
if (getReturnContextLogits())
{
result.contextLogits
= std::make_shared<executor::Tensor>(executor::detail::ofITensor(getContextLogitsHost()));
result.contextLogits = executor::detail::ofITensor(getContextLogitsHost());
}
if (getReturnGenerationLogits())
{
result.generationLogits
= std::make_shared<executor::Tensor>(executor::detail::ofITensor(getGenerationLogitsHost()));
result.generationLogits = executor::detail::ofITensor(getGenerationLogitsHost());
}
// Update position of last sent response

View File

@ -237,12 +237,11 @@ public:
void bcast(runtime::IBuffer& buf, int root) const
{
TLLM_CHECK(buf.getMemoryType() != runtime::MemoryType::kGPU);
bcast(buf.data(), buf.getSizeInBytes(), MpiType::kBYTE, root);
}
template <typename T>
void bcast(T& value, int root) const
void bcastValue(T& value, int root) const
{
if constexpr (std::is_fundamental_v<std::remove_cv_t<T>>)
{

View File

@ -99,18 +99,18 @@ struct OutputConfig
class SpeculativeDecodingConfig
{
public:
explicit SpeculativeDecodingConfig(VecTokens tokens, std::optional<TensorPtr> logits = std::nullopt,
explicit SpeculativeDecodingConfig(VecTokens tokens, std::optional<Tensor> logits = std::nullopt,
std::optional<FloatType> acceptanceThreshold = std::nullopt);
~SpeculativeDecodingConfig();
[[nodiscard]] VecTokens getTokens() const;
[[nodiscard]] std::optional<TensorPtr> getLogits() const;
[[nodiscard]] std::optional<Tensor> getLogits() const;
[[nodiscard]] std::optional<FloatType> getAcceptanceThreshold() const;
private:
VecTokens mTokens;
std::optional<TensorPtr> mLogits;
std::optional<Tensor> mLogits;
std::optional<FloatType> mAcceptanceThreshold;
};
@ -122,28 +122,28 @@ public:
/// @param embeddingTable The prompt embedding table. Data type must match model weights. Shape [vocabSize,
/// hiddenSize]
/// @param vocabSize
PromptTuningConfig(TensorPtr embeddingTable);
PromptTuningConfig(Tensor embeddingTable);
~PromptTuningConfig();
[[nodiscard]] TensorPtr getEmbeddingTable() const;
[[nodiscard]] Tensor getEmbeddingTable() const;
private:
TensorPtr mEmbeddingTable;
Tensor mEmbeddingTable;
};
/// @brief Configuration for LoRA
class LoraConfig
{
public:
LoraConfig(TensorPtr weights, TensorPtr config);
LoraConfig(Tensor weights, Tensor config);
~LoraConfig();
[[nodiscard]] TensorPtr getWeights() const;
[[nodiscard]] TensorPtr getConfig() const;
[[nodiscard]] Tensor getWeights() const;
[[nodiscard]] Tensor getConfig() const;
private:
TensorPtr mWeights;
TensorPtr mConfig;
Tensor mWeights;
Tensor mConfig;
};
/// @brief A class that holds information about the request
@ -169,7 +169,7 @@ public:
std::optional<SizeType> endId = std::nullopt, std::optional<SizeType> padId = std::nullopt,
std::optional<std::list<VecTokens>> badWords = std::nullopt,
std::optional<std::list<VecTokens>> stopWords = std::nullopt,
std::optional<TensorPtr> embeddingBias = std::nullopt,
std::optional<Tensor> embeddingBias = std::nullopt,
std::optional<SpeculativeDecodingConfig> speculativeDecodingConfig = std::nullopt,
std::optional<PromptTuningConfig> pTuningConfig = std::nullopt,
std::optional<LoraConfig> loraConfig = std::nullopt);
@ -189,7 +189,7 @@ public:
[[nodiscard]] std::optional<SizeType> getPadId() const;
[[nodiscard]] std::optional<std::list<VecTokens>> getBadWords() const;
[[nodiscard]] std::optional<std::list<VecTokens>> getStopWords() const;
[[nodiscard]] std::optional<TensorPtr> getEmbeddingBias() const;
[[nodiscard]] std::optional<Tensor> getEmbeddingBias() const;
[[nodiscard]] std::optional<SpeculativeDecodingConfig> getSpeculativeDecodingConfig() const;
[[nodiscard]] std::optional<PromptTuningConfig> getPromptTuningConfig() const;
[[nodiscard]] std::optional<LoraConfig> getLoraConfig() const;
@ -201,7 +201,7 @@ public:
void setPadId(SizeType padId);
void setBadWords(std::list<VecTokens> badWords);
void setStopWords(std::list<VecTokens> stopWords);
void setEmbeddingBias(TensorPtr);
void setEmbeddingBias(Tensor);
void setSpeculativeDecodingConfig(SpeculativeDecodingConfig specDecodingConfig);
void setPromptTuningConfig(PromptTuningConfig pTuningConfig);
void setLoraConfig(LoraConfig loraConfig);
@ -222,8 +222,8 @@ struct Result
std::optional<VecLogProbs> cumLogProbs; // [beamSize]
std::optional<std::vector<VecLogProbs>> logProbs; // [beamSize, seqLen]
std::optional<TensorPtr> contextLogits; // [promptLen, vocab_size_padded]
std::optional<TensorPtr> generationLogits; // [beam_size, mMaxNewTokens, vocab_size_padded]
std::optional<Tensor> contextLogits; // [promptLen, vocab_size_padded]
std::optional<Tensor> generationLogits; // [beam_size, mMaxNewTokens, vocab_size_padded]
};
/// @brief Class that holds either an error or a result

View File

@ -92,6 +92,46 @@ public:
std::optional<SizeType> ctxMicroBatchSize = std::nullopt;
std::optional<SizeType> genMicroBatchSize = std::nullopt;
std::optional<DecodingMode> decodingMode = std::nullopt;
bool normalizeLogProbs = true;
};
//! @brief Optional profiler class to profile the generation phase of an inference request
class GenerationProfiler
{
public:
// Use a constexpr variable to resolve the ambiguous match for overloaded CudaEvent constructor
static constexpr unsigned int flags{cudaEventDefault};
GenerationProfiler()
: start(flags)
, end(flags)
{
}
CudaEvent const& getStart() const
{
return start;
}
CudaEvent const& getEnd() const
{
return end;
}
float getElapsedTimeMs()
{
start.synchronize();
end.synchronize();
float result;
TLLM_CUDA_CHECK(::cudaEventElapsedTime(&result, start.get(), end.get()));
return result;
}
private:
CudaEvent start;
CudaEvent end;
};
GptSession(Config const& sessionConfig, GptModelConfig const& modelConfig, WorldConfig const& worldConfig,
@ -129,9 +169,15 @@ public:
return mDevice;
}
[[nodiscard]] bool getNormalizeLogProbs() const noexcept
{
return mNormalizeLogProbs;
}
[[nodiscard]] nvinfer1::DataType getLogitDataType() const;
void generate(GenerationOutput& outputs, GenerationInput const& inputs, SamplingConfig const& samplingConfig);
void generate(GenerationOutput& outputs, GenerationInput const& inputs, SamplingConfig const& samplingConfig,
std::shared_ptr<GenerationProfiler> const generationProfiler = nullptr);
private:
[[nodiscard]] bool useCudaGraphs()
@ -141,7 +187,7 @@ private:
void generateBatched(std::vector<GenerationOutput>& microBatchesOutputs,
std::vector<GenerationInput> const& microBatchesInputs, SamplingConfig const& samplingConfig,
TokenGeneratedCallback const& onTokenGenerated);
TokenGeneratedCallback const& onTokenGenerated, std::shared_ptr<GenerationProfiler> const generationProfiler);
void setup(Config const& sessionConfig);
@ -154,9 +200,8 @@ private:
SizeType sinkTokenLength, SizeType maxSequenceLength, KvCacheConfig const& config);
void createCustomAllReduceWorkspace(SizeType batchSize, SizeType beamWidth, SizeType maxSequenceLength);
void executeContextStep(std::vector<GenerationInput> const& microBatchesInputs,
std::vector<GenerationOutput>& microBatchesOutputs, std::vector<SizeType> const& microBatchOffsets,
KvCacheManager const* kvCacheManager);
void executeContextStep(std::vector<GenerationInput> const& generationBatchesInputs,
std::vector<SizeType> const& generationBatchesOffsets, KvCacheManager const* kvCacheManager);
SizeType executeGenerationStep(SizeType step, std::vector<GenerationInput> const& microBatchesInputs,
std::vector<GenerationOutput>& microBatchesOutputs, std::vector<SizeType> const& microBatchOffsets,
KvCacheManager* kvCacheManager, std::vector<bool>& microBatchesFinished);
@ -275,6 +320,8 @@ private:
bool mCudaGraphMode{false};
// ping-pong instances
std::vector<CudaGraphExecutor> mCudaGraphInstances;
bool mNormalizeLogProbs = true;
};
} // namespace tensorrt_llm::runtime

View File

@ -24,7 +24,7 @@
namespace tensorrt_llm::runtime
{
void setPeerAccess(WorldConfig worldConfig, bool enable = true);
void setPeerAccess(WorldConfig const& worldConfig, bool enable = true);
class IpcMemory
{
@ -33,7 +33,7 @@ public:
size_t static constexpr FLAGS_SIZE = kernels::MAX_ALL_REDUCE_BLOCKS * sizeof(uint32_t);
IpcMemory(WorldConfig worldConfig, std::size_t bufferSize);
IpcMemory(WorldConfig const& worldConfig, std::size_t bufferSize);
~IpcMemory();
[[nodiscard]] const std::vector<void*>& getCommPtrsTensor() const
@ -48,7 +48,7 @@ private:
WorldConfig mWorldConfig;
std::vector<void*> mCommPtrs;
std::size_t mBufferSize;
void* mBufferPtr;
void* mBufferPtr{nullptr};
};
} // namespace tensorrt_llm::runtime

View File

@ -195,8 +195,8 @@ set(TRTLLM_LINK_LIBS
${TRT_LIB}
common_src
kernels_src
cutlass_src_pre_hopper
cutlass_src_hopper
cutlass2_src
cutlass3_src
layers_src
runtime_src)

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c9fd644e0a38b1d4d1a54d4b7b834cc6b0110a5771fcfc480e96795b3f9bc892
size 2081046
oid sha256:0ecc134ad10a54b2953c772e72db2f71e84130d5736087b033e9e5b78594db6d
size 2113376

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:90436c59eb243a0156e3f0aa95412a7caacbefdcde768c158edc4b821044dfd1
size 2102486
oid sha256:9aa3f3d7f8313c099df8e9bd4c9707922a4f1c4025c4c99986acf6df781738c7
size 2128450

View File

@ -1,3 +1,3 @@
f53c02e3829b516a6e9221745bcbacbd libtensorrt_llm_batch_manager_static.a
9e92e5dbb104e3e676952ea40c81916f libtensorrt_llm_batch_manager_static.pre_cxx11.a
25adff90cc350eb9ca9804051a08de80d547c113 commit
add62ff328028bbcded1af694fe758c5 libtensorrt_llm_batch_manager_static.a
9e8846e200e2aaaeace862741a90c3ab libtensorrt_llm_batch_manager_static.pre_cxx11.a
230623fa285048a2de5c54c2cc0f364fb9f2c559 commit

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c3433d7b52bb6dcac32111172cb6201a9fee56e739f3660895083baebd1b89ee
size 2033616
oid sha256:7b25de974b6ca5f0dcb279f16f38199167d1efc35c01770d3234bec2dfb5dc86
size 2097848

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fb3f4145881984de6268c34f7e5d452f78f54952f454f747a1cd52bc3171de62
size 2012002
oid sha256:5f06cee5ae2bcf393196265cd9a3ef832690cd4c5c53934bbfb169d50ab33c41
size 2055004

View File

@ -1,2 +1,2 @@
d60b12741e940f56addaf2d92e78b50f libtensorrt_llm_batch_manager_static.a
c55e606a3430d3a56cee3968a77b46f1 libtensorrt_llm_batch_manager_static.pre_cxx11.a
bb62a31b8e17dae284d784ba43d5bc02 libtensorrt_llm_batch_manager_static.a
19327f59c7f5b6235e15b322d5f5a0f4 libtensorrt_llm_batch_manager_static.pre_cxx11.a

View File

@ -146,7 +146,7 @@ void CublasMMWrapper::Gemm(cublasOperation_t transa, cublasOperation_t transb, c
{
check_cuda_error(cublasSetStream(getCublasHandle(), mStream));
check_cuda_error(cublasSetWorkspace(getCublasHandle(), mCublasWorkspace, workspaceSize));
// Go with default heruistic to choose tactic as cuBLAS does not allow to choose tactics in Ampere+
// Go with default heuristic to choose tactic as cuBLAS does not allow to choose tactics in Ampere+
cublasGemmAlgo_t cublasAlgo = CUBLAS_GEMM_DEFAULT;
check_cuda_error(cublasGemmEx(getCublasHandle(), transa, transb, m, n, k, alpha, A, mAType, lda, B, mBType, ldb,
beta, C, mCType, ldc, mComputeType, static_cast<cublasGemmAlgo_t>(cublasAlgo)));
@ -318,7 +318,7 @@ std::vector<cublasLtMatmulHeuristicResult_t> CublasMMWrapper::getTactics(cublasL
uint64_t workspace_size = CUBLAS_WORKSPACE_SIZE;
check_cuda_error(cublasLtMatmulPreferenceSetAttribute(
preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES, &workspace_size, sizeof(workspace_size)));
// Restrict reduction algorithms for numerical stability and better determenism
// Restrict reduction algorithms for numerical stability and better determinism
uint32_t reduction_mask = CUBLASLT_REDUCTION_SCHEME_MASK;
check_cuda_error(cublasLtMatmulPreferenceSetAttribute(
preference, CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK, &reduction_mask, sizeof(reduction_mask)));

View File

@ -283,23 +283,23 @@ inline std::tuple<size_t, size_t> getDeviceMemoryInfo(const bool useUvm)
{
if (useUvm)
{
size_t freeSysmem, totalSysmem;
size_t freeSysMem, totalSysMem;
#ifndef _WIN32 // Linux
struct sysinfo info;
sysinfo(&info);
totalSysmem = info.totalram * info.mem_unit;
freeSysmem = info.freeram * info.mem_unit;
totalSysMem = info.totalram * info.mem_unit;
freeSysMem = info.freeram * info.mem_unit;
#else // Windows
MEMORYSTATUSEX memInfo;
memInfo.dwLength = sizeof(memInfo);
GlobalMemoryStatusEx(&memInfo);
totalSysmem = memInfo.ullTotalPhys;
freeSysmem = memInfo.ullAvailPhys;
totalSysMem = memInfo.ullTotalPhys;
freeSysMem = memInfo.ullAvailPhys;
#endif // WIN32
TLLM_LOG_INFO("Using UVM based system memory for KV cache, total memory %0.2f GB, available memory %0.2f GB",
((double) totalSysmem / 1e9), ((double) freeSysmem / 1e9));
return {freeSysmem, totalSysmem};
((double) totalSysMem / 1e9), ((double) freeSysMem / 1e9));
return {freeSysMem, totalSysMem};
}
else
{

View File

@ -29,35 +29,29 @@ Logger::Logger()
int deviceId;
cudaGetDevice(&deviceId);
char* levelName = std::getenv("TLLM_LOG_LEVEL");
auto const* levelName = std::getenv("TLLM_LOG_LEVEL");
if (levelName != nullptr)
{
std::map<std::string, Level> nameToLevel = {
{"TRACE", TRACE},
{"DEBUG", DEBUG},
{"INFO", INFO},
{"WARNING", WARNING},
{"ERROR", ERROR},
};
auto level = nameToLevel.find(levelName);
auto level = [levelName = std::string(levelName)]()
{
if (levelName == "TRACE")
return TRACE;
if (levelName == "DEBUG")
return DEBUG;
if (levelName == "INFO")
return INFO;
if (levelName == "WARNING")
return WARNING;
if (levelName == "ERROR")
return ERROR;
TLLM_THROW("Invalid log level: %s", levelName.c_str());
}();
// If TLLM_LOG_FIRST_RANK_ONLY=ON, set LOG LEVEL of other device to ERROR
if (isFirstRankOnly && deviceId != 0)
{
level = nameToLevel.find("ERROR");
}
if (level != nameToLevel.end())
{
setLevel(level->second);
}
else
{
fprintf(stderr,
"[TensorRT-LLM][WARNING] Invalid logger level TLLM_LOG_LEVEL=%s. "
"Ignore the environment variable and use a default "
"logging level.\n",
levelName);
levelName = nullptr;
level = ERROR;
}
setLevel(level);
}
}

View File

@ -18,10 +18,10 @@
#include <cstdlib>
#include <iostream>
#include <map>
#include <stdexcept>
#include <string>
#include "tensorrt_llm/common/assert.h"
#include "tensorrt_llm/common/stringUtils.h"
namespace tensorrt_llm::common
@ -88,13 +88,11 @@ public:
void setLevel(const Level level)
{
level_ = level;
log(INFO, "Set logger level by %s", getLevelName(level).c_str());
log(INFO, "Set logger level by %s", getLevelName(level));
}
private:
const std::string PREFIX = "[TensorRT-LLM]";
std::map<Level, std::string> level_name_
= {{TRACE, "TRACE"}, {DEBUG, "DEBUG"}, {INFO, "INFO"}, {WARNING, "WARNING"}, {ERROR, "ERROR"}};
static auto constexpr kPREFIX = "[TensorRT-LLM]";
#ifndef NDEBUG
const Level DEFAULT_LOG_LEVEL = DEBUG;
@ -105,19 +103,28 @@ private:
Logger(); // NOLINT(modernize-use-equals-delete)
inline std::string getLevelName(const Level level)
static inline char const* getLevelName(const Level level)
{
return level_name_[level];
switch (level)
{
case TRACE: return "TRACE";
case DEBUG: return "DEBUG";
case INFO: return "INFO";
case WARNING: return "WARNING";
case ERROR: return "ERROR";
}
inline std::string getPrefix(const Level level)
{
return PREFIX + "[" + getLevelName(level) + "] ";
TLLM_THROW("Unknown log level: %d", level);
}
inline std::string getPrefix(const Level level, const int rank)
static inline std::string getPrefix(const Level level)
{
return PREFIX + "[" + getLevelName(level) + "][" + std::to_string(rank) + "] ";
return fmtstr("%s[%s] ", kPREFIX, getLevelName(level));
}
static inline std::string getPrefix(const Level level, const int rank)
{
return fmtstr("%s[%s][%d] ", kPREFIX, getLevelName(level), rank);
}
};

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:13e17e2d9a94d2bc1b131d096a3722a83a67ab115fa8271b57b27f7e2877bdc1
size 587334
oid sha256:4201c7241d53298ca52d4f1447cc9cbc4024f63b42a24cbcff82192cc10bed67
size 576098

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:45438204eba812694bd30b68cfc9bb2bc54a8a59c6c86e037bbc4ac7e5f8230c
size 589438
oid sha256:2960feb2c7ad941a473408e2f6fd8c324f60f6af3c4d8f11217c676fd830e4cb
size 578660

View File

@ -1,3 +1,3 @@
835767a37292ea9786c0d6149ae270f4 libtensorrt_llm_executor_static.a
1fe0c9ac7a1a35ce7d80676146867374 libtensorrt_llm_executor_static.pre_cxx11.a
25adff90cc350eb9ca9804051a08de80d547c113 commit
8a8d6505d9ef62cb2eeb8c75a5ee5bbb libtensorrt_llm_executor_static.a
e3b8edc619c99a7f125fe81bc8554ff0 libtensorrt_llm_executor_static.pre_cxx11.a
230623fa285048a2de5c54c2cc0f364fb9f2c559 commit

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7969768d3b9a65182ee519c60e11f27b0a088c2c0b732f3780d7c0c563dbb180
size 587776
oid sha256:cde295fa290b15b3d76b8e8b2cc435d7fceb2f456d8cb4d9b22ee2cf3ddbd344
size 588504

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:98d9b7c4a586f0be0499a0df487cacba69985ce43ca5fd543c90c6a368c91b67
size 571150
oid sha256:54ac66f3555bff4ed28ba0352bcb4a0f541346592cf109b491071b6374e5238c
size 562260

View File

@ -1,2 +1,2 @@
6771f94e0bce39c6cab391cf1f92484c libtensorrt_llm_executor_static.a
84b7550448f8710de17644a5d404178f libtensorrt_llm_executor_static.pre_cxx11.a
ee96c6e2742539da0e8d732635f84449 libtensorrt_llm_executor_static.a
9154564ed926ffbcdb83e7eac3504fa0 libtensorrt_llm_executor_static.pre_cxx11.a

View File

@ -18,7 +18,8 @@
file(GLOB_RECURSE SRC_CPP *.cpp)
file(GLOB_RECURSE SRC_CU *.cu)
# This can happen when not building for Torch
# The Python executable will only be defined if building with Torch support. If
# not, we need to find it here.
if(NOT Python3_EXECUTABLE)
find_package(
Python3
@ -57,17 +58,13 @@ endif()
file(GLOB_RECURSE CU_INSTANTIATIONS ${CMAKE_CURRENT_BINARY_DIR}/*.cu)
add_library(cutlass_src_pre_hopper STATIC ${SRC_CPP} ${SRC_CU})
set_property(TARGET cutlass_src_pre_hopper PROPERTY POSITION_INDEPENDENT_CODE
ON)
set_property(TARGET cutlass_src_pre_hopper PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS
ON)
add_library(cutlass2_src STATIC ${SRC_CPP} ${SRC_CU})
set_property(TARGET cutlass2_src PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET cutlass2_src PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
add_library(cutlass_src_hopper STATIC ${CU_INSTANTIATIONS})
set_property(TARGET cutlass_src_hopper PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET cutlass_src_hopper PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
add_dependencies(cutlass_src_hopper cutlass_src_pre_hopper)
add_library(cutlass3_src STATIC ${CU_INSTANTIATIONS})
set_property(TARGET cutlass3_src PROPERTY POSITION_INDEPENDENT_CODE ON)
set_property(TARGET cutlass3_src PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
# Note - we deliberately do not include 90a PTX (even when 9.0+PTX is
# specified). This is because sm_90a has arch conditional instructions that are
@ -75,24 +72,18 @@ add_dependencies(cutlass_src_hopper cutlass_src_pre_hopper)
# the binary anyway.
if("9.0" IN_LIST TORCH_CUDA_ARCH_LIST
OR "9.0+PTX" IN_LIST TORCH_CUDA_ARCH_LIST
OR TORCH_CUDA_ARCH_LIST STREQUAL "Auto")
OR "90-real" IN_LIST CMAKE_CUDA_ARCHITECTURES_NATIVE)
message(STATUS "MANUALLY APPENDING FLAG TO COMPILE FOR SM_90a.")
target_compile_options(
cutlass_src_pre_hopper
PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-gencode=arch=compute_90a,code=sm_90a>)
target_compile_options(
cutlass_src_hopper
cutlass3_src
PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-gencode=arch=compute_90a,code=sm_90a>)
# Hopper kernels require cuda lib for TMA APIs
target_link_libraries(cutlass_src_pre_hopper PRIVATE CUDA::cuda_driver)
target_link_libraries(cutlass_src_hopper PRIVATE CUDA::cuda_driver)
target_link_libraries(cutlass3_src PRIVATE CUDA::cuda_driver)
# No kernels should be parsed, unless hopper is specified. This is a build
# time improvement
target_compile_definitions(cutlass_src_pre_hopper
PRIVATE COMPILE_HOPPER_MIXED_INPUT_GEMMS)
target_compile_definitions(cutlass_src_hopper
target_compile_definitions(cutlass3_src
PRIVATE COMPILE_HOPPER_MIXED_INPUT_GEMMS)
endif()
@ -101,9 +92,5 @@ endif()
# compilation output.
if(NOT WIN32)
target_compile_options(
cutlass_src_pre_hopper
PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-Wno-psabi>)
target_compile_options(
cutlass_src_hopper
PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-Wno-psabi>)
cutlass3_src PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-Wno-psabi>)
endif()

View File

@ -127,9 +127,9 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
const float length_penalty{beam_hyps.length_penalties[global_batch_idx]};
const int early_stopping{beam_hyps.early_stoppings[global_batch_idx]};
const int* sequence_lengths{beam_hyps.sequence_lengths_src};
const T diversity_rate{beam_hyps.diversity_rates[global_batch_idx]};
float* output_log_probs{beam_hyps.log_probs_src};
const int* sequence_lengths{beam_hyps.sequence_lengths_src};
using cub_kvp = cub::KeyValuePair<int, T>;
using BlockReduce = cub::BlockReduce<cub_kvp, THREADBLOCK_SIZE>;
@ -177,21 +177,7 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
for (int elem_id = thread_id; elem_id < candidate_size; elem_id += THREADBLOCK_SIZE)
{
int i = beam_hyps.num_beams == nullptr ? elem_id % K : elem_id / 2 / K;
T elem = topk_tmp_val_buf[elem_id];
if (length_penalty > 0.0f)
{
int length = sequence_lengths[vector_id * K + i];
if (early_stopping == 0)
{
// Use generated_length (rather than sequence_length) to compute length_penalty
// https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L957
// But this branch will cause CI error in
// "C++ Tests (GPT) on A30", "C++ Tests (GPT-J) on H100_PCIe", "H100_PCIe-accuracy-0"
length -= beam_hyps.input_lengths[global_batch_idx];
}
const int pad_if_not_finish = finished[vector_id * K + i].isFinished() ? 0 : 1;
elem = apply_length_penalty(elem, length + pad_if_not_finish, length_penalty);
}
T elem = topk_tmp_val_buf[elem_id]; // use token score to do TopK
elem += diversity_rate * (T) i;
cub_kvp new_elem{elem_id, elem};
partial_topk = arg_max(partial_topk, new_elem);
@ -232,21 +218,25 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
{
const int current_key = cta_topk[i].key;
const T current_value = cta_topk[i].value;
// Consider to add beam only if this token belongs to top K range and it is end_token
// https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L272
if (i < K && beam_hyps.num_beams != nullptr
&& topk_tmp_id_buf[current_key] % vocab_size == beam_hyps.end_ids[vector_id])
{
// Add beam only if beam_token belongs to top K tokens
// https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L272
const float normed_score = (float) current_value;
const int num_beam = beam_hyps.num_beams[global_batch_idx];
int beam_idx = num_beam;
const int seq_len = sequence_lengths[vector_id * K + i] - beam_hyps.input_lengths[global_batch_idx];
const int pad_if_not_finish = finished[vector_id * K + i].isFinished() ? 0 : 1;
const float normed_score
= apply_length_penalty(current_value, seq_len + pad_if_not_finish, length_penalty);
int beam_idx = beam_hyps.num_beams[global_batch_idx];
// There are already K beams
if (num_beam == K)
if (beam_idx == K)
{
// The current score is worse than the worst one in beams
if (normed_score < beam_hyps.min_normed_scores[global_batch_idx])
{
// Stop considering new beams
selected_beams = K;
break;
}
@ -291,24 +281,34 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
{
const int src_idx = j * beam_hyps.batch_size * K + beam_hyps.ite * beam_hyps.local_batch_size * K
+ vector_id * K + prev_id;
beam_hyps.output_ids_tgt[tgt_id_offset + j]
= beam_hyps.output_ids_src_ptr[vector_id][prev_id * beam_hyps.max_seq_len + j];
if (beam_hyps.log_probs != nullptr && beam_hyps.log_probs_src != nullptr)
{
beam_hyps.log_probs[tgt_id_offset + j] = beam_hyps.log_probs_src[src_idx];
}
prev_id = beam_hyps.parent_ids_src_ptr[vector_id][prev_id * beam_hyps.max_seq_len + j];
}
const int tgt_beam_idx = global_batch_idx * (K * 2) + beam_idx;
beam_hyps.sequence_lengths_tgt[tgt_beam_idx] = current_step;
beam_hyps.normed_scores[tgt_beam_idx] = normed_score;
beam_hyps.min_normed_scores[global_batch_idx]
= min(beam_hyps.min_normed_scores[global_batch_idx], beam_hyps.normed_scores[tgt_beam_idx]);
beam_hyps.num_beams[global_batch_idx]++;
cum_log_probs[tgt_beam_idx] = (float) topk_tmp_val_buf[current_key];
beam_hyps.cum_log_probs[tgt_beam_idx] = (float) topk_tmp_val_buf[current_key];
}
// This token is end_token but belongs to range K ~ 2K, just ignoe it
// TODO: eliminate this branch by rewriting condition of the else_if
else if (i >= K && beam_hyps.num_beams != nullptr
&& topk_tmp_id_buf[current_key] % vocab_size == beam_hyps.end_ids[vector_id])
{
;
}
// Beam search is disabled or this token is not end_token, we add it to the end of the unfinished sentence
else if (beam_hyps.num_beams != nullptr || beam_hyps.num_beams == nullptr && i < K)
{
const int current_step{sequence_lengths[vector_id * K + selected_beams]};

View File

@ -1,6 +1,5 @@
/*
* Adapted from https://github.com/state-spaces/mamba/blob/main/csrc/selective_scan/selective_scan_fwd_kernel.cuh
* Copyright (c) 2023, Tri Dao.
* Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
@ -13,413 +12,318 @@
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Not a contribution
* Changes made by NVIDIA CORPORATION & AFFILIATES or otherwise documented as
* NVIDIA-proprietary are not a contribution and subject to the following terms and conditions:
* SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: LicenseRef-NvidiaProprietary
*
* NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
* property and proprietary rights in and to this material, related
* documentation and any modifications thereto. Any use, reproduction,
* disclosure or distribution of this material and related documentation
* without an express license agreement from NVIDIA CORPORATION or
* its affiliates is strictly prohibited.
*/
#include <cuda_runtime_api.h>
#include <cooperative_groups/memcpy_async.h>
#include <cuda/pipeline>
#include <cuda_bf16.h>
#include <cuda_fp16.h>
#ifdef ENABLE_FP8
#include <cuda_fp8.h>
#endif
#include <cub/block/block_load.cuh>
#include <cub/block/block_scan.cuh>
#include <cub/block/block_store.cuh>
#include "selectiveScan.h"
#include "selectiveScanCommon.h"
namespace tensorrt_llm
{
namespace kernels
{
template <int kNThreads_, int kNItems_, int kNRows_, bool kIsEvenLen_, bool kIsVariableB_, bool kIsVariableC_,
bool kHasZ_, typename input_t_, typename weight_t_>
struct Selective_Scan_fwd_kernel_traits
__device__ float toFloat(float f)
{
static_assert(kNItems_ % 4 == 0);
using input_t = input_t_;
using weight_t = weight_t_;
static constexpr int kNThreads = kNThreads_;
// Setting MinBlocksPerMP to be 3 (instead of 2) for 128 threads improves occupancy.
static constexpr int kMinBlocks = kNThreads < 128 ? 5 : 3;
static constexpr int kNItems = kNItems_;
static constexpr int kNRows = kNRows_;
static constexpr int kNBytes = sizeof(input_t);
static_assert(kNBytes == 2 || kNBytes == 4);
static constexpr int kNElts = kNBytes == 4 ? 4 : std::min(8, kNItems);
static_assert(kNItems % kNElts == 0);
static constexpr int kNLoads = kNItems / kNElts;
static constexpr bool kIsEvenLen = kIsEvenLen_;
static constexpr bool kIsVariableB = kIsVariableB_;
static constexpr bool kIsVariableC = kIsVariableC_;
static constexpr bool kHasZ = kHasZ_;
static constexpr bool kDirectIO = kIsEvenLen && kNLoads == 1;
using vec_t = typename BytesToType<kNBytes * kNElts>::Type;
using scan_t = float2;
using scan_t_s = float;
using BlockLoadT = cub::BlockLoad<input_t, kNThreads, kNItems, cub::BLOCK_LOAD_WARP_TRANSPOSE>;
using BlockLoadVecT = cub::BlockLoad<vec_t, kNThreads, kNLoads,
!kDirectIO ? cub::BLOCK_LOAD_WARP_TRANSPOSE : cub::BLOCK_LOAD_DIRECT>;
using BlockLoadWeightT = cub::BlockLoad<input_t, kNThreads, kNItems, cub::BLOCK_LOAD_WARP_TRANSPOSE>;
using BlockLoadWeightVecT = cub::BlockLoad<vec_t, kNThreads, kNLoads,
!kDirectIO ? cub::BLOCK_LOAD_WARP_TRANSPOSE : cub::BLOCK_LOAD_DIRECT>;
using BlockStoreT = cub::BlockStore<input_t, kNThreads, kNItems, cub::BLOCK_STORE_WARP_TRANSPOSE>;
using BlockStoreVecT = cub::BlockStore<vec_t, kNThreads, kNLoads,
!kDirectIO ? cub::BLOCK_STORE_WARP_TRANSPOSE : cub::BLOCK_STORE_DIRECT>;
// using BlockScanT = cub::BlockScan<scan_t, kNThreads, cub::BLOCK_SCAN_RAKING_MEMOIZE>;
// using BlockScanT = cub::BlockScan<scan_t, kNThreads, cub::BLOCK_SCAN_RAKING>;
using BlockScanT = cub::BlockScan<scan_t, kNThreads, cub::BLOCK_SCAN_WARP_SCANS>;
static constexpr int kSmemIOSize
= std::max({sizeof(typename BlockLoadT::TempStorage), sizeof(typename BlockLoadVecT::TempStorage),
(int(kIsVariableB) + int(kIsVariableC)) * sizeof(typename BlockLoadWeightT::TempStorage),
(int(kIsVariableB) + int(kIsVariableC)) * sizeof(typename BlockLoadWeightVecT::TempStorage),
sizeof(typename BlockStoreT::TempStorage), sizeof(typename BlockStoreVecT::TempStorage)});
static constexpr int kSmemSize = kSmemIOSize + sizeof(typename BlockScanT::TempStorage);
};
template <typename Ktraits>
__global__ __launch_bounds__(Ktraits::kNThreads, Ktraits::kMinBlocks) void selective_scan_fwd_kernel(
SSMParamsBase params)
{
constexpr bool kIsVariableB = Ktraits::kIsVariableB;
constexpr bool kIsVariableC = Ktraits::kIsVariableC;
constexpr bool kHasZ = Ktraits::kHasZ;
constexpr int kNThreads = Ktraits::kNThreads;
constexpr int kNItems = Ktraits::kNItems;
constexpr int kNRows = Ktraits::kNRows;
constexpr bool kDirectIO = Ktraits::kDirectIO;
using input_t = typename Ktraits::input_t;
using weight_t = typename Ktraits::weight_t;
using scan_t = typename Ktraits::scan_t;
using scan_t_s = typename Ktraits::scan_t_s;
// Shared memory.
extern __shared__ char smem_[];
// cast to lvalue reference of expected type
// char *smem_loadstorescan = smem_ + 2 * MAX_DSTATE * sizeof(weight_t);
// auto& smem_load = reinterpret_cast<typename BlockLoadT::TempStorage&>(smem_ + 2 * MAX_DSTATE * sizeof(weight_t));
// auto& smem_load = reinterpret_cast<typename BlockLoadT::TempStorage&>(smem_loadstorescan);
auto& smem_load = reinterpret_cast<typename Ktraits::BlockLoadT::TempStorage&>(smem_);
auto& smem_load_weight = reinterpret_cast<typename Ktraits::BlockLoadWeightT::TempStorage&>(smem_);
auto& smem_load_weight1 = *reinterpret_cast<typename Ktraits::BlockLoadWeightT::TempStorage*>(
smem_ + sizeof(typename Ktraits::BlockLoadWeightT::TempStorage));
auto& smem_store = reinterpret_cast<typename Ktraits::BlockStoreT::TempStorage&>(smem_);
auto& smem_scan = *reinterpret_cast<typename Ktraits::BlockScanT::TempStorage*>(smem_ + Ktraits::kSmemIOSize);
// weight_t *smem_a = reinterpret_cast<weight_t *>(smem_ + smem_loadstorescan_size);
// weight_t *smem_bc = reinterpret_cast<weight_t *>(smem_a + MAX_DSTATE);
scan_t* smem_running_prefix = reinterpret_cast<scan_t*>(smem_ + Ktraits::kSmemSize);
const int batch_id = blockIdx.x;
const int dim_id = blockIdx.y;
const int group_id = dim_id / (params.dim_ngroups_ratio);
input_t* u = reinterpret_cast<input_t*>(params.u_ptr) + batch_id * params.u_batch_stride
+ dim_id * kNRows * params.u_d_stride;
input_t* delta = reinterpret_cast<input_t*>(params.delta_ptr) + batch_id * params.delta_batch_stride
+ dim_id * kNRows * params.delta_d_stride;
weight_t* A = reinterpret_cast<weight_t*>(params.A_ptr) + dim_id * kNRows * params.A_d_stride;
weight_t* B = reinterpret_cast<weight_t*>(params.B_ptr) + dim_id * kNRows * params.B_d_stride;
input_t* Bvar = reinterpret_cast<input_t*>(params.B_ptr) + batch_id * params.B_batch_stride
+ group_id * params.B_group_stride;
weight_t* C = reinterpret_cast<weight_t*>(params.C_ptr) + dim_id * kNRows * params.C_d_stride;
input_t* Cvar = reinterpret_cast<input_t*>(params.C_ptr) + batch_id * params.C_batch_stride
+ group_id * params.C_group_stride;
scan_t_s* x = reinterpret_cast<scan_t_s*>(params.x_ptr) + (batch_id * params.dim + dim_id * kNRows) * params.dstate;
float D_val[kNRows] = {0};
if (params.D_ptr != nullptr)
{
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
D_val[r] = reinterpret_cast<float*>(params.D_ptr)[dim_id * kNRows + r];
}
}
float delta_bias[kNRows] = {0};
if (params.delta_bias_ptr != nullptr)
{
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
delta_bias[r] = reinterpret_cast<float*>(params.delta_bias_ptr)[dim_id * kNRows + r];
}
return f;
}
// for (int state_idx = threadIdx.x; state_idx < params.dstate; state_idx += blockDim.x) {
// smem_a[state_idx] = A[state_idx * params.A_dstate_stride];
// smem_bc[state_idx] = B[state_idx * params.B_dstate_stride] * C[state_idx * params.C_dstate_stride];
// }
__device__ float toFloat(__half h)
{
return __half2float(h);
}
#ifdef ENABLE_BF16
__device__ float toFloat(__nv_bfloat16 val)
{
return __bfloat162float(val);
}
#endif
constexpr int kChunkSize = kNThreads * kNItems;
for (int chunk = 0; chunk < params.n_chunks; ++chunk)
__device__ void convertAndStore(float* output, float input)
{
input_t u_vals[kNRows][kNItems], delta_vals_load[kNRows][kNItems];
__syncthreads();
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
if constexpr (!kDirectIO)
{
if (r > 0)
{
__syncthreads();
}
}
load_input<Ktraits>(u + r * params.u_d_stride, u_vals[r], smem_load, params.seqlen - chunk * kChunkSize);
if constexpr (!kDirectIO)
{
__syncthreads();
}
load_input<Ktraits>(
delta + r * params.delta_d_stride, delta_vals_load[r], smem_load, params.seqlen - chunk * kChunkSize);
}
u += kChunkSize;
delta += kChunkSize;
float delta_vals[kNRows][kNItems], delta_u_vals[kNRows][kNItems], out_vals[kNRows][kNItems];
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
#pragma unroll
for (int i = 0; i < kNItems; ++i)
{
float u_val = float(u_vals[r][i]);
delta_vals[r][i] = float(delta_vals_load[r][i]) + delta_bias[r];
if (params.delta_softplus)
{
delta_vals[r][i] = delta_vals[r][i] <= 20.f ? log1pf(expf(delta_vals[r][i])) : delta_vals[r][i];
}
delta_u_vals[r][i] = delta_vals[r][i] * u_val;
out_vals[r][i] = D_val[r] * u_val;
}
*output = input;
}
__syncthreads();
for (int state_idx = 0; state_idx < params.dstate; ++state_idx)
__device__ void convertAndStore(__half* output, float input)
{
weight_t A_val[kNRows];
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
A_val[r] = A[state_idx * params.A_dstate_stride + r * params.A_d_stride];
// Multiply the real part of A with LOG2E so we can use exp2f instead of expf.
constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
A_val[r] *= kLog2e;
*output = __float2half(input);
}
// This variable holds B * C if both B and C are constant across seqlen. If only B varies
// across seqlen, this holds C. If only C varies across seqlen, this holds B.
// If both B and C vary, this is unused.
weight_t BC_val[kNRows];
weight_t B_vals[kNItems], C_vals[kNItems];
if constexpr (kIsVariableB)
#ifdef ENABLE_BF16
__device__ void convertAndStore(__nv_bfloat16* output, float input)
{
load_weight<Ktraits>(Bvar + state_idx * params.B_dstate_stride, B_vals, smem_load_weight,
params.seqlen - chunk * kChunkSize);
if constexpr (!kIsVariableC)
*output = __float2bfloat16(input);
}
#endif
template <typename input_t, typename weight_t, int DSTATE = 16, int CHANNELS_PER_BLOCK = 128, int STAGES = 12,
int SEQ_UNROLL = 6>
__launch_bounds__(256, 1) __global__ void selective_scan_loop_kernel(SSMParamsBase params)
{
#pragma unroll
for (int r = 0; r < kNRows; ++r)
input_t* output = reinterpret_cast<input_t*>(params.out_ptr);
weight_t* state = reinterpret_cast<weight_t*>(params.x_ptr);
input_t* x = reinterpret_cast<input_t*>(params.u_ptr);
input_t* dt = reinterpret_cast<input_t*>(params.delta_ptr);
weight_t* A = reinterpret_cast<weight_t*>(params.A_ptr);
input_t* B = reinterpret_cast<input_t*>(params.B_ptr);
input_t* C = reinterpret_cast<input_t*>(params.C_ptr);
weight_t* D = reinterpret_cast<weight_t*>(params.D_ptr);
input_t* z = reinterpret_cast<input_t*>(params.z_ptr);
weight_t* dt_bias = reinterpret_cast<weight_t*>(params.delta_bias_ptr);
bool dt_softplus = params.delta_softplus;
int num_tokens = params.seqlen;
int num_channels = params.dim;
// static const int STAGES = 12;
// static const int SEQ_UNROLL = 6;
__shared__ cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, STAGES / SEQ_UNROLL> pipeline_state;
auto block = cooperative_groups::this_thread_block();
__shared__ __align__(16) input_t sh_B[STAGES][DSTATE];
__shared__ __align__(16) input_t sh_C[STAGES][DSTATE];
__shared__ __align__(128) input_t sh_dt[STAGES][CHANNELS_PER_BLOCK];
__shared__ input_t sh_x[STAGES][CHANNELS_PER_BLOCK];
__shared__ input_t sh_z[STAGES][CHANNELS_PER_BLOCK];
__shared__ weight_t sh_D[CHANNELS_PER_BLOCK];
__shared__ weight_t sh_dt_bias[CHANNELS_PER_BLOCK];
const int channel = blockIdx.x * blockDim.x + threadIdx.x;
const int sample = blockIdx.y; // batch id
const int seq_loops = (num_tokens + SEQ_UNROLL - 1) / SEQ_UNROLL;
const int input_matrix_row_id = sample * num_tokens;
if (threadIdx.y == 1)
{
BC_val[r] = C[state_idx * params.C_dstate_stride + r * params.C_d_stride];
}
}
}
if constexpr (kIsVariableC)
// Data loading warps
// Bias is independent of token
sh_dt_bias[threadIdx.x] = dt_bias[channel];
// D is independent of token
if (D)
sh_D[threadIdx.x] = D[channel];
cuda::pipeline pipeline = cuda::make_pipeline(block, &pipeline_state, cuda::pipeline_role::producer);
int stage = 0;
for (int si = 0; si < seq_loops; si++)
{
auto& smem_load_weight_C = !kIsVariableB ? smem_load_weight : smem_load_weight1;
load_weight<Ktraits>(Cvar + state_idx * params.C_dstate_stride, C_vals, smem_load_weight_C,
params.seqlen - chunk * kChunkSize);
if constexpr (!kIsVariableB)
{
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
BC_val[r] = B[state_idx * params.B_dstate_stride + r * params.B_d_stride];
}
}
}
if constexpr (!kIsVariableB && !kIsVariableC)
{
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
BC_val[r] = B[state_idx * params.B_dstate_stride + r * params.B_d_stride]
* C[state_idx * params.C_dstate_stride + r * params.C_d_stride];
}
}
pipeline.producer_acquire();
#pragma unroll
for (int r = 0; r < kNRows; ++r)
for (int token_id = si * SEQ_UNROLL; token_id < num_tokens && token_id < (si + 1) * SEQ_UNROLL; token_id++)
{
if (r > 0)
{
__syncthreads();
} // Scan could be using the same smem
scan_t thread_data[kNItems];
#pragma unroll
for (int i = 0; i < kNItems; ++i)
{
thread_data[i] = make_float2(exp2f(delta_vals[r][i] * A_val[r]),
!kIsVariableB ? delta_u_vals[r][i] : B_vals[i] * delta_u_vals[r][i]);
if constexpr (!Ktraits::kIsEvenLen)
{ // So that the last state is correct
if (threadIdx.x * kNItems + i >= params.seqlen - chunk * kChunkSize)
{
thread_data[i] = make_float2(1.f, 0.f);
}
}
}
// Initialize running total
scan_t running_prefix;
// If we use WARP_SCAN then all lane 0 of all warps (not just thread 0) needs to read
running_prefix = chunk > 0 && threadIdx.x % 32 == 0 ? smem_running_prefix[state_idx + r * MAX_DSTATE]
: make_float2(1.f, 0.f);
// running_prefix = chunk > 0 && threadIdx.x == 0 ? smem_running_prefix[state_idx] :
// make_float2(1.f, 0.f);
SSMScanPrefixCallbackOp<weight_t> prefix_op(running_prefix);
Ktraits::BlockScanT(smem_scan).InclusiveScan(
thread_data, thread_data, SSMScanOp<weight_t>(), prefix_op);
// There's a syncthreads in the scan op, so we don't need to sync here.
// Unless there's only 1 warp, but then it's the same thread (0) reading and writing.
if (threadIdx.x == 0)
{
smem_running_prefix[state_idx] = prefix_op.running_prefix;
if (chunk == params.n_chunks - 1)
{
x[r * params.dstate + state_idx] = prefix_op.running_prefix.y;
}
}
#pragma unroll
for (int i = 0; i < kNItems; ++i)
{
const weight_t C_val
= !kIsVariableC ? BC_val[r] : (!kIsVariableB ? BC_val[r] * C_vals[i] : C_vals[i]);
out_vals[r][i] += thread_data[i].y * C_val;
}
}
}
input_t* out = reinterpret_cast<input_t*>(params.out_ptr) + batch_id * params.out_batch_stride
+ dim_id * kNRows * params.out_d_stride + chunk * kChunkSize;
if constexpr (kHasZ)
input_t* my_B = &B[input_matrix_row_id * DSTATE + token_id * DSTATE];
input_t* my_C = &C[input_matrix_row_id * DSTATE + token_id * DSTATE];
int block_channel_per_token = blockIdx.x * blockDim.x;
int block_channel
= input_matrix_row_id * num_channels + token_id * num_channels + block_channel_per_token;
if (threadIdx.x < DSTATE)
cuda::memcpy_async(&sh_B[stage][threadIdx.x], &my_B[threadIdx.x], sizeof(input_t), pipeline);
else if (threadIdx.x >= 32 && threadIdx.x < 32 + DSTATE)
cuda::memcpy_async(
&sh_C[stage][threadIdx.x - 32], &my_C[threadIdx.x - 32], sizeof(input_t), pipeline);
if (sizeof(input_t) == 4)
{
input_t* z = reinterpret_cast<input_t*>(params.z_ptr) + batch_id * params.z_batch_stride
+ dim_id * kNRows * params.z_d_stride + chunk * kChunkSize;
#pragma unroll
for (int r = 0; r < kNRows; ++r)
{
input_t z_vals[kNItems];
__syncthreads();
load_input<Ktraits>(z + r * params.z_d_stride, z_vals, smem_load, params.seqlen - chunk * kChunkSize);
#pragma unroll
for (int i = 0; i < kNItems; ++i)
{
float z_val = z_vals[i];
out_vals[r][i] *= z_val / (1 + expf(-z_val));
cuda::memcpy_async(&sh_dt[stage][threadIdx.x],
&dt[input_matrix_row_id * num_channels + token_id * num_channels + channel], sizeof(input_t),
pipeline);
cuda::memcpy_async(&sh_x[stage][threadIdx.x],
&x[input_matrix_row_id * num_channels + token_id * num_channels + channel], sizeof(input_t),
pipeline);
if (z)
cuda::memcpy_async(&sh_z[stage][threadIdx.x],
&z[input_matrix_row_id * num_channels + token_id * num_channels + channel], sizeof(input_t),
pipeline);
}
__syncthreads();
store_output<Ktraits>(
out + r * params.out_d_stride, out_vals[r], smem_store, params.seqlen - chunk * kChunkSize);
else
{
// sh_dt[stage][threadIdx.x] = dt[block_channel + threadIdx.x];
if (threadIdx.x < 32)
{
int tid = threadIdx.x;
float2* block_dt = (float2*) &dt[block_channel];
cuda::memcpy_async((float2*) &sh_dt[stage][tid * 4], &block_dt[tid], sizeof(float2), pipeline);
}
// sh_x[stage][threadIdx.x] = x[block_channel + threadIdx.x];
else if (threadIdx.x < 64)
{
int tid = threadIdx.x - 32;
float2* block_x = (float2*) &x[block_channel];
cuda::memcpy_async((float2*) &sh_x[stage][tid * 4], &block_x[tid], sizeof(float2), pipeline);
}
// sh_z[stage][threadIdx.x] = z[block_channel + threadIdx.x];
else if (threadIdx.x < 96)
{
int tid = threadIdx.x - 64;
if (z)
{
float2* block_z = (float2*) &z[block_channel];
cuda::memcpy_async(
(float2*) &sh_z[stage][tid * 4], &block_z[tid], sizeof(float2), pipeline);
}
}
else
{
__syncthreads();
}
}
stage++;
if (stage >= STAGES)
stage = 0;
}
pipeline.producer_commit();
}
}
else
{
// Compute warps
// Load state and A matrix into registers
float state_reg[DSTATE];
float A_reg[DSTATE];
for (int i = 0; i < DSTATE; i++)
{
// state_reg[i] = toFloat(state[sample*num_channels*DSTATE + i*num_channels + channel]);
state_reg[i] = 0.f;
A_reg[i] = toFloat(A[i * num_channels + channel]);
}
cuda::pipeline pipeline = cuda::make_pipeline(block, &pipeline_state, cuda::pipeline_role::consumer);
int stage = 0;
for (int si = 0; si < seq_loops; si++)
{
pipeline.consumer_wait();
#pragma unroll
for (int r = 0; r < kNRows; ++r)
for (int token_id = si * SEQ_UNROLL; token_id < num_tokens && token_id < (si + 1) * SEQ_UNROLL; token_id++)
{
if constexpr (!kDirectIO)
float dt_b = toFloat(sh_dt[stage][threadIdx.x]) + toFloat(sh_dt_bias[threadIdx.x]);
float dt_b_sp;
if (dt_softplus)
{
if (r > 0)
dt_b_sp = dt_b <= 20.f ? log1pf(__expf(dt_b)) : dt_b; // softplus
}
float my_x = toFloat(sh_x[stage][threadIdx.x]);
float Dx = my_x * (D ? toFloat(sh_D[threadIdx.x]) : 0.f);
float dtx = dt_b_sp * my_x;
float my_z = z ? toFloat(sh_z[stage][threadIdx.x]) : 0.f;
float out = Dx;
if (sizeof(input_t) == 4)
{
__syncthreads();
float4* B4 = (float4*) &sh_B[stage][0];
float4* C4 = (float4*) &sh_C[stage][0];
#pragma unroll
for (int i = 0; i < DSTATE / 4; i++)
{
float4 Bi4 = B4[i];
float4 Ci4 = C4[i];
float* Bi = (float*) &Bi4;
float* Ci = (float*) &Ci4;
#pragma unroll
for (int j = 0; j < 4; j++)
{
float dtA = A_reg[i * 4 + j] * dt_b_sp;
float dA = __expf(dtA);
float sdA = state_reg[i * 4 + j] * dA;
float dBx = Bi[j] * dtx;
float newState = sdA + dBx;
state_reg[i * 4 + j] = newState;
out += newState * Ci[j];
}
}
store_output<Ktraits>(
out + r * params.out_d_stride, out_vals[r], smem_store, params.seqlen - chunk * kChunkSize);
}
else
{
float4* B8 = (float4*) &sh_B[stage][0];
float4* C8 = (float4*) &sh_C[stage][0];
#pragma unroll
for (int i = 0; i < DSTATE / 8; i++)
{
input_t* Bi = (input_t*) (&B8[i]);
input_t* Ci = (input_t*) (&C8[i]);
#pragma unroll
for (int j = 0; j < 8; j++)
{
float dtA = A_reg[i * 8 + j] * dt_b_sp;
float dA = __expf(dtA);
float sdA = state_reg[i * 8 + j] * dA;
float dBx = toFloat(Bi[j]) * dtx;
float newState = sdA + dBx;
state_reg[i * 8 + j] = newState;
out += newState * toFloat(Ci[j]);
}
}
}
Bvar += kChunkSize;
Cvar += kChunkSize;
}
if (z)
{
float enz = __expf(0.f - my_z);
enz += 1.0;
float sig_z = 1.0 / enz;
float silu_z = my_z * sig_z;
out *= silu_z;
}
input_t* my_output = &output[input_matrix_row_id * num_channels + token_id * num_channels];
convertAndStore(&my_output[channel], out);
template <int kNThreads, int kNItems, typename input_t, typename weight_t>
void selective_scan_fwd_launch(SSMParamsBase& params, cudaStream_t stream)
{
// Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
// processing 1 row.
static constexpr int kNRows = 1;
BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen,
[&]
{
BOOL_SWITCH(params.is_variable_B, kIsVariableB,
[&]
{
BOOL_SWITCH(params.is_variable_C, kIsVariableC,
[&]
{
BOOL_SWITCH(params.z_ptr != nullptr, kHasZ,
[&]
{
using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows,
kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
// constexpr int kSmemSize = Ktraits::kSmemSize;
constexpr int kSmemSize
= Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
// printf("smem_size = %d\n", kSmemSize);
dim3 grid(params.batch, params.dim / kNRows);
auto kernel = &selective_scan_fwd_kernel<Ktraits>;
if (kSmemSize >= 48 * 1024)
{
TLLM_CUDA_CHECK(cudaFuncSetAttribute(
kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
stage++;
if (stage >= STAGES)
stage = 0;
}
pipeline.consumer_release();
}
// Write the new state back out to the cache
for (int i = 0; i < DSTATE; i++)
{
weight_t* my_state = &state[sample * num_channels * DSTATE];
int offset = i * num_channels + channel;
convertAndStore(&my_state[offset], state_reg[i]);
}
}
kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
});
});
});
});
}
template <typename input_t, typename weight_t>
void invokeSelectiveScan(SSMParamsBase& params, cudaStream_t stream)
{
if (params.seqlen <= 128)
{
selective_scan_fwd_launch<32, 4, input_t, weight_t>(params, stream);
}
else if (params.seqlen <= 256)
{
selective_scan_fwd_launch<32, 8, input_t, weight_t>(params, stream);
}
else if (params.seqlen <= 512)
{
selective_scan_fwd_launch<32, 16, input_t, weight_t>(params, stream);
}
else if (params.seqlen <= 1024)
{
selective_scan_fwd_launch<64, 16, input_t, weight_t>(params, stream);
}
else
{
selective_scan_fwd_launch<128, 16, input_t, weight_t>(params, stream);
}
int samples = params.batch;
int channels = params.dim;
const int threads = 128;
const int blocks = (channels + threads - 1) / threads;
dim3 block(threads, 2);
dim3 grid(blocks, samples);
TLLM_CHECK((channels % block.x) == 0);
TLLM_CHECK(params.is_variable_B);
TLLM_CHECK(params.is_variable_C);
TLLM_CHECK(params.dstate == 16);
selective_scan_loop_kernel<input_t, weight_t><<<grid, block, 0, stream>>>(params);
}
#define INSTANTIATE_SELECTIVE_SCAN_DATA_TYPE(input_t, weight_t) \
@ -434,126 +338,101 @@ INSTANTIATE_SELECTIVE_SCAN_DATA_TYPE(__nv_bfloat16, float);
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename input_t, typename weight_t, bool dt_softplus, bool has_dt_bias, bool has_d, bool has_z>
__global__ void selectiveScanUpdate(SSMParamsBase params)
template <typename input_t, typename weight_t, int DSTATE = 16, int CHANNELS_PER_BLOCK = 128>
__launch_bounds__(128, 2) __global__ void selective_scan_update_kernel(SSMParamsBase params)
{
// Shared memory.
extern __shared__ char smem_[];
input_t* smem_b = reinterpret_cast<input_t*>(smem_);
input_t* smem_c = reinterpret_cast<input_t*>(smem_ + sizeof(input_t) * params.dstate);
input_t* output = reinterpret_cast<input_t*>(params.out_ptr);
weight_t* state = reinterpret_cast<weight_t*>(params.x_ptr);
input_t* x = reinterpret_cast<input_t*>(params.u_ptr);
input_t* dt = reinterpret_cast<input_t*>(params.delta_ptr);
weight_t* A = reinterpret_cast<weight_t*>(params.A_ptr);
input_t* B = reinterpret_cast<input_t*>(params.B_ptr);
input_t* C = reinterpret_cast<input_t*>(params.C_ptr);
weight_t* D = reinterpret_cast<weight_t*>(params.D_ptr);
input_t* z = reinterpret_cast<input_t*>(params.z_ptr);
weight_t* dt_bias = reinterpret_cast<weight_t*>(params.delta_bias_ptr);
bool dt_softplus = params.delta_softplus;
int num_channels = params.dim;
const int batch_id = blockIdx.x;
const int dim_id = blockIdx.y * blockDim.x + threadIdx.x;
const int channel = blockIdx.x * blockDim.x + threadIdx.x;
if (channel >= num_channels)
return;
const int sample = blockIdx.y;
const input_t x = reinterpret_cast<const input_t*>(params.u_ptr)[batch_id * params.u_batch_stride + dim_id];
const weight_t* A = reinterpret_cast<const weight_t*>(params.A_ptr) + dim_id * params.A_d_stride;
const input_t* B = reinterpret_cast<const input_t*>(params.B_ptr) + batch_id * params.B_batch_stride;
const input_t* C = reinterpret_cast<const input_t*>(params.C_ptr) + batch_id * params.C_batch_stride;
const float* D_ptr = reinterpret_cast<const float*>(params.D_ptr);
const input_t* z_ptr = reinterpret_cast<const input_t*>(params.z_ptr);
weight_t* state = reinterpret_cast<weight_t*>(params.x_ptr) + batch_id * params.state_batch_stride
+ dim_id * params.state_d_stride;
const input_t dt
= reinterpret_cast<const input_t*>(params.delta_ptr)[batch_id * params.delta_batch_stride + dim_id];
const float* dt_bias_ptr = reinterpret_cast<const float*>(params.delta_bias_ptr);
input_t* out = reinterpret_cast<input_t*>(params.out_ptr) + batch_id * params.out_batch_stride;
float out_tmp = 0.0f;
weight_t* my_state = &state[sample * num_channels * DSTATE];
input_t* my_output = &output[sample * num_channels];
// get delta bias
float dt_bias = 0.0f;
if (has_dt_bias)
float rA[DSTATE];
float rB[DSTATE];
float rC[DSTATE];
float rState[DSTATE];
#pragma unroll
for (int i = 0; i < DSTATE; i++)
{
dt_bias = dt_bias_ptr[dim_id];
rA[i] = toFloat(A[i * num_channels + channel]);
rB[i] = toFloat(B[sample * DSTATE + i]);
rC[i] = toFloat(C[sample * DSTATE + i]);
rState[i] = toFloat(my_state[i * num_channels + channel]);
}
// get D
float D = 0.0f;
if (has_d)
{
D = D_ptr[dim_id];
}
float my_x, my_dt, my_z, my_dt_bias, my_D;
my_x = toFloat(x[sample * num_channels + channel]);
my_dt = toFloat(dt[sample * num_channels + channel]);
my_z = z ? toFloat(z[sample * num_channels + channel]) : 0.f;
my_dt_bias = dt_bias ? toFloat(dt_bias[channel]) : 0.f;
my_D = D ? toFloat(D[channel]) : 0.f;
// dt = softplus(dt + dt_bias)
float dt_val = float(dt) + dt_bias;
float dt_b = my_dt + my_dt_bias;
float dt_b_sp;
if (dt_softplus)
{
dt_val = dt_val <= 20.f ? log1pf(expf(dt_val)) : dt_val;
dt_b_sp = dt_b <= 20.f ? logf(1.f + expf(dt_b)) : dt_b; // softplus
}
out_tmp = D * float(x);
float out = 0.f;
// read B, C
if (threadIdx.x == 0)
{
#pragma unroll
for (int i = 0; i < params.dstate; ++i)
for (int i = 0; i < DSTATE; i++)
{
smem_b[i] = B[i];
smem_c[i] = C[i];
float dA = expf(rA[i] * dt_b_sp);
float dB = rB[i] * dt_b_sp;
float sdA = rState[i] * dA;
float dBx = dB * my_x;
float newState = sdA + dBx;
convertAndStore(&my_state[i * num_channels + channel], newState); // Write the new state back out to the cache
out += newState * rC[i];
}
}
__syncthreads();
for (int state_idx = 0; state_idx < params.dstate; ++state_idx)
if (D)
out += my_D * my_x;
if (z)
{
// read A
weight_t A_val = A[state_idx];
// Multiply the real part of A with LOG2E so we can use exp2f instead of expf.
constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
A_val *= kLog2e;
// dtA = exp(dt * A), dtB = dt * B
float dt_A = exp2f(dt_val * A_val);
float dt_B = dt_val * float(smem_b[state_idx]);
// update state
float state_new = float(state[state_idx]) * dt_A + float(x) * dt_B;
state[state_idx] = weight_t(state_new);
// y = C * state + D * x
out_tmp += state_new * float(smem_c[state_idx]);
float sig_z = 1.0 / (1.0 + exp(0.f - my_z));
float silu_z = my_z * sig_z;
out *= silu_z;
}
// y = y * silu(z)
if (has_z)
{
float z = z_ptr[batch_id * params.z_batch_stride + dim_id];
out_tmp *= z / (1 + expf(-z));
}
// save out
out[dim_id] = input_t(out_tmp);
convertAndStore(&my_output[channel], out);
}
template <typename input_t, typename weight_t>
void invokeSelectiveScanUpdate(SSMParamsBase& params, cudaStream_t stream)
{
const int kNThreads = 32;
dim3 block(kNThreads);
dim3 grid(params.batch, (params.dim + kNThreads - 1) / kNThreads);
// only save B and C to shared mem for reuse
size_t smem_size = params.dstate * sizeof(input_t) * 2;
int samples = params.batch;
int channels = params.dim;
BOOL_SWITCH(params.delta_softplus, kDtSoftplus,
[&]
{
BOOL_SWITCH(params.delta_bias_ptr != nullptr, kHasDtBias,
[&]
{
BOOL_SWITCH(params.D_ptr != nullptr, kHasD,
[&]
{
BOOL_SWITCH(params.z_ptr != nullptr, kHasZ,
[&]
{
selectiveScanUpdate<input_t, weight_t, kDtSoftplus, kHasDtBias, kHasD, kHasZ>
<<<grid, block, smem_size, stream>>>(params);
});
});
});
});
const int threads = 128;
const int blocks = (channels + threads - 1) / threads;
dim3 block(threads, 1);
dim3 grid(blocks, samples);
TLLM_CHECK(params.is_variable_B);
TLLM_CHECK(params.is_variable_C);
TLLM_CHECK(params.dstate == 16);
selective_scan_update_kernel<input_t, weight_t><<<grid, block, 0, stream>>>(params);
}
#define INSTANTIATE_SELECTIVE_SCAN_UPDATE_DATA_TYPE(input_t, weight_t) \

View File

@ -30,6 +30,7 @@
#pragma once
#include "tensorrt_llm/common/assert.h"
#include "tensorrt_llm/common/cudaUtils.h"
namespace tensorrt_llm
@ -41,34 +42,12 @@ struct SSMParamsBase
{
using index_t = uint32_t;
int batch, dim, seqlen, dstate, n_groups, n_chunks;
int dim_ngroups_ratio;
int batch, dim, seqlen, dstate;
bool is_variable_B;
bool is_variable_C;
bool delta_softplus;
index_t A_d_stride;
index_t A_dstate_stride;
index_t B_batch_stride;
index_t B_d_stride;
index_t B_dstate_stride;
index_t B_group_stride;
index_t C_batch_stride;
index_t C_d_stride;
index_t C_dstate_stride;
index_t C_group_stride;
index_t u_batch_stride;
index_t u_d_stride;
index_t delta_batch_stride;
index_t delta_d_stride;
index_t z_batch_stride;
index_t z_d_stride;
index_t out_batch_stride;
index_t out_d_stride;
index_t state_batch_stride;
index_t state_d_stride;
// Common data pointers.
void* __restrict__ A_ptr;
void* __restrict__ B_ptr;

View File

@ -1,284 +0,0 @@
/*
* Adapted from https://github.com/state-spaces/mamba/blob/main/csrc/selective_scan/selective_scan_common.h
* Copyright (c) 2023, Tri Dao.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Not a contribution
* Changes made by NVIDIA CORPORATION & AFFILIATES or otherwise documented as
* NVIDIA-proprietary are not a contribution and subject to the following terms and conditions:
* SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: LicenseRef-NvidiaProprietary
*
* NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
* property and proprietary rights in and to this material, related
* documentation and any modifications thereto. Any use, reproduction,
* disclosure or distribution of this material and related documentation
* without an express license agreement from NVIDIA CORPORATION or
* its affiliates is strictly prohibited.
*/
#pragma once
#include <cuda_bf16.h>
#include <cuda_fp16.h>
namespace tensorrt_llm
{
namespace kernels
{
#define MAX_DSTATE 256
inline __device__ float2 operator+(const float2& a, const float2& b)
{
return {a.x + b.x, a.y + b.y};
}
inline __device__ float3 operator+(const float3& a, const float3& b)
{
return {a.x + b.x, a.y + b.y, a.z + b.z};
}
inline __device__ float4 operator+(const float4& a, const float4& b)
{
return {a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w};
}
////////////////////////////////////////////////////////////////////////////////////////////////////
// Inspired by https://github.com/NVIDIA/DALI/blob/main/include/dali/core/static_switch.h
// and https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h
/// @param COND - a boolean expression to switch by
/// @param CONST_NAME - a name given for the constexpr bool variable.
/// @param ... - code to execute for true and false
///
/// Usage:
/// ```
/// BOOL_SWITCH(flag, BoolConst, [&] {
/// some_function<BoolConst>(...);
/// });
/// ```
#define BOOL_SWITCH(COND, CONST_NAME, ...) \
[&] \
{ \
if (COND) \
{ \
static constexpr bool CONST_NAME = true; \
return __VA_ARGS__(); \
} \
else \
{ \
static constexpr bool CONST_NAME = false; \
return __VA_ARGS__(); \
} \
}()
////////////////////////////////////////////////////////////////////////////////////////////////////
template <int BYTES>
struct BytesToType
{
};
template <>
struct BytesToType<16>
{
using Type = uint4;
static_assert(sizeof(Type) == 16);
};
template <>
struct BytesToType<8>
{
using Type = uint64_t;
static_assert(sizeof(Type) == 8);
};
template <>
struct BytesToType<4>
{
using Type = uint32_t;
static_assert(sizeof(Type) == 4);
};
template <>
struct BytesToType<2>
{
using Type = uint16_t;
static_assert(sizeof(Type) == 2);
};
template <>
struct BytesToType<1>
{
using Type = uint8_t;
static_assert(sizeof(Type) == 1);
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename scalar_t, int N>
struct Converter
{
static inline __device__ void to_float(const scalar_t (&src)[N], float (&dst)[N])
{
#pragma unroll
for (int i = 0; i < N; ++i)
{
dst[i] = src[i];
}
}
};
template <int N>
struct Converter<half, N>
{
static inline __device__ void to_float(const half (&src)[N], float (&dst)[N])
{
static_assert(N % 2 == 0);
auto& src2 = reinterpret_cast<const half2(&)[N / 2]>(src);
auto& dst2 = reinterpret_cast<float2(&)[N / 2]>(dst);
#pragma unroll
for (int i = 0; i < N / 2; ++i)
{
dst2[i] = __half22float2(src2[i]);
}
}
};
#if __CUDA_ARCH__ >= 800
template <int N>
struct Converter<__nv_bfloat16, N>
{
static inline __device__ void to_float(const __nv_bfloat16 (&src)[N], float (&dst)[N])
{
static_assert(N % 2 == 0);
auto& src2 = reinterpret_cast<const nv_bfloat162(&)[N / 2]>(src);
auto& dst2 = reinterpret_cast<float2(&)[N / 2]>(dst);
#pragma unroll
for (int i = 0; i < N / 2; ++i)
{
dst2[i] = __bfloat1622float2(src2[i]);
}
}
};
#endif
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename scalar_t>
struct SSMScanOp;
template <>
struct SSMScanOp<float>
{
__device__ __forceinline__ float2 operator()(const float2& ab0, const float2& ab1) const
{
return make_float2(ab1.x * ab0.x, ab1.x * ab0.y + ab1.y);
}
};
// A stateful callback functor that maintains a running prefix to be applied
// during consecutive scan operations.
template <typename scalar_t>
struct SSMScanPrefixCallbackOp
{
using scan_t = std::conditional_t<std::is_same_v<scalar_t, float>, float2, float4>;
scan_t running_prefix;
// Constructor
__device__ SSMScanPrefixCallbackOp(scan_t running_prefix_)
: running_prefix(running_prefix_)
{
}
// Callback operator to be entered by the first warp of threads in the block.
// Thread-0 is responsible for returning a value for seeding the block-wide scan.
__device__ scan_t operator()(scan_t block_aggregate)
{
scan_t old_prefix = running_prefix;
running_prefix = SSMScanOp<scalar_t>()(running_prefix, block_aggregate);
return old_prefix;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Ktraits>
inline __device__ void load_input(typename Ktraits::input_t* u, typename Ktraits::input_t (&u_vals)[Ktraits::kNItems],
typename Ktraits::BlockLoadT::TempStorage& smem_load, int seqlen)
{
if constexpr (Ktraits::kIsEvenLen)
{
auto& smem_load_vec = reinterpret_cast<typename Ktraits::BlockLoadVecT::TempStorage&>(smem_load);
using vec_t = typename Ktraits::vec_t;
Ktraits::BlockLoadVecT(smem_load_vec)
.Load(reinterpret_cast<vec_t*>(u), reinterpret_cast<vec_t(&)[Ktraits::kNLoads]>(u_vals));
}
else
{
Ktraits::BlockLoadT(smem_load).Load(u, u_vals, seqlen, 0.f);
}
}
template <typename Ktraits>
inline __device__ void load_weight(typename Ktraits::input_t* Bvar,
typename Ktraits::weight_t (&B_vals)[Ktraits::kNItems],
typename Ktraits::BlockLoadWeightT::TempStorage& smem_load_weight, int seqlen)
{
constexpr int kNItems = Ktraits::kNItems;
typename Ktraits::input_t B_vals_load[kNItems];
if constexpr (Ktraits::kIsEvenLen)
{
auto& smem_load_weight_vec
= reinterpret_cast<typename Ktraits::BlockLoadWeightVecT::TempStorage&>(smem_load_weight);
using vec_t = typename Ktraits::vec_t;
Ktraits::BlockLoadWeightVecT(smem_load_weight_vec)
.Load(reinterpret_cast<vec_t*>(Bvar), reinterpret_cast<vec_t(&)[Ktraits::kNLoads]>(B_vals_load));
}
else
{
Ktraits::BlockLoadWeightT(smem_load_weight).Load(Bvar, B_vals_load, seqlen, 0.f);
}
// #pragma unroll
// for (int i = 0; i < kNItems; ++i) { B_vals[i] = B_vals_load[i]; }
Converter<typename Ktraits::input_t, kNItems>::to_float(B_vals_load, B_vals);
}
template <typename Ktraits>
inline __device__ void store_output(typename Ktraits::input_t* out, const float (&out_vals)[Ktraits::kNItems],
typename Ktraits::BlockStoreT::TempStorage& smem_store, int seqlen)
{
typename Ktraits::input_t write_vals[Ktraits::kNItems];
#pragma unroll
for (int i = 0; i < Ktraits::kNItems; ++i)
{
write_vals[i] = out_vals[i];
}
if constexpr (Ktraits::kIsEvenLen)
{
auto& smem_store_vec = reinterpret_cast<typename Ktraits::BlockStoreVecT::TempStorage&>(smem_store);
using vec_t = typename Ktraits::vec_t;
Ktraits::BlockStoreVecT(smem_store_vec)
.Store(reinterpret_cast<vec_t*>(out), reinterpret_cast<vec_t(&)[Ktraits::kNLoads]>(write_vals));
}
else
{
Ktraits::BlockStoreT(smem_store).Store(out, write_vals, seqlen);
}
}
} // namespace kernels
} // namespace tensorrt_llm

View File

@ -86,7 +86,7 @@ struct WeightOnlyDetails<ActType, WeightOnlyQuantType::Int4b>
// weight 0 1 8 9 16 17 24 25 2 3 10 11 18 19 26 27 4 5 12 13 20 21 28 29 6 7 14 15 22 23 30 31
static constexpr int kShuffleSize = 32;
static constexpr int kShuffleBasicTile = 2;
static constexpr int kShuffleContinous = 4;
static constexpr int kShuffleContinuous = 4;
static constexpr int kShuffleStrided = 4;
// Each warp completes the internal reduce and writes the [Batch * NPerBlock * Interleave] results to the
@ -136,7 +136,7 @@ struct WeightOnlyDetails<ActType, WeightOnlyQuantType::Int8b>
// weight 0 1 8 9 2 3 10 11 4 5 12 13 6 7 14 15
static constexpr int kShuffleSize = 16;
static constexpr int kShuffleBasicTile = 2;
static constexpr int kShuffleContinous = 2;
static constexpr int kShuffleContinuous = 2;
static constexpr int kShuffleStrided = 4;
// Each warp completes the internal reduce and writes the [Batch * NPerBlock * Interleave] results to the
@ -177,7 +177,7 @@ struct WeightOnlyKernelDetails
static constexpr int kShuffleSize = Layout::kShuffleSize;
static constexpr int kShuffleBasicTile = Layout::kShuffleBasicTile;
static constexpr int kShuffleContinous = Layout::kShuffleContinous;
static constexpr int kShuffleContinuous = Layout::kShuffleContinuous;
static constexpr int kShuffleStrided = Layout::kShuffleStrided;
// The rearrangement here counteracts the effect of cutlass::add_bias_and_interleave_int4/8s_inplace
@ -352,7 +352,7 @@ __device__ void weight_only_batched_gemv(const uint8_t* qweight, const ActType*
weights_quantized + i * Details::kConvertCount / Details::kElemsPerByte)));
}
#pragma unroll
for (int i = 0; i < Details::kShuffleContinous; ++i)
for (int i = 0; i < Details::kShuffleContinuous; ++i)
{
#pragma unroll
for (int j = 0; j < Details::kShuffleStrided; ++j)
@ -360,7 +360,7 @@ __device__ void weight_only_batched_gemv(const uint8_t* qweight, const ActType*
// Dequantize the weights and arrange the shuffled elements back to the correct order in the
// register array
ActType2 v = *reinterpret_cast<ActType2*>(weights_vec + i * Details::kShuffleBasicTile
+ j * Details::kShuffleContinous * Details::kShuffleBasicTile);
+ j * Details::kShuffleContinuous * Details::kShuffleBasicTile);
v = __hfma2(
v, ActTypeDetails<ActType>::to_vec2(scale[idx]), ActTypeDetails<ActType>::to_vec2(zero[idx]));
weights_f16[(i * Details::kShuffleStrided * Details::kShuffleBasicTile

View File

@ -211,6 +211,7 @@ std::optional<Config> GemmPluginProfiler<Config, RunnerPtr, GemmIdType, GemmIdHa
<< " m=" << m << ", n=" << n << ", k=" << k << ")"
<< ", reason: \"" << e.what() << "\". Skipped";
TLLM_LOG_TRACE(msg.str());
cudaGetLastError(); // Reset the last cudaError to cudaSuccess.
continue;
}

View File

@ -823,6 +823,9 @@ int GPTAttentionPluginCommon::enqueueContext(const EnqueueContextParams<T, KVCac
if (mEnableContextFMHA)
{
const bool enablePagedKVContextFMHA = mPagedKVCache && mPagedContextFMHA;
// Paged Context FMHA doesn't work with fp8/int8 kv cache currently.
TLLM_CHECK_WITH_INFO(cache_type == KvCacheDataType::BASE || !enablePagedKVContextFMHA,
"Paged Context FMHA doesn't work with fp8/int8 kv cache currently.");
invokeApplyBiasRopeUpdateKVCache(const_cast<T*>(params.attention_input), q_buf_2_, kv_cache_buffer,
const_cast<T*>(params.qkv_bias), params.q_seq_lengths, params.kv_seq_lengths,
mRemovePadding ? padding_offset : nullptr, params.batch_size, params.input_seq_length,

View File

@ -69,8 +69,8 @@ nvinfer1::IPluginV2DynamicExt* SelectiveScanPlugin::clone() const noexcept
}
// Outputs
// output_tensor: [batch_size, dim, seq_len]
// state: [batch_size, dim, dstate]
// output_tensor: [batch_size, seq_len, dim]
// state: [batch_size, dstate, dim]
nvinfer1::DimsExprs SelectiveScanPlugin::getOutputDimensions(
int outputIndex, const nvinfer1::DimsExprs* inputs, int nbInputs, nvinfer1::IExprBuilder& exprBuilder) noexcept
{
@ -110,11 +110,9 @@ size_t SelectiveScanPlugin::getWorkspaceSize(const nvinfer1::PluginTensorDesc* i
}
void SelectiveScanPlugin::setSSMParams(SSMParamsBase& params, const size_t batch, const size_t dim, const size_t seqLen,
const size_t dstate, const size_t nChunks, const bool isVariableB, const bool isVariableC, void* statePtr,
const void* x, const void* delta, const void* deltaBias, const void* A, const void* B, const void* C, const void* D,
const void* z, void* out, const size_t strideXBatch, const size_t strideDtBatch, const size_t strideADim,
const size_t strideBBatch, const size_t strideCBatch, const size_t strideZBatch, const size_t strideOutBatch,
const size_t strideStateBatch, const size_t strideStateDim, bool deltaSoftplus)
const size_t dstate, const bool isVariableB, const bool isVariableC, void* statePtr, const void* x,
const void* delta, const void* deltaBias, const void* A, const void* B, const void* C, const void* D, const void* z,
void* out, bool deltaSoftplus)
{
// Reset the parameters
memset(&params, 0, sizeof(params));
@ -123,9 +121,6 @@ void SelectiveScanPlugin::setSSMParams(SSMParamsBase& params, const size_t batch
params.dim = dim;
params.seqlen = seqLen;
params.dstate = dstate;
params.n_groups = 1;
params.n_chunks = nChunks;
params.dim_ngroups_ratio = dim;
params.delta_softplus = deltaSoftplus;
@ -143,39 +138,6 @@ void SelectiveScanPlugin::setSSMParams(SSMParamsBase& params, const size_t batch
params.out_ptr = out;
params.x_ptr = statePtr;
params.z_ptr = const_cast<void*>(z);
// All stride are in elements, not bytes.
params.A_d_stride = strideADim;
params.A_dstate_stride = 1;
if (!isVariableB)
{
params.B_d_stride = dim * dstate;
}
else
{
params.B_batch_stride = strideBBatch;
params.B_group_stride = strideBBatch;
}
params.B_dstate_stride = !isVariableB ? dstate : seqLen;
if (!isVariableC)
{
params.C_d_stride = dim * dstate;
}
else
{
params.C_batch_stride = strideCBatch;
params.C_group_stride = strideCBatch;
}
params.C_dstate_stride = !isVariableC ? dstate : seqLen;
params.u_batch_stride = strideXBatch;
params.u_d_stride = seqLen;
params.delta_batch_stride = strideDtBatch;
params.delta_d_stride = seqLen;
params.z_batch_stride = strideZBatch;
params.z_d_stride = seqLen;
params.out_batch_stride = strideOutBatch;
params.out_d_stride = seqLen;
params.state_batch_stride = strideStateBatch;
params.state_d_stride = strideStateDim;
}
template <typename T>
@ -184,41 +146,31 @@ int SelectiveScanPlugin::enqueueImpl(const nvinfer1::PluginTensorDesc* inputDesc
cudaStream_t stream)
{
// inputs
// 0. input_tensor [batch_size, dim, seq_len]
// 1. state [batch_size, dim, dstate]
// 2. delta [batch_size, dim, seq_len]
// 0. input_tensor [batch_size, seq_len, dim]
// 1. state [batch_size, dstate, dim]
// 2. delta [batch_size, seq_len, dim]
// 3. delta_bias [dim]
// 4. A [dim, dstate]
// 5. B [batch_size, dstate, seq_len]
// 6. C [batch_size, dstate, seq_len]
// 4. A [dstate, dim]
// 5. B [batch_size, seq_len, dstate]
// 6. C [batch_size, seq_len, dstate]
// 7. D [dim]
// 8. z [batch_size, dim, seq_len]
// 8. z [batch_size, seq_len, dim]
// 9. host_request_types [batch_size] int32. 0: context; 1: generation.
// outputs
// 0. output_tensor [batch_size, dim, seq_len]
// 1. state [batch_size, dim, dstate]
// 0. output_tensor [batch_size, seq_len, dim]
// 1. state [batch_size, dstate, dim]
auto const batch_size = inputDesc[getInputTensorIdx()].dims.d[0];
auto const seq_len = inputDesc[getInputTensorIdx()].dims.d[2];
auto const stride_state_batch = mDim * mDState;
auto const stride_state_dim = mDState;
auto const stride_x_batch = mDim * seq_len;
auto const stride_dt_batch = mDim * seq_len;
auto const stride_A_dim = mDState;
auto const stride_B_batch = mDState * seq_len;
auto const stride_C_batch = mDState * seq_len;
auto const stride_z_batch = mDim * seq_len;
auto const stride_out_batch = mDim * seq_len;
auto const seq_len = inputDesc[getInputTensorIdx()].dims.d[1];
// only support context or generation, not for both of them
RequestType const* reqTypes = static_cast<RequestType const*>(inputs[getHostRequestTypesIdx()]);
auto const n_chunks = (seq_len + 2048 - 1) / 2048;
SSMParamsBase ssm_params;
setSSMParams(ssm_params, batch_size, mDim, seq_len, mDState, n_chunks, mIsVariableB, mIsVariableC, outputs[1],
setSSMParams(ssm_params, batch_size, mDim, seq_len, mDState, mIsVariableB, mIsVariableC, outputs[1],
inputs[getInputTensorIdx()], inputs[getDeltaIdx()], inputs[getDeltaBiasIdx()], inputs[getAIdx()],
inputs[getBIdx()], inputs[getCIdx()], inputs[getDIdx()], inputs[getZIdx()], outputs[0], stride_x_batch,
stride_dt_batch, stride_A_dim, stride_B_batch, stride_C_batch, stride_z_batch, stride_out_batch,
stride_state_batch, stride_state_dim, mDeltaSoftplus);
inputs[getBIdx()], inputs[getCIdx()], inputs[getDIdx()], inputs[getZIdx()], outputs[0], mDeltaSoftplus);
if (reqTypes[0] == RequestType::kCONTEXT)
{
@ -321,9 +273,9 @@ SelectiveScanPluginCreator::SelectiveScanPluginCreator()
mPluginAttributes.clear();
mPluginAttributes.emplace_back(PluginField("dim", nullptr, PluginFieldType::kINT32, 16));
mPluginAttributes.emplace_back(PluginField("dstate", nullptr, PluginFieldType::kINT32, 16));
mPluginAttributes.emplace_back(PluginField("is_variable_B", nullptr, PluginFieldType::kINT32, 1));
mPluginAttributes.emplace_back(PluginField("is_variable_C", nullptr, PluginFieldType::kINT32, 1));
mPluginAttributes.emplace_back(PluginField("delta_softplus", nullptr, PluginFieldType::kINT32, 1));
mPluginAttributes.emplace_back(PluginField("is_variable_B", nullptr, PluginFieldType::kINT8, 1));
mPluginAttributes.emplace_back(PluginField("is_variable_C", nullptr, PluginFieldType::kINT8, 1));
mPluginAttributes.emplace_back(PluginField("delta_softplus", nullptr, PluginFieldType::kINT8, 1));
mPluginAttributes.emplace_back(PluginField("type_id", nullptr, PluginFieldType::kINT32, 1));
mFC.nbFields = mPluginAttributes.size();
mFC.fields = mPluginAttributes.data();

View File

@ -29,19 +29,19 @@ namespace tensorrt_llm::plugins
// can not support beam search
// inputs
// 0. input_tensor [batch_size, dim, seq_len]
// 1. state [batch_size, dim, dstate]
// 2. delta [batch_size, dim, seq_len]
// 0. input_tensor [batch_size, seq_len, dim]
// 1. state [batch_size, dstate, dim]
// 2. delta [batch_size, seq_len, dim]
// 3. delta_bias [dim]
// 4. A [dim, seq_len]
// 5. B [batch_size, dstate, seq_len]
// 6. C [batch_size, dstate, seq_len]
// 4. A [dstate, dim]
// 5. B [batch_size, seq_len, dstate]
// 6. C [batch_size, seq_len, dstate]
// 7. D [dim]
// 8. z [batch_size, dim, seq_len]
// 8. z [batch_size, seq_len, dim]
// 9. host_request_types [batch_size] int32. 0: context; 1: generation; 2: none.
// outputs
// 0. output_tensor [batch_size, dim, seq_len]
// 1. state [batch_size, dim, dstate]
// 0. output_tensor [batch_size, seq_len, dim]
// 1. state [batch_size, dstate, dim]
class SelectiveScanPlugin : public BasePlugin
{
@ -144,15 +144,11 @@ private:
void setSSMParams(tensorrt_llm::kernels::SSMParamsBase& params,
// sizes
const size_t batch, const size_t dim, const size_t seqLen, const size_t dstate, const size_t nChunks,
const bool isVariableB, const bool isVariableC,
const size_t batch, const size_t dim, const size_t seqLen, const size_t dstate, const bool isVariableB,
const bool isVariableC,
// device pointers
void* statePtr, const void* x, const void* delta, const void* deltaBias, const void* A, const void* B,
const void* C, const void* D, const void* z, void* out,
// strides
const size_t strideXBatch, const size_t strideDtBatch, const size_t strideADim, const size_t strideBBatch,
const size_t strideCBatch, const size_t strideZBatch, const size_t strideOutBatch,
const size_t strideStateBatch, const size_t strideStateDim, bool deltaSoftplus);
const void* C, const void* D, const void* z, void* out, bool deltaSoftplus);
private:
int mDim;

View File

@ -195,20 +195,7 @@ void WeightOnlyGroupwiseQuantMatmulPlugin::init(nvinfer1::DataType type, int qua
{
TLLM_THROW("FP8 is unsupported on pre-Hopper architectures!");
}
if (quant_algo & ZERO)
{
// has zeros
m_weightOnlyGroupwiseGemmRunner
= std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_fp8_e4m3,
cutlass::int4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS, half, half, half>>();
}
else
{
// no zeros
m_weightOnlyGroupwiseGemmRunner
= std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_fp8_e4m3,
cutlass::int4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_ONLY, half, half, half>>();
}
TLLM_THROW("FP8 is unsupported on with BF16 scales and zero-points!");
}
else
{
@ -301,8 +288,7 @@ bool WeightOnlyGroupwiseQuantMatmulPlugin::supportsFormatCombination(
if (pos == mWeightInputIdx)
{
// weights
return inOut[mWeightInputIdx].type == nvinfer1::DataType::kHALF
&& inOut[mWeightInputIdx].format == TensorFormat::kLINEAR;
return inOut[mWeightInputIdx].type == mType && inOut[mWeightInputIdx].format == TensorFormat::kLINEAR;
}
else if ((mQuantAlgo & FP8_ALPHA) && pos == mAlphaInputIdx)
{
@ -310,7 +296,7 @@ bool WeightOnlyGroupwiseQuantMatmulPlugin::supportsFormatCombination(
}
else
{
return inOut[pos].type == nvinfer1::DataType::kHALF && inOut[pos].format == TensorFormat::kLINEAR;
return inOut[pos].type == mType && inOut[pos].format == TensorFormat::kLINEAR;
}
}
else
@ -374,7 +360,14 @@ int WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(const nvinfer1::PluginTensorDe
}
const int n = inputDesc[mWeightInputIdx].dims.d[1];
const int k = inputDesc[0].dims.d[inputDesc[0].dims.nbDims - 1];
int smVersion = getSMVersion();
bool use_cuda_kernel = m < SMALL_M_FAST_PATH && mCudaKernelEnabled;
#if defined(ENABLE_BF16)
// CUDA kernels assume FP16 activations for Hopper
bool force_disable_cuda_kernel = smVersion == 90 && mType == nvinfer1::DataType::kBF16;
use_cuda_kernel = use_cuda_kernel && !force_disable_cuda_kernel;
#endif
bool use_pre_quant_scale = mQuantAlgo & PRE_QUANT_SCALE;
const half* zeros_ptr = (mQuantAlgo & ZERO) ? reinterpret_cast<const half*>(inputs[mZerosInputIdx]) : nullptr;
@ -443,7 +436,7 @@ int WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(const nvinfer1::PluginTensorDe
weight_only_act_type = tensorrt_llm::kernels::WeightOnlyActivationType::BF16;
}
if (getSMVersion() == 90)
if (smVersion == 90)
{
// Hopper style kernels
if (use_cuda_kernel)

View File

@ -37,6 +37,7 @@ set(SRCS
runtimeBuffers.cpp
runtimeKernels.cu
statefulGptDecoder.cpp
tllmBuffers.cpp
tllmRuntime.cpp
tllmLogger.cpp
worldConfig.cpp)

View File

@ -66,6 +66,7 @@ SamplingConfig extractSamplingConfig(SamplingConfig const& batchSamplingConfig,
samplingConfig.beamSearchDiversityRate = batchSamplingConfig.beamSearchDiversityRate;
samplingConfig.lengthPenalty = batchSamplingConfig.lengthPenalty;
samplingConfig.earlyStopping = batchSamplingConfig.earlyStopping;
samplingConfig.normalizeLogProbs = batchSamplingConfig.normalizeLogProbs;
TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
return samplingConfig;
@ -278,7 +279,7 @@ void GptDecoderBatch::newRequest(
tc::fmtstr("Input length (%d) + max new tokens (%d) must be less than max sequence length (%d).", inputLength,
maxNewTokens, mMaxSequenceLength));
TLLM_CHECK(requestIds->getDataType() == TRTDataType<TokenIdType>::value);
auto const endId = request.endId.value_or(mVocabSize - 1);
auto const endId = request.endId.value_or(-1);
auto constexpr localBatchSize = 1;
@ -459,6 +460,7 @@ void GptDecoderBatch::newRequest(
{
mDecoders[decoderIdx]->setup(samplingConfig, localBatchSize, mMaxSequenceLength);
}
TLLM_CHECK_WITH_INFO(!mFusedDecoder || beamWidth == 1, "Fused decoder is not supported for beam search yet.");
mBeamWidths[batchIdx] = beamWidth;
mNbSteps[batchIdx] = 0;
mFinished[batchIdx] = false;
@ -622,8 +624,6 @@ GptDecoderBatch::TokenPtr GptDecoderBatch::forwardAsync(
}
else
{
TLLM_CHECK_WITH_INFO(mBeamWidths[0] == 1, "Fused decoder is not supported for beam search yet.");
auto& dInput = *mJointDecodingInput;
auto& dOutput = *mJointDecodingOutput;
auto& decoder = *mDecoders[0];

View File

@ -185,8 +185,8 @@ template <typename InputType>
GptJsonConfig parseJson(InputType&& input)
{
auto constexpr allowExceptions = true;
auto constexpr ingoreComments = true;
auto const json = nlohmann::json::parse(std::forward<InputType>(input), nullptr, allowExceptions, ingoreComments);
auto constexpr ignoreComments = true;
auto const json = nlohmann::json::parse(std::forward<InputType>(input), nullptr, allowExceptions, ignoreComments);
auto const engineVersion = parseJsonFieldOr(json, "version", std::string("none"));

View File

@ -19,6 +19,7 @@
#include "tensorrt_llm/runtime/gptSession.h"
#include "common.h"
#include "iBuffer.h"
#include "tensorrt_llm/batch_manager/kvCacheManager.h"
#include "tensorrt_llm/common/customAllReduceUtils.h"
@ -55,13 +56,13 @@ std::unordered_set<std::int32_t> populateMicrobatchIndexes()
std::unordered_set<std::int32_t> idxSet;
if (profileMbIdxChar != nullptr)
{
std::istringstream ss{profileMbIdxChar};
std::istringstream iss{profileMbIdxChar};
std::int32_t idx;
char c;
while (ss >> idx)
while (iss >> idx)
{
idxSet.insert(idx);
ss >> c;
iss >> c;
}
}
@ -79,9 +80,6 @@ GptSession::GptSession(Config const& sessionConfig, GptModelConfig const& modelC
, mDevice{utils::initDevice(worldConfig)}
, mLogger{logger ? std::move(logger) : std::make_shared<TllmLogger>()}
, mRuntime{std::make_shared<TllmRuntime>(engineBuffer, engineSize, *mLogger)}
, mDecoders{}
, mBuffers{}
, mCudaGraphInstances{}
{
if (mWorldConfig.isPipelineParallel())
{
@ -157,9 +155,13 @@ void GptSession::createDecoders(SizeType batchSize, SizeType beamWidth, SizeType
for (SizeType i = 0; i < numMicroBatches; ++i)
{
if (decoderPerRequest)
{
mDecoders.emplace_back(std::make_shared<GptDecoderBatch>(vocabSize, vocabSizePadded, stream));
}
else
{
mDecoders.emplace_back(std::make_shared<StatefulGptDecoder>(vocabSize, vocabSizePadded, stream));
}
constexpr SizeType maxTokensPerStep = 1;
mDecoders.back()->setup(decodingMode, batchSize, beamWidth, maxAttentionWindow, sinkTokenLength,
maxSequenceLength, maxTokensPerStep, /* fusedDecoder*/ false, logitsType);
@ -174,19 +176,21 @@ void GptSession::createKvCacheManager(SizeType batchSize, SizeType beamWidth, Si
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
auto const tokensPerBlock = mModelConfig.getTokensPerBlock();
nvinfer1::DataType kvDtype;
auto const kvDtype = [this]()
{
if (mModelConfig.getQuantMode().hasFp8KvCache())
{
kvDtype = nvinfer1::DataType::kFP8;
return nvinfer1::DataType::kFP8;
}
else if (mModelConfig.getQuantMode().hasInt8KvCache())
{
kvDtype = nvinfer1::DataType::kINT8;
return nvinfer1::DataType::kINT8;
}
else
{
kvDtype = mModelConfig.getDataType();
return mModelConfig.getDataType();
}
}();
auto const maxNumBlocks = bmkv::KVCacheManager::calculateMaxNumBlocks(
kvCacheConfig, kvDtype, mModelConfig, mWorldConfig, getBufferManager());
@ -208,6 +212,7 @@ void GptSession::createKvCacheManager(SizeType batchSize, SizeType beamWidth, Si
void GptSession::createCustomAllReduceWorkspace(
SizeType maxBatchSize, SizeType maxBeamWidth, SizeType maxSequenceLength)
{
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
setPeerAccess(mWorldConfig, true);
mIpcMemoryHandles.clear();
@ -219,11 +224,10 @@ void GptSession::createCustomAllReduceWorkspace(
mIpcMemoryHandles.emplace_back(std::make_shared<IpcMemory>(mWorldConfig, IpcMemory::FLAGS_SIZE * sizeof(int32_t)));
mIpcMemoryHandles.emplace_back(std::make_shared<IpcMemory>(mWorldConfig, IpcMemory::FLAGS_SIZE * sizeof(int32_t)));
auto& manager = mRuntime->getBufferManager();
mCommPtrs = manager.cpu(
mCommPtrs = BufferManager::cpu(
ITensor::makeShape({static_cast<SizeType>(mIpcMemoryHandles.size()) * mWorldConfig.getTensorParallelism()}),
nvinfer1::DataType::kINT64);
const auto commPtrsData = bufferCast<void*>(*mCommPtrs);
auto* const commPtrsData = bufferCast<void*>(*mCommPtrs);
for (size_t memIdx = 0; memIdx < mIpcMemoryHandles.size(); memIdx++)
{
@ -233,6 +237,7 @@ void GptSession::createCustomAllReduceWorkspace(
commPtrsData[memIdx * mWorldConfig.getTensorParallelism() + tpIdx] = memCommPtrs[tpIdx];
}
}
TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
}
GptSession::MicroBatchConfig::MicroBatchConfig(SizeType maxBatchSize, SizeType pipelineParallelism,
@ -289,6 +294,8 @@ void GptSession::setup(Config const& sessionConfig)
createContexts();
createBuffers(mMicroBatchConfig.numGenBatches);
mNormalizeLogProbs = sessionConfig.normalizeLogProbs;
// Store this param related to decoder buffer size and kv cache manager to check against
// the input shape with the params given in generate().
// gptDecoderBatch does not resize buffers, but allows smaller batchSize and beamWidth.
@ -297,12 +304,6 @@ void GptSession::setup(Config const& sessionConfig)
mDecoderMaxAttentionWindow = maxAttentionWindow;
mDecoderSinkTokenLength = sinkTokenLength;
if (mModelConfig.usePagedKvCache())
{
createKvCacheManager(maxBatchSize, maxBeamWidth, maxAttentionWindow, sinkTokenLength, maxSequenceLength,
sessionConfig.kvCacheConfig);
}
if (mWorldConfig.isLastPipelineParallelRank())
{
auto const logitsType = mRuntime->getEngine().getTensorDataType("logits");
@ -317,14 +318,22 @@ void GptSession::setup(Config const& sessionConfig)
{
mReceivedEvents.clear();
for (SizeType i = 0; i < mMicroBatchConfig.numGenBatches; ++i)
{
mReceivedEvents.emplace_back();
}
}
if (mWorldConfig.isTensorParallel() && mModelConfig.useCustomAllReduce())
{
createCustomAllReduceWorkspace(mMicroBatchConfig.genBatchSize, maxBeamWidth, maxSequenceLength);
}
if (mModelConfig.usePagedKvCache())
{
createKvCacheManager(maxBatchSize, maxBeamWidth, maxAttentionWindow, sinkTokenLength, maxSequenceLength,
sessionConfig.kvCacheConfig);
}
auto* kvCacheManager = mModelConfig.usePagedKvCache() ? mKvCacheManager.get() : nullptr;
for (auto& buffers : mBuffers)
@ -334,6 +343,7 @@ void GptSession::setup(Config const& sessionConfig)
mMicroBatchConfig.genBatchSize, maxBeamWidth, 0, maxAttentionWindow, sinkTokenLength, maxSequenceLength};
buffers->reshape(kvCacheManager, mModelConfig, mWorldConfig);
}
TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
}
@ -344,7 +354,7 @@ void GptSession::kvCacheAddSequences(SizeType beamWidth, SizeType microBatchId,
TLLM_CHECK(mKvCacheManager);
auto contextLengthsHost = mBuffers.at(microBatchId)->contextLengthsHost;
TLLM_CHECK(contextLengthsHost);
auto const contextLengthsPtr = bufferCast<SizeType const>(*contextLengthsHost);
const auto* const contextLengthsPtr = bufferCast<SizeType const>(*contextLengthsHost);
auto const contextLengthsSize = static_cast<SizeType>(contextLengthsHost->getSize());
for (SizeType batchIdx = 0; batchIdx < contextLengthsSize; ++batchIdx)
{
@ -358,9 +368,9 @@ ITensor::SharedPtr GptSession::initDecoder(ITensor& outputIds, GenerationInput c
{
if (mWorldConfig.isLastPipelineParallelRank())
{
auto& decoder = mDecoders.at(microBatchId);
decoder->newBatch(inputs, outputs, samplingConfig);
return decoder->getNewTokens();
auto& decoder = *mDecoders.at(microBatchId);
decoder.newBatch(inputs, outputs, samplingConfig);
return decoder.getNewTokens();
}
else if (mWorldConfig.isFirstPipelineParallelRank())
{
@ -467,7 +477,9 @@ std::vector<GenerationInput> splitInputs(GenerationInput const& inputs, SizeType
auto const batchSize = microBatchOffsets[batchId + 1] - offset;
if (inputs.embeddingBias)
{
batch.embeddingBias = inputs.embeddingBias;
}
if (inputs.badWordsList)
{
@ -487,21 +499,29 @@ std::vector<GenerationInput> splitInputs(GenerationInput const& inputs, SizeType
batch.stopWordsList = ITensor::slice(inputs.stopWordsList, offset, batchSize);
}
if (inputs.maxNewTokens)
{
batch.maxNewTokens = inputs.maxNewTokens;
}
if (inputs.promptTuningParams.embeddingTable)
{
batch.promptTuningParams.embeddingTable = inputs.promptTuningParams.embeddingTable;
}
if (inputs.promptTuningParams.tasks)
{
batch.promptTuningParams.tasks = ITensor::slice(inputs.promptTuningParams.tasks, offset, batchSize);
}
if (inputs.promptTuningParams.vocabSize)
{
batch.promptTuningParams.vocabSize = inputs.promptTuningParams.vocabSize;
}
}
TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
return inputBatches;
}
std::vector<GenerationOutput> splitOutputs(GenerationOutput& outputs, SizeType microBatchSize, BufferManager& manager)
std::vector<GenerationOutput> splitOutputs(GenerationOutput& outputs, SizeType microBatchSize)
{
auto const numRequests = outputs.ids->getShape().d[0];
@ -547,8 +567,8 @@ void updateOutputIds(ITensor::SharedPtr const& outputIds, ITensor::SharedPtr con
}
} // namespace
void GptSession::generate(
GenerationOutput& outputs, GenerationInput const& inputs, SamplingConfig const& samplingConfig)
void GptSession::generate(GenerationOutput& outputs, GenerationInput const& inputs,
SamplingConfig const& samplingConfig, std::shared_ptr<GenerationProfiler> const generationProfiler)
{
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
@ -604,9 +624,9 @@ void GptSession::generate(
}
else
{
for (auto iter = inputLengthsRange.begin(); iter != inputLengthsRange.end(); iter++)
for (auto iter : inputLengthsRange)
{
maxNewTokens = std::max(maxNewTokens, mDecoderMaxSequenceLength - *iter);
maxNewTokens = std::max(maxNewTokens, mDecoderMaxSequenceLength - iter);
}
}
@ -635,13 +655,13 @@ void GptSession::generate(
{
std::vector<GenerationInput> microBatchesInputs{inputs};
std::vector<GenerationOutput> microBatchesOutputs{outputs};
generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated);
generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated, generationProfiler);
}
else
{
auto const microBatchesInputs = splitInputs(inputs, mMicroBatchConfig.genBatchSize, manager);
auto microBatchesOutputs = splitOutputs(outputs, mMicroBatchConfig.genBatchSize, manager);
generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated);
auto microBatchesOutputs = splitOutputs(outputs, mMicroBatchConfig.genBatchSize);
generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated, generationProfiler);
}
TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
@ -665,7 +685,7 @@ GptSession::TokenGeneratedCallback GptSession::createOnTokenGeneratedCallback(Ge
void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutputs,
std::vector<GenerationInput> const& microBatchesInputs, SamplingConfig const& samplingConfig,
TokenGeneratedCallback const& onTokenGenerated)
TokenGeneratedCallback const& onTokenGenerated, std::shared_ptr<GenerationProfiler> const generationProfiler)
{
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
@ -743,7 +763,7 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
auto const profileContext = !kProfileMbIdxs.empty() && kProfileMbIdxs.count(0) > 0;
if (profileContext)
cudaProfilerStart();
executeContextStep(microBatchesInputs, microBatchesOutputs, microBatchOffsets, kvCacheManager);
executeContextStep(microBatchesInputs, microBatchOffsets, kvCacheManager);
if (profileContext)
cudaProfilerStop();
@ -751,6 +771,11 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
SizeType numBatchesFinished{0};
SizeType step{0};
if (generationProfiler)
{
manager.getStream().record(generationProfiler->getStart());
}
while (numBatchesFinished < numMicroBatches)
{
++step;
@ -768,6 +793,11 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
cudaProfilerStop();
}
if (generationProfiler)
{
manager.getStream().record(generationProfiler->getEnd());
}
// Collect the results for the last step
for (auto microBatchId = 0; microBatchId < numMicroBatches; ++microBatchId)
{
@ -796,12 +826,15 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
auto& cumLogProbs = buffers.cumLogProbs;
if (cumLogProbs)
{
manager.copy(*decoder.getCumLogProbs(), *buffers.cumLogProbs);
}
auto& logProbs = buffers.logProbs;
if (logProbs)
{
manager.copy(*decoder.getLogProbs(), *buffers.logProbs);
}
}
// copy generation logits fragments into a single generationLogits tensor
if (mModelConfig.computeGenerationLogits())
{
@ -823,19 +856,18 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
}
void GptSession::executeContextStep(std::vector<GenerationInput> const& microBatchesInputs,
std::vector<GenerationOutput>& microBatchesOutputs, std::vector<SizeType> const& generationBatchOffsets,
KvCacheManager const* kvCacheManager)
void GptSession::executeContextStep(std::vector<GenerationInput> const& generationBatchesInputs,
std::vector<SizeType> const& generationBatchesOffsets, KvCacheManager const* kvCacheManager)
{
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
auto& manager = mRuntime->getBufferManager();
auto const numGenerationBatches = static_cast<SizeType>(microBatchesInputs.size());
auto const numGenerationBatches = static_cast<SizeType>(generationBatchesInputs.size());
auto constexpr step = 0;
auto constexpr contextId = 0;
for (auto generationBatchId = 0; generationBatchId < numGenerationBatches; ++generationBatchId)
{
auto const& generationBatchInputs = microBatchesInputs.at(generationBatchId);
auto const& generationBatchInputs = generationBatchesInputs.at(generationBatchId);
auto& generationBuffers = *mBuffers.at(generationBatchId);
auto const contextBatchSize = mMicroBatchConfig.ctxBatchSize;
@ -847,7 +879,7 @@ void GptSession::executeContextStep(std::vector<GenerationInput> const& microBat
for (auto contextBatchId = 0; contextBatchId < numContextBatches; ++contextBatchId)
{
auto batchOffset = generationBatchOffsets.at(generationBatchId) + contextBatchOffsets.at(contextBatchId);
auto batchOffset = generationBatchesOffsets.at(generationBatchId) + contextBatchOffsets.at(contextBatchId);
auto& buffers = contextBuffers.at(contextBatchId);
auto& inputBuffer = buffers.inputBuffers[0];
auto& outputBuffer = buffers.outputBuffers[0];
@ -976,7 +1008,7 @@ SizeType GptSession::executeGenerationStep(SizeType step, std::vector<Generation
void GptSession::decoderStepAsync(SizeType decoderStep, SizeType microBatchId)
{
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
auto& stream = mRuntime->getStream();
auto const& stream = mRuntime->getStream();
auto& buffers = *mBuffers.at(microBatchId);
auto const& outputIds = buffers.outputIds;
auto const& newTokens = buffers.newTokens;

View File

@ -20,7 +20,7 @@
namespace tensorrt_llm::runtime
{
void setPeerAccess(WorldConfig worldConfig, bool enable)
void setPeerAccess(WorldConfig const& worldConfig, bool enable)
{
const auto srcNode = worldConfig.getTensorParallelRank();
@ -50,7 +50,7 @@ void setPeerAccess(WorldConfig worldConfig, bool enable)
}
}
IpcMemory::IpcMemory(WorldConfig worldConfig, std::size_t bufferSize)
IpcMemory::IpcMemory(WorldConfig const& worldConfig, std::size_t bufferSize)
: mWorldConfig(worldConfig)
, mCommPtrs(worldConfig.getTensorParallelism())
, mBufferSize(bufferSize)

View File

@ -78,7 +78,7 @@ ncclComm_t NcclCommunicator::createComm(int worldSize, int rank, mpi::MpiComm co
{
ncclGetUniqueId(&id);
}
mpiComm.bcast(id, 0);
mpiComm.bcastValue(id, 0);
ncclComm_t comm;
TLLM_NCCL_CHECK(ncclCommInitRank(&comm, worldSize, id, rank));
return comm;

View File

@ -0,0 +1,31 @@
/*
* Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "tensorrt_llm/runtime/tllmBuffers.h"
namespace tensorrt_llm::runtime
{
template <typename TAllocator>
typename PoolAllocator<TAllocator>::PoolType& PoolAllocator<TAllocator>::getPool()
{
static PoolType pool;
return pool;
}
// explicit instantiations
template class PoolAllocator<PinnedAllocator>;
} // namespace tensorrt_llm::runtime

View File

@ -480,11 +480,7 @@ public:
using SizeType = typename Base::SizeType;
using PoolType = MemoryPool<TAllocator>;
static PoolType& getPool()
{
static PoolType pool;
return pool;
}
static PoolType& getPool();
protected:
void allocateImpl(PointerType* ptr, SizeType n) // NOLINT(readability-convert-member-functions-to-static)

View File

@ -46,7 +46,7 @@ void testBroadcast()
auto constexpr expectedValue = static_cast<T>(42);
auto constexpr root = 0;
auto value = rank == root ? expectedValue : T{};
comm.bcast(value, root);
comm.bcastValue(value, root);
EXPECT_EQ(value, expectedValue);
}
@ -79,7 +79,7 @@ TEST(MPIUtils, BroadcastNcclId)
{
std::memset(&id, 0, sizeof(id));
}
comm.bcast(id, root);
comm.bcastValue(id, root);
EXPECT_TRUE(std::any_of(
id.internal, id.internal + sizeof(id.internal) / sizeof(id.internal[0]), [](auto x) { return x != 0; }));
}

View File

@ -15,7 +15,6 @@
# limitations under the License.
import argparse as _arg
import glob as _gl
import logging as _log
import os as _os
import pathlib as _pl
@ -72,7 +71,7 @@ def build_trt_llm(python_exe: str,
python_exe, "scripts/build_wheel.py", "--cuda_architectures",
cuda_architectures, "--build_dir",
str(build_dir), "--dist_dir",
str(dist_dir)
str(dist_dir), "-s", "-i"
]
if use_ccache:
@ -86,12 +85,6 @@ def build_trt_llm(python_exe: str,
run_command(build_wheel, cwd=root_dir, env=_os.environ, timeout=2400)
dist_dir = dist_dir if dist_dir.is_absolute() else root_dir / dist_dir
wheels = _gl.glob(str(dist_dir / "tensorrt_llm-*.whl"))
assert len(wheels) > 0, "No wheels found"
install_wheel = [python_exe, "-m", "pip", "install", "--upgrade", *wheels]
run_command(install_wheel, cwd=root_dir, timeout=300)
def run_tests(cuda_architectures: _tp.Optional[str] = None,
build_dir: _tp.Optional[str] = None,
@ -369,11 +362,18 @@ def run_multi_gpu_tests(build_dir: _pl.Path):
tests_dir = build_dir / "tests"
cpp_env = {**_os.environ}
# TP2+PP2 tests fail for beam search
session_test = [
"mpirun", "-n", "4", "--allow-run-as-root", "gptSessionTest",
"--gtest_filter=*TP*:*PP*"
"--gtest_filter=*TP4*:*PP4*"
]
run_command(session_test, cwd=tests_dir, env=cpp_env, timeout=900)
run_command(session_test, cwd=tests_dir, env=cpp_env, timeout=300)
trt_model_test = [
"mpirun", "-n", "4", "--allow-run-as-root",
"batch_manager/trtGptModelRealDecoderTest", "--gtest_filter=*TP*:*PP*"
]
run_command(trt_model_test, cwd=tests_dir, env=cpp_env, timeout=300)
def run_benchmarks(python_exe: str, root_dir: _pl.Path, build_dir: _pl.Path,

View File

@ -15,8 +15,6 @@ GROUP_NAME ?= $(shell id --group --name)
LOCAL_USER ?= 0
ifeq ($(LOCAL_USER),1)
IMAGE_TAG_SUFFIX ?= -$(USER_NAME)
else
IMAGE_TAG_SUFFIX ?=
endif
# Default stage of the docker multi-stage build
@ -70,7 +68,7 @@ endef
$(if $(GIT_COMMIT), --build-arg GIT_COMMIT="$(GIT_COMMIT)") \
$(if $(STAGE), --target $(STAGE)) \
--file Dockerfile.multi \
--tag $(IMAGE_WITH_TAG)$(IMAGE_TAG_SUFFIX) \
--tag $(IMAGE_WITH_TAG) \
..
%_user:
@ -122,15 +120,15 @@ release_%: STAGE = release
release_run: WORK_DIR = /app/tensorrt_llm
# For x86_64
jenkins_%: IMAGE_TAG = jenkins_latest
jenkins_%: IMAGE_WITH_TAG = $(shell grep 'LLM_DOCKER_IMAGE = ' ../jenkins/L0_MergeRequest.groovy | grep -o '".*"' | tr -d '"')
jenkins_%: STAGE = devel
# For aarch64
jenkins-aarch64_%: IMAGE_TAG = jenkins-aarch64_latest
jenkins-aarch64_%: IMAGE_WITH_TAG = $(shell grep 'LLM_SBSA_DOCKER_IMAGE = ' ../jenkins/GH200Builder.groovy | grep -o '".*"' | tr -d '"')
jenkins-aarch64_%: STAGE = devel
# For x86_64
centos7_%: IMAGE_TAG = centos7_latest
centos7_%: IMAGE_WITH_TAG = $(shell grep 'LLM_CENTOS7_DOCKER_IMAGE = ' ../jenkins/L0_MergeRequest.groovy | grep -o '".*"' | tr -d '"')
centos7_%: STAGE = devel
centos7_%: BASE_IMAGE = nvidia/cuda
centos7_%: BASE_TAG = 12.3.1-devel-centos7
@ -141,7 +139,7 @@ ubuntu22_%: BASE_IMAGE = nvidia/cuda
ubuntu22_%: BASE_TAG = 12.3.1-devel-ubuntu22.04
# For x86_64
old-cuda_%: IMAGE_TAG = old-cuda_latest
old-cuda_%: IMAGE_WITH_TAG = $(shell grep 'LLM_OLD_CUDA_DOCKER_IMAGE = ' ../jenkins/L0_MergeRequest.groovy | grep -o '".*"' | tr -d '"')
old-cuda_%: BASE_TAG = 23.07-py3
old-cuda_%: STAGE = devel
old-cuda_%: CUDA_VERSION = 12.1

View File

@ -252,7 +252,7 @@ populates an instance of the
* `embeddingBiasOpt`, is a tensor of floating-point values on the GPU that
contains the bias to add to the logits during sampling (after the projection
from hidden states to logits as the last step of the model). This tensor
must have `vocabSize` elements (as defined in the `ModelConfig` argument
must have `vocabSize` elements (as defined in the `modelConfig` argument
passed to the constructor),
* `badWordsList`, is a tensor of integers on the GPU that encodes the list of
words that have to be banned from generated sequences. Its shape is `[2,
@ -315,8 +315,7 @@ batchSize, beamWidth]`_.
After inference is complete, you can get the context logits in `GenerationOutput.contextLogits`, these are variables on the GPU. For specific acquisition methods, please refer to the example of [gptSessionBenchmark.cpp](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/gptSessionBenchmark.cpp).
It is important to point out
that enabling that computation may have an impact on performance (the final
LM head has to perform a matrix multiplication on all the context tokens
that enabling the computation may have an impact on performance (the language modeling head (LM head) has to perform a matrix multiplication on all the context tokens
instead of a just the last one).
* `generationLogits`, is a tensor of values on the GPU (same datatype as the
computation type) to store the logits for the generation. Its shape is

View File

@ -23,6 +23,10 @@ Welcome to TensorRT-LLM's documentation!
graph-rewriting.md
memory.md
new_workflow.md
lora.md
perf_best_practices.md
performance_analysis.md
Python API
----------
@ -81,3 +85,4 @@ Blogs
blogs/H100vsA100.md
blogs/H200launch.md
blogs/Falcon180B-H200.md
blogs/quantization-in-TRT-LLM.md

View File

@ -22,7 +22,7 @@ Optional tensors that can be supplied to `InferenceRequest` are shown below. Def
| `presence_penalty` | [1] | `float` | Sampling Config param: `presencePenalty` |
| `frequency_penalty` | [1] | `float` | Sampling Config param: `frequencyPenalty` |
| `random_seed` | [1] | `uint64_t` | Sampling Config param: `randomSeed` |
| `end_id` | [1] | `int32_t` | End token Id |
| `end_id` | [1] | `int32_t` | End token Id. If not specified, defaults to -1 |
| `pad_id` | [1] | `int32_t` | Pad token Id |
| `embedding_bias` | [1] | `float` | Embedding bias |
| `bad_words_list` | [2, num_bad_words] | `int32_t` | Bad words list |

View File

@ -60,8 +60,8 @@ The different files will be loaded by different ranks in a multi-GPU (multi-proc
| mapping.world_size | int | 1 |
| mapping.tp_size | int | 1 |
| mapping.pp_size | int | 1 |
| quantization.quant_aglo | str | null |
| quantization.kv_cache_quant_aglo | str | null |
| quantization.quant_algo | str | null |
| quantization.kv_cache_quant_algo | str | null |
| quantization.group_size | int | 64 |
| quantization.has_zero_point | bool | False |
| quantization.pre_quant_scale | bool | False |
@ -211,10 +211,6 @@ Here is the `config.json`:
"position_embedding_type": "learned_absolute",
"max_position_embeddings": 2048,
"hidden_act": "relu",
"quantization": {
"use_weight_only": false,
"weight_only_precision": "int8"
},
"mapping": {
"world_size": 2,
"tp_size": 2

View File

@ -17,122 +17,214 @@ described in the benchmarks [folder](source:benchmarks/).
The below tables provide reference data at large batch sizes, representing
high throughput offline tasks.
This data has been updated for v0.6.1, unless specified.
All data was generated using version 0.8.0
### H200 GPUs (FP8)
### H100 GPUs (FP8)
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s/GPU) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | -------------------------: |
| GPT-J 6B | 1024 | 1 | 128 | 128 | 26,150 |
| GPT-J 6B | 120 | 1 | 128 | 2048 | 8,011 |
| GPT-J 6B | 64 | 1 | 2048 | 128 | 2,551 |
| GPT-J 6B | 64 | 1 | 2048 | 2048 | 3,327 |
| GPT-J 6B | 1024 | 1 | 128 | 128 | 29,168 |
| GPT-J 6B | 120 | 1 | 128 | 2048 | 9,472 |
| GPT-J 6B | 64 | 1 | 2048 | 128 | 2,961 |
| GPT-J 6B | 64 | 1 | 2048 | 2048 | 4,149 |
| | | | | | |
| LLaMA 7B | 768 | 1 | 128 | 128 | 19,694 |
| LLaMA 7B | 112 | 1 | 128 | 2048 | 6,818 |
| LLaMA 7B | 80 | 1 | 2048 | 128 | 2,244 |
| LLaMA 7B | 48 | 1 | 2048 | 2048 | 2,740 |
| Mistral 7B | 896 | 1 | 128 | 128 | 20,569 |
| Mistral 7B | 120 | 1 | 128 | 2048 | 8,968 |
| Mistral 7B | 84 | 1 | 2048 | 128 | 2,450 |
| Mistral 7B | 56 | 1 | 2048 | 2048 | 3,868 |
| | | | | | |
| LLaMA 70B | 1024 | 2 | 128 | 128 | 2,657 |
| LLaMA 70B | 480 | 4 | 128 | 2048 | 1,486 |
| LLaMA 70B | 96 | 2 | 2048 | 128 | 306 |
| LLaMA 70B | 64 | 2 | 2048 | 2048 | 547 |
| LLaMA 7B | 896 | 1 | 128 | 128 | 20,548 |
| LLaMA 7B | 120 | 1 | 128 | 2048 | 8,343 |
| LLaMA 7B | 84 | 1 | 2048 | 128 | 2,429 |
| LLaMA 7B | 56 | 1 | 2048 | 2048 | 3,530 |
| | | | | | |
| Falcon 180B | 1024 | 4 | 128 | 128 | 987 |
| Falcon 180B | 1024 | 8 | 128 | 2048 | 724 |
| Falcon 180B | 64 | 4 | 2048 | 128 | 112 |
| Falcon 180B | 64 | 4 | 2048 | 2048 | 264 |
| LLaMA 70B | 512 | 1 | 128 | 128 | 3,844 |
| LLaMA 70B | 512 | 2 | 128 | 2048 | 4,008 |
| LLaMA 70B | 64 | 1 | 2048 | 128 | 421 |
| LLaMA 70B | 64 | 1 | 2048 | 2048 | 1,461 |
| | | | | | |
| Falcon 180B | 1024 | 4 | 128 | 128 | 1,116 |
| Falcon 180B | 1024 | 4 | 128 | 2048 | 990 |
| Falcon 180B | 64 | 4 | 2048 | 128 | 118 |
| Falcon 180B | 64 | 4 | 2048 | 2048 | 269 |
### L40S GPUs (FP8)<sup>*</sup>
<sup> * The following data is from TensorRT-LLM v0.5. </sup>
### H100 GPUs (FP8)
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s/GPU) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | -------------------------: |
| GPT-J 6B | 1024 | 1 | 128 | 128 | 27,357 |
| GPT-J 6B | 120 | 1 | 128 | 2048 | 7,831 |
| GPT-J 6B | 64 | 1 | 2048 | 128 | 2,661 |
| GPT-J 6B | 64 | 1 | 2048 | 2048 | 3,409 |
| | | | | | |
| Mistral 7B | 896 | 1 | 128 | 128 | 20,517 |
| Mistral 7B | 120 | 1 | 128 | 2048 | 8,619 |
| Mistral 7B | 64 | 1 | 2048 | 128 | 2,438 |
| Mistral 7B | 56 | 1 | 2048 | 2048 | 3,733 |
| | | | | | |
| LLaMA 7B | 896 | 1 | 128 | 128 | 20,241 |
| LLaMA 7B | 120 | 1 | 128 | 2048 | 6,922 |
| LLaMA 7B | 64 | 1 | 2048 | 128 | 2,170 |
| LLaMA 7B | 56 | 1 | 2048 | 2048 | 2,816 |
| | | | | | |
| LLaMA 70B | 1024 | 2 | 128 | 128 | 3,269 |
| LLaMA 70B | 512 | 4 | 128 | 2048 | 2,718 |
| LLaMA 70B | 96 | 2 | 2048 | 128 | 347 |
| LLaMA 70B | 64 | 2 | 2048 | 2048 | 1,020 |
| | | | | | |
| Falcon 180B | 512 | 4 | 128 | 128 | 1,048 |
| Falcon 180B | 1024 | 8 | 128 | 2048 | 836 |
| Falcon 180B | 64 | 4 | 2048 | 128 | 114 |
| Falcon 180B | 64 | 4 | 2048 | 2048 | 250 |
### L40S GPUs (FP8)
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s/GPU) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
| GPT-J 6B | 64 | 1 | 128 | 128 | 3,630 |
| GPT-J 6B | 64 | 1 | 128 | 2048 | 1,859 |
| GPT-J 6B | 32 | 1 | 2048 | 128 | 616 |
| GPT-J 6B | 32 | 1 | 2048 | 2048 | 757 |
| GPT-J 6B | 512 | 1 | 128 | 128 | 7,992 |
| GPT-J 6B | 64 | 1 | 128 | 2048 | 1,874 |
| GPT-J 6B | 32 | 1 | 2048 | 128 | 693 |
| GPT-J 6B | 32 | 1 | 2048 | 2048 | 768 |
| | | | | | |
| LLaMA 7B | 64 | 1 | 128 | 128 | 3,240 |
| LLaMA 7B | 64 | 1 | 128 | 2048 | 1,622 |
| LLaMA 7B | 32 | 1 | 2048 | 128 | 581 |
| LLaMA 7B | 16 | 1 | 2048 | 2048 | 531 |
| Mistral 7B | 896 | 1 | 128 | 128 | 9,679 |
| Mistral 7B | 120 | 1 | 128 | 2048 | 4,401 |
| Mistral 7B | 84 | 1 | 2048 | 128 | 979 |
| Mistral 7B | 56 | 1 | 2048 | 2048 | 1,721 |
| | | | | | |
| LLaMA 7B | 256 | 1 | 128 | 128 | 5,954 |
| LLaMA 7B | 64 | 1 | 128 | 2048 | 1,654 |
| LLaMA 7B | 32 | 1 | 2048 | 128 | 579 |
| LLaMA 7B | 16 | 1 | 2048 | 2048 | 542 |
| | | | | | |
| LLaMA 70B | 256 | 2 | 128 | 128 | 561 |
| LLaMA 70B | 256 | 4 | 128 | 2048 | 471 |
| LLaMA 70B | 16 | 2 | 2048 | 128 | 49 |
| LLaMA 70B | 64 | 4 | 2048 | 2048 | 177 |
| | | | | | |
| Falcon 180B | 512 | 8 | 128 | 128 | 152 |
| Falcon 180B | 256 | 8 | 128 | 2048 | 200 |
| Falcon 180B | 32 | 8 | 2048 | 128 | 15 |
| Falcon 180B | 16 | 8 | 2048 | 2048 | 39 |
### A100 GPUs (FP16)
| Model | Batch Size | TP (1) | Input Length | Output Length | Throughput (out tok/s/GPU) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
| GPT-J 6B | 512 | 1 | 128 | 128 | 6,374 |
| GPT-J 6B | 120 | 2 | 128 | 2048 | 2,192 |
| GPT-J 6B | 60 | 1 | 2048 | 128 | 670 |
| GPT-J 6B | 64 | 2 | 2048 | 2048 | 903 |
| GPT-J 6B | 512 | 1 | 128 | 128 | 6,810 |
| GPT-J 6B | 32 | 1 | 128 | 2048 | 1,658 |
| GPT-J 6B | 32 | 1 | 2048 | 128 | 631 |
| GPT-J 6B | 16 | 1 | 2048 | 2048 | 692 |
| | | | | | |
| LLaMA 7B | 384 | 1 | 128 | 128 | 5,586 |
| LLaMA 7B | 60 | 1 | 128 | 2048 | 1,928 |
| LLaMA 7B | 52 | 1 | 2048 | 128 | 591 |
| LLaMA 7B | 64 | 2 | 2048 | 2048 | 782 |
| Mistral 7B | 896 | 1 | 128 | 128 | 6,472 |
| Mistral 7B | 120 | 1 | 128 | 2048 | 3,812 |
| Mistral 7B | 84 | 1 | 2048 | 128 | 734 |
| Mistral 7B | 56 | 1 | 2048 | 2048 | 1,607 |
| | | | | | |
| LLaMA 70B | 1280 | 4 | 128 | 128 | 670 |
| LLaMA 70B | 240 | 4 | 128 | 2048 | 525 |
| LLaMA 70B | 120 | 4 | 2048 | 128 | 79 |
| LLaMA 7B | 256 | 1 | 128 | 128 | 5,353 |
| LLaMA 7B | 32 | 1 | 128 | 2048 | 1,518 |
| LLaMA 7B | 32 | 1 | 2048 | 128 | 547 |
| LLaMA 7B | 16 | 1 | 2048 | 2048 | 613 |
| | | | | | |
| Falcon 180B | 1024 | 8 | 128 | 128 | 232 |
| Falcon 180B | 128 | 8 | 128 | 2048 | 180 |
| LLaMA 70B | 256 | 4 | 128 | 128 | 565 |
| LLaMA 70B | 128 | 4 | 128 | 2048 | 595 |
| LLaMA 70B | 32 | 4 | 2048 | 128 | 66 |
| LLaMA 70B | 32 | 4 | 2048 | 2048 | 185 |
| | | | | | |
| Falcon 180B | 256 | 8 | 128 | 128 | 193 |
| Falcon 180B | 256 | 8 | 128 | 2048 | 203 |
| Falcon 180B | 16 | 8 | 2048 | 128 | 20 |
(1) TP stands for Tensor Parallelism.
## Low Latency<sup>**</sup>
<sup> ** The following data is from TensorRT-LLM v0.5. Low latency numbers will soon be updated to reflect real time latency with infight-batching.</sup>
All data was generated using version 0.8.0
<sup> ** Low latency numbers will soon be updated to reflect real time latency with infight-batching.</sup>
The below tables provide reference data at batch size 1 for first token
latency, representing end-user's perceived latency for online streaming
tasks.
### H200 GPUs (FP8)
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B | 1 | 1 | 128 | 5.2 |
| GPT-J 6B | 1 | 1 | 2048 | 23.6 |
| | | | | |
| Mistral 7B | 1 | 1 | 128 | 6.0 |
| Mistral 7B | 1 | 1 | 2048 | 31.8 |
| | | | | |
| LLaMA 7B | 1 | 1 | 128 | 5.8 |
| LLaMA 7B | 1 | 1 | 2048 | 30.1 |
| | | | | |
| LLaMA 70B | 1 | 8 | 128 | 16.0 |
| LLaMA 70B | 1 | 8 | 2048 | 78.8 |
| | | | | |
| Falcon 180B | 1 | 8 | 128 | 37.2 |
| Falcon 180B | 1 | 8 | 2048 | 120.8 |
### H100 GPUs (FP8)
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B | 1 | 1 | 128 | 7 |
| GPT-J 6B | 1 | 1 | 2048 | 29 |
| GPT-J 6B | 1 | 1 | 128 | 5.7 |
| GPT-J 6B | 1 | 1 | 2048 | 23.8 |
| | | | | |
| LLaMA 7B | 1 | 1 | 128 | 7 |
| LLaMA 7B | 1 | 1 | 2048 | 36 |
| Mistral 7B | 1 | 1 | 128 | 6.6 |
| Mistral 7B | 1 | 1 | 2048 | 32.6 |
| | | | | |
| LLaMA 70B | 1 | 4 | 128 | 26 |
| LLaMA 70B | 1 | 4 | 2048 | 109 |
| LLaMA 7B | 1 | 1 | 128 | 6.4 |
| LLaMA 7B | 1 | 1 | 2048 | 31.0 |
| | | | | |
| Falcon 180B | 1 | 8 | 128 | 27 |
| Falcon 180B | 1 | 8 | 2048 | 205 |
| LLaMA 70B | 1 | 8 | 128 | 17.0 |
| LLaMA 70B | 1 | 8 | 2048 | 84.4 |
| | | | | |
| Falcon 180B | 1 | 8 | 128 | 39.7 |
| Falcon 180B | 1 | 8 | 2048 | 128.0 |
### L40S GPUs (FP8)
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B | 1 | 1 | 128 | 12 |
| GPT-J 6B | 1 | 1 | 2048 | 71 |
| GPT-J 6B | 1 | 1 | 128 | 12.6 |
| GPT-J 6B | 1 | 1 | 2048 | 61.2 |
| | | | | |
| LLaMA 7B | 1 | 1 | 128 | 14 |
| LLaMA 7B | 1 | 1 | 2048 | 73 |
| Mistral 7B | 1 | 1 | 128 | 15.5 |
| Mistral 7B | 1 | 1 | 2048 | 84.3 |
| | | | | |
| LLaMA 7B | 1 | 1 | 128 | 14.3 |
| LLaMA 7B | 1 | 1 | 2048 | 79.0 |
| | | | | |
| LLaMA 70B | 1 | 8 | 128 | 70.9 |
| LLaMA 70B | 1 | 8 | 2048 | 708.7 |
| | | | | |
| Falcon 180B | 1 | 8 | 128 | 93.4 |
| Falcon 180B | 1 | 8 | 2048 | 769.8 |
### A100 GPUs (FP16)
| Model | Batch Size | TP (1) | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B | 1 | 1 | 128 | 12 |
| GPT-J 6B | 1 | 1 | 2048 | 129 |
| GPT-J 6B | 1 | 1 | 128 | 14.1 |
| GPT-J 6B | 1 | 1 | 2048 | 102.8 |
| | | | | |
| LLaMA 7B | 1 | 1 | 128 | 16 |
| LLaMA 7B | 1 | 1 | 2048 | 133 |
| Mistral 7B | 1 | 1 | 128 | 16.4 |
| Mistral 7B | 1 | 1 | 2048 | 128.7 |
| | | | | |
| LLaMA 70B | 1 | 4 | 128 | 47 |
| LLaMA 70B | 1 | 4 | 2048 | 377 |
| LLaMA 7B | 1 | 1 | 128 | 16.1 |
| LLaMA 7B | 1 | 1 | 2048 | 120.5 |
| | | | | |
| Falcon 180B | 1 | 8 | 128 | 61 |
| Falcon 180B | 1 | 8 | 2048 | 509 |
| LLaMA 70B | 1 | 8 | 128 | 35.6 |
| LLaMA 70B | 1 | 8 | 2048 | 235.1 |
| | | | | |
| Falcon 180B | 1 | 8 | 128 | 76.5 |
| Falcon 180B | 1 | 8 | 2048 | 463.0 |
(1) TP stands for Tensor Parallelism.
@ -476,7 +568,7 @@ Prepare a config json file `/tmp/engines/falcon/180b/ckpt_config.json`:
```json
{
"architecture": "FalconForCausalLM",
"dtype": "float16",
"dtype": "bfloat16",
"num_hidden_layers": 80,
"num_attention_heads": 232,
"num_key_value_heads": 8,
@ -523,8 +615,8 @@ do
--workers 8 \
--remove_input_padding enable \
--context_fmha enable \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--paged_kv_cache enable \
--max_batch_size $batch_size \
--max_input_len $isl \

View File

@ -4,7 +4,7 @@ NVIDIA Nsight Systems reports at the application level are highly informative. M
Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, we have added features to TensorRT-LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.
# Feature Descriptions
## Feature Descriptions
The main functionality here:
* Relies on toggling the CUDA profiler runtime API on and off.
@ -35,7 +35,7 @@ To profile just those iterations, in addition to setting `TLLM_GPTS_PROFILE_STAR
* We need to tell Nsight Systems to look for explicit API triggers to profile (`-c cudaProfilerApi`)
* We need to tell Nsight Systems to keep profiling after seeing a profile stop API call (`--capture-range-end="repeat[]"`)
# Examples
## Examples
Consult the Nsight Systems User Guide for full overview of MPI-related options.
## Profiling a single IFB iteration executing on a single rank of a multi-GPU model

View File

@ -139,7 +139,8 @@ This release of TensorRT-LLM contains the following examples:
| Replit Code| Y | Y | Y | . | . | . | . | . | . |
| SantaCoder | Y | Y | Y | . | . | . | . | . | . |
| Skywork | Y | Y | Y | . | . | . | . | . | . |
| StarCoder | Y | Y | Y | . | . | . | . | . | . |
| StarCoder1 | Y | Y | Y | . | . | Y | . | . | . |
| StarCoder2 | Y | Y | Y | . | . | Y | . | . | . |
| T5 | Y | Y | Y | . | . | . | . | . | . |
| Whisper | Y | Y | Y | . | . | Y | Y | . | . |

View File

@ -6,7 +6,7 @@ This document shows how to build and run a Baichuan models (including `v1_7b`/`v
The TensorRT-LLM Baichuan implementation can be found in [tensorrt_llm/models/baichuan/model.py](../../tensorrt_llm/models/baichuan/model.py). The TensorRT-LLM Baichuan example code is located in [`examples/baichuan`](./). There is one main file:
* [`copnvert_checkpoint.py`](./copnvert_checkpoint.py) to convert supported checkpoints into TensorRT-LLM format.
* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert supported checkpoints into TensorRT-LLM format.
The script accepts an argument named model_version, whose value should be `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b` and the default value is `v1_13b`.
@ -20,9 +20,9 @@ In addition, there are two shared files in the parent folder [`examples`](../) f
* FP8
* BF16
* INT4 & INT8 Weight-Only
* INT8 KV CACHE (+ AWQ/per-channel weight-only)
* INT8 SmoothQuant
* Groupwise quantization (AWQ/GPTQ)
* INT8 KV CACHE (+ AWQ/per-channel weight-only/SmoothQuant)
## Usage
@ -56,27 +56,26 @@ trtllm-build --checkpoint_dir ./trt_ckpt/baichuan_v1_13b/ \
Here're some examples for checkpoint conversion that take `v1_13b` as example:
```bash
# Build a single-GPU float16 engine from HF weights.
# Build the Baichuan V1 13B model using a single GPU and FP16.
# Convert the Baichuan V1 13B model using a single GPU and FP16.
python convert_checkpoint.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--output_dir ./tmp/baichuan_v1_13b/trt_engines/fp16/1-gpu/
# Build the Baichuan V1 13B model using a single GPU and BF16.
# Convert the Baichuan V1 13B model using a single GPU and BF16.
python convert_checkpoint.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype bfloat16 \
--output_dir ./tmp/baichuan_v1_13b/trt_engines/bf16/1-gpu/
# Build the Baichuan V1 13B model using a single GPU and apply INT8 weight-only quantization.
# Convert the Baichuan V1 13B model using a single GPU and apply INT8 weight-only quantization.
python convert_checkpoint.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--use_weight_only \
--output_dir ./tmp/baichuan_v1_13b/trt_engines/int8_weight_only/1-gpu/
# Build the Baichuan V1 13B model using a single GPU and apply INT4 weight-only quantization.
# Convert the Baichuan V1 13B model using a single GPU and apply INT4 weight-only quantization.
python convert_checkpoint.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
@ -84,7 +83,7 @@ python convert_checkpoint.py --model_version v1_13b \
--weight_only_precision int4 \
--output_dir ./tmp/baichuan_v1_13b/trt_engines/int4_weight_only/1-gpu/
# Build Baichuan V1 13B using 2-way tensor parallelism.
# Convert Baichuan V1 13B using 2-way tensor parallelism.
python convert_checkpoint.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
@ -93,47 +92,6 @@ python convert_checkpoint.py --model_version v1_13b \
--tp_size 2
```
#### INT8 KV cache
INT8 KV cache could be enabled to reduce memory footprint. It will bring more performance gains when batch size gets larger.
You can get the INT8 scale of KV cache through NVIDIA AMMO (AlgorithMic Model Optimization) toolkit, which features a
`--kv_cache_dtype` option.
Example:
```bash
python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--kv_cache_dtype int8 \
--output_dir ./trt_ckpt/baichuan_int8kv_tp1 \
--calib_size 512
```
**INT8 KV cache + per-channel weight-only quantization**
INT8 KV cache could be combined with per-channel weight-only quantization, as follows:
```bash
python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--qformat int4_wo \
--kv_cache_dtype int8 \
--output_dir ./trt_ckpt/baichuan_int4wo_int8kv_tp1 \
--calib_size 512
```
**INT8 KV cache + AWQ**
In addition, you can enable INT8 KV cache together with AWQ (per-group INT4 weight-only quantization), as follows:
```bash
python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--qformat int4_awq \
--kv_cache_dtype int8 \
--output_dir ./trt_ckpt/baichuan_int4awq_int8kv_tp1 \
--calib_size 512
```
#### SmoothQuant
The SmoothQuant supports all Baichuan model variants. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
@ -210,6 +168,62 @@ To run the GPTQ Baichuan example, the following steps are required:
```
The quantized model checkpoint is saved for future TensorRT-LLM engine build directly with the `trtllm-build` command mentioned above.
#### INT8 KV cache
INT8 KV cache could be enabled to reduce memory footprint. It will bring more performance gains when batch size gets larger.
You can get the INT8 scale of KV cache through NVIDIA AMMO (AlgorithMic Model Optimization) toolkit, which features a
`--kv_cache_dtype` option.
Example:
```bash
python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--kv_cache_dtype int8 \
--output_dir ./trt_ckpt/baichuan_int8kv_tp1 \
--calib_size 512
```
**INT8 KV cache + per-channel weight-only quantization**
INT8 KV cache could be combined with per-channel weight-only quantization, as follows:
```bash
python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--qformat int4_wo \
--kv_cache_dtype int8 \
--output_dir ./trt_ckpt/baichuan_int4wo_int8kv_tp1 \
--calib_size 512
```
**INT8 KV cache + AWQ**
In addition, you can enable INT8 KV cache together with AWQ (per-group INT4 weight-only quantization), as follows:
```bash
python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--qformat int4_awq \
--kv_cache_dtype int8 \
--output_dir ./trt_ckpt/baichuan_int4awq_int8kv_tp1 \
--calib_size 512
```
**INT8 KV cache + INT8 SmoothQuant**
In addition, you can enable INT8 KV cache together with INT8 SmoothQuant, as follows:
```bash
python convert_checkpoint.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--smoothquant 0.8 \
--per_channel \
--per_token \
--int8_kv_cache \
--output_dir ./tmp/baichuan_v1_13b/sq0.8/1-gpu/
```
### Run
To run a TensorRT-LLM Baichuan model using the engines generated by `trtllm-build`

View File

@ -67,13 +67,6 @@ def parse_arguments():
type=int,
default=0,
help='Setting to a value > 0 enables support for prompt tuning.')
parser.add_argument(
"--calibrate_kv_cache",
"-kv",
action="store_true",
help=
"Generate scaling factors for KV cache. Used for storing KV cache in int8."
)
parser.add_argument(
'--per_channel',
default=False,
@ -1100,12 +1093,9 @@ def convert_baichuan_gptq(hf_config: AutoConfig,
# 4. Weights inside each layer
num_hidden_layers = hf_config.num_hidden_layers
layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
layers_range = mapping.pp_layers(num_hidden_layers)
for l in layers_range:
layer_idx = l - mapping.pp_rank * layers_per_pipeline_stage
layer_idx = l - layers_range[0]
prefix = f"layers.{l}."
tllm_prefix = f"transformer.layers.{l}."
tensorrt_llm.logger.info(f'Process weights in layer: {layer_idx}')
@ -1189,7 +1179,7 @@ if __name__ == '__main__':
elif args.per_token and not args.per_channel:
quant_algo = 'W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN'
if args.calibrate_kv_cache:
if args.int8_kv_cache:
kv_cache_quant_algo = "INT8"
else:
kv_cache_quant_algo = None
@ -1252,7 +1242,7 @@ if __name__ == '__main__':
hf_model = AutoModelForCausalLM.from_pretrained(args.model_dir,
trust_remote_code=True,
torch_dtype="auto")
if args.smoothquant is not None or args.calibrate_kv_cache:
if args.smoothquant is not None or args.int8_kv_cache:
act_range = {}
baichuan_smoother = {}
act_range = capture_activation_range(
@ -1265,9 +1255,8 @@ if __name__ == '__main__':
baichuan_smoother)
weights = convert_hf_baichuan_sq(hf_model, mapping, rank,
args.dtype, args.per_channel,
args.per_token,
args.calibrate_kv_cache, act_range,
baichuan_smoother)
args.per_token, args.int8_kv_cache,
act_range, baichuan_smoother)
elif args.use_weight_only and args.weight_only_precision == 'int4_gptq':
weights = convert_baichuan_gptq(hf_config,
args.quant_ckpt_path,

View File

@ -275,7 +275,7 @@ if __name__ == '__main__':
('batch_size', [bs_range])
]))
# logits for QA BERT, or hidden_state for vanila BERT
# logits for QA BERT, or hidden_state for vanilla BERT
output = tensorrt_llm_bert(input_ids=input_ids,
input_lengths=input_lengths,
token_type_ids=token_type_ids)

View File

@ -519,7 +519,7 @@ def smooth_bloom_model(model, scales, alpha, bloom_qkv_param, bloom_smoother):
bloom_qkv_param[layer_name] = param
# dense
# enabled for better accuracy with perf overhead of quantiztion
# enabled for better accuracy with perf overhead of quantization
layer_name = name + ".self_attention.dense"
smoother = smooth_gemm(module.self_attention.dense.weight,
scales[layer_name]["x"], None, None, alpha)
@ -540,7 +540,7 @@ def smooth_bloom_model(model, scales, alpha, bloom_qkv_param, bloom_smoother):
dim=1)[0]
# fc2
# enabled for better accuracy with perf overhead of quantiztion
# enabled for better accuracy with perf overhead of quantization
layer_name = name + ".mlp.dense_4h_to_h"
smoother = smooth_gemm(module.mlp.dense_4h_to_h.weight,
scales[layer_name]["x"], None, None, alpha)

View File

@ -184,7 +184,7 @@ If the engines are run successfully, you will see output like (ChatGLM3-6B as th
* The engine(s) must be built accordingly if [in-flight batching in C++ runtime](../../docs/in_flight_batching.md) will be used.
* Use `--gpt_attention_plugin float16`, `--paged_kv_cache enable`, `--remove_input_padding enable` to build engine(s) supporting In-flight Batching.
* It is possible to use `--gpt_attention_plugin float32` In-flight Batching.
* The size of the block in paged KV cache can be conteoled additionally by using `--tokens_per_block=N`.
* The size of the block in paged KV cache can be controlled additionally by using `--tokens_per_block=N`.
### 4. Run inference
@ -258,7 +258,7 @@ If the engines are run successfully, you will see output like (ChatGLM3-6B as th
### Weight Only quantization
Use `--use_weight_only` to enable INT8-Weight-Only quantization, this will siginficantly lower the latency and memory footprint. Furthermore, use `--weight_only_precision int8` or `--weight_only_precision int4` to configure the data type of the weights.
Use `--use_weight_only` to enable INT8-Weight-Only quantization, this will significantly lower the latency and memory footprint. Furthermore, use `--weight_only_precision int8` or `--weight_only_precision int4` to configure the data type of the weights.
```bash
# ChatGLM3-6B: single gpu, int8 weight only quantization

View File

@ -228,7 +228,7 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
unk_token=unk_token,
num_image_tokens=num_image_tokens,
**kwargs)
""" Initialisation """
""" Initialization """
@property
def gmask_token_id(self) -> Optional[int]:

View File

@ -69,7 +69,7 @@ We should distinguish between `X` - TP size and `Y` - total number of GPU ranks:
# Example 1: build t5-small using a single GPU, FP32, running greedy search
# use_gpt_attention_plugin is necessary in Enc-Dec.
# Try use_gemm_plugin to prevent accuracy issue.
# It is recommend to use --remove_input_padding along with --use_gpt_attention_plugin for better performance
# It is recommended to use --remove_input_padding along with --use_gpt_attention_plugin for better performance
python build.py --model_type t5 \
--weight_dir tmp/trt_models/t5-small/tp1 \
-o tmp/trt_engines/t5-small/1-gpu \

View File

@ -533,8 +533,7 @@ def build(rank, args):
hf_modules_to_trtllm_modules=args.hf_modules_to_trtllm_modules
if args.use_lora_plugin else None,
trtllm_modules_to_hf_modules=args.trtllm_modules_to_hf_modules
if args.use_lora_plugin else None,
)
if args.use_lora_plugin else None)
engine_name = get_engine_name(args.engine_name, args.dtype,
args.tp_size, args.pp_size, cur_rank)
@ -588,7 +587,7 @@ def run_build(component):
if args.parallel_build and args.world_size > 1 and \
torch.cuda.device_count() >= args.world_size:
logger.warning(
f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
f'Parallel build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
)
mp.spawn(build, nprocs=args.world_size, args=(args, ))
else:

View File

@ -84,8 +84,7 @@ Note that we need to download the dataset of MMLU first and the evaluation of MM
VOCAB_FILE_PATH=/tmp/models/gemma_nv/checkpoints/tmp_vocab.model
python3 ../run.py --engine_dir ${ENGINE_PATH} \
--max_output_len 30 \
--vocab_file ${VOCAB_FILE_PATH} \
--no_add_special_tokens
--vocab_file ${VOCAB_FILE_PATH}
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600Input [Text 0]: "<bos> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "chef in the renowned kitchens of Lyon. After honing his skills in various Michelin-starred establishments, he embarked on a solo venture, establishing his own restaurant"
@ -98,8 +97,7 @@ python3 ../summarize.py --test_trt_llm \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--vocab_file ${VOCAB_FILE_PATH} \
--no_add_special_tokens
--vocab_file ${VOCAB_FILE_PATH}
[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.2821836471557617 sec)
[02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1989)
@ -167,8 +165,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-05:04:13] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.96612286567688 sec)
[02/08/2024-05:04:13] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2510)
@ -213,8 +210,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.116227149963379 sec)
[02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2419)
@ -263,8 +259,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.460859775543213 sec)
[02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1786)
@ -308,8 +303,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.5987987518310547 sec)
[02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1797)
@ -349,8 +343,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-04:48:06] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.1938045024871826 sec)
[02/08/2024-04:48:06] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1462)
@ -393,8 +386,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.5348474979400635 sec)
[02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1819)
@ -437,8 +429,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
python3 ../mmlu.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
@ -482,8 +473,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.884302377700806 sec)
[02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2694)
@ -524,8 +514,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/19/2024-10:02:53] [TRT-LLM] [I] ---------------------------------------------------------
[02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT-LLM (total latency: 13.65670919418335 sec)
@ -570,8 +559,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 8.49835753440857 sec)
[02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2654)
@ -611,8 +599,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-07:43:32] [TRT-LLM] [I] TensorRT-LLM (total latency: 7.282559156417847 sec)
[02/08/2024-07:43:32] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2253)
@ -655,8 +642,7 @@ python3 ../summarize.py --test_trt_llm \
--vocab_file ${VOCAB_FILE_PATH} \
--engine_dir ${ENGINE_PATH} \
--batch_size 8 \
--max_ite 5 \
--no_add_special_tokens
--max_ite 5
[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (total latency: 8.73880124092102 sec)
[02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2771)
@ -672,7 +658,7 @@ python3 ../summarize.py --test_trt_llm \
#### Requirements
AMMO toolkit provides quantization solutions with better accuracy. To enable it, have the latest ammo and transformers Python package installed to support Gemma. Then run the following commands.
AMMO toolkit also provides quantization solutions. To enable it, have the latest ammo and transformers Python package installed to support Gemma. Then run the following commands.
#### Quantize Checkpoints
@ -713,7 +699,7 @@ trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
#### Accuracy Results on MMLU
| Model | fp8 | int4_awq | int8_sq |
|---------------|-------|----------|---------|
| 2B Pretrained | 0.407 | 0.378 | 0.328 |
| 7B Pretrained | 0.643 | 0.615 | 0.480 |
| Model | fp8 | int4_awq | int8_sq (AMMO) | int8_sq (Native per-channel) |
|---------------|-------|----------|----------------|------------------|
| 2B Pretrained | 0.407 | 0.378 | 0.338 | 0.338 |
| 7B Pretrained | 0.643 | 0.615 | 0.448 | 0.595 |

View File

@ -25,7 +25,7 @@ from tensorrt_llm._utils import torch_to_numpy
from tensorrt_llm.models.gemma.smoothquant import *
from tensorrt_llm.models.gemma.weight import (dummy_weights_awq,
load_from_fp8_llama,
quantize_fp8_weigths)
quantize_fp8_weights)
LOGGER = logging.getLogger("convert_checkpoint")
@ -735,7 +735,7 @@ def convert(worker_rank, args, convert_kwargs):
trt_llm_config=trt_llm_config,
group_size=128)
elif args.enable_fp8 or args.fp8_kv_cache:
weight_scales = quantize_fp8_weigths(
weight_scales = quantize_fp8_weights(
weights, trt_llm_config.num_hidden_layers,
trt_llm_config.mapping)
scales = load_from_fp8_llama(args.ammo_quant_ckpt_path,
@ -766,7 +766,6 @@ def main():
print(f"Source configuration determined from parameters: {ckpt_config}")
quant_mode = tensorrt_llm.quantization.QuantMode(0)
quant_kwargs = {}
quant_algo = None
kv_cache_quant_algo = None
@ -801,11 +800,6 @@ def main():
quant_kwargs.update(quant_algo=quant_algo,
kv_cache_quant_algo=kv_cache_quant_algo)
if quant_algo is not None or kv_cache_quant_algo is not None:
quant_mode = tensorrt_llm.quantization.QuantMode.from_quant_algo(
quant_algo,
kv_cache_quant_algo=kv_cache_quant_algo,
)
if args.use_weight_only_with_precision:
if args.use_weight_only_with_precision.endswith("awq"):
quant_kwargs.update(has_zero_point=False,
@ -830,8 +824,7 @@ def main():
world_size=args.world_size,
tp_size=args.world_size,
pp_size=1,
quant_mode=quant_mode,
quant_kwargs=quant_kwargs,
quantization=quant_kwargs,
)
trt_llm_config_dict = trt_llm_config.to_dict()

View File

@ -206,7 +206,7 @@ python3 build.py \
mpirun -np 4 python3 ../run.py --engine_dir santacoder_outputs_tp4 --tokenizer_dir ./santacoder --input_text "def print_hello_world():" --max_output_len 20 --no_add_special_tokens
```
## GPT Variant - StarCoder
## GPT Variant - StarCoder (v1 and v2)
For StarCoder, the steps are similar except that `santacoder` is swapped with `starcoder`.
@ -228,6 +228,11 @@ python3 build.py \
mpirun -np 4 python3 ../run.py --engine_dir starcoder_outputs_tp4 --tokenizer_dir ./starcoder --input_text "def print_hello_world():" --max_output_len 20 --no_add_special_tokens
```
For StarCoder2, you can use almost the same steps as shown above by just setting `--model starcoder2` when converting the huggingface models.
- Note that StarCoder2 hasn't been merged to the official releases of transformers package yet, so remember using the [main branch of transformers repo](https://github.com/huggingface/transformers).
- Add `--max_attention_window_size 4096` when running with run.py or summarization, which enables the sliding window attention.
- the sliding window size comes from the hf model [config.json](https://huggingface.co/bigcode/starcoder2-15b/blob/main/config.json#L23).
## Summarization using the GPT model
The following section describes how to run a TensorRT-LLM GPT model to summarize the articles from the

View File

@ -68,6 +68,7 @@ def override_args_from_model_dir(args: argparse.Namespace) -> None:
parsed_params = parse_ft_config(Path(args.model_dir) / "config.ini")
args.n_embd = parsed_params["n_embd"]
args.n_head = parsed_params["n_head"]
args.n_kv_head = parsed_params["n_kv_head"]
args.n_layer = parsed_params["n_layer"]
args.n_positions = parsed_params["n_positions"]
args.vocab_size = parsed_params["vocab_size"]
@ -82,6 +83,8 @@ def override_args_from_model_dir(args: argparse.Namespace) -> None:
args.dtype = parsed_params["dtype"]
args.inter_size = parsed_params["inter_size"]
args.multi_query_mode = parsed_params["multi_query_mode"]
else:
args.n_kv_head = 1 if args.multi_query_mode else args.n_head
def parse_arguments(args):
@ -167,7 +170,7 @@ def parse_arguments(args):
action='store_true',
help=
'Split long kv sequence into multiple blocks (applied to generation MHA kernels). \
It is beneifical when batchxnum_heads cannot fully utilize GPU.'
It is beneficial when batch x num_heads cannot fully utilize GPU.'
)
parser.add_argument('--gpus_per_node', type=int, default=8)
parser.add_argument('--builder_opt', type=int, default=None)
@ -549,6 +552,7 @@ def build_rank_engine(builder: Builder,
tensorrt_llm_gpt = tensorrt_llm.models.GPTLMHeadModel(
num_layers=args.n_layer,
num_heads=args.n_head,
num_kv_heads=args.n_kv_head,
hidden_size=args.n_embd,
inter_size=args.inter_size,
vocab_size=args.vocab_size,
@ -568,7 +572,6 @@ def build_rank_engine(builder: Builder,
apply_query_key_layer_scaling,
quant_mode=args.quant_mode,
bias=args.bias,
num_kv_heads=1 if args.multi_query_mode else args.n_head,
use_prompt_tuning=args.max_prompt_embedding_table_size > 0,
use_parallel_embedding=args.use_parallel_embedding,
embedding_sharding_dim=args.embedding_sharding_dim,
@ -712,7 +715,6 @@ def build(rank, args):
int8_trt_flag = args.quant_mode.has_act_or_weight_quant() or (
args.paged_kv_cache == False
and args.quant_mode.has_int8_kv_cache())
num_kv_heads = 1 if args.multi_query_mode else args.n_head
builder_config = builder.create_builder_config(
name=MODEL_NAME,
precision=args.dtype,
@ -722,7 +724,7 @@ def build(rank, args):
parallel_build=args.parallel_build,
num_layers=args.n_layer,
num_heads=args.n_head,
num_kv_heads=num_kv_heads,
num_kv_heads=args.n_kv_head,
hidden_size=args.n_embd,
vocab_size=args.vocab_size,
hidden_act=args.hidden_act,
@ -753,7 +755,7 @@ def build(rank, args):
cur_rank, args)
assert engine is not None, f'Failed to build engine for rank {cur_rank}'
local_num_kv_heads = (num_kv_heads + args.world_size -
local_num_kv_heads = (args.n_kv_head + args.world_size -
1) // args.world_size
kv_dtype = str_dtype_to_trt(args.dtype)
if args.quant_mode.has_int8_kv_cache():
@ -797,7 +799,7 @@ def run_build(args=None):
if args.parallel_build and args.world_size > 1 and \
torch.cuda.device_count() >= args.world_size:
logger.warning(
f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
f'Parallel build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
)
mp.spawn(build, nprocs=args.world_size, args=(args, ))
else:

View File

@ -94,7 +94,7 @@ class ProgArgs:
default="gpt2",
type=str,
help="Specify GPT variants to convert checkpoints correctly",
choices=["gpt2", "santacoder", "starcoder"])
choices=["gpt2", "santacoder", "starcoder", "starcoder2"])
parser.add_argument("--storage-type",
"-t",
type=str,
@ -134,14 +134,30 @@ def smooth_gpt_model(model, scales, alpha):
# SantaCoder separates Q projection from KV projection
def concat_qkv_weight_bias(q, hf_key, hf_model):
kv = hf_model.state_dict()[hf_key.replace("q_attn", "kv_attn")]
def concat_qkv_weight_bias(q, hf_key, hf_model, model_type):
if model_type == "starcoder2":
k = hf_model.state_dict()[hf_key.replace("q_proj",
"k_proj")].to(q.device)
v = hf_model.state_dict()[hf_key.replace("q_proj",
"v_proj")].to(q.device)
if len(q.shape) == 2:
k = k.transpose(0, 1)
v = v.transpose(0, 1)
return torch.cat([q, k, v], dim=-1)
else:
kv = hf_model.state_dict()[hf_key.replace("q_attn",
"kv_attn")].to(q.device)
return torch.cat([q, kv], dim=-1)
# StarCoder uses nn.Linear for these following ops whose weight matrix is transposed compared to transformer.Conv1D
def transpose_weights(hf_name, param):
def transpose_weights(hf_name, param, model_type):
weight_to_transpose = []
if model_type == "starcoder":
weight_to_transpose = ["c_attn", "c_proj", "c_fc"]
elif model_type == "starcoder2":
weight_to_transpose = ["self_attn", "c_proj", "c_fc"]
if any([k in hf_name for k in weight_to_transpose]):
if len(param.shape) == 2:
param = param.transpose(0, 1)
@ -154,7 +170,11 @@ def gpt_to_ft_name(orig_name):
"transformer.wte.weight": "model.wte",
"transformer.ln_f.bias": "model.final_layernorm.bias",
"transformer.ln_f.weight": "model.final_layernorm.weight",
"lm_head.weight": "model.lm_head.weight"
"lm_head.weight": "model.lm_head.weight",
# StarCoder2
"model.embed_tokens.weight": "model.wte",
"model.norm.weight": "model.final_layernorm.weight",
"model.norm.bias": "model.final_layernorm.bias"
}
if orig_name in global_weights:
@ -181,6 +201,25 @@ def gpt_to_ft_name(orig_name):
"transformer.mlp.c_fc.weight": "mlp.dense_h_to_4h.weight",
"transformer.mlp.c_proj.bias": "mlp.dense_4h_to_h.bias",
"transformer.mlp.c_proj.weight": "mlp.dense_4h_to_h.weight",
# StarCoder2
"transformer.input_layernorm.bias": "input_layernorm.bias",
"transformer.input_layernorm.weight": "input_layernorm.weight",
"transformer.self_attn.q_proj.bias": "attention.query.bias",
"transformer.self_attn.q_proj.weight": "attention.query.weight",
"transformer.self_attn.k_proj.weight": "attention.key.weight",
"transformer.self_attn.k_proj.bias": "attention.key.bias",
"transformer.self_attn.v_proj.weight": "attention.value.weight",
"transformer.self_attn.v_proj.bias": "attention.value.bias",
"transformer.self_attn.o_proj.bias": "attention.dense.bias",
"transformer.self_attn.o_proj.weight": "attention.dense.weight",
"transformer.post_attention_layernorm.bias":
"post_attention_layernorm.bias",
"transformer.post_attention_layernorm.weight":
"post_attention_layernorm.weight",
"transformer.mlp.c_fc.bias": "mlp.dense_h_to_4h.bias",
"transformer.mlp.c_fc.weight": "mlp.dense_h_to_4h.weight",
"transformer.mlp.c_proj.bias": "mlp.dense_4h_to_h.bias",
"transformer.mlp.c_proj.weight": "mlp.dense_4h_to_h.weight"
}
return f"layers.{layer_idx}.{per_layer_weights[weight_name]}"
@ -222,6 +261,9 @@ def hf_gpt_converter(args: ProgArgs):
config["gpt"][k] = f"{v}"
config["gpt"]["storage_dtype"] = args.storage_type
config["gpt"]["multi_query_mode"] = str(multi_query_mode)
num_attention_heads = int(config['gpt'].get("num_attention_heads", 0))
num_key_value_heads = 1 if multi_query_mode else int(config['gpt'].get(
"num_key_value_heads", num_attention_heads))
with open(saved_dir / "config.ini", 'w') as configfile:
config.write(configfile)
@ -246,14 +288,13 @@ def hf_gpt_converter(args: ProgArgs):
if args.convert_model_on_cpu:
param = param.cpu()
if args.model == "starcoder":
param = transpose_weights(name, param)
param = transpose_weights(name, param, args.model)
if ft_name in global_ft_weights:
torch_to_numpy(param.to(storage_type).cpu()).tofile(
saved_dir / f"{ft_name}.bin")
else:
if 'q_attn' in name:
param = concat_qkv_weight_bias(param, name, model)
if 'q_attn' in name or 'q_proj' in name:
param = concat_qkv_weight_bias(param, name, model, args.model)
ft_name = ft_name.replace("query", "query_key_value")
# Needed by QKV projection weight split. With multi_query_mode one does not simply take
# out_dim and divide it by 3 to get local_dim because out_dim = local_dim + 2 * head_size
@ -265,7 +306,9 @@ def hf_gpt_converter(args: ProgArgs):
storage_type, act_range.get(name.replace(".weight", "")), {
"int8_outputs": int8_outputs,
"multi_query_mode": multi_query_mode,
"local_dim": local_dim
"local_dim": local_dim,
"num_attention_heads": num_attention_heads,
"num_key_value_heads": num_key_value_heads
})
else:
starmap_args.append(
@ -273,7 +316,9 @@ def hf_gpt_converter(args: ProgArgs):
storage_type, act_range.get(name.replace(".weight", "")), {
"int8_outputs": int8_outputs,
"multi_query_mode": multi_query_mode,
"local_dim": local_dim
"local_dim": local_dim,
"num_attention_heads": num_attention_heads,
"num_key_value_heads": num_key_value_heads
}))
starmap_args = tqdm(starmap_args, desc="saving weights")

View File

@ -162,10 +162,11 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
storage_type, act_range, config):
use_attention_nemo_shape = config.get("use_attention_nemo_shape", False)
split_gated_activation = config.get("split_gated_activation", False)
multi_query_mode = config.get("multi_query_mode", False)
num_attention_heads = config.get("num_attention_heads", 0)
num_key_value_heads = config.get("num_key_value_heads", num_attention_heads)
tp_size = config.get("tp_size", 1)
int8_outputs = config.get("int8_outputs", None)
multi_query_mode = config.get("multi_query_mode", False)
local_dim = config.get("local_dim", None)
save_int8 = int8_outputs == "all" or int8_outputs == "kv_cache_only"
@ -236,6 +237,37 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
b_q, b_kv = np.split(val, [local_dim], axis=-1)
b_q_split = np.split(b_q, split_factor, axis=-1)
split_vals = [np.concatenate((i, b_kv), axis=-1) for i in b_q_split]
elif num_attention_heads != num_key_value_heads:
# GQA mode
# split_vals = np.split(vals[0], split_factor, axis=-1)
assert num_key_value_heads % split_factor == 0
val = vals[0]
qkv_hidden_dim = val.shape[0]
size_per_head = qkv_hidden_dim // (num_attention_heads +
2 * num_key_value_heads)
num_attention_heads // num_key_value_heads
val = val.reshape(num_attention_heads + 2 * num_key_value_heads,
size_per_head)
# Split the QKV to separate variables.
qkv = np.split(val, [
num_attention_heads, num_attention_heads + num_key_value_heads
],
axis=0)
q_split = np.split(qkv[0], split_factor, axis=0)
k_split = np.split(qkv[1], split_factor, axis=0)
v_split = np.split(qkv[2], split_factor, axis=0)
# Concatenate Q, K, and V together
split_vals = [
np.concatenate([
q_split[i].reshape(-1), k_split[i].reshape(-1),
v_split[i].reshape(-1)
],
axis=0) for i in range(split_factor)
]
else:
if use_attention_nemo_shape:
head_num = num_attention_heads // tp_size
@ -261,6 +293,35 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
w_q, w_kv = np.split(val, [local_dim], axis=-1)
w_q_split = np.split(w_q, split_factor, axis=-1)
split_vals = [np.concatenate((i, w_kv), axis=-1) for i in w_q_split]
elif num_attention_heads != num_key_value_heads:
# GQA mode.
assert num_key_value_heads % split_factor == 0
val = vals[0]
size_per_head = hidden_dim // num_attention_heads
num_attention_heads // num_key_value_heads
val = val.reshape(hidden_dim,
num_attention_heads + 2 * num_key_value_heads,
size_per_head)
# Split the QKV to separate variables.
qkv = np.split(val, [
num_attention_heads, num_attention_heads + num_key_value_heads
],
axis=1)
q_split = np.split(qkv[0], split_factor, axis=1)
k_split = np.split(qkv[1], split_factor, axis=1)
v_split = np.split(qkv[2], split_factor, axis=1)
# Concatenate Q, K, and V together
split_vals = [
np.concatenate([
q_split[i].reshape(hidden_dim, -1), k_split[i].reshape(
hidden_dim, -1), v_split[i].reshape(hidden_dim, -1)
],
axis=1) for i in range(split_factor)
]
else:
if use_attention_nemo_shape:
head_num = num_attention_heads // tp_size
@ -291,7 +352,9 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
kv_cache_only=int8_outputs == "kv_cache_only")
elif ("attention.query.weight" in key or "attention.query.bias" in key
or "attention.key_value.weight" in key
or "attention.key_value.bias" in key):
or "attention.key_value.bias" in key or "attention.key.weight" in key
or "attention.key.bias" in key or "attention.value.weight" in key
or "attention.value.bias" in key):
pass
else:
print(f"[WARNING] {key} not handled by converter")

View File

@ -59,10 +59,85 @@ def split(v, tp_size, idx, dim=0):
return None
def parse_sc2_config(ini_file):
gpt_config = configparser.ConfigParser()
gpt_config.read(ini_file)
n_embd = gpt_config.getint('gpt', 'hidden_size')
n_head = gpt_config.getint('gpt', 'num_attention_heads')
n_kv_head = gpt_config.getint('gpt', 'num_key_value_heads')
n_layer = gpt_config.getint('gpt', 'num_hidden_layers')
n_positions = gpt_config.getint('gpt', 'max_position_embeddings')
vocab_size = gpt_config.getint('gpt', 'vocab_size')
do_layer_norm_before = gpt_config.getboolean('gpt',
'do_layer_norm_before',
fallback=True)
rotary_base = gpt_config.getfloat('gpt', 'rope_theta', fallback=None)
rotary_scaling_type = gpt_config.get('gpt',
'rotary_scaling_type',
fallback=None)
rotary_scaling_factor = gpt_config.get('gpt',
'rotary_scaling_factor',
fallback=None)
if rotary_scaling_type is None:
if rotary_scaling_factor is not None:
raise ValueError(
f"'rotary_scaling_factor={rotary_scaling_factor}' is found in ini "
f"config file {ini_file}, whereas 'rotary_scaling_type' is missing "
f"in the config. The 'rotary_scaling_factor' will be ignored and "
f"rotary scaling will not be used.")
rotary_scaling = None
else:
if rotary_scaling_factor is None:
raise ValueError(
f"'rotary_scaling_factor={rotary_scaling_factor}' was not found "
f"in ini config file {ini_file}, whereas 'rotary_scaling_type' is "
f"provided and equals {repr(rotary_scaling_type)}.")
rotary_scaling = [rotary_scaling_type, rotary_scaling_factor]
rotary_pct = 1.0
hidden_act = "gelu"
bias = gpt_config.getboolean('gpt', 'use_bias', fallback=True)
inter_size = gpt_config.getint('gpt', 'intermediate_size', fallback=None)
dtype = gpt_config.get('gpt', 'storage_dtype', fallback='float32')
if inter_size is None:
inter_size = 4 * n_embd
multi_query_mode = gpt_config.getboolean('gpt',
'multi_query_mode',
fallback=False)
prompt_num_tasks = gpt_config.getint('gpt', 'prompt_num_tasks', fallback=0)
prompt_max_vocab_size = gpt_config.getint('gpt',
'prompt_max_vocab_size',
fallback=0)
return {
"n_embd": n_embd,
"n_head": n_head,
"n_kv_head": n_kv_head,
"n_layer": n_layer,
"n_positions": n_positions,
"vocab_size": vocab_size,
"do_layer_norm_before": do_layer_norm_before,
"hidden_act": hidden_act,
"rotary_pct": rotary_pct,
"rotary_base": rotary_base,
"rotary_scaling": rotary_scaling,
"bias": bias,
"inter_size": inter_size,
"multi_query_mode": multi_query_mode,
"dtype": dtype,
"prompt_num_tasks": prompt_num_tasks,
"prompt_max_vocab_size": prompt_max_vocab_size
}
def parse_ft_config(ini_file):
gpt_config = configparser.ConfigParser()
gpt_config.read(ini_file)
if gpt_config.get("gpt", "model", fallback=None) == "starcoder2":
return parse_sc2_config(ini_file)
n_embd = gpt_config.getint('gpt', 'n_embd')
n_head = gpt_config.getint('gpt', 'n_head')
n_layer = gpt_config.getint('gpt', 'n_layer')
@ -112,6 +187,7 @@ def parse_ft_config(ini_file):
return {
"n_embd": n_embd,
"n_head": n_head,
"n_kv_head": 1 if multi_query_mode else n_head,
"n_layer": n_layer,
"n_positions": n_positions,
"vocab_size": vocab_size,
@ -157,6 +233,8 @@ def load_from_ft(tensorrt_llm_gpt: GPTLMHeadModel,
_parsed_params = parse_ft_config(Path(dir_path) / 'config.ini')
n_embd = _parsed_params["n_embd"]
n_head = _parsed_params["n_head"]
n_kv_head = _parsed_params["n_kv_head"]
head_size = n_embd // n_head
n_layer = _parsed_params["n_layer"]
n_positions = _parsed_params["n_positions"]
vocab_size = _parsed_params["vocab_size"]
@ -164,7 +242,6 @@ def load_from_ft(tensorrt_llm_gpt: GPTLMHeadModel,
hidden_act = _parsed_params["hidden_act"]
bias = _parsed_params["bias"]
inter_size = _parsed_params["inter_size"]
multi_query_mode = _parsed_params["multi_query_mode"]
np_dtype = str_dtype_to_np(dtype)
@ -284,10 +361,8 @@ def load_from_ft(tensorrt_llm_gpt: GPTLMHeadModel,
split(lm_head_weight, tensor_parallel, rank))
fake_fp8_sf_dt = np.float32
for i in range(n_layer):
c_attn_out_dim = (3 * n_embd //
tensor_parallel) if not multi_query_mode else (
n_embd // tensor_parallel +
(n_embd // n_head) * 2)
c_attn_out_dim = ((n_head // tensor_parallel) +
max(n_kv_head // tensor_parallel, 1) * 2) * head_size
gpt_layer = tensorrt_llm_gpt.layers[i]
gpt_layer.input_layernorm.weight.value = (fromfile(
dir_path, 'model.layers.' + str(i) + '.input_layernorm.weight.bin'))

View File

@ -149,7 +149,7 @@ sh gptq_convert.sh
### 3. Convert weights from HF Transformers to TensorRT-LLM format
To apply groupwise quantization GPTQ, addition commandline flags need to be passed to `convert_checkpoint.py`:
To apply groupwise quantization GPTQ, addition command-line flags need to be passed to `convert_checkpoint.py`:
Here `--ammo_quant_ckpt_path` flag specifies the output safetensors of `gptq_convert.sh` script.
```bash
@ -173,7 +173,7 @@ python3 convert_checkpoint.py --model_dir ./gptneox_model \
### 4. Build TensorRT engine(s)
The command to build TensorRT engines to apply GPTQ are almost no change:
The command to build TensorRT engines to apply GPTQ does not change:
```bash
# Single GPU
@ -197,7 +197,7 @@ trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ \
### 5. Summarization using the GPT-NeoX model
The command to run summarization with GPTQ qunatized model are also no change:
The command to run summarization with GPTQ quantized model also does not change:
```bash
# Single GPU

View File

@ -322,13 +322,9 @@ def load_from_gptq_gptneox(quant_ckpt_path,
weights['transformer.ln_f.bias'] = b.to(torch_dtype)
# 4. Weights inside each layer
num_hidden_layers = hf_config.num_hidden_layers
layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
layers_range = mapping.pp_layers(num_hidden_layers)
for l in layers_range:
layer_idx = l - mapping.pp_rank * layers_per_pipeline_stage
layer_idx = l - layers_range[0]
prefix = "layers" + split_sym + str(l) + split_sym
tensorrt_llm.logger.info(f'Process weights in layer: {layer_idx}')
# layer = tensorrt_llm_llama.layers[layer_idx]

View File

@ -86,7 +86,7 @@ def run_llm_generate_async_example(prompts: List[str],
config = ModelConfig(llama_model_dir)
config.parallel_config.tp_size = tp_size
llm = LLM(config, async_mode=True, kvcahe_free_gpu_memory_fraction=0.4)
llm = LLM(config, kvcache_free_gpu_memory_fraction=0.4)
async def task(prompt: str):
outputs = []
@ -146,7 +146,7 @@ def _parse_arguments():
help='The directory to dump the engine.',
default=None)
parser.add_argument('--quant_type', type=str, choices=['int4_awq', 'fp8'])
parser.add_argument('--prompt', type=str)
parser.add_argument('--prompt', type=str, default="What is LLM?")
parser.add_argument('--tp_size', type=int, default=1)
parser.add_argument('--streaming', action='store_true')
return parser.parse_args()

View File

@ -32,7 +32,7 @@ InternLM has released several checkpoints of different size or capabilities unde
Below examples use [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) and [internlm-chat-20b](https://huggingface.co/internlm/internlm-chat-20b) and assume these repositories are cloned or linked under this directory, for example `./internlm-chat-7b/`.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `--workers` feature only supports single node.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `--workers` feature only supports single node.
Here're some examples:

View File

@ -841,17 +841,15 @@ def convert_hf_internlm(hf_model,
num_key_value_heads = hf_model.config.num_attention_heads
mha_mode = (num_key_value_heads == num_attention_heads)
layers_per_pipeline_stage = hf_model.config.num_hidden_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
num_hidden_layers = hf_model.config.num_hidden_layers
layers_range = mapping.pp_layers(num_hidden_layers)
if moe_config and moe_config.has_moe():
rank_experts = list(range(moe_config.num_experts))
if moe_config.tp_mode == moe_config.ParallelismMode.EXPERT_PARALLEL:
rank_experts = mapping.ep_experts(moe_config.num_experts)
for l in range(hf_model.config.num_hidden_layers):
for l in range(num_hidden_layers):
for suffix in ["w1", "w2", "w3"]:
model_params[f'model.layers.{l}.block_sparse_moe.experts.{suffix}.weight'] = \
torch.stack(list(model_params[f'model.layers.{l}.block_sparse_moe.experts.{expert}.{suffix}.weight']
@ -872,12 +870,10 @@ def convert_hf_internlm(hf_model,
model_params[
f'model.layers.{l}.block_sparse_moe.experts.w2.weight'] = w2
for l in range(hf_model.config.num_hidden_layers):
if l not in layers_range:
continue
for l in layers_range:
layer_idx = l - layers_range[0]
prefix = f'model.layers.{l}.'
idx = int(l) - mapping.pp_rank * layers_per_pipeline_stage
tllm_prex = f'transformer.layers.{idx}.'
tllm_prex = f'transformer.layers.{layer_idx}.'
q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
k_weight = get_weight(model_params, prefix + 'self_attn.k_proj', dtype)
@ -1183,7 +1179,7 @@ def convert_hf_internlm(hf_model,
weights['lm_head.weight'] = split_matrix_tp(lm_head_weights,
tensor_parallel,
rank,
mapping.tp_rank,
dim=0)
ln_f_w = get_weight(model_params, 'model.norm', dtype)

View File

@ -34,7 +34,7 @@ Need to prepare the HF LLaMA checkpoint first by following the guides here https
TensorRT-LLM LLaMA builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.
Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.
`--use_fused_mlp` enables GEMM horizontal fusion in gated MLP layer, which reduces input traffic and potentially improves performance. For FP8 PTQ, the downside is slight reduction of accuracy because one of the quantization scaling factors are discarded (accuracy 0.45734 vs 0.45755 for LLaMA-v2 7B using ammo/examples/hf/instruct_eval/mmlu.py).
@ -159,7 +159,7 @@ The implementation is identical to Huggingface's.
Please refer to https://huggingface.co/docs/transformers/model_doc/llama2#transformers.LlamaConfig.rope_scaling for more details.
### Long context length
To use the model with Long context lengths, it is necessary to add `--multi_block_mode` in the build command to enable faster decoding in multihead attention.
To use the model with Long context lengths, it is necessary to add `--multi_block_mode` in the build command to enable faster decoding in multi-head attention.
A few LLaMA models are fine-tuned for long context length that TRT-LLM can support today. For example https://huggingface.co/Yukang/LongAlpaca-70B employs rotary scaling plus fine-tuning to support up to 32K context length. The following show the steps for running LongAlpaca-70B in TRT-LLM:
@ -171,8 +171,6 @@ python convert_checkpoint.py --meta_ckpt_dir ./tmp/LongAlpaca-70B/ \
--output_dir ./tllm_checkpoint_8gpu_tp8 \
--dtype float16 \
--tp_size 8 \
--vocab_size=32001 \
--rotary_scaling linear 8.0
trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8 \
--output_dir ./tmp/llama/70B/trt_engines/fp16/8-gpu/ \
@ -506,26 +504,23 @@ Use the following command to build `CodeLlama-7b-Instruct`:
```bash
python convert_checkpoint.py --model_dir /tmp/CodeLlama-7b-Instruct-hf \
--output_dir ./tllm_checkpoint_1gpu_codellama \
--dtype float16 \
--rotary_base 1000000 \
--vocab_size 32016
--dtype float16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_codellama \
--output_dir ./tmp/codellama/trt_engines/fp16/1-gpu/ \
--gemm_plugin float16 \
--gemm_plugin float16
```
Use the following command to build `CodeLlama-34b-Instruct` for 4 GPUs (TP=4):
```bash
python convert_checkpoint.py --model_dir /tmp/CodeLlama-34b-Instruct-hf \
--output_dir ./tllm_checkpoint_4gpu_codellama \
--dtype float16 \
--rotary_base 1000000 \
--vocab_size 32000 \
--tp_size 4
trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_codellama \
--output_dir ./tmp/codellama/trt_engines/fp16/4-gpu/ \
--gemm_plugin float16 \
--gemm_plugin float16
```
NOTE: CodeLlama uses the `max_position_embeddings` of 16K.
@ -536,8 +531,6 @@ Use `--max_input_len` and `--max_output_len` (which defaults to `2048` and `512`
python convert_checkpoint.py --model_dir /tmp/CodeLlama-34b-Instruct-hf \
--output_dir ./tllm_checkpoint_4gpu_codellama \
--dtype float16 \
--rotary_base 1000000 \
--vocab_size 32000 \
--tp_size 8 \
--use_parallel_embedding
@ -625,7 +618,7 @@ Output: "我看见一个人坐在那边边看书书,我看起来还挺像你
### Run LLaMa with several lora checkpoints
In this section, we show how to run a model with multiple LoRA modules at the same time. Note that if one of the LoRA module has a
fine-tuned embedding table or logit GEMM, users should guarantee that all the instances of the model can use the same finetuned
fine-tuned embedding table or logit GEMM, users should guarantee that all the instances of the model can use the same fine-tuned
embedding table or logit GEMM.
Here, we use two LoRA checkpoints as examples. These two LoRA checkponits add LoRA modules to `q_proj` and `v_proj`. Because we only
support adding lora modules on `q`, `k` and `v` at the same time, we need to add `--lora_target_modules "attn_q" "attn_k" "attn_v"`.
@ -633,7 +626,7 @@ In this case, we assign null pointers for the `k` LoRA module in TensorRT-LLM an
As the rank of the LoRA modules of both checkpoints is 8, we can set `--max_lora_rank 8` to reduce the memory requirement for the LoRA plugin.
In this example, we use a LoRA checkpoint finetuned on the Chinese dataset `luotuo-lora-7b-0.1` and a LoRA checkpoint finetuned on
In this example, we use a LoRA checkpoint fine-tuned on the Chinese dataset `luotuo-lora-7b-0.1` and a LoRA checkpoint fine-tuned on
the Japanese dataset `Japanese-Alpaca-LoRA-7b-v0`. For the `lora_manager` to load several checkpoints, we pass several directories
of LoRA checkpoints at the same time: `--lora_dir "luotuo-lora-7b-0.1/" "Japanese-Alpaca-LoRA-7b-v0/"`.
Then, `lora_manager` will assign `lora_task_uids` to these checkpoints. `lora_task_uids -1` is a predefined value, which corresponds to

File diff suppressed because it is too large Load Diff

View File

@ -43,7 +43,7 @@ def parse_args():
type=int,
default=4096,
help=
'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
)
parser.add_argument(
'--max_input_len',

View File

@ -30,7 +30,6 @@ def parse_arguments():
help='The path to save the baichuan TensorRT-LLM checkpoint')
parser.add_argument('--log_level', type=str, default='info')
args = parser.parse_args()
return args
@ -57,11 +56,7 @@ def get_tllm_linear_weight(weight, prefix, bias=None):
return results
def convert_hf_mamba(
hf_mamba,
rank=0,
dtype='float32',
):
def convert_hf_mamba(hf_mamba, rank=0, dtype='float32'):
weights = {}
tik = time.time()
@ -85,8 +80,9 @@ def convert_hf_mamba(
weights[tllm_weight_name] = weight
if bias is not None:
weights[tllm_bias_name] = bias
weights[tllm_prex + 'A'] = -torch.exp(
model_params[prefix + 'A_log'].float().detach())
Aparam = model_params[prefix + 'A_log'].float().detach()
Aparam = Aparam.permute(1, 0).contiguous()
weights[tllm_prex + 'A'] = -torch.exp(Aparam)
weights[tllm_prex + 'D'] = model_params[prefix + 'D'].float().detach()
# norm
prefix = f'backbone.layers.{l}.norm'
@ -130,11 +126,9 @@ def rename_hf_to_tllm(name: str):
return name
def convert_from_hf_checkpoint(
model_dir: Union[str, Path],
def convert_from_hf_checkpoint(model_dir: Union[str, Path],
rank=0,
dtype: Union[str, torch.dtype] = torch.float32,
):
dtype: Union[str, torch.dtype] = torch.float32):
logger.info('Loading weights from HF Mamba...')
tik = time.time()
@ -153,6 +147,7 @@ def convert_from_hf_checkpoint(
param_fp32 = model_params_fp32[name].detach().cpu()
if 'A_log' in name:
param = -torch.exp(param_fp32)
param = param.permute(1, 0).contiguous()
elif 'D' in name:
param = param_fp32
elif 'dt_proj.bias' in name:

View File

@ -806,17 +806,12 @@ def convert_hf_llama(hf_model,
num_key_value_heads = hf_model.config.num_key_value_heads
mha_mode = (num_key_value_heads == num_attention_heads)
layers_per_pipeline_stage = hf_model.config.num_hidden_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
for l in range(hf_model.config.num_hidden_layers):
if l not in layers_range:
continue
num_hidden_layers = hf_model.config.num_hidden_layers
layers_range = mapping.pp_layers(num_hidden_layers)
for l in layers_range:
layer_idx = l - layers_range[0]
prefix = f'model.layers.{l}.'
idx = int(l) - mapping.pp_rank * layers_per_pipeline_stage
tllm_prex = f'transformer.layers.{idx}.'
tllm_prex = f'transformer.layers.{layer_idx}.'
q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
k_weight = get_weight(model_params, prefix + 'self_attn.k_proj', dtype)

View File

@ -52,7 +52,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \
--gemm_plugin float16
```
Then, you can test your engine with the [run.py](./examples/run.py) script:
Then, you can test your engine with the [run.py](../run.py) script:
```
mpirun -n 2 python3 ../run.py --engine_dir ./trt_engines/mixtral/tp2 --tokenizer_dir ./Mixtral-8x7B-v0.1 --max_output_len 8 --input_text "I love french quiche"

View File

@ -248,12 +248,7 @@ class Pipeline:
def __call__(self, prompt):
# Run the model in batch size 1 and beam size 1
if self.model_name == 'GemmaForCausalLM':
inputs = self.tokenizer.encode(prompt, add_special_tokens=False)
inputs = torch.tensor([self.tokenizer.bos_token_id] + inputs)
else:
inputs = self.tokenizer.encode(prompt,
return_tensors="pt").squeeze(0)
inputs = self.tokenizer.encode(prompt, return_tensors="pt").squeeze(0)
batch_input_ids = [inputs]
# For multi-choice tasks like MMLU, we don't need to adjust following parameters
@ -341,7 +336,7 @@ def parse_args():
type=int,
default=None,
help=
'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
)
parser.add_argument(
'--tokenizer_dir',
@ -394,10 +389,10 @@ def main():
debug_mode=args.debug_mode)
else:
assert args.test_hf, "Must test either TRT-LLM or HF"
if model_name.startswith("chatglm"):
auto_model_cls = AutoModel
elif model_name.startswith("glm"):
if model_name == 'ChatGLMForCausalLM' and model_version == 'glm':
auto_model_cls = AutoModelForSeq2SeqLM
elif model_name == 'ChatGLMForCausalLM' and model_version == 'chatglm':
auto_model_cls = AutoModel
else:
auto_model_cls = AutoModelForCausalLM
model = auto_model_cls.from_pretrained(

View File

@ -31,34 +31,34 @@ The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to conv
```bash
# Generate FP16 checkpoints.
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp16/ --dtype float16
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp16/ --dtype float16
# Generate FP32 checkpoints with TP=4.
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp32_tp4/ --dtype float32 --tp_size 4
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp32_tp4/ --dtype float32 --tp_size 4
```
#### 1.2 Convert from HF Transformers with weight-only quantization
```bash
# Use int8 weight-only quantization.
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_wo/ --use_weight_only
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int8_wo/ --use_weight_only
# Use int4 weight-only quantization.
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_wo/ --use_weight_only --weight_only_precision int4
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int4_wo/ --use_weight_only --weight_only_precision int4
```
#### 1.3 Convert from HF Transformers with SmoothQuant quantization
```bash
# Use int8 smoothquant (weight and activation) quantization.
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_sq/ --smoothquant 0.5
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int8_sq/ --smoothquant 0.5
```
#### 1.4 Convert from HF Transformers with INT8 KV cache quantization
```bash
# Use int8 kv cache quantization.
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp16_int8kv/ --dtype float16 --calibrate_kv_cache
python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp16_int8kv/ --dtype float16 --calibrate_kv_cache
```
***INT8-KV-cache can be used with SQ and Weight-only at the same time***
@ -70,31 +70,31 @@ First make sure AMMO toolkit is installed (see [examples/quantization/README.md]
```bash
# INT4 AWQ quantization using AMMO.
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_awq/ --qformat int4_awq
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int4_awq/ --qformat int4_awq
```
#### 1.6 FP8 Post-Training Quantization with AMMO
```bash
# FP8 quantization using AMMO.
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp8/ --qformat fp8 --kv_cache_dtype fp8
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp8/ --qformat fp8 --kv_cache_dtype fp8
```
#### 1.6 Weight-only quantization with AMMO
```bash
# INT8 Weight-only quantization using AMMO with TP=2.
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_wo/ --qformat int8_wo --tp_size 2
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int8_wo/ --qformat int8_wo --tp_size 2
# INT4 Weight-only quantization using AMMO.
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_wo/ --qformat int4_wo
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int4_wo/ --qformat int4_wo
```
#### 1.7 SmoothQuant and INT8 KV cache with AMMO
```bash
# Use int4 awq quantization.
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/sq_int8kv/ --qformat int8_sq --kv_cache_dtype int8
python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/sq_int8kv/ --qformat int8_sq --kv_cache_dtype int8
```
***INT8-KV-cache can also be used with Weight-only at the same time***
@ -105,13 +105,13 @@ All of the checkpoint generated by `convert_checkpoint.py` or `quantize.py` (AMM
```bash
# Build a single-GPU float16 engine using TRTLLM checkpoints.
trtllm-build --checkpoint_dir=./ft_ckpts/mpt-7b/fp16/1-gpu \
trtllm-build --checkpoint_dir=./ckpts/mpt-7b/fp16 \
--max_batch_size 32 \
--max_input_len 1024 \
--max_output_len 512 \
--gemm_plugin
--gemm_plugin float16 \
--workers 1 \
--output_dir ./trt_engines/mpt-7b/fp16/1-gpu
--output_dir ./trt_engines/mpt-7b/fp16
```
### MPT 30B
@ -123,7 +123,7 @@ Same commands can be changed to convert MPT 30B to TRT LLM format. Below is an e
The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM format.
```bash
python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ft_ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16
python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16
```
#### 2. Build TensorRT engine(s)
@ -132,11 +132,11 @@ Examples of build invocations:
```bash
# Build 4-GPU MPT-30B float16 engines
trtllm-build --checkpoint_dir ./ft_ckpts/mpt-30b/fp16_tp4 \
trtllm-build --checkpoint_dir ./ckpts/mpt-30b/fp16_tp4 \
--max_batch_size 32 \
--max_input_len 1024 \
--max_output_len 512 \
--gemm_plugin
--gemm_plugin float16 \
--workers 4 \
--output_dir ./trt_engines/mpt-30b/fp16_tp4
```
@ -159,7 +159,7 @@ Same commands can be changed to convert [Replit Code V-1.5 3B](https://huggingfa
The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM format.
```bash
python convert_checkpoint.py --model_dir ./replit-code-v1_5-3b --output_dir ./ft_ckpts/replit-code-v1_5-3b/bf16_tp2/ --tp_size 2 --dtype bfloat16
python convert_checkpoint.py --model_dir ./replit-code-v1_5-3b --output_dir ./ckpts/replit-code-v1_5-3b/bf16_tp2/ --tp_size 2 --dtype bfloat16
```
#### 2. Build TensorRT engine(s)
@ -168,11 +168,12 @@ Examples of build invocations:
```bash
# Build 2-GPU Replit Code V-1.5 3B bfloat16 engines
trtllm-build --checkpoint_dir ./ft_ckpts/replit-code-v1_5-3b/bf16_tp2 \
trtllm-build --checkpoint_dir ./ckpts/replit-code-v1_5-3b/bf16_tp2 \
--max_batch_size 32 \
--max_input_len 1024 \
--max_output_len 512 \
--gemm_plugin \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--workers 2 \
--output_dir ./trt_engines/replit-code-v1_5-3b/bf16_tp2
```

View File

@ -613,7 +613,7 @@ def get_tllm_param(
return results
def convert_hf_mpt_lagacy(hf_model,
def convert_hf_mpt_legacy(hf_model,
mapping,
rank=0,
dtype='float32',
@ -967,7 +967,8 @@ if __name__ == '__main__':
'pp_size': args.pp_size,
},
'bias': (not hf_config.no_bias),
'clip_qkv': hf_config.attn_config['clip_qkv']
'clip_qkv': hf_config.attn_config['clip_qkv'],
'alibi_bias_max': hf_config.attn_config['alibi_bias_max']
}
with open(os.path.join(args.output_dir, 'config.json'), 'w') as f:
@ -998,7 +999,7 @@ if __name__ == '__main__':
if args.smoothquant is not None:
smooth_mpt_model(hf_model, act_range, args.smoothquant,
mpt_qkv_para, mpt_smoother)
weights = convert_hf_mpt_lagacy(
weights = convert_hf_mpt_legacy(
hf_model, mapping, rank, args.dtype, args.use_weight_only,
plugin_weight_only_quant_type, args.smoothquant is not None,
args.per_channel, args.per_token, args.calibrate_kv_cache,

View File

@ -3,8 +3,11 @@ import os
import shutil
from time import time
import tensorrt as trt
# isort: off
import torch
import tensorrt as trt
# isort: on
from PIL import Image
from transformers import (AutoProcessor, Blip2ForConditionalGeneration,
Blip2Processor, LlavaForConditionalGeneration,

View File

@ -5,8 +5,12 @@ from pathlib import Path
import numpy as np
import requests
import tensorrt as trt
# isort: off
import torch
import tensorrt as trt
# isort: on
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import (AutoConfig, AutoProcessor, AutoTokenizer,
@ -127,7 +131,7 @@ class MultiModalModel:
self.runtime_mapping = self.model.session.mapping
else:
self.model = TRTLLMEncDecModel.from_engine(
self.args.hf_model_dir.split('/')[-1],
os.path.basename(self.args.hf_model_dir),
self.args.llm_engine_dir,
skip_encoder=self.args.nougat,
debug_mode=False,

View File

@ -59,7 +59,7 @@ mv Qwen-14B-Chat ./tmp/Qwen/14B
TensorRT-LLM Qwen builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.
Normally `build.py` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.
Normally `build.py` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel-building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.
Here're some examples:

View File

@ -470,7 +470,7 @@ def parse_arguments():
args.hidden_act = "silu"
args.rms_norm_eps = hf_config.layer_norm_epsilon
args.kv_channels = hf_config.kv_channels
args.rotary_emb_base = hf_config.rotary_emb_base
args.rotary_base = hf_config.rotary_emb_base
if args.n_kv_head is None:
args.n_kv_head = args.n_head
if args.n_kv_head != args.n_head:
@ -803,7 +803,7 @@ if __name__ == '__main__':
if args.parallel_build and args.world_size > 1 and \
torch.cuda.device_count() >= args.world_size:
logger.warning(
f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
f'Parallel build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
)
mp.spawn(build, nprocs=args.world_size, args=(args, ))
else:

View File

@ -207,17 +207,14 @@ def load_from_binary(tensorrt_llm_qwen: QWenForCausalLM,
tensorrt_llm_qwen.lm_head.weight.value = np.ascontiguousarray(
split(lm_head_weight, mapping.tp_size, mapping.tp_rank))
layers_per_pipeline_stage = tensorrt_llm_qwen.num_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
num_hidden_layers = tensorrt_llm_qwen.num_layers
layers_range = mapping.pp_layers(num_hidden_layers)
for i in layers_range:
c_attn_out_dim = (3 * hidden_size //
mapping.tp_size) if not multi_query_mode else (
hidden_size // mapping.tp_size +
(hidden_size // num_hidden_layers) * 2)
idx = i - mapping.pp_rank * layers_per_pipeline_stage
idx = i - layers_range[0]
tensorrt_llm_qwen.layers[idx].ln_1.weight.value = fromfile(
dir_path, 'model.layers.' + str(i) + '.ln_1.weight.bin')
@ -406,10 +403,9 @@ def load_from_hf_qwen(tensorrt_llm_qwen: tensorrt_llm.models.QWenForCausalLM,
model_params = dict(hf_qwen.named_parameters())
torch_dtype = str_dtype_to_torch(dtype)
layers_per_pipeline_stage = hf_qwen.config.num_hidden_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
num_hidden_layers = hf_qwen.config.num_hidden_layers
layers_range = mapping.pp_layers(num_hidden_layers)
for k, v in tqdm(model_params.items(),
total=len(model_params),
@ -438,7 +434,7 @@ def load_from_hf_qwen(tensorrt_llm_qwen: tensorrt_llm.models.QWenForCausalLM,
layer_idx = extract_layer_idx(k)
if layer_idx is None or int(layer_idx) not in layers_range:
continue
idx = int(layer_idx) - mapping.pp_rank * layers_per_pipeline_stage
idx = int(layer_idx) - layers_range[0]
if idx >= tensorrt_llm_qwen.num_layers:
continue
if 'ln_1.weight' in k:
@ -631,13 +627,7 @@ def load_from_gptq_qwen(
num_hidden_layers = max(layer_ids) + 1
suffixs = ["qweight", "qzeros", "scales"]
layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
layers_range = list(
range(
mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage,
1,
))
layers_range = mapping.pp_layers(num_hidden_layers)
torch_dtype = str_dtype_to_torch(dtype)
for layer in tqdm(layers_range,
ncols=80,
@ -655,7 +645,7 @@ def load_from_gptq_qwen(
# dtype: int32, int32, float16
split_qkv_suf.append(split_qkv)
idx = layer - mapping.pp_rank * layers_per_pipeline_stage
idx = layer - layers_range[0]
th_bias = model_params[prefix + "c_attn.bias"].to(
torch_dtype).cpu().contiguous()
@ -709,7 +699,7 @@ def load_from_gptq_qwen(
idx = int(layer_idx)
if idx not in layers_range:
continue
idx = idx - mapping.pp_rank * layers_per_pipeline_stage
idx = idx - layers_range[0]
if "ln_1.weight" in k:
tensorrt_llm_qwen.layers[idx].ln_1.weight.value = v
@ -791,7 +781,7 @@ def load_from_gptq_qwen(
dst.value = np.ascontiguousarray(split_v)
tok = time.time()
t = time.strftime("%h:%m:%s", time.gmtime(tok - tik))
t = time.strftime("%H:%M:%S", time.gmtime(tok - tik))
tensorrt_llm.logger.info(f"weights loaded. total time: {t}")
@ -919,11 +909,7 @@ def load_from_awq_qwen(tensorrt_llm_qwen: QWenForCausalLM,
]
num_hidden_layers = max(layer_ids) + 1
layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
layers_range = list(
range(mapping.pp_rank * layers_per_pipeline_stage,
(mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
layers_range = mapping.pp_layers(num_hidden_layers)
for layer_idx in tqdm(layers_range, "Loading weights..."):
prefix = "transformer.h." + str(layer_idx) + "."
for idx, awq_attr in enumerate(awq_block_names):

View File

@ -40,7 +40,7 @@ def parse_arguments(args=None):
type=int,
default=None,
help=
'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
)
parser.add_argument('--sink_token_length',
type=int,
@ -231,8 +231,6 @@ def parse_input(tokenizer,
else:
print('Input file format not supported.')
raise SystemExit
if model_name == 'GemmaForCausalLM':
batch_input_ids[0] = [tokenizer.bos_token_id] + batch_input_ids[0]
if num_prepend_vtokens:
assert len(num_prepend_vtokens) == len(batch_input_ids)

View File

@ -158,14 +158,6 @@ def main(args):
max_input_length=test_token_num,
)
input_ids = torch.tensor(input_id_list)
elif model_name == 'GemmaForCausalLM':
input_ids = tokenizer.encode(
curr_text,
add_special_tokens=add_special_tokens,
truncation=True,
max_length=test_token_num -
1) # minus 1 to add bos_token_id
input_ids = torch.tensor([tokenizer.bos_token_id] + input_ids)
else:
input_ids = tokenizer.encode(
curr_text,
@ -624,7 +616,7 @@ if __name__ == '__main__':
type=int,
default=None,
help=
'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
)
parser.add_argument('--sink_token_length',
type=int,

View File

@ -88,6 +88,14 @@ def load_tokenizer(tokenizer_dir: Optional[str] = None,
trust_remote_code=True,
tokenizer_type=tokenizer_type,
use_fast=use_fast)
elif model_name == 'GemmaForCausalLM':
from transformers import GemmaTokenizer
# Initialize tokenizer from vocab file.
tokenizer = GemmaTokenizer(vocab_file=vocab_file,
padding_side='left',
truncation_side='left',
legacy=False)
else:
# For gpt-next, directly load from tokenizer.model
tokenizer = T5Tokenizer(vocab_file=vocab_file,
@ -107,11 +115,6 @@ def load_tokenizer(tokenizer_dir: Optional[str] = None,
elif model_name == 'ChatGLMForCausalLM' and model_version == 'glm':
pad_id = tokenizer.pad_token_id
end_id = tokenizer.eop_token_id
elif model_name == 'GemmaForCausalLM':
tokenizer.eos_token_id = tokenizer.sp_model.eos_id()
tokenizer.bos_token_id = tokenizer.sp_model.bos_id()
pad_id = tokenizer.pad_token_id
end_id = tokenizer.eos_token_id
else:
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id

View File

@ -20,7 +20,7 @@ torch==2.1.0+cu121
torchdata==0.7.0
torchtext==0.16.0+cpu
torchvision==0.16.0+cu121
transformers==4.36.1
transformers==4.38.2
wheel
optimum
evaluate

View File

@ -1,4 +1,3 @@
--extra-index-url https://download.pytorch.org/whl/cu121
--extra-index-url https://pypi.nvidia.com
accelerate==0.25.0
build
@ -16,7 +15,7 @@ sentencepiece>=0.1.99
tensorrt==9.2.0.post12.dev5
torch<=2.2.0a
nvidia-ammo~=0.7.0; platform_machine=="x86_64"
transformers==4.36.1
transformers==4.38.2
wheel
optimum
evaluate

Some files were not shown because too many files have changed in this diff Show More