Update TensorRT-LLM (#1233)

* Update TensorRT-LLM --------- Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com> Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2024-03-05 18:32:53 +08:00 · 2024-03-05 18:32:53 +08:00 · 728cc0044b
commit 728cc0044b
parent b7c309d1c9
163 changed files with 4151 additions and 3978 deletions
--- a/3rdparty/cutlass
+++ b/3rdparty/cutlass
@ -1 +1 @@
-Subproject commit 8236f30675bbe98f81d11c05764b77bfcb25b8cc
+Subproject commit a8f2c80db0564c74f4efccac71993b971dfc448b
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,41 @@
 # Change Log

+## Versions 0.7.0 / 0.7.1
+
+* Models
+  - BART and mBART support in encoder-decoder models
+  - FairSeq Neural Machine Translation (NMT) family
+  - Mixtral-8x7B model
+    - Support weight loading for HuggingFace Mixtral model
+  - OpenAI Whisper
+  - Mixture of Experts support
+  - MPT - Int4 AWQ / SmoothQuant support
+  - Baichuan FP8 quantization support
+* Features
+  - [Preview] Speculative decoding
+  - Add Python binding for `GptManager`
+  - Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
+  - System prompt caching
+  - Enable split-k for weight-only cutlass kernels
+  - FP8 KV cache support for XQA kernel
+  - New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
+  - Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
+  - fMHA support for chunked attention and paged kv cache
+* Bug fixes
+  - Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
+  - Fix LLaMa with LoRA error #637
+  - Fix LLaMA GPTQ failure #580
+  - Fix Python binding for InferenceRequest issue #528
+  - Fix CodeLlama SQ accuracy issue #453
+* Performance
+  - MMHA optimization for MQA and GQA
+  - LoRA optimization: cutlass grouped gemm
+  - Optimize Hopper warp specialized kernels
+  - Optimize AllReduce for parallel attention on Falcon and GPT-J
+  - Enable split-k for weight-only cutlass kernel when SM>=75
+* Documentation
+  - Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
+
 ## Versions 0.6.0 / 0.6.1

  * Models
--- a/README.md
+++ b/README.md
@ -8,7 +8,7 @@ TensorRT-LLM
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
 [![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
 [![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
+[![version](https://img.shields.io/badge/release-0.9.0.dev-green)](./setup.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)

 [Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@ -38,6 +38,9 @@ TensorRT-LLM

 ## Table of Contents

+- [TensorRT-LLM](#tensorrt-llm)
+  - [Latest News](#latest-news)
+  - [Table of Contents](#table-of-contents)
  - [TensorRT-LLM Overview](#tensorrt-llm-overview)
  - [Installation](#installation)
  - [Quick Start](#quick-start)
@ -56,6 +59,8 @@ TensorRT-LLM
  - [Troubleshooting](#troubleshooting)
  - [Release notes](#release-notes)
    - [Change Log](#change-log)
+      - [Versions 0.8.0](#versions-080)
+      - [For history change log, please see CHANGELOG.md.](#for-history-change-log-please-see-changelogmd)
    - [Known Issues](#known-issues)
    - [Report Issues](#report-issues)

@ -288,7 +293,7 @@ The list of supported models is:
 * [Replit Code](examples/mpt)
 * [RoBERTa](examples/bert)
 * [SantaCoder](examples/gpt)
-* [StarCoder](examples/gpt)
+* [StarCoder1/StarCoder2](examples/gpt)
 * [T5](examples/enc_dec)
 * [Whisper](examples/whisper)

@ -402,50 +407,91 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`

 ## Release notes

-  * TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
+  * TensorRT-LLM requires TensorRT 9.2 and 23.12 containers.

 ### Change Log

-#### Versions 0.7.0 / 0.7.1
+#### Versions 0.8.0

-* Models
-  - BART and mBART support in encoder-decoder models
-  - FairSeq Neural Machine Translation (NMT) family
-  - Mixtral-8x7B model
-    - Support weight loading for HuggingFace Mixtral model
-  - OpenAI Whisper
-  - Mixture of Experts support
-  - MPT - Int4 AWQ / SmoothQuant support
-  - Baichuan FP8 quantization support
+* Model Support
+  - Phi-1.5/2.0
+  - Mamba support (see examples/mamba/README.md)
+    - The support is limited to beam width = 1 and single-node single-GPU
+  - Nougat support (see examples/multimodal/README.md#nougat)
+  - Qwen-VL support (see examples/qwenvl/README.md)
+  - RoBERTa support, thanks to the contribution from @erenup
+  - Skywork model support
+  - Add example for multimodal models (BLIP with OPT or T5, LlaVA)
 * Features
-  - [Preview] Speculative decoding
-  - Add Python binding for `GptManager`
-  - Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
-  - System prompt caching
-  - Enable split-k for weight-only cutlass kernels
-  - FP8 KV cache support for XQA kernel
-  - New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
-  - Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
-  - fMHA support for chunked attention and paged kv cache
+  - Chunked context support (see docs/source/gpt_attention.md#chunked-context)
+  - LoRA support for C++ runtime (see docs/source/lora.md)
+  - Medusa decoding support (see examples/medusa/README.md)
+    - The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
+  - StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
+  - Support for batch manager to return logits from context and/or generation phases
+    - Include support in the Triton backend
+  - Support AWQ and GPTQ for QWEN
+  - Support ReduceScatter plugin
+  - Support for combining `repetition_penalty` and `presence_penalty` #274
+  - Support for `frequency_penalty` #275
+  - OOTB functionality support:
+    - Baichuan
+    - InternLM
+    - Qwen
+    - BART
+  - LLaMA
+    - Support enabling INT4-AWQ along with FP8 KV Cache
+    - Support BF16 for weight-only plugin
+  - Baichuan
+    - P-tuning support
+    - INT4-AWQ and INT4-GPTQ support
+  - Decoder iteration-level profiling improvements
+  - Add `masked_select` and `cumsum` function for modeling
+  - Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
+  - Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
+  - Support FP16 fMHA on NVIDIA V100 GPU
+* API
+  - Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
+  - **[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
+  - **[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
+  - **[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
 * Bug fixes
-  - Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
-  - Fix LLaMa with LoRA error #637
-  - Fix LLaMA GPTQ failure #580
-  - Fix Python binding for InferenceRequest issue #528
-  - Fix CodeLlama SQ accuracy issue #453
+  - Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
+  - Fix LLaMA with LoRA enabled build failure #673
+  - Fix InternLM SmoothQuant build failure #705
+  - Fix Bloom int8_kv_cache functionality  #741
+  - Fix crash in `gptManagerBenchmark` #649
+  - Fix Blip2 build error #695
+  - Add pickle support for `InferenceRequest` #701
+  - Fix Mixtral-8x7b build failure with custom_all_reduce #825
+  - Fix INT8 GEMM shape #935
+  - Minor bug fixes
 * Performance
-  - MMHA optimization for MQA and GQA
-  - LoRA optimization: cutlass grouped gemm
-  - Optimize Hopper warp specialized kernels
-  - Optimize AllReduce for parallel attention on Falcon and GPT-J
-  - Enable split-k for weight-only cutlass kernel when SM>=75
+  - **[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
+  - **[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
+  - Performance optimization of beam search kernel
+  - Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
+  - Custom AllReduce plugins performance optimization
+  - Top-P sampling performance optimization
+  - LoRA performance optimization
+  - Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
+  - Integrate XQA kernels for GPT-J (beamWidth=4)
 * Documentation
-  - Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
+  - Batch manager arguments documentation updates
+  - Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
+  - Add documentation for Falcon AWQ support (See examples/falcon/README.md)
+  - Update to the `docs/source/new_workflow.md` documentation
+  - Update AWQ INT4 weight only quantization documentation for GPT-J
+  - Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
+  - Refine TensorRT-LLM backend README structure #133
+  - Typo fix #739

 #### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).

 ### Known Issues

+  * On windows, running context FMHA plugin with FP16 accumulation on LLaMA, Mistral and Phi models suffers from poor accuracy and the resulting inference output may be garbled. The suggestion to workaround these is to enable FP32 accumulation when building the models, i.e. passing the options `--context_fmha disable --context_fmha_fp32_acc enable` to `trtllm-build` command as a work-around, and this should be fixed in the next version
+
  * The hang reported in issue
    [#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
    has not been reproduced by the TensorRT-LLM team. If it is caused by a bug
--- a/benchmarks/cpp/README.md
+++ b/benchmarks/cpp/README.md
@ -103,7 +103,8 @@ For example, setting mean=100 and std dev=10 would generate requests where 95.4%
  --tokenizer <path/to/tokenizer> \
   token-norm-dist \
   --num-requests 100 \
-   --input-mean 100 --input-stdev 10 --output-mean 15 --output-stdev 0 --num-requests 100
+   --input-mean 100 --input-stdev 10 \
+   --output-mean 15 --output-stdev 0
 ```

 For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
@ -141,8 +142,25 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
    --max_num_samples 500
 ```

-To emulate `gptSessionBenchmark` static batching, you can use the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
-Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated-timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count.
+`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
+
+#### Emulated static batching
+
+To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
+Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
+
+`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
+```
+ python prepare_dataset.py \
+  --output tokens-fixed-lengths.json \
+  --request-rate -1 \
+  --time-delay-dist constant \
+  --tokenizer <path/to/tokenizer> \
+   token-norm-dist \
+   --num-requests 128 \
+   --input-mean 60 --input-stdev 0 \
+   --output-mean 20 --output-stdev 0
+```

 Take GPT-350M as an example for single GPU with static batching
 ```
@ -152,7 +170,5 @@ Take GPT-350M as an example for single GPU with static batching
    --type IFB \
    --static_emulated_batch_size 32 \
    --static_emulated_timeout 100 \
-    --dataset ../../benchmarks/cpp/preprocessed_dataset.json
+    --dataset ../../benchmarks/cpp/tokens-fixed-lengths.json
 ```
-
-`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
--- a/benchmarks/cpp/bertBenchmark.cpp
+++ b/benchmarks/cpp/bertBenchmark.cpp
@ -57,12 +57,12 @@ std::string engineFilename(
    std::filesystem::path const& dataPath, WorldConfig const& worldConfig, std::string const& model)
 {
    auto constexpr allowExceptions = true;
-    auto constexpr ingoreComments = true;
+    auto constexpr ignoreComments = true;
    auto const jsonFilePath = dataPath / "config.json";
    TLLM_CHECK_WITH_INFO(
        std::filesystem::exists(jsonFilePath), std::string("File does not exist: ") + jsonFilePath.string());
    std::ifstream jsonStream(jsonFilePath);
-    auto const json = nlohmann::json::parse(jsonStream, nullptr, allowExceptions, ingoreComments);
+    auto const json = nlohmann::json::parse(jsonStream, nullptr, allowExceptions, ignoreComments);
    auto const& builderConfig = json.at("builder_config");
    auto const precision = builderConfig.at("precision").template get<std::string>();
    auto const worldSize = builderConfig.at("tensor_parallel").template get<SizeType>();
@ -97,9 +97,9 @@ void benchmarkBert(std::string const& modelName, std::filesystem::path const& da
            allocator.setZero(*inputIdsBuffer);
            tensorMap.insert(std::make_pair("input_ids", inputIdsBuffer));
            // input_lengths
-            std::vector<SizeType> inputLenghtsHost(batchSize);
+            std::vector<SizeType> inputLengthsHost(batchSize);
            auto inLensBuffer = std::shared_ptr<ITensor>{
-                allocator.copyFrom(inputLenghtsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU)};
+                allocator.copyFrom(inputLengthsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU)};
            allocator.setZero(*inLensBuffer);
            tensorMap.insert(std::make_pair("input_lengths", inLensBuffer));

--- a/benchmarks/cpp/gptManagerBenchmark.cpp
+++ b/benchmarks/cpp/gptManagerBenchmark.cpp
@ -1049,12 +1049,8 @@ int main(int argc, char* argv[])
        padId = result["pad_id"].as<int>();
    }

-    std::optional<int32_t> eosId;
    // Argument: End-of-sentence token id
-    if (result.count("eos_id"))
-    {
-        eosId = result["eos_id"].as<int>();
-    }
+    std::optional<int32_t> eosId = result["eos_id"].as<int>();

    std::optional<int> staticEmulatedBatchSize;
    // Argument: Static emulated batch size
--- a/benchmarks/cpp/gptSessionBenchmark.cpp
+++ b/benchmarks/cpp/gptSessionBenchmark.cpp
@ -120,9 +120,9 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                auto peakMemFuture = std::async(&monitorMemory, std::ref(done));
                TLLM_LOG_INFO(memoryCounter.toString());

-                std::vector<SizeType> inputLenghtsHost(batchSize, maxInputLength);
-                auto inputLenghts
-                    = bufferManager.copyFrom(inputLenghtsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU);
+                std::vector<SizeType> inputLengthsHost(batchSize, maxInputLength);
+                auto inputLengths
+                    = bufferManager.copyFrom(inputLengthsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU);

                // copy inputs and wrap into shared_ptr
                GenerationInput::TensorPtr inputIds;
@ -147,7 +147,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                TLLM_LOG_INFO(memoryCounter.toString());

                GenerationInput generationInput{
-                    endId, padId, std::move(inputIds), std::move(inputLenghts), inputPacked};
+                    endId, padId, std::move(inputIds), std::move(inputLengths), inputPacked};

                // runtime will allocate memory for output if this tensor is empty
                GenerationOutput generationOutput{
@ -183,6 +183,8 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                int iterIdx = 0;
                float curDuration = 0;
                std::vector<float> latencies;
+                std::vector<float> generationTimes;
+                auto generationProfiler = std::make_shared<GptSession::GenerationProfiler>();
                while (iterIdx < numRuns || curDuration / 1000 < duration)
                {
                    auto const start = std::chrono::steady_clock::now();
@ -190,7 +192,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                    generationOutput.onTokenGenerated
                        = [&numSteps, maxNewTokens](GenerationOutput::TensorPtr const& outputIds, SizeType step,
                              bool finished) { ++numSteps; };
-                    session.generate(generationOutput, generationInput, samplingConfig);
+                    session.generate(generationOutput, generationInput, samplingConfig, generationProfiler);
                    bufferManager.getStream().synchronize();
                    auto const end = std::chrono::steady_clock::now();

@ -198,6 +200,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                    float latency = std::chrono::duration<float, std::milli>(end - start).count();
                    curDuration += latency;
                    latencies.emplace_back(latency);
+                    generationTimes.emplace_back(generationProfiler->getElapsedTimeMs());
                }

                TLLM_LOG_INFO(memoryCounter.toString());
@ -231,12 +234,16 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                {
                    auto const averageLatency = curDuration / iterIdx;
                    float const tokensPerSec = batchSize * maxNewTokens / (averageLatency / 1000);
+                    auto const avgGenerationTime
+                        = std::reduce(generationTimes.begin(), generationTimes.end(), 0.0f) / generationTimes.size();
+                    float const generationTokensPerSec = batchSize * maxNewTokens / (avgGenerationTime / 1000);
                    // convert to GB
                    float const peakMemGB = peakMem / 1e9;
                    printf(
                        "[BENCHMARK] batch_size %d input_length %d output_length %d latency(ms) %.2f tokensPerSec "
-                        "%.2f gpu_peak_mem(gb) %.2f\n",
-                        batchSize, maxInputLength, maxNewTokens, averageLatency, tokensPerSec, peakMemGB);
+                        "%.2f generation_time(ms) %.2f generationTokensPerSec %.2f gpu_peak_mem(gb) %.2f\n",
+                        batchSize, maxInputLength, maxNewTokens, averageLatency, tokensPerSec, avgGenerationTime,
+                        generationTokensPerSec, peakMemGB);
                }

                // logits are store in last rank
@ -246,7 +253,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                    {
                        std::cout << "generationOutput.contextLogits.shape: "
                                  << generationOutput.contextLogits->getShape()
-                                  << std::endl; // (batchsize, prompt_len, vocabsize)
+                                  << std::endl; // (batch_size, prompt_len, vocab_size)
                        std::cout << "generationOutput.contextLogits: " << *generationOutput.contextLogits << std::endl;
                    }

@ -254,7 +261,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                    {
                        std::cout << "generationOutput.generationLogits.shape: "
                                  << generationOutput.generationLogits->getShape()
-                                  << std::endl; // (batchsize, beamwidth, maxNewTokens, vocabsize)
+                                  << std::endl; // (batch_size, beam_width, maxNewTokens, vocab_size)
                        generationOutput.generationLogits->reshape(ITensor::makeShape({batchSize * beamWidth,
                            maxNewTokens, modelConfig.getVocabSizePadded(worldConfig.getSize())}));

--- a/benchmarks/python/base_benchmark.py
+++ b/benchmarks/python/base_benchmark.py
@ -75,6 +75,43 @@ class BaseBenchmark(object):
        self.quant_mode = QuantMode(0)
        self.enable_fp8 = False
        if engine_dir is not None:
+            # Read config from engine directory
+            config_path = os.path.join(engine_dir, 'config.json')
+            with open(config_path, 'r') as f:
+                self.config = json.load(f)
+            # Sanity checks
+            if 'pretrained_config' in self.config:  # new build api branch
+                config_dtype = self.config['pretrained_config']['dtype']
+                assert dtype == config_dtype, f"Engine dtype ({config_dtype}) != Runtime dtype ({dtype})"
+                world_size = self.config['pretrained_config']['mapping'][
+                    'world_size']
+                assert world_size == self.world_size, \
+                    (f'Engine world size ({world_size}) != Runtime world size ({self.world_size})')
+                # Load config into self
+                for key, value in self.config['pretrained_config'].items():
+                    setattr(self, key, value)
+
+                self.quant_mode = QuantMode.from_quant_algo(
+                    quant_algo=self.quantization['quant_algo'],
+                    kv_cache_quant_algo=self.quantization['kv_cache_quant_algo']
+                )
+                self.enable_fp8 = self.quant_mode.has_fp8_qdq()
+                self.fp8_kv_cache = self.quant_mode.has_fp8_kv_cache()
+
+                for key, value in self.config['build_config'].items():
+                    setattr(self, key, value)
+
+                for key, value in self.plugin_config.items():
+                    if "plugin" in key:
+                        key = "use_" + key
+                    setattr(self, key, value)
+
+                self.engine_name = f"rank{self.runtime_rank}.engine"
+
+                self.num_kv_heads = self.num_key_value_heads
+                self.num_layers = self.num_hidden_layers
+                self.num_heads = self.num_attention_heads
+            else:
                # Read config from engine directory
                config_path = os.path.join(engine_dir, 'config.json')
                with open(config_path, 'r') as f:
@ -100,9 +137,14 @@ class BaseBenchmark(object):
                    if "plugin" in key:
                        key = "use_" + key
                    setattr(self, key, value)
+                self.engine_name = get_engine_name(self.engine_model_name,
+                                                   self.dtype, self.world_size,
+                                                   self.runtime_rank)
+        else:
+            self.engine_name = get_engine_name(self.engine_model_name,
+                                               self.dtype, self.world_size,
+                                               self.runtime_rank)

-        self.engine_name = get_engine_name(self.engine_model_name, self.dtype,
-                                           self.world_size, self.runtime_rank)
        self.runtime_mapping = tensorrt_llm.Mapping(world_size=self.world_size,
                                                    rank=self.runtime_rank,
                                                    tp_size=self.world_size)
--- a/benchmarks/python/build.py
+++ b/benchmarks/python/build.py
@ -53,11 +53,12 @@ def parse_arguments():
        '--mode',
        type=str,
        default="plugin",
-        choices=['ootb', 'plugin', 'ootb-except-mha'],
+        choices=['ootb', 'plugin', 'plugin-ifb', 'ootb-except-mha'],
        help=
        ('Choose mode between ootb/plugin/ootb-except-mha. '
         '\"ootb\" means the engines will be built without any plugins, '
         '\"plugin\" means the engines will be built with tuned recipe of using plugins.'
+         '\"plugin-ifb\" will include additional options required for inflight batching.'
         '\"ootb-except-mha\" means the engines will be built with only attention plugins.'
         ))

@ -749,7 +750,7 @@ def build_gpt(args):
    network.plugin_config.to_legacy_setting()

    # Plugins
-    if args.mode == 'plugin':
+    if args.mode in ['plugin', 'plugin-ifb']:
        network.plugin_config.set_gpt_attention_plugin(dtype=args.dtype)
        network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
        network.plugin_config.enable_remove_input_padding()
@ -773,6 +774,10 @@ def build_gpt(args):
            # RMS norm plugin for SmoothQuant
            network.plugin_config.set_rmsnorm_quantization_plugin(
                dtype=args.dtype)
+
+        # Inflight batching
+        if args.mode == 'plugin-ifb':
+            network.plugin_config.enable_paged_kv_cache()
    elif args.mode == 'ootb-except-mha':
        network.plugin_config.set_gpt_attention_plugin(dtype=args.dtype)
        network.plugin_config.set_context_fmha(ContextFMHAType.enabled)
@ -801,7 +806,7 @@ def build_gpt(args):
        else:
            tensorrt_llm_model(*inputs)

-    if args.mode == 'plugin':
+    if args.mode in ['plugin', 'plugin-ifb']:
        tensorrt_llm.graph_rewriting.optimize(network)

    # Network -> Engine
--- a/benchmarks/python/gpt_benchmark.py
+++ b/benchmarks/python/gpt_benchmark.py
@ -109,6 +109,7 @@ class GPTBenchmark(BaseBenchmark):

        if not hasattr(self, 'num_kv_heads') or self.num_kv_heads is None:
            self.num_kv_heads = self.num_heads
+
        model_config = tensorrt_llm.runtime.ModelConfig(
            max_batch_size=self.max_batch_size,
            max_beam_width=self.num_beams,
@ -118,6 +119,9 @@ class GPTBenchmark(BaseBenchmark):
            num_kv_heads=ceil(self.num_kv_heads / self.world_size),
            hidden_size=self.hidden_size // self.world_size,
            gpt_attention_plugin=self.use_gpt_attention_plugin,
+            paged_kv_cache=self.paged_kv_cache if hasattr(
+                self, 'paged_kv_cache') else False,
+            dtype=self.dtype,
            remove_input_padding=self.remove_input_padding,
            quant_mode=self.quant_mode,
            use_custom_all_reduce=self.use_custom_all_reduce,
--- a/cpp/include/tensorrt_llm/batch_manager/llmRequest.h
+++ b/cpp/include/tensorrt_llm/batch_manager/llmRequest.h
@ -120,7 +120,7 @@ public:
    {
        if (req.getEmbeddingBias())
        {
-            mEmbeddingBias = executor::detail::toITensor(*(req.getEmbeddingBias().value()));
+            mEmbeddingBias = executor::detail::toITensor(req.getEmbeddingBias().value());
            // Add leading 1 dimension since that's what IFB code expects
            mEmbeddingBias.value()->unsqueeze(0);
        }
@ -136,7 +136,7 @@ public:
        auto pTuningConfig = req.getPromptTuningConfig();
        if (pTuningConfig)
        {
-            mPromptEmbeddingTable = executor::detail::toITensor(*pTuningConfig.value().getEmbeddingTable());
+            mPromptEmbeddingTable = executor::detail::toITensor(pTuningConfig.value().getEmbeddingTable());
            TLLM_CHECK(mPromptEmbeddingTable.value()->getShape().nbDims == 2);
            mPromptVocabSize = mPromptEmbeddingTable.value()->getShape().d[0];
            mPromptEmbeddingTable.value()->unsqueeze(0);
@ -145,10 +145,10 @@ public:
        auto loraConfig = req.getLoraConfig();
        if (loraConfig)
        {
-            mLoraWeights = executor::detail::toITensor(*loraConfig.value().getWeights());
+            mLoraWeights = executor::detail::toITensor(loraConfig.value().getWeights());
            mLoraWeights.value()->unsqueeze(0);

-            mLoraConfig = executor::detail::toITensor(*loraConfig.value().getConfig());
+            mLoraConfig = executor::detail::toITensor(loraConfig.value().getConfig());
            mLoraConfig.value()->unsqueeze(0);
        }

@ -159,7 +159,7 @@ public:

            if (speculativeDecodingConfig.value().getLogits())
            {
-                mDraftLogits = executor::detail::toITensor(*speculativeDecodingConfig.value().getLogits().value());
+                mDraftLogits = executor::detail::toITensor(speculativeDecodingConfig.value().getLogits().value());
            }

            // NOTE: Draft acceptance threshold is stored in mSamplingConfig
@ -551,7 +551,7 @@ public:
        return mState == REQUEST_STATE_CONTEXT_INIT;
    }

-    [[nodiscard]] bool isGenerationInProgessState() const noexcept
+    [[nodiscard]] bool isGenerationInProgressState() const noexcept
    {
        return mState == REQUEST_STATE_GENERATION_IN_PROGRESS;
    }
@ -680,14 +680,12 @@ public:

                if (getReturnContextLogits())
                {
-                    result.contextLogits
-                        = std::make_shared<executor::Tensor>(executor::detail::ofITensor(getContextLogitsHost()));
+                    result.contextLogits = executor::detail::ofITensor(getContextLogitsHost());
                }

                if (getReturnGenerationLogits())
                {
-                    result.generationLogits
-                        = std::make_shared<executor::Tensor>(executor::detail::ofITensor(getGenerationLogitsHost()));
+                    result.generationLogits = executor::detail::ofITensor(getGenerationLogitsHost());
                }

                // Update position of last sent response
--- a/cpp/include/tensorrt_llm/common/mpiUtils.h
+++ b/cpp/include/tensorrt_llm/common/mpiUtils.h
@ -237,12 +237,11 @@ public:

    void bcast(runtime::IBuffer& buf, int root) const
    {
-        TLLM_CHECK(buf.getMemoryType() != runtime::MemoryType::kGPU);
        bcast(buf.data(), buf.getSizeInBytes(), MpiType::kBYTE, root);
    }

    template <typename T>
-    void bcast(T& value, int root) const
+    void bcastValue(T& value, int root) const
    {
        if constexpr (std::is_fundamental_v<std::remove_cv_t<T>>)
        {
--- a/cpp/include/tensorrt_llm/executor/executor.h
+++ b/cpp/include/tensorrt_llm/executor/executor.h
@ -99,18 +99,18 @@ struct OutputConfig
 class SpeculativeDecodingConfig
 {
 public:
-    explicit SpeculativeDecodingConfig(VecTokens tokens, std::optional<TensorPtr> logits = std::nullopt,
+    explicit SpeculativeDecodingConfig(VecTokens tokens, std::optional<Tensor> logits = std::nullopt,
        std::optional<FloatType> acceptanceThreshold = std::nullopt);

    ~SpeculativeDecodingConfig();

    [[nodiscard]] VecTokens getTokens() const;
-    [[nodiscard]] std::optional<TensorPtr> getLogits() const;
+    [[nodiscard]] std::optional<Tensor> getLogits() const;
    [[nodiscard]] std::optional<FloatType> getAcceptanceThreshold() const;

 private:
    VecTokens mTokens;
-    std::optional<TensorPtr> mLogits;
+    std::optional<Tensor> mLogits;
    std::optional<FloatType> mAcceptanceThreshold;
 };

@ -122,28 +122,28 @@ public:
    /// @param embeddingTable  The prompt embedding table. Data type must match model weights. Shape [vocabSize,
    /// hiddenSize]
    /// @param vocabSize
-    PromptTuningConfig(TensorPtr embeddingTable);
+    PromptTuningConfig(Tensor embeddingTable);
    ~PromptTuningConfig();

-    [[nodiscard]] TensorPtr getEmbeddingTable() const;
+    [[nodiscard]] Tensor getEmbeddingTable() const;

 private:
-    TensorPtr mEmbeddingTable;
+    Tensor mEmbeddingTable;
 };

 /// @brief Configuration for LoRA
 class LoraConfig
 {
 public:
-    LoraConfig(TensorPtr weights, TensorPtr config);
+    LoraConfig(Tensor weights, Tensor config);
    ~LoraConfig();

-    [[nodiscard]] TensorPtr getWeights() const;
-    [[nodiscard]] TensorPtr getConfig() const;
+    [[nodiscard]] Tensor getWeights() const;
+    [[nodiscard]] Tensor getConfig() const;

 private:
-    TensorPtr mWeights;
-    TensorPtr mConfig;
+    Tensor mWeights;
+    Tensor mConfig;
 };

 /// @brief A class that holds information about the request
@ -169,7 +169,7 @@ public:
        std::optional<SizeType> endId = std::nullopt, std::optional<SizeType> padId = std::nullopt,
        std::optional<std::list<VecTokens>> badWords = std::nullopt,
        std::optional<std::list<VecTokens>> stopWords = std::nullopt,
-        std::optional<TensorPtr> embeddingBias = std::nullopt,
+        std::optional<Tensor> embeddingBias = std::nullopt,
        std::optional<SpeculativeDecodingConfig> speculativeDecodingConfig = std::nullopt,
        std::optional<PromptTuningConfig> pTuningConfig = std::nullopt,
        std::optional<LoraConfig> loraConfig = std::nullopt);
@ -189,7 +189,7 @@ public:
    [[nodiscard]] std::optional<SizeType> getPadId() const;
    [[nodiscard]] std::optional<std::list<VecTokens>> getBadWords() const;
    [[nodiscard]] std::optional<std::list<VecTokens>> getStopWords() const;
-    [[nodiscard]] std::optional<TensorPtr> getEmbeddingBias() const;
+    [[nodiscard]] std::optional<Tensor> getEmbeddingBias() const;
    [[nodiscard]] std::optional<SpeculativeDecodingConfig> getSpeculativeDecodingConfig() const;
    [[nodiscard]] std::optional<PromptTuningConfig> getPromptTuningConfig() const;
    [[nodiscard]] std::optional<LoraConfig> getLoraConfig() const;
@ -201,7 +201,7 @@ public:
    void setPadId(SizeType padId);
    void setBadWords(std::list<VecTokens> badWords);
    void setStopWords(std::list<VecTokens> stopWords);
-    void setEmbeddingBias(TensorPtr);
+    void setEmbeddingBias(Tensor);
    void setSpeculativeDecodingConfig(SpeculativeDecodingConfig specDecodingConfig);
    void setPromptTuningConfig(PromptTuningConfig pTuningConfig);
    void setLoraConfig(LoraConfig loraConfig);
@ -222,8 +222,8 @@ struct Result

    std::optional<VecLogProbs> cumLogProbs;           // [beamSize]
    std::optional<std::vector<VecLogProbs>> logProbs; // [beamSize, seqLen]
-    std::optional<TensorPtr> contextLogits;           // [promptLen, vocab_size_padded]
-    std::optional<TensorPtr> generationLogits;        // [beam_size, mMaxNewTokens, vocab_size_padded]
+    std::optional<Tensor> contextLogits;              // [promptLen, vocab_size_padded]
+    std::optional<Tensor> generationLogits;           // [beam_size, mMaxNewTokens, vocab_size_padded]
 };

 /// @brief Class that holds either an error or a result
--- a/cpp/include/tensorrt_llm/runtime/gptSession.h
+++ b/cpp/include/tensorrt_llm/runtime/gptSession.h
@ -92,6 +92,46 @@ public:
        std::optional<SizeType> ctxMicroBatchSize = std::nullopt;
        std::optional<SizeType> genMicroBatchSize = std::nullopt;
        std::optional<DecodingMode> decodingMode = std::nullopt;
+        bool normalizeLogProbs = true;
+    };
+
+    //! @brief Optional profiler class to profile the generation phase of an inference request
+    class GenerationProfiler
+    {
+    public:
+        // Use a constexpr variable to resolve the ambiguous match for overloaded CudaEvent constructor
+        static constexpr unsigned int flags{cudaEventDefault};
+
+        GenerationProfiler()
+            : start(flags)
+            , end(flags)
+        {
+        }
+
+        CudaEvent const& getStart() const
+        {
+            return start;
+        }
+
+        CudaEvent const& getEnd() const
+        {
+            return end;
+        }
+
+        float getElapsedTimeMs()
+        {
+            start.synchronize();
+            end.synchronize();
+
+            float result;
+            TLLM_CUDA_CHECK(::cudaEventElapsedTime(&result, start.get(), end.get()));
+
+            return result;
+        }
+
+    private:
+        CudaEvent start;
+        CudaEvent end;
    };

    GptSession(Config const& sessionConfig, GptModelConfig const& modelConfig, WorldConfig const& worldConfig,
@ -129,9 +169,15 @@ public:
        return mDevice;
    }

+    [[nodiscard]] bool getNormalizeLogProbs() const noexcept
+    {
+        return mNormalizeLogProbs;
+    }
+
    [[nodiscard]] nvinfer1::DataType getLogitDataType() const;

-    void generate(GenerationOutput& outputs, GenerationInput const& inputs, SamplingConfig const& samplingConfig);
+    void generate(GenerationOutput& outputs, GenerationInput const& inputs, SamplingConfig const& samplingConfig,
+        std::shared_ptr<GenerationProfiler> const generationProfiler = nullptr);

 private:
    [[nodiscard]] bool useCudaGraphs()
@ -141,7 +187,7 @@ private:

    void generateBatched(std::vector<GenerationOutput>& microBatchesOutputs,
        std::vector<GenerationInput> const& microBatchesInputs, SamplingConfig const& samplingConfig,
-        TokenGeneratedCallback const& onTokenGenerated);
+        TokenGeneratedCallback const& onTokenGenerated, std::shared_ptr<GenerationProfiler> const generationProfiler);

    void setup(Config const& sessionConfig);

@ -154,9 +200,8 @@ private:
        SizeType sinkTokenLength, SizeType maxSequenceLength, KvCacheConfig const& config);
    void createCustomAllReduceWorkspace(SizeType batchSize, SizeType beamWidth, SizeType maxSequenceLength);

-    void executeContextStep(std::vector<GenerationInput> const& microBatchesInputs,
-        std::vector<GenerationOutput>& microBatchesOutputs, std::vector<SizeType> const& microBatchOffsets,
-        KvCacheManager const* kvCacheManager);
+    void executeContextStep(std::vector<GenerationInput> const& generationBatchesInputs,
+        std::vector<SizeType> const& generationBatchesOffsets, KvCacheManager const* kvCacheManager);
    SizeType executeGenerationStep(SizeType step, std::vector<GenerationInput> const& microBatchesInputs,
        std::vector<GenerationOutput>& microBatchesOutputs, std::vector<SizeType> const& microBatchOffsets,
        KvCacheManager* kvCacheManager, std::vector<bool>& microBatchesFinished);
@ -275,6 +320,8 @@ private:
    bool mCudaGraphMode{false};
    // ping-pong instances
    std::vector<CudaGraphExecutor> mCudaGraphInstances;
+
+    bool mNormalizeLogProbs = true;
 };

 } // namespace tensorrt_llm::runtime
--- a/cpp/include/tensorrt_llm/runtime/ipcUtils.h
+++ b/cpp/include/tensorrt_llm/runtime/ipcUtils.h
@ -24,7 +24,7 @@
 namespace tensorrt_llm::runtime
 {

-void setPeerAccess(WorldConfig worldConfig, bool enable = true);
+void setPeerAccess(WorldConfig const& worldConfig, bool enable = true);

 class IpcMemory
 {
@ -33,7 +33,7 @@ public:

    size_t static constexpr FLAGS_SIZE = kernels::MAX_ALL_REDUCE_BLOCKS * sizeof(uint32_t);

-    IpcMemory(WorldConfig worldConfig, std::size_t bufferSize);
+    IpcMemory(WorldConfig const& worldConfig, std::size_t bufferSize);
    ~IpcMemory();

    [[nodiscard]] const std::vector<void*>& getCommPtrsTensor() const
@ -48,7 +48,7 @@ private:
    WorldConfig mWorldConfig;
    std::vector<void*> mCommPtrs;
    std::size_t mBufferSize;
-    void* mBufferPtr;
+    void* mBufferPtr{nullptr};
 };

 } // namespace tensorrt_llm::runtime
--- a/cpp/tensorrt_llm/CMakeLists.txt
+++ b/cpp/tensorrt_llm/CMakeLists.txt
@ -195,8 +195,8 @@ set(TRTLLM_LINK_LIBS
    ${TRT_LIB}
    common_src
    kernels_src
-    cutlass_src_pre_hopper
-    cutlass_src_hopper
+    cutlass2_src
+    cutlass3_src
    layers_src
    runtime_src)

--- a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a
+++ b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c9fd644e0a38b1d4d1a54d4b7b834cc6b0110a5771fcfc480e96795b3f9bc892
-size 2081046
+oid sha256:0ecc134ad10a54b2953c772e72db2f71e84130d5736087b033e9e5b78594db6d
+size 2113376
--- a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
+++ b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:90436c59eb243a0156e3f0aa95412a7caacbefdcde768c158edc4b821044dfd1
-size 2102486
+oid sha256:9aa3f3d7f8313c099df8e9bd4c9707922a4f1c4025c4c99986acf6df781738c7
+size 2128450
--- a/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt
+++ b/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu/version.txt
@ -1,3 +1,3 @@
-f53c02e3829b516a6e9221745bcbacbd  libtensorrt_llm_batch_manager_static.a
-9e92e5dbb104e3e676952ea40c81916f  libtensorrt_llm_batch_manager_static.pre_cxx11.a
-25adff90cc350eb9ca9804051a08de80d547c113 commit
+add62ff328028bbcded1af694fe758c5  libtensorrt_llm_batch_manager_static.a
+9e8846e200e2aaaeace862741a90c3ab  libtensorrt_llm_batch_manager_static.pre_cxx11.a
+230623fa285048a2de5c54c2cc0f364fb9f2c559 commit
--- a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a
+++ b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c3433d7b52bb6dcac32111172cb6201a9fee56e739f3660895083baebd1b89ee
-size 2033616
+oid sha256:7b25de974b6ca5f0dcb279f16f38199167d1efc35c01770d3234bec2dfb5dc86
+size 2097848
--- a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
+++ b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fb3f4145881984de6268c34f7e5d452f78f54952f454f747a1cd52bc3171de62
-size 2012002
+oid sha256:5f06cee5ae2bcf393196265cd9a3ef832690cd4c5c53934bbfb169d50ab33c41
+size 2055004
--- a/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt
+++ b/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/version.txt
@ -1,2 +1,2 @@
-d60b12741e940f56addaf2d92e78b50f  libtensorrt_llm_batch_manager_static.a
-c55e606a3430d3a56cee3968a77b46f1  libtensorrt_llm_batch_manager_static.pre_cxx11.a
+bb62a31b8e17dae284d784ba43d5bc02  libtensorrt_llm_batch_manager_static.a
+19327f59c7f5b6235e15b322d5f5a0f4  libtensorrt_llm_batch_manager_static.pre_cxx11.a
--- a/cpp/tensorrt_llm/common/cublasMMWrapper.cpp
+++ b/cpp/tensorrt_llm/common/cublasMMWrapper.cpp
@ -146,7 +146,7 @@ void CublasMMWrapper::Gemm(cublasOperation_t transa, cublasOperation_t transb, c
    {
        check_cuda_error(cublasSetStream(getCublasHandle(), mStream));
        check_cuda_error(cublasSetWorkspace(getCublasHandle(), mCublasWorkspace, workspaceSize));
-        // Go with default heruistic to choose tactic as cuBLAS does not allow to choose tactics in Ampere+
+        // Go with default heuristic to choose tactic as cuBLAS does not allow to choose tactics in Ampere+
        cublasGemmAlgo_t cublasAlgo = CUBLAS_GEMM_DEFAULT;
        check_cuda_error(cublasGemmEx(getCublasHandle(), transa, transb, m, n, k, alpha, A, mAType, lda, B, mBType, ldb,
            beta, C, mCType, ldc, mComputeType, static_cast<cublasGemmAlgo_t>(cublasAlgo)));
@ -318,7 +318,7 @@ std::vector<cublasLtMatmulHeuristicResult_t> CublasMMWrapper::getTactics(cublasL
    uint64_t workspace_size = CUBLAS_WORKSPACE_SIZE;
    check_cuda_error(cublasLtMatmulPreferenceSetAttribute(
        preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES, &workspace_size, sizeof(workspace_size)));
-    // Restrict reduction algorithms for numerical stability and better determenism
+    // Restrict reduction algorithms for numerical stability and better determinism
    uint32_t reduction_mask = CUBLASLT_REDUCTION_SCHEME_MASK;
    check_cuda_error(cublasLtMatmulPreferenceSetAttribute(
        preference, CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK, &reduction_mask, sizeof(reduction_mask)));
--- a/cpp/tensorrt_llm/common/cudaUtils.h
+++ b/cpp/tensorrt_llm/common/cudaUtils.h
@ -283,23 +283,23 @@ inline std::tuple<size_t, size_t> getDeviceMemoryInfo(const bool useUvm)
 {
    if (useUvm)
    {
-        size_t freeSysmem, totalSysmem;
+        size_t freeSysMem, totalSysMem;
 #ifndef _WIN32 // Linux
        struct sysinfo info;
        sysinfo(&info);
-        totalSysmem = info.totalram * info.mem_unit;
-        freeSysmem = info.freeram * info.mem_unit;
+        totalSysMem = info.totalram * info.mem_unit;
+        freeSysMem = info.freeram * info.mem_unit;
 #else  // Windows
        MEMORYSTATUSEX memInfo;
        memInfo.dwLength = sizeof(memInfo);
        GlobalMemoryStatusEx(&memInfo);
-        totalSysmem = memInfo.ullTotalPhys;
-        freeSysmem = memInfo.ullAvailPhys;
+        totalSysMem = memInfo.ullTotalPhys;
+        freeSysMem = memInfo.ullAvailPhys;
 #endif // WIN32

        TLLM_LOG_INFO("Using UVM based system memory for KV cache, total memory %0.2f GB, available memory %0.2f GB",
-            ((double) totalSysmem / 1e9), ((double) freeSysmem / 1e9));
-        return {freeSysmem, totalSysmem};
+            ((double) totalSysMem / 1e9), ((double) freeSysMem / 1e9));
+        return {freeSysMem, totalSysMem};
    }
    else
    {
--- a/cpp/tensorrt_llm/common/logger.cpp
+++ b/cpp/tensorrt_llm/common/logger.cpp
@ -29,35 +29,29 @@ Logger::Logger()
    int deviceId;
    cudaGetDevice(&deviceId);

-    char* levelName = std::getenv("TLLM_LOG_LEVEL");
+    auto const* levelName = std::getenv("TLLM_LOG_LEVEL");
    if (levelName != nullptr)
    {
-        std::map<std::string, Level> nameToLevel = {
-            {"TRACE", TRACE},
-            {"DEBUG", DEBUG},
-            {"INFO", INFO},
-            {"WARNING", WARNING},
-            {"ERROR", ERROR},
-        };
-        auto level = nameToLevel.find(levelName);
+        auto level = [levelName = std::string(levelName)]()
+        {
+            if (levelName == "TRACE")
+                return TRACE;
+            if (levelName == "DEBUG")
+                return DEBUG;
+            if (levelName == "INFO")
+                return INFO;
+            if (levelName == "WARNING")
+                return WARNING;
+            if (levelName == "ERROR")
+                return ERROR;
+            TLLM_THROW("Invalid log level: %s", levelName.c_str());
+        }();
        // If TLLM_LOG_FIRST_RANK_ONLY=ON, set LOG LEVEL of other device to ERROR
        if (isFirstRankOnly && deviceId != 0)
        {
-            level = nameToLevel.find("ERROR");
-        }
-        if (level != nameToLevel.end())
-        {
-            setLevel(level->second);
-        }
-        else
-        {
-            fprintf(stderr,
-                "[TensorRT-LLM][WARNING] Invalid logger level TLLM_LOG_LEVEL=%s. "
-                "Ignore the environment variable and use a default "
-                "logging level.\n",
-                levelName);
-            levelName = nullptr;
+            level = ERROR;
        }
+        setLevel(level);
    }
 }

--- a/cpp/tensorrt_llm/common/logger.h
+++ b/cpp/tensorrt_llm/common/logger.h
@ -18,10 +18,10 @@

 #include <cstdlib>
 #include <iostream>
-#include <map>
 #include <stdexcept>
 #include <string>

+#include "tensorrt_llm/common/assert.h"
 #include "tensorrt_llm/common/stringUtils.h"

 namespace tensorrt_llm::common
@ -88,13 +88,11 @@ public:
    void setLevel(const Level level)
    {
        level_ = level;
-        log(INFO, "Set logger level by %s", getLevelName(level).c_str());
+        log(INFO, "Set logger level by %s", getLevelName(level));
    }

 private:
-    const std::string PREFIX = "[TensorRT-LLM]";
-    std::map<Level, std::string> level_name_
-        = {{TRACE, "TRACE"}, {DEBUG, "DEBUG"}, {INFO, "INFO"}, {WARNING, "WARNING"}, {ERROR, "ERROR"}};
+    static auto constexpr kPREFIX = "[TensorRT-LLM]";

 #ifndef NDEBUG
    const Level DEFAULT_LOG_LEVEL = DEBUG;
@ -105,19 +103,28 @@ private:

    Logger(); // NOLINT(modernize-use-equals-delete)

-    inline std::string getLevelName(const Level level)
+    static inline char const* getLevelName(const Level level)
    {
-        return level_name_[level];
+        switch (level)
+        {
+        case TRACE: return "TRACE";
+        case DEBUG: return "DEBUG";
+        case INFO: return "INFO";
+        case WARNING: return "WARNING";
+        case ERROR: return "ERROR";
        }

-    inline std::string getPrefix(const Level level)
-    {
-        return PREFIX + "[" + getLevelName(level) + "] ";
+        TLLM_THROW("Unknown log level: %d", level);
    }

-    inline std::string getPrefix(const Level level, const int rank)
+    static inline std::string getPrefix(const Level level)
    {
-        return PREFIX + "[" + getLevelName(level) + "][" + std::to_string(rank) + "] ";
+        return fmtstr("%s[%s] ", kPREFIX, getLevelName(level));
+    }
+
+    static inline std::string getPrefix(const Level level, const int rank)
+    {
+        return fmtstr("%s[%s][%d] ", kPREFIX, getLevelName(level), rank);
    }
 };

--- a/cpp/tensorrt_llm/executor/aarch64-linux-gnu/libtensorrt_llm_executor_static.a
+++ b/cpp/tensorrt_llm/executor/aarch64-linux-gnu/libtensorrt_llm_executor_static.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:13e17e2d9a94d2bc1b131d096a3722a83a67ab115fa8271b57b27f7e2877bdc1
-size 587334
+oid sha256:4201c7241d53298ca52d4f1447cc9cbc4024f63b42a24cbcff82192cc10bed67
+size 576098
--- a/cpp/tensorrt_llm/executor/aarch64-linux-gnu/libtensorrt_llm_executor_static.pre_cxx11.a
+++ b/cpp/tensorrt_llm/executor/aarch64-linux-gnu/libtensorrt_llm_executor_static.pre_cxx11.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:45438204eba812694bd30b68cfc9bb2bc54a8a59c6c86e037bbc4ac7e5f8230c
-size 589438
+oid sha256:2960feb2c7ad941a473408e2f6fd8c324f60f6af3c4d8f11217c676fd830e4cb
+size 578660
--- a/cpp/tensorrt_llm/executor/aarch64-linux-gnu/version.txt
+++ b/cpp/tensorrt_llm/executor/aarch64-linux-gnu/version.txt
@ -1,3 +1,3 @@
-835767a37292ea9786c0d6149ae270f4  libtensorrt_llm_executor_static.a
-1fe0c9ac7a1a35ce7d80676146867374  libtensorrt_llm_executor_static.pre_cxx11.a
-25adff90cc350eb9ca9804051a08de80d547c113 commit
+8a8d6505d9ef62cb2eeb8c75a5ee5bbb  libtensorrt_llm_executor_static.a
+e3b8edc619c99a7f125fe81bc8554ff0  libtensorrt_llm_executor_static.pre_cxx11.a
+230623fa285048a2de5c54c2cc0f364fb9f2c559 commit
--- a/cpp/tensorrt_llm/executor/x86_64-linux-gnu/libtensorrt_llm_executor_static.a
+++ b/cpp/tensorrt_llm/executor/x86_64-linux-gnu/libtensorrt_llm_executor_static.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7969768d3b9a65182ee519c60e11f27b0a088c2c0b732f3780d7c0c563dbb180
-size 587776
+oid sha256:cde295fa290b15b3d76b8e8b2cc435d7fceb2f456d8cb4d9b22ee2cf3ddbd344
+size 588504
--- a/cpp/tensorrt_llm/executor/x86_64-linux-gnu/libtensorrt_llm_executor_static.pre_cxx11.a
+++ b/cpp/tensorrt_llm/executor/x86_64-linux-gnu/libtensorrt_llm_executor_static.pre_cxx11.a
@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:98d9b7c4a586f0be0499a0df487cacba69985ce43ca5fd543c90c6a368c91b67
-size 571150
+oid sha256:54ac66f3555bff4ed28ba0352bcb4a0f541346592cf109b491071b6374e5238c
+size 562260
--- a/cpp/tensorrt_llm/executor/x86_64-linux-gnu/version.txt
+++ b/cpp/tensorrt_llm/executor/x86_64-linux-gnu/version.txt
@ -1,2 +1,2 @@
-6771f94e0bce39c6cab391cf1f92484c  libtensorrt_llm_executor_static.a
-84b7550448f8710de17644a5d404178f  libtensorrt_llm_executor_static.pre_cxx11.a
+ee96c6e2742539da0e8d732635f84449  libtensorrt_llm_executor_static.a
+9154564ed926ffbcdb83e7eac3504fa0  libtensorrt_llm_executor_static.pre_cxx11.a
--- a/cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt
+++ b/cpp/tensorrt_llm/kernels/cutlass_kernels/CMakeLists.txt
@ -18,7 +18,8 @@
 file(GLOB_RECURSE SRC_CPP *.cpp)
 file(GLOB_RECURSE SRC_CU *.cu)

-# This can happen when not building for Torch
+# The Python executable will only be defined if building with Torch support. If
+# not, we need to find it here.
 if(NOT Python3_EXECUTABLE)
  find_package(
    Python3
@ -57,17 +58,13 @@ endif()

 file(GLOB_RECURSE CU_INSTANTIATIONS ${CMAKE_CURRENT_BINARY_DIR}/*.cu)

-add_library(cutlass_src_pre_hopper STATIC ${SRC_CPP} ${SRC_CU})
-set_property(TARGET cutlass_src_pre_hopper PROPERTY POSITION_INDEPENDENT_CODE
-                                                    ON)
-set_property(TARGET cutlass_src_pre_hopper PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS
-                                                    ON)
+add_library(cutlass2_src STATIC ${SRC_CPP} ${SRC_CU})
+set_property(TARGET cutlass2_src PROPERTY POSITION_INDEPENDENT_CODE ON)
+set_property(TARGET cutlass2_src PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)

-add_library(cutlass_src_hopper STATIC ${CU_INSTANTIATIONS})
-set_property(TARGET cutlass_src_hopper PROPERTY POSITION_INDEPENDENT_CODE ON)
-set_property(TARGET cutlass_src_hopper PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
-
-add_dependencies(cutlass_src_hopper cutlass_src_pre_hopper)
+add_library(cutlass3_src STATIC ${CU_INSTANTIATIONS})
+set_property(TARGET cutlass3_src PROPERTY POSITION_INDEPENDENT_CODE ON)
+set_property(TARGET cutlass3_src PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)

 # Note - we deliberately do not include 90a PTX (even when 9.0+PTX is
 # specified). This is because sm_90a has arch conditional instructions that are
@ -75,24 +72,18 @@ add_dependencies(cutlass_src_hopper cutlass_src_pre_hopper)
 # the binary anyway.
 if("9.0" IN_LIST TORCH_CUDA_ARCH_LIST
   OR "9.0+PTX" IN_LIST TORCH_CUDA_ARCH_LIST
-   OR TORCH_CUDA_ARCH_LIST STREQUAL "Auto")
+   OR "90-real" IN_LIST CMAKE_CUDA_ARCHITECTURES_NATIVE)
  message(STATUS "MANUALLY APPENDING FLAG TO COMPILE FOR SM_90a.")
  target_compile_options(
-    cutlass_src_pre_hopper
-    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-gencode=arch=compute_90a,code=sm_90a>)
-  target_compile_options(
-    cutlass_src_hopper
+    cutlass3_src
    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-gencode=arch=compute_90a,code=sm_90a>)

  # Hopper kernels require cuda lib for TMA APIs
-  target_link_libraries(cutlass_src_pre_hopper PRIVATE CUDA::cuda_driver)
-  target_link_libraries(cutlass_src_hopper PRIVATE CUDA::cuda_driver)
+  target_link_libraries(cutlass3_src PRIVATE CUDA::cuda_driver)

  # No kernels should be parsed, unless hopper is specified. This is a build
  # time improvement
-  target_compile_definitions(cutlass_src_pre_hopper
-                             PRIVATE COMPILE_HOPPER_MIXED_INPUT_GEMMS)
-  target_compile_definitions(cutlass_src_hopper
+  target_compile_definitions(cutlass3_src
                             PRIVATE COMPILE_HOPPER_MIXED_INPUT_GEMMS)
 endif()

@ -101,9 +92,5 @@ endif()
 # compilation output.
 if(NOT WIN32)
  target_compile_options(
-    cutlass_src_pre_hopper
-    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-Wno-psabi>)
-  target_compile_options(
-    cutlass_src_hopper
-    PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-Wno-psabi>)
+    cutlass3_src PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-Wno-psabi>)
 endif()
--- a/cpp/tensorrt_llm/kernels/onlineSoftmaxBeamsearchKernels/onlineSoftmaxBeamsearchKernelsTemplate.h
+++ b/cpp/tensorrt_llm/kernels/onlineSoftmaxBeamsearchKernels/onlineSoftmaxBeamsearchKernelsTemplate.h
@ -127,9 +127,9 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*

    const float length_penalty{beam_hyps.length_penalties[global_batch_idx]};
    const int early_stopping{beam_hyps.early_stoppings[global_batch_idx]};
-    const int* sequence_lengths{beam_hyps.sequence_lengths_src};
    const T diversity_rate{beam_hyps.diversity_rates[global_batch_idx]};
    float* output_log_probs{beam_hyps.log_probs_src};
+    const int* sequence_lengths{beam_hyps.sequence_lengths_src};

    using cub_kvp = cub::KeyValuePair<int, T>;
    using BlockReduce = cub::BlockReduce<cub_kvp, THREADBLOCK_SIZE>;
@ -177,21 +177,7 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
    for (int elem_id = thread_id; elem_id < candidate_size; elem_id += THREADBLOCK_SIZE)
    {
        int i = beam_hyps.num_beams == nullptr ? elem_id % K : elem_id / 2 / K;
-        T elem = topk_tmp_val_buf[elem_id];
-        if (length_penalty > 0.0f)
-        {
-            int length = sequence_lengths[vector_id * K + i];
-            if (early_stopping == 0)
-            {
-                // Use generated_length (rather than sequence_length) to compute length_penalty
-                // https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L957
-                // But this branch will cause CI error in
-                // "C++ Tests (GPT) on A30", "C++ Tests (GPT-J) on H100_PCIe", "H100_PCIe-accuracy-0"
-                length -= beam_hyps.input_lengths[global_batch_idx];
-            }
-            const int pad_if_not_finish = finished[vector_id * K + i].isFinished() ? 0 : 1;
-            elem = apply_length_penalty(elem, length + pad_if_not_finish, length_penalty);
-        }
+        T elem = topk_tmp_val_buf[elem_id]; //  use token score to do TopK
        elem += diversity_rate * (T) i;
        cub_kvp new_elem{elem_id, elem};
        partial_topk = arg_max(partial_topk, new_elem);
@ -232,21 +218,25 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
        {
            const int current_key = cta_topk[i].key;
            const T current_value = cta_topk[i].value;
+
+            // Consider to add beam only if this token belongs to top K range and it is end_token
+            // https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L272
            if (i < K && beam_hyps.num_beams != nullptr
                && topk_tmp_id_buf[current_key] % vocab_size == beam_hyps.end_ids[vector_id])
            {
-                // Add beam only if beam_token belongs to top K tokens
-                // https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L272
-                const float normed_score = (float) current_value;
-                const int num_beam = beam_hyps.num_beams[global_batch_idx];
-                int beam_idx = num_beam;
+                const int seq_len = sequence_lengths[vector_id * K + i] - beam_hyps.input_lengths[global_batch_idx];
+                const int pad_if_not_finish = finished[vector_id * K + i].isFinished() ? 0 : 1;
+                const float normed_score
+                    = apply_length_penalty(current_value, seq_len + pad_if_not_finish, length_penalty);

+                int beam_idx = beam_hyps.num_beams[global_batch_idx];
                // There are already K beams
-                if (num_beam == K)
+                if (beam_idx == K)
                {
                    // The current score is worse than the worst one in beams
                    if (normed_score < beam_hyps.min_normed_scores[global_batch_idx])
                    {
+                        // Stop considering new beams
                        selected_beams = K;
                        break;
                    }
@ -291,24 +281,34 @@ __launch_bounds__(THREADBLOCK_SIZE) __global__ void batch_topk_kernel(const int*
                {
                    const int src_idx = j * beam_hyps.batch_size * K + beam_hyps.ite * beam_hyps.local_batch_size * K
                        + vector_id * K + prev_id;
-
                    beam_hyps.output_ids_tgt[tgt_id_offset + j]
                        = beam_hyps.output_ids_src_ptr[vector_id][prev_id * beam_hyps.max_seq_len + j];
+
                    if (beam_hyps.log_probs != nullptr && beam_hyps.log_probs_src != nullptr)
                    {
                        beam_hyps.log_probs[tgt_id_offset + j] = beam_hyps.log_probs_src[src_idx];
                    }
+
                    prev_id = beam_hyps.parent_ids_src_ptr[vector_id][prev_id * beam_hyps.max_seq_len + j];
                }
                const int tgt_beam_idx = global_batch_idx * (K * 2) + beam_idx;
+
                beam_hyps.sequence_lengths_tgt[tgt_beam_idx] = current_step;
                beam_hyps.normed_scores[tgt_beam_idx] = normed_score;
                beam_hyps.min_normed_scores[global_batch_idx]
                    = min(beam_hyps.min_normed_scores[global_batch_idx], beam_hyps.normed_scores[tgt_beam_idx]);

                beam_hyps.num_beams[global_batch_idx]++;
-                cum_log_probs[tgt_beam_idx] = (float) topk_tmp_val_buf[current_key];
+                beam_hyps.cum_log_probs[tgt_beam_idx] = (float) topk_tmp_val_buf[current_key];
            }
+            // This token is end_token but belongs to range K ~ 2K, just ignoe it
+            // TODO: eliminate this branch by rewriting condition of the else_if
+            else if (i >= K && beam_hyps.num_beams != nullptr
+                && topk_tmp_id_buf[current_key] % vocab_size == beam_hyps.end_ids[vector_id])
+            {
+                ;
+            }
+            // Beam search is disabled or this token is not end_token, we add it to the end of the unfinished sentence
            else if (beam_hyps.num_beams != nullptr || beam_hyps.num_beams == nullptr && i < K)
            {
                const int current_step{sequence_lengths[vector_id * K + selected_beams]};
--- a/cpp/tensorrt_llm/kernels/selectiveScan.cu
+++ b/cpp/tensorrt_llm/kernels/selectiveScan.cu
@ -1,6 +1,5 @@
 /*
- * Adapted from https://github.com/state-spaces/mamba/blob/main/csrc/selective_scan/selective_scan_fwd_kernel.cuh
- * Copyright (c) 2023, Tri Dao.
+ * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
@ -13,413 +12,318 @@
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
- *
- * Not a contribution
- * Changes made by NVIDIA CORPORATION & AFFILIATES or otherwise documented as
- * NVIDIA-proprietary are not a contribution and subject to the following terms and conditions:
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: LicenseRef-NvidiaProprietary
- *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
 */

 #include <cuda_runtime_api.h>
+
+#include <cooperative_groups/memcpy_async.h>
+#include <cuda/pipeline>
+
+#include <cuda_bf16.h>
+#include <cuda_fp16.h>
+
 #ifdef ENABLE_FP8
 #include <cuda_fp8.h>
 #endif

-#include <cub/block/block_load.cuh>
-#include <cub/block/block_scan.cuh>
-#include <cub/block/block_store.cuh>
-
 #include "selectiveScan.h"
-#include "selectiveScanCommon.h"

 namespace tensorrt_llm
 {
 namespace kernels
 {

-template <int kNThreads_, int kNItems_, int kNRows_, bool kIsEvenLen_, bool kIsVariableB_, bool kIsVariableC_,
-    bool kHasZ_, typename input_t_, typename weight_t_>
-struct Selective_Scan_fwd_kernel_traits
+__device__ float toFloat(float f)
 {
-    static_assert(kNItems_ % 4 == 0);
-    using input_t = input_t_;
-    using weight_t = weight_t_;
-    static constexpr int kNThreads = kNThreads_;
-    // Setting MinBlocksPerMP to be 3 (instead of 2) for 128 threads improves occupancy.
-    static constexpr int kMinBlocks = kNThreads < 128 ? 5 : 3;
-    static constexpr int kNItems = kNItems_;
-    static constexpr int kNRows = kNRows_;
-    static constexpr int kNBytes = sizeof(input_t);
-    static_assert(kNBytes == 2 || kNBytes == 4);
-    static constexpr int kNElts = kNBytes == 4 ? 4 : std::min(8, kNItems);
-    static_assert(kNItems % kNElts == 0);
-    static constexpr int kNLoads = kNItems / kNElts;
-    static constexpr bool kIsEvenLen = kIsEvenLen_;
-    static constexpr bool kIsVariableB = kIsVariableB_;
-    static constexpr bool kIsVariableC = kIsVariableC_;
-    static constexpr bool kHasZ = kHasZ_;
-
-    static constexpr bool kDirectIO = kIsEvenLen && kNLoads == 1;
-
-    using vec_t = typename BytesToType<kNBytes * kNElts>::Type;
-    using scan_t = float2;
-    using scan_t_s = float;
-    using BlockLoadT = cub::BlockLoad<input_t, kNThreads, kNItems, cub::BLOCK_LOAD_WARP_TRANSPOSE>;
-    using BlockLoadVecT = cub::BlockLoad<vec_t, kNThreads, kNLoads,
-        !kDirectIO ? cub::BLOCK_LOAD_WARP_TRANSPOSE : cub::BLOCK_LOAD_DIRECT>;
-    using BlockLoadWeightT = cub::BlockLoad<input_t, kNThreads, kNItems, cub::BLOCK_LOAD_WARP_TRANSPOSE>;
-    using BlockLoadWeightVecT = cub::BlockLoad<vec_t, kNThreads, kNLoads,
-        !kDirectIO ? cub::BLOCK_LOAD_WARP_TRANSPOSE : cub::BLOCK_LOAD_DIRECT>;
-    using BlockStoreT = cub::BlockStore<input_t, kNThreads, kNItems, cub::BLOCK_STORE_WARP_TRANSPOSE>;
-    using BlockStoreVecT = cub::BlockStore<vec_t, kNThreads, kNLoads,
-        !kDirectIO ? cub::BLOCK_STORE_WARP_TRANSPOSE : cub::BLOCK_STORE_DIRECT>;
-    // using BlockScanT = cub::BlockScan<scan_t, kNThreads, cub::BLOCK_SCAN_RAKING_MEMOIZE>;
-    // using BlockScanT = cub::BlockScan<scan_t, kNThreads, cub::BLOCK_SCAN_RAKING>;
-    using BlockScanT = cub::BlockScan<scan_t, kNThreads, cub::BLOCK_SCAN_WARP_SCANS>;
-    static constexpr int kSmemIOSize
-        = std::max({sizeof(typename BlockLoadT::TempStorage), sizeof(typename BlockLoadVecT::TempStorage),
-            (int(kIsVariableB) + int(kIsVariableC)) * sizeof(typename BlockLoadWeightT::TempStorage),
-            (int(kIsVariableB) + int(kIsVariableC)) * sizeof(typename BlockLoadWeightVecT::TempStorage),
-            sizeof(typename BlockStoreT::TempStorage), sizeof(typename BlockStoreVecT::TempStorage)});
-    static constexpr int kSmemSize = kSmemIOSize + sizeof(typename BlockScanT::TempStorage);
-};
-
-template <typename Ktraits>
-__global__ __launch_bounds__(Ktraits::kNThreads, Ktraits::kMinBlocks) void selective_scan_fwd_kernel(
-    SSMParamsBase params)
-{
-    constexpr bool kIsVariableB = Ktraits::kIsVariableB;
-    constexpr bool kIsVariableC = Ktraits::kIsVariableC;
-    constexpr bool kHasZ = Ktraits::kHasZ;
-    constexpr int kNThreads = Ktraits::kNThreads;
-    constexpr int kNItems = Ktraits::kNItems;
-    constexpr int kNRows = Ktraits::kNRows;
-    constexpr bool kDirectIO = Ktraits::kDirectIO;
-    using input_t = typename Ktraits::input_t;
-    using weight_t = typename Ktraits::weight_t;
-    using scan_t = typename Ktraits::scan_t;
-    using scan_t_s = typename Ktraits::scan_t_s;
-
-    // Shared memory.
-    extern __shared__ char smem_[];
-    // cast to lvalue reference of expected type
-    // char *smem_loadstorescan = smem_ + 2 * MAX_DSTATE * sizeof(weight_t);
-    // auto& smem_load = reinterpret_cast<typename BlockLoadT::TempStorage&>(smem_ + 2 * MAX_DSTATE * sizeof(weight_t));
-    // auto& smem_load = reinterpret_cast<typename BlockLoadT::TempStorage&>(smem_loadstorescan);
-    auto& smem_load = reinterpret_cast<typename Ktraits::BlockLoadT::TempStorage&>(smem_);
-    auto& smem_load_weight = reinterpret_cast<typename Ktraits::BlockLoadWeightT::TempStorage&>(smem_);
-    auto& smem_load_weight1 = *reinterpret_cast<typename Ktraits::BlockLoadWeightT::TempStorage*>(
-        smem_ + sizeof(typename Ktraits::BlockLoadWeightT::TempStorage));
-    auto& smem_store = reinterpret_cast<typename Ktraits::BlockStoreT::TempStorage&>(smem_);
-    auto& smem_scan = *reinterpret_cast<typename Ktraits::BlockScanT::TempStorage*>(smem_ + Ktraits::kSmemIOSize);
-    // weight_t *smem_a = reinterpret_cast<weight_t *>(smem_ + smem_loadstorescan_size);
-    // weight_t *smem_bc = reinterpret_cast<weight_t *>(smem_a + MAX_DSTATE);
-    scan_t* smem_running_prefix = reinterpret_cast<scan_t*>(smem_ + Ktraits::kSmemSize);
-
-    const int batch_id = blockIdx.x;
-    const int dim_id = blockIdx.y;
-    const int group_id = dim_id / (params.dim_ngroups_ratio);
-    input_t* u = reinterpret_cast<input_t*>(params.u_ptr) + batch_id * params.u_batch_stride
-        + dim_id * kNRows * params.u_d_stride;
-    input_t* delta = reinterpret_cast<input_t*>(params.delta_ptr) + batch_id * params.delta_batch_stride
-        + dim_id * kNRows * params.delta_d_stride;
-    weight_t* A = reinterpret_cast<weight_t*>(params.A_ptr) + dim_id * kNRows * params.A_d_stride;
-    weight_t* B = reinterpret_cast<weight_t*>(params.B_ptr) + dim_id * kNRows * params.B_d_stride;
-    input_t* Bvar = reinterpret_cast<input_t*>(params.B_ptr) + batch_id * params.B_batch_stride
-        + group_id * params.B_group_stride;
-    weight_t* C = reinterpret_cast<weight_t*>(params.C_ptr) + dim_id * kNRows * params.C_d_stride;
-    input_t* Cvar = reinterpret_cast<input_t*>(params.C_ptr) + batch_id * params.C_batch_stride
-        + group_id * params.C_group_stride;
-    scan_t_s* x = reinterpret_cast<scan_t_s*>(params.x_ptr) + (batch_id * params.dim + dim_id * kNRows) * params.dstate;
-
-    float D_val[kNRows] = {0};
-    if (params.D_ptr != nullptr)
-    {
-#pragma unroll
-        for (int r = 0; r < kNRows; ++r)
-        {
-            D_val[r] = reinterpret_cast<float*>(params.D_ptr)[dim_id * kNRows + r];
-        }
-    }
-    float delta_bias[kNRows] = {0};
-    if (params.delta_bias_ptr != nullptr)
-    {
-#pragma unroll
-        for (int r = 0; r < kNRows; ++r)
-        {
-            delta_bias[r] = reinterpret_cast<float*>(params.delta_bias_ptr)[dim_id * kNRows + r];
-        }
+    return f;
 }

-    // for (int state_idx = threadIdx.x; state_idx < params.dstate; state_idx += blockDim.x) {
-    //     smem_a[state_idx] = A[state_idx * params.A_dstate_stride];
-    //     smem_bc[state_idx] = B[state_idx * params.B_dstate_stride] * C[state_idx * params.C_dstate_stride];
-    // }
+__device__ float toFloat(__half h)
+{
+    return __half2float(h);
+}
+#ifdef ENABLE_BF16
+__device__ float toFloat(__nv_bfloat16 val)
+{
+    return __bfloat162float(val);
+}
+#endif

-    constexpr int kChunkSize = kNThreads * kNItems;
-    for (int chunk = 0; chunk < params.n_chunks; ++chunk)
+__device__ void convertAndStore(float* output, float input)
 {
-        input_t u_vals[kNRows][kNItems], delta_vals_load[kNRows][kNItems];
-        __syncthreads();
-#pragma unroll
-        for (int r = 0; r < kNRows; ++r)
-        {
-            if constexpr (!kDirectIO)
-            {
-                if (r > 0)
-                {
-                    __syncthreads();
-                }
-            }
-            load_input<Ktraits>(u + r * params.u_d_stride, u_vals[r], smem_load, params.seqlen - chunk * kChunkSize);
-            if constexpr (!kDirectIO)
-            {
-                __syncthreads();
-            }
-            load_input<Ktraits>(
-                delta + r * params.delta_d_stride, delta_vals_load[r], smem_load, params.seqlen - chunk * kChunkSize);
-        }
-        u += kChunkSize;
-        delta += kChunkSize;
-
-        float delta_vals[kNRows][kNItems], delta_u_vals[kNRows][kNItems], out_vals[kNRows][kNItems];
-#pragma unroll
-        for (int r = 0; r < kNRows; ++r)
-        {
-#pragma unroll
-            for (int i = 0; i < kNItems; ++i)
-            {
-                float u_val = float(u_vals[r][i]);
-                delta_vals[r][i] = float(delta_vals_load[r][i]) + delta_bias[r];
-                if (params.delta_softplus)
-                {
-                    delta_vals[r][i] = delta_vals[r][i] <= 20.f ? log1pf(expf(delta_vals[r][i])) : delta_vals[r][i];
-                }
-                delta_u_vals[r][i] = delta_vals[r][i] * u_val;
-                out_vals[r][i] = D_val[r] * u_val;
-            }
+    *output = input;
 }

-        __syncthreads();
-        for (int state_idx = 0; state_idx < params.dstate; ++state_idx)
+__device__ void convertAndStore(__half* output, float input)
 {
-            weight_t A_val[kNRows];
-#pragma unroll
-            for (int r = 0; r < kNRows; ++r)
-            {
-                A_val[r] = A[state_idx * params.A_dstate_stride + r * params.A_d_stride];
-                // Multiply the real part of A with LOG2E so we can use exp2f instead of expf.
-                constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
-                A_val[r] *= kLog2e;
+    *output = __float2half(input);
 }
-            // This variable holds B * C if both B and C are constant across seqlen. If only B varies
-            // across seqlen, this holds C. If only C varies across seqlen, this holds B.
-            // If both B and C vary, this is unused.
-            weight_t BC_val[kNRows];
-            weight_t B_vals[kNItems], C_vals[kNItems];
-            if constexpr (kIsVariableB)
+#ifdef ENABLE_BF16
+__device__ void convertAndStore(__nv_bfloat16* output, float input)
 {
-                load_weight<Ktraits>(Bvar + state_idx * params.B_dstate_stride, B_vals, smem_load_weight,
-                    params.seqlen - chunk * kChunkSize);
-                if constexpr (!kIsVariableC)
+    *output = __float2bfloat16(input);
+}
+#endif
+
+template <typename input_t, typename weight_t, int DSTATE = 16, int CHANNELS_PER_BLOCK = 128, int STAGES = 12,
+    int SEQ_UNROLL = 6>
+__launch_bounds__(256, 1) __global__ void selective_scan_loop_kernel(SSMParamsBase params)
 {
-#pragma unroll
-                    for (int r = 0; r < kNRows; ++r)
+    input_t* output = reinterpret_cast<input_t*>(params.out_ptr);
+    weight_t* state = reinterpret_cast<weight_t*>(params.x_ptr);
+    input_t* x = reinterpret_cast<input_t*>(params.u_ptr);
+    input_t* dt = reinterpret_cast<input_t*>(params.delta_ptr);
+    weight_t* A = reinterpret_cast<weight_t*>(params.A_ptr);
+    input_t* B = reinterpret_cast<input_t*>(params.B_ptr);
+    input_t* C = reinterpret_cast<input_t*>(params.C_ptr);
+    weight_t* D = reinterpret_cast<weight_t*>(params.D_ptr);
+    input_t* z = reinterpret_cast<input_t*>(params.z_ptr);
+    weight_t* dt_bias = reinterpret_cast<weight_t*>(params.delta_bias_ptr);
+    bool dt_softplus = params.delta_softplus;
+    int num_tokens = params.seqlen;
+    int num_channels = params.dim;
+
+    // static const int STAGES = 12;
+    // static const int SEQ_UNROLL = 6;
+
+    __shared__ cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, STAGES / SEQ_UNROLL> pipeline_state;
+    auto block = cooperative_groups::this_thread_block();
+
+    __shared__ __align__(16) input_t sh_B[STAGES][DSTATE];
+    __shared__ __align__(16) input_t sh_C[STAGES][DSTATE];
+
+    __shared__ __align__(128) input_t sh_dt[STAGES][CHANNELS_PER_BLOCK];
+    __shared__ input_t sh_x[STAGES][CHANNELS_PER_BLOCK];
+    __shared__ input_t sh_z[STAGES][CHANNELS_PER_BLOCK];
+
+    __shared__ weight_t sh_D[CHANNELS_PER_BLOCK];
+    __shared__ weight_t sh_dt_bias[CHANNELS_PER_BLOCK];
+
+    const int channel = blockIdx.x * blockDim.x + threadIdx.x;
+    const int sample = blockIdx.y; // batch id
+
+    const int seq_loops = (num_tokens + SEQ_UNROLL - 1) / SEQ_UNROLL;
+
+    const int input_matrix_row_id = sample * num_tokens;
+
+    if (threadIdx.y == 1)
    {
-                        BC_val[r] = C[state_idx * params.C_dstate_stride + r * params.C_d_stride];
-                    }
-                }
-            }
-            if constexpr (kIsVariableC)
+        // Data loading warps
+
+        // Bias is independent of token
+        sh_dt_bias[threadIdx.x] = dt_bias[channel];
+        // D is independent of token
+        if (D)
+            sh_D[threadIdx.x] = D[channel];
+
+        cuda::pipeline pipeline = cuda::make_pipeline(block, &pipeline_state, cuda::pipeline_role::producer);
+
+        int stage = 0;
+        for (int si = 0; si < seq_loops; si++)
        {
-                auto& smem_load_weight_C = !kIsVariableB ? smem_load_weight : smem_load_weight1;
-                load_weight<Ktraits>(Cvar + state_idx * params.C_dstate_stride, C_vals, smem_load_weight_C,
-                    params.seqlen - chunk * kChunkSize);
-                if constexpr (!kIsVariableB)
-                {
-#pragma unroll
-                    for (int r = 0; r < kNRows; ++r)
-                    {
-                        BC_val[r] = B[state_idx * params.B_dstate_stride + r * params.B_d_stride];
-                    }
-                }
-            }
-            if constexpr (!kIsVariableB && !kIsVariableC)
-            {
-#pragma unroll
-                for (int r = 0; r < kNRows; ++r)
-                {
-                    BC_val[r] = B[state_idx * params.B_dstate_stride + r * params.B_d_stride]
-                        * C[state_idx * params.C_dstate_stride + r * params.C_d_stride];
-                }
-            }
+
+            pipeline.producer_acquire();

 #pragma unroll
-            for (int r = 0; r < kNRows; ++r)
+            for (int token_id = si * SEQ_UNROLL; token_id < num_tokens && token_id < (si + 1) * SEQ_UNROLL; token_id++)
            {
-                if (r > 0)
-                {
-                    __syncthreads();
-                } // Scan could be using the same smem
-                scan_t thread_data[kNItems];
-#pragma unroll
-                for (int i = 0; i < kNItems; ++i)
-                {
-                    thread_data[i] = make_float2(exp2f(delta_vals[r][i] * A_val[r]),
-                        !kIsVariableB ? delta_u_vals[r][i] : B_vals[i] * delta_u_vals[r][i]);
-                    if constexpr (!Ktraits::kIsEvenLen)
-                    { // So that the last state is correct
-                        if (threadIdx.x * kNItems + i >= params.seqlen - chunk * kChunkSize)
-                        {
-                            thread_data[i] = make_float2(1.f, 0.f);
-                        }
-                    }
-                }
-                // Initialize running total
-                scan_t running_prefix;
-                // If we use WARP_SCAN then all lane 0 of all warps (not just thread 0) needs to read
-                running_prefix = chunk > 0 && threadIdx.x % 32 == 0 ? smem_running_prefix[state_idx + r * MAX_DSTATE]
-                                                                    : make_float2(1.f, 0.f);
-                // running_prefix = chunk > 0 && threadIdx.x == 0 ? smem_running_prefix[state_idx] :
-                // make_float2(1.f, 0.f);
-                SSMScanPrefixCallbackOp<weight_t> prefix_op(running_prefix);
-                Ktraits::BlockScanT(smem_scan).InclusiveScan(
-                    thread_data, thread_data, SSMScanOp<weight_t>(), prefix_op);
-                // There's a syncthreads in the scan op, so we don't need to sync here.
-                // Unless there's only 1 warp, but then it's the same thread (0) reading and writing.
-                if (threadIdx.x == 0)
-                {
-                    smem_running_prefix[state_idx] = prefix_op.running_prefix;
-                    if (chunk == params.n_chunks - 1)
-                    {
-                        x[r * params.dstate + state_idx] = prefix_op.running_prefix.y;
-                    }
-                }
-#pragma unroll
-                for (int i = 0; i < kNItems; ++i)
-                {
-                    const weight_t C_val
-                        = !kIsVariableC ? BC_val[r] : (!kIsVariableB ? BC_val[r] * C_vals[i] : C_vals[i]);
-                    out_vals[r][i] += thread_data[i].y * C_val;
-                }
-            }
-        }

-        input_t* out = reinterpret_cast<input_t*>(params.out_ptr) + batch_id * params.out_batch_stride
-            + dim_id * kNRows * params.out_d_stride + chunk * kChunkSize;
-        if constexpr (kHasZ)
+                input_t* my_B = &B[input_matrix_row_id * DSTATE + token_id * DSTATE];
+                input_t* my_C = &C[input_matrix_row_id * DSTATE + token_id * DSTATE];
+
+                int block_channel_per_token = blockIdx.x * blockDim.x;
+                int block_channel
+                    = input_matrix_row_id * num_channels + token_id * num_channels + block_channel_per_token;
+
+                if (threadIdx.x < DSTATE)
+                    cuda::memcpy_async(&sh_B[stage][threadIdx.x], &my_B[threadIdx.x], sizeof(input_t), pipeline);
+                else if (threadIdx.x >= 32 && threadIdx.x < 32 + DSTATE)
+                    cuda::memcpy_async(
+                        &sh_C[stage][threadIdx.x - 32], &my_C[threadIdx.x - 32], sizeof(input_t), pipeline);
+                if (sizeof(input_t) == 4)
                {
-            input_t* z = reinterpret_cast<input_t*>(params.z_ptr) + batch_id * params.z_batch_stride
-                + dim_id * kNRows * params.z_d_stride + chunk * kChunkSize;
-#pragma unroll
-            for (int r = 0; r < kNRows; ++r)
-            {
-                input_t z_vals[kNItems];
-                __syncthreads();
-                load_input<Ktraits>(z + r * params.z_d_stride, z_vals, smem_load, params.seqlen - chunk * kChunkSize);
-#pragma unroll
-                for (int i = 0; i < kNItems; ++i)
-                {
-                    float z_val = z_vals[i];
-                    out_vals[r][i] *= z_val / (1 + expf(-z_val));
+                    cuda::memcpy_async(&sh_dt[stage][threadIdx.x],
+                        &dt[input_matrix_row_id * num_channels + token_id * num_channels + channel], sizeof(input_t),
+                        pipeline);
+                    cuda::memcpy_async(&sh_x[stage][threadIdx.x],
+                        &x[input_matrix_row_id * num_channels + token_id * num_channels + channel], sizeof(input_t),
+                        pipeline);
+                    if (z)
+                        cuda::memcpy_async(&sh_z[stage][threadIdx.x],
+                            &z[input_matrix_row_id * num_channels + token_id * num_channels + channel], sizeof(input_t),
+                            pipeline);
                }
-                __syncthreads();
-                store_output<Ktraits>(
-                    out + r * params.out_d_stride, out_vals[r], smem_store, params.seqlen - chunk * kChunkSize);
+                else
+                {
+                    // sh_dt[stage][threadIdx.x] = dt[block_channel + threadIdx.x];
+                    if (threadIdx.x < 32)
+                    {
+                        int tid = threadIdx.x;
+                        float2* block_dt = (float2*) &dt[block_channel];
+                        cuda::memcpy_async((float2*) &sh_dt[stage][tid * 4], &block_dt[tid], sizeof(float2), pipeline);
+                    }
+                    // sh_x[stage][threadIdx.x] = x[block_channel + threadIdx.x];
+                    else if (threadIdx.x < 64)
+                    {
+                        int tid = threadIdx.x - 32;
+                        float2* block_x = (float2*) &x[block_channel];
+                        cuda::memcpy_async((float2*) &sh_x[stage][tid * 4], &block_x[tid], sizeof(float2), pipeline);
+                    }
+                    // sh_z[stage][threadIdx.x] = z[block_channel + threadIdx.x];
+                    else if (threadIdx.x < 96)
+                    {
+                        int tid = threadIdx.x - 64;
+                        if (z)
+                        {
+                            float2* block_z = (float2*) &z[block_channel];
+                            cuda::memcpy_async(
+                                (float2*) &sh_z[stage][tid * 4], &block_z[tid], sizeof(float2), pipeline);
                        }
                    }
                    else
                    {
-            __syncthreads();
+                    }
+                }
+
+                stage++;
+                if (stage >= STAGES)
+                    stage = 0;
+            }
+            pipeline.producer_commit();
+        }
+    }
+    else
+    {
+
+        // Compute warps
+        // Load state and A matrix into registers
+        float state_reg[DSTATE];
+        float A_reg[DSTATE];
+        for (int i = 0; i < DSTATE; i++)
+        {
+            // state_reg[i] = toFloat(state[sample*num_channels*DSTATE + i*num_channels + channel]);
+            state_reg[i] = 0.f;
+            A_reg[i] = toFloat(A[i * num_channels + channel]);
+        }
+
+        cuda::pipeline pipeline = cuda::make_pipeline(block, &pipeline_state, cuda::pipeline_role::consumer);
+        int stage = 0;
+        for (int si = 0; si < seq_loops; si++)
+        {
+
+            pipeline.consumer_wait();
+
 #pragma unroll
-            for (int r = 0; r < kNRows; ++r)
+            for (int token_id = si * SEQ_UNROLL; token_id < num_tokens && token_id < (si + 1) * SEQ_UNROLL; token_id++)
            {
-                if constexpr (!kDirectIO)
+
+                float dt_b = toFloat(sh_dt[stage][threadIdx.x]) + toFloat(sh_dt_bias[threadIdx.x]);
+                float dt_b_sp;
+                if (dt_softplus)
                {
-                    if (r > 0)
+                    dt_b_sp = dt_b <= 20.f ? log1pf(__expf(dt_b)) : dt_b; // softplus
+                }
+                float my_x = toFloat(sh_x[stage][threadIdx.x]);
+                float Dx = my_x * (D ? toFloat(sh_D[threadIdx.x]) : 0.f);
+                float dtx = dt_b_sp * my_x;
+                float my_z = z ? toFloat(sh_z[stage][threadIdx.x]) : 0.f;
+
+                float out = Dx;
+
+                if (sizeof(input_t) == 4)
                {
-                        __syncthreads();
+                    float4* B4 = (float4*) &sh_B[stage][0];
+                    float4* C4 = (float4*) &sh_C[stage][0];
+#pragma unroll
+                    for (int i = 0; i < DSTATE / 4; i++)
+                    {
+
+                        float4 Bi4 = B4[i];
+                        float4 Ci4 = C4[i];
+
+                        float* Bi = (float*) &Bi4;
+                        float* Ci = (float*) &Ci4;
+
+#pragma unroll
+                        for (int j = 0; j < 4; j++)
+                        {
+                            float dtA = A_reg[i * 4 + j] * dt_b_sp;
+                            float dA = __expf(dtA);
+                            float sdA = state_reg[i * 4 + j] * dA;
+                            float dBx = Bi[j] * dtx;
+                            float newState = sdA + dBx;
+                            state_reg[i * 4 + j] = newState;
+                            out += newState * Ci[j];
                        }
                    }
-                store_output<Ktraits>(
-                    out + r * params.out_d_stride, out_vals[r], smem_store, params.seqlen - chunk * kChunkSize);
+                }
+                else
+                {
+                    float4* B8 = (float4*) &sh_B[stage][0];
+                    float4* C8 = (float4*) &sh_C[stage][0];
+#pragma unroll
+                    for (int i = 0; i < DSTATE / 8; i++)
+                    {
+                        input_t* Bi = (input_t*) (&B8[i]);
+                        input_t* Ci = (input_t*) (&C8[i]);
+#pragma unroll
+                        for (int j = 0; j < 8; j++)
+                        {
+                            float dtA = A_reg[i * 8 + j] * dt_b_sp;
+                            float dA = __expf(dtA);
+                            float sdA = state_reg[i * 8 + j] * dA;
+                            float dBx = toFloat(Bi[j]) * dtx;
+                            float newState = sdA + dBx;
+                            state_reg[i * 8 + j] = newState;
+                            out += newState * toFloat(Ci[j]);
+                        }
                    }
                }

-        Bvar += kChunkSize;
-        Cvar += kChunkSize;
-    }
+                if (z)
+                {
+                    float enz = __expf(0.f - my_z);
+                    enz += 1.0;
+                    float sig_z = 1.0 / enz;
+                    float silu_z = my_z * sig_z;
+                    out *= silu_z;
                }
+                input_t* my_output = &output[input_matrix_row_id * num_channels + token_id * num_channels];
+                convertAndStore(&my_output[channel], out);

-template <int kNThreads, int kNItems, typename input_t, typename weight_t>
-void selective_scan_fwd_launch(SSMParamsBase& params, cudaStream_t stream)
-{
-    // Only kNRows == 1 is tested for now, which ofc doesn't differ from previously when we had each block
-    // processing 1 row.
-    static constexpr int kNRows = 1;
-    BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen,
-        [&]
-        {
-            BOOL_SWITCH(params.is_variable_B, kIsVariableB,
-                [&]
-                {
-                    BOOL_SWITCH(params.is_variable_C, kIsVariableC,
-                        [&]
-                        {
-                            BOOL_SWITCH(params.z_ptr != nullptr, kHasZ,
-                                [&]
-                                {
-                                    using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows,
-                                        kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ, input_t, weight_t>;
-                                    // constexpr int kSmemSize = Ktraits::kSmemSize;
-                                    constexpr int kSmemSize
-                                        = Ktraits::kSmemSize + kNRows * MAX_DSTATE * sizeof(typename Ktraits::scan_t);
-                                    // printf("smem_size = %d\n", kSmemSize);
-                                    dim3 grid(params.batch, params.dim / kNRows);
-                                    auto kernel = &selective_scan_fwd_kernel<Ktraits>;
-                                    if (kSmemSize >= 48 * 1024)
-                                    {
-                                        TLLM_CUDA_CHECK(cudaFuncSetAttribute(
-                                            kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
+                stage++;
+                if (stage >= STAGES)
+                    stage = 0;
+            }
+            pipeline.consumer_release();
+        }
+        // Write the new state back out to the cache
+        for (int i = 0; i < DSTATE; i++)
+        {
+            weight_t* my_state = &state[sample * num_channels * DSTATE];
+            int offset = i * num_channels + channel;
+            convertAndStore(&my_state[offset], state_reg[i]);
+        }
    }
-                                    kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
-                                });
-                        });
-                });
-        });
 }

 template <typename input_t, typename weight_t>
 void invokeSelectiveScan(SSMParamsBase& params, cudaStream_t stream)
 {
-    if (params.seqlen <= 128)
-    {
-        selective_scan_fwd_launch<32, 4, input_t, weight_t>(params, stream);
-    }
-    else if (params.seqlen <= 256)
-    {
-        selective_scan_fwd_launch<32, 8, input_t, weight_t>(params, stream);
-    }
-    else if (params.seqlen <= 512)
-    {
-        selective_scan_fwd_launch<32, 16, input_t, weight_t>(params, stream);
-    }
-    else if (params.seqlen <= 1024)
-    {
-        selective_scan_fwd_launch<64, 16, input_t, weight_t>(params, stream);
-    }
-    else
-    {
-        selective_scan_fwd_launch<128, 16, input_t, weight_t>(params, stream);
-    }
+    int samples = params.batch;
+    int channels = params.dim;
+
+    const int threads = 128;
+    const int blocks = (channels + threads - 1) / threads;
+    dim3 block(threads, 2);
+    dim3 grid(blocks, samples);
+    TLLM_CHECK((channels % block.x) == 0);
+
+    TLLM_CHECK(params.is_variable_B);
+    TLLM_CHECK(params.is_variable_C);
+    TLLM_CHECK(params.dstate == 16);
+
+    selective_scan_loop_kernel<input_t, weight_t><<<grid, block, 0, stream>>>(params);
 }

 #define INSTANTIATE_SELECTIVE_SCAN_DATA_TYPE(input_t, weight_t)                                                        \
@ -434,126 +338,101 @@ INSTANTIATE_SELECTIVE_SCAN_DATA_TYPE(__nv_bfloat16, float);

 ////////////////////////////////////////////////////////////////////////////////////////////////////

-template <typename input_t, typename weight_t, bool dt_softplus, bool has_dt_bias, bool has_d, bool has_z>
-__global__ void selectiveScanUpdate(SSMParamsBase params)
+template <typename input_t, typename weight_t, int DSTATE = 16, int CHANNELS_PER_BLOCK = 128>
+__launch_bounds__(128, 2) __global__ void selective_scan_update_kernel(SSMParamsBase params)
 {
-    // Shared memory.
-    extern __shared__ char smem_[];

-    input_t* smem_b = reinterpret_cast<input_t*>(smem_);
-    input_t* smem_c = reinterpret_cast<input_t*>(smem_ + sizeof(input_t) * params.dstate);
+    input_t* output = reinterpret_cast<input_t*>(params.out_ptr);
+    weight_t* state = reinterpret_cast<weight_t*>(params.x_ptr);
+    input_t* x = reinterpret_cast<input_t*>(params.u_ptr);
+    input_t* dt = reinterpret_cast<input_t*>(params.delta_ptr);
+    weight_t* A = reinterpret_cast<weight_t*>(params.A_ptr);
+    input_t* B = reinterpret_cast<input_t*>(params.B_ptr);
+    input_t* C = reinterpret_cast<input_t*>(params.C_ptr);
+    weight_t* D = reinterpret_cast<weight_t*>(params.D_ptr);
+    input_t* z = reinterpret_cast<input_t*>(params.z_ptr);
+    weight_t* dt_bias = reinterpret_cast<weight_t*>(params.delta_bias_ptr);
+    bool dt_softplus = params.delta_softplus;
+    int num_channels = params.dim;

-    const int batch_id = blockIdx.x;
-    const int dim_id = blockIdx.y * blockDim.x + threadIdx.x;
+    const int channel = blockIdx.x * blockDim.x + threadIdx.x;
+    if (channel >= num_channels)
+        return;
+    const int sample = blockIdx.y;

-    const input_t x = reinterpret_cast<const input_t*>(params.u_ptr)[batch_id * params.u_batch_stride + dim_id];
-    const weight_t* A = reinterpret_cast<const weight_t*>(params.A_ptr) + dim_id * params.A_d_stride;
-    const input_t* B = reinterpret_cast<const input_t*>(params.B_ptr) + batch_id * params.B_batch_stride;
-    const input_t* C = reinterpret_cast<const input_t*>(params.C_ptr) + batch_id * params.C_batch_stride;
-    const float* D_ptr = reinterpret_cast<const float*>(params.D_ptr);
-    const input_t* z_ptr = reinterpret_cast<const input_t*>(params.z_ptr);
-    weight_t* state = reinterpret_cast<weight_t*>(params.x_ptr) + batch_id * params.state_batch_stride
-        + dim_id * params.state_d_stride;
-    const input_t dt
-        = reinterpret_cast<const input_t*>(params.delta_ptr)[batch_id * params.delta_batch_stride + dim_id];
-    const float* dt_bias_ptr = reinterpret_cast<const float*>(params.delta_bias_ptr);
-    input_t* out = reinterpret_cast<input_t*>(params.out_ptr) + batch_id * params.out_batch_stride;
-    float out_tmp = 0.0f;
+    weight_t* my_state = &state[sample * num_channels * DSTATE];
+    input_t* my_output = &output[sample * num_channels];

-    // get delta bias
-    float dt_bias = 0.0f;
-    if (has_dt_bias)
+    float rA[DSTATE];
+    float rB[DSTATE];
+    float rC[DSTATE];
+
+    float rState[DSTATE];
+
+#pragma unroll
+    for (int i = 0; i < DSTATE; i++)
    {
-        dt_bias = dt_bias_ptr[dim_id];
+        rA[i] = toFloat(A[i * num_channels + channel]);
+        rB[i] = toFloat(B[sample * DSTATE + i]);
+        rC[i] = toFloat(C[sample * DSTATE + i]);
+        rState[i] = toFloat(my_state[i * num_channels + channel]);
    }

-    // get D
-    float D = 0.0f;
-    if (has_d)
-    {
-        D = D_ptr[dim_id];
-    }
+    float my_x, my_dt, my_z, my_dt_bias, my_D;
+    my_x = toFloat(x[sample * num_channels + channel]);
+    my_dt = toFloat(dt[sample * num_channels + channel]);
+    my_z = z ? toFloat(z[sample * num_channels + channel]) : 0.f;
+    my_dt_bias = dt_bias ? toFloat(dt_bias[channel]) : 0.f;
+    my_D = D ? toFloat(D[channel]) : 0.f;

-    // dt = softplus(dt + dt_bias)
-    float dt_val = float(dt) + dt_bias;
+    float dt_b = my_dt + my_dt_bias;
+    float dt_b_sp;
    if (dt_softplus)
    {
-        dt_val = dt_val <= 20.f ? log1pf(expf(dt_val)) : dt_val;
+        dt_b_sp = dt_b <= 20.f ? logf(1.f + expf(dt_b)) : dt_b; // softplus
    }

-    out_tmp = D * float(x);
+    float out = 0.f;

-    // read B, C
-    if (threadIdx.x == 0)
-    {
 #pragma unroll
-        for (int i = 0; i < params.dstate; ++i)
+    for (int i = 0; i < DSTATE; i++)
    {
-            smem_b[i] = B[i];
-            smem_c[i] = C[i];
+        float dA = expf(rA[i] * dt_b_sp);
+        float dB = rB[i] * dt_b_sp;
+        float sdA = rState[i] * dA;
+        float dBx = dB * my_x;
+        float newState = sdA + dBx;
+        convertAndStore(&my_state[i * num_channels + channel], newState); // Write the new state back out to the cache
+        out += newState * rC[i];
    }
-    }
-    __syncthreads();

-    for (int state_idx = 0; state_idx < params.dstate; ++state_idx)
+    if (D)
+        out += my_D * my_x;
+    if (z)
    {
-        // read A
-        weight_t A_val = A[state_idx];
-
-        // Multiply the real part of A with LOG2E so we can use exp2f instead of expf.
-        constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E
-        A_val *= kLog2e;
-
-        // dtA = exp(dt * A), dtB = dt * B
-        float dt_A = exp2f(dt_val * A_val);
-        float dt_B = dt_val * float(smem_b[state_idx]);
-
-        // update state
-        float state_new = float(state[state_idx]) * dt_A + float(x) * dt_B;
-        state[state_idx] = weight_t(state_new);
-
-        // y = C * state + D * x
-        out_tmp += state_new * float(smem_c[state_idx]);
+        float sig_z = 1.0 / (1.0 + exp(0.f - my_z));
+        float silu_z = my_z * sig_z;
+        out *= silu_z;
    }

-    // y = y * silu(z)
-    if (has_z)
-    {
-        float z = z_ptr[batch_id * params.z_batch_stride + dim_id];
-        out_tmp *= z / (1 + expf(-z));
-    }
-
-    // save out
-    out[dim_id] = input_t(out_tmp);
+    convertAndStore(&my_output[channel], out);
 }

 template <typename input_t, typename weight_t>
 void invokeSelectiveScanUpdate(SSMParamsBase& params, cudaStream_t stream)
 {
-    const int kNThreads = 32;
-    dim3 block(kNThreads);
-    dim3 grid(params.batch, (params.dim + kNThreads - 1) / kNThreads);
-    // only save B and C to shared mem for reuse
-    size_t smem_size = params.dstate * sizeof(input_t) * 2;
+    int samples = params.batch;
+    int channels = params.dim;

-    BOOL_SWITCH(params.delta_softplus, kDtSoftplus,
-        [&]
-        {
-            BOOL_SWITCH(params.delta_bias_ptr != nullptr, kHasDtBias,
-                [&]
-                {
-                    BOOL_SWITCH(params.D_ptr != nullptr, kHasD,
-                        [&]
-                        {
-                            BOOL_SWITCH(params.z_ptr != nullptr, kHasZ,
-                                [&]
-                                {
-                                    selectiveScanUpdate<input_t, weight_t, kDtSoftplus, kHasDtBias, kHasD, kHasZ>
-                                        <<<grid, block, smem_size, stream>>>(params);
-                                });
-                        });
-                });
-        });
+    const int threads = 128;
+    const int blocks = (channels + threads - 1) / threads;
+    dim3 block(threads, 1);
+    dim3 grid(blocks, samples);
+
+    TLLM_CHECK(params.is_variable_B);
+    TLLM_CHECK(params.is_variable_C);
+    TLLM_CHECK(params.dstate == 16);
+    selective_scan_update_kernel<input_t, weight_t><<<grid, block, 0, stream>>>(params);
 }

 #define INSTANTIATE_SELECTIVE_SCAN_UPDATE_DATA_TYPE(input_t, weight_t)                                                 \
--- a/cpp/tensorrt_llm/kernels/selectiveScan.h
+++ b/cpp/tensorrt_llm/kernels/selectiveScan.h
@ -30,6 +30,7 @@

 #pragma once

+#include "tensorrt_llm/common/assert.h"
 #include "tensorrt_llm/common/cudaUtils.h"

 namespace tensorrt_llm
@ -41,34 +42,12 @@ struct SSMParamsBase
 {
    using index_t = uint32_t;

-    int batch, dim, seqlen, dstate, n_groups, n_chunks;
-    int dim_ngroups_ratio;
+    int batch, dim, seqlen, dstate;
    bool is_variable_B;
    bool is_variable_C;

    bool delta_softplus;

-    index_t A_d_stride;
-    index_t A_dstate_stride;
-    index_t B_batch_stride;
-    index_t B_d_stride;
-    index_t B_dstate_stride;
-    index_t B_group_stride;
-    index_t C_batch_stride;
-    index_t C_d_stride;
-    index_t C_dstate_stride;
-    index_t C_group_stride;
-    index_t u_batch_stride;
-    index_t u_d_stride;
-    index_t delta_batch_stride;
-    index_t delta_d_stride;
-    index_t z_batch_stride;
-    index_t z_d_stride;
-    index_t out_batch_stride;
-    index_t out_d_stride;
-    index_t state_batch_stride;
-    index_t state_d_stride;
-
    // Common data pointers.
    void* __restrict__ A_ptr;
    void* __restrict__ B_ptr;
--- a/cpp/tensorrt_llm/kernels/selectiveScanCommon.h
+++ b/cpp/tensorrt_llm/kernels/selectiveScanCommon.h
@ -1,284 +0,0 @@
-/*
- * Adapted from https://github.com/state-spaces/mamba/blob/main/csrc/selective_scan/selective_scan_common.h
- * Copyright (c) 2023, Tri Dao.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- *
- * Not a contribution
- * Changes made by NVIDIA CORPORATION & AFFILIATES or otherwise documented as
- * NVIDIA-proprietary are not a contribution and subject to the following terms and conditions:
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: LicenseRef-NvidiaProprietary
- *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
- */
-
-#pragma once
-
-#include <cuda_bf16.h>
-#include <cuda_fp16.h>
-
-namespace tensorrt_llm
-{
-namespace kernels
-{
-
-#define MAX_DSTATE 256
-
-inline __device__ float2 operator+(const float2& a, const float2& b)
-{
-    return {a.x + b.x, a.y + b.y};
-}
-
-inline __device__ float3 operator+(const float3& a, const float3& b)
-{
-    return {a.x + b.x, a.y + b.y, a.z + b.z};
-}
-
-inline __device__ float4 operator+(const float4& a, const float4& b)
-{
-    return {a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w};
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// Inspired by https://github.com/NVIDIA/DALI/blob/main/include/dali/core/static_switch.h
-// and https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h
-
-/// @param COND       - a boolean expression to switch by
-/// @param CONST_NAME - a name given for the constexpr bool variable.
-/// @param ...       - code to execute for true and false
-///
-/// Usage:
-/// ```
-/// BOOL_SWITCH(flag, BoolConst, [&] {
-///     some_function<BoolConst>(...);
-/// });
-/// ```
-#define BOOL_SWITCH(COND, CONST_NAME, ...)                                                                             \
-    [&]                                                                                                                \
-    {                                                                                                                  \
-        if (COND)                                                                                                      \
-        {                                                                                                              \
-            static constexpr bool CONST_NAME = true;                                                                   \
-            return __VA_ARGS__();                                                                                      \
-        }                                                                                                              \
-        else                                                                                                           \
-        {                                                                                                              \
-            static constexpr bool CONST_NAME = false;                                                                  \
-            return __VA_ARGS__();                                                                                      \
-        }                                                                                                              \
-    }()
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <int BYTES>
-struct BytesToType
-{
-};
-
-template <>
-struct BytesToType<16>
-{
-    using Type = uint4;
-    static_assert(sizeof(Type) == 16);
-};
-
-template <>
-struct BytesToType<8>
-{
-    using Type = uint64_t;
-    static_assert(sizeof(Type) == 8);
-};
-
-template <>
-struct BytesToType<4>
-{
-    using Type = uint32_t;
-    static_assert(sizeof(Type) == 4);
-};
-
-template <>
-struct BytesToType<2>
-{
-    using Type = uint16_t;
-    static_assert(sizeof(Type) == 2);
-};
-
-template <>
-struct BytesToType<1>
-{
-    using Type = uint8_t;
-    static_assert(sizeof(Type) == 1);
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename scalar_t, int N>
-struct Converter
-{
-    static inline __device__ void to_float(const scalar_t (&src)[N], float (&dst)[N])
-    {
-#pragma unroll
-        for (int i = 0; i < N; ++i)
-        {
-            dst[i] = src[i];
-        }
-    }
-};
-
-template <int N>
-struct Converter<half, N>
-{
-    static inline __device__ void to_float(const half (&src)[N], float (&dst)[N])
-    {
-        static_assert(N % 2 == 0);
-        auto& src2 = reinterpret_cast<const half2(&)[N / 2]>(src);
-        auto& dst2 = reinterpret_cast<float2(&)[N / 2]>(dst);
-#pragma unroll
-        for (int i = 0; i < N / 2; ++i)
-        {
-            dst2[i] = __half22float2(src2[i]);
-        }
-    }
-};
-
-#if __CUDA_ARCH__ >= 800
-template <int N>
-struct Converter<__nv_bfloat16, N>
-{
-    static inline __device__ void to_float(const __nv_bfloat16 (&src)[N], float (&dst)[N])
-    {
-        static_assert(N % 2 == 0);
-        auto& src2 = reinterpret_cast<const nv_bfloat162(&)[N / 2]>(src);
-        auto& dst2 = reinterpret_cast<float2(&)[N / 2]>(dst);
-#pragma unroll
-        for (int i = 0; i < N / 2; ++i)
-        {
-            dst2[i] = __bfloat1622float2(src2[i]);
-        }
-    }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename scalar_t>
-struct SSMScanOp;
-
-template <>
-struct SSMScanOp<float>
-{
-    __device__ __forceinline__ float2 operator()(const float2& ab0, const float2& ab1) const
-    {
-        return make_float2(ab1.x * ab0.x, ab1.x * ab0.y + ab1.y);
-    }
-};
-
-// A stateful callback functor that maintains a running prefix to be applied
-// during consecutive scan operations.
-template <typename scalar_t>
-struct SSMScanPrefixCallbackOp
-{
-    using scan_t = std::conditional_t<std::is_same_v<scalar_t, float>, float2, float4>;
-    scan_t running_prefix;
-
-    // Constructor
-    __device__ SSMScanPrefixCallbackOp(scan_t running_prefix_)
-        : running_prefix(running_prefix_)
-    {
-    }
-
-    // Callback operator to be entered by the first warp of threads in the block.
-    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
-    __device__ scan_t operator()(scan_t block_aggregate)
-    {
-        scan_t old_prefix = running_prefix;
-        running_prefix = SSMScanOp<scalar_t>()(running_prefix, block_aggregate);
-        return old_prefix;
-    }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Ktraits>
-inline __device__ void load_input(typename Ktraits::input_t* u, typename Ktraits::input_t (&u_vals)[Ktraits::kNItems],
-    typename Ktraits::BlockLoadT::TempStorage& smem_load, int seqlen)
-{
-    if constexpr (Ktraits::kIsEvenLen)
-    {
-        auto& smem_load_vec = reinterpret_cast<typename Ktraits::BlockLoadVecT::TempStorage&>(smem_load);
-        using vec_t = typename Ktraits::vec_t;
-        Ktraits::BlockLoadVecT(smem_load_vec)
-            .Load(reinterpret_cast<vec_t*>(u), reinterpret_cast<vec_t(&)[Ktraits::kNLoads]>(u_vals));
-    }
-    else
-    {
-        Ktraits::BlockLoadT(smem_load).Load(u, u_vals, seqlen, 0.f);
-    }
-}
-
-template <typename Ktraits>
-inline __device__ void load_weight(typename Ktraits::input_t* Bvar,
-    typename Ktraits::weight_t (&B_vals)[Ktraits::kNItems],
-    typename Ktraits::BlockLoadWeightT::TempStorage& smem_load_weight, int seqlen)
-{
-    constexpr int kNItems = Ktraits::kNItems;
-    typename Ktraits::input_t B_vals_load[kNItems];
-    if constexpr (Ktraits::kIsEvenLen)
-    {
-        auto& smem_load_weight_vec
-            = reinterpret_cast<typename Ktraits::BlockLoadWeightVecT::TempStorage&>(smem_load_weight);
-        using vec_t = typename Ktraits::vec_t;
-        Ktraits::BlockLoadWeightVecT(smem_load_weight_vec)
-            .Load(reinterpret_cast<vec_t*>(Bvar), reinterpret_cast<vec_t(&)[Ktraits::kNLoads]>(B_vals_load));
-    }
-    else
-    {
-        Ktraits::BlockLoadWeightT(smem_load_weight).Load(Bvar, B_vals_load, seqlen, 0.f);
-    }
-    // #pragma unroll
-    // for (int i = 0; i < kNItems; ++i) { B_vals[i] = B_vals_load[i]; }
-    Converter<typename Ktraits::input_t, kNItems>::to_float(B_vals_load, B_vals);
-}
-
-template <typename Ktraits>
-inline __device__ void store_output(typename Ktraits::input_t* out, const float (&out_vals)[Ktraits::kNItems],
-    typename Ktraits::BlockStoreT::TempStorage& smem_store, int seqlen)
-{
-    typename Ktraits::input_t write_vals[Ktraits::kNItems];
-#pragma unroll
-    for (int i = 0; i < Ktraits::kNItems; ++i)
-    {
-        write_vals[i] = out_vals[i];
-    }
-    if constexpr (Ktraits::kIsEvenLen)
-    {
-        auto& smem_store_vec = reinterpret_cast<typename Ktraits::BlockStoreVecT::TempStorage&>(smem_store);
-        using vec_t = typename Ktraits::vec_t;
-        Ktraits::BlockStoreVecT(smem_store_vec)
-            .Store(reinterpret_cast<vec_t*>(out), reinterpret_cast<vec_t(&)[Ktraits::kNLoads]>(write_vals));
-    }
-    else
-    {
-        Ktraits::BlockStoreT(smem_store).Store(out, write_vals, seqlen);
-    }
-}
-
-} // namespace kernels
-} // namespace tensorrt_llm
--- a/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h
+++ b/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernel.h
@ -86,7 +86,7 @@ struct WeightOnlyDetails<ActType, WeightOnlyQuantType::Int4b>
    // weight 0 1 8 9 16 17 24 25 2 3 10 11 18 19 26 27 4 5 12 13 20 21 28 29 6 7 14 15 22 23 30 31
    static constexpr int kShuffleSize = 32;
    static constexpr int kShuffleBasicTile = 2;
-    static constexpr int kShuffleContinous = 4;
+    static constexpr int kShuffleContinuous = 4;
    static constexpr int kShuffleStrided = 4;

    // Each warp completes the internal reduce and writes the [Batch * NPerBlock * Interleave] results to the
@ -136,7 +136,7 @@ struct WeightOnlyDetails<ActType, WeightOnlyQuantType::Int8b>
    // weight 0 1 8 9 2 3 10 11 4 5 12 13 6 7 14 15
    static constexpr int kShuffleSize = 16;
    static constexpr int kShuffleBasicTile = 2;
-    static constexpr int kShuffleContinous = 2;
+    static constexpr int kShuffleContinuous = 2;
    static constexpr int kShuffleStrided = 4;

    // Each warp completes the internal reduce and writes the [Batch * NPerBlock * Interleave] results to the
@ -177,7 +177,7 @@ struct WeightOnlyKernelDetails

    static constexpr int kShuffleSize = Layout::kShuffleSize;
    static constexpr int kShuffleBasicTile = Layout::kShuffleBasicTile;
-    static constexpr int kShuffleContinous = Layout::kShuffleContinous;
+    static constexpr int kShuffleContinuous = Layout::kShuffleContinuous;
    static constexpr int kShuffleStrided = Layout::kShuffleStrided;

    // The rearrangement here counteracts the effect of cutlass::add_bias_and_interleave_int4/8s_inplace
@ -352,7 +352,7 @@ __device__ void weight_only_batched_gemv(const uint8_t* qweight, const ActType*
                        weights_quantized + i * Details::kConvertCount / Details::kElemsPerByte)));
            }
 #pragma unroll
-            for (int i = 0; i < Details::kShuffleContinous; ++i)
+            for (int i = 0; i < Details::kShuffleContinuous; ++i)
            {
 #pragma unroll
                for (int j = 0; j < Details::kShuffleStrided; ++j)
@ -360,7 +360,7 @@ __device__ void weight_only_batched_gemv(const uint8_t* qweight, const ActType*
                    // Dequantize the weights and arrange the shuffled elements back to the correct order in the
                    // register array
                    ActType2 v = *reinterpret_cast<ActType2*>(weights_vec + i * Details::kShuffleBasicTile
-                        + j * Details::kShuffleContinous * Details::kShuffleBasicTile);
+                        + j * Details::kShuffleContinuous * Details::kShuffleBasicTile);
                    v = __hfma2(
                        v, ActTypeDetails<ActType>::to_vec2(scale[idx]), ActTypeDetails<ActType>::to_vec2(zero[idx]));
                    weights_f16[(i * Details::kShuffleStrided * Details::kShuffleBasicTile
--- a/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp
+++ b/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp
@ -211,6 +211,7 @@ std::optional<Config> GemmPluginProfiler<Config, RunnerPtr, GemmIdType, GemmIdHa
                << " m=" << m << ", n=" << n << ", k=" << k << ")"
                << ", reason: \"" << e.what() << "\". Skipped";
            TLLM_LOG_TRACE(msg.str());
+            cudaGetLastError(); // Reset the last cudaError to cudaSuccess.
            continue;
        }

--- a/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp
+++ b/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp
@ -823,6 +823,9 @@ int GPTAttentionPluginCommon::enqueueContext(const EnqueueContextParams<T, KVCac
    if (mEnableContextFMHA)
    {
        const bool enablePagedKVContextFMHA = mPagedKVCache && mPagedContextFMHA;
+        // Paged Context FMHA doesn't work with fp8/int8 kv cache currently.
+        TLLM_CHECK_WITH_INFO(cache_type == KvCacheDataType::BASE || !enablePagedKVContextFMHA,
+            "Paged Context FMHA doesn't work with fp8/int8 kv cache currently.");
        invokeApplyBiasRopeUpdateKVCache(const_cast<T*>(params.attention_input), q_buf_2_, kv_cache_buffer,
            const_cast<T*>(params.qkv_bias), params.q_seq_lengths, params.kv_seq_lengths,
            mRemovePadding ? padding_offset : nullptr, params.batch_size, params.input_seq_length,
--- a/cpp/tensorrt_llm/plugins/selectiveScanPlugin/selectiveScanPlugin.cpp
+++ b/cpp/tensorrt_llm/plugins/selectiveScanPlugin/selectiveScanPlugin.cpp
@ -69,8 +69,8 @@ nvinfer1::IPluginV2DynamicExt* SelectiveScanPlugin::clone() const noexcept
 }

 // Outputs
-//     output_tensor: [batch_size, dim, seq_len]
-//     state: [batch_size, dim, dstate]
+//     output_tensor: [batch_size, seq_len, dim]
+//     state: [batch_size, dstate, dim]
 nvinfer1::DimsExprs SelectiveScanPlugin::getOutputDimensions(
    int outputIndex, const nvinfer1::DimsExprs* inputs, int nbInputs, nvinfer1::IExprBuilder& exprBuilder) noexcept
 {
@ -110,11 +110,9 @@ size_t SelectiveScanPlugin::getWorkspaceSize(const nvinfer1::PluginTensorDesc* i
 }

 void SelectiveScanPlugin::setSSMParams(SSMParamsBase& params, const size_t batch, const size_t dim, const size_t seqLen,
-    const size_t dstate, const size_t nChunks, const bool isVariableB, const bool isVariableC, void* statePtr,
-    const void* x, const void* delta, const void* deltaBias, const void* A, const void* B, const void* C, const void* D,
-    const void* z, void* out, const size_t strideXBatch, const size_t strideDtBatch, const size_t strideADim,
-    const size_t strideBBatch, const size_t strideCBatch, const size_t strideZBatch, const size_t strideOutBatch,
-    const size_t strideStateBatch, const size_t strideStateDim, bool deltaSoftplus)
+    const size_t dstate, const bool isVariableB, const bool isVariableC, void* statePtr, const void* x,
+    const void* delta, const void* deltaBias, const void* A, const void* B, const void* C, const void* D, const void* z,
+    void* out, bool deltaSoftplus)
 {
    // Reset the parameters
    memset(&params, 0, sizeof(params));
@ -123,9 +121,6 @@ void SelectiveScanPlugin::setSSMParams(SSMParamsBase& params, const size_t batch
    params.dim = dim;
    params.seqlen = seqLen;
    params.dstate = dstate;
-    params.n_groups = 1;
-    params.n_chunks = nChunks;
-    params.dim_ngroups_ratio = dim;

    params.delta_softplus = deltaSoftplus;

@ -143,39 +138,6 @@ void SelectiveScanPlugin::setSSMParams(SSMParamsBase& params, const size_t batch
    params.out_ptr = out;
    params.x_ptr = statePtr;
    params.z_ptr = const_cast<void*>(z);
-    // All stride are in elements, not bytes.
-    params.A_d_stride = strideADim;
-    params.A_dstate_stride = 1;
-    if (!isVariableB)
-    {
-        params.B_d_stride = dim * dstate;
-    }
-    else
-    {
-        params.B_batch_stride = strideBBatch;
-        params.B_group_stride = strideBBatch;
-    }
-    params.B_dstate_stride = !isVariableB ? dstate : seqLen;
-    if (!isVariableC)
-    {
-        params.C_d_stride = dim * dstate;
-    }
-    else
-    {
-        params.C_batch_stride = strideCBatch;
-        params.C_group_stride = strideCBatch;
-    }
-    params.C_dstate_stride = !isVariableC ? dstate : seqLen;
-    params.u_batch_stride = strideXBatch;
-    params.u_d_stride = seqLen;
-    params.delta_batch_stride = strideDtBatch;
-    params.delta_d_stride = seqLen;
-    params.z_batch_stride = strideZBatch;
-    params.z_d_stride = seqLen;
-    params.out_batch_stride = strideOutBatch;
-    params.out_d_stride = seqLen;
-    params.state_batch_stride = strideStateBatch;
-    params.state_d_stride = strideStateDim;
 }

 template <typename T>
@ -184,41 +146,31 @@ int SelectiveScanPlugin::enqueueImpl(const nvinfer1::PluginTensorDesc* inputDesc
    cudaStream_t stream)
 {
    // inputs
-    //     0.  input_tensor [batch_size, dim, seq_len]
-    //     1.  state [batch_size, dim, dstate]
-    //     2.  delta [batch_size, dim, seq_len]
+    //     0.  input_tensor [batch_size, seq_len, dim]
+    //     1.  state [batch_size, dstate, dim]
+    //     2.  delta [batch_size, seq_len, dim]
    //     3.  delta_bias [dim]
-    //     4.  A [dim, dstate]
-    //     5.  B [batch_size, dstate, seq_len]
-    //     6.  C [batch_size, dstate, seq_len]
+    //     4.  A [dstate, dim]
+    //     5.  B [batch_size, seq_len, dstate]
+    //     6.  C [batch_size, seq_len, dstate]
    //     7.  D [dim]
-    //     8.  z [batch_size, dim, seq_len]
+    //     8.  z [batch_size, seq_len, dim]
    //     9.  host_request_types [batch_size] int32. 0: context; 1: generation.
    // outputs
-    //     0. output_tensor [batch_size, dim, seq_len]
-    //     1. state [batch_size, dim, dstate]
+    //     0. output_tensor [batch_size, seq_len, dim]
+    //     1. state [batch_size, dstate, dim]
    auto const batch_size = inputDesc[getInputTensorIdx()].dims.d[0];
-    auto const seq_len = inputDesc[getInputTensorIdx()].dims.d[2];
-    auto const stride_state_batch = mDim * mDState;
-    auto const stride_state_dim = mDState;
-    auto const stride_x_batch = mDim * seq_len;
-    auto const stride_dt_batch = mDim * seq_len;
-    auto const stride_A_dim = mDState;
-    auto const stride_B_batch = mDState * seq_len;
-    auto const stride_C_batch = mDState * seq_len;
-    auto const stride_z_batch = mDim * seq_len;
-    auto const stride_out_batch = mDim * seq_len;
+    auto const seq_len = inputDesc[getInputTensorIdx()].dims.d[1];

    // only support context or generation, not for both of them
    RequestType const* reqTypes = static_cast<RequestType const*>(inputs[getHostRequestTypesIdx()]);

    auto const n_chunks = (seq_len + 2048 - 1) / 2048;
    SSMParamsBase ssm_params;
-    setSSMParams(ssm_params, batch_size, mDim, seq_len, mDState, n_chunks, mIsVariableB, mIsVariableC, outputs[1],
+
+    setSSMParams(ssm_params, batch_size, mDim, seq_len, mDState, mIsVariableB, mIsVariableC, outputs[1],
        inputs[getInputTensorIdx()], inputs[getDeltaIdx()], inputs[getDeltaBiasIdx()], inputs[getAIdx()],
-        inputs[getBIdx()], inputs[getCIdx()], inputs[getDIdx()], inputs[getZIdx()], outputs[0], stride_x_batch,
-        stride_dt_batch, stride_A_dim, stride_B_batch, stride_C_batch, stride_z_batch, stride_out_batch,
-        stride_state_batch, stride_state_dim, mDeltaSoftplus);
+        inputs[getBIdx()], inputs[getCIdx()], inputs[getDIdx()], inputs[getZIdx()], outputs[0], mDeltaSoftplus);

    if (reqTypes[0] == RequestType::kCONTEXT)
    {
@ -321,9 +273,9 @@ SelectiveScanPluginCreator::SelectiveScanPluginCreator()
    mPluginAttributes.clear();
    mPluginAttributes.emplace_back(PluginField("dim", nullptr, PluginFieldType::kINT32, 16));
    mPluginAttributes.emplace_back(PluginField("dstate", nullptr, PluginFieldType::kINT32, 16));
-    mPluginAttributes.emplace_back(PluginField("is_variable_B", nullptr, PluginFieldType::kINT32, 1));
-    mPluginAttributes.emplace_back(PluginField("is_variable_C", nullptr, PluginFieldType::kINT32, 1));
-    mPluginAttributes.emplace_back(PluginField("delta_softplus", nullptr, PluginFieldType::kINT32, 1));
+    mPluginAttributes.emplace_back(PluginField("is_variable_B", nullptr, PluginFieldType::kINT8, 1));
+    mPluginAttributes.emplace_back(PluginField("is_variable_C", nullptr, PluginFieldType::kINT8, 1));
+    mPluginAttributes.emplace_back(PluginField("delta_softplus", nullptr, PluginFieldType::kINT8, 1));
    mPluginAttributes.emplace_back(PluginField("type_id", nullptr, PluginFieldType::kINT32, 1));
    mFC.nbFields = mPluginAttributes.size();
    mFC.fields = mPluginAttributes.data();
--- a/cpp/tensorrt_llm/plugins/selectiveScanPlugin/selectiveScanPlugin.h
+++ b/cpp/tensorrt_llm/plugins/selectiveScanPlugin/selectiveScanPlugin.h
@ -29,19 +29,19 @@ namespace tensorrt_llm::plugins
 // can not support beam search

 // inputs
-//     0.  input_tensor [batch_size, dim, seq_len]
-//     1.  state [batch_size, dim, dstate]
-//     2.  delta [batch_size, dim, seq_len]
+//     0.  input_tensor [batch_size, seq_len, dim]
+//     1.  state [batch_size, dstate, dim]
+//     2.  delta [batch_size, seq_len, dim]
 //     3.  delta_bias [dim]
-//     4.  A [dim, seq_len]
-//     5.  B [batch_size, dstate, seq_len]
-//     6.  C [batch_size, dstate, seq_len]
+//     4.  A [dstate, dim]
+//     5.  B [batch_size, seq_len, dstate]
+//     6.  C [batch_size, seq_len, dstate]
 //     7.  D [dim]
-//     8.  z [batch_size, dim, seq_len]
+//     8.  z [batch_size, seq_len, dim]
 //     9.  host_request_types [batch_size] int32. 0: context; 1: generation; 2: none.
 // outputs
-//     0. output_tensor [batch_size, dim, seq_len]
-//     1. state [batch_size, dim, dstate]
+//     0. output_tensor [batch_size, seq_len, dim]
+//     1. state [batch_size, dstate, dim]

 class SelectiveScanPlugin : public BasePlugin
 {
@ -144,15 +144,11 @@ private:

    void setSSMParams(tensorrt_llm::kernels::SSMParamsBase& params,
        // sizes
-        const size_t batch, const size_t dim, const size_t seqLen, const size_t dstate, const size_t nChunks,
-        const bool isVariableB, const bool isVariableC,
+        const size_t batch, const size_t dim, const size_t seqLen, const size_t dstate, const bool isVariableB,
+        const bool isVariableC,
        // device pointers
        void* statePtr, const void* x, const void* delta, const void* deltaBias, const void* A, const void* B,
-        const void* C, const void* D, const void* z, void* out,
-        // strides
-        const size_t strideXBatch, const size_t strideDtBatch, const size_t strideADim, const size_t strideBBatch,
-        const size_t strideCBatch, const size_t strideZBatch, const size_t strideOutBatch,
-        const size_t strideStateBatch, const size_t strideStateDim, bool deltaSoftplus);
+        const void* C, const void* D, const void* z, void* out, bool deltaSoftplus);

 private:
    int mDim;
--- a/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp
+++ b/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp
@ -195,20 +195,7 @@ void WeightOnlyGroupwiseQuantMatmulPlugin::init(nvinfer1::DataType type, int qua
            {
                TLLM_THROW("FP8 is unsupported on pre-Hopper architectures!");
            }
-            if (quant_algo & ZERO)
-            {
-                // has zeros
-                m_weightOnlyGroupwiseGemmRunner
-                    = std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_fp8_e4m3,
-                        cutlass::int4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS, half, half, half>>();
-            }
-            else
-            {
-                // no zeros
-                m_weightOnlyGroupwiseGemmRunner
-                    = std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_fp8_e4m3,
-                        cutlass::int4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_ONLY, half, half, half>>();
-            }
+            TLLM_THROW("FP8 is unsupported on with BF16 scales and zero-points!");
        }
        else
        {
@ -301,8 +288,7 @@ bool WeightOnlyGroupwiseQuantMatmulPlugin::supportsFormatCombination(
        if (pos == mWeightInputIdx)
        {
            // weights
-            return inOut[mWeightInputIdx].type == nvinfer1::DataType::kHALF
-                && inOut[mWeightInputIdx].format == TensorFormat::kLINEAR;
+            return inOut[mWeightInputIdx].type == mType && inOut[mWeightInputIdx].format == TensorFormat::kLINEAR;
        }
        else if ((mQuantAlgo & FP8_ALPHA) && pos == mAlphaInputIdx)
        {
@ -310,7 +296,7 @@ bool WeightOnlyGroupwiseQuantMatmulPlugin::supportsFormatCombination(
        }
        else
        {
-            return inOut[pos].type == nvinfer1::DataType::kHALF && inOut[pos].format == TensorFormat::kLINEAR;
+            return inOut[pos].type == mType && inOut[pos].format == TensorFormat::kLINEAR;
        }
    }
    else
@ -374,7 +360,14 @@ int WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(const nvinfer1::PluginTensorDe
    }
    const int n = inputDesc[mWeightInputIdx].dims.d[1];
    const int k = inputDesc[0].dims.d[inputDesc[0].dims.nbDims - 1];
+
+    int smVersion = getSMVersion();
    bool use_cuda_kernel = m < SMALL_M_FAST_PATH && mCudaKernelEnabled;
+#if defined(ENABLE_BF16)
+    // CUDA kernels assume FP16 activations for Hopper
+    bool force_disable_cuda_kernel = smVersion == 90 && mType == nvinfer1::DataType::kBF16;
+    use_cuda_kernel = use_cuda_kernel && !force_disable_cuda_kernel;
+#endif
    bool use_pre_quant_scale = mQuantAlgo & PRE_QUANT_SCALE;

    const half* zeros_ptr = (mQuantAlgo & ZERO) ? reinterpret_cast<const half*>(inputs[mZerosInputIdx]) : nullptr;
@ -443,7 +436,7 @@ int WeightOnlyGroupwiseQuantMatmulPlugin::enqueue(const nvinfer1::PluginTensorDe
        weight_only_act_type = tensorrt_llm::kernels::WeightOnlyActivationType::BF16;
    }

-    if (getSMVersion() == 90)
+    if (smVersion == 90)
    {
        // Hopper style kernels
        if (use_cuda_kernel)
--- a/cpp/tensorrt_llm/runtime/CMakeLists.txt
+++ b/cpp/tensorrt_llm/runtime/CMakeLists.txt
@ -37,6 +37,7 @@ set(SRCS
    runtimeBuffers.cpp
    runtimeKernels.cu
    statefulGptDecoder.cpp
+    tllmBuffers.cpp
    tllmRuntime.cpp
    tllmLogger.cpp
    worldConfig.cpp)
--- a/cpp/tensorrt_llm/runtime/gptDecoderBatch.cpp
+++ b/cpp/tensorrt_llm/runtime/gptDecoderBatch.cpp
@ -66,6 +66,7 @@ SamplingConfig extractSamplingConfig(SamplingConfig const& batchSamplingConfig,
    samplingConfig.beamSearchDiversityRate = batchSamplingConfig.beamSearchDiversityRate;
    samplingConfig.lengthPenalty = batchSamplingConfig.lengthPenalty;
    samplingConfig.earlyStopping = batchSamplingConfig.earlyStopping;
+    samplingConfig.normalizeLogProbs = batchSamplingConfig.normalizeLogProbs;

    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
    return samplingConfig;
@ -278,7 +279,7 @@ void GptDecoderBatch::newRequest(
        tc::fmtstr("Input length (%d) + max new tokens (%d) must be less than max sequence length (%d).", inputLength,
            maxNewTokens, mMaxSequenceLength));
    TLLM_CHECK(requestIds->getDataType() == TRTDataType<TokenIdType>::value);
-    auto const endId = request.endId.value_or(mVocabSize - 1);
+    auto const endId = request.endId.value_or(-1);

    auto constexpr localBatchSize = 1;

@ -459,6 +460,7 @@ void GptDecoderBatch::newRequest(
    {
        mDecoders[decoderIdx]->setup(samplingConfig, localBatchSize, mMaxSequenceLength);
    }
+    TLLM_CHECK_WITH_INFO(!mFusedDecoder || beamWidth == 1, "Fused decoder is not supported for beam search yet.");
    mBeamWidths[batchIdx] = beamWidth;
    mNbSteps[batchIdx] = 0;
    mFinished[batchIdx] = false;
@ -622,8 +624,6 @@ GptDecoderBatch::TokenPtr GptDecoderBatch::forwardAsync(
        }
        else
        {
-            TLLM_CHECK_WITH_INFO(mBeamWidths[0] == 1, "Fused decoder is not supported for beam search yet.");
-
            auto& dInput = *mJointDecodingInput;
            auto& dOutput = *mJointDecodingOutput;
            auto& decoder = *mDecoders[0];
--- a/cpp/tensorrt_llm/runtime/gptJsonConfig.cpp
+++ b/cpp/tensorrt_llm/runtime/gptJsonConfig.cpp
@ -185,8 +185,8 @@ template <typename InputType>
 GptJsonConfig parseJson(InputType&& input)
 {
    auto constexpr allowExceptions = true;
-    auto constexpr ingoreComments = true;
-    auto const json = nlohmann::json::parse(std::forward<InputType>(input), nullptr, allowExceptions, ingoreComments);
+    auto constexpr ignoreComments = true;
+    auto const json = nlohmann::json::parse(std::forward<InputType>(input), nullptr, allowExceptions, ignoreComments);

    auto const engineVersion = parseJsonFieldOr(json, "version", std::string("none"));

--- a/cpp/tensorrt_llm/runtime/gptSession.cpp
+++ b/cpp/tensorrt_llm/runtime/gptSession.cpp
@ -19,6 +19,7 @@

 #include "tensorrt_llm/runtime/gptSession.h"

+#include "common.h"
 #include "iBuffer.h"
 #include "tensorrt_llm/batch_manager/kvCacheManager.h"
 #include "tensorrt_llm/common/customAllReduceUtils.h"
@ -55,13 +56,13 @@ std::unordered_set<std::int32_t> populateMicrobatchIndexes()
    std::unordered_set<std::int32_t> idxSet;
    if (profileMbIdxChar != nullptr)
    {
-        std::istringstream ss{profileMbIdxChar};
+        std::istringstream iss{profileMbIdxChar};
        std::int32_t idx;
        char c;
-        while (ss >> idx)
+        while (iss >> idx)
        {
            idxSet.insert(idx);
-            ss >> c;
+            iss >> c;
        }
    }

@ -79,9 +80,6 @@ GptSession::GptSession(Config const& sessionConfig, GptModelConfig const& modelC
    , mDevice{utils::initDevice(worldConfig)}
    , mLogger{logger ? std::move(logger) : std::make_shared<TllmLogger>()}
    , mRuntime{std::make_shared<TllmRuntime>(engineBuffer, engineSize, *mLogger)}
-    , mDecoders{}
-    , mBuffers{}
-    , mCudaGraphInstances{}
 {
    if (mWorldConfig.isPipelineParallel())
    {
@ -157,9 +155,13 @@ void GptSession::createDecoders(SizeType batchSize, SizeType beamWidth, SizeType
    for (SizeType i = 0; i < numMicroBatches; ++i)
    {
        if (decoderPerRequest)
+        {
            mDecoders.emplace_back(std::make_shared<GptDecoderBatch>(vocabSize, vocabSizePadded, stream));
+        }
        else
+        {
            mDecoders.emplace_back(std::make_shared<StatefulGptDecoder>(vocabSize, vocabSizePadded, stream));
+        }
        constexpr SizeType maxTokensPerStep = 1;
        mDecoders.back()->setup(decodingMode, batchSize, beamWidth, maxAttentionWindow, sinkTokenLength,
            maxSequenceLength, maxTokensPerStep, /* fusedDecoder*/ false, logitsType);
@ -174,19 +176,21 @@ void GptSession::createKvCacheManager(SizeType batchSize, SizeType beamWidth, Si
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
    auto const tokensPerBlock = mModelConfig.getTokensPerBlock();

-    nvinfer1::DataType kvDtype;
+    auto const kvDtype = [this]()
+    {
        if (mModelConfig.getQuantMode().hasFp8KvCache())
        {
-        kvDtype = nvinfer1::DataType::kFP8;
+            return nvinfer1::DataType::kFP8;
        }
        else if (mModelConfig.getQuantMode().hasInt8KvCache())
        {
-        kvDtype = nvinfer1::DataType::kINT8;
+            return nvinfer1::DataType::kINT8;
        }
        else
        {
-        kvDtype = mModelConfig.getDataType();
+            return mModelConfig.getDataType();
        }
+    }();

    auto const maxNumBlocks = bmkv::KVCacheManager::calculateMaxNumBlocks(
        kvCacheConfig, kvDtype, mModelConfig, mWorldConfig, getBufferManager());
@ -208,6 +212,7 @@ void GptSession::createKvCacheManager(SizeType batchSize, SizeType beamWidth, Si
 void GptSession::createCustomAllReduceWorkspace(
    SizeType maxBatchSize, SizeType maxBeamWidth, SizeType maxSequenceLength)
 {
+    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
    setPeerAccess(mWorldConfig, true);

    mIpcMemoryHandles.clear();
@ -219,11 +224,10 @@ void GptSession::createCustomAllReduceWorkspace(
    mIpcMemoryHandles.emplace_back(std::make_shared<IpcMemory>(mWorldConfig, IpcMemory::FLAGS_SIZE * sizeof(int32_t)));
    mIpcMemoryHandles.emplace_back(std::make_shared<IpcMemory>(mWorldConfig, IpcMemory::FLAGS_SIZE * sizeof(int32_t)));

-    auto& manager = mRuntime->getBufferManager();
-    mCommPtrs = manager.cpu(
+    mCommPtrs = BufferManager::cpu(
        ITensor::makeShape({static_cast<SizeType>(mIpcMemoryHandles.size()) * mWorldConfig.getTensorParallelism()}),
        nvinfer1::DataType::kINT64);
-    const auto commPtrsData = bufferCast<void*>(*mCommPtrs);
+    auto* const commPtrsData = bufferCast<void*>(*mCommPtrs);

    for (size_t memIdx = 0; memIdx < mIpcMemoryHandles.size(); memIdx++)
    {
@ -233,6 +237,7 @@ void GptSession::createCustomAllReduceWorkspace(
            commPtrsData[memIdx * mWorldConfig.getTensorParallelism() + tpIdx] = memCommPtrs[tpIdx];
        }
    }
+    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
 }

 GptSession::MicroBatchConfig::MicroBatchConfig(SizeType maxBatchSize, SizeType pipelineParallelism,
@ -289,6 +294,8 @@ void GptSession::setup(Config const& sessionConfig)
    createContexts();
    createBuffers(mMicroBatchConfig.numGenBatches);

+    mNormalizeLogProbs = sessionConfig.normalizeLogProbs;
+
    // Store this param related to decoder buffer size and kv cache manager to check against
    // the input shape with the params given in generate().
    // gptDecoderBatch does not resize buffers, but allows smaller batchSize and beamWidth.
@ -297,12 +304,6 @@ void GptSession::setup(Config const& sessionConfig)
    mDecoderMaxAttentionWindow = maxAttentionWindow;
    mDecoderSinkTokenLength = sinkTokenLength;

-    if (mModelConfig.usePagedKvCache())
-    {
-        createKvCacheManager(maxBatchSize, maxBeamWidth, maxAttentionWindow, sinkTokenLength, maxSequenceLength,
-            sessionConfig.kvCacheConfig);
-    }
-
    if (mWorldConfig.isLastPipelineParallelRank())
    {
        auto const logitsType = mRuntime->getEngine().getTensorDataType("logits");
@ -317,14 +318,22 @@ void GptSession::setup(Config const& sessionConfig)
    {
        mReceivedEvents.clear();
        for (SizeType i = 0; i < mMicroBatchConfig.numGenBatches; ++i)
+        {
            mReceivedEvents.emplace_back();
        }
+    }

    if (mWorldConfig.isTensorParallel() && mModelConfig.useCustomAllReduce())
    {
        createCustomAllReduceWorkspace(mMicroBatchConfig.genBatchSize, maxBeamWidth, maxSequenceLength);
    }

+    if (mModelConfig.usePagedKvCache())
+    {
+        createKvCacheManager(maxBatchSize, maxBeamWidth, maxAttentionWindow, sinkTokenLength, maxSequenceLength,
+            sessionConfig.kvCacheConfig);
+    }
+
    auto* kvCacheManager = mModelConfig.usePagedKvCache() ? mKvCacheManager.get() : nullptr;

    for (auto& buffers : mBuffers)
@ -334,6 +343,7 @@ void GptSession::setup(Config const& sessionConfig)
            mMicroBatchConfig.genBatchSize, maxBeamWidth, 0, maxAttentionWindow, sinkTokenLength, maxSequenceLength};
        buffers->reshape(kvCacheManager, mModelConfig, mWorldConfig);
    }
+
    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
 }

@ -344,7 +354,7 @@ void GptSession::kvCacheAddSequences(SizeType beamWidth, SizeType microBatchId,
        TLLM_CHECK(mKvCacheManager);
        auto contextLengthsHost = mBuffers.at(microBatchId)->contextLengthsHost;
        TLLM_CHECK(contextLengthsHost);
-        auto const contextLengthsPtr = bufferCast<SizeType const>(*contextLengthsHost);
+        const auto* const contextLengthsPtr = bufferCast<SizeType const>(*contextLengthsHost);
        auto const contextLengthsSize = static_cast<SizeType>(contextLengthsHost->getSize());
        for (SizeType batchIdx = 0; batchIdx < contextLengthsSize; ++batchIdx)
        {
@ -358,9 +368,9 @@ ITensor::SharedPtr GptSession::initDecoder(ITensor& outputIds, GenerationInput c
 {
    if (mWorldConfig.isLastPipelineParallelRank())
    {
-        auto& decoder = mDecoders.at(microBatchId);
-        decoder->newBatch(inputs, outputs, samplingConfig);
-        return decoder->getNewTokens();
+        auto& decoder = *mDecoders.at(microBatchId);
+        decoder.newBatch(inputs, outputs, samplingConfig);
+        return decoder.getNewTokens();
    }
    else if (mWorldConfig.isFirstPipelineParallelRank())
    {
@ -467,7 +477,9 @@ std::vector<GenerationInput> splitInputs(GenerationInput const& inputs, SizeType
        auto const batchSize = microBatchOffsets[batchId + 1] - offset;

        if (inputs.embeddingBias)
+        {
            batch.embeddingBias = inputs.embeddingBias;
+        }

        if (inputs.badWordsList)
        {
@ -487,21 +499,29 @@ std::vector<GenerationInput> splitInputs(GenerationInput const& inputs, SizeType
            batch.stopWordsList = ITensor::slice(inputs.stopWordsList, offset, batchSize);
        }
        if (inputs.maxNewTokens)
+        {
            batch.maxNewTokens = inputs.maxNewTokens;
+        }

        if (inputs.promptTuningParams.embeddingTable)
+        {
            batch.promptTuningParams.embeddingTable = inputs.promptTuningParams.embeddingTable;
+        }
        if (inputs.promptTuningParams.tasks)
+        {
            batch.promptTuningParams.tasks = ITensor::slice(inputs.promptTuningParams.tasks, offset, batchSize);
+        }
        if (inputs.promptTuningParams.vocabSize)
+        {
            batch.promptTuningParams.vocabSize = inputs.promptTuningParams.vocabSize;
        }
+    }

    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
    return inputBatches;
 }

-std::vector<GenerationOutput> splitOutputs(GenerationOutput& outputs, SizeType microBatchSize, BufferManager& manager)
+std::vector<GenerationOutput> splitOutputs(GenerationOutput& outputs, SizeType microBatchSize)
 {
    auto const numRequests = outputs.ids->getShape().d[0];

@ -547,8 +567,8 @@ void updateOutputIds(ITensor::SharedPtr const& outputIds, ITensor::SharedPtr con
 }
 } // namespace

-void GptSession::generate(
-    GenerationOutput& outputs, GenerationInput const& inputs, SamplingConfig const& samplingConfig)
+void GptSession::generate(GenerationOutput& outputs, GenerationInput const& inputs,
+    SamplingConfig const& samplingConfig, std::shared_ptr<GenerationProfiler> const generationProfiler)
 {
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);

@ -604,9 +624,9 @@ void GptSession::generate(
                }
                else
                {
-                    for (auto iter = inputLengthsRange.begin(); iter != inputLengthsRange.end(); iter++)
+                    for (auto iter : inputLengthsRange)
                    {
-                        maxNewTokens = std::max(maxNewTokens, mDecoderMaxSequenceLength - *iter);
+                        maxNewTokens = std::max(maxNewTokens, mDecoderMaxSequenceLength - iter);
                    }
                }

@ -635,13 +655,13 @@ void GptSession::generate(
    {
        std::vector<GenerationInput> microBatchesInputs{inputs};
        std::vector<GenerationOutput> microBatchesOutputs{outputs};
-        generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated);
+        generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated, generationProfiler);
    }
    else
    {
        auto const microBatchesInputs = splitInputs(inputs, mMicroBatchConfig.genBatchSize, manager);
-        auto microBatchesOutputs = splitOutputs(outputs, mMicroBatchConfig.genBatchSize, manager);
-        generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated);
+        auto microBatchesOutputs = splitOutputs(outputs, mMicroBatchConfig.genBatchSize);
+        generateBatched(microBatchesOutputs, microBatchesInputs, samplingConfig, onTokenGenerated, generationProfiler);
    }

    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
@ -665,7 +685,7 @@ GptSession::TokenGeneratedCallback GptSession::createOnTokenGeneratedCallback(Ge

 void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutputs,
    std::vector<GenerationInput> const& microBatchesInputs, SamplingConfig const& samplingConfig,
-    TokenGeneratedCallback const& onTokenGenerated)
+    TokenGeneratedCallback const& onTokenGenerated, std::shared_ptr<GenerationProfiler> const generationProfiler)
 {
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);

@ -743,7 +763,7 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
    auto const profileContext = !kProfileMbIdxs.empty() && kProfileMbIdxs.count(0) > 0;
    if (profileContext)
        cudaProfilerStart();
-    executeContextStep(microBatchesInputs, microBatchesOutputs, microBatchOffsets, kvCacheManager);
+    executeContextStep(microBatchesInputs, microBatchOffsets, kvCacheManager);
    if (profileContext)
        cudaProfilerStop();

@ -751,6 +771,11 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
    SizeType numBatchesFinished{0};
    SizeType step{0};

+    if (generationProfiler)
+    {
+        manager.getStream().record(generationProfiler->getStart());
+    }
+
    while (numBatchesFinished < numMicroBatches)
    {
        ++step;
@ -768,6 +793,11 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
            cudaProfilerStop();
    }

+    if (generationProfiler)
+    {
+        manager.getStream().record(generationProfiler->getEnd());
+    }
+
    // Collect the results for the last step
    for (auto microBatchId = 0; microBatchId < numMicroBatches; ++microBatchId)
    {
@ -796,12 +826,15 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp

            auto& cumLogProbs = buffers.cumLogProbs;
            if (cumLogProbs)
+            {
                manager.copy(*decoder.getCumLogProbs(), *buffers.cumLogProbs);
-
+            }
            auto& logProbs = buffers.logProbs;
            if (logProbs)
+            {
                manager.copy(*decoder.getLogProbs(), *buffers.logProbs);
            }
+        }
        // copy generation logits fragments into a single generationLogits tensor
        if (mModelConfig.computeGenerationLogits())
        {
@ -823,19 +856,18 @@ void GptSession::generateBatched(std::vector<GenerationOutput>& microBatchesOutp
    TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
 }

-void GptSession::executeContextStep(std::vector<GenerationInput> const& microBatchesInputs,
-    std::vector<GenerationOutput>& microBatchesOutputs, std::vector<SizeType> const& generationBatchOffsets,
-    KvCacheManager const* kvCacheManager)
+void GptSession::executeContextStep(std::vector<GenerationInput> const& generationBatchesInputs,
+    std::vector<SizeType> const& generationBatchesOffsets, KvCacheManager const* kvCacheManager)
 {
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
    auto& manager = mRuntime->getBufferManager();

-    auto const numGenerationBatches = static_cast<SizeType>(microBatchesInputs.size());
+    auto const numGenerationBatches = static_cast<SizeType>(generationBatchesInputs.size());
    auto constexpr step = 0;
    auto constexpr contextId = 0;
    for (auto generationBatchId = 0; generationBatchId < numGenerationBatches; ++generationBatchId)
    {
-        auto const& generationBatchInputs = microBatchesInputs.at(generationBatchId);
+        auto const& generationBatchInputs = generationBatchesInputs.at(generationBatchId);
        auto& generationBuffers = *mBuffers.at(generationBatchId);

        auto const contextBatchSize = mMicroBatchConfig.ctxBatchSize;
@ -847,7 +879,7 @@ void GptSession::executeContextStep(std::vector<GenerationInput> const& microBat

        for (auto contextBatchId = 0; contextBatchId < numContextBatches; ++contextBatchId)
        {
-            auto batchOffset = generationBatchOffsets.at(generationBatchId) + contextBatchOffsets.at(contextBatchId);
+            auto batchOffset = generationBatchesOffsets.at(generationBatchId) + contextBatchOffsets.at(contextBatchId);
            auto& buffers = contextBuffers.at(contextBatchId);
            auto& inputBuffer = buffers.inputBuffers[0];
            auto& outputBuffer = buffers.outputBuffers[0];
@ -976,7 +1008,7 @@ SizeType GptSession::executeGenerationStep(SizeType step, std::vector<Generation
 void GptSession::decoderStepAsync(SizeType decoderStep, SizeType microBatchId)
 {
    TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
-    auto& stream = mRuntime->getStream();
+    auto const& stream = mRuntime->getStream();
    auto& buffers = *mBuffers.at(microBatchId);
    auto const& outputIds = buffers.outputIds;
    auto const& newTokens = buffers.newTokens;
--- a/cpp/tensorrt_llm/runtime/ipcUtils.cpp
+++ b/cpp/tensorrt_llm/runtime/ipcUtils.cpp
@ -20,7 +20,7 @@
 namespace tensorrt_llm::runtime
 {

-void setPeerAccess(WorldConfig worldConfig, bool enable)
+void setPeerAccess(WorldConfig const& worldConfig, bool enable)
 {
    const auto srcNode = worldConfig.getTensorParallelRank();

@ -50,7 +50,7 @@ void setPeerAccess(WorldConfig worldConfig, bool enable)
    }
 }

-IpcMemory::IpcMemory(WorldConfig worldConfig, std::size_t bufferSize)
+IpcMemory::IpcMemory(WorldConfig const& worldConfig, std::size_t bufferSize)
    : mWorldConfig(worldConfig)
    , mCommPtrs(worldConfig.getTensorParallelism())
    , mBufferSize(bufferSize)
--- a/cpp/tensorrt_llm/runtime/ncclCommunicator.cpp
+++ b/cpp/tensorrt_llm/runtime/ncclCommunicator.cpp
@ -78,7 +78,7 @@ ncclComm_t NcclCommunicator::createComm(int worldSize, int rank, mpi::MpiComm co
    {
        ncclGetUniqueId(&id);
    }
-    mpiComm.bcast(id, 0);
+    mpiComm.bcastValue(id, 0);
    ncclComm_t comm;
    TLLM_NCCL_CHECK(ncclCommInitRank(&comm, worldSize, id, rank));
    return comm;
--- a/cpp/tensorrt_llm/runtime/tllmBuffers.cpp
+++ b/cpp/tensorrt_llm/runtime/tllmBuffers.cpp
@ -0,0 +1,31 @@
+/*
+ * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "tensorrt_llm/runtime/tllmBuffers.h"
+
+namespace tensorrt_llm::runtime
+{
+
+template <typename TAllocator>
+typename PoolAllocator<TAllocator>::PoolType& PoolAllocator<TAllocator>::getPool()
+{
+    static PoolType pool;
+    return pool;
+}
+
+// explicit instantiations
+template class PoolAllocator<PinnedAllocator>;
+} // namespace tensorrt_llm::runtime
--- a/cpp/tensorrt_llm/runtime/tllmBuffers.h
+++ b/cpp/tensorrt_llm/runtime/tllmBuffers.h
@ -480,11 +480,7 @@ public:
    using SizeType = typename Base::SizeType;
    using PoolType = MemoryPool<TAllocator>;

-    static PoolType& getPool()
-    {
-        static PoolType pool;
-        return pool;
-    }
+    static PoolType& getPool();

 protected:
    void allocateImpl(PointerType* ptr, SizeType n) // NOLINT(readability-convert-member-functions-to-static)
--- a/cpp/tests/common/mpiUtilsTest.cpp
+++ b/cpp/tests/common/mpiUtilsTest.cpp
@ -46,7 +46,7 @@ void testBroadcast()
    auto constexpr expectedValue = static_cast<T>(42);
    auto constexpr root = 0;
    auto value = rank == root ? expectedValue : T{};
-    comm.bcast(value, root);
+    comm.bcastValue(value, root);
    EXPECT_EQ(value, expectedValue);
 }

@ -79,7 +79,7 @@ TEST(MPIUtils, BroadcastNcclId)
    {
        std::memset(&id, 0, sizeof(id));
    }
-    comm.bcast(id, root);
+    comm.bcastValue(id, root);
    EXPECT_TRUE(std::any_of(
        id.internal, id.internal + sizeof(id.internal) / sizeof(id.internal[0]), [](auto x) { return x != 0; }));
 }
--- a/cpp/tests/resources/scripts/test_cpp.py
+++ b/cpp/tests/resources/scripts/test_cpp.py
@ -15,7 +15,6 @@
 # limitations under the License.

 import argparse as _arg
-import glob as _gl
 import logging as _log
 import os as _os
 import pathlib as _pl
@ -72,7 +71,7 @@ def build_trt_llm(python_exe: str,
        python_exe, "scripts/build_wheel.py", "--cuda_architectures",
        cuda_architectures, "--build_dir",
        str(build_dir), "--dist_dir",
-        str(dist_dir)
+        str(dist_dir), "-s", "-i"
    ]

    if use_ccache:
@ -86,12 +85,6 @@ def build_trt_llm(python_exe: str,

    run_command(build_wheel, cwd=root_dir, env=_os.environ, timeout=2400)

-    dist_dir = dist_dir if dist_dir.is_absolute() else root_dir / dist_dir
-    wheels = _gl.glob(str(dist_dir / "tensorrt_llm-*.whl"))
-    assert len(wheels) > 0, "No wheels found"
-    install_wheel = [python_exe, "-m", "pip", "install", "--upgrade", *wheels]
-    run_command(install_wheel, cwd=root_dir, timeout=300)
-

 def run_tests(cuda_architectures: _tp.Optional[str] = None,
              build_dir: _tp.Optional[str] = None,
@ -369,11 +362,18 @@ def run_multi_gpu_tests(build_dir: _pl.Path):

    tests_dir = build_dir / "tests"
    cpp_env = {**_os.environ}
+    # TP2+PP2 tests fail for beam search
    session_test = [
        "mpirun", "-n", "4", "--allow-run-as-root", "gptSessionTest",
-        "--gtest_filter=*TP*:*PP*"
+        "--gtest_filter=*TP4*:*PP4*"
    ]
-    run_command(session_test, cwd=tests_dir, env=cpp_env, timeout=900)
+    run_command(session_test, cwd=tests_dir, env=cpp_env, timeout=300)
+
+    trt_model_test = [
+        "mpirun", "-n", "4", "--allow-run-as-root",
+        "batch_manager/trtGptModelRealDecoderTest", "--gtest_filter=*TP*:*PP*"
+    ]
+    run_command(trt_model_test, cwd=tests_dir, env=cpp_env, timeout=300)


 def run_benchmarks(python_exe: str, root_dir: _pl.Path, build_dir: _pl.Path,
--- a/docker/Makefile
+++ b/docker/Makefile
@ -15,8 +15,6 @@ GROUP_NAME         ?= $(shell id --group --name)
 LOCAL_USER         ?= 0
 ifeq ($(LOCAL_USER),1)
 IMAGE_TAG_SUFFIX   ?= -$(USER_NAME)
-else
-  IMAGE_TAG_SUFFIX   ?=
 endif

 # Default stage of the docker multi-stage build
@ -70,7 +68,7 @@ endef
 		$(if $(GIT_COMMIT), --build-arg GIT_COMMIT="$(GIT_COMMIT)") \
 		$(if $(STAGE), --target $(STAGE)) \
 		--file Dockerfile.multi \
-		--tag $(IMAGE_WITH_TAG)$(IMAGE_TAG_SUFFIX) \
+		--tag $(IMAGE_WITH_TAG) \
 		..

 %_user:
@ -122,15 +120,15 @@ release_%: STAGE = release
 release_run: WORK_DIR = /app/tensorrt_llm

 # For x86_64
-jenkins_%: IMAGE_TAG = jenkins_latest
+jenkins_%: IMAGE_WITH_TAG = $(shell grep 'LLM_DOCKER_IMAGE = ' ../jenkins/L0_MergeRequest.groovy | grep -o '".*"' | tr -d '"')
 jenkins_%: STAGE = devel

 # For aarch64
-jenkins-aarch64_%: IMAGE_TAG = jenkins-aarch64_latest
+jenkins-aarch64_%: IMAGE_WITH_TAG = $(shell grep 'LLM_SBSA_DOCKER_IMAGE = ' ../jenkins/GH200Builder.groovy | grep -o '".*"' | tr -d '"')
 jenkins-aarch64_%: STAGE = devel

 # For x86_64
-centos7_%: IMAGE_TAG = centos7_latest
+centos7_%: IMAGE_WITH_TAG = $(shell grep 'LLM_CENTOS7_DOCKER_IMAGE = ' ../jenkins/L0_MergeRequest.groovy | grep -o '".*"' | tr -d '"')
 centos7_%: STAGE = devel
 centos7_%: BASE_IMAGE = nvidia/cuda
 centos7_%: BASE_TAG = 12.3.1-devel-centos7
@ -141,7 +139,7 @@ ubuntu22_%: BASE_IMAGE = nvidia/cuda
 ubuntu22_%: BASE_TAG = 12.3.1-devel-ubuntu22.04

 # For x86_64
-old-cuda_%: IMAGE_TAG = old-cuda_latest
+old-cuda_%: IMAGE_WITH_TAG = $(shell grep 'LLM_OLD_CUDA_DOCKER_IMAGE = ' ../jenkins/L0_MergeRequest.groovy | grep -o '".*"' | tr -d '"')
 old-cuda_%: BASE_TAG = 23.07-py3
 old-cuda_%: STAGE = devel
 old-cuda_%: CUDA_VERSION = 12.1
--- a/docs/source/gpt_runtime.md
+++ b/docs/source/gpt_runtime.md
@ -252,7 +252,7 @@ populates an instance of the
 * `embeddingBiasOpt`, is a tensor of floating-point values on the GPU that
   contains the bias to add to the logits during sampling (after the projection
   from hidden states to logits as the last step of the model). This tensor
-   must have `vocabSize` elements (as defined in the `ModelConfig` argument
+   must have `vocabSize` elements (as defined in the `modelConfig` argument
   passed to the constructor),
 * `badWordsList`, is a tensor of integers on the GPU that encodes the list of
   words that have to be banned from generated sequences. Its shape is `[2,
@ -315,8 +315,7 @@ batchSize, beamWidth]`_.
   After inference is complete, you can get the context logits in `GenerationOutput.contextLogits`, these are variables on the GPU. For specific acquisition methods, please refer to the example of [gptSessionBenchmark.cpp](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/gptSessionBenchmark.cpp).

   It is important to point out
-   that enabling that computation may have an impact on performance (the final
-   LM head has to perform a matrix multiplication on all the context tokens
+   that enabling the computation may have an impact on performance (the language modeling head (LM head) has to perform a matrix multiplication on all the context tokens
   instead of a just the last one).
 * `generationLogits`, is a tensor of values on the GPU (same datatype as the
   computation type) to store the logits for the generation. Its shape is
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -23,6 +23,10 @@ Welcome to TensorRT-LLM's documentation!
   graph-rewriting.md
   memory.md
   new_workflow.md
+   lora.md
+   perf_best_practices.md
+   performance_analysis.md
+

 Python API
 ----------
@ -81,3 +85,4 @@ Blogs
   blogs/H100vsA100.md
   blogs/H200launch.md
   blogs/Falcon180B-H200.md
+   blogs/quantization-in-TRT-LLM.md
--- a/docs/source/inference_request.md
+++ b/docs/source/inference_request.md
@ -22,7 +22,7 @@ Optional tensors that can be supplied to `InferenceRequest` are shown below. Def
 | `presence_penalty` | [1] | `float` | Sampling Config param: `presencePenalty` |
 | `frequency_penalty` | [1] | `float` | Sampling Config param: `frequencyPenalty` |
 | `random_seed` | [1] | `uint64_t` | Sampling Config param: `randomSeed` |
-| `end_id` | [1] | `int32_t` | End token Id |
+| `end_id` | [1] | `int32_t` | End token Id. If not specified, defaults to -1 |
 | `pad_id` | [1] | `int32_t` | Pad token Id |
 | `embedding_bias` | [1] | `float` | Embedding bias |
 | `bad_words_list` | [2, num_bad_words] | `int32_t` | Bad words list |
--- a/docs/source/new_workflow.md
+++ b/docs/source/new_workflow.md
@ -60,8 +60,8 @@ The different files will be loaded by different ranks in a multi-GPU (multi-proc
 | mapping.world_size                     | int        | 1                   |
 | mapping.tp_size                        | int        | 1                   |
 | mapping.pp_size                        | int        | 1                   |
-| quantization.quant_aglo                | str        | null                |
-| quantization.kv_cache_quant_aglo       | str        | null                |
+| quantization.quant_algo                | str        | null                |
+| quantization.kv_cache_quant_algo       | str        | null                |
 | quantization.group_size                | int        | 64                  |
 | quantization.has_zero_point            | bool       | False               |
 | quantization.pre_quant_scale           | bool       | False               |
@ -211,10 +211,6 @@ Here is the `config.json`:
    "position_embedding_type": "learned_absolute",
    "max_position_embeddings": 2048,
    "hidden_act": "relu",
-    "quantization": {
-        "use_weight_only": false,
-        "weight_only_precision": "int8"
-    },
    "mapping": {
        "world_size": 2,
        "tp_size": 2
--- a/docs/source/performance.md
+++ b/docs/source/performance.md
@ -17,122 +17,214 @@ described in the benchmarks [folder](source:benchmarks/).
 The below tables provide reference data at large batch sizes, representing
 high throughput offline tasks.

-This data has been updated for v0.6.1, unless specified.
+All data was generated using version 0.8.0
+
+### H200 GPUs (FP8)

-### H100 GPUs (FP8)

 | Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s/GPU) |
 | :--------------------------- | :--------- | :-------- | :----------- | :------------ | -------------------------: |
-| GPT-J 6B                     | 1024       | 1         | 128          | 128           |                     26,150 |
-| GPT-J 6B                     | 120        | 1         | 128          | 2048          |                      8,011 |
-| GPT-J 6B                     | 64         | 1         | 2048         | 128           |                      2,551 |
-| GPT-J 6B                     | 64         | 1         | 2048         | 2048          |                      3,327 |
+| GPT-J 6B                     | 1024       | 1         | 128          | 128           |                     29,168 |
+| GPT-J 6B                     | 120        | 1         | 128          | 2048          |                      9,472 |
+| GPT-J 6B                     | 64         | 1         | 2048         | 128           |                      2,961 |
+| GPT-J 6B                     | 64         | 1         | 2048         | 2048          |                      4,149 |
 |                              |            |           |              |               |                            |
-| LLaMA 7B                     | 768        | 1         | 128          | 128           |                     19,694 |
-| LLaMA 7B                     | 112        | 1         | 128          | 2048          |                      6,818 |
-| LLaMA 7B                     | 80         | 1         | 2048         | 128           |                      2,244 |
-| LLaMA 7B                     | 48         | 1         | 2048         | 2048          |                      2,740 |
+| Mistral 7B                   | 896        | 1         | 128          | 128           |                     20,569 |
+| Mistral 7B                   | 120        | 1         | 128          | 2048          |                      8,968 |
+| Mistral 7B                   | 84         | 1         | 2048         | 128           |                      2,450 |
+| Mistral 7B                   | 56         | 1         | 2048         | 2048          |                      3,868 |
 |                              |            |           |              |               |                            |
-| LLaMA 70B                    | 1024       | 2         | 128          | 128           |                      2,657 |
-| LLaMA 70B                    | 480        | 4         | 128          | 2048          |                      1,486 |
-| LLaMA 70B                    | 96         | 2         | 2048         | 128           |                        306 |
-| LLaMA 70B                    | 64         | 2         | 2048         | 2048          |                        547 |
+| LLaMA 7B                     | 896        | 1         | 128          | 128           |                     20,548 |
+| LLaMA 7B                     | 120        | 1         | 128          | 2048          |                      8,343 |
+| LLaMA 7B                     | 84         | 1         | 2048         | 128           |                      2,429 |
+| LLaMA 7B                     | 56         | 1         | 2048         | 2048          |                      3,530 |
 |                              |            |           |              |               |                            |
-| Falcon 180B                  | 1024       | 4         | 128          | 128           |                        987 |
-| Falcon 180B                  | 1024       | 8         | 128          | 2048          |                        724 |
-| Falcon 180B                  | 64         | 4         | 2048         | 128           |                        112 |
-| Falcon 180B                  | 64         | 4         | 2048         | 2048          |                        264 |
+| LLaMA 70B                    | 512        | 1         | 128          | 128           |                      3,844 |
+| LLaMA 70B                    | 512        | 2         | 128          | 2048          |                      4,008 |
+| LLaMA 70B                    | 64         | 1         | 2048         | 128           |                        421 |
+| LLaMA 70B                    | 64         | 1         | 2048         | 2048          |                      1,461 |
+|                              |            |           |              |               |                            |
+| Falcon 180B                  | 1024       | 4         | 128          | 128           |                      1,116 |
+| Falcon 180B                  | 1024       | 4         | 128          | 2048          |                        990 |
+| Falcon 180B                  | 64         | 4         | 2048         | 128           |                        118 |
+| Falcon 180B                  | 64         | 4         | 2048         | 2048          |                        269 |

-### L40S GPUs (FP8)<sup>*</sup>

-<sup> * The following data is from TensorRT-LLM v0.5. </sup>
+### H100 GPUs (FP8)
+
+
+| Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s/GPU) |
+| :--------------------------- | :--------- | :-------- | :----------- | :------------ | -------------------------: |
+| GPT-J 6B                     | 1024       | 1         | 128          | 128           |                     27,357 |
+| GPT-J 6B                     | 120        | 1         | 128          | 2048          |                      7,831 |
+| GPT-J 6B                     | 64         | 1         | 2048         | 128           |                      2,661 |
+| GPT-J 6B                     | 64         | 1         | 2048         | 2048          |                      3,409 |
+|                              |            |           |              |               |                            |
+| Mistral 7B                   | 896        | 1         | 128          | 128           |                     20,517 |
+| Mistral 7B                   | 120        | 1         | 128          | 2048          |                      8,619 |
+| Mistral 7B                   | 64         | 1         | 2048         | 128           |                      2,438 |
+| Mistral 7B                   | 56         | 1         | 2048         | 2048          |                      3,733 |
+|                              |            |           |              |               |                            |
+| LLaMA 7B                     | 896        | 1         | 128          | 128           |                     20,241 |
+| LLaMA 7B                     | 120        | 1         | 128          | 2048          |                      6,922 |
+| LLaMA 7B                     | 64         | 1         | 2048         | 128           |                      2,170 |
+| LLaMA 7B                     | 56         | 1         | 2048         | 2048          |                      2,816 |
+|                              |            |           |              |               |                            |
+| LLaMA 70B                    | 1024       | 2         | 128          | 128           |                      3,269 |
+| LLaMA 70B                    | 512        | 4         | 128          | 2048          |                      2,718 |
+| LLaMA 70B                    | 96         | 2         | 2048         | 128           |                        347 |
+| LLaMA 70B                    | 64         | 2         | 2048         | 2048          |                      1,020 |
+|                              |            |           |              |               |                            |
+| Falcon 180B                  | 512        | 4         | 128          | 128           |                      1,048 |
+| Falcon 180B                  | 1024       | 8         | 128          | 2048          |                        836 |
+| Falcon 180B                  | 64         | 4         | 2048         | 128           |                        114 |
+| Falcon 180B                  | 64         | 4         | 2048         | 2048          |                        250 |
+
+### L40S GPUs (FP8)


 | Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s/GPU) |
 | :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
-| GPT-J 6B                     | 64         | 1         | 128          | 128           |                  3,630 |
-| GPT-J 6B                     | 64         | 1         | 128          | 2048          |                  1,859 |
-| GPT-J 6B                     | 32         | 1         | 2048         | 128           |                    616 |
-| GPT-J 6B                     | 32         | 1         | 2048         | 2048          |                    757 |
+| GPT-J 6B                     | 512        | 1         | 128          | 128           |                  7,992 |
+| GPT-J 6B                     | 64         | 1         | 128          | 2048          |                  1,874 |
+| GPT-J 6B                     | 32         | 1         | 2048         | 128           |                    693 |
+| GPT-J 6B                     | 32         | 1         | 2048         | 2048          |                    768 |
 |                              |            |           |              |               |                        |
-| LLaMA 7B                     | 64         | 1         | 128          | 128           |                  3,240 |
-| LLaMA 7B                     | 64         | 1         | 128          | 2048          |                  1,622 |
-| LLaMA 7B                     | 32         | 1         | 2048         | 128           |                    581 |
-| LLaMA 7B                     | 16         | 1         | 2048         | 2048          |                    531 |
+| Mistral 7B                   | 896        | 1         | 128          | 128           |                  9,679 |
+| Mistral 7B                   | 120        | 1         | 128          | 2048          |                  4,401 |
+| Mistral 7B                   | 84         | 1         | 2048         | 128           |                    979 |
+| Mistral 7B                   | 56         | 1         | 2048         | 2048          |                  1,721 |
+|                              |            |           |              |               |                        |
+| LLaMA 7B                     | 256        | 1         | 128          | 128           |                  5,954 |
+| LLaMA 7B                     | 64         | 1         | 128          | 2048          |                  1,654 |
+| LLaMA 7B                     | 32         | 1         | 2048         | 128           |                    579 |
+| LLaMA 7B                     | 16         | 1         | 2048         | 2048          |                    542 |
+|                              |            |           |              |               |                        |
+| LLaMA 70B                    | 256        | 2         | 128          | 128           |                    561 |
+| LLaMA 70B                    | 256        | 4         | 128          | 2048          |                    471 |
+| LLaMA 70B                    | 16         | 2         | 2048         | 128           |                     49 |
+| LLaMA 70B                    | 64         | 4         | 2048         | 2048          |                    177 |
+|                              |            |           |              |               |                        |
+| Falcon 180B                  | 512        | 8         | 128          | 128           |                    152 |
+| Falcon 180B                  | 256        | 8         | 128          | 2048          |                    200 |
+| Falcon 180B                  | 32         | 8         | 2048         | 128           |                     15 |
+| Falcon 180B                  | 16         | 8         | 2048         | 2048          |                     39 |


 ### A100 GPUs (FP16)

 | Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s/GPU) |
 | :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
-| GPT-J 6B                     | 512        | 1         | 128          | 128           |                  6,374 |
-| GPT-J 6B                     | 120        | 2         | 128          | 2048          |                  2,192 |
-| GPT-J 6B                     | 60         | 1         | 2048         | 128           |                    670 |
-| GPT-J 6B                     | 64         | 2         | 2048         | 2048          |                    903 |
+| GPT-J 6B                     | 512        | 1         | 128          | 128           |                  6,810 |
+| GPT-J 6B                     | 32         | 1         | 128          | 2048          |                  1,658 |
+| GPT-J 6B                     | 32         | 1         | 2048         | 128           |                    631 |
+| GPT-J 6B                     | 16         | 1         | 2048         | 2048          |                    692 |
 |                              |            |           |              |               |                        |
-| LLaMA 7B                     | 384        | 1         | 128          | 128           |                  5,586 |
-| LLaMA 7B                     | 60         | 1         | 128          | 2048          |                  1,928 |
-| LLaMA 7B                     | 52         | 1         | 2048         | 128           |                    591 |
-| LLaMA 7B                     | 64         | 2         | 2048         | 2048          |                    782 |
+| Mistral 7B                   | 896        | 1         | 128          | 128           |                  6,472 |
+| Mistral 7B                   | 120        | 1         | 128          | 2048          |                  3,812 |
+| Mistral 7B                   | 84         | 1         | 2048         | 128           |                    734 |
+| Mistral 7B                   | 56         | 1         | 2048         | 2048          |                  1,607 |
 |                              |            |           |              |               |                        |
-| LLaMA 70B                    | 1280       | 4         | 128          | 128           |                    670 |
-| LLaMA 70B                    | 240        | 4         | 128          | 2048          |                    525 |
-| LLaMA 70B                    | 120        | 4         | 2048         | 128           |                     79 |
+| LLaMA 7B                     | 256        | 1         | 128          | 128           |                  5,353 |
+| LLaMA 7B                     | 32         | 1         | 128          | 2048          |                  1,518 |
+| LLaMA 7B                     | 32         | 1         | 2048         | 128           |                    547 |
+| LLaMA 7B                     | 16         | 1         | 2048         | 2048          |                    613 |
 |                              |            |           |              |               |                        |
-| Falcon 180B                  | 1024       | 8         | 128          | 128           |                    232 |
-| Falcon 180B                  | 128        | 8         | 128          | 2048          |                    180 |
+| LLaMA 70B                    | 256        | 4         | 128          | 128           |                    565 |
+| LLaMA 70B                    | 128        | 4         | 128          | 2048          |                    595 |
+| LLaMA 70B                    | 32         | 4         | 2048         | 128           |                     66 |
+| LLaMA 70B                    | 32         | 4         | 2048         | 2048          |                    185 |
+|                              |            |           |              |               |                        |
+| Falcon 180B                  | 256        | 8         | 128          | 128           |                    193 |
+| Falcon 180B                  | 256        | 8         | 128          | 2048          |                    203 |
+| Falcon 180B                  | 16         | 8         | 2048         | 128           |                     20 |

 (1) TP stands for Tensor Parallelism.

 ## Low Latency<sup>**</sup>

-<sup> ** The following data is from TensorRT-LLM v0.5. Low latency numbers will soon be updated to reflect real time latency with infight-batching.</sup>
+All data was generated using version 0.8.0
+<sup> ** Low latency numbers will soon be updated to reflect real time latency with infight-batching.</sup>

 The below tables provide reference data at batch size 1 for first token
 latency, representing end-user's perceived latency for online streaming
 tasks.

+### H200 GPUs (FP8)
+
+| Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
+| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
+| GPT-J 6B                     | 1          | 1         | 128          |                    5.2 |
+| GPT-J 6B                     | 1          | 1         | 2048         |                   23.6 |
+|                              |            |           |              |                        |
+| Mistral 7B                   | 1          | 1         | 128          |                    6.0 |
+| Mistral 7B                   | 1          | 1         | 2048         |                   31.8 |
+|                              |            |           |              |                        |
+| LLaMA 7B                     | 1          | 1         | 128          |                    5.8 |
+| LLaMA 7B                     | 1          | 1         | 2048         |                   30.1 |
+|                              |            |           |              |                        |
+| LLaMA 70B                    | 1          | 8         | 128          |                   16.0 |
+| LLaMA 70B                    | 1          | 8         | 2048         |                   78.8 |
+|                              |            |           |              |                        |
+| Falcon 180B                  | 1          | 8         | 128          |                   37.2 |
+| Falcon 180B                  | 1          | 8         | 2048         |                  120.8 |
+
 ### H100 GPUs (FP8)

 | Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
 | :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
-| GPT-J 6B                     | 1          | 1         | 128          |                      7 |
-| GPT-J 6B                     | 1          | 1         | 2048         |                     29 |
+| GPT-J 6B                     | 1          | 1         | 128          |                    5.7 |
+| GPT-J 6B                     | 1          | 1         | 2048         |                   23.8 |
 |                              |            |           |              |                        |
-| LLaMA 7B                     | 1          | 1         | 128          |                      7 |
-| LLaMA 7B                     | 1          | 1         | 2048         |                     36 |
+| Mistral 7B                   | 1          | 1         | 128          |                    6.6 |
+| Mistral 7B                   | 1          | 1         | 2048         |                   32.6 |
 |                              |            |           |              |                        |
-| LLaMA 70B                    | 1          | 4         | 128          |                     26 |
-| LLaMA 70B                    | 1          | 4         | 2048         |                    109 |
+| LLaMA 7B                     | 1          | 1         | 128          |                    6.4 |
+| LLaMA 7B                     | 1          | 1         | 2048         |                   31.0 |
 |                              |            |           |              |                        |
-| Falcon 180B                  | 1          | 8         | 128          |                     27 |
-| Falcon 180B                  | 1          | 8         | 2048         |                    205 |
+| LLaMA 70B                    | 1          | 8         | 128          |                   17.0 |
+| LLaMA 70B                    | 1          | 8         | 2048         |                   84.4 |
+|                              |            |           |              |                        |
+| Falcon 180B                  | 1          | 8         | 128          |                   39.7 |
+| Falcon 180B                  | 1          | 8         | 2048         |                  128.0 |

 ### L40S GPUs (FP8)

 | Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
 | :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
-| GPT-J 6B                     | 1          | 1         | 128          |                     12 |
-| GPT-J 6B                     | 1          | 1         | 2048         |                     71 |
+| GPT-J 6B                     | 1          | 1         | 128          |                   12.6 |
+| GPT-J 6B                     | 1          | 1         | 2048         |                   61.2 |
 |                              |            |           |              |                        |
-| LLaMA 7B                     | 1          | 1         | 128          |                     14 |
-| LLaMA 7B                     | 1          | 1         | 2048         |                     73 |
+| Mistral 7B                   | 1          | 1         | 128          |                   15.5 |
+| Mistral 7B                   | 1          | 1         | 2048         |                   84.3 |
+|                              |            |           |              |                        |
+| LLaMA 7B                     | 1          | 1         | 128          |                   14.3 |
+| LLaMA 7B                     | 1          | 1         | 2048         |                   79.0 |
+|                              |            |           |              |                        |
+| LLaMA 70B                    | 1          | 8         | 128          |                   70.9 |
+| LLaMA 70B                    | 1          | 8         | 2048         |                  708.7 |
+|                              |            |           |              |                        |
+| Falcon 180B                  | 1          | 8         | 128          |                   93.4 |
+| Falcon 180B                  | 1          | 8         | 2048         |                  769.8 |

 ### A100 GPUs (FP16)

 | Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
 | :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
-| GPT-J 6B                     | 1          | 1         | 128          |                     12 |
-| GPT-J 6B                     | 1          | 1         | 2048         |                    129 |
+| GPT-J 6B                     | 1          | 1         | 128          |                   14.1 |
+| GPT-J 6B                     | 1          | 1         | 2048         |                  102.8 |
 |                              |            |           |              |                        |
-| LLaMA 7B                     | 1          | 1         | 128          |                     16 |
-| LLaMA 7B                     | 1          | 1         | 2048         |                    133 |
+| Mistral 7B                   | 1          | 1         | 128          |                   16.4 |
+| Mistral 7B                   | 1          | 1         | 2048         |                  128.7 |
 |                              |            |           |              |                        |
-| LLaMA 70B                    | 1          | 4         | 128          |                     47 |
-| LLaMA 70B                    | 1          | 4         | 2048         |                    377 |
+| LLaMA 7B                     | 1          | 1         | 128          |                   16.1 |
+| LLaMA 7B                     | 1          | 1         | 2048         |                  120.5 |
 |                              |            |           |              |                        |
-| Falcon 180B                  | 1          | 8         | 128          |                     61 |
-| Falcon 180B                  | 1          | 8         | 2048         |                    509 |
+| LLaMA 70B                    | 1          | 8         | 128          |                   35.6 |
+| LLaMA 70B                    | 1          | 8         | 2048         |                  235.1 |
+|                              |            |           |              |                        |
+| Falcon 180B                  | 1          | 8         | 128          |                   76.5 |
+| Falcon 180B                  | 1          | 8         | 2048         |                  463.0 |

 (1) TP stands for Tensor Parallelism.

@ -476,7 +568,7 @@ Prepare a config json file `/tmp/engines/falcon/180b/ckpt_config.json`:
 ```json
 {
    "architecture": "FalconForCausalLM",
-    "dtype": "float16",
+    "dtype": "bfloat16",
    "num_hidden_layers": 80,
    "num_attention_heads": 232,
    "num_key_value_heads": 8,
@ -523,8 +615,8 @@ do
 		--workers 8 \
 		--remove_input_padding enable \
 		--context_fmha enable \
-		--gpt_attention_plugin float16 \
-		--gemm_plugin float16 \
+		--gpt_attention_plugin bfloat16 \
+		--gemm_plugin bfloat16 \
 		--paged_kv_cache enable \
 		--max_batch_size $batch_size \
 		--max_input_len $isl \
--- a/docs/source/performance_analysis.md
+++ b/docs/source/performance_analysis.md
@ -4,7 +4,7 @@ NVIDIA Nsight Systems reports at the application level are highly informative. M

 Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, we have added features to TensorRT-LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application.

-# Feature Descriptions
+## Feature Descriptions

 The main functionality here:
  * Relies on toggling the CUDA profiler runtime API on and off.
@ -35,7 +35,7 @@ To profile just those iterations, in addition to setting `TLLM_GPTS_PROFILE_STAR
  * We need to tell Nsight Systems to look for explicit API triggers to profile (`-c cudaProfilerApi`)
  * We need to tell Nsight Systems to keep profiling after seeing a profile stop API call (`--capture-range-end="repeat[]"`)

-# Examples
+## Examples
 Consult the Nsight Systems User Guide for full overview of MPI-related options.

 ## Profiling a single IFB iteration executing on a single rank of a multi-GPU model
--- a/docs/source/precision.md
+++ b/docs/source/precision.md
@ -139,7 +139,8 @@ This release of TensorRT-LLM contains the following examples:
 | Replit Code|   Y   |   Y   |   Y   |   .   |    .    |   .   |   .   |     .     |     .      |
 | SantaCoder |   Y   |   Y   |   Y   |   .   |    .    |   .   |   .   |     .     |     .      |
 | Skywork    |   Y   |   Y   |   Y   |   .   |    .    |   .   |   .   |     .     |     .      |
-| StarCoder  |   Y   |   Y   |   Y   |   .   |    .    |   .   |   .   |     .     |     .      |
+| StarCoder1 |   Y   |   Y   |   Y   |   .   |    .    |   Y   |   .   |     .     |     .      |
+| StarCoder2 |   Y   |   Y   |   Y   |   .   |    .    |   Y   |   .   |     .     |     .      |
 | T5         |   Y   |   Y   |   Y   |   .   |    .    |   .   |   .   |     .     |     .      |
 | Whisper    |   Y   |   Y   |   Y   |   .   |    .    |   Y   |   Y   |     .     |     .      |

--- a/examples/baichuan/README.md
+++ b/examples/baichuan/README.md
@ -6,7 +6,7 @@ This document shows how to build and run a Baichuan models (including `v1_7b`/`v

 The TensorRT-LLM Baichuan implementation can be found in [tensorrt_llm/models/baichuan/model.py](../../tensorrt_llm/models/baichuan/model.py). The TensorRT-LLM Baichuan example code is located in [`examples/baichuan`](./). There is one main file:

-* [`copnvert_checkpoint.py`](./copnvert_checkpoint.py) to convert supported checkpoints into TensorRT-LLM format.
+* [`convert_checkpoint.py`](./convert_checkpoint.py) to convert supported checkpoints into TensorRT-LLM format.

 The script accepts an argument named model_version, whose value should be `v1_7b`/`v1_13b`/`v2_7b`/`v2_13b` and the default value is `v1_13b`.

@ -20,9 +20,9 @@ In addition, there are two shared files in the parent folder [`examples`](../) f
  * FP8
  * BF16
  * INT4 & INT8 Weight-Only
-  * INT8 KV CACHE (+ AWQ/per-channel weight-only)
  * INT8 SmoothQuant
  * Groupwise quantization (AWQ/GPTQ)
+  * INT8 KV CACHE (+ AWQ/per-channel weight-only/SmoothQuant)

 ## Usage

@ -56,27 +56,26 @@ trtllm-build --checkpoint_dir ./trt_ckpt/baichuan_v1_13b/ \
 Here're some examples for checkpoint conversion that take `v1_13b` as example:

 ```bash
-# Build a single-GPU float16 engine from HF weights.
-# Build the Baichuan V1 13B model using a single GPU and FP16.
+# Convert the Baichuan V1 13B model using a single GPU and FP16.
 python convert_checkpoint.py --model_version v1_13b \
                             --model_dir baichuan-inc/Baichuan-13B-Chat \
                             --dtype float16 \
                             --output_dir ./tmp/baichuan_v1_13b/trt_engines/fp16/1-gpu/

-# Build the Baichuan V1 13B model using a single GPU and BF16.
+# Convert the Baichuan V1 13B model using a single GPU and BF16.
 python convert_checkpoint.py --model_version v1_13b \
                             --model_dir baichuan-inc/Baichuan-13B-Chat \
                             --dtype bfloat16 \
                             --output_dir ./tmp/baichuan_v1_13b/trt_engines/bf16/1-gpu/

-# Build the Baichuan V1 13B model using a single GPU and apply INT8 weight-only quantization.
+# Convert the Baichuan V1 13B model using a single GPU and apply INT8 weight-only quantization.
 python convert_checkpoint.py --model_version v1_13b \
                             --model_dir baichuan-inc/Baichuan-13B-Chat \
                             --dtype float16 \
                             --use_weight_only \
                             --output_dir ./tmp/baichuan_v1_13b/trt_engines/int8_weight_only/1-gpu/

-# Build the Baichuan V1 13B model using a single GPU and apply INT4 weight-only quantization.
+# Convert the Baichuan V1 13B model using a single GPU and apply INT4 weight-only quantization.
 python convert_checkpoint.py --model_version v1_13b \
                             --model_dir baichuan-inc/Baichuan-13B-Chat \
                             --dtype float16 \
@ -84,7 +83,7 @@ python convert_checkpoint.py --model_version v1_13b \
                             --weight_only_precision int4 \
                             --output_dir ./tmp/baichuan_v1_13b/trt_engines/int4_weight_only/1-gpu/

-# Build Baichuan V1 13B using 2-way tensor parallelism.
+# Convert Baichuan V1 13B using 2-way tensor parallelism.
 python convert_checkpoint.py --model_version v1_13b \
                             --model_dir baichuan-inc/Baichuan-13B-Chat \
                             --dtype float16 \
@ -93,47 +92,6 @@ python convert_checkpoint.py --model_version v1_13b \
                             --tp_size 2
 ```

-#### INT8 KV cache
-INT8 KV cache could be enabled to reduce memory footprint. It will bring more performance gains when batch size gets larger.
-
-You can get the INT8 scale of KV cache through NVIDIA AMMO (AlgorithMic Model Optimization) toolkit, which features a
-`--kv_cache_dtype` option.
-
-Example:
-
-```bash
-python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
-                                   --dtype float16 \
-                                   --kv_cache_dtype int8 \
-                                   --output_dir ./trt_ckpt/baichuan_int8kv_tp1 \
-                                   --calib_size 512
-```
-
-**INT8 KV cache + per-channel weight-only quantization**
-
-INT8 KV cache could be combined with per-channel weight-only quantization, as follows:
-```bash
-python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
-                                   --dtype float16 \
-                                   --qformat int4_wo \
-                                   --kv_cache_dtype int8 \
-                                   --output_dir ./trt_ckpt/baichuan_int4wo_int8kv_tp1 \
-                                   --calib_size 512
-```
-
-**INT8 KV cache + AWQ**
-
-In addition, you can enable INT8 KV cache together with AWQ (per-group INT4 weight-only quantization), as follows:
-
-```bash
-python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
-                                   --dtype float16 \
-                                   --qformat int4_awq \
-                                   --kv_cache_dtype int8 \
-                                   --output_dir ./trt_ckpt/baichuan_int4awq_int8kv_tp1 \
-                                   --calib_size 512
-```
-
 #### SmoothQuant

 The SmoothQuant supports all Baichuan model variants. Unlike the FP16 build where the HF weights are processed and loaded into the TensorRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.
@ -210,6 +168,62 @@ To run the GPTQ Baichuan example, the following steps are required:
    ```
    The quantized model checkpoint is saved for future TensorRT-LLM engine build directly with the `trtllm-build` command mentioned above.

+#### INT8 KV cache
+INT8 KV cache could be enabled to reduce memory footprint. It will bring more performance gains when batch size gets larger.
+
+You can get the INT8 scale of KV cache through NVIDIA AMMO (AlgorithMic Model Optimization) toolkit, which features a
+`--kv_cache_dtype` option.
+
+Example:
+
+```bash
+python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
+                                   --dtype float16 \
+                                   --kv_cache_dtype int8 \
+                                   --output_dir ./trt_ckpt/baichuan_int8kv_tp1 \
+                                   --calib_size 512
+```
+
+**INT8 KV cache + per-channel weight-only quantization**
+
+INT8 KV cache could be combined with per-channel weight-only quantization, as follows:
+```bash
+python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
+                                   --dtype float16 \
+                                   --qformat int4_wo \
+                                   --kv_cache_dtype int8 \
+                                   --output_dir ./trt_ckpt/baichuan_int4wo_int8kv_tp1 \
+                                   --calib_size 512
+```
+
+**INT8 KV cache + AWQ**
+
+In addition, you can enable INT8 KV cache together with AWQ (per-group INT4 weight-only quantization), as follows:
+
+```bash
+python ../quantization/quantize.py --model_dir baichuan-inc/Baichuan-13B-Chat \
+                                   --dtype float16 \
+                                   --qformat int4_awq \
+                                   --kv_cache_dtype int8 \
+                                   --output_dir ./trt_ckpt/baichuan_int4awq_int8kv_tp1 \
+                                   --calib_size 512
+```
+
+**INT8 KV cache + INT8 SmoothQuant**
+
+In addition, you can enable INT8 KV cache together with INT8 SmoothQuant, as follows:
+
+```bash
+python convert_checkpoint.py --model_version v1_13b \
+                --model_dir baichuan-inc/Baichuan-13B-Chat \
+                --dtype float16 \
+                --smoothquant 0.8 \
+                --per_channel \
+                --per_token \
+                --int8_kv_cache \
+                --output_dir ./tmp/baichuan_v1_13b/sq0.8/1-gpu/
+```
+
 ### Run

 To run a TensorRT-LLM Baichuan model using the engines generated by `trtllm-build`
--- a/examples/baichuan/convert_checkpoint.py
+++ b/examples/baichuan/convert_checkpoint.py
@ -67,13 +67,6 @@ def parse_arguments():
        type=int,
        default=0,
        help='Setting to a value > 0 enables support for prompt tuning.')
-    parser.add_argument(
-        "--calibrate_kv_cache",
-        "-kv",
-        action="store_true",
-        help=
-        "Generate scaling factors for KV cache. Used for storing KV cache in int8."
-    )
    parser.add_argument(
        '--per_channel',
        default=False,
@ -1100,12 +1093,9 @@ def convert_baichuan_gptq(hf_config: AutoConfig,

    # 4. Weights inside each layer
    num_hidden_layers = hf_config.num_hidden_layers
-    layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
+    layers_range = mapping.pp_layers(num_hidden_layers)
    for l in layers_range:
-        layer_idx = l - mapping.pp_rank * layers_per_pipeline_stage
+        layer_idx = l - layers_range[0]
        prefix = f"layers.{l}."
        tllm_prefix = f"transformer.layers.{l}."
        tensorrt_llm.logger.info(f'Process weights in layer: {layer_idx}')
@ -1189,7 +1179,7 @@ if __name__ == '__main__':
        elif args.per_token and not args.per_channel:
            quant_algo = 'W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN'

-    if args.calibrate_kv_cache:
+    if args.int8_kv_cache:
        kv_cache_quant_algo = "INT8"
    else:
        kv_cache_quant_algo = None
@ -1252,7 +1242,7 @@ if __name__ == '__main__':
        hf_model = AutoModelForCausalLM.from_pretrained(args.model_dir,
                                                        trust_remote_code=True,
                                                        torch_dtype="auto")
-        if args.smoothquant is not None or args.calibrate_kv_cache:
+        if args.smoothquant is not None or args.int8_kv_cache:
            act_range = {}
            baichuan_smoother = {}
            act_range = capture_activation_range(
@ -1265,9 +1255,8 @@ if __name__ == '__main__':
                                      baichuan_smoother)
            weights = convert_hf_baichuan_sq(hf_model, mapping, rank,
                                             args.dtype, args.per_channel,
-                                             args.per_token,
-                                             args.calibrate_kv_cache, act_range,
-                                             baichuan_smoother)
+                                             args.per_token, args.int8_kv_cache,
+                                             act_range, baichuan_smoother)
        elif args.use_weight_only and args.weight_only_precision == 'int4_gptq':
            weights = convert_baichuan_gptq(hf_config,
                                            args.quant_ckpt_path,
--- a/examples/bert/build.py
+++ b/examples/bert/build.py
@ -275,7 +275,7 @@ if __name__ == '__main__':
                                                ('batch_size', [bs_range])
                                            ]))

-        # logits for QA BERT, or hidden_state for vanila BERT
+        # logits for QA BERT, or hidden_state for vanilla BERT
        output = tensorrt_llm_bert(input_ids=input_ids,
                                   input_lengths=input_lengths,
                                   token_type_ids=token_type_ids)
--- a/examples/bloom/convert_checkpoint.py
+++ b/examples/bloom/convert_checkpoint.py
@ -519,7 +519,7 @@ def smooth_bloom_model(model, scales, alpha, bloom_qkv_param, bloom_smoother):
        bloom_qkv_param[layer_name] = param

        # dense
-        # enabled for better accuracy with perf overhead of quantiztion
+        # enabled for better accuracy with perf overhead of quantization
        layer_name = name + ".self_attention.dense"
        smoother = smooth_gemm(module.self_attention.dense.weight,
                               scales[layer_name]["x"], None, None, alpha)
@ -540,7 +540,7 @@ def smooth_bloom_model(model, scales, alpha, bloom_qkv_param, bloom_smoother):
            dim=1)[0]

        # fc2
-        # enabled for better accuracy with perf overhead of quantiztion
+        # enabled for better accuracy with perf overhead of quantization
        layer_name = name + ".mlp.dense_4h_to_h"
        smoother = smooth_gemm(module.mlp.dense_4h_to_h.weight,
                               scales[layer_name]["x"], None, None, alpha)
--- a/examples/chatglm/README.md
+++ b/examples/chatglm/README.md
@ -184,7 +184,7 @@ If the engines are run successfully, you will see output like (ChatGLM3-6B as th
 * The engine(s) must be built accordingly if [in-flight batching in C++ runtime](../../docs/in_flight_batching.md) will be used.
 * Use `--gpt_attention_plugin float16`, `--paged_kv_cache enable`, `--remove_input_padding enable` to build engine(s) supporting In-flight Batching.
  * It is possible to use `--gpt_attention_plugin float32` In-flight Batching.
-  * The size of the block in paged KV cache can be conteoled additionally by using `--tokens_per_block=N`.
+  * The size of the block in paged KV cache can be controlled additionally by using `--tokens_per_block=N`.

 ### 4. Run inference

@ -258,7 +258,7 @@ If the engines are run successfully, you will see output like (ChatGLM3-6B as th

 ### Weight Only quantization

-Use `--use_weight_only` to enable INT8-Weight-Only quantization, this will siginficantly lower the latency and memory footprint. Furthermore, use `--weight_only_precision int8` or `--weight_only_precision int4` to configure the data type of the weights.
+Use `--use_weight_only` to enable INT8-Weight-Only quantization, this will significantly lower the latency and memory footprint. Furthermore, use `--weight_only_precision int8` or `--weight_only_precision int4` to configure the data type of the weights.

 ```bash
 # ChatGLM3-6B: single gpu, int8 weight only quantization
--- a/examples/chatglm/tokenization_chatglm.py
+++ b/examples/chatglm/tokenization_chatglm.py
@ -228,7 +228,7 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
                         unk_token=unk_token,
                         num_image_tokens=num_image_tokens,
                         **kwargs)
-        """ Initialisation """
+        """ Initialization """

    @property
    def gmask_token_id(self) -> Optional[int]:
--- a/examples/enc_dec/README.md
+++ b/examples/enc_dec/README.md
@ -69,7 +69,7 @@ We should distinguish between `X` - TP size and `Y` - total number of GPU ranks:
 # Example 1: build t5-small using a single GPU, FP32, running greedy search
 # use_gpt_attention_plugin is necessary in Enc-Dec.
 # Try use_gemm_plugin to prevent accuracy issue.
-# It is recommend to use --remove_input_padding along with --use_gpt_attention_plugin for better performance
+# It is recommended to use --remove_input_padding along with --use_gpt_attention_plugin for better performance
 python build.py --model_type t5 \
                --weight_dir tmp/trt_models/t5-small/tp1 \
                -o tmp/trt_engines/t5-small/1-gpu \
--- a/examples/enc_dec/build.py
+++ b/examples/enc_dec/build.py
@ -533,8 +533,7 @@ def build(rank, args):
            hf_modules_to_trtllm_modules=args.hf_modules_to_trtllm_modules
            if args.use_lora_plugin else None,
            trtllm_modules_to_hf_modules=args.trtllm_modules_to_hf_modules
-            if args.use_lora_plugin else None,
-        )
+            if args.use_lora_plugin else None)

        engine_name = get_engine_name(args.engine_name, args.dtype,
                                      args.tp_size, args.pp_size, cur_rank)
@ -588,7 +587,7 @@ def run_build(component):
    if args.parallel_build and args.world_size > 1 and \
            torch.cuda.device_count() >= args.world_size:
        logger.warning(
-            f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
+            f'Parallel build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
        )
        mp.spawn(build, nprocs=args.world_size, args=(args, ))
    else:
--- a/examples/gemma/README.md
+++ b/examples/gemma/README.md
@ -84,8 +84,7 @@ Note that we need to download the dataset of MMLU first and the evaluation of MM
 VOCAB_FILE_PATH=/tmp/models/gemma_nv/checkpoints/tmp_vocab.model
 python3 ../run.py --engine_dir ${ENGINE_PATH} \
                  --max_output_len 30 \
-                  --vocab_file ${VOCAB_FILE_PATH} \
-                  --no_add_special_tokens
+                  --vocab_file ${VOCAB_FILE_PATH}

 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600Input [Text 0]: "<bos> Born in north-east France, Soyer trained as a"
 Output [Text 0 Beam 0]: "chef in the renowned kitchens of Lyon. After honing his skills in various Michelin-starred establishments, he embarked on a solo venture, establishing his own restaurant"
@ -98,8 +97,7 @@ python3 ../summarize.py --test_trt_llm \
                        --engine_dir ${ENGINE_PATH} \
                        --batch_size 8 \
                        --max_ite 5 \
-                        --vocab_file ${VOCAB_FILE_PATH} \
-                        --no_add_special_tokens
+                        --vocab_file ${VOCAB_FILE_PATH}

 [02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.2821836471557617 sec)
 [02/06/2024-10:08:54] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1989)
@ -167,8 +165,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-05:04:13] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.96612286567688 sec)
 [02/08/2024-05:04:13] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2510)
@ -213,8 +210,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.116227149963379 sec)
 [02/08/2024-10:37:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2419)
@ -263,8 +259,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.460859775543213 sec)
 [02/08/2024-04:42:06] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1786)
@ -308,8 +303,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.5987987518310547 sec)
 [02/08/2024-04:44:54] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1797)
@ -349,8 +343,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-04:48:06] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.1938045024871826 sec)
 [02/08/2024-04:48:06] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1462)
@ -393,8 +386,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (total latency: 3.5348474979400635 sec)
 [02/08/2024-04:52:22] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1819)
@ -437,8 +429,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 python3 ../mmlu.py --test_trt_llm \
                 --vocab_file ${VOCAB_FILE_PATH} \
@ -482,8 +473,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.884302377700806 sec)
 [02/08/2024-06:42:13] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2694)
@ -524,8 +514,7 @@ python3 ../summarize.py --test_trt_llm \
                        --vocab_file ${VOCAB_FILE_PATH} \
                        --engine_dir ${ENGINE_PATH} \
                        --batch_size 8 \
-                        --max_ite 5 \
-                        --no_add_special_tokens
+                        --max_ite 5

 [02/19/2024-10:02:53] [TRT-LLM] [I] ---------------------------------------------------------
 [02/19/2024-10:03:09] [TRT-LLM] [I] TensorRT-LLM (total latency: 13.65670919418335 sec)
@ -570,8 +559,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 8.49835753440857 sec)
 [02/08/2024-07:38:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2654)
@ -611,8 +599,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-07:43:32] [TRT-LLM] [I] TensorRT-LLM (total latency: 7.282559156417847 sec)
 [02/08/2024-07:43:32] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2253)
@ -655,8 +642,7 @@ python3 ../summarize.py --test_trt_llm \
                      --vocab_file ${VOCAB_FILE_PATH} \
                      --engine_dir ${ENGINE_PATH} \
                      --batch_size 8 \
-                      --max_ite 5 \
-                      --no_add_special_tokens
+                      --max_ite 5

 [02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (total latency: 8.73880124092102 sec)
 [02/08/2024-07:51:11] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 2771)
@ -672,7 +658,7 @@ python3 ../summarize.py --test_trt_llm \

 #### Requirements

-AMMO toolkit provides quantization solutions with better accuracy. To enable it, have the latest ammo and transformers Python package installed to support Gemma. Then run the following commands.
+AMMO toolkit also provides quantization solutions. To enable it, have the latest ammo and transformers Python package installed to support Gemma. Then run the following commands.

 #### Quantize Checkpoints

@ -713,7 +699,7 @@ trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \

 #### Accuracy Results on MMLU

-| Model         | fp8   | int4_awq | int8_sq |
-|---------------|-------|----------|---------|
-| 2B Pretrained | 0.407 | 0.378    | 0.328   |
-| 7B Pretrained | 0.643 | 0.615    | 0.480   |
+| Model         | fp8   | int4_awq | int8_sq (AMMO) | int8_sq (Native per-channel) |
+|---------------|-------|----------|----------------|------------------|
+| 2B Pretrained | 0.407 | 0.378    |    0.338       |     0.338        |
+| 7B Pretrained | 0.643 | 0.615    |    0.448       |     0.595        |
--- a/examples/gemma/convert_checkpoint.py
+++ b/examples/gemma/convert_checkpoint.py
@ -25,7 +25,7 @@ from tensorrt_llm._utils import torch_to_numpy
 from tensorrt_llm.models.gemma.smoothquant import *
 from tensorrt_llm.models.gemma.weight import (dummy_weights_awq,
                                              load_from_fp8_llama,
-                                              quantize_fp8_weigths)
+                                              quantize_fp8_weights)

 LOGGER = logging.getLogger("convert_checkpoint")

@ -735,7 +735,7 @@ def convert(worker_rank, args, convert_kwargs):
                trt_llm_config=trt_llm_config,
                group_size=128)
        elif args.enable_fp8 or args.fp8_kv_cache:
-            weight_scales = quantize_fp8_weigths(
+            weight_scales = quantize_fp8_weights(
                weights, trt_llm_config.num_hidden_layers,
                trt_llm_config.mapping)
            scales = load_from_fp8_llama(args.ammo_quant_ckpt_path,
@ -766,7 +766,6 @@ def main():

    print(f"Source configuration determined from parameters: {ckpt_config}")

-    quant_mode = tensorrt_llm.quantization.QuantMode(0)
    quant_kwargs = {}
    quant_algo = None
    kv_cache_quant_algo = None
@ -801,11 +800,6 @@ def main():

    quant_kwargs.update(quant_algo=quant_algo,
                        kv_cache_quant_algo=kv_cache_quant_algo)
-    if quant_algo is not None or kv_cache_quant_algo is not None:
-        quant_mode = tensorrt_llm.quantization.QuantMode.from_quant_algo(
-            quant_algo,
-            kv_cache_quant_algo=kv_cache_quant_algo,
-        )
    if args.use_weight_only_with_precision:
        if args.use_weight_only_with_precision.endswith("awq"):
            quant_kwargs.update(has_zero_point=False,
@ -830,8 +824,7 @@ def main():
        world_size=args.world_size,
        tp_size=args.world_size,
        pp_size=1,
-        quant_mode=quant_mode,
-        quant_kwargs=quant_kwargs,
+        quantization=quant_kwargs,
    )

    trt_llm_config_dict = trt_llm_config.to_dict()
--- a/examples/gpt/README.md
+++ b/examples/gpt/README.md
@ -206,7 +206,7 @@ python3 build.py \
 mpirun -np 4 python3 ../run.py --engine_dir santacoder_outputs_tp4 --tokenizer_dir ./santacoder --input_text "def print_hello_world():" --max_output_len 20 --no_add_special_tokens
 ```

-## GPT Variant - StarCoder
+## GPT Variant - StarCoder (v1 and v2)

 For StarCoder, the steps are similar except that `santacoder` is swapped with `starcoder`.

@ -228,6 +228,11 @@ python3 build.py \
 mpirun -np 4 python3 ../run.py --engine_dir starcoder_outputs_tp4 --tokenizer_dir ./starcoder  --input_text "def print_hello_world():" --max_output_len 20 --no_add_special_tokens
 ```

+For StarCoder2, you can use almost the same steps as shown above by just setting `--model starcoder2` when converting the huggingface models.
+ - Note that StarCoder2 hasn't been merged to the official releases of transformers package yet, so remember using the [main branch of transformers repo](https://github.com/huggingface/transformers).
+ - Add `--max_attention_window_size 4096` when running with run.py or summarization, which enables the sliding window attention.
+   - the sliding window size comes from the hf model [config.json](https://huggingface.co/bigcode/starcoder2-15b/blob/main/config.json#L23).
+
 ## Summarization using the GPT model

 The following section describes how to run a TensorRT-LLM GPT model to summarize the articles from the
--- a/examples/gpt/build.py
+++ b/examples/gpt/build.py
@ -68,6 +68,7 @@ def override_args_from_model_dir(args: argparse.Namespace) -> None:
        parsed_params = parse_ft_config(Path(args.model_dir) / "config.ini")
        args.n_embd = parsed_params["n_embd"]
        args.n_head = parsed_params["n_head"]
+        args.n_kv_head = parsed_params["n_kv_head"]
        args.n_layer = parsed_params["n_layer"]
        args.n_positions = parsed_params["n_positions"]
        args.vocab_size = parsed_params["vocab_size"]
@ -82,6 +83,8 @@ def override_args_from_model_dir(args: argparse.Namespace) -> None:
        args.dtype = parsed_params["dtype"]
        args.inter_size = parsed_params["inter_size"]
        args.multi_query_mode = parsed_params["multi_query_mode"]
+    else:
+        args.n_kv_head = 1 if args.multi_query_mode else args.n_head


 def parse_arguments(args):
@ -167,7 +170,7 @@ def parse_arguments(args):
        action='store_true',
        help=
        'Split long kv sequence into multiple blocks (applied to generation MHA kernels). \
-                        It is beneifical when batchxnum_heads cannot fully utilize GPU.'
+                        It is beneficial when batch x num_heads cannot fully utilize GPU.'
    )
    parser.add_argument('--gpus_per_node', type=int, default=8)
    parser.add_argument('--builder_opt', type=int, default=None)
@ -549,6 +552,7 @@ def build_rank_engine(builder: Builder,
    tensorrt_llm_gpt = tensorrt_llm.models.GPTLMHeadModel(
        num_layers=args.n_layer,
        num_heads=args.n_head,
+        num_kv_heads=args.n_kv_head,
        hidden_size=args.n_embd,
        inter_size=args.inter_size,
        vocab_size=args.vocab_size,
@ -568,7 +572,6 @@ def build_rank_engine(builder: Builder,
        apply_query_key_layer_scaling,
        quant_mode=args.quant_mode,
        bias=args.bias,
-        num_kv_heads=1 if args.multi_query_mode else args.n_head,
        use_prompt_tuning=args.max_prompt_embedding_table_size > 0,
        use_parallel_embedding=args.use_parallel_embedding,
        embedding_sharding_dim=args.embedding_sharding_dim,
@ -712,7 +715,6 @@ def build(rank, args):
        int8_trt_flag = args.quant_mode.has_act_or_weight_quant() or (
            args.paged_kv_cache == False
            and args.quant_mode.has_int8_kv_cache())
-        num_kv_heads = 1 if args.multi_query_mode else args.n_head
        builder_config = builder.create_builder_config(
            name=MODEL_NAME,
            precision=args.dtype,
@ -722,7 +724,7 @@ def build(rank, args):
            parallel_build=args.parallel_build,
            num_layers=args.n_layer,
            num_heads=args.n_head,
-            num_kv_heads=num_kv_heads,
+            num_kv_heads=args.n_kv_head,
            hidden_size=args.n_embd,
            vocab_size=args.vocab_size,
            hidden_act=args.hidden_act,
@ -753,7 +755,7 @@ def build(rank, args):
                                   cur_rank, args)
        assert engine is not None, f'Failed to build engine for rank {cur_rank}'

-        local_num_kv_heads = (num_kv_heads + args.world_size -
+        local_num_kv_heads = (args.n_kv_head + args.world_size -
                              1) // args.world_size
        kv_dtype = str_dtype_to_trt(args.dtype)
        if args.quant_mode.has_int8_kv_cache():
@ -797,7 +799,7 @@ def run_build(args=None):
    if args.parallel_build and args.world_size > 1 and \
            torch.cuda.device_count() >= args.world_size:
        logger.warning(
-            f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
+            f'Parallel build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
        )
        mp.spawn(build, nprocs=args.world_size, args=(args, ))
    else:
--- a/examples/gpt/hf_gpt_convert.py
+++ b/examples/gpt/hf_gpt_convert.py
@ -94,7 +94,7 @@ class ProgArgs:
            default="gpt2",
            type=str,
            help="Specify GPT variants to convert checkpoints correctly",
-            choices=["gpt2", "santacoder", "starcoder"])
+            choices=["gpt2", "santacoder", "starcoder", "starcoder2"])
        parser.add_argument("--storage-type",
                            "-t",
                            type=str,
@ -134,14 +134,30 @@ def smooth_gpt_model(model, scales, alpha):


 # SantaCoder separates Q projection from KV projection
-def concat_qkv_weight_bias(q, hf_key, hf_model):
-    kv = hf_model.state_dict()[hf_key.replace("q_attn", "kv_attn")]
+def concat_qkv_weight_bias(q, hf_key, hf_model, model_type):
+    if model_type == "starcoder2":
+        k = hf_model.state_dict()[hf_key.replace("q_proj",
+                                                 "k_proj")].to(q.device)
+        v = hf_model.state_dict()[hf_key.replace("q_proj",
+                                                 "v_proj")].to(q.device)
+        if len(q.shape) == 2:
+            k = k.transpose(0, 1)
+            v = v.transpose(0, 1)
+        return torch.cat([q, k, v], dim=-1)
+    else:
+        kv = hf_model.state_dict()[hf_key.replace("q_attn",
+                                                  "kv_attn")].to(q.device)
        return torch.cat([q, kv], dim=-1)


 # StarCoder uses nn.Linear for these following ops whose weight matrix is transposed compared to transformer.Conv1D
-def transpose_weights(hf_name, param):
+def transpose_weights(hf_name, param, model_type):
+
+    weight_to_transpose = []
+    if model_type == "starcoder":
        weight_to_transpose = ["c_attn", "c_proj", "c_fc"]
+    elif model_type == "starcoder2":
+        weight_to_transpose = ["self_attn", "c_proj", "c_fc"]
    if any([k in hf_name for k in weight_to_transpose]):
        if len(param.shape) == 2:
            param = param.transpose(0, 1)
@ -154,7 +170,11 @@ def gpt_to_ft_name(orig_name):
        "transformer.wte.weight": "model.wte",
        "transformer.ln_f.bias": "model.final_layernorm.bias",
        "transformer.ln_f.weight": "model.final_layernorm.weight",
-        "lm_head.weight": "model.lm_head.weight"
+        "lm_head.weight": "model.lm_head.weight",
+        # StarCoder2
+        "model.embed_tokens.weight": "model.wte",
+        "model.norm.weight": "model.final_layernorm.weight",
+        "model.norm.bias": "model.final_layernorm.bias"
    }

    if orig_name in global_weights:
@ -181,6 +201,25 @@ def gpt_to_ft_name(orig_name):
        "transformer.mlp.c_fc.weight": "mlp.dense_h_to_4h.weight",
        "transformer.mlp.c_proj.bias": "mlp.dense_4h_to_h.bias",
        "transformer.mlp.c_proj.weight": "mlp.dense_4h_to_h.weight",
+        # StarCoder2
+        "transformer.input_layernorm.bias": "input_layernorm.bias",
+        "transformer.input_layernorm.weight": "input_layernorm.weight",
+        "transformer.self_attn.q_proj.bias": "attention.query.bias",
+        "transformer.self_attn.q_proj.weight": "attention.query.weight",
+        "transformer.self_attn.k_proj.weight": "attention.key.weight",
+        "transformer.self_attn.k_proj.bias": "attention.key.bias",
+        "transformer.self_attn.v_proj.weight": "attention.value.weight",
+        "transformer.self_attn.v_proj.bias": "attention.value.bias",
+        "transformer.self_attn.o_proj.bias": "attention.dense.bias",
+        "transformer.self_attn.o_proj.weight": "attention.dense.weight",
+        "transformer.post_attention_layernorm.bias":
+        "post_attention_layernorm.bias",
+        "transformer.post_attention_layernorm.weight":
+        "post_attention_layernorm.weight",
+        "transformer.mlp.c_fc.bias": "mlp.dense_h_to_4h.bias",
+        "transformer.mlp.c_fc.weight": "mlp.dense_h_to_4h.weight",
+        "transformer.mlp.c_proj.bias": "mlp.dense_4h_to_h.bias",
+        "transformer.mlp.c_proj.weight": "mlp.dense_4h_to_h.weight"
    }
    return f"layers.{layer_idx}.{per_layer_weights[weight_name]}"

@ -222,6 +261,9 @@ def hf_gpt_converter(args: ProgArgs):
        config["gpt"][k] = f"{v}"
    config["gpt"]["storage_dtype"] = args.storage_type
    config["gpt"]["multi_query_mode"] = str(multi_query_mode)
+    num_attention_heads = int(config['gpt'].get("num_attention_heads", 0))
+    num_key_value_heads = 1 if multi_query_mode else int(config['gpt'].get(
+        "num_key_value_heads", num_attention_heads))
    with open(saved_dir / "config.ini", 'w') as configfile:
        config.write(configfile)

@ -246,14 +288,13 @@ def hf_gpt_converter(args: ProgArgs):

        if args.convert_model_on_cpu:
            param = param.cpu()
-        if args.model == "starcoder":
-            param = transpose_weights(name, param)
+        param = transpose_weights(name, param, args.model)
        if ft_name in global_ft_weights:
            torch_to_numpy(param.to(storage_type).cpu()).tofile(
                saved_dir / f"{ft_name}.bin")
        else:
-            if 'q_attn' in name:
-                param = concat_qkv_weight_bias(param, name, model)
+            if 'q_attn' in name or 'q_proj' in name:
+                param = concat_qkv_weight_bias(param, name, model, args.model)
                ft_name = ft_name.replace("query", "query_key_value")
            # Needed by QKV projection weight split. With multi_query_mode one does not simply take
            # out_dim and divide it by 3 to get local_dim because out_dim = local_dim + 2 * head_size
@ -265,7 +306,9 @@ def hf_gpt_converter(args: ProgArgs):
                    storage_type, act_range.get(name.replace(".weight", "")), {
                        "int8_outputs": int8_outputs,
                        "multi_query_mode": multi_query_mode,
-                        "local_dim": local_dim
+                        "local_dim": local_dim,
+                        "num_attention_heads": num_attention_heads,
+                        "num_key_value_heads": num_key_value_heads
                    })
            else:
                starmap_args.append(
@ -273,7 +316,9 @@ def hf_gpt_converter(args: ProgArgs):
                     storage_type, act_range.get(name.replace(".weight", "")), {
                         "int8_outputs": int8_outputs,
                         "multi_query_mode": multi_query_mode,
-                         "local_dim": local_dim
+                         "local_dim": local_dim,
+                         "num_attention_heads": num_attention_heads,
+                         "num_key_value_heads": num_key_value_heads
                     }))

    starmap_args = tqdm(starmap_args, desc="saving weights")
--- a/examples/gpt/utils/convert.py
+++ b/examples/gpt/utils/convert.py
@ -162,10 +162,11 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
                          storage_type, act_range, config):
    use_attention_nemo_shape = config.get("use_attention_nemo_shape", False)
    split_gated_activation = config.get("split_gated_activation", False)
+    multi_query_mode = config.get("multi_query_mode", False)
    num_attention_heads = config.get("num_attention_heads", 0)
+    num_key_value_heads = config.get("num_key_value_heads", num_attention_heads)
    tp_size = config.get("tp_size", 1)
    int8_outputs = config.get("int8_outputs", None)
-    multi_query_mode = config.get("multi_query_mode", False)
    local_dim = config.get("local_dim", None)

    save_int8 = int8_outputs == "all" or int8_outputs == "kv_cache_only"
@ -236,6 +237,37 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
            b_q, b_kv = np.split(val, [local_dim], axis=-1)
            b_q_split = np.split(b_q, split_factor, axis=-1)
            split_vals = [np.concatenate((i, b_kv), axis=-1) for i in b_q_split]
+        elif num_attention_heads != num_key_value_heads:
+            # GQA mode
+            # split_vals = np.split(vals[0], split_factor, axis=-1)
+            assert num_key_value_heads % split_factor == 0
+            val = vals[0]
+            qkv_hidden_dim = val.shape[0]
+            size_per_head = qkv_hidden_dim // (num_attention_heads +
+                                               2 * num_key_value_heads)
+            num_attention_heads // num_key_value_heads
+
+            val = val.reshape(num_attention_heads + 2 * num_key_value_heads,
+                              size_per_head)
+
+            # Split the QKV to separate variables.
+            qkv = np.split(val, [
+                num_attention_heads, num_attention_heads + num_key_value_heads
+            ],
+                           axis=0)
+
+            q_split = np.split(qkv[0], split_factor, axis=0)
+            k_split = np.split(qkv[1], split_factor, axis=0)
+            v_split = np.split(qkv[2], split_factor, axis=0)
+
+            # Concatenate Q, K, and V together
+            split_vals = [
+                np.concatenate([
+                    q_split[i].reshape(-1), k_split[i].reshape(-1),
+                    v_split[i].reshape(-1)
+                ],
+                               axis=0) for i in range(split_factor)
+            ]
        else:
            if use_attention_nemo_shape:
                head_num = num_attention_heads // tp_size
@ -261,6 +293,35 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
            w_q, w_kv = np.split(val, [local_dim], axis=-1)
            w_q_split = np.split(w_q, split_factor, axis=-1)
            split_vals = [np.concatenate((i, w_kv), axis=-1) for i in w_q_split]
+        elif num_attention_heads != num_key_value_heads:
+            # GQA mode.
+            assert num_key_value_heads % split_factor == 0
+            val = vals[0]
+            size_per_head = hidden_dim // num_attention_heads
+            num_attention_heads // num_key_value_heads
+
+            val = val.reshape(hidden_dim,
+                              num_attention_heads + 2 * num_key_value_heads,
+                              size_per_head)
+
+            # Split the QKV to separate variables.
+            qkv = np.split(val, [
+                num_attention_heads, num_attention_heads + num_key_value_heads
+            ],
+                           axis=1)
+
+            q_split = np.split(qkv[0], split_factor, axis=1)
+            k_split = np.split(qkv[1], split_factor, axis=1)
+            v_split = np.split(qkv[2], split_factor, axis=1)
+
+            # Concatenate Q, K, and V together
+            split_vals = [
+                np.concatenate([
+                    q_split[i].reshape(hidden_dim, -1), k_split[i].reshape(
+                        hidden_dim, -1), v_split[i].reshape(hidden_dim, -1)
+                ],
+                               axis=1) for i in range(split_factor)
+            ]
        else:
            if use_attention_nemo_shape:
                head_num = num_attention_heads // tp_size
@ -291,7 +352,9 @@ def split_and_save_weight(tp_rank, saved_dir, split_factor, key, vals,
                       kv_cache_only=int8_outputs == "kv_cache_only")
    elif ("attention.query.weight" in key or "attention.query.bias" in key
          or "attention.key_value.weight" in key
-          or "attention.key_value.bias" in key):
+          or "attention.key_value.bias" in key or "attention.key.weight" in key
+          or "attention.key.bias" in key or "attention.value.weight" in key
+          or "attention.value.bias" in key):
        pass
    else:
        print(f"[WARNING] {key} not handled by converter")
--- a/examples/gpt/weight.py
+++ b/examples/gpt/weight.py
@ -59,10 +59,85 @@ def split(v, tp_size, idx, dim=0):
    return None


+def parse_sc2_config(ini_file):
+    gpt_config = configparser.ConfigParser()
+    gpt_config.read(ini_file)
+
+    n_embd = gpt_config.getint('gpt', 'hidden_size')
+    n_head = gpt_config.getint('gpt', 'num_attention_heads')
+    n_kv_head = gpt_config.getint('gpt', 'num_key_value_heads')
+    n_layer = gpt_config.getint('gpt', 'num_hidden_layers')
+    n_positions = gpt_config.getint('gpt', 'max_position_embeddings')
+    vocab_size = gpt_config.getint('gpt', 'vocab_size')
+    do_layer_norm_before = gpt_config.getboolean('gpt',
+                                                 'do_layer_norm_before',
+                                                 fallback=True)
+    rotary_base = gpt_config.getfloat('gpt', 'rope_theta', fallback=None)
+    rotary_scaling_type = gpt_config.get('gpt',
+                                         'rotary_scaling_type',
+                                         fallback=None)
+    rotary_scaling_factor = gpt_config.get('gpt',
+                                           'rotary_scaling_factor',
+                                           fallback=None)
+    if rotary_scaling_type is None:
+        if rotary_scaling_factor is not None:
+            raise ValueError(
+                f"'rotary_scaling_factor={rotary_scaling_factor}' is found in ini "
+                f"config file {ini_file}, whereas 'rotary_scaling_type' is missing "
+                f"in the config. The 'rotary_scaling_factor' will be ignored and "
+                f"rotary scaling will not be used.")
+        rotary_scaling = None
+    else:
+        if rotary_scaling_factor is None:
+            raise ValueError(
+                f"'rotary_scaling_factor={rotary_scaling_factor}' was not found "
+                f"in ini config file {ini_file}, whereas 'rotary_scaling_type' is "
+                f"provided  and equals {repr(rotary_scaling_type)}.")
+        rotary_scaling = [rotary_scaling_type, rotary_scaling_factor]
+    rotary_pct = 1.0
+    hidden_act = "gelu"
+    bias = gpt_config.getboolean('gpt', 'use_bias', fallback=True)
+    inter_size = gpt_config.getint('gpt', 'intermediate_size', fallback=None)
+    dtype = gpt_config.get('gpt', 'storage_dtype', fallback='float32')
+
+    if inter_size is None:
+        inter_size = 4 * n_embd
+
+    multi_query_mode = gpt_config.getboolean('gpt',
+                                             'multi_query_mode',
+                                             fallback=False)
+    prompt_num_tasks = gpt_config.getint('gpt', 'prompt_num_tasks', fallback=0)
+    prompt_max_vocab_size = gpt_config.getint('gpt',
+                                              'prompt_max_vocab_size',
+                                              fallback=0)
+    return {
+        "n_embd": n_embd,
+        "n_head": n_head,
+        "n_kv_head": n_kv_head,
+        "n_layer": n_layer,
+        "n_positions": n_positions,
+        "vocab_size": vocab_size,
+        "do_layer_norm_before": do_layer_norm_before,
+        "hidden_act": hidden_act,
+        "rotary_pct": rotary_pct,
+        "rotary_base": rotary_base,
+        "rotary_scaling": rotary_scaling,
+        "bias": bias,
+        "inter_size": inter_size,
+        "multi_query_mode": multi_query_mode,
+        "dtype": dtype,
+        "prompt_num_tasks": prompt_num_tasks,
+        "prompt_max_vocab_size": prompt_max_vocab_size
+    }
+
+
 def parse_ft_config(ini_file):
    gpt_config = configparser.ConfigParser()
    gpt_config.read(ini_file)

+    if gpt_config.get("gpt", "model", fallback=None) == "starcoder2":
+        return parse_sc2_config(ini_file)
+
    n_embd = gpt_config.getint('gpt', 'n_embd')
    n_head = gpt_config.getint('gpt', 'n_head')
    n_layer = gpt_config.getint('gpt', 'n_layer')
@ -112,6 +187,7 @@ def parse_ft_config(ini_file):
    return {
        "n_embd": n_embd,
        "n_head": n_head,
+        "n_kv_head": 1 if multi_query_mode else n_head,
        "n_layer": n_layer,
        "n_positions": n_positions,
        "vocab_size": vocab_size,
@ -157,6 +233,8 @@ def load_from_ft(tensorrt_llm_gpt: GPTLMHeadModel,
    _parsed_params = parse_ft_config(Path(dir_path) / 'config.ini')
    n_embd = _parsed_params["n_embd"]
    n_head = _parsed_params["n_head"]
+    n_kv_head = _parsed_params["n_kv_head"]
+    head_size = n_embd // n_head
    n_layer = _parsed_params["n_layer"]
    n_positions = _parsed_params["n_positions"]
    vocab_size = _parsed_params["vocab_size"]
@ -164,7 +242,6 @@ def load_from_ft(tensorrt_llm_gpt: GPTLMHeadModel,
    hidden_act = _parsed_params["hidden_act"]
    bias = _parsed_params["bias"]
    inter_size = _parsed_params["inter_size"]
-    multi_query_mode = _parsed_params["multi_query_mode"]

    np_dtype = str_dtype_to_np(dtype)

@ -284,10 +361,8 @@ def load_from_ft(tensorrt_llm_gpt: GPTLMHeadModel,
            split(lm_head_weight, tensor_parallel, rank))
    fake_fp8_sf_dt = np.float32
    for i in range(n_layer):
-        c_attn_out_dim = (3 * n_embd //
-                          tensor_parallel) if not multi_query_mode else (
-                              n_embd // tensor_parallel +
-                              (n_embd // n_head) * 2)
+        c_attn_out_dim = ((n_head // tensor_parallel) +
+                          max(n_kv_head // tensor_parallel, 1) * 2) * head_size
        gpt_layer = tensorrt_llm_gpt.layers[i]
        gpt_layer.input_layernorm.weight.value = (fromfile(
            dir_path, 'model.layers.' + str(i) + '.input_layernorm.weight.bin'))
--- a/examples/gptneox/README.md
+++ b/examples/gptneox/README.md
@ -149,7 +149,7 @@ sh gptq_convert.sh

 ### 3. Convert weights from HF Transformers to TensorRT-LLM format

-To apply groupwise quantization GPTQ, addition commandline flags need to be passed to `convert_checkpoint.py`:
+To apply groupwise quantization GPTQ, addition command-line flags need to be passed to `convert_checkpoint.py`:
 Here `--ammo_quant_ckpt_path` flag specifies the output safetensors of `gptq_convert.sh` script.

 ```bash
@ -173,7 +173,7 @@ python3 convert_checkpoint.py --model_dir ./gptneox_model \

 ### 4. Build TensorRT engine(s)

-The command to build TensorRT engines to apply GPTQ are almost no change:
+The command to build TensorRT engines to apply GPTQ does not change:

 ```bash
 # Single GPU
@ -197,7 +197,7 @@ trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ \

 ### 5. Summarization using the GPT-NeoX model

-The command to run summarization with GPTQ qunatized model are also no change:
+The command to run summarization with GPTQ quantized model also does not change:

 ```bash
 # Single GPU
--- a/examples/gptneox/convert_checkpoint.py
+++ b/examples/gptneox/convert_checkpoint.py
@ -322,13 +322,9 @@ def load_from_gptq_gptneox(quant_ckpt_path,
        weights['transformer.ln_f.bias'] = b.to(torch_dtype)
    # 4. Weights inside each layer
    num_hidden_layers = hf_config.num_hidden_layers
-    layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
-
+    layers_range = mapping.pp_layers(num_hidden_layers)
    for l in layers_range:
-        layer_idx = l - mapping.pp_rank * layers_per_pipeline_stage
+        layer_idx = l - layers_range[0]
        prefix = "layers" + split_sym + str(l) + split_sym
        tensorrt_llm.logger.info(f'Process weights in layer: {layer_idx}')
        # layer = tensorrt_llm_llama.layers[layer_idx]
--- a/examples/high-level-api/llm_examples.py
+++ b/examples/high-level-api/llm_examples.py
@ -86,7 +86,7 @@ def run_llm_generate_async_example(prompts: List[str],
    config = ModelConfig(llama_model_dir)
    config.parallel_config.tp_size = tp_size

-    llm = LLM(config, async_mode=True, kvcahe_free_gpu_memory_fraction=0.4)
+    llm = LLM(config, kvcache_free_gpu_memory_fraction=0.4)

    async def task(prompt: str):
        outputs = []
@ -146,7 +146,7 @@ def _parse_arguments():
                        help='The directory to dump the engine.',
                        default=None)
    parser.add_argument('--quant_type', type=str, choices=['int4_awq', 'fp8'])
-    parser.add_argument('--prompt', type=str)
+    parser.add_argument('--prompt', type=str, default="What is LLM?")
    parser.add_argument('--tp_size', type=int, default=1)
    parser.add_argument('--streaming', action='store_true')
    return parser.parse_args()
--- a/examples/internlm/README.md
+++ b/examples/internlm/README.md
@ -32,7 +32,7 @@ InternLM has released several checkpoints of different size or capabilities unde

 Below examples use [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) and [internlm-chat-20b](https://huggingface.co/internlm/internlm-chat-20b) and assume these repositories are cloned or linked under this directory, for example `./internlm-chat-7b/`.

-Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `--workers` feature only supports single node.
+Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `--workers` feature only supports single node.

 Here're some examples:

--- a/examples/internlm/convert_checkpoint.py
+++ b/examples/internlm/convert_checkpoint.py
@ -841,17 +841,15 @@ def convert_hf_internlm(hf_model,
    num_key_value_heads = hf_model.config.num_attention_heads
    mha_mode = (num_key_value_heads == num_attention_heads)

-    layers_per_pipeline_stage = hf_model.config.num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
+    num_hidden_layers = hf_model.config.num_hidden_layers
+    layers_range = mapping.pp_layers(num_hidden_layers)

    if moe_config and moe_config.has_moe():
        rank_experts = list(range(moe_config.num_experts))
        if moe_config.tp_mode == moe_config.ParallelismMode.EXPERT_PARALLEL:
            rank_experts = mapping.ep_experts(moe_config.num_experts)

-        for l in range(hf_model.config.num_hidden_layers):
+        for l in range(num_hidden_layers):
            for suffix in ["w1", "w2", "w3"]:
                model_params[f'model.layers.{l}.block_sparse_moe.experts.{suffix}.weight'] = \
                            torch.stack(list(model_params[f'model.layers.{l}.block_sparse_moe.experts.{expert}.{suffix}.weight']
@ -872,12 +870,10 @@ def convert_hf_internlm(hf_model,
            model_params[
                f'model.layers.{l}.block_sparse_moe.experts.w2.weight'] = w2

-    for l in range(hf_model.config.num_hidden_layers):
-        if l not in layers_range:
-            continue
+    for l in layers_range:
+        layer_idx = l - layers_range[0]
        prefix = f'model.layers.{l}.'
-        idx = int(l) - mapping.pp_rank * layers_per_pipeline_stage
-        tllm_prex = f'transformer.layers.{idx}.'
+        tllm_prex = f'transformer.layers.{layer_idx}.'

        q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
        k_weight = get_weight(model_params, prefix + 'self_attn.k_proj', dtype)
@ -1183,7 +1179,7 @@ def convert_hf_internlm(hf_model,

        weights['lm_head.weight'] = split_matrix_tp(lm_head_weights,
                                                    tensor_parallel,
-                                                    rank,
+                                                    mapping.tp_rank,
                                                    dim=0)

        ln_f_w = get_weight(model_params, 'model.norm', dtype)
--- a/examples/llama/README.md
+++ b/examples/llama/README.md
@ -34,7 +34,7 @@ Need to prepare the HF LLaMA checkpoint first by following the guides here https

 TensorRT-LLM LLaMA builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.

-Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.
+Normally `trtllm-build` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel building to make the engine building process faster by adding `--workers` argument. Please note that currently `workers` feature only supports single node.

 `--use_fused_mlp` enables GEMM horizontal fusion in gated MLP layer, which reduces input traffic and potentially improves performance. For FP8 PTQ, the downside is slight reduction of accuracy because one of the quantization scaling factors are discarded (accuracy 0.45734 vs 0.45755 for LLaMA-v2 7B using ammo/examples/hf/instruct_eval/mmlu.py).

@ -159,7 +159,7 @@ The implementation is identical to Huggingface's.
 Please refer to https://huggingface.co/docs/transformers/model_doc/llama2#transformers.LlamaConfig.rope_scaling for more details.

 ### Long context length
-To use the model with Long context lengths, it is necessary to add `--multi_block_mode` in the build command to enable faster decoding in multihead attention.
+To use the model with Long context lengths, it is necessary to add `--multi_block_mode` in the build command to enable faster decoding in multi-head attention.


 A few LLaMA models are fine-tuned for long context length that TRT-LLM can support today. For example https://huggingface.co/Yukang/LongAlpaca-70B employs rotary scaling plus fine-tuning to support up to 32K context length. The following show the steps for running LongAlpaca-70B in TRT-LLM:
@ -171,8 +171,6 @@ python convert_checkpoint.py --meta_ckpt_dir ./tmp/LongAlpaca-70B/ \
                            --output_dir ./tllm_checkpoint_8gpu_tp8 \
                            --dtype float16 \
                            --tp_size 8 \
-                            --vocab_size=32001 \
-                            --rotary_scaling linear 8.0

 trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8 \
            --output_dir ./tmp/llama/70B/trt_engines/fp16/8-gpu/ \
@ -506,26 +504,23 @@ Use the following command to build `CodeLlama-7b-Instruct`:
 ```bash
 python convert_checkpoint.py --model_dir /tmp/CodeLlama-7b-Instruct-hf  \
                             --output_dir ./tllm_checkpoint_1gpu_codellama \
-                             --dtype float16 \
-                             --rotary_base 1000000 \
-                             --vocab_size 32016
+                             --dtype float16
+

 trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_codellama \
            --output_dir ./tmp/codellama/trt_engines/fp16/1-gpu/ \
-            --gemm_plugin float16 \
+            --gemm_plugin float16
 ```
 Use the following command to build `CodeLlama-34b-Instruct` for 4 GPUs (TP=4):
 ```bash
 python convert_checkpoint.py --model_dir /tmp/CodeLlama-34b-Instruct-hf  \
                             --output_dir ./tllm_checkpoint_4gpu_codellama \
                             --dtype float16 \
-                             --rotary_base 1000000 \
-                             --vocab_size 32000 \
                             --tp_size 4

 trtllm-build --checkpoint_dir ./tllm_checkpoint_4gpu_codellama \
            --output_dir ./tmp/codellama/trt_engines/fp16/4-gpu/ \
-            --gemm_plugin float16 \
+            --gemm_plugin float16
 ```

 NOTE: CodeLlama uses the `max_position_embeddings` of 16K.
@ -536,8 +531,6 @@ Use `--max_input_len` and `--max_output_len` (which defaults to `2048` and `512`
 python convert_checkpoint.py --model_dir /tmp/CodeLlama-34b-Instruct-hf  \
                             --output_dir ./tllm_checkpoint_4gpu_codellama \
                             --dtype float16 \
-                             --rotary_base 1000000 \
-                             --vocab_size 32000 \
                             --tp_size 8 \
                             --use_parallel_embedding

@ -625,7 +618,7 @@ Output: "我看见一个人坐在那边边看书书，我看起来还挺像你
 ### Run LLaMa with several lora checkpoints

 In this section, we show how to run a model with multiple LoRA modules at the same time. Note that if one of the LoRA module has a
-fine-tuned embedding table or logit GEMM, users should guarantee that all the instances of the model can use the same finetuned
+fine-tuned embedding table or logit GEMM, users should guarantee that all the instances of the model can use the same fine-tuned
 embedding table or logit GEMM.
 Here, we use two LoRA checkpoints as examples. These two LoRA checkponits add LoRA modules to `q_proj` and `v_proj`. Because we only
 support adding lora modules on `q`, `k` and `v` at the same time, we need to add `--lora_target_modules "attn_q" "attn_k" "attn_v"`.
@ -633,7 +626,7 @@ In this case, we assign null pointers for the `k` LoRA module in TensorRT-LLM an

 As the rank of the LoRA modules of both checkpoints is 8, we can set `--max_lora_rank 8` to reduce the memory requirement for the LoRA plugin.

-In this example, we use a LoRA checkpoint finetuned on the Chinese dataset `luotuo-lora-7b-0.1` and a LoRA checkpoint finetuned on
+In this example, we use a LoRA checkpoint fine-tuned on the Chinese dataset `luotuo-lora-7b-0.1` and a LoRA checkpoint fine-tuned on
 the Japanese dataset `Japanese-Alpaca-LoRA-7b-v0`. For the `lora_manager` to load several checkpoints, we pass several directories
 of LoRA checkpoints at the same time: `--lora_dir  "luotuo-lora-7b-0.1/" "Japanese-Alpaca-LoRA-7b-v0/"`.
 Then, `lora_manager` will assign `lora_task_uids` to these checkpoints. `lora_task_uids -1` is a predefined value, which corresponds to
--- a/examples/llama/convert_checkpoint.py
+++ b/examples/llama/convert_checkpoint.py
--- a/examples/llama/summarize_long.py
+++ b/examples/llama/summarize_long.py
@ -43,7 +43,7 @@ def parse_args():
        type=int,
        default=4096,
        help=
-        'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
+        'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
    )
    parser.add_argument(
        '--max_input_len',
--- a/examples/mamba/convert_checkpoint.py
+++ b/examples/mamba/convert_checkpoint.py
@ -30,7 +30,6 @@ def parse_arguments():
        help='The path to save the baichuan TensorRT-LLM checkpoint')
    parser.add_argument('--log_level', type=str, default='info')
    args = parser.parse_args()
-
    return args


@ -57,11 +56,7 @@ def get_tllm_linear_weight(weight, prefix, bias=None):
    return results


-def convert_hf_mamba(
-    hf_mamba,
-    rank=0,
-    dtype='float32',
-):
+def convert_hf_mamba(hf_mamba, rank=0, dtype='float32'):
    weights = {}
    tik = time.time()

@ -85,8 +80,9 @@ def convert_hf_mamba(
            weights[tllm_weight_name] = weight
            if bias is not None:
                weights[tllm_bias_name] = bias
-        weights[tllm_prex + 'A'] = -torch.exp(
-            model_params[prefix + 'A_log'].float().detach())
+        Aparam = model_params[prefix + 'A_log'].float().detach()
+        Aparam = Aparam.permute(1, 0).contiguous()
+        weights[tllm_prex + 'A'] = -torch.exp(Aparam)
        weights[tllm_prex + 'D'] = model_params[prefix + 'D'].float().detach()
        # norm
        prefix = f'backbone.layers.{l}.norm'
@ -130,11 +126,9 @@ def rename_hf_to_tllm(name: str):
    return name


-def convert_from_hf_checkpoint(
-    model_dir: Union[str, Path],
+def convert_from_hf_checkpoint(model_dir: Union[str, Path],
                               rank=0,
-    dtype: Union[str, torch.dtype] = torch.float32,
-):
+                               dtype: Union[str, torch.dtype] = torch.float32):
    logger.info('Loading weights from HF Mamba...')
    tik = time.time()

@ -153,6 +147,7 @@ def convert_from_hf_checkpoint(
            param_fp32 = model_params_fp32[name].detach().cpu()
            if 'A_log' in name:
                param = -torch.exp(param_fp32)
+                param = param.permute(1, 0).contiguous()
            elif 'D' in name:
                param = param_fp32
            elif 'dt_proj.bias' in name:
--- a/examples/medusa/convert_checkpoint.py
+++ b/examples/medusa/convert_checkpoint.py
@ -806,17 +806,12 @@ def convert_hf_llama(hf_model,
    num_key_value_heads = hf_model.config.num_key_value_heads
    mha_mode = (num_key_value_heads == num_attention_heads)

-    layers_per_pipeline_stage = hf_model.config.num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
-
-    for l in range(hf_model.config.num_hidden_layers):
-        if l not in layers_range:
-            continue
+    num_hidden_layers = hf_model.config.num_hidden_layers
+    layers_range = mapping.pp_layers(num_hidden_layers)
+    for l in layers_range:
+        layer_idx = l - layers_range[0]
        prefix = f'model.layers.{l}.'
-        idx = int(l) - mapping.pp_rank * layers_per_pipeline_stage
-        tllm_prex = f'transformer.layers.{idx}.'
+        tllm_prex = f'transformer.layers.{layer_idx}.'

        q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
        k_weight = get_weight(model_params, prefix + 'self_attn.k_proj', dtype)
--- a/examples/mixtral/README.md
+++ b/examples/mixtral/README.md
@ -52,7 +52,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \
                 --gemm_plugin float16
 ```

-Then, you can test your engine with the [run.py](./examples/run.py) script:
+Then, you can test your engine with the [run.py](../run.py) script:

 ```
 mpirun -n 2 python3 ../run.py --engine_dir ./trt_engines/mixtral/tp2 --tokenizer_dir ./Mixtral-8x7B-v0.1 --max_output_len 8 --input_text "I love french quiche"
--- a/examples/mmlu.py
+++ b/examples/mmlu.py
@ -248,12 +248,7 @@ class Pipeline:

    def __call__(self, prompt):
        # Run the model in batch size 1 and beam size 1
-        if self.model_name == 'GemmaForCausalLM':
-            inputs = self.tokenizer.encode(prompt, add_special_tokens=False)
-            inputs = torch.tensor([self.tokenizer.bos_token_id] + inputs)
-        else:
-            inputs = self.tokenizer.encode(prompt,
-                                           return_tensors="pt").squeeze(0)
+        inputs = self.tokenizer.encode(prompt, return_tensors="pt").squeeze(0)
        batch_input_ids = [inputs]

        # For multi-choice tasks like MMLU, we don't need to adjust following parameters
@ -341,7 +336,7 @@ def parse_args():
        type=int,
        default=None,
        help=
-        'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
+        'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
    )
    parser.add_argument(
        '--tokenizer_dir',
@ -394,10 +389,10 @@ def main():
                                     debug_mode=args.debug_mode)
    else:
        assert args.test_hf, "Must test either TRT-LLM or HF"
-        if model_name.startswith("chatglm"):
-            auto_model_cls = AutoModel
-        elif model_name.startswith("glm"):
+        if model_name == 'ChatGLMForCausalLM' and model_version == 'glm':
            auto_model_cls = AutoModelForSeq2SeqLM
+        elif model_name == 'ChatGLMForCausalLM' and model_version == 'chatglm':
+            auto_model_cls = AutoModel
        else:
            auto_model_cls = AutoModelForCausalLM
        model = auto_model_cls.from_pretrained(
--- a/examples/mpt/README.md
+++ b/examples/mpt/README.md
@ -31,34 +31,34 @@ The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to conv

 ```bash
 # Generate FP16 checkpoints.
-python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp16/ --dtype float16
+python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp16/ --dtype float16

 # Generate FP32 checkpoints with TP=4.
-python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp32_tp4/ --dtype float32 --tp_size 4
+python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp32_tp4/ --dtype float32 --tp_size 4
 ```

 #### 1.2 Convert from HF Transformers with weight-only quantization

 ```bash
 # Use int8 weight-only quantization.
-python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_wo/ --use_weight_only
+python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int8_wo/ --use_weight_only

 # Use int4 weight-only quantization.
-python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_wo/ --use_weight_only --weight_only_precision int4
+python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int4_wo/ --use_weight_only --weight_only_precision int4
 ```

 #### 1.3 Convert from HF Transformers with SmoothQuant quantization

 ```bash
 # Use int8 smoothquant (weight and activation) quantization.
-python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_sq/ --smoothquant 0.5
+python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int8_sq/ --smoothquant 0.5
 ```

 #### 1.4 Convert from HF Transformers with INT8 KV cache quantization

 ```bash
 # Use int8 kv cache quantization.
-python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp16_int8kv/ --dtype float16 --calibrate_kv_cache
+python convert_checkpoint.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp16_int8kv/ --dtype float16 --calibrate_kv_cache
 ```
 ***INT8-KV-cache can be used with SQ and Weight-only at the same time***

@ -70,31 +70,31 @@ First make sure AMMO toolkit is installed (see [examples/quantization/README.md]

 ```bash
 # INT4 AWQ quantization using AMMO.
-python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_awq/ --qformat int4_awq
+python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int4_awq/ --qformat int4_awq
 ```

 #### 1.6 FP8 Post-Training Quantization with AMMO

 ```bash
 # FP8 quantization using AMMO.
-python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/fp8/ --qformat fp8 --kv_cache_dtype fp8
+python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/fp8/ --qformat fp8 --kv_cache_dtype fp8
 ```

 #### 1.6 Weight-only quantization with AMMO

 ```bash
 # INT8 Weight-only quantization using AMMO with TP=2.
-python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int8_wo/ --qformat int8_wo --tp_size 2
+python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int8_wo/ --qformat int8_wo --tp_size 2

 # INT4 Weight-only quantization using AMMO.
-python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/int4_wo/ --qformat int4_wo
+python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/int4_wo/ --qformat int4_wo
 ```

 #### 1.7 SmoothQuant and INT8 KV cache with AMMO

 ```bash
 # Use int4 awq quantization.
-python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ft_ckpts/mpt-7b/sq_int8kv/ --qformat int8_sq --kv_cache_dtype int8
+python ../quantization/quantize.py --model_dir mosaicml/mpt-7b --output_dir ./ckpts/mpt-7b/sq_int8kv/ --qformat int8_sq --kv_cache_dtype int8
 ```
 ***INT8-KV-cache can also be used with Weight-only at the same time***

@ -105,13 +105,13 @@ All of the checkpoint generated by `convert_checkpoint.py` or `quantize.py` (AMM

 ```bash
 # Build a single-GPU float16 engine using TRTLLM checkpoints.
-trtllm-build --checkpoint_dir=./ft_ckpts/mpt-7b/fp16/1-gpu \
+trtllm-build --checkpoint_dir=./ckpts/mpt-7b/fp16 \
             --max_batch_size 32 \
             --max_input_len 1024 \
             --max_output_len 512 \
-             --gemm_plugin
+             --gemm_plugin float16 \
             --workers 1 \
-             --output_dir ./trt_engines/mpt-7b/fp16/1-gpu
+             --output_dir ./trt_engines/mpt-7b/fp16
 ```

 ### MPT 30B
@ -123,7 +123,7 @@ Same commands can be changed to convert MPT 30B to TRT LLM format. Below is an e
 The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM format.

 ```bash
-python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ft_ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16
+python convert_checkpoint.py --model_dir mosaicml/mpt-30b --output_dir ./ckpts/mpt-30b/fp16_tp4/ --tp_szie 4 --dtype float16
 ```

 #### 2. Build TensorRT engine(s)
@ -132,11 +132,11 @@ Examples of build invocations:

 ```bash
 # Build 4-GPU MPT-30B float16 engines
-trtllm-build --checkpoint_dir ./ft_ckpts/mpt-30b/fp16_tp4 \
+trtllm-build --checkpoint_dir ./ckpts/mpt-30b/fp16_tp4 \
             --max_batch_size 32 \
             --max_input_len 1024 \
             --max_output_len 512 \
-             --gemm_plugin
+             --gemm_plugin float16 \
             --workers 4 \
             --output_dir ./trt_engines/mpt-30b/fp16_tp4
 ```
@ -159,7 +159,7 @@ Same commands can be changed to convert [Replit Code V-1.5 3B](https://huggingfa
 The [`convert_checkpoint.py`](./convert_checkpoint.py) script allows you to convert weights from HF Transformers format to TRTLLM format.

 ```bash
-python convert_checkpoint.py --model_dir ./replit-code-v1_5-3b --output_dir ./ft_ckpts/replit-code-v1_5-3b/bf16_tp2/ --tp_size 2 --dtype bfloat16
+python convert_checkpoint.py --model_dir ./replit-code-v1_5-3b --output_dir ./ckpts/replit-code-v1_5-3b/bf16_tp2/ --tp_size 2 --dtype bfloat16
 ```

 #### 2. Build TensorRT engine(s)
@ -168,11 +168,12 @@ Examples of build invocations:

 ```bash
 # Build 2-GPU Replit Code V-1.5 3B bfloat16 engines
-trtllm-build --checkpoint_dir ./ft_ckpts/replit-code-v1_5-3b/bf16_tp2 \
+trtllm-build --checkpoint_dir ./ckpts/replit-code-v1_5-3b/bf16_tp2 \
             --max_batch_size 32 \
             --max_input_len 1024 \
             --max_output_len 512 \
-             --gemm_plugin \
+             --gpt_attention_plugin bfloat16 \
+             --gemm_plugin bfloat16 \
             --workers 2 \
             --output_dir ./trt_engines/replit-code-v1_5-3b/bf16_tp2
 ```
--- a/examples/mpt/convert_checkpoint.py
+++ b/examples/mpt/convert_checkpoint.py
@ -613,7 +613,7 @@ def get_tllm_param(
    return results


-def convert_hf_mpt_lagacy(hf_model,
+def convert_hf_mpt_legacy(hf_model,
                          mapping,
                          rank=0,
                          dtype='float32',
@ -967,7 +967,8 @@ if __name__ == '__main__':
            'pp_size': args.pp_size,
        },
        'bias': (not hf_config.no_bias),
-        'clip_qkv': hf_config.attn_config['clip_qkv']
+        'clip_qkv': hf_config.attn_config['clip_qkv'],
+        'alibi_bias_max': hf_config.attn_config['alibi_bias_max']
    }

    with open(os.path.join(args.output_dir, 'config.json'), 'w') as f:
@ -998,7 +999,7 @@ if __name__ == '__main__':
            if args.smoothquant is not None:
                smooth_mpt_model(hf_model, act_range, args.smoothquant,
                                 mpt_qkv_para, mpt_smoother)
-            weights = convert_hf_mpt_lagacy(
+            weights = convert_hf_mpt_legacy(
                hf_model, mapping, rank, args.dtype, args.use_weight_only,
                plugin_weight_only_quant_type, args.smoothquant is not None,
                args.per_channel, args.per_token, args.calibrate_kv_cache,
--- a/examples/multimodal/build_visual_engine.py
+++ b/examples/multimodal/build_visual_engine.py
@ -3,8 +3,11 @@ import os
 import shutil
 from time import time

-import tensorrt as trt
+# isort: off
 import torch
+import tensorrt as trt
+# isort: on
+
 from PIL import Image
 from transformers import (AutoProcessor, Blip2ForConditionalGeneration,
                          Blip2Processor, LlavaForConditionalGeneration,
--- a/examples/multimodal/run.py
+++ b/examples/multimodal/run.py
@ -5,8 +5,12 @@ from pathlib import Path

 import numpy as np
 import requests
-import tensorrt as trt
+
+# isort: off
 import torch
+import tensorrt as trt
+# isort: on
+
 from huggingface_hub import hf_hub_download
 from PIL import Image
 from transformers import (AutoConfig, AutoProcessor, AutoTokenizer,
@ -127,7 +131,7 @@ class MultiModalModel:
            self.runtime_mapping = self.model.session.mapping
        else:
            self.model = TRTLLMEncDecModel.from_engine(
-                self.args.hf_model_dir.split('/')[-1],
+                os.path.basename(self.args.hf_model_dir),
                self.args.llm_engine_dir,
                skip_encoder=self.args.nougat,
                debug_mode=False,
--- a/examples/qwen/README.md
+++ b/examples/qwen/README.md
@ -59,7 +59,7 @@ mv Qwen-14B-Chat ./tmp/Qwen/14B

 TensorRT-LLM Qwen builds TensorRT engine(s) from HF checkpoint. If no checkpoint directory is specified, TensorRT-LLM will build engine(s) with dummy weights.

-Normally `build.py` only requires single GPU, but if you've already got all the GPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.
+Normally `build.py` only requires single GPU, but if you've already got all the GPUs needed for inference, you could enable parallel-building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.

 Here're some examples:

--- a/examples/qwen/build.py
+++ b/examples/qwen/build.py
@ -470,7 +470,7 @@ def parse_arguments():
        args.hidden_act = "silu"
        args.rms_norm_eps = hf_config.layer_norm_epsilon
        args.kv_channels = hf_config.kv_channels
-        args.rotary_emb_base = hf_config.rotary_emb_base
+        args.rotary_base = hf_config.rotary_emb_base
    if args.n_kv_head is None:
        args.n_kv_head = args.n_head
    if args.n_kv_head != args.n_head:
@ -803,7 +803,7 @@ if __name__ == '__main__':
    if args.parallel_build and args.world_size > 1 and \
            torch.cuda.device_count() >= args.world_size:
        logger.warning(
-            f'Parallelly build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
+            f'Parallel build TensorRT engines. Please make sure that all of the {args.world_size} GPUs are totally free.'
        )
        mp.spawn(build, nprocs=args.world_size, args=(args, ))
    else:
--- a/examples/qwen/weight.py
+++ b/examples/qwen/weight.py
@ -207,17 +207,14 @@ def load_from_binary(tensorrt_llm_qwen: QWenForCausalLM,
        tensorrt_llm_qwen.lm_head.weight.value = np.ascontiguousarray(
            split(lm_head_weight, mapping.tp_size, mapping.tp_rank))

-    layers_per_pipeline_stage = tensorrt_llm_qwen.num_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
-
+    num_hidden_layers = tensorrt_llm_qwen.num_layers
+    layers_range = mapping.pp_layers(num_hidden_layers)
    for i in layers_range:
        c_attn_out_dim = (3 * hidden_size //
                          mapping.tp_size) if not multi_query_mode else (
                              hidden_size // mapping.tp_size +
                              (hidden_size // num_hidden_layers) * 2)
-        idx = i - mapping.pp_rank * layers_per_pipeline_stage
+        idx = i - layers_range[0]
        tensorrt_llm_qwen.layers[idx].ln_1.weight.value = fromfile(
            dir_path, 'model.layers.' + str(i) + '.ln_1.weight.bin')

@ -406,10 +403,9 @@ def load_from_hf_qwen(tensorrt_llm_qwen: tensorrt_llm.models.QWenForCausalLM,

    model_params = dict(hf_qwen.named_parameters())
    torch_dtype = str_dtype_to_torch(dtype)
-    layers_per_pipeline_stage = hf_qwen.config.num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
+
+    num_hidden_layers = hf_qwen.config.num_hidden_layers
+    layers_range = mapping.pp_layers(num_hidden_layers)

    for k, v in tqdm(model_params.items(),
                     total=len(model_params),
@ -438,7 +434,7 @@ def load_from_hf_qwen(tensorrt_llm_qwen: tensorrt_llm.models.QWenForCausalLM,
            layer_idx = extract_layer_idx(k)
            if layer_idx is None or int(layer_idx) not in layers_range:
                continue
-            idx = int(layer_idx) - mapping.pp_rank * layers_per_pipeline_stage
+            idx = int(layer_idx) - layers_range[0]
            if idx >= tensorrt_llm_qwen.num_layers:
                continue
            if 'ln_1.weight' in k:
@ -631,13 +627,7 @@ def load_from_gptq_qwen(
    num_hidden_layers = max(layer_ids) + 1
    suffixs = ["qweight", "qzeros", "scales"]

-    layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(
-            mapping.pp_rank * layers_per_pipeline_stage,
-            (mapping.pp_rank + 1) * layers_per_pipeline_stage,
-            1,
-        ))
+    layers_range = mapping.pp_layers(num_hidden_layers)
    torch_dtype = str_dtype_to_torch(dtype)
    for layer in tqdm(layers_range,
                      ncols=80,
@ -655,7 +645,7 @@ def load_from_gptq_qwen(
            # dtype: int32, int32, float16
            split_qkv_suf.append(split_qkv)

-        idx = layer - mapping.pp_rank * layers_per_pipeline_stage
+        idx = layer - layers_range[0]
        th_bias = model_params[prefix + "c_attn.bias"].to(
            torch_dtype).cpu().contiguous()

@ -709,7 +699,7 @@ def load_from_gptq_qwen(
            idx = int(layer_idx)
            if idx not in layers_range:
                continue
-            idx = idx - mapping.pp_rank * layers_per_pipeline_stage
+            idx = idx - layers_range[0]

            if "ln_1.weight" in k:
                tensorrt_llm_qwen.layers[idx].ln_1.weight.value = v
@ -791,7 +781,7 @@ def load_from_gptq_qwen(
                dst.value = np.ascontiguousarray(split_v)

    tok = time.time()
-    t = time.strftime("%h:%m:%s", time.gmtime(tok - tik))
+    t = time.strftime("%H:%M:%S", time.gmtime(tok - tik))
    tensorrt_llm.logger.info(f"weights loaded. total time: {t}")


@ -919,11 +909,7 @@ def load_from_awq_qwen(tensorrt_llm_qwen: QWenForCausalLM,
    ]

    num_hidden_layers = max(layer_ids) + 1
-    layers_per_pipeline_stage = num_hidden_layers // mapping.pp_size
-    layers_range = list(
-        range(mapping.pp_rank * layers_per_pipeline_stage,
-              (mapping.pp_rank + 1) * layers_per_pipeline_stage, 1))
-
+    layers_range = mapping.pp_layers(num_hidden_layers)
    for layer_idx in tqdm(layers_range, "Loading weights..."):
        prefix = "transformer.h." + str(layer_idx) + "."
        for idx, awq_attr in enumerate(awq_block_names):
--- a/examples/run.py
+++ b/examples/run.py
@ -40,7 +40,7 @@ def parse_arguments(args=None):
        type=int,
        default=None,
        help=
-        'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
+        'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
    )
    parser.add_argument('--sink_token_length',
                        type=int,
@ -231,8 +231,6 @@ def parse_input(tokenizer,
        else:
            print('Input file format not supported.')
            raise SystemExit
-    if model_name == 'GemmaForCausalLM':
-        batch_input_ids[0] = [tokenizer.bos_token_id] + batch_input_ids[0]

    if num_prepend_vtokens:
        assert len(num_prepend_vtokens) == len(batch_input_ids)
--- a/examples/summarize.py
+++ b/examples/summarize.py
@ -158,14 +158,6 @@ def main(args):
                    max_input_length=test_token_num,
                )
                input_ids = torch.tensor(input_id_list)
-            elif model_name == 'GemmaForCausalLM':
-                input_ids = tokenizer.encode(
-                    curr_text,
-                    add_special_tokens=add_special_tokens,
-                    truncation=True,
-                    max_length=test_token_num -
-                    1)  # minus 1 to add bos_token_id
-                input_ids = torch.tensor([tokenizer.bos_token_id] + input_ids)
            else:
                input_ids = tokenizer.encode(
                    curr_text,
@ -624,7 +616,7 @@ if __name__ == '__main__':
        type=int,
        default=None,
        help=
-        'The attention window size that controls the sliding window attention / cyclic kv cache behaviour'
+        'The attention window size that controls the sliding window attention / cyclic kv cache behavior'
    )
    parser.add_argument('--sink_token_length',
                        type=int,
--- a/examples/utils.py
+++ b/examples/utils.py
@ -88,6 +88,14 @@ def load_tokenizer(tokenizer_dir: Optional[str] = None,
                                                  trust_remote_code=True,
                                                  tokenizer_type=tokenizer_type,
                                                  use_fast=use_fast)
+    elif model_name == 'GemmaForCausalLM':
+        from transformers import GemmaTokenizer
+
+        # Initialize tokenizer from vocab file.
+        tokenizer = GemmaTokenizer(vocab_file=vocab_file,
+                                   padding_side='left',
+                                   truncation_side='left',
+                                   legacy=False)
    else:
        # For gpt-next, directly load from tokenizer.model
        tokenizer = T5Tokenizer(vocab_file=vocab_file,
@ -107,11 +115,6 @@ def load_tokenizer(tokenizer_dir: Optional[str] = None,
    elif model_name == 'ChatGLMForCausalLM' and model_version == 'glm':
        pad_id = tokenizer.pad_token_id
        end_id = tokenizer.eop_token_id
-    elif model_name == 'GemmaForCausalLM':
-        tokenizer.eos_token_id = tokenizer.sp_model.eos_id()
-        tokenizer.bos_token_id = tokenizer.sp_model.bos_id()
-        pad_id = tokenizer.pad_token_id
-        end_id = tokenizer.eos_token_id
    else:
        if tokenizer.pad_token_id is None:
            tokenizer.pad_token_id = tokenizer.eos_token_id
--- a/requirements-windows.txt
+++ b/requirements-windows.txt
@ -20,7 +20,7 @@ torch==2.1.0+cu121
 torchdata==0.7.0
 torchtext==0.16.0+cpu
 torchvision==0.16.0+cu121
-transformers==4.36.1
+transformers==4.38.2
 wheel
 optimum
 evaluate
--- a/requirements.txt
+++ b/requirements.txt
@ -1,4 +1,3 @@
--extra-index-url https://download.pytorch.org/whl/cu121
 --extra-index-url https://pypi.nvidia.com
 accelerate==0.25.0
 build
@ -16,7 +15,7 @@ sentencepiece>=0.1.99
 tensorrt==9.2.0.post12.dev5
 torch<=2.2.0a
 nvidia-ammo~=0.7.0; platform_machine=="x86_64"
-transformers==4.36.1
+transformers==4.38.2
 wheel
 optimum
 evaluate
--- a/Show More
+++ b/Show More