mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

hlu1 8207d5fd39 [None] [feat] Add model gpt-oss (#6645 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>		2025-08-07 03:04:18 -04:00
..
nvrtc	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
test	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
barriers.cuh	[perf] improve XQA-MLA perf (#5468 )	2025-06-26 18:09:13 +08:00
CMakeLists.txt	[None][feat] Add vLLM KV Pool support for XQA kernel (#6013 )	2025-08-06 09:29:37 +08:00
cuda_hint.cuh	[feat] Support XQA-based MLA on SM120 (#4858 )	2025-06-06 22:32:49 +08:00
defines.h	[None][feat] Add vLLM KV Pool support for XQA kernel (#6013 )	2025-08-06 09:29:37 +08:00
gen_cpp_header.py	[feat] Support XQA-based MLA on SM120 (#4858 )	2025-06-06 22:32:49 +08:00
gen_cubins.py	update spec_dec (#6079 )	2025-07-16 17:50:43 +08:00
gmma_impl.cuh	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
gmma.cuh	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
hostUtils.h	fix: rename some terms (#4534 )	2025-05-23 23:23:49 +08:00
ldgsts.cuh	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
mha_components.cuh	[perf] improve XQA-MLA perf (#5468 )	2025-06-26 18:09:13 +08:00
mha_sm90_transpose.xlsx	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
mha_sm90.cu	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
mha_stdheaders.cuh	[feat] Support XQA-based MLA on SM120 (#4858 )	2025-06-06 22:32:49 +08:00
mha.cu	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
mha.h	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
mhaUtils.cuh	[None][feat] Add vLLM KV Pool support for XQA kernel (#6013 )	2025-08-06 09:29:37 +08:00
mla_sm120.cu	[None] [feat] Add model gpt-oss (#6645 )	2025-08-07 03:04:18 -04:00
mla_sm120.cuh	[perf] improve XQA-MLA perf (#5468 )	2025-06-26 18:09:13 +08:00
mma.cuh	[feat] Support XQA-based MLA on SM120 (#4858 )	2025-06-06 22:32:49 +08:00
platform.h	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
README.md	[None][feat] Add vLLM KV Pool support for XQA kernel (#6013 )	2025-08-06 09:29:37 +08:00
ref.py	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
RefChecker.cuh	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
specDec.h	infra: open source XQA kernels (#3762 )	2025-04-30 18:05:15 +08:00
tensorMap.cpp	[None][feat] Add vLLM KV Pool support for XQA kernel (#6013 )	2025-08-06 09:29:37 +08:00
tensorMap.h	[feat] Support XQA-based MLA on SM120 (#4858 )	2025-06-06 22:32:49 +08:00
tma.h	[perf] improve XQA-MLA perf (#5468 )	2025-06-26 18:09:13 +08:00
utils.cuh	[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379 )	2025-08-05 07:47:41 +00:00
utils.h	[feat] Support XQA-based MLA on SM120 (#4858 )	2025-06-06 22:32:49 +08:00

README.md

XQA - A set of optimized kernels for generation-phase MQA/GQA

Dependency

If you want to build & run unit tests, you need libgtest-dev and libeigen3-dev.

Options

Kernel compile-time options can be found in defines.h. See code comments for details. Runtime options of unit tests can be modified in test.cpp.

Build & run unit tests

You need to install libgtest-dev and libeigen3-dev before building. To build, use the normal cmake build steps:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_XQA_TESTS=ON
cmake --build . -j

To run unit tests, run ./unitTests. There are a few runtime options that can be controlled with environment variables:

XQA_ZERO_FILL: Set this to 1 to initialize input data with zeros (instead of random numbers). This is useful if you want to run perf tests quickly and skip the slow random data generation step. Note there is an impact on measure perf.
XQA_USE_QGMMA: On Hopper, we try to use TMA+QGMMA kernel (mha_sm90.cu) by default if possible. To force using mha.cu, set this to 0.
XQA_NB_SUB_SEQ: The number of CUDA thread blocks used to handle one K/V head. We have reasonable default but if you want to change it manually, use this variable.

Support for VLLM Paged KV-Cache

When PAGED_KV_CACHE_LAYOUT=1 is enabled, XQA supports VLLM-style KV pool input with split-wise KV-pool and sequence-first memory layout. To build and test with this feature enabled, run the following commands:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_XQA_TESTS=ON -DPAGED_KV_CACHE_LAYOUT=1
cmake --build . -j
./unitTests

Generation cubins used in TensorRT-LLM

Run gen_cubin.py in the repo workspace.