TensorRT-LLMs/cpp/kernels/xqa
Ming Wei ed887940d4
infra: open source XQA kernels (#3762)
Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which
consists of two parts:

1. NVRTC glue code
2. XQA kernel code

During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via
gen_cpp_header.py and passed to NVRTC for JIT compilation.

Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>
2025-04-30 18:05:15 +08:00
..
nvrtc infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
test infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
barriers.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
CMakeLists.txt infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
cuda_hint.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
defines.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
gen_cpp_header.py infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
gen_cubins.py infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
gmma_impl.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
gmma.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
hostUtils.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
ldgsts.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mha_sm90_transpose.xlsx infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mha_sm90.cu infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mha_stdheaders.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mha.cu infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mha.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mhaUtils.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
mma.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
pairedF32Op.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
platform.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
README.md infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
ref.py infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
RefChecker.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
specDec.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
tma.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
utils.cuh infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00
utils.h infra: open source XQA kernels (#3762) 2025-04-30 18:05:15 +08:00

XQA - A set of optimized kernels for generation-phase MQA/GQA

Dependency

If you want to build & run unit tests, you need libgtest-dev and libeigen3-dev.

Options

Kernel compile-time options can be found in defines.h. See code comments for details. Runtime options of unit tests can be modified in test.cpp.

Build & run unit tests

You need to install libgtest-dev and libeigen3-dev before building. To build, use the normal cmake build steps:

  • mkdir build
  • cd build
  • cmake .. -DCMAKE_BUILD_TYPE=Release
  • cmake --build . -j

To run unit tests, run ./unitTests. There are a few runtime options that can be controlled with environment variables:

  • XQA_ZERO_FILL: Set this to 1 to initialize input data with zeros (instead of random numbers). This is useful if you want to run perf tests quickly and skip the slow random data generation step. Note there is an impact on measure perf.
  • XQA_USE_QGMMA: On Hopper, we try to use TMA+QGMMA kernel (mha_sm90.cu) by default if possible. To force using mha.cu, set this to 0.
  • XQA_NB_SUB_SEQ: The number of CUDA thread blocks used to handle one K/V head. We have reasonable default but if you want to change it manually, use this variable.

Generation cubins used in TensorRT-LLM

Run gen_cubin.py in the repo workspace.