TensorRT-LLMs/open-sourced-cutlass-kernels.md at 2e437536b7400e9ef68a5871214d4bc499bceb53

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

add doc for open-sourced cutlass kernels (#5194 )

Signed-off-by: yunruis

2025-06-13 18:51:27 +08:00

1.8 KiB

Raw Blame History

We have recently open-sourced a set of Cutlass kernels that were previously known as "internal_cutlass_kernels". Due to internal dependencies, these kernels were previously only available to users as static libraries. We have now decoupled these internal dependencies, making the kernels available as source code.

The open-sourced Cutlass kernels are on the path cpp/tensorrt_llm/kernels/cutlass_kernels, including:

low_latency_gemm
moe_gemm
fp4_gemm
allreduce_gemm

To ensure stability and provide an optimized performance experience, we have maintained the previous method of calling these kernels via static libraries as an alternative option. You can switch between open-sourced Cutlass kernels and static library Cutlass kernels through the USING_OSS_CUTLASS_* macro (where * represents the specific kernel name), enabling kernel-level control. By default, the open-source Cutlass kernels are used. Note that support for these static libraries will be gradually deprioritized in the future and may eventually be deprecated.

Default Configuration (Using open-sourced Cutlass Kernels)

To build using the open-source Cutlass kernels (default setting), run:

python3 ./scripts/build_wheel.py --cuda_architectures "90-real;100-real"

Using Static Library Cutlass Kernels

If you prefer to use the Cutlass kernels from the static library, you can control this during compilation by setting the USING_OSS_CUTLASS_* macro to OFF. For example, to use the static library implementation specifically for low_latency_gemm and moe_gemm while keeping other kernels as OSS, use the following compilation command:

python3 ./scripts/build_wheel.py --cuda_architectures "90-real;100-real" -D "USING_OSS_CUTLASS_MOE_GEMM=OFF;USING_OSS_CUTLASS_LOW_LATENCY_GEMM=OFF"

1.8 KiB Raw Blame History

1.8 KiB

Raw Blame History