TensorRT-LLMs/cpp/tensorrt_llm/kernels
liji-nv dca6397d1e
feat: Introduce UB allocator for pytorch flow (#3257)
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
  are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.

* In common usecase, the Userbuffers will be allocated correctly during
  warm up stage. There is no dynamic allocation during inference.

* UB fusion pattern is rewroten using the new UB Allocator. It contains
  following passes:

1. Fuse Quant with allreduce, replace with UB impl, and insert a
   copy_to_userbuffers. Currently the normal allreduce still does not
   support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
   writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
   allreduce.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-04-08 18:39:49 +08:00
..
beamSearchKernels feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133) 2025-04-02 12:31:28 +08:00
contextFusedMultiHeadAttention feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
cutlass_kernels feat: enable DeepGEMM by default (#3341) 2025-04-08 13:58:57 +08:00
decoderMaskedMultiheadAttention Support speculative decoding with Hopper XQA (#3269) 2025-04-07 17:14:34 +08:00
flashMLA feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
fusedLayernormKernels update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
internal_cutlass_kernels chore: update internal cutlass library base #2981 and #3165. (#3308) 2025-04-07 13:53:02 +08:00
lora Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
selectiveScan Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
speculativeDecoding Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
trtllmGenKernels feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
unfusedAttentionKernels Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
userbuffers feat: Introduce UB allocator for pytorch flow (#3257) 2025-04-08 18:39:49 +08:00
weightOnlyBatchedGemv feat: Update cutlass (#2981) 2025-03-26 22:36:27 +08:00
allReduceFusionKernels.cu update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
allReduceFusionKernels.h update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
attentionMask.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
attentionMask.h Update TensorRT-LLM (#2363) 2024-10-22 20:27:35 +08:00
banBadWords.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
banBadWords.h Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
banRepeatNgram.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
banRepeatNgram.h Update TensorRT-LLM (#1598) 2024-05-14 16:43:41 +08:00
beamSearchKernels.cu feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133) 2025-04-02 12:31:28 +08:00
beamSearchKernels.h feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133) 2025-04-02 12:31:28 +08:00
buildRelativeAttentionBiasKernel.cu Update TensorRT-LLM (#1763) 2024-06-11 16:59:02 +08:00
buildRelativeAttentionBiasKernel.h Update TensorRT-LLM (#1763) 2024-06-11 16:59:02 +08:00
CMakeLists.txt Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
cumsumLastDim.cu open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273) 2024-09-30 13:51:19 +02:00
cumsumLastDim.h Update TensorRT-LLM (#1725) 2024-06-04 20:26:32 +08:00
customAllReduceKernels.cu Update (#2978) 2025-03-23 16:39:35 +08:00
customAllReduceKernels.h perf: Add optimizations for deepseek in min latency mode (#3093) 2025-04-02 09:05:24 +08:00
decoderMaskedMultiheadAttention.cu Update TensorRT-LLM (#2502) 2024-11-26 16:51:34 +08:00
decoderMaskedMultiheadAttention.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
decoderMaskedMultiheadAttentionUtils.h Update TensorRT-LLM (#2363) 2024-10-22 20:27:35 +08:00
decodingCommon.cu Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
decodingKernels.cu refactor: Improve decoder finalize function (#3077) 2025-03-28 14:33:59 +08:00
decodingKernels.h refactor: Improve decoder finalize function (#3077) 2025-03-28 14:33:59 +08:00
delayStream.cu Update (#2978) 2025-03-23 16:39:35 +08:00
delayStream.h Update (#2978) 2025-03-23 16:39:35 +08:00
doraScaling.cu Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
doraScaling.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
fmhaDispatcher.cpp feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
fmhaDispatcher.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
gptKernels.cu Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
gptKernels.h Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
groupGemm.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
groupGemm.h Update TensorRT-LLM (#2562) 2024-12-11 00:31:05 -08:00
kvCachePartialCopy.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
kvCacheUtils.h Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
layernormKernels.cu Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
layernormKernels.h Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
logitsBitmask.cu bitmask v3 (#3009) 2025-03-26 15:21:29 +08:00
logitsBitmask.h Update TensorRT-LLM (#2532) 2024-12-04 21:16:56 +08:00
lookupKernels.cu Update TensorRT-LLM (#1639) 2024-05-21 17:51:02 +08:00
lookupKernels.h Update TensorRT-LLM (#1639) 2024-05-21 17:51:02 +08:00
lruKernel.cu Update TensorRT-LLM (#1688) 2024-05-28 20:07:49 +08:00
lruKernel.h Update TensorRT-LLM (#1688) 2024-05-28 20:07:49 +08:00
mambaConv1dKernels.cu Update TensorRT-LLM (#2562) 2024-12-11 00:31:05 -08:00
mambaConv1dKernels.h Update TensorRT-LLM (#1954) 2024-07-16 15:30:25 +08:00
mlaKernels.cu feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
mlaKernels.h feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
multiHeadAttentionCommon.h feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190) 2025-04-07 15:14:13 +08:00
noAuxTcKernels.cu Update (#2978) 2025-03-23 16:39:35 +08:00
noAuxTcKernels.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
penaltyKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
penaltyKernels.h Update TensorRT-LLM (#2502) 2024-11-26 16:51:34 +08:00
penaltyTypes.h Update TensorRT-LLM (#1554) 2024-05-07 23:34:28 +08:00
preQuantScaleKernel.cu open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010) 2024-07-24 05:48:06 +08:00
preQuantScaleKernel.h Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
qserveGemm.h Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
qserveGemmPerChannel.cu Update TensorRT-LLM (#2532) 2024-12-04 21:16:56 +08:00
qserveGemmPerGroup.cu Update TensorRT-LLM (#2502) 2024-11-26 16:51:34 +08:00
quantization.cu update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
quantization.cuh update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
quantization.h update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
rmsnormKernels.cu Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
rmsnormKernels.h Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
sageAttentionKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
sageAttentionKernels.h Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
samplingAirTopPKernels.cu Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
samplingTopKKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
samplingTopKKernels.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
samplingTopPKernels.cu chore: remove usernames from comments (#3291) 2025-04-05 13:44:28 +08:00
samplingTopPKernels.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
splitkGroupGemm.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
splitkGroupGemm.h Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
stopCriteriaKernels.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
stopCriteriaKernels.h open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297) 2024-10-08 12:19:19 +02:00
topkLastDim.cu Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
topkLastDim.h Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
unfusedAttentionKernels.cu fix: fix for cp > kvHeadNum (#3002) 2025-03-26 12:39:02 +08:00
unfusedAttentionKernels.h fix: fix for cp > kvHeadNum (#3002) 2025-03-26 12:39:02 +08:00
xqaDispatcher.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
xqaDispatcher.h Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00