TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

liji-nv dca6397d1e feat: Introduce UB allocator for pytorch flow (#3257 ) * Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>		2025-04-08 18:39:49 +08:00
..
utils	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
bufferManager.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
bufferView.h	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
CMakeLists.txt	Update (#2978 )	2025-03-23 16:39:35 +08:00
cudaMemPool.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
cudaMemPool.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
decoderState.cpp	refactor: Simplify disableLookahead and improve numDecodingEngineTokens handling (#3103 )	2025-04-01 18:47:31 +08:00
decodingLayerWorkspace.cpp	Update TensorRT-LLM (#2184 )	2024-09-03 12:14:23 +02:00
decodingLayerWorkspace.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
decodingOutput.cpp	Update (#2978 )	2025-03-23 16:39:35 +08:00
eagleBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
explicitDraftTokensBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
explicitDraftTokensModule.h	Update TensorRT-LLM (#1763 )	2024-06-11 16:59:02 +08:00
generationConfig.cpp	Update TensorRT-LLM (#2110 )	2024-08-13 22:34:33 +08:00
generationConfig.h	Update TensorRT-LLM (#2110 )	2024-08-13 22:34:33 +08:00
gptDecoder.cpp	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
gptDecoderBatched.cpp	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
gptJsonConfig.cpp	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
gptSession.cpp	refactor: Remove speculative decoding parameters from stateful decoders (#3024 )	2025-03-26 20:16:26 +08:00
iBuffer.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
ipcNvlsMemory.cpp	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
ipcSocket.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
ipcSocket.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
ipcUtils.cpp	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
iTensor.cpp	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
jsonSerialization.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
layerProfiler.cpp	Update TensorRT-LLM (#1554 )	2024-05-07 23:34:28 +08:00
layerProfiler.h	Update TensorRT-LLM (#1554 )	2024-05-07 23:34:28 +08:00
lookaheadBuffers.cpp	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
loraCache.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
loraManager.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
loraManager.h	Update TensorRT-LLM (#2413 )	2024-11-05 16:27:06 +08:00
loraModule.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
loraUtils.cpp	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
loraUtils.h	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
memoryCounters.cpp	Update TensorRT-LLM (#2110 )	2024-08-13 22:34:33 +08:00
ncclCommunicator.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
ncclCommunicator.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
promptTuningParams.cpp	Update TensorRT-LLM (#1598 )	2024-05-14 16:43:41 +08:00
rnnStateBuffers.cpp	open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273 )	2024-09-30 13:51:19 +02:00
rnnStateBuffers.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
runtimeBuffers.cpp	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
runtimeBuffers.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
runtimeKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
runtimeKernels.h	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
statefulGptDecoder.cpp	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
statefulGptDecoder.h	refactor: Remove speculative decoding parameters from stateful decoders (#3024 )	2025-03-26 20:16:26 +08:00
statefulGptDecoderBatched.cpp	Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183 ) (#3195 )	2025-04-04 15:56:28 +02:00
tensorView.h	Update TensorRT-LLM (#1793 )	2024-06-18 18:18:23 +08:00
tllmBuffers.cpp	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
tllmBuffers.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
tllmLogger.cpp	Update TensorRT-LLM (#787 )	2024-01-02 17:54:32 +08:00
tllmRuntime.cpp	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
tllmRuntime.h	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
torch.h	Update TensorRT-LLM (#2110 )	2024-08-13 22:34:33 +08:00
torchUtils.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
torchView.h	Update TensorRT-LLM (#1168 )	2024-02-27 17:37:34 +08:00
transformerBuffers.cpp	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
transformerBuffers.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
workerPool.cpp	Update TensorRT-LLM (#2156 )	2024-08-27 18:20:59 +08:00
workerPool.h	Update TensorRT-LLM (#2156 )	2024-08-27 18:20:59 +08:00
worldConfig.cpp	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00