TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-03 01:31:30 +08:00

History

liji-nv dca6397d1e feat: Introduce UB allocator for pytorch flow (#3257 ) * Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>		2025-04-08 18:39:49 +08:00
..
auto_deploy	test: reorganize tests folder hierarchy (#2996 )	2025-03-27 12:07:53 +08:00
compilation	Update (#2978 )	2025-03-23 16:39:35 +08:00
modeling	feat: no-cache attention in PyTorch workflow (#3085 )	2025-04-05 01:54:32 +08:00
multi_gpu	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
multi_gpu_modeling	test: Accuracy test improvement (Part 3.4): Move LLaMA tests (#3350 )	2025-04-08 15:07:57 +08:00
speculative	chore: refactor the LlmArgs with Pydantic and migrate remaining pybinding configs to python (#3025 )	2025-04-05 13:31:48 +08:00
thop	feat: Introduce UB allocator for pytorch flow (#3257 )	2025-04-08 18:39:49 +08:00
deep_gemm_tests.py	feat: use NVRTC for DeepGEMM JIT compilation (#3239 )	2025-04-07 20:29:23 +08:00
helpers.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
pattern_watcher.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
test_attention_no_cache.py	feat: no-cache attention in PyTorch workflow (#3085 )	2025-04-05 01:54:32 +08:00
test_attention.py	test: reorganize tests folder hierarchy (#2996 )	2025-03-27 12:07:53 +08:00
test_autotuner.py	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 )	2025-04-08 14:28:36 +08:00
test_flashinfer_attention.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
test_flashinfer_star_attn.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
test_fp4_bmm_quantize.py	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 )	2025-04-08 14:28:36 +08:00
test_fp4_gemm_quantize.py	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
test_fp4_linear.py	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 )	2025-04-08 14:28:36 +08:00
test_fp8_block_scale_gemm.py	feat: enable DeepGEMM by default (#3341 )	2025-04-08 13:58:57 +08:00
test_fp8_linear.py	test: reorganize tests folder hierarchy (#2996 )	2025-03-27 12:07:53 +08:00
test_fp8_quantize.py	test: reorganize tests folder hierarchy (#2996 )	2025-03-27 12:07:53 +08:00
test_fused_moe.py	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 )	2025-04-08 14:28:36 +08:00
test_moe_routing.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
test_moe.py	test: reorganize tests folder hierarchy (#2996 )	2025-03-27 12:07:53 +08:00
test_overlap_scheduler_input.json	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
test_overlap_scheduler.py	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
test_pytorch_model_engine.py	Update (#2978 )	2025-03-23 16:39:35 +08:00
test_resource_manager.py	feat: Support PeftCacheManager in Torch (#3186 )	2025-04-04 12:38:08 +08:00
test_vanilla_attention.py	Update (#2978 )	2025-03-23 16:39:35 +08:00