TensorRT-LLMs

kanshan/TensorRT-LLMs

Fork 0

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-02-05 02:31:33 +08:00

Commit Graph

Author	SHA1	Message	Date
Yukun He	bf7303c7f1	[https://nvbugs/5636916 ][fix] Cherry-pick #10654 : Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2026-01-21 17:25:40 +08:00
Ludwig Schneider	41ce14ab04	[None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce (#9314 ) Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>	2025-12-07 09:43:26 -08:00
Yukun He	2225745782	[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 ) Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it. Implemented new AllreduceOp heuristic: - Added Linear programming-based heuristic implementation. - Added LUT-based heuristic implementation and corresponding code generation script. AllreduceOp minor fixing: - Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set. - Fixed a minor TWOSHOT kernel perf issue. - Cleaned up Dispatching code in AllReduceOp. This PR will fix the perf gaps reported in: https://nvbugspro.nvidia.com/bug/5517023 For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-11-04 16:42:31 +08:00

Author

SHA1

Message

Date

Yukun He

bf7303c7f1

[https://nvbugs/5636916 ][fix] Cherry-pick #10654 : Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841 )

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

2026-01-21 17:25:40 +08:00

Ludwig Schneider

41ce14ab04

[None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce (#9314 )

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

2025-12-07 09:43:26 -08:00

Yukun He

2225745782

[TRTLLM-8129][feat] Allreduce tuning and benchmark script revising (#7870 )

Because we have encountered some perf regression due to using a one-shot kernel instead of NCCL on A100/H100, it will be beneficial if we can have a solid benchmarking of allreduce Op and analyze the data collected from it.

Implemented new AllreduceOp heuristic:
- Added Linear programming-based heuristic implementation.
- Added LUT-based heuristic implementation and corresponding code generation script.

AllreduceOp minor fixing:
- Fixed a minor issue in AllreduceOp, that the strategy can not be overridden when ONESHOT or TWOSHOT is set.
- Fixed a minor TWOSHOT kernel perf issue.
- Cleaned up Dispatching code in AllReduceOp.

This PR will fix the perf gaps reported in:
https://nvbugspro.nvidia.com/bug/5517023

For Deepseek-R1, it shows a performance gain of about 3-4% in concurrency levels of 256 and 512.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

2025-11-04 16:42:31 +08:00

3 Commits