TensorRT-LLMs/tensorrt_llm/_torch/distributed
Yukun He cbca6505ff
[nvbugs/5268808][fix] Fix the list-out-of-range access issue of AllReduce workspace on multi-node. (#4159)
This issue is found for tp=ep=8 on the multi-node machine due to the inconsistent PP sizes.
* Reform the workspace allocation implementation to avoid the list-out-of-range issues.
* Disable min_latency_mode under the multi-node scenario to avoid the illegal memory access issue.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-13 17:17:25 +08:00
..
__init__.py chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse (#4169) 2025-05-09 12:51:02 +08:00
communicator.py chore: bump version to 0.19.0 (#3598) (#3841) 2025-04-29 16:57:22 +08:00
ops.py [nvbugs/5268808][fix] Fix the list-out-of-range access issue of AllReduce workspace on multi-node. (#4159) 2025-05-13 17:17:25 +08:00