Added a precision for AllGather and ReduceScatter sizes since NCCL uses the size per rank.

2026-04-25 08:58:18 +08:00 · 2018-08-17 14:58:44 -07:00 · 2018-08-17 14:58:44 -07:00 · dcf818955f
commit dcf818955f
parent eb4c43ff3d
1 changed files with 4 additions and 0 deletions
--- a/doc/PERFORMANCE.md
+++ b/doc/PERFORMANCE.md
@ -78,6 +78,8 @@ And the Bus Bandwidth is therefore computed as :

 `B = S/t * (n-1)/n = algbw * (n-1)/n`

+Note that here, S is the size in bytes of the total array, which for NCCL is equal to `recvcount*sizeof(datatype)*n` as the `recvcount` argument is the count per rank.
+
 ### AllGather

 The AllGather operation requires only to perform the assignation part of the allReduce operation :
@ -94,6 +96,8 @@ And the Bus Bandwidth is therefore computed as :

 `B = S/t * (n-1)/n = algbw * (n-1)/n`

+Note that here, S is the size in bytes of the total array, which for NCCL is equal to `sendcount*sizeof(datatype)*n` as the `sendcount` argument is the count per rank.
+
 ### Broadcast

 The broadcast operation representation is similar to allGather :