Major rework to merge most of the changes from the NCCL internal
tests into the public ones
Added "-m <agg_iters>" operation aggregation option.
Data integrity checking is now much more performant at scale.
Startup times at scale are improved.
Test latency units are now displayed in usec.