diff --git a/README.md b/README.md index 44e406a..e5ac603 100644 --- a/README.md +++ b/README.md @@ -4,33 +4,43 @@ These tests check both the performance and the correctness of [NCCL](http://gith ## Build -To build the tests, just type `make`. +To build the tests, just type `make` or `make -j` -If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME. +If CUDA is not installed in `/usr/local/cuda`, you may specify `CUDA_HOME`. Similarly, if NCCL is not installed in `/usr`, you may specify `NCCL_HOME`. ```shell $ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl ``` -NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed. +NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set `MPI=1` and set `MPI_HOME` to the path where MPI is installed. ```shell $ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl ``` +You can also add a suffix to the name of the generated binaries with `NAME_SUFFIX`. For example when compiling with the MPI versions you could use: + +```shell +$ make MPI=1 NAME_SUFFIX=_mpi MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl +``` + +This will generate test binaries with names such as `all_reduce_perf_mpi`. + ## Usage -NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread). +NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to `(number of processes)*(number of threads)*(number of GPUs per thread)`. ### Quick examples Run on single node with 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes : + ```shell $ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 ``` Run 64 MPI processes on nodes with 8 GPUs each, for a total of 64 GPUs spread across 8 nodes : (NB: The nccl-tests binaries must be compiled with `MPI=1` for this case) + ```shell $ mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 ``` @@ -58,7 +68,7 @@ All tests support the same set of arguments : * `-r,--root ` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0. * Performance * `-n,--iters ` number of iterations. Default : 20. - * `-w,--warmup_iters ` number of warmup iterations (not timed). Default : 5. + * `-w,--warmup_iters ` number of warmup iterations (not timed). Default : 1. * `-m,--agg_iters ` number of operations to aggregate together in each iteration. Default : 1. * `-N,--run_cycles ` run & print each cycle. Default : 1; 0=infinite. * `-a,--average <0/1/2/3>` Report performance as an average across all ranks (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>. Default : 1. @@ -67,11 +77,32 @@ All tests support the same set of arguments : * `-c,--check ` perform count iterations, checking correctness of results on each iteration. This can be quite slow on large numbers of GPUs. Default : 1. * `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0. * `-G,--cudagraph ` Capture iterations as a CUDA graph and then replay specified number of times. Default : 0. - * `-C,--report_cputime <0/1>]` Report CPU time instead of latency. Default : 0. - * `-R,--local_register <1/0>` enable local buffer registration on send/recv buffers. Default : 0. + * `-C,--report_cputime <0/1>` Report CPU time instead of latency. Default : 0. + * `-R,--local_register <0/1/2>` enable local (1) or symmetric (2) buffer registration on send/recv buffers. Default : 0. + * `-S,--report_timestamps <0/1>` Add timestamp (`"%Y-%m-%d %H:%M:%S"`) to each performance report line. Default : 0. + * `-J,--output_file ` Write [JSON] output to filepath. Infer type from suffix (only `json` supported presently). * `-T,--timeout