Commit Graph

125 Commits

Author SHA1 Message Date
David Addison
81463c58d0 NCCL_TESTS_VERSION 2.17.8 2026-01-06 15:00:17 -08:00
David Addison
7278698c1b Clarified use of Mebibytes and Gibibytes for sizes 2026-01-06 14:59:17 -08:00
Katie Gioioso
2656c58421 NCCL_TESTS_VERSION 2.17.7 2025-12-30 20:18:25 +00:00
Katie Gioioso
070d17528c refactor comm init 2025-12-30 20:18:25 +00:00
Katie Gioioso
332e61896f device api 2.28 is not compatible with 2.29. Check versions and print error if there is a mismatch 2025-12-30 20:18:25 +00:00
Katie Gioioso
24874bdaa8 Compatibility with 2.29 device API: use NCCL_DEV_COMM_REQUIREMENTS_INTIIALIZER, query properties to check for device api support 2025-12-30 20:18:24 +00:00
David Addison
7106245178 Add include of <limits> due to compilation error 2025-12-30 20:13:13 +00:00
David Addison
760c467f12 Add memory usage report option
Use -M 1 to dump library memory usage information
2025-12-30 20:12:58 +00:00
David Addison
4bc314aa27 Add README.md text for -J option 2025-11-21 11:31:48 -08:00
David Addison
51f2e7ed7c Remove trailing WS when timestamp option not used 2025-11-03 11:23:52 -08:00
David Addison
da0b547b1b NCCL_TESTS_VERSION 2.17.6 2025-10-28 10:22:08 -07:00
David Addison
e2af90af76 Add new report_timestamps option to README.md 2025-10-28 10:21:58 -07:00
David Addison
a62c975681 Add option to suffix a timestamp to each perf line
Based on code from yakovdyadkin & Scott Moe in MR 349

Adds -S 1 option to suffix each performance report line with
a timestamp. Format is "%Y-%m-%d %H:%M:%S"

This is especially useful when using the -N 0 option and looking
for hangs or failure events.
2025-10-28 10:11:23 -07:00
David Addison
0bb567cc02 NCCL_TESTS_VERSION 2.17.5 2025-10-28 09:34:56 -07:00
Shane Snyder
013c49e930 add necessary ifdef guards for device API tests 2025-10-28 09:34:26 -07:00
Shane Snyder
f66d20e360 add runtime guards for ncclAlltoAll() 2025-10-28 09:32:17 -07:00
David Addison
3744121a2d NCCL_TESTS_VERSION 2.17.4 2025-10-24 17:11:08 -07:00
David Addison
9641693e9b Add PRINT of nccl-tests, NCCL header and library versions 2025-10-24 17:10:57 -07:00
Shane Snyder
9829ea42b5 add GIN-based device API kernels to alltoall
- add GIN-only A2A kernel implementation
- add hybrid LSA+GIN A2A kernel implementation
- update perf test cases to expose a function for setting
  devCommRequirements for each device implementation and
  simplify devCommCreate code path to use this directly instead
  of complex fallback logic
- add missing call to devCommDestroy
2025-10-24 17:10:34 -07:00
David Addison
00f52811b8 Add support for JSON output to perf test framework
This adds support for writing structured information about the run to a JSON file.

Enable with -J <filename>.json

If the target JSON filename already exists then an incrementing numeric suffix will be
added to create <filename>.json.<n>
2025-10-17 12:01:25 -07:00
Stephen Sachs
abc46770a9 Check if sufficient GPUs are available
The CUDA error message "Test CUDA failure util.cu:706 'invalid device ordinal'"
is not as helpful. Test this explicitly and guide the user.
2025-10-02 15:48:13 -07:00
Sylvain Jeaugey
9a5c15461a Fix compilation for old NCCL versions
Fix compilation failure on ctaPolicy with NCCL <= 2.26.
Fix compilation failure on local_register with NCCL <= 2.18.
Fix ctaPolicy behavior if the tests are compiled with NCCL <= 2.26
but run with NCCL >= 2.27.
2025-09-05 09:15:06 -07:00
David Addison
e12dbb0a14 Update to align with the NCCL 2.28 release
Added Device API infrastructure and example kernels
Two new command line arguments:

  -D <num> device kernel implementation to use <0/1/2/3/4>
  -V <num> number of CTAs to launch device kernels with

Added new CTA Policy command line option:

  -x <policy> set the CTA Policy <0/1/2>
2025-09-04 17:23:22 -07:00
David Addison
c2cb96faac Update NVCUFLAGS and CXXFLAGS to use -std=c++14 2025-08-29 14:55:31 -07:00
David Addison
f2015cbe82 Modified warmup to run for more message sizes
Loops between minBytes and maxBytes doubling size each time

Reduced default warmup iteration count to 1 (was 5)
2025-08-25 13:57:51 -07:00
David Addison
fae7cb4727
Merge pull request #316 from martin-belanger/print-program-name
Print the name of the program being executed before and after test output
2025-07-24 14:58:54 -07:00
David Addison
6edafa0a9c Add extra reserved space during maxBytes calculation
Also, don't allow minBytes > maxBytes
2025-07-23 16:19:37 -07:00
David Addison
def2d3689c Minor fix to Makefile
Move comments to separate lines
2025-07-23 16:04:30 -07:00
David Addison
97ee098516 Add Turing (SM75) support to CUDA 13.0 builds 2025-06-04 17:54:58 -07:00
David Addison
e7c8825b0b Wrap ncclCommWindowRegister() calls within ncclGroup 2025-06-03 10:36:53 -07:00
Martin Belanger
dafb70408d Print the name of the program being executed
One thing missing from the stdout of each performance test is
the name of the test that is actually being run.

This patch adds 2 new messages to the stdout. At the beginning
of the execution of a test (e.g. sendrecv_perf) we will now
see this message:

  Collective test starting: sendrecv_perf

And at the end, we will now see this:

  Collective test concluded: sendrecv_perf

This is needed when running several tests consecutively and we're
trying to parse the stdout to collect the results.

For example, using a Python script to parse the stdout, one could
retrieve the results for each test and plot them on a graph. This
patch makes it easier to implement such a script.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
2025-06-03 11:43:02 -04:00
David Addison
5290298ab6 Reinstate Pascal suppport for CUDA 12.8+ builds 2025-06-02 09:29:52 -07:00
David Addison
8bc16f4e01 Need to drop Volta (sm_70) support from CUDA 13.0 2025-05-30 18:04:25 -07:00
David Addison
0c60e6a8e4 Fix formatting errors in README.md 2025-05-30 17:43:30 -07:00
David Addison
a5c539e68b Add support for Symmetric Memory Registration
From NCCL 2.27.x we can now use the Symmetric Memory APIs (-R 2)
2025-05-30 17:31:34 -07:00
David Addison
e041d901e6 Re-add sm_70 support for CUDA 12.8+ and 13.0 builds 2025-05-07 10:30:59 -07:00
David Addison
1021260ca9 Make verifiable a DSO and add NAME_SUFFIX support
Build option DSO=1 generates libverifiable.so which can be
used to reduce the combined binary size.

Build option NAME_SUFFIX can be used to a add suffix to all
generated binaries. e.g. NAME_SUFFIX=_mpi

Added new make target: clean_intermediates
2025-04-23 17:07:24 -07:00
David Addison
501a149d57 Add support for FP8 datatypes
Added new datatypes: f8e4m3, f8e5m2

Only supported on H100+ architectures and NCCL versions >= 2.24.0
2025-04-18 19:20:59 -07:00
David Addison
b4300cc79d Add PCI domain and device ID for GPU device BDF display 2025-02-28 13:25:51 -08:00
Sylvain Jeaugey
903918fc54
Add NCCL_TESTS_SPLIT documentation in the README 2025-02-06 14:10:07 +01:00
Junyu Ma
a89cf07fe8 Perftests: Introduce NCCL_TESTS_SPLIT env
`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators.

Will be overrided by `NCCL_TESTS_SPLIT_MASK`.

Examples:

NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node.
NCCL_TESTS_SPLIT="AND 0x7"  # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7
NCCL_TESTS_SPLIT="MOD 72"   # color = rank % 72.  One GPU per NVLink domain on an NVL72 system.
NCCL_TESTS_SPLIT="DIV 72"   # color = rank / 72.  Intra NVLink domain on NVL72.

You can also use: "%" "&" "|" "/" for short.
Extra spaces in the middle will be automatically ignored.
Not case sensitive.

The followings are all equivalent:

NCCL_TESTS_SPLIT="%0x7"
NCCL_TESTS_SPLIT="%0b111"
NCCL_TESTS_SPLIT="AND 7"
NCCL_TESTS_SPLIT="and 0x7"
2025-02-04 15:18:09 -08:00
David Addison
cb6a46fdd6 Update CUDA gencodes
Add support for Blackwell sm100 and sm120 from CUDA 12.8

Add support for Hopper sm90 from CUDA 12.0
2025-01-25 17:32:16 -08:00
John Bachan
29f4114f02 Fixes to all tests that divide buffers by nranks so that they trim buffer sizes to be multiples of 16 bytes.
This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.
2024-12-18 11:20:28 -08:00
Sylvain Jeaugey
8dfeab9eb9
Merge pull request #259 from NVIDIA/fix-ncclstringtotype
Future-proof ncclstringtotype
2024-10-24 10:28:02 -07:00
Kamil Iskra
34d6d53910 Future-proof ncclstringtotype
Ensure that ncclstringtotype iterates only over data types known to
nccl-tests (as indicated by test_typenum), not over a potentially larger
set of all NCCL types.
2024-10-24 09:21:37 -07:00
David Addison
9d26b8422b
Merge pull request #226 from netgroup/master
improve parsing of stepbytes (increment size) argument
2024-07-30 14:58:54 -07:00
David Addison
0d86b5a6e7 Added some missing command line options to README.md
Also updated single and multi-node examples.
2024-07-30 14:50:45 -07:00
David Addison
d2d40cc824 Added -N,--run_cycles option 2024-07-25 22:00:23 -07:00
David Addison
3a3f790efd
Merge pull request #240 from OrenLeung/patch-1
doc: add all2all factor
2024-07-25 22:00:06 -07:00
Oren
c6eb15875f
doc: add all2all factor 2024-07-24 22:55:00 -04:00