Commit Graph

88 Commits

Author SHA1 Message Date
Kaiming Ouyang
59072b7e3d Add symmetric registration support
-R 2 will enable symmetric registration
2025-03-14 17:04:06 -07:00
David Addison
b4300cc79d Add PCI domain and device ID for GPU device BDF display 2025-02-28 13:25:51 -08:00
Sylvain Jeaugey
903918fc54
Add NCCL_TESTS_SPLIT documentation in the README 2025-02-06 14:10:07 +01:00
Junyu Ma
a89cf07fe8 Perftests: Introduce NCCL_TESTS_SPLIT env
`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators.

Will be overrided by `NCCL_TESTS_SPLIT_MASK`.

Examples:

NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node.
NCCL_TESTS_SPLIT="AND 0x7"  # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7
NCCL_TESTS_SPLIT="MOD 72"   # color = rank % 72.  One GPU per NVLink domain on an NVL72 system.
NCCL_TESTS_SPLIT="DIV 72"   # color = rank / 72.  Intra NVLink domain on NVL72.

You can also use: "%" "&" "|" "/" for short.
Extra spaces in the middle will be automatically ignored.
Not case sensitive.

The followings are all equivalent:

NCCL_TESTS_SPLIT="%0x7"
NCCL_TESTS_SPLIT="%0b111"
NCCL_TESTS_SPLIT="AND 7"
NCCL_TESTS_SPLIT="and 0x7"
2025-02-04 15:18:09 -08:00
David Addison
cb6a46fdd6 Update CUDA gencodes
Add support for Blackwell sm100 and sm120 from CUDA 12.8

Add support for Hopper sm90 from CUDA 12.0
2025-01-25 17:32:16 -08:00
John Bachan
29f4114f02 Fixes to all tests that divide buffers by nranks so that they trim buffer sizes to be multiples of 16 bytes.
This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.
2024-12-18 11:20:28 -08:00
Sylvain Jeaugey
8dfeab9eb9
Merge pull request #259 from NVIDIA/fix-ncclstringtotype
Future-proof ncclstringtotype
2024-10-24 10:28:02 -07:00
Kamil Iskra
34d6d53910 Future-proof ncclstringtotype
Ensure that ncclstringtotype iterates only over data types known to
nccl-tests (as indicated by test_typenum), not over a potentially larger
set of all NCCL types.
2024-10-24 09:21:37 -07:00
David Addison
9d26b8422b
Merge pull request #226 from netgroup/master
improve parsing of stepbytes (increment size) argument
2024-07-30 14:58:54 -07:00
David Addison
0d86b5a6e7 Added some missing command line options to README.md
Also updated single and multi-node examples.
2024-07-30 14:50:45 -07:00
David Addison
d2d40cc824 Added -N,--run_cycles option 2024-07-25 22:00:23 -07:00
David Addison
3a3f790efd
Merge pull request #240 from OrenLeung/patch-1
doc: add all2all factor
2024-07-25 22:00:06 -07:00
Oren
c6eb15875f
doc: add all2all factor 2024-07-24 22:55:00 -04:00
Stefano Salsano
746549b28d
improve parsing of stepbytes (increment size) argument 2024-06-14 11:28:55 +02:00
Kaiming Ouyang
d028efcf35 Change ncclCommRegister size to maxBytes in serial comm init 2024-06-06 06:54:48 -07:00
Giuseppe Congiu
a1efb427e7 Add -R option to register user buffers 2024-06-03 01:04:58 -07:00
David Addison
c6afef0b6f Added missing MPI_Comm_free() call before MPI_Finalize() 2024-02-05 08:53:54 -08:00
David Addison
1292b25553 Added an MPI_Barrier() call after MPI_Bcast() for HCOLL issue 2023-10-12 16:53:32 -07:00
David Addison
6c46206a47 Make the -c option be a datacheck iteration count parameter
Default is 1
2023-09-13 14:03:38 -07:00
Sylvain Jeaugey
1a5f551ffd
Merge pull request #146 from yangxingwu/master
makefile: remove extra space
2023-06-06 11:58:24 +02:00
yangxingwu
52ea1b2148 makefile: remove extra space 2023-06-06 09:47:50 +00:00
Sylvain Jeaugey
e98ef24bc0
Merge pull request #135 from aavbsouza/fix_nvcc_variable_handling
fix handling of variable NVCC.
2023-03-27 11:14:10 +02:00
alan.souza
7ccda3c97b fix handling of variable NVCC. Permit overriding the variable using environment variables 2023-03-25 16:56:16 -03:00
David Addison
e76e36e9a9
Merge pull request #134 from flx42/patch-1
Update README.md to fix -i default increment value.
2023-03-23 09:53:15 -07:00
Felix Abecassis
17d0a42d5a
Update README.md 2023-03-23 09:05:41 -07:00
Sylvain Jeaugey
2cbb968101
Update README.md
Improve MPI example to avoid confusion of number of processes / total number of GPUs.

https://github.com/NVIDIA/nccl-tests/issues/54#issuecomment-1212023369
2023-01-03 08:47:43 +01:00
David Addison
0b4c4cb99f Add boot_id to the hostname hash due to collisions on Azure
Fixes #60
2022-12-12 01:16:46 -08:00
Jithin Jose
0aeba157db Use DJB2a hash algorithm in getHostHash() 2022-12-12 01:16:38 -08:00
David Addison
24fcf64ed1 Call cudaFreeHost() on wrongPerGpu not cudaFree() 2022-11-22 11:18:37 -08:00
David Addison
3bd2bd292b Add fflush(stdout) before perf output 2022-11-22 11:16:47 -08:00
Sylvain Jeaugey
365b92a1ea Fix build on RHEL7 with GCC 4.8
Add -std=c++11 to CXXFLAGS.
Fixes #116.
2022-10-12 01:24:14 -07:00
Sylvain Jeaugey
d313d20a26 Update NCCL tests 2022-09-23 01:13:29 -07:00
David Addison
749573f2d6 Fix preprocessor version check for ncclGetLastError()
ncclGetLastError() was added in NCCL 2.13.0
2022-09-07 16:10:41 -07:00
David Addison
afa4c56b6a Fix an issue with the last commit when data checking is disabled 2022-09-07 11:23:49 -07:00
David Addison
a0a14911ee Display N/A for error count in AlltoAll in-place test
AlltoAll does not support in-place buffers
2022-09-06 13:17:15 -07:00
John Bachan
bc5f7cfb0a Changed top-level Makefile behavior so that BUILDDIR is interpreted
as relative to top-level directory. This done is by abspath'ing it before
passing it to subdirectory Makefile's.

The old behavior had two cases: with and without BUILDDIR being set by
the user. With BUILDDIR not set, the build dir would be named "build"
in the top-level directory. If BUILDDIR was set, then the build dir
would be placed at "src/${BUILDDIR}".

The new behavior is simpler, if BUILDDIR is not set then it defaults
to "build", and the directory holding the final build is always at just
"${BUILDDIR}" in the top level.
2022-08-23 10:08:49 -07:00
John Bachan
51af5572bf Resync with NCCL 2.13
* Added "verifiable", a suite of kernels for generating and verifying reduction
  input and output arrays in a bit-precise way.
* Data corruption errors now reported in number of wrong elements instead of max
  deviation.
* Use ncclGetLastError.
* Don't run hypercube on non-powers of 2 ranks.
* Fix to hypercube data verification.
* Use "thread local" as the defaut CUDA capture mode.
* Replaced pthread_yield -> sched_yield()
* Bugfix to the cpu-side barrier/allreduce implementations.
2022-08-22 17:51:06 -07:00
David Addison
8274cb47b6
Merge pull request #96 from NVIDIA/nersc-linkage-fix
Add option to statically link cudart
2022-05-26 16:54:44 -07:00
David Addison
de3ddbe261 Add option to statically link cudart
Build with CUDARTLIB=cudart_static to remove dynamic linkage

Also removed unused curand and nvToolsExt dependencies

BUG 95
2021-11-10 10:02:41 -08:00
David Addison
7130fa6096 Add MPI_IBM build option 2021-10-25 16:30:57 -07:00
David Addison
f773748b46 Resync with NCCL 2.11
New operator: mulsum
New test: gather
2021-09-17 09:02:45 -07:00
David Addison
1f8f541686 Add CUDA graph support only for CUDA 11.3 and later builds
Fixes #90
2021-07-13 10:47:47 -07:00
David Addison
b9f90d12a9 Removed MPI_SUPPORT conditional compilation of average flag 2021-07-12 11:43:57 -07:00
David Addison
547e119d35 Fix issues with MPI_Allreduce and multi-threaded tests 2021-07-08 16:42:40 -07:00
David Addison
11cff17a04 Updated with new command line arguments 2021-07-06 16:27:45 -07:00
David Addison
f476f4a17a Merge branch 'bfloat16' 2021-07-06 10:20:32 -07:00
David Addison
1dfc76eccc Added new option to report average iteration time 2021-06-30 19:36:07 -07:00
David Addison
1ae8cdc315 Resync with changes in gitilab-master code 2021-06-30 13:16:04 -07:00
David Addison
44df0bf010
Merge pull request #88 from nzmsv/master
Cleanup argument error handling and messages
2021-06-30 12:35:47 -07:00
David Addison
9dae3d3a37 Added new tests: scatter, sendrecv, hypercube 2021-06-28 16:49:10 -07:00