Merge remote-tracking branch 'upstream/master' into sync/upstream-20251216

2026-04-23 16:08:20 +08:00 · 2025-12-16 23:05:46 +08:00 · 2025-12-16 23:05:46 +08:00 · 8ab1ad2779
commit 8ab1ad2779
parent 1fb35de24e 4bc314aa27
24 changed files with 2891 additions and 542 deletions
--- a/README.md
+++ b/README.md
@ -4,33 +4,43 @@ These tests check both the performance and the correctness of [NCCL](http://gith

 ## Build

-To build the tests, just type `make`.
+To build the tests, just type `make` or `make -j`

-If CUDA is not installed in /usr/local/cuda, you may specify CUDA\_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL\_HOME.
+If CUDA is not installed in `/usr/local/cuda`, you may specify `CUDA_HOME`. Similarly, if NCCL is not installed in `/usr`, you may specify `NCCL_HOME`.

 ```shell
 $ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
 ```

-NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI\_HOME to the path where MPI is installed.
+NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set `MPI=1` and set `MPI_HOME` to the path where MPI is installed.

 ```shell
 $ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
 ```

+You can also add a suffix to the name of the generated binaries with `NAME_SUFFIX`. For example when compiling with the MPI versions you could use:
+
+```shell
+$ make MPI=1 NAME_SUFFIX=_mpi MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
+```
+
+This will generate test binaries with names such as `all_reduce_perf_mpi`.
+
 ## Usage

-NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)\*(number of threads)\*(number of GPUs per thread).
+NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to `(number of processes)*(number of threads)*(number of GPUs per thread)`.

 ### Quick examples

 Run on single node with 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes :
+
 ```shell
 $ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
 ```

 Run 64 MPI processes on nodes with 8 GPUs each, for a total of 64 GPUs spread across 8 nodes :
 (NB: The nccl-tests binaries must be compiled with `MPI=1` for this case)
+
 ```shell
 $ mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
 ```
@ -58,7 +68,7 @@ All tests support the same set of arguments :
  * `-r,--root <root/all>` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
 * Performance
  * `-n,--iters <iteration count>` number of iterations. Default : 20.
-  * `-w,--warmup_iters <warmup iteration count>` number of warmup iterations (not timed). Default : 5.
+  * `-w,--warmup_iters <warmup iteration count>` number of warmup iterations (not timed). Default : 1.
  * `-m,--agg_iters <aggregation count>` number of operations to aggregate together in each iteration. Default : 1.
  * `-N,--run_cycles <cycle count>` run & print each cycle. Default : 1; 0=infinite.
  * `-a,--average <0/1/2/3>` Report performance as an average across all ranks (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>. Default : 1.
@ -67,11 +77,32 @@ All tests support the same set of arguments :
  * `-c,--check <check iteration count>` perform count iterations, checking correctness of results on each iteration. This can be quite slow on large numbers of GPUs. Default : 1.
  * `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0.
  * `-G,--cudagraph <num graph launches>` Capture iterations as a CUDA graph and then replay specified number of times. Default : 0.
-  * `-C,--report_cputime <0/1>]` Report CPU time instead of latency. Default : 0.
-  * `-R,--local_register <1/0>` enable local buffer registration on send/recv buffers. Default : 0.
+  * `-C,--report_cputime <0/1>` Report CPU time instead of latency. Default : 0.
+  * `-R,--local_register <0/1/2>` enable local (1) or symmetric (2) buffer registration on send/recv buffers. Default : 0.
+  * `-S,--report_timestamps <0/1>` Add timestamp (`"%Y-%m-%d %H:%M:%S"`) to each performance report line. Default : 0.
+  * `-J,--output_file <file>` Write [JSON] output to filepath. Infer type from suffix (only `json` supported presently).
  * `-T,--timeout <time in seconds>` timeout each test after specified number of seconds. Default : disabled.

+### Running multiple operations in parallel
+
+NCCL tests allow to partition the set of GPUs into smaller sets, each executing the same operation in parallel.
+To split the GPUs, NCCL will compute a "color" for each rank, based on the `NCCL_TESTS_SPLIT` environment variable, then all ranks
+with the same color will end up in the same group. The resulting group is printed next to each GPU at the beginning of the test.
+
+`NCCL_TESTS_SPLIT` takes the following syntax: `<operation><value>`. Operation can be `AND`, `OR`, `MOD` or `DIV`. The `&`, `|`, `%`, and `/` symbols are also supported. The value can be either decimal, hexadecimal (prefixed by `0x`) or binary (prefixed by `0b`).
+
+`NCCL_TESTS_SPLIT_MASK="<value>"` is equivalent to `NCCL_TESTS_SPLIT="&<value>"`.
+
+Here are a few examples:
+
+ - `NCCL_TESTS_SPLIT="AND 0x7"` or `NCCL_TESTS_SPLIT="MOD 8"`: On systems with 8 GPUs, run 8 parallel operations, each with 1 GPU per node (purely communicating over the inter-node network)
+
+- `NCCL_TESTS_SPLIT="OR 0x7"` or `NCCL_TESTS_SPLIT="DIV 8"`: On systems with 8 GPUs, run one operation per node, purely intra-node.
+
+- `NCCL_TESTS_SPLIT="AND 0x1"` or `NCCL_TESTS_SPLIT="MOD 2"`: Run two operations, each operation using every other rank.
+
+Note that the reported bandwidth is per group, hence to get the total bandwidth used by all groups, one must multiply by the number of groups.
+
 ## Copyright

-NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2024, NVIDIA CORPORATION. All rights reserved.
-
+NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.
--- a/src/Makefile
+++ b/src/Makefile
@ -1,66 +1,16 @@
 #
-# Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
 #
 # See LICENSE.txt for license information
 #
+include common.mk

-CUDA_HOME ?= /usr/local/cuda
-PREFIX ?= /usr/local
-VERBOSE ?= 0
-DEBUG ?= 0
-
-CUDA_LIB ?= $(CUDA_HOME)/lib64
-CUDA_INC ?= $(CUDA_HOME)/include
-NVCC ?= $(CUDA_HOME)/bin/nvcc
-CUDARTLIB ?= cudart
-
-CUDA_VERSION = $(strip $(shell which $(NVCC) >/dev/null && $(NVCC) --version | grep release | sed 's/.*release //' | sed 's/\,.*//'))
-CUDA_MAJOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 1)
-
-# Better define NVCC_GENCODE in your environment to the minimal set
-# of archs to reduce compile time.
-ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 12; echo $$?),0)
-NVCC_GENCODE ?= -gencode=arch=compute_70,code=sm_70 \
-                -gencode=arch=compute_80,code=sm_80 \
-                -gencode=arch=compute_86,code=sm_86 \
-                -gencode=arch=compute_89,code=sm_89 \
-                -gencode=arch=compute_90,code=sm_90 \
-                -gencode=arch=compute_90,code=compute_90
-else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 11; echo $$?),0)
-NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
-                -gencode=arch=compute_61,code=sm_61 \
-                -gencode=arch=compute_70,code=sm_70 \
-                -gencode=arch=compute_80,code=sm_80 \
-                -gencode=arch=compute_80,code=compute_80
-else
-NVCC_GENCODE ?= -gencode=arch=compute_35,code=sm_35 \
-                -gencode=arch=compute_50,code=sm_50 \
-                -gencode=arch=compute_60,code=sm_60 \
-                -gencode=arch=compute_61,code=sm_61 \
-                -gencode=arch=compute_70,code=sm_70 \
-                -gencode=arch=compute_70,code=compute_70
-endif
-$(info NVCC_GENCODE is ${NVCC_GENCODE})
-
-NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11
-CXXFLAGS   := -std=c++11
-
-LDFLAGS    := -L${CUDA_LIB} -lcudart -lrt
-NVLDFLAGS  := -L${CUDA_LIB} -l${CUDARTLIB} -lrt
-
-ifeq ($(DEBUG), 0)
-NVCUFLAGS += -O3 -g
-CXXFLAGS  += -O3 -g
-else
-NVCUFLAGS += -O0 -G -g
-CXXFLAGS  += -O0 -g -ggdb3
-endif
-
-ifneq ($(VERBOSE), 0)
-NVCUFLAGS += -Xcompiler -Wall,-Wextra,-Wno-unused-parameter
-else
-.SILENT:
-endif
+# Set to 1 to enable MPI support (multi-process/multi-node)
+MPI ?= 0
+# e.g. Set to _mpi when using MPI=1
+NAME_SUFFIX ?=
+# Set to 1 to create and use libverifiable.so to reduce binary size
+DSO ?= 0

 .PHONY: build clean

@ -85,7 +35,7 @@ DST_DIR := $(BUILDDIR)
 SRC_FILES := $(wildcard *.cu)
 OBJ_FILES := $(SRC_FILES:%.cu=${DST_DIR}/%.o)
 BIN_FILES_LIST := all_reduce all_gather broadcast reduce_scatter reduce alltoall scatter gather sendrecv hypercube
-BIN_FILES := $(BIN_FILES_LIST:%=${DST_DIR}/%_perf)
+BIN_FILES := $(BIN_FILES_LIST:%=${DST_DIR}/%_perf${NAME_SUFFIX})

 build: ${BIN_FILES}

@ -96,7 +46,14 @@ TEST_VERIFIABLE_SRCDIR := ../verifiable
 TEST_VERIFIABLE_BUILDDIR := $(BUILDDIR)/verifiable
 include ../verifiable/verifiable.mk

-${DST_DIR}/%.o: %.cu common.h $(TEST_VERIFIABLE_HDRS)
+.PRECIOUS: ${DST_DIR}/%.o
+
+${DST_DIR}/%.o: %.cu common.h util.h $(TEST_VERIFIABLE_HDRS)
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${DST_DIR}
+	$(NVCC) -o $@ $(NVCUFLAGS) -c $<
+
+${DST_DIR}/%$(NAME_SUFFIX).o: %.cu common.h util.h $(TEST_VERIFIABLE_HDRS)
 	@printf "Compiling  %-35s > %s\n" $< $@
 	@mkdir -p ${DST_DIR}
 	$(NVCC) -o $@ $(NVCUFLAGS) -c $<
@ -104,15 +61,25 @@ ${DST_DIR}/%.o: %.cu common.h $(TEST_VERIFIABLE_HDRS)
 ${DST_DIR}/timer.o: timer.cc timer.h
 	@printf "Compiling  %-35s > %s\n" $< $@
 	@mkdir -p ${DST_DIR}
-	$(CXX) $(CXXFLAGS) -o $@ -c timer.cc
+	$(CXX) $(CXXFLAGS) -o $@ -c $<

 ${DST_DIR}/ucommd.o: ucommd.cc ucommd.h
 	@printf "Compiling  %-35s > %s\n" $< $@
 	@mkdir -p ${DST_DIR}
 	$(CXX) $(CXXFLAGS) -o $@ -c ucommd.cc

-${DST_DIR}/%_perf:${DST_DIR}/%.o ${DST_DIR}/common.o ${DST_DIR}/timer.o ${DST_DIR}/ucommd.o $(TEST_VERIFIABLE_OBJS)
+ifeq ($(DSO), 1)
+${DST_DIR}/%_perf$(NAME_SUFFIX): ${DST_DIR}/%.o ${DST_DIR}/common$(NAME_SUFFIX).o ${DST_DIR}/util$(NAME_SUFFIX).o ${DST_DIR}/timer.o ${DST_DIR}/ucommd.o $(TEST_VERIFIABLE_LIBS)
 	@printf "Linking  %-35s > %s\n" $< $@
 	@mkdir -p ${DST_DIR}
-	$(NVCC) -o $@ $(NVCUFLAGS) $^ ${NVLDFLAGS} -Xcompiler \"-Wl,-rpath,/usr/local/sihpc/lib\"
+	$(NVCC) -o $@ $(NVCUFLAGS) $^ -L$(TEST_VERIFIABLE_BUILDDIR) -lverifiable ${NVLDFLAGS} -Xlinker "--enable-new-dtags" -Xlinker "-rpath,\$$ORIGIN:\$$ORIGIN/verifiable"
+else
+${DST_DIR}/%_perf$(NAME_SUFFIX):${DST_DIR}/%.o ${DST_DIR}/common$(NAME_SUFFIX).o ${DST_DIR}/util$(NAME_SUFFIX).o ${DST_DIR}/timer.o $(TEST_VERIFIABLE_OBJS)
+	@printf "Linking  %-35s > %s\n" $< $@
+	@mkdir -p ${DST_DIR}
+	$(NVCC) -o $@ $(NVCUFLAGS) $^ ${NVLDFLAGS}
+endif
+
+clean_intermediates:
+	rm -f ${DST_DIR}/*.o $(TEST_VERIFIABLE_OBJS)

--- a/src/all_gather.cu
+++ b/src/all_gather.cu
@ -7,10 +7,8 @@
 #include "cuda_runtime.h"
 #include "common.h"

-#define ALIGN 4
-
-void AllGatherGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
-  size_t base = (count/(ALIGN*nranks))*ALIGN;
+void AllGatherGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
+  size_t base = (count/nranks) & -(16/eltSize);
  *sendcount = base;
  *recvcount = base*nranks;
  *sendInplaceOffset = base;
@ -45,8 +43,14 @@ void AllGatherGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t AllGatherRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  NCCLCHECK(ncclAllGather(sendbuff, recvbuff, count, type, comm, stream));
+testResult_t AllGatherRunColl(void* sendbuff,  size_t sendoffset,void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+    NCCLCHECK(ncclAllGather(sptr, rptr, count, type, comm, stream));
+  } else {
+    return testNotImplemented;
+  }
  return testSuccess;
 }

@ -60,7 +64,7 @@ struct testColl allGatherTest = {

 void AllGatherGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  AllGatherGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  AllGatherGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t AllGatherRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -86,8 +90,8 @@ testResult_t AllGatherRunTest(struct threadArgs* args, int root, ncclDataType_t
 }

 struct testEngine allGatherEngine = {
-  AllGatherGetBuffSize,
-  AllGatherRunTest
+  .getBuffSize = AllGatherGetBuffSize,
+  .runTest = AllGatherRunTest
 };

 #pragma weak ncclTestEngine=allGatherEngine
--- a/src/all_reduce.cu
+++ b/src/all_reduce.cu
@ -4,10 +4,34 @@
 * See LICENSE.txt for license information
 ************************************************************************/

+/*
+ * AllReduce Performance Test Implementation
+ *
+ * This file implements multiple AllReduce kernel variants optimized for different
+ * use cases within CUDA P2P connectivity.
+ * These kernels are designed to highlight the device API functionality. As well as how to optimize for best performance.
+ *
+ * IMPORTANT: All custom kernels require CUDA P2P connectivity since they require Load-Store Accessible (LSA) memory.
+ *
+ * Kernel Selection Strategy:
+ * - deviceImpl = 0: NCCL's built-in AllReduce implementation (fallback)
+ * - deviceImpl = 1: allReduceLsaKernel - Basic LSA implementation for demonstration and small message sizes.
+ * - deviceImpl = 2: allReduceLsaVectorizedKernel - Vectorized LSA for demonstration to achieve performance for large message sizes.
+ * - deviceImpl = 3: allReduceMultimemKernel - Multi-memory for hardware acceleration. Requires Multimem capable hardware but can offer better performance.
+ * - deviceImpl = 4: allReduceMultimemVectorizedKernel - Vectorized multi-memory for best performance. Requires Multimem capable hardware but can offer better performance.
+ */
+
 #include "cuda_runtime.h"
 #include "common.h"
+#include <algorithm>
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+#include "nccl_device.h"
+#include "vector_types.h"
+#include "multimem_ops.h"
+constexpr int WARP_SIZE = 32;
+#endif

-void AllReduceGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
+void AllReduceGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
  *sendcount = count;
  *recvcount = count;
  *sendInplaceOffset = 0;
@ -40,9 +64,433 @@ void AllReduceGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t AllReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  NCCLCHECK(ncclAllReduce(sendbuff, recvbuff, count, type, op, comm, stream));
-  return testSuccess;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+// set devComm reqs for allreduce device kernels
+bool AllReduceGetDevCommRequirements(int deviceImpl, ncclDevCommRequirements* reqs) {
+  if (!reqs) return false;
+  memset(reqs, 0, sizeof(*reqs));
+
+  switch(deviceImpl) {
+    case 1: // allReduceLsaKernel
+    case 2: // allReduceLsaVectorizedKernel
+      reqs->lsaBarrierCount = deviceCtaCount;
+      return true;
+    case 3: // allReduceMultimemKernel
+    case 4: // allReduceMultimemVectorizedKernel
+      reqs->lsaMultimem = true;
+      reqs->lsaBarrierCount = deviceCtaCount;
+      return true;
+    default:
+      return false;
+  }
+}
+
+/*
+ * Kernel 1: allReduceLsaKernel - Basic LSA-based AllReduce
+ *
+ * Purpose: Provides a simple, deterministic AllReduce implementation for small to
+ * medium message sizes within CUDA P2P connectivity.
+ *
+ * Solution: Implements AllReduce using direct peer-to-peer memory access through
+ * LSA windows. Each rank reads from all other ranks, performs reduction, and
+ * writes the result back to all ranks using cooperative thread arrays.
+ *
+ * Key Optimizations:
+ * - LSA barriers for faster synchronization than global barriers
+ * - Global grid stride loop to distribute work across all ranks
+ * - Direct peer access within CUDA P2P connectivity for optimal bandwidth
+ *
+ * CUDA P2P Connectivity Requirement: CRITICAL - This kernel requires all participating
+ * ranks to be within the same CUDA P2P connectivity.
+ *
+ * Use Case: Small to medium messages (< 1MB) where simplicity and determinism
+ * are more important than maximum bandwidth.
+ */
+template <typename T>
+__global__ void allReduceLsaKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamLsa(devComm), devComm.lsaBarrier, blockIdx.x };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
+
+  const int rank = devComm.rank, nRanks = devComm.nRanks;
+
+  const int globalTid = threadIdx.x + blockDim.x * (rank + blockIdx.x * nRanks);
+  const int globalNthreads = blockDim.x * gridDim.x * nRanks;
+
+  for (size_t offset = globalTid; offset < count; offset += globalNthreads) {
+    T v = T{0};
+    for (int peer=0; peer<nRanks; peer++) {
+      T* sendPtr = (T*)ncclGetLsaPointer(sendwin, sendoffset, peer);
+      v += sendPtr[offset];
+    }
+    for (int peer=0; peer<nRanks; peer++) {
+      T* recvPtr = (T*)ncclGetLsaPointer(recvwin, recvoffset, peer);
+      recvPtr[offset] = v;
+    }
+  }
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
+}
+
+/*
+ * Kernel 2: allReduceLsaVectorizedKernel - Vectorized LSA-based AllReduce
+ *
+ * Purpose: Enhanced AllReduce implementation using vectorized memory operations
+ * and loop unrolling to maximize memory bandwidth utilization for large messages
+ * within CUDA P2P connectivity.
+ *
+ * Solution: Builds upon the basic LSA approach but adds vectorized loads/stores
+ * and aggressive loop unrolling to achieve higher memory bandwidth. Handles
+ * misaligned data gracefully while maximizing vectorized throughput. Not necessarily optimal for small message sizes.
+ *
+ * Key Optimizations:
+ * - Vectorized loads/stores for improved memory bandwidth (128-bit operations)
+ * - Loop unrolling to reduce loop overhead and improve instruction-level parallelism
+ * - Warp-coalesced memory access patterns for optimal memory controller utilization
+ * - Graceful handling of misaligned data with scalar fallback, comes at the cost of higher latency if not required.
+ *
+ * CUDA P2P Connectivity Requirement: CRITICAL - Same as basic LSA kernel. Requires
+ * CUDA P2P connectivity due to LSA memory access patterns.
+ *
+ * Use Case: Large messages where maximum memory bandwidth is
+ * critical and data alignment can be optimized.
+ */
+template <typename T>
+__global__ void allReduceLsaVectorizedKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamLsa(devComm), devComm.lsaBarrier, blockIdx.x };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
+
+  // Compile time vector type and vector size mapping
+  using TN = typename VectorTypeMapping<T>::Type;
+  constexpr int VECTOR_FACTOR = sizeof(TN)/sizeof(T);
+
+  constexpr int UNROLL_FACTOR = 128/sizeof(TN); // Same as before 128 Bytes per thread
+
+  const int rank = devComm.rank, nRanks = devComm.nRanks;
+
+  const int globalTid = threadIdx.x + blockDim.x * (rank + blockIdx.x * nRanks);
+  const int globalNthreads = blockDim.x * gridDim.x * nRanks;
+
+  // Since we use vector types, the non-vector allocated memory is not necessarily aligned.
+  const size_t alignment_offset = (sendoffset % sizeof(TN)) / sizeof(T);
+  const size_t aligned_count = count - alignment_offset;
+  const size_t vector_count = aligned_count / VECTOR_FACTOR;
+  const size_t remainder = aligned_count % VECTOR_FACTOR;
+
+  // As before
+  const int elements_per_block = globalNthreads * UNROLL_FACTOR;
+  const int num_blocks = vector_count / elements_per_block;
+
+  const int warp_id = globalTid / WARP_SIZE;
+  const int lane_id = globalTid % WARP_SIZE;
+
+  const int warp_offset = warp_id * WARP_SIZE * UNROLL_FACTOR;
+  const int lane_offset = lane_id;
+  const int warp_lane_offset = warp_offset + lane_offset;
+
+  // Handle misaligned elements first using scalar operations. Grid stride loop with scalar handling
+  if (alignment_offset > 0) {
+    for (size_t offset = globalTid; offset < alignment_offset; offset += globalNthreads) {
+      T v_scalar = T{0};
+
+      for (int peer=0; peer<nRanks; peer++) {
+        T* remotePtr = (T*)ncclGetLsaPointer(sendwin, sendoffset, peer);
+        v_scalar += remotePtr[offset];
+      }
+
+      for (int peer=0; peer<nRanks; peer++) {
+        T* remotePtr = (T*)ncclGetLsaPointer(recvwin, recvoffset, peer);
+        remotePtr[offset] = v_scalar;
+      }
+    }
+  }
+
+  // Handle vectorized memory that can be handled in whole chunks (no if)
+  for (int block = 0; block < num_blocks; block += 1) {
+    TN v[UNROLL_FACTOR] = {TN{0}};
+    const size_t block_offset = block * globalNthreads * UNROLL_FACTOR;
+    for (int peer=0; peer<nRanks; peer++) {
+#pragma unroll
+      for (int i=0; i < UNROLL_FACTOR; i++) {
+        const int stride_offset = i * WARP_SIZE;
+        const size_t offset = warp_lane_offset + block_offset + stride_offset;
+        // Uses TN* as pointer type for vectorized pointer arithmatic
+        // The pointer is also adjusted for misalignment
+        TN* remotePtr = (TN*)ncclGetLsaPointer(sendwin, sendoffset + alignment_offset * sizeof(T), peer);
+        v[i] = vectorAdd(v[i], remotePtr[offset]);
+      }
+    }
+    for (int peer=0; peer<nRanks; ++peer) {
+#pragma unroll
+      for (int i=0; i < UNROLL_FACTOR; i++) {
+        const int stride_offset = i * WARP_SIZE;
+        const size_t offset = warp_lane_offset + block_offset + stride_offset;
+        TN* remotePtr = (TN*)ncclGetLsaPointer(recvwin, recvoffset + alignment_offset * sizeof(T), peer);
+        remotePtr[offset] = v[i];
+      }
+    }
+  }
+
+    // Handle the last partial vectorized block, but with if conditions
+  const int block = num_blocks;
+  TN v[UNROLL_FACTOR] = {TN{0}};
+  const size_t block_offset = block * globalNthreads * UNROLL_FACTOR;
+  for (int peer=0; peer<nRanks; peer++) {
+#pragma unroll
+      for (int i=0; i < UNROLL_FACTOR; i++) {
+        const int stride_offset = i * WARP_SIZE;
+        const size_t offset = warp_lane_offset + block_offset + stride_offset;
+        if (offset < vector_count) {
+          TN* remotePtr = (TN*)ncclGetLsaPointer(sendwin, sendoffset + alignment_offset * sizeof(T), peer);
+          v[i] = vectorAdd(v[i], remotePtr[offset]);
+        }
+      }
+  }
+  for (int peer=0; peer<nRanks; ++peer) {
+#pragma unroll
+      for(int i=0; i < UNROLL_FACTOR; i++){
+        const int stride_offset = i * WARP_SIZE;
+        const size_t offset = warp_lane_offset + block_offset + stride_offset;
+        if (offset < vector_count) {
+          TN* remotePtr = (TN*)ncclGetLsaPointer(recvwin, recvoffset + alignment_offset * sizeof(T), peer);
+          remotePtr[offset] = v[i];
+        }
+      }
+  }
+
+  // Since the data doesn't have to be perfectly aligned with the vector size, we need to handle remaining elements.
+  if (remainder > 0) {
+    const size_t remainder_start = alignment_offset + vector_count * VECTOR_FACTOR;
+    const int globalTid_remainder = globalTid;
+    const int globalNthreads_remainder = globalNthreads;
+
+    for (size_t offset = globalTid_remainder; offset < remainder; offset += globalNthreads_remainder) {
+      T v_scalar = 0;
+      const size_t actual_offset = remainder_start + offset;
+
+      for (int peer=0; peer<nRanks; peer++) {
+        T* remotePtr = (T*)ncclGetLsaPointer(sendwin, sendoffset, peer);
+        v_scalar += remotePtr[actual_offset];
+      }
+
+      for (int peer=0; peer<nRanks; peer++) {
+        T* remotePtr = (T*)ncclGetLsaPointer(recvwin, recvoffset, peer);
+        remotePtr[actual_offset] = v_scalar;
+      }
+    }
+  }
+
+  // Sync
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
+}
+
+/*
+ * Kernel 3: allReduceMultimemKernel - Multi-memory Hardware-Accelerated AllReduce
+ *
+ * Purpose: High-performance AllReduce implementation using multi-memory primitives
+ * that leverage hardware acceleration for memory operations, significantly reducing
+ * SM utilization while maintaining high bandwidth within CUDA P2P connectivity.
+ *
+ * Solution: Replaces the O(Nrank) peer loop approach with hardware-accelerated
+ * multi-memory operations. The kernel initiates CUDA P2P reductions directly through
+ * hardware, eliminating the need for explicit peer-to-peer communication loops.
+ *
+ * Key Optimizations:
+ * - Multi-memory primitives for hardware-accelerated operations
+ * - Eliminates O(Nrank) scaling by using hardware reduction capabilities
+ * - Hardware-assisted memory synchronization and reduction
+ *
+ * CUDA P2P Connectivity Requirement: CRITICAL - Requires CUDA P2P connectivity and
+ * multi-memory support. Hardware acceleration is only available within the
+ * same CUDA P2P connectivity where multi-memory operations can be performed.
+ *
+ * Use Case: Large CUDA P2P connectivity where scaling to more ranks is desired.
+ *
+ * Hardware Requirements: Hopper+ architecture with multi-memory support enabled.
+ */
+template <typename T>
+__global__ void allReduceMultimemKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamTagLsa(), blockIdx.x, true };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
+
+  const int rank = devComm.rank, nRanks = devComm.nRanks;
+
+  const int globalTid = threadIdx.x + blockDim.x * (rank + blockIdx.x * nRanks);
+  const int globalNthreads = blockDim.x * gridDim.x * nRanks;
+
+  T* send_ptr = reinterpret_cast<T*>(ncclGetLsaMultimemPointer(sendwin, sendoffset, devComm));
+  T* recv_ptr = reinterpret_cast<T*>(ncclGetLsaMultimemPointer(recvwin, recvoffset, devComm));
+  for (size_t offset=globalTid; offset < count; offset += globalNthreads) {
+    if (offset < count) {
+      T v = multimemLoadSum<T,T>(send_ptr + offset);
+      multimemStore<T,T>(recv_ptr + offset, v);
+    }
+  }
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
+}
+
+/*
+ * Kernel 4: allReduceMultimemVectorizedKernel - Vectorized Multi-memory AllReduce
+ *
+ * Purpose: Ultimate performance AllReduce implementation combining multi-memory
+ * hardware acceleration with vectorized operations and loop unrolling for maximum
+ * bandwidth utilization within CUDA P2P connectivity.
+ *
+ * Solution: Combines the hardware acceleration benefits of multi-memory operations
+ * with the bandwidth optimization techniques from vectorized kernels. This kernel
+ * represents the highest performance option for large, aligned data sets.
+ *
+ * Key Optimizations:
+ * - Multi-memory primitives for hardware-accelerated operations
+ * - Vectorized loads/stores for maximum memory bandwidth (128-bit operations)
+ * - Aggressive loop unrolling for improved instruction-level parallelism
+ * - Warp-coalesced memory access patterns for optimal memory controller utilization
+ * - Hardware-assisted memory synchronization and reduction
+ * - Graceful handling of misaligned data with scalar fallback
+ *
+ * CUDA P2P Connectivity Requirement: CRITICAL - Requires CUDA P2P connectivity and
+ * multi-memory support. This kernel leverages both P2P locality and hardware
+ * acceleration for optimal performance.
+ *
+ * Hardware Requirements: Hopper+ architecture with multi-memory support enabled.
+ *
+ * Performance Note: This kernel provides the best performance for large, aligned
+ * data sets but requires careful data alignment for optimal vectorization benefits.
+ */
+template <typename T>
+__global__ void allReduceMultimemVectorizedKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamTagLsa(), blockIdx.x, true };
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
+
+  using TN = typename VectorTypeMapping<T>::Type;
+  constexpr int VECTOR_FACTOR = sizeof(TN)/sizeof(T);
+
+  constexpr int UNROLL_FACTOR = 128/sizeof(TN);
+
+  const int rank = devComm.rank, nRanks = devComm.nRanks;
+
+  const int globalTid = threadIdx.x + blockDim.x * (rank + blockIdx.x * nRanks);
+  const int globalNthreads = blockDim.x * gridDim.x * nRanks;
+
+  // Calculate alignment offset to handle misaligned elements first
+  const size_t alignment_offset = (sendoffset % sizeof(TN)) / sizeof(T);
+  const size_t aligned_count = count - alignment_offset;
+  const size_t vector_count = aligned_count / VECTOR_FACTOR;
+  const size_t remainder = aligned_count % VECTOR_FACTOR;
+
+  const int elements_per_block = globalNthreads * UNROLL_FACTOR;
+  const int num_blocks = vector_count / elements_per_block;
+
+  const int warp_id = globalTid / WARP_SIZE;
+  const int lane_id = globalTid % WARP_SIZE;
+
+  const int warp_offset = warp_id * WARP_SIZE * UNROLL_FACTOR;
+  const int lane_offset = lane_id;
+  const int warp_lane_offset = warp_offset + lane_offset;
+
+  // Multimem pointers that handle scalar access for misaligned and remainder elements
+  T* send_ptr = reinterpret_cast<T*>(ncclGetLsaMultimemPointer(sendwin, sendoffset, devComm));
+  T* recv_ptr = reinterpret_cast<T*>(ncclGetLsaMultimemPointer(recvwin, recvoffset, devComm));
+
+  // Handle misaligned elements first using scalar operations
+  if (alignment_offset > 0) {
+    for (size_t offset = globalTid; offset < max(alignment_offset,count); offset += globalNthreads) {
+      T v_scalar = multimemLoadSum<T,T>(send_ptr + offset);
+      multimemStore<T,T>(recv_ptr+offset, v_scalar);
+    }
+  }
+
+  // separate TN* for 2 reasons. a) alignment offset, b) pointer arithmetic with the vectorized type
+  TN* send_ptrN = reinterpret_cast<TN*>(ncclGetLsaMultimemPointer(sendwin, sendoffset+alignment_offset*sizeof(T), devComm));
+  TN* recv_ptrN = reinterpret_cast<TN*>(ncclGetLsaMultimemPointer(recvwin, recvoffset+alignment_offset*sizeof(T), devComm));
+
+  // Handle vectorized memory that can be handled in whole chunks (no if)
+  for (int block = 0; block < num_blocks; block += 1) {
+    TN v[UNROLL_FACTOR] = {TN{0}};
+    const size_t block_offset = block * globalNthreads * UNROLL_FACTOR;
+#pragma unroll
+    for (int i=0; i < UNROLL_FACTOR; i++) {
+      const int stride_offset = i * WARP_SIZE;
+      const size_t offset = warp_lane_offset + block_offset + stride_offset;
+      v[i] = multimemLoadSum<T,TN>(reinterpret_cast<T*>(send_ptrN + offset));
+    }
+
+#pragma unroll
+    for (int i=0; i < UNROLL_FACTOR; i++) {
+      const int stride_offset = i * WARP_SIZE;
+      const size_t offset = warp_lane_offset + block_offset + stride_offset;
+      multimemStore<T,TN>(reinterpret_cast<T*>(recv_ptrN+offset), v[i]);
+    }
+  }
+
+  // Handle the last partial vectorized block, but with if conditions
+  const int block = num_blocks;
+  TN v[UNROLL_FACTOR] = {TN{0}};
+  const size_t block_offset = block * globalNthreads * UNROLL_FACTOR;
+#pragma unroll
+  for (int i=0; i < UNROLL_FACTOR; i++) {
+    const int stride_offset = i * WARP_SIZE;
+    const size_t offset = warp_lane_offset + block_offset + stride_offset;
+    if (offset < vector_count) {
+      v[i] = multimemLoadSum<T,TN>(reinterpret_cast<T*>(send_ptrN+offset));
+    }
+  }
+#pragma unroll
+  for (int i=0; i < UNROLL_FACTOR; i++) {
+    const int stride_offset = i * WARP_SIZE;
+    const size_t offset = warp_lane_offset + block_offset + stride_offset;
+    if (offset < vector_count) {
+      multimemStore<T,TN>(reinterpret_cast<T*>(recv_ptrN+offset), v[i]);
+    }
+  }
+
+  // Handle remainder elements using scalar operations
+  if (remainder > 0) {
+    const size_t remainder_start = alignment_offset + vector_count * VECTOR_FACTOR;
+    const int globalTid_remainder = globalTid;
+    const int globalNthreads_remainder = globalNthreads;
+
+    for (size_t offset = globalTid_remainder; offset < remainder; offset += globalNthreads_remainder) {
+      const size_t actual_offset = remainder_start + offset;
+      T v_scalar = multimemLoadSum<T,T>(send_ptr+actual_offset);
+      multimemStore<T,T>(recv_ptr+actual_offset, v_scalar);
+    }
+  }
+
+  // Sync
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
+}
+#endif
+
+testResult_t AllReduceRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+
+  char* sptr = (char*)sendbuff + sendoffset;
+  char* rptr = (char*)recvbuff + recvoffset;
+
+  switch (deviceImpl) {
+  case 0:
+    NCCLCHECK(ncclAllReduce(sptr, rptr, count, type, op, comm, stream));
+    return testSuccess;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  case 1:
+    TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(allReduceLsaKernel, type, op),
+               sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+    return testSuccess;
+  case 2:
+    TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(allReduceLsaVectorizedKernel, type, op),
+               sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+    return testSuccess;
+  case 3:
+    TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(allReduceMultimemKernel, type, op),
+               sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+    return testSuccess;
+  case 4:
+    TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(allReduceMultimemVectorizedKernel, type, op),
+               sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+    return testSuccess;
+#endif
+  }
+
+  return testNotImplemented;
 }

 struct testColl allReduceTest = {
@ -55,7 +503,7 @@ struct testColl allReduceTest = {

 void AllReduceGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  AllReduceGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  AllReduceGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t AllReduceRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -94,8 +542,11 @@ testResult_t AllReduceRunTest(struct threadArgs* args, int root, ncclDataType_t
 }

 struct testEngine allReduceEngine = {
-  AllReduceGetBuffSize,
-  AllReduceRunTest
+  .getBuffSize = AllReduceGetBuffSize,
+  .runTest = AllReduceRunTest,
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  .getDevCommRequirements = AllReduceGetDevCommRequirements
+#endif
 };

 #pragma weak ncclTestEngine=allReduceEngine
--- a/src/alltoall.cu
+++ b/src/alltoall.cu
@ -6,13 +6,19 @@

 #include "cuda_runtime.h"
 #include "common.h"
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+#include "nccl_device.h"
+#include "vector_types.h"
+#endif

-void AlltoAllGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
-  *sendcount = (count/nranks)*nranks;
-  *recvcount = (count/nranks)*nranks;
+#pragma weak ncclAlltoAll
+
+void AlltoAllGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
+  *paramcount = (count/nranks) & -(16/eltSize);
+  *sendcount = nranks*(*paramcount);
+  *recvcount = *sendcount;
  *sendInplaceOffset = 0;
  *recvInplaceOffset = 0;
-  *paramcount = count/nranks;
 }

 testResult_t AlltoAllInitData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t op, int root, int rep, int in_place) {
@ -45,23 +51,268 @@ void AlltoAllGetBw(size_t count, int typesize, double sec, double* algBw, double
  *busBw = baseBw * factor;
 }

-testResult_t AlltoAllRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  int nRanks;
-  NCCLCHECK(ncclCommCount(comm, &nRanks));
-  size_t rankOffset = count * wordSize(type);
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+// set devComm reqs for alltoall device kernels
+bool AlltoAllGetDevCommRequirements(int deviceImpl, ncclDevCommRequirements* reqs) {
+  if (!reqs) return false;
+  memset(reqs, 0, sizeof(*reqs));

-#if NCCL_MAJOR < 2 || NCCL_MINOR < 7
-  printf("NCCL 2.7 or later is needed for alltoall. This test was compiled with %d.%d.\n", NCCL_MAJOR, NCCL_MINOR);
-  return testNcclError;
-#else
-  NCCLCHECK(ncclGroupStart());
-  for (int r=0; r<nRanks; r++) {
-    NCCLCHECK(ncclSend(((char*)sendbuff)+r*rankOffset, count, type, r, comm, stream));
-    NCCLCHECK(ncclRecv(((char*)recvbuff)+r*rankOffset, count, type, r, comm, stream));
-  }
-  NCCLCHECK(ncclGroupEnd());
-  return testSuccess;
+  switch(deviceImpl) {
+    case 1: // NvlAlltoAllKernel
+    case 2: // NvlAlltoAllKernelOptimized
+      reqs->lsaBarrierCount = deviceCtaCount;
+      return true;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,7)
+    case 3: // GinAlltoAllKernel
+    case 4: // HybridAlltoAllKernel (LSA+GIN)
+      reqs->barrierCount = deviceCtaCount;
+      reqs->ginSignalCount = deviceCtaCount;
+      return true;
 #endif
+    default:
+      return false;
+  }
+}
+
+// shared scalar AlltoAll implementation used by both kernels
+template <typename T>
+__device__ void AlltoAllScalarImpl(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int rank, int nRanks, int tid, int nthreads) {
+  T* sendPtr = (T*)ncclGetLsaPointer(sendwin, sendoffset, rank);
+
+  for (size_t offset = tid; offset < count; offset += nthreads) {
+    for (int peer = 0; peer < nRanks; peer++) {
+      T value = sendPtr[peer * count + offset];
+      T* recvPtr = (T*)ncclGetLsaPointer(recvwin, recvoffset, peer);
+      recvPtr[rank * count + offset] = value;
+    }
+  }
+}
+
+// Device implementation #1 - simple NVL kernel
+template <typename T>
+__global__ void NvlAlltoAllKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamLsa(devComm), devComm.lsaBarrier, blockIdx.x };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
+
+  int rank = devComm.rank, nRanks = devComm.nRanks;
+  int tid = threadIdx.x + blockDim.x * blockIdx.x;
+  int nthreads = blockDim.x * gridDim.x;
+
+  AlltoAllScalarImpl<T>(sendwin, sendoffset, recvwin, recvoffset, count, rank, nRanks, tid, nthreads);
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
+}
+
+// Device implementation #2 - optimized NVL kernel using vectorization and unrolling
+template <typename T>
+__global__ void NvlAlltoAllKernelOptimized(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  ncclLsaBarrierSession<ncclCoopCta> bar { ncclCoopCta(), devComm, ncclTeamLsa(devComm), devComm.lsaBarrier, blockIdx.x };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed);
+
+  using TN = typename VectorTypeMapping<T>::Type;
+  constexpr int VECTOR_FACTOR = sizeof(TN) / sizeof(T);
+  constexpr int UNROLL_FACTOR = 128/sizeof(TN);
+  constexpr int PEER_UNROLL = 2;
+
+  int rank = devComm.rank, nRanks = devComm.nRanks;
+  int tid = threadIdx.x + blockDim.x * blockIdx.x;
+  int nthreads = blockDim.x * gridDim.x;
+
+  T* sendPtr = (T*)ncclGetLsaPointer(sendwin, sendoffset, rank);
+
+  // alignment check: can we use vectorized operations?
+  bool canVectorize = (sizeof(TN) > sizeof(T)) &&  // Only if vectorization helps
+                      (reinterpret_cast<uintptr_t>(sendPtr) % sizeof(TN) == 0) &&  // Base aligned
+                      ((count * sizeof(T)) % sizeof(TN) == 0);  // Stride compatible
+
+  if (canVectorize) {
+    size_t vector_count = count / VECTOR_FACTOR;
+    int elements_per_iteration = nthreads * UNROLL_FACTOR;
+
+    // process aligned vectorized elements without bounds checks
+    size_t aligned_vector_count = (vector_count / elements_per_iteration) * elements_per_iteration;
+    for (size_t base_offset = tid; base_offset < aligned_vector_count; base_offset += elements_per_iteration) {
+      // unroll a limited number of peers at a time
+      for (int peerBase = 0; peerBase < nRanks; peerBase += PEER_UNROLL) {
+        int peersInGroup = min(PEER_UNROLL, nRanks - peerBase);
+
+        #pragma unroll
+        for (int p = 0; p < peersInGroup; p++) {
+          int peer = peerBase + p;
+          TN* sendVecPtr = (TN*)(sendPtr + peer * count);
+          TN* recvVecPtr = (TN*)((T*)ncclGetLsaPointer(recvwin, recvoffset, peer) + rank * count);
+          TN values[UNROLL_FACTOR];
+
+          // split load/store into separate loops for better overlap and ILP
+          #pragma unroll
+          for (int i = 0; i < UNROLL_FACTOR; i++) {
+            size_t offset = base_offset + i * nthreads;
+            values[i] = sendVecPtr[offset];
+          }
+          #pragma unroll
+          for (int i = 0; i < UNROLL_FACTOR; i++) {
+            size_t offset = base_offset + i * nthreads;
+            recvVecPtr[offset] = values[i];
+          }
+        }
+      }
+    }
+
+    // handle remaining vectorized elements that didn't fit in aligned chunks
+    for (size_t base_offset = aligned_vector_count + tid; base_offset < vector_count; base_offset += nthreads) {
+      for (int peer = 0; peer < nRanks; peer++) {
+        TN* sendVecPtr = (TN*)(sendPtr + peer * count);
+        TN* recvVecPtr = (TN*)((T*)ncclGetLsaPointer(recvwin, recvoffset, peer) + rank * count);
+        recvVecPtr[base_offset] = sendVecPtr[base_offset];
+      }
+    }
+
+    // handle any remaining elements not divisible by vectorization factor
+    size_t scalar_start = vector_count * VECTOR_FACTOR;
+    for (size_t offset = scalar_start + tid; offset < count; offset += nthreads) {
+      for (int peer = 0; peer < nRanks; peer++) {
+        T value = sendPtr[peer * count + offset];
+        T* recvPtr = (T*)ncclGetLsaPointer(recvwin, recvoffset, peer);
+        recvPtr[rank * count + offset] = value;
+      }
+    }
+  } else {
+    // simple scalar fallback for unaligned data (identical to simple kernel)
+    AlltoAllScalarImpl<T>(sendwin, sendoffset, recvwin, recvoffset, count, rank, nRanks, tid, nthreads);
+  }
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release);
+}
+
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,7)
+template <typename T>
+__global__ void GinAlltoAllKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  int ginContext = 0;
+  unsigned int signalIndex = 0;
+  ncclGin gin { devComm, ginContext };
+  uint64_t signalValue = gin.readSignal(signalIndex);
+
+  ncclBarrierSession<ncclCoopCta> bar { ncclCoopCta(), ncclTeamTagWorld(), gin, blockIdx.x };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed, ncclGinFenceLevel::Relaxed);
+
+  int tid = threadIdx.x + blockIdx.x * blockDim.x;
+  int nthreads = blockDim.x * gridDim.x;
+
+  /* send to all peers via GIN */
+  const size_t size = count * sizeof(T);
+  for (int r=tid; r<devComm.nRanks; r+=nthreads) {
+    gin.put(ncclTeamWorld(devComm), r,
+        recvwin, recvoffset + devComm.rank * size,
+        sendwin, sendoffset + r * size,
+        size, ncclGin_SignalInc{signalIndex});
+  }
+
+  gin.waitSignal(ncclCoopCta(), signalIndex, signalValue + devComm.nRanks);
+  gin.flush(ncclCoopCta());
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release, ncclGinFenceLevel::Relaxed);
+}
+
+template <typename T>
+__global__ void HybridAlltoAllKernel(ncclWindow_t sendwin, size_t sendoffset, ncclWindow_t recvwin, size_t recvoffset, size_t count, int root, struct ncclDevComm devComm) {
+  int ginContext = 0;
+  unsigned int signalIndex = 0;
+  ncclGin gin { devComm, ginContext };
+  uint64_t signalValue = gin.readSignal(signalIndex);
+
+  ncclBarrierSession<ncclCoopCta> bar { ncclCoopCta(), ncclTeamTagWorld(), gin, blockIdx.x };
+  bar.sync(ncclCoopCta(), cuda::memory_order_relaxed, ncclGinFenceLevel::Relaxed);
+
+  int tid = threadIdx.x + blockIdx.x*blockDim.x;
+  int nthreads = blockDim.x * gridDim.x;
+
+  ncclTeam world = ncclTeamWorld(devComm);
+  ncclTeam lsa = ncclTeamLsa(devComm);
+  const int startLsa = world.rank - lsa.rank;
+  const int lsaSize  = lsa.nRanks;
+
+  /* handle remote peers (i.e., non-LSA) using GIN */
+  const size_t size = count * sizeof(T);
+  for (int r = tid; r < startLsa; r += nthreads) {
+    gin.put(world, r,
+        recvwin, recvoffset + world.rank * size,
+        sendwin, sendoffset + r * size,
+        size, ncclGin_SignalInc{signalIndex});
+  }
+  for (int r = startLsa + lsaSize + tid; r < world.nRanks; r += nthreads) {
+    gin.put(world, r,
+        recvwin, recvoffset + world.rank * size,
+        sendwin, sendoffset + r * size,
+        size, ncclGin_SignalInc{signalIndex});
+  }
+
+  /* handle local peers with LSA */
+  T* sendLocal = (T*)ncclGetLocalPointer(sendwin, sendoffset);
+  for (size_t offset = tid; offset < count; offset += nthreads) {
+    for (int lp = 0; lp < lsa.nRanks; lp++) {
+      int wr = startLsa + lp;
+      T* recvPtr = (T*)ncclGetLsaPointer(recvwin, recvoffset, lp);
+      recvPtr[world.rank * count + offset] = sendLocal[wr * count + offset];
+    }
+  }
+
+  int numRemotePeers = world.nRanks - lsa.nRanks;
+  gin.waitSignal(ncclCoopCta(), signalIndex, signalValue + numRemotePeers);
+  gin.flush(ncclCoopCta());
+
+  bar.sync(ncclCoopCta(), cuda::memory_order_release, ncclGinFenceLevel::Relaxed);
+}
+#endif
+#endif
+
+testResult_t AlltoAllRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+    if (test_ncclVersion >= NCCL_VERSION(2,28,0)) {
+      NCCLCHECK(ncclAlltoAll(sptr, rptr, count, type, comm, stream));
+      return testSuccess;
+    }
+    // fall-through to send/recv implementation if ncclAlltoAll is not available
+#endif
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,7,0)
+    int nRanks;
+    NCCLCHECK(ncclCommCount(comm, &nRanks));
+    size_t rankOffset = count * wordSize(type);
+    NCCLCHECK(ncclGroupStart());
+    for (int r=0; r<nRanks; r++) {
+      NCCLCHECK(ncclSend(sptr+r*rankOffset, count, type, r, comm, stream));
+      NCCLCHECK(ncclRecv(rptr+r*rankOffset, count, type, r, comm, stream));
+    }
+    NCCLCHECK(ncclGroupEnd());
+#else
+    printf("NCCL 2.7 or later is needed for alltoall. This test was compiled with %d.%d.\n", NCCL_MAJOR, NCCL_MINOR);
+    return testNcclError;
+#endif
+  } else {
+    switch(deviceImpl) {
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+      case 1:
+        TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(NvlAlltoAllKernel, type, op), sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+        return testSuccess;
+      case 2:
+        TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(NvlAlltoAllKernelOptimized, type, op), sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+        return testSuccess;
+#endif
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,7)
+      case 3:
+        TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(GinAlltoAllKernel, type, op), sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+        return testSuccess;
+      case 4:
+        TESTCHECK(testLaunchDeviceKernel(SPECIALIZE_KERNEL(HybridAlltoAllKernel, type, op), sendbuff, sendoffset, recvbuff, recvoffset, count, type, op, root, comm, stream));
+        return testSuccess;
+#endif
+      default:
+        return testNotImplemented;
+    }
+  }
+  return testSuccess;
 }

 struct testColl alltoAllTest = {
@ -74,7 +325,7 @@ struct testColl alltoAllTest = {

 void AlltoAllGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  AlltoAllGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  AlltoAllGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t AlltoAllRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -100,8 +351,11 @@ testResult_t AlltoAllRunTest(struct threadArgs* args, int root, ncclDataType_t t
 }

 struct testEngine alltoAllEngine = {
-  AlltoAllGetBuffSize,
-  AlltoAllRunTest
+  .getBuffSize = AlltoAllGetBuffSize,
+  .runTest = AlltoAllRunTest,
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  .getDevCommRequirements = AlltoAllGetDevCommRequirements
+#endif
 };

 #pragma weak ncclTestEngine=alltoAllEngine
--- a/src/broadcast.cu
+++ b/src/broadcast.cu
@ -7,7 +7,7 @@
 #include "cuda_runtime.h"
 #include "common.h"

-void BroadcastGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
+void BroadcastGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
  *sendcount = count;
  *recvcount = count;
  *sendInplaceOffset = 0;
@ -39,18 +39,25 @@ void BroadcastGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t BroadcastRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  int rank;
-  NCCLCHECK(ncclCommUserRank(comm, &rank));
+testResult_t BroadcastRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    int rank;
+    NCCLCHECK(ncclCommUserRank(comm, &rank));
+
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
 #if NCCL_MAJOR >= 2 && NCCL_MINOR >= 2
-  NCCLCHECK(ncclBroadcast(sendbuff, recvbuff, count, type, root, comm, stream));
+    NCCLCHECK(ncclBroadcast(sptr, rptr, count, type, root, comm, stream));
 #else
-  if (rank == root) {
-      NCCLCHECK(ncclBcast(sendbuff, count, type, root, comm, stream));
-  } else {
-      NCCLCHECK(ncclBcast(recvbuff, count, type, root, comm, stream));
-  }
+    if (rank == root) {
+      NCCLCHECK(ncclBcast(sptr, count, type, root, comm, stream));
+    } else {
+      NCCLCHECK(ncclBcast(rptr, count, type, root, comm, stream));
+    }
 #endif
+  } else {
+    return testNotImplemented;
+  }
  return testSuccess;
 }

@ -64,7 +71,7 @@ struct testColl broadcastTest = {

 void BroadcastGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  BroadcastGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  BroadcastGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t BroadcastRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -100,8 +107,8 @@ testResult_t BroadcastRunTest(struct threadArgs* args, int root, ncclDataType_t
 }

 struct testEngine broadcastEngine = {
-  BroadcastGetBuffSize,
-  BroadcastRunTest
+  .getBuffSize = BroadcastGetBuffSize,
+  .runTest = BroadcastRunTest
 };

 #pragma weak ncclTestEngine=broadcastEngine
--- a/src/common.cu
+++ b/src/common.cu
@ -11,28 +11,46 @@
 #include <type_traits>
 #include <getopt.h>
 #include <libgen.h>
+#include <string.h>
+#include <ctype.h>
 #include "cuda.h"
+#include <errno.h>     /* program_invocation_short_name */

+#include "util.h"
 #include "../verifiable/verifiable.h"

 #include "ucommd.h"

 static Ucommd ucommd_;

+#pragma weak ncclCommWindowRegister
+#pragma weak ncclCommWindowDeregister
+#pragma weak ncclDevCommCreate
+#pragma weak ncclDevCommDestroy
+
+#define DIVUP(x, y) \
+    (((x)+(y)-1)/(y))
+
 int test_ncclVersion = 0; // init'd with ncclGetVersion()

 #if NCCL_MAJOR >= 2
  ncclDataType_t test_types[ncclNumTypes] = {
    ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64, ncclHalf, ncclFloat, ncclDouble
-  #if defined(__CUDA_BF16_TYPES_EXIST__) && NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0)
+  #if HAVE_BF16
    , ncclBfloat16
  #endif
+  #if HAVE_FP8
+    , ncclFloat8e4m3, ncclFloat8e5m2
+  #endif
  };
  const char *test_typenames[ncclNumTypes] = {
    "int8", "uint8", "int32", "uint32", "int64", "uint64", "half", "float", "double"
-  #if defined(__CUDA_BF16_TYPES_EXIST__) && NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0)
+  #if HAVE_BF16
    , "bfloat16"
  #endif
+  #if HAVE_FP8
+    , "f8e4m3", "f8e5m2"
+  #endif
  };
  int test_typenum = -1;

@ -64,16 +82,16 @@ int is_main_proc = 0;
 thread_local int is_main_thread = 0;

 // Command line parameter defaults
-static int nThreads = 1;
-static int nGpus = 1;
-static size_t minBytes = 32*1024*1024;
-static size_t maxBytes = 32*1024*1024;
-static size_t stepBytes = 1*1024*1024;
-static size_t stepFactor = 2;
-static int datacheck = 1;
-static int warmup_iters = 5;
-static int iters = 20;
-static int agg_iters = 1;
+int nThreads = 1;
+int nGpus = 1;
+size_t minBytes = 32*1024*1024;
+size_t maxBytes = 32*1024*1024;
+size_t stepBytes = 1*1024*1024;
+size_t stepFactor = 2;
+int datacheck = 1;
+int warmup_iters = 1;
+int iters = 20;
+int agg_iters = 1;
 static int run_cycles = 1;
 static int ncclop = ncclSum;
 #if defined(__CUDA_BF16_TYPES_EXIST__) && NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0)
@ -82,19 +100,95 @@ static int nccltype = ncclBfloat16;
 static int nccltype = ncclHalf;
 #endif
 static int ncclroot = 0;
-static int parallel_init = 0;
-static int blocking_coll = 0;
+int parallel_init = 0;
+int blocking_coll = 0;
 static int streamnull = 0;
 static int timeout = 0;
-static int cudaGraphLaunches = 0;
+int cudaGraphLaunches = 0;
 static int report_cputime = 0;
+static int report_timestamps = 0;
+static int deviceImpl = 0;
+
+int deviceCtaCount = 16; // Default number of CTAs for device implementation
+
 // Report average iteration time: (0=RANK0,1=AVG,2=MIN,3=MAX)
 static int average = 1;
-#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
+#define LOCAL_REGISTER 1
+#define SYMMETRIC_REGISTER 2
 static int local_register = 0;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+static int ctaPolicy = -1;
 #endif
+static int minCudaArch = 1<<30;

-#define NUM_BLOCKS 32
+enum output_file_type_t {
+  JSON_FILE_OUTPUT,
+  UNSPECIFIED_FILE_OUTPUT
+};
+
+// Return pointer to extension in `path` if one is found. An extension
+// is the last `.` in the `path`, if there is no `/` following the `.`
+// and there are characters after `.`.
+//
+// Therefore: returns 0 if no meaningful extension was found, or returns offset
+// into string where extension begins
+static const char *getExtension(const char *path) {
+  if (path == nullptr) return nullptr;
+  int last_dot = -1;
+  int last_slash = -1;
+
+  int pos;
+  for (pos = 0; path[pos] != '\0'; ++pos) {
+    switch (path[pos]) {
+    case '.':
+      last_dot = pos;
+      break;
+    case '/':
+      last_slash = pos;
+      break;
+    default:
+      break;
+    }
+  }
+
+  if (last_dot > last_slash && last_dot + 1 != pos) {
+    return path + last_dot + 1;
+  }
+
+  return nullptr;
+}
+
+static output_file_type_t classifyOutputFile(const char *filename) {
+  const char *extension = getExtension(filename);
+  if (extension != nullptr && strcasecmp(extension, "json") == 0) {
+    return JSON_FILE_OUTPUT;
+  }
+
+  return UNSPECIFIED_FILE_OUTPUT;
+}
+
+static void outputFileInit(output_file_type_t output_file_type,
+                           const char *output_file, char argc, char **argv, char **envp) {
+  switch (output_file_type) {
+  case JSON_FILE_OUTPUT:
+    jsonOutputInit(output_file, argc, argv, envp);
+    break;
+  case UNSPECIFIED_FILE_OUTPUT:
+  default:
+    break;
+  }
+}
+
+static void outputFileFinalize(output_file_type_t output_file_type) {
+  switch (output_file_type) {
+  case JSON_FILE_OUTPUT:
+    jsonOutputFinalize();
+    break;
+  case UNSPECIFIED_FILE_OUTPUT:
+  default:
+    break;
+  }
+}

 static double parsesize(const char *value) {
    long long int units;
@ -133,18 +227,18 @@ static double parsesize(const char *value) {
 }

 testResult_t CheckDelta(void* results, void* expected, size_t count, size_t offset, ncclDataType_t type, ncclRedOp_t op, uint64_t seed, int nranks, int64_t *wrongEltN) {
-  ncclVerifiableVerify(results, expected, count, (int)type, (int)op, nranks, seed, offset, wrongEltN, cudaStreamDefault);
+  CUDACHECK(ncclVerifiableVerify(results, expected, count, (int)type, (int)op, nranks, seed, offset, wrongEltN, cudaStreamDefault));
  CUDACHECK(cudaDeviceSynchronize());
  return testSuccess;
 }

 testResult_t InitDataReduce(void* data, const size_t count, const size_t offset, ncclDataType_t type, ncclRedOp_t op, uint64_t seed, int nranks) {
-  ncclVerifiablePrepareExpected(data, count, (int)type, (int)op, nranks, seed, offset, cudaStreamDefault);
+  CUDACHECK(ncclVerifiablePrepareExpected(data, count, (int)type, (int)op, nranks, seed, offset, cudaStreamDefault));
  return testSuccess;
 }

 testResult_t InitData(void* data, const size_t count, size_t offset, ncclDataType_t type, ncclRedOp_t op, uint64_t seed, int nranks, int rank) {
-  ncclVerifiablePrepareInput(data, count, (int)type, (int)op, nranks, rank, seed, offset, cudaStreamDefault);
+  CUDACHECK(ncclVerifiablePrepareInput(data, count, (int)type, (int)op, nranks, rank, seed, offset, cudaStreamDefault));
  return testSuccess;
 }

@ -365,9 +459,12 @@ testResult_t startColl(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
      union {
        int8_t i8; uint8_t u8; int32_t i32; uint32_t u32; int64_t i64; uint64_t u64;
        half f16; float f32; double f64;
-        #if defined(__CUDA_BF16_TYPES_EXIST__)
+        #if HAVE_BF16
        __nv_bfloat16 bf16;
        #endif
+        #if HAVE_FP8
+        __nv_fp8_e4m3 f8e4m3; __nv_fp8_e5m2 f8e5m2;
+        #endif
      };
      switch(type) {
      case ncclInt8: i8 = ncclVerifiablePremulScalar<int8_t>(rank); break;
@ -379,18 +476,35 @@ testResult_t startColl(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
      case ncclFloat16: f16 = ncclVerifiablePremulScalar<half>(rank); break;
      case ncclFloat32: f32 = ncclVerifiablePremulScalar<float>(rank); break;
      case ncclFloat64: f64 = ncclVerifiablePremulScalar<double>(rank); break;
-      #if defined(__CUDA_BF16_TYPES_EXIST__)
+      #if HAVE_BF16
      case ncclBfloat16: bf16 = ncclVerifiablePremulScalar<__nv_bfloat16>(rank); break;
      #endif
+      #if HAVE_FP8
+      case ncclFloat8e4m3: f8e4m3 = ncclVerifiablePremulScalar<__nv_fp8_e4m3>(rank); break;
+      case ncclFloat8e5m2: f8e5m2 = ncclVerifiablePremulScalar<__nv_fp8_e5m2>(rank); break;
+      #endif
+      default: break; // Just to silence clang
      }
      NCCLCHECK(ncclRedOpCreatePreMulSum(&op, &u64, type, ncclScalarHostImmediate, args->comms[i]));
    }
    #endif

-    TESTCHECK(args->collTest->runColl(
-          (void*)(in_place ? recvBuff + args->sendInplaceOffset*rank : sendBuff),
-          (void*)(in_place ? recvBuff + args->recvInplaceOffset*rank : recvBuff),
-        count, type, op, root, args->comms[i], args->streams[i]));
+    if (deviceImpl == 0) {
+      TESTCHECK(args->collTest->runColl(
+            (void*)(in_place ? recvBuff : sendBuff), in_place ? args->sendInplaceOffset*rank : 0,
+            (void*)recvBuff, in_place ? args->recvInplaceOffset*rank : 0,
+            count, type, op, root, args->comms[i], args->streams[i], 0));
+    } else {
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+      void* sendwin = args->sendRegHandles[i];
+      void* recvwin = args->recvRegHandles[i];
+      CUDACHECK(cudaSetDevice(args->gpus[i]));
+      TESTCHECK(args->collTest->runColl(
+            (void*)(in_place ? recvwin : sendwin), shift + in_place ? args->sendInplaceOffset*rank : 0,
+            (void*)recvwin, shift + in_place ? args->recvInplaceOffset*rank : 0,
+            count, type, op, root, (ncclComm_t)(args->devComms+i), args->streams[i], deviceImpl));
+#endif
+    }

    #if NCCL_VERSION_CODE >= NCCL_VERSION(2,11,0)
    if(opIndex >= ncclNumOps) {
@ -556,19 +670,7 @@ testResult_t BenchTime(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
  }

  double timeUsec = (report_cputime ? cputimeSec : deltaSec)*1.0E6;
-  char timeStr[100];
-  if (timeUsec >= 10000.0) {
-    sprintf(timeStr, "%7.0f", timeUsec);
-  } else if (timeUsec >= 100.0) {
-    sprintf(timeStr, "%7.1f", timeUsec);
-  } else {
-    sprintf(timeStr, "%7.2f", timeUsec);
-  }
-  if (args->reportErrors) {
-    PRINT("  %7s  %6.2f  %6.2f  %5g", timeStr, algBw, busBw, (double)wrongElts);
-  } else {
-    PRINT("  %7s  %6.2f  %6.2f  %5s", timeStr, algBw, busBw, "N/A");
-  }
+  writeBenchmarkLineBody(timeUsec, algBw, busBw, args->reportErrors, wrongElts, report_cputime, report_timestamps, in_place==0);

  args->bw[0] += busBw;
  args->bw_count[0]++;
@ -580,7 +682,7 @@ void setupArgs(size_t size, ncclDataType_t type, struct threadArgs* args) {
  size_t count, sendCount, recvCount, paramCount, sendInplaceOffset, recvInplaceOffset;

  count = size / wordSize(type);
-  args->collTest->getCollByteCount(&sendCount, &recvCount, &paramCount, &sendInplaceOffset, &recvInplaceOffset, (size_t)count, (size_t)nranks);
+  args->collTest->getCollByteCount(&sendCount, &recvCount, &paramCount, &sendInplaceOffset, &recvInplaceOffset, (size_t)count, wordSize(type), (size_t)nranks);

  args->nbytes = paramCount * wordSize(type);
  args->sendBytes = sendCount * wordSize(type);
@ -593,31 +695,24 @@ testResult_t TimeTest(struct threadArgs* args, ncclDataType_t type, const char*
  // Sync to avoid first-call timeout
  Barrier(args);

-  // Warm-up for large size
-  setupArgs(args->maxbytes, type, args);
-  for (int iter = 0; iter < warmup_iters; iter++) {
-    TESTCHECK(startColl(args, type, op, root, 0, iter));
+  // Warm-up for all sizes (using a stepfactor of 2)
+  for (size_t size = args->minbytes; size <= args->maxbytes; size = size * 2) {
+    setupArgs(size, type, args);
+    for (int iter = 0; iter < warmup_iters; iter++) {
+      TESTCHECK(startColl(args, type, op, root, 0, iter));
+    }
+    TESTCHECK(completeColl(args));
  }
-  TESTCHECK(completeColl(args));
-
-  // Warm-up for small size
-  setupArgs(args->minbytes, type, args);
-  for (int iter = 0; iter < warmup_iters; iter++) {
-    TESTCHECK(startColl(args, type, op, root, 0, iter));
-  }
-  TESTCHECK(completeColl(args));

  // Benchmark
  long repeat = run_cycles;
  do {
    for (size_t size = args->minbytes; size<=args->maxbytes; size = ((args->stepfactor > 1) ? size*args->stepfactor : size+args->stepbytes)) {
      setupArgs(size, type, args);
-      char rootName[100];
-      sprintf(rootName, "%6i", root);
-      PRINT("%12li  %12li  %8s  %6s  %6s", max(args->sendBytes, args->expectedBytes), args->nbytes / wordSize(type), typeName, opName, rootName);
+      writeBenchmarkLinePreamble(max(args->sendBytes, args->expectedBytes), args->nbytes / wordSize(type), typeName, opName, root);
      TESTCHECK(BenchTime(args, type, op, root, 0));
      TESTCHECK(BenchTime(args, type, op, root, 1));
-      PRINT("\n");
+      writeBenchmarkLineTerminator(iters, "");
    }
  } while (--repeat);

@ -641,31 +736,66 @@ testResult_t threadInit(struct threadArgs* args) {
  //set main thread again
  is_main_thread = (is_main_proc && args->thread == 0) ? 1 : 0;

+  jsonIdentifyWriter(is_main_thread);
+
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,14,0)
+     ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+     if (ctaPolicy >= 0)
+       config.CTAPolicy = ctaPolicy;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+     config.nvlinkCentricSched = 1;
+#endif
+#endif
+#endif
+
  NCCLCHECK(ncclGroupStart());
  for (int i=0; i<args->nGpus; i++) {
    int rank = args->proc*args->nThreads*args->nGpus + args->thread*args->nGpus + i;
    CUDACHECK(cudaSetDevice(args->gpus[i]));
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,14,0)
+    NCCLCHECK(ncclCommInitRankConfig(args->comms+i, nranks, args->ncclId, rank, &config));
+#else
    NCCLCHECK(ncclCommInitRank(args->comms+i, nranks, args->ncclId, rank));
+#endif
  }
  NCCLCHECK(ncclGroupEnd());
 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-  void **sendRegHandles = (local_register) ? (void **)malloc(sizeof(*sendRegHandles)*args->nGpus) : NULL;
-  void **recvRegHandles = (local_register) ? (void **)malloc(sizeof(*recvRegHandles)*args->nGpus) : NULL;
+  NCCLCHECK(ncclGroupStart());
  for (int i=0; i<args->nGpus; i++) {
-    if (local_register) NCCLCHECK(ncclCommRegister(args->comms[i], args->sendbuffs[i], args->maxbytes, &sendRegHandles[i]));
-    if (local_register) NCCLCHECK(ncclCommRegister(args->comms[i], args->recvbuffs[i], args->maxbytes, &recvRegHandles[i]));
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+    if (test_ncclVersion >= NCCL_VERSION(2,27,0) && (local_register == SYMMETRIC_REGISTER)) {
+      NCCLCHECK(ncclCommWindowRegister(args->comms[i], args->sendbuffs[i], args->maxbytes, (ncclWindow_t*)&args->sendRegHandles[i], NCCL_WIN_COLL_SYMMETRIC));
+      NCCLCHECK(ncclCommWindowRegister(args->comms[i], args->recvbuffs[i], args->maxbytes, (ncclWindow_t*)&args->recvRegHandles[i], NCCL_WIN_COLL_SYMMETRIC));
+    } else
+#endif
+    {
+      if (local_register) NCCLCHECK(ncclCommRegister(args->comms[i], args->sendbuffs[i], args->maxbytes, &args->sendRegHandles[i]));
+      if (local_register) NCCLCHECK(ncclCommRegister(args->comms[i], args->recvbuffs[i], args->maxbytes, &args->recvRegHandles[i]));
+    }
+  }
+  NCCLCHECK(ncclGroupEnd());
+#endif
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  /* Create device communicators based on test-specific requirements */
+  if (deviceImpl) {
+    ncclDevCommRequirements reqs;
+    if (!ncclTestEngine.getDevCommRequirements ||
+        !ncclTestEngine.getDevCommRequirements(deviceImpl, &reqs)) {
+      fprintf(stderr, "Device implementation %d is not supported by this test\n", deviceImpl);
+      return testNotImplemented;
+    }
+
+    NCCLCHECK(ncclGroupStart());
+    for (int i = 0; i < args->nGpus; i++) {
+      NCCLCHECK(ncclDevCommCreate(args->comms[i], &reqs, args->devComms+i));
+    }
+    NCCLCHECK(ncclGroupEnd());
  }
 #endif

  TESTCHECK(threadRunTests(args));

-  for (int i=0; i<args->nGpus; i++) {
-#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-    if (local_register) NCCLCHECK(ncclCommDeregister(args->comms[i], sendRegHandles[i]));
-    if (local_register) NCCLCHECK(ncclCommDeregister(args->comms[i], recvRegHandles[i]));
-#endif
-    NCCLCHECK(ncclCommDestroy(args->comms[i]));
-  }
  return testSuccess;
 }

@ -694,7 +824,7 @@ testResult_t AllocateBuffs(void **sendbuff, size_t sendBytes, void **recvbuff, s

 testResult_t run(); // Main function

-int main(int argc, char* argv[]) {
+int main(int argc, char* argv[], char **envp) {
  // Make sure everyline is flushed so that we see the progress of the test
  setlinebuf(stdout);

@ -703,19 +833,26 @@ int main(int argc, char* argv[]) {
  #else
    test_ncclVersion = NCCL_VERSION_CODE;
  #endif
-  //printf("# NCCL_VERSION_CODE=%d ncclGetVersion=%d\n", NCCL_VERSION_CODE, test_ncclVersion);
+  //printf("# nccl-tests version %s NCCL_VERSION_CODE=%d ncclGetVersion=%d\n", NCCL_TESTS_VERSION, NCCL_VERSION_CODE, test_ncclVersion);
  #if NCCL_VERSION_CODE >= NCCL_VERSION(2,0,0)
    test_opnum = 4;
    test_typenum = 9;
    if (NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0) && test_ncclVersion >= NCCL_VERSION(2,10,0)) {
      test_opnum++; // ncclAvg
-      #if defined(__CUDA_BF16_TYPES_EXIST__)
-        test_typenum++; // bfloat16
-      #endif
    }
    if (NCCL_VERSION_CODE >= NCCL_VERSION(2,11,0) && test_ncclVersion >= NCCL_VERSION(2,11,0)) {
      test_opnum++; // PreMulSum
    }
+    #if defined(__CUDA_BF16_TYPES_EXIST__)
+    if (NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0) && test_ncclVersion >= NCCL_VERSION(2,10,0)) {
+      test_typenum++; // bfloat16
+    }
+    #endif
+    #if defined(__CUDA_FP8_TYPES_EXIST__)
+    if (NCCL_VERSION_CODE >= NCCL_VERSION(2,24,0) && test_ncclVersion >= NCCL_VERSION(2,24,0)) {
+      test_typenum += 2; // fp8 e4m3,e5m2
+    }
+    #endif
  #endif

  nGpus = ucommd_.getNGpusPerProc();
@ -725,6 +862,8 @@ int main(int argc, char* argv[]) {
  // Parse args
  double parsed;
  int longindex;
+  char *output_file = nullptr;
+
  static struct option longopts[] = {
  //{"nthreads", required_argument, 0, 't'},
    {"ngpus", required_argument, 0, 'g'},
@ -746,16 +885,22 @@ int main(int argc, char* argv[]) {
    {"timeout", required_argument, 0, 'T'},
    {"cudagraph", required_argument, 0, 'G'},
    {"report_cputime", required_argument, 0, 'C'},
+    {"report_timestamps", required_argument, 0, 'S'},
+    {"output_file", required_argument, 0, 'J'},
    {"average", required_argument, 0, 'a'},
    {"local_register", required_argument, 0, 'R'},
+    {"cta_policy", required_argument, 0, 'x'},
+    {"device_implementation", required_argument, 0, 'D'},
+    {"device_cta_count", required_argument, 0, 'V'},
    {"help", no_argument, 0, 'h'},
    {}
  };

  while(1) {
    int c;
-  //c = getopt_long(argc, argv, "t:g:b:e:i:f:n:m:w:N:p:c:o:d:r:z:y:T:hG:C:a:R:", longopts, &longindex);
-    c = getopt_long(argc, argv, "g:b:e:i:f:n:m:w:N:p:c:o:d:r:z:y:T:hG:C:a:R:", longopts, &longindex);
+    // c = getopt_long(argc, argv, "t:g:b:e:i:f:n:m:w:N:p:c:o:d:r:z:y:T:hG:C:a:R:x:D:V:J:S:", longopts, &longindex);
+    c = getopt_long(argc, argv, "g:b:e:i:f:n:m:w:N:p:c:o:d:r:z:y:T:hG:C:a:R:x:D:V:J:S:", longopts, &longindex);
+    

    if (c == -1)
      break;
@ -844,18 +989,60 @@ int main(int argc, char* argv[]) {
      case 'C':
        report_cputime = strtol(optarg, NULL, 0);
        break;
+      case 'J':
+        output_file = strdup(optarg);
+        break;
+      case 'S':
+        report_timestamps = strtol(optarg, NULL, 0);
+        break;
      case 'a':
        average = (int)strtol(optarg, NULL, 0);
        break;
      case 'R':
 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-        if ((int)strtol(optarg, NULL, 0)) {
-          local_register = 1;
+        local_register = (int)strtol(optarg, NULL, 0);
+        if (local_register == SYMMETRIC_REGISTER && test_ncclVersion < NCCL_VERSION(2,27,0)) {
+          printf("Option -R 2 (symmetric) is not supported before NCCL 2.27. Defaulting to local registration\n");
+          local_register = LOCAL_REGISTER;
        }
 #else
        printf("Option -R (register) is not supported before NCCL 2.19. Ignoring\n");
 #endif
        break;
+      case 'x':
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+        ctaPolicy = (int)strtol(optarg, NULL, 0);
+        if (ctaPolicy > 1 && test_ncclVersion < NCCL_VERSION(2,28,0)) {
+          printf("Option -x (cta_policy) %d is not supported before NCCL 2.28. Ignoring\n", ctaPolicy);
+          ctaPolicy = -1;
+        }
+#else
+        printf("Option -x (cta_policy) is not supported before NCCL 2.27. Ignoring\n");
+#endif
+        break;
+      case 'D':
+	if (test_ncclVersion >= NCCL_VERSION(2,28,0)) {
+          deviceImpl = (int)strtol(optarg, NULL, 0);
+	}
+	else {
+          fprintf(stderr, "Option -D (device implementation) requires NCCL >= 2.28.0\n");
+          return -1;
+        }
+        break;
+      case 'V':
+	if (test_ncclVersion >= NCCL_VERSION(2,28,0)) {
+          deviceCtaCount = (int)strtol(optarg, NULL, 0);
+          if (deviceCtaCount <= 0 || deviceCtaCount > 128) {
+            fprintf(stderr, "device_cta_count (-V) must be positive and less than 128, got %d. "
+                    "Using default value 16.\n", deviceCtaCount);
+            deviceCtaCount = 16;
+	  }
+        }
+	else {
+          fprintf(stderr, "Option -V (device CTA count) requires NCCL >= 2.28.0\n");
+          return -1;
+	}
+        break;
      case 'h':
      default:
        if (c != 'h') printf("invalid option '%c'\n", c);
@ -886,8 +1073,13 @@ int main(int argc, char* argv[]) {
            "[-T,--timeout <time in seconds>] \n\t"
            "[-G,--cudagraph <num graph launches>] \n\t"
            "[-C,--report_cputime <0/1>] \n\t"
+            "[-S,--report_timestamps <0/1> report timestamps (default 0)] \n\t"
+            "[-J,--output_file <file> write output to filepath, if accessible. Infer type from suffix (only json supported presently.)] \n\t"
            "[-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>] \n\t"
-            "[-R,--local_register <1/0> enable local buffer registration on send/recv buffers (default: disable)] \n\t"
+            "[-R,--local_register <0/1/2> enable local (1) or symmetric (2) buffer registration on send/recv buffers (default: disable (0))] \n\t"
+            "[-x,--cta_policy <0/1/2> set CTA policy (NCCL_CTA_POLICY_DEFAULT (0), NCCL_CTA_POLICY_EFFICIENCY (1), NCCL_CTA_POLICY_ZERO (2)) (default: do not set)] \n\t"
+            "[-D,--device_implementation <implementation number> enable device implementation (default: 0, use NCCL implementation; requires -R 2 if > 0)] \n\t"
+            "[-V,--device_cta_count <number> set number of CTAs for device implementation (default: 16)] \n\t"
            "[-h,--help]\n",
          basename(argv[0]));
        return 0;
@ -899,13 +1091,52 @@ int main(int argc, char* argv[]) {
           (unsigned long long)maxBytes);
    return -1;
  }
+  if (deviceImpl > 0 && (local_register != SYMMETRIC_REGISTER)) {
+    fprintf(stderr, "device implementation (-D > 0) requires enabling symmetric memory registration (-R 2)\n");
+    return -1;
+  }
+
 #ifdef MPI_SUPPORT
  MPI_Init(&argc, &argv);
 #endif
-  TESTCHECK(run());
+
+  const output_file_type_t output_file_type = classifyOutputFile(output_file);
+  outputFileInit(output_file_type, output_file, argc, argv, envp);
+
+  if(output_file) {
+    free(output_file);
+    output_file = nullptr;
+  }
+
+  testResult_t result = run();
+
+  outputFileFinalize(output_file_type);
+
+  TESTCHECK(result);
+
  return 0;
 }

+#ifdef MPI_SUPPORT
+// parse int for base 2/10/16, will ignore first whitespaces
+static bool parseInt(char *s, int *num) {
+  char *p = NULL;
+  if (!s || !num)
+    return false;
+  while (*s && isspace(*s)) ++s;
+  if (!*s) return false;
+
+  if (strncasecmp(s, "0b", 2) == 0)
+    *num = (int)strtoul(s + 2, &p, 2);
+  else
+    *num = (int)strtoul(s, &p, 0);
+
+  if (p == s)
+    return false;
+  return true;
+}
+#endif
+
 testResult_t run() {
  int totalProcs = 1, proc = 0, ncclProcs = 1, ncclProc = 0, color = 0;
  int localRank = 0;
@ -923,63 +1154,55 @@ testResult_t run() {
    if (hostHashs[p] == hostHashs[proc]) localRank++;
  }

-  char* str = getenv("NCCL_TESTS_SPLIT_MASK");
-  uint64_t mask = str ? strtoul(str, NULL, 16) : 0;
+  char *splitMaskEnv = NULL;
+  if (splitMaskEnv = getenv("NCCL_TESTS_SPLIT_MASK")) {
+    color = proc & strtoul(splitMaskEnv, NULL, 16);
+  } else if (splitMaskEnv = getenv("NCCL_TESTS_SPLIT")) {
+    if (
+      (strncasecmp(splitMaskEnv, "AND", strlen("AND")) == 0 && parseInt(splitMaskEnv + strlen("AND"), &color)) ||
+      (strncasecmp(splitMaskEnv, "&", strlen("&")) == 0 && parseInt(splitMaskEnv + strlen("&"), &color))
+    )
+        color = proc & color;
+    if (
+      (strncasecmp(splitMaskEnv, "OR", strlen("OR")) == 0 && parseInt(splitMaskEnv + strlen("OR"), &color)) ||
+      (strncasecmp(splitMaskEnv, "|", strlen("|")) == 0 && parseInt(splitMaskEnv + strlen("|"), &color))
+    )
+        color = proc | color;
+    if (
+      (strncasecmp(splitMaskEnv, "MOD", strlen("MOD")) == 0 && parseInt(splitMaskEnv + strlen("MOD"), &color)) ||
+      (strncasecmp(splitMaskEnv, "%", strlen("%")) == 0 && parseInt(splitMaskEnv + strlen("%"), &color))
+    )
+        color = proc % color;
+    if (
+      (strncasecmp(splitMaskEnv, "DIV", strlen("DIV")) == 0 && parseInt(splitMaskEnv + strlen("DIV"), &color)) ||
+      (strncasecmp(splitMaskEnv, "/", strlen("/")) == 0 && parseInt(splitMaskEnv + strlen("/"), &color))
+    )
+        color = proc / color;
+  }
+
  MPI_Comm mpi_comm;
-  color = proc & mask;
  MPI_Comm_split(MPI_COMM_WORLD, color, proc, &mpi_comm);
  MPI_Comm_size(mpi_comm, &ncclProcs);
  MPI_Comm_rank(mpi_comm, &ncclProc);
 #endif
  is_main_thread = is_main_proc = (proc == 0) ? 1 : 0;

-//PRINT("# nThread %d nGpus %d minBytes %ld maxBytes %ld step: %ld(%s) warmup iters: %d iters: %d agg iters: %d validation: %d graph: %d\n",
-//      nThreads, nGpus, minBytes, maxBytes,
-  PRINT("# nGpus(perProc) %d minBytes %ld maxBytes %ld step: %ld(%s) warmup iters: %d iters: %d agg iters: %d validation: %d graph: %d\n",
-        nGpus, minBytes, maxBytes,
-        (stepFactor > 1)?stepFactor:stepBytes, (stepFactor > 1)?"factor":"bytes",
-        warmup_iters, iters, agg_iters, datacheck, cudaGraphLaunches);
-  if (blocking_coll) PRINT("# Blocking Enabled: wait for completion and barrier after each collective \n");
-  if (parallel_init) PRINT("# Parallel Init Enabled: threads call into NcclInitRank concurrently \n");
-  PRINT("#\n");
+  jsonIdentifyWriter(is_main_thread);

-  PRINT("# Using devices\n");
-#define MAX_LINE 2048
-  char line[MAX_LINE];
-  int len = 0;
  size_t maxMem = ~0;
-  char* envstr = getenv("NCCL_TESTS_DEVICE");
-  int gpu0 = envstr ? atoi(envstr) : -1;
-  for (int i=0; i<nThreads*nGpus; i++) {
-    int cudaDev = (gpu0 != -1 ? gpu0 : localRank*nThreads*nGpus) + i;
-    int rank = proc*nThreads*nGpus+i;
-    cudaDeviceProp prop;
-    CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev));
-    len += snprintf(line+len, MAX_LINE-len, "#  Rank %2d Group %2d Pid %6d on %10s device %2d [0x%02x] %s\n",
-                    rank, color, getpid(), hostname, cudaDev, prop.pciBusID, prop.name);
-    maxMem = std::min(maxMem, prop.totalGlobalMem);
+  testResult_t report_result = writeDeviceReport(&maxMem, localRank, proc, totalProcs, color, hostname, program_invocation_short_name);
+  if(report_result != testSuccess) {
+    return report_result;
  }

-#if MPI_SUPPORT
-  char *lines = (proc == 0) ? (char *)malloc(totalProcs*MAX_LINE) : NULL;
-  // Gather all output in rank order to root (0)
-  MPI_Gather(line, MAX_LINE, MPI_BYTE, lines, MAX_LINE, MPI_BYTE, 0, MPI_COMM_WORLD);
-  if (proc == 0) {
-  //for (int p = 0; p < totalProcs; p++)
-    int stride = ucommd_.getLocalSize() > 0 ? ucommd_.getLocalSize() : 1;
-    for (int p = stride-1; p < totalProcs; p+=stride)
-      PRINT("%s", lines+MAX_LINE*p);
-    free(lines);
-  }
-  MPI_Allreduce(MPI_IN_PLACE, &maxMem, 1, MPI_LONG, MPI_MIN, MPI_COMM_WORLD);
-#else
-  PRINT("%s", line);
-#endif
-
+  // Reserve 1GiB of memory for each 16GiB installed, but limit to a max of 4GiB
+  const size_t GB = (1ULL << 30);
+  size_t reserveMem =  std::min(DIVUP(maxMem, 16*GB) * 1*GB, 4*GB);
  // We need sendbuff, recvbuff, expected (when datacheck enabled), plus 1G for the rest.
-  size_t memMaxBytes = (maxMem - (1<<30)) / (datacheck ? 3 : 2);
+  size_t memMaxBytes = (maxMem - reserveMem - 1*GB) / (datacheck ? 3 : 2);
  if (maxBytes > memMaxBytes) {
    maxBytes = memMaxBytes;
+    if (minBytes > maxBytes) minBytes = maxBytes;
    if (proc == 0) printf("#\n# Reducing maxBytes to %ld due to memory limitation\n", maxBytes);
  }

@ -1000,49 +1223,112 @@ testResult_t run() {

  ncclTestEngine.getBuffSize(&sendBytes, &recvBytes, (size_t)maxBytes, (size_t)ncclProcs*nGpus*nThreads);

-  envstr = getenv("NCCL_TESTS_DEVICE");
-  gpu0 = envstr ? atoi(envstr) : -1;
+  char* envstr = getenv("NCCL_TESTS_DEVICE");
+  int gpu0 = envstr ? atoi(envstr) : -1;
  for (int i=0; i<nGpus*nThreads; i++) {
    gpus[i] = (gpu0 != -1 ? gpu0 : localRank*nThreads*nGpus) + i;
    CUDACHECK(cudaSetDevice(gpus[i]));
    TESTCHECK(AllocateBuffs(sendbuffs+i, sendBytes, recvbuffs+i, recvBytes, expected+i, (size_t)maxBytes));
-    if (streamnull)
+    if (streamnull) {
      streams[i] = NULL;
-    else
+    }
+    else {
      CUDACHECK(cudaStreamCreateWithFlags(streams+i, cudaStreamNonBlocking));
+    }
+    int archMajor, archMinor;
+    CUDACHECK(cudaDeviceGetAttribute(&archMajor, cudaDevAttrComputeCapabilityMajor, gpus[i]));
+    CUDACHECK(cudaDeviceGetAttribute(&archMinor, cudaDevAttrComputeCapabilityMinor, gpus[i]));
+    minCudaArch = std::min(minCudaArch, 100*archMajor + 10*archMinor);
  }

+#ifdef MPI_SUPPORT
+  MPI_Allreduce(MPI_IN_PLACE, &minCudaArch, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
+#endif
+#if defined(__CUDA_FP8_TYPES_EXIST__)
+  if (NCCL_VERSION_CODE >= NCCL_VERSION(2,24,0) && test_ncclVersion >= NCCL_VERSION(2,24,0)) {
+    if (minCudaArch < 900) { // Filter out fp8 on pre-Hopper hardware
+      int n = 0;
+      for (int i=0; i < test_typenum; i++) {
+        if (!(test_types[i] == ncclFloat8e4m3 || test_types[i] == ncclFloat8e5m2)) {
+          test_types[n] = test_types[i];
+          test_typenames[n] = test_typenames[i];
+          n += 1;
+        }
+      }
+      test_typenum = n;
+    }
+  }
+#endif
+
  //if parallel init is not selected, use main thread to initialize NCCL
  ncclComm_t* comms = (ncclComm_t*)malloc(sizeof(ncclComm_t)*nThreads*nGpus);
 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-  void **sendRegHandles = NULL;
-  void **recvRegHandles = NULL;
+  void* sendRegHandles[nThreads*nGpus];
+  void* recvRegHandles[nThreads*nGpus];
+  memset(sendRegHandles, 0, sizeof(sendRegHandles));
+  memset(recvRegHandles, 0, sizeof(recvRegHandles));
+#endif
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  ncclDevComm devComms[nThreads*nGpus];
 #endif
  if (!parallel_init) {
-     if (ncclProcs == 1) {
-       NCCLCHECK(ncclCommInitAll(comms, nGpus*nThreads, gpus));
-     } else {
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,14,0)
+     ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+     if (ctaPolicy >= 0)
+       config.CTAPolicy = ctaPolicy;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+     config.nvlinkCentricSched = 1;
+#endif
+#endif
+#endif
+     NCCLCHECK(ncclGroupStart());
+     for (int i=0; i<nGpus*nThreads; i++) {
+       CUDACHECK(cudaSetDevice(gpus[i]));
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,14,0)
+       NCCLCHECK(ncclCommInitRankConfig(comms+i, ncclProcs*nThreads*nGpus, ncclId, ncclProc*nThreads*nGpus+i, &config));
+#else
+       NCCLCHECK(ncclCommInitRank(comms+i, ncclProcs*nThreads*nGpus, ncclId, ncclProc*nThreads*nGpus+i));
+#endif
+     }
+     NCCLCHECK(ncclGroupEnd());
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
+     NCCLCHECK(ncclGroupStart());
+     for (int i=0; i<nGpus*nThreads; i++) {
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+       if (test_ncclVersion >= NCCL_VERSION(2,27,0) && (local_register == SYMMETRIC_REGISTER)) {
+         NCCLCHECK(ncclCommWindowRegister(comms[i], sendbuffs[i], maxBytes, (ncclWindow_t*)&sendRegHandles[i], NCCL_WIN_COLL_SYMMETRIC));
+         NCCLCHECK(ncclCommWindowRegister(comms[i], recvbuffs[i], maxBytes, (ncclWindow_t*)&recvRegHandles[i], NCCL_WIN_COLL_SYMMETRIC));
+       } else
+#endif
+       {
+         if (local_register) NCCLCHECK(ncclCommRegister(comms[i], sendbuffs[i], maxBytes, &sendRegHandles[i]));
+         if (local_register) NCCLCHECK(ncclCommRegister(comms[i], recvbuffs[i], maxBytes, &recvRegHandles[i]));
+       }
+     }
+     NCCLCHECK(ncclGroupEnd());
+#endif
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+     /* Create device communicators based on test-specific requirements */
+     if (deviceImpl) {
+       ncclDevCommRequirements reqs;
+       if (!ncclTestEngine.getDevCommRequirements ||
+           !ncclTestEngine.getDevCommRequirements(deviceImpl, &reqs)) {
+         fprintf(stderr, "Device implementation %d is not supported by this test\n", deviceImpl);
+         return testNotImplemented;
+       }
+
       NCCLCHECK(ncclGroupStart());
-       for (int i=0; i<nGpus*nThreads; i++) {
-         CUDACHECK(cudaSetDevice(gpus[i]));
-         NCCLCHECK(ncclCommInitRank(comms+i, ncclProcs*nThreads*nGpus, ncclId, ncclProc*nThreads*nGpus+i));
+       for (int i = 0; i < nGpus * nThreads; i++) {
+         NCCLCHECK(ncclDevCommCreate(comms[i], &reqs, devComms+i));
       }
       NCCLCHECK(ncclGroupEnd());
     }
-#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-     sendRegHandles = (local_register) ? (void **)malloc(sizeof(*sendRegHandles)*nThreads*nGpus) : NULL;
-     recvRegHandles = (local_register) ? (void **)malloc(sizeof(*recvRegHandles)*nThreads*nGpus) : NULL;
-     for (int i=0; i<nGpus*nThreads; i++) {
-       if (local_register) NCCLCHECK(ncclCommRegister(comms[i], sendbuffs[i], maxBytes, &sendRegHandles[i]));
-       if (local_register) NCCLCHECK(ncclCommRegister(comms[i], recvbuffs[i], maxBytes, &recvRegHandles[i]));
-     }
 #endif
  }

  int errors[nThreads];
  double bw[nThreads];
-  double* delta;
-  CUDACHECK(cudaHostAlloc(&delta, sizeof(double)*nThreads*NUM_BLOCKS, cudaHostAllocPortable | cudaHostAllocMapped));
  int bw_count[nThreads];
  for (int t=0; t<nThreads; t++) {
    bw[t] = 0.0;
@ -1051,13 +1337,7 @@ testResult_t run() {

  fflush(stdout);

-  const char* timeStr = report_cputime ? "cputime" : "time";
-  PRINT("#\n");
-  PRINT("# %10s  %12s  %8s  %6s  %6s           out-of-place                       in-place          \n", "", "", "", "", "");
-  PRINT("# %10s  %12s  %8s  %6s  %6s  %7s  %6s  %6s %6s  %7s  %6s  %6s %6s\n", "size", "count", "type", "redop", "root",
-      timeStr, "algbw", "busbw", "#wrong", timeStr, "algbw", "busbw", "#wrong");
-  PRINT("# %10s  %12s  %8s  %6s  %6s  %7s  %6s  %6s  %5s  %7s  %6s  %6s  %5s\n", "(B)", "(elements)", "", "", "",
-      "(us)", "(GB/s)", "(GB/s)", "", "(us)", "(GB/s)", "(GB/s)", "");
+  writeResultHeader(report_cputime, report_timestamps);

  struct testThread threads[nThreads];
  memset(threads, 0, sizeof(struct testThread)*nThreads);
@ -1079,6 +1359,13 @@ testResult_t run() {
    threads[t].args.sendbuffs = sendbuffs+t*nGpus;
    threads[t].args.recvbuffs = recvbuffs+t*nGpus;
    threads[t].args.expected = expected+t*nGpus;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+    threads[t].args.devComms = devComms+t*nGpus;
+#endif
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
+    threads[t].args.sendRegHandles = sendRegHandles+t*nGpus;
+    threads[t].args.recvRegHandles = recvRegHandles+t*nGpus;
+#endif
    threads[t].args.ncclId = ncclId;
    threads[t].args.comms=comms+t*nGpus;
    threads[t].args.streams=streams+t*nGpus;
@ -1114,8 +1401,16 @@ testResult_t run() {
  if (!parallel_init) {
    for(int i=0; i<nGpus*nThreads; ++i) {
 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-      if (local_register) NCCLCHECK(ncclCommDeregister(comms[i], sendRegHandles[i]));
-      if (local_register) NCCLCHECK(ncclCommDeregister(comms[i], recvRegHandles[i]));
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,27,0)
+      if (test_ncclVersion >= NCCL_VERSION(2,27,0) && (local_register == SYMMETRIC_REGISTER)) {
+        NCCLCHECK(ncclCommWindowDeregister(comms[i], (ncclWindow_t)sendRegHandles[i]));
+        NCCLCHECK(ncclCommWindowDeregister(comms[i], (ncclWindow_t)recvRegHandles[i]));
+      } else
+#endif
+      {
+        if (local_register) NCCLCHECK(ncclCommDeregister(comms[i], sendRegHandles[i]));
+        if (local_register) NCCLCHECK(ncclCommDeregister(comms[i], recvRegHandles[i]));
+      }
 #endif
      NCCLCHECK(ncclCommDestroy(comms[i]));
    }
@ -1134,11 +1429,6 @@ testResult_t run() {
    if (datacheck) CUDACHECK(cudaFree(expected[i]));
 #endif
  }
-  CUDACHECK(cudaFreeHost(delta));
-#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
-  free(sendRegHandles);
-  free(recvRegHandles);
-#endif

  envstr = getenv("NCCL_TESTS_MIN_BW");
 //double check_avg_bw = envstr ? atof(envstr) : -1;
@ -1146,22 +1436,20 @@ testResult_t run() {
      (!strcmp(threads[0].args.collTest->name, "AllReduce") && minBytes == maxBytes && minBytes >= ucommd_.getBytes()) ? ucommd_.getBw(nGpus) : -1;
  bw[0] /= bw_count[0];

-  PRINT("# Out of bounds values : %d %s\n", errors[0], errors[0] ? "FAILED" : "OK");
-  PRINT("# Avg bus bandwidth    : %g %s\n", bw[0], check_avg_bw == -1 ? "" : (bw[0] < check_avg_bw/**(0.9)*/ ? "FAILED" : "OK"));
-  if (bw[0] < check_avg_bw) PRINT("# Expected min bandwidth : %g\n", check_avg_bw);
-  PRINT("#\n");
+  writeResultFooter(errors, bw, check_avg_bw, program_invocation_short_name);
+
 #ifdef MPI_SUPPORT
  MPI_Comm_free(&mpi_comm);
  MPI_Finalize();
 #endif

-  PRINT("%s\n", ncclGetLastError(NULL));
+  writeErrors();

  // 'cuda-memcheck --leak-check full' requires this
  cudaDeviceReset();

-  if (errors[0] || bw[0] < check_avg_bw/**(0.9)*/)
-    exit(EXIT_FAILURE);
+  if (errors[0] || bw[0] < check_avg_bw*(0.9))
+    return testNumResults;
  else
-    exit(EXIT_SUCCESS);
+    return testSuccess;
 }
--- a/src/common.h
+++ b/src/common.h
@ -6,7 +6,12 @@
 #ifndef __COMMON_H__
 #define __COMMON_H__

+#define NCCL_TESTS_VERSION "2.17.6"
+
 #include "nccl.h"
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+#include "nccl_device.h"
+#endif
 #include <stdio.h>
 #include <cstdint>
 #include <algorithm>
@ -66,7 +71,8 @@ typedef enum {
  testCudaError = 2,
  testNcclError = 3,
  testTimeout = 4,
-  testNumResults = 5
+  testNotImplemented = 5,
+  testNumResults = 6
 } testResult_t;

 // Relay errors up and trace
@ -87,18 +93,21 @@ struct testColl {
  void (*getCollByteCount)(
      size_t *sendcount, size_t *recvcount, size_t *paramcount,
      size_t *sendInplaceOffset, size_t *recvInplaceOffset,
-      size_t count, int nranks);
+      size_t count, size_t eltSize, int nranks);
  testResult_t (*initData)(struct threadArgs* args, ncclDataType_t type,
      ncclRedOp_t op, int root, int rep, int in_place);
  void (*getBw)(size_t count, int typesize, double sec, double* algBw, double* busBw, int nranks);
-  testResult_t (*runColl)(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type,
-      ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream);
+  testResult_t (*runColl)(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset,
+      size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int implIndex);
 };

 struct testEngine {
  void (*getBuffSize)(size_t *sendcount, size_t *recvcount, size_t count, int nranks);
  testResult_t (*runTest)(struct threadArgs* args, int root, ncclDataType_t type,
      const char* typeName, ncclRedOp_t op, const char* opName);
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  bool (*getDevCommRequirements)(int deviceImpl, ncclDevCommRequirements* reqs);
+#endif
 };

 extern struct testEngine ncclTestEngine;
@ -125,6 +134,9 @@ struct threadArgs {
  size_t recvInplaceOffset;
  ncclUniqueId ncclId;
  ncclComm_t* comms;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+  ncclDevComm* devComms;
+#endif
  cudaStream_t* streams;

  void** expected;
@ -136,6 +148,11 @@ struct threadArgs {
  int reportErrors;

  struct testColl* collTest;
+
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,19,0)
+  void** sendRegHandles;
+  void** recvRegHandles;
+#endif
 };

 typedef testResult_t (*threadFunc_t)(struct threadArgs* args);
@ -165,6 +182,9 @@ static void getHostName(char* hostname, int maxlen) {
    }
  }
  for (int i=0; i< maxlen; i++) {
+    if (hostname[i] == '\0') {
+      return;
+    }
    if (hostname[i] == '.') {
      hostname[i] = '\0';
      return;
@ -214,16 +234,34 @@ static uint64_t getHostHash(const char* hostname) {
  return getHash(hostHash, strlen(hostHash));
 }

+#define HAVE_BF16 0
+#define HAVE_FP8 0
+
+#if NCCL_MAJOR >= 2
+  #if defined(__CUDA_BF16_TYPES_EXIST__) && NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0)
+    #undef HAVE_BF16
+    #define HAVE_BF16 1
+    #if defined(__CUDA_FP8_TYPES_EXIST__) && NCCL_VERSION_CODE >= NCCL_VERSION(2,24,0)
+      #undef HAVE_FP8
+      #define HAVE_FP8 1
+    #endif
+  #endif
+#endif
+
 static size_t wordSize(ncclDataType_t type) {
  switch(type) {
    case ncclChar:
 #if NCCL_MAJOR >= 2
    //case ncclInt8:
    case ncclUint8:
+#endif
+#if HAVE_FP8
+    case ncclFloat8e4m3:
+    case ncclFloat8e5m2:
 #endif
      return 1;
    case ncclHalf:
-#if defined(__CUDA_BF16_TYPES_EXIST__)
+#if HAVE_BF16
    case ncclBfloat16:
 #endif
    //case ncclFloat16:
@ -246,6 +284,7 @@ static size_t wordSize(ncclDataType_t type) {
 }

 extern int test_ncclVersion; // init'd with ncclGetVersion()
+extern int deviceCtaCount; // number of CTAs for device implementation
 constexpr int test_opNumMax = (int)ncclNumOps + (NCCL_VERSION_CODE >= NCCL_VERSION(2,11,0) ? 1 : 0);
 extern int test_opnum;
 extern int test_typenum;
@ -282,6 +321,38 @@ static int ncclstringtoop (char *str) {

 extern int is_main_proc;
 extern thread_local int is_main_thread;
-#define PRINT if (is_main_thread) printf
+
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+template <typename F>
+testResult_t testLaunchDeviceKernel(F kernel, void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+  if (kernel == nullptr) return testNotImplemented;
+  ncclDevComm* devComm = (ncclDevComm*)comm;
+
+  ncclWindow_t sendwin = (ncclWindow_t)sendbuff;
+  ncclWindow_t recvwin = (ncclWindow_t)recvbuff;
+  kernel<<<deviceCtaCount, 512, 0, stream>>>(sendwin, sendoffset, recvwin, recvoffset, count, root, *devComm);
+  return testSuccess;
+}
+
+#define SPECIALIZE_KERNEL(kernel, type, op) \
+  ( op != ncclSum ? nullptr : \
+   type == ncclInt8 ? kernel<int8_t> : \
+   type == ncclUint8 ? kernel<uint8_t> : \
+   type == ncclInt32 ? kernel<int32_t> : \
+   type == ncclUint32 ? kernel<uint32_t> : \
+   type == ncclInt64 ? kernel<int64_t> : \
+   type == ncclUint64 ? kernel<uint64_t> : \
+   type == ncclFloat16 ? kernel<half> : \
+   type == ncclFloat32 ? kernel<float> : \
+   type == ncclFloat64 ? kernel<double> : \
+   nullptr \
+  )
+#else
+template <typename F>
+testResult_t testLaunchDeviceKernel(F kernel, void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
+  return testNotImplemented;
+}
+#define SPECIALIZE_KERNEL(kernel, type, op) nullptr
+#endif

 #endif
--- a/src/common.mk
+++ b/src/common.mk
@ -0,0 +1,87 @@
+#
+# Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+#
+# See LICENSE.txt for license information
+#
+CUDA_HOME ?= /usr/local/cuda
+PREFIX ?= /usr/local
+VERBOSE ?= 0
+DEBUG ?= 0
+
+CUDA_LIB ?= $(CUDA_HOME)/lib64
+CUDA_INC ?= $(CUDA_HOME)/include
+NVCC ?= $(CUDA_HOME)/bin/nvcc
+CUDARTLIB ?= cudart
+
+CUDA_VERSION = $(strip $(shell which $(NVCC) >/dev/null && $(NVCC) --version | grep release | sed 's/.*release //' | sed 's/\,.*//'))
+CUDA_MAJOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 1)
+CUDA_MINOR = $(shell echo $(CUDA_VERSION) | cut -d "." -f 2)
+
+# CUDA 13.0 requires c++17
+ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
+  CXXSTD ?= -std=c++17
+else
+  CXXSTD ?= -std=c++14
+endif
+
+# Better define NVCC_GENCODE in your environment to the minimal set
+# of archs to reduce compile time.
+ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
+# Add Blackwell but drop Pascal & Volta support if we're using CUDA13.0 or above
+NVCC_GENCODE ?= -gencode=arch=compute_75,code=sm_75 \
+		-gencode=arch=compute_80,code=sm_80 \
+		-gencode=arch=compute_90,code=sm_90 \
+		-gencode=arch=compute_100,code=sm_100 \
+		-gencode=arch=compute_120,code=sm_120 \
+		-gencode=arch=compute_120,code=compute_120
+else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8; echo $$?),0)
+# Include Blackwell support if we're using CUDA12.8 or above
+NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
+		-gencode=arch=compute_61,code=sm_61 \
+		-gencode=arch=compute_70,code=sm_70 \
+		-gencode=arch=compute_80,code=sm_80 \
+		-gencode=arch=compute_90,code=sm_90 \
+		-gencode=arch=compute_100,code=sm_100 \
+		-gencode=arch=compute_120,code=sm_120 \
+		-gencode=arch=compute_120,code=compute_120
+else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 12; echo $$?),0)
+NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
+                -gencode=arch=compute_61,code=sm_61 \
+                -gencode=arch=compute_70,code=sm_70 \
+		-gencode=arch=compute_80,code=sm_80 \
+		-gencode=arch=compute_90,code=sm_90 \
+		-gencode=arch=compute_90,code=compute_90
+else ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 11; echo $$?),0)
+NVCC_GENCODE ?= -gencode=arch=compute_60,code=sm_60 \
+                -gencode=arch=compute_61,code=sm_61 \
+                -gencode=arch=compute_70,code=sm_70 \
+		-gencode=arch=compute_80,code=sm_80 \
+		-gencode=arch=compute_80,code=compute_80
+else
+NVCC_GENCODE ?= -gencode=arch=compute_35,code=sm_35 \
+                -gencode=arch=compute_50,code=sm_50 \
+                -gencode=arch=compute_60,code=sm_60 \
+                -gencode=arch=compute_61,code=sm_61 \
+                -gencode=arch=compute_70,code=sm_70 \
+                -gencode=arch=compute_70,code=compute_70
+endif
+
+NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) $(CXXSTD)
+CXXFLAGS   := $(CXXSTD)
+
+LDFLAGS    := -L${CUDA_LIB} -lcudart -lrt
+NVLDFLAGS  := -L${CUDA_LIB} -l${CUDARTLIB} -lrt
+
+ifeq ($(DEBUG), 0)
+NVCUFLAGS += -O3 -g
+CXXFLAGS  += -O3 -g
+else
+NVCUFLAGS += -O0 -G -g
+CXXFLAGS  += -O0 -g -ggdb3
+endif
+
+ifneq ($(VERBOSE), 0)
+NVCUFLAGS += -Xcompiler -Wall,-Wextra,-Wno-unused-parameter
+else
+.SILENT:
+endif
--- a/src/gather.cu
+++ b/src/gather.cu
@ -7,12 +7,12 @@
 #include "cuda_runtime.h"
 #include "common.h"

-void GatherGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
-  *sendcount = count/nranks;
-  *recvcount = (count/nranks)*nranks;
-  *sendInplaceOffset = count/nranks;
+void GatherGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
+  *sendcount = (count/nranks) & -(16/eltSize);
+  *recvcount = (*sendcount)*nranks;
+  *sendInplaceOffset = *sendcount;
  *recvInplaceOffset = 0;
-  *paramcount = count/nranks;
+  *paramcount = *sendcount;
 }

 testResult_t GatherInitData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t op, int root, int rep, int in_place) {
@ -43,23 +43,35 @@ void GatherGetBw(size_t count, int typesize, double sec, double* algBw, double*
  *busBw = baseBw * factor;
 }

-testResult_t GatherRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  int nRanks;
-  NCCLCHECK(ncclCommCount(comm, &nRanks));
-  int rank;
-  NCCLCHECK(ncclCommUserRank(comm, &rank));
-  size_t rankOffset = count * wordSize(type);
-  if (count == 0) return testSuccess;
+testResult_t GatherRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    int nRanks;
+    NCCLCHECK(ncclCommCount(comm, &nRanks));
+    int rank;
+    NCCLCHECK(ncclCommUserRank(comm, &rank));
+    size_t rankOffset = count * wordSize(type);
+    if (count == 0) return testSuccess;

-  NCCLCHECK(ncclGroupStart());
-  NCCLCHECK(ncclSend(sendbuff, count, type, root, comm, stream));
-  if (rank == root) {
-    for (int r=0; r<nRanks; r++) {
-      NCCLCHECK(ncclRecv(((char*)recvbuff)+r*rankOffset, count, type, r, comm, stream));
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+    NCCLCHECK(ncclGather(sptr, rptr, count, type, root, comm, stream));
+#elif NCCL_VERSION_CODE >= NCCL_VERSION(2,7,0)
+    NCCLCHECK(ncclGroupStart());
+    NCCLCHECK(ncclSend(sptr, count, type, root, comm, stream));
+    if (rank == root) {
+      for (int r=0; r<nRanks; r++) {
+        NCCLCHECK(ncclRecv(rptr + r * rankOffset, count, type, r, comm, stream));
+      }
    }
+    NCCLCHECK(ncclGroupEnd());
+#else
+    printf("NCCL 2.7 or later is needed for gather. This test was compiled with %d.%d.\n", NCCL_MAJOR, NCCL_MINOR);
+    return testNcclError;
+#endif
+  } else {
+    return testNotImplemented;
  }
-  NCCLCHECK(ncclGroupEnd());
-
  return testSuccess;
 }

@ -73,7 +85,7 @@ struct testColl gatherTest = {

 void GatherGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  GatherGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  GatherGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t GatherRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -109,8 +121,8 @@ testResult_t GatherRunTest(struct threadArgs* args, int root, ncclDataType_t typ
 }

 struct testEngine gatherEngine = {
-  GatherGetBuffSize,
-  GatherRunTest
+  .getBuffSize = GatherGetBuffSize,
+  .runTest = GatherRunTest
 };

 #pragma weak ncclTestEngine=gatherEngine
--- a/src/hypercube.cu
+++ b/src/hypercube.cu
@ -9,8 +9,8 @@

 #define ALIGN 4

-void HyperCubeGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
-  size_t base = (count/(ALIGN*nranks))*ALIGN;
+void HyperCubeGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
+  size_t base = (count/nranks) & -(16/eltSize);
  *sendcount = base;
  *recvcount = base*nranks;
  *sendInplaceOffset = base;
@ -45,25 +45,29 @@ void HyperCubeGetBw(size_t count, int typesize, double sec, double* algBw, doubl
  *busBw = baseBw * factor;
 }

-testResult_t HyperCubeRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  char* sbuff = (char*)sendbuff;
-  char* rbuff = (char*)recvbuff;
-  int nRanks;
-  NCCLCHECK(ncclCommCount(comm, &nRanks));
-  int rank;
-  NCCLCHECK(ncclCommUserRank(comm, &rank));
-  size_t rankSize = count * wordSize(type);
+testResult_t HyperCubeRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    char* sbuff = ((char*)sendbuff) + sendoffset;
+    char* rbuff = ((char*)recvbuff) + recvoffset;
+    int nRanks;
+    NCCLCHECK(ncclCommCount(comm, &nRanks));
+    int rank;
+    NCCLCHECK(ncclCommUserRank(comm, &rank));
+    size_t rankSize = count * wordSize(type);

-  if (rbuff+rank*rankSize != sbuff) CUDACHECK(cudaMemcpyAsync(rbuff+rank*rankSize, sbuff, rankSize, cudaMemcpyDeviceToDevice, stream));
+    if (rbuff+rank*rankSize != sbuff) CUDACHECK(cudaMemcpyAsync(rbuff+rank*rankSize, sbuff, rankSize, cudaMemcpyDeviceToDevice, stream));

-  // Hypercube AllGather
-  for (int mask=1; mask<nRanks; mask<<=1) {
-    NCCLCHECK(ncclGroupStart());
-    int s = rank & ~(mask-1);
-    int r = s ^ mask;
-    NCCLCHECK(ncclSend(rbuff+s*rankSize, count*mask, type, rank^mask, comm, stream));
-    NCCLCHECK(ncclRecv(rbuff+r*rankSize, count*mask, type, rank^mask, comm, stream));
-    NCCLCHECK(ncclGroupEnd());
+    // Hypercube AllGather
+    for (int mask=1; mask<nRanks; mask<<=1) {
+      NCCLCHECK(ncclGroupStart());
+      int s = rank & ~(mask-1);
+      int r = s ^ mask;
+      NCCLCHECK(ncclSend(rbuff+s*rankSize, count*mask, type, rank^mask, comm, stream));
+      NCCLCHECK(ncclRecv(rbuff+r*rankSize, count*mask, type, rank^mask, comm, stream));
+      NCCLCHECK(ncclGroupEnd());
+    }
+  } else {
+    return testNotImplemented;
  }
  return testSuccess;
 }
@ -78,7 +82,7 @@ struct testColl hyperCubeTest = {

 void HyperCubeGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  HyperCubeGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  HyperCubeGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t HyperCubeRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -111,8 +115,8 @@ testResult_t HyperCubeRunTest(struct threadArgs* args, int root, ncclDataType_t
 }

 struct testEngine hyperCubeEngine = {
-  HyperCubeGetBuffSize,
-  HyperCubeRunTest
+  .getBuffSize = HyperCubeGetBuffSize,
+  .runTest = HyperCubeRunTest
 };

 #pragma weak ncclTestEngine=hyperCubeEngine
--- a/src/multimem_ops.h
+++ b/src/multimem_ops.h
@ -0,0 +1,105 @@
+/*************************************************************************
+ * Copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _MULTIMEM_OPS_H_
+#define _MULTIMEM_OPS_H_
+
+#include <cuda_runtime.h>
+#include <cassert>
+
+// Multimem operations. Since Multimem is currently only available in PTX here are C++ wrappers around it.
+// First template argument is data type, second template type is vectorized data type.
+// In the future, the second template type also dictates reduction accuracy
+
+template<typename ptrT, typename valT>
+__device__ __forceinline__ valT multimemLoadSum(const ptrT* addr) {
+  assert(false);
+  // static_assert(std::is_same<ptrT, void>::value, "multimemLoadSum can only be instantiated with implemented types");
+  // static_assert(std::is_same<valT, void>::value, "multimemLoadSum can only be instantiated with implemented types");
+  return valT{0};
+}
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ double multimemLoadSum<double, double>(const double* addr) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  double result;
+  asm volatile("multimem.ld_reduce.global.add.f64 %0, [%1];" : "=d"(result) : "l"(multimem_addr) : "memory");
+  return result;
+}
+#endif
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ float multimemLoadSum<float, float>(const float* addr) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  float result;
+  asm volatile("multimem.ld_reduce.global.add.f32 %0, [%1];" : "=f"(result) : "l"(multimem_addr) : "memory");
+  return result;
+}
+#endif
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ float2 multimemLoadSum<float, float2>(const float* addr) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  float2 result;
+  asm volatile("multimem.ld_reduce.global.add.v2.f32 {%0,  %1}, [%2];" : "=f"(result.x), "=f"(result.y) : "l"(multimem_addr) : "memory");
+  return result;
+}
+#endif
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ float4 multimemLoadSum<float, float4>(const float* addr) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  float4 result;
+  asm volatile("multimem.ld_reduce.global.add.v4.f32 {%0,  %1, %2, %3}, [%4];" : "=f"(result.x), "=f"(result.y), "=f"(result.z), "=f"(result.w) : "l"(multimem_addr) : "memory");
+  return result;
+}
+#endif
+
+template<typename ptrT, typename valT>
+__device__ __forceinline__ void multimemStore(ptrT* addr, const valT val) {
+  assert(false);
+  // static_assert(std::is_same<ptrT, void>::value, "multimemStore can only be instantiated with implemented types");
+  // static_assert(std::is_same<valT, void>::value, "multimemStore can only be instantiated with implemented types");
+}
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ void multimemStore<double, double>(double* addr, const double val) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  asm volatile("multimem.st.global.f64 [%0], %1;" : : "l"(multimem_addr), "d"(val) : "memory");
+}
+#endif
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ void multimemStore<float, float>(float* addr, const float val) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  asm volatile("multimem.st.global.f32 [%0], %1;" : : "l"(multimem_addr), "f"(val) : "memory");
+}
+#endif
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ void multimemStore<float, float2>(float* addr, const float2 val) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  asm volatile("multimem.st.global.v2.f32 [%0], {%1, %2};" : : "l"(multimem_addr), "f"(val.x), "f"(val.y) : "memory");
+}
+#endif
+
+#if __CUDA_ARCH__ >= 900  // Hopper and later
+template<>
+__device__ __forceinline__ void multimemStore<float, float4>(float* addr, const float4 val) {
+  const uintptr_t multimem_addr = reinterpret_cast<uintptr_t>(addr);
+  asm volatile("multimem.st.global.v4.f32 [%0], {%1, %2, %3, %4};" : : "l"(multimem_addr), "f"(val.x), "f"(val.y), "f"(val.z), "f"(val.w) : "memory");
+}
+#endif
+
+
+#endif // _MULTIMEM_OPS_H_
--- a/src/reduce.cu
+++ b/src/reduce.cu
@ -7,7 +7,7 @@
 #include "cuda_runtime.h"
 #include "common.h"

-void ReduceGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
+void ReduceGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
  *sendcount = count;
  *recvcount = count;
  *sendInplaceOffset = 0;
@ -39,8 +39,14 @@ void ReduceGetBw(size_t count, int typesize, double sec, double* algBw, double*
  *busBw = baseBw;
 }

-testResult_t ReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  NCCLCHECK(ncclReduce(sendbuff, recvbuff, count, type, op, root, comm, stream));
+testResult_t ReduceRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+    NCCLCHECK(ncclReduce(sptr, rptr, count, type, op, root, comm, stream));
+  } else {
+    return testNotImplemented;
+  }
  return testSuccess;
 }

@ -54,7 +60,7 @@ struct testColl reduceTest = {

 void ReduceGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  ReduceGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  ReduceGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t ReduceRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -103,8 +109,8 @@ testResult_t ReduceRunTest(struct threadArgs* args, int root, ncclDataType_t typ
 }

 struct testEngine reduceEngine = {
-  ReduceGetBuffSize,
-  ReduceRunTest
+  .getBuffSize = ReduceGetBuffSize,
+  .runTest = ReduceRunTest
 };

 #pragma weak ncclTestEngine=reduceEngine
--- a/src/reduce_scatter.cu
+++ b/src/reduce_scatter.cu
@ -7,10 +7,8 @@
 #include "cuda_runtime.h"
 #include "common.h"

-#define ALIGN 4
-
-void ReduceScatterGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
-  size_t base = (count/(ALIGN*nranks))*ALIGN;
+void ReduceScatterGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
+  size_t base = (count/nranks) & -(16/eltSize);
  *sendcount = base*nranks;
  *recvcount = base;
  *sendInplaceOffset = 0;
@ -44,8 +42,14 @@ void ReduceScatterGetBw(size_t count, int typesize, double sec, double* algBw, d
  *busBw = baseBw * factor;
 }

-testResult_t ReduceScatterRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  NCCLCHECK(ncclReduceScatter(sendbuff, recvbuff, count, type, op, comm, stream));
+testResult_t ReduceScatterRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+    NCCLCHECK(ncclReduceScatter(sptr, rptr, count, type, op, comm, stream));
+  } else {
+    return testNotImplemented;
+  }
  return testSuccess;
 }

@ -59,7 +63,7 @@ struct testColl reduceScatterTest = {

 void ReduceScatterGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  ReduceScatterGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  ReduceScatterGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t ReduceScatterRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -98,8 +102,8 @@ testResult_t ReduceScatterRunTest(struct threadArgs* args, int root, ncclDataTyp
 }

 struct testEngine reduceScatterEngine = {
-  ReduceScatterGetBuffSize,
-  ReduceScatterRunTest
+  .getBuffSize = ReduceScatterGetBuffSize,
+  .runTest = ReduceScatterRunTest
 };

 #pragma weak ncclTestEngine=reduceScatterEngine
--- a/src/scatter.cu
+++ b/src/scatter.cu
@ -7,12 +7,12 @@
 #include "cuda_runtime.h"
 #include "common.h"

-void ScatterGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
-  *sendcount = (count/nranks)*nranks;
-  *recvcount = count/nranks;
+void ScatterGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
+  *recvcount = (count/nranks) & -(16/eltSize);
+  *sendcount = (*recvcount)*nranks;
  *sendInplaceOffset = 0;
-  *recvInplaceOffset = count/nranks;
-  *paramcount = count/nranks;
+  *recvInplaceOffset = *recvcount;
+  *paramcount = *recvcount;
 }

 testResult_t ScatterInitData(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t op, int root, int rep, int in_place) {
@ -39,23 +39,35 @@ void ScatterGetBw(size_t count, int typesize, double sec, double* algBw, double*
  *busBw = baseBw * factor;
 }

-testResult_t ScatterRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  int nRanks;
-  NCCLCHECK(ncclCommCount(comm, &nRanks));
-  int rank;
-  NCCLCHECK(ncclCommUserRank(comm, &rank));
-  size_t rankOffset = count * wordSize(type);
-  if (count == 0) return testSuccess;
+testResult_t ScatterRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    int nRanks;
+    NCCLCHECK(ncclCommCount(comm, &nRanks));
+    int rank;
+    NCCLCHECK(ncclCommUserRank(comm, &rank));
+    size_t rankOffset = count * wordSize(type);
+    if (count == 0) return testSuccess;

-  NCCLCHECK(ncclGroupStart());
-  if (rank == root) {
-    for (int r=0; r<nRanks; r++) {
-      NCCLCHECK(ncclSend(((char*)sendbuff)+r*rankOffset, count, type, r, comm, stream));
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,28,0)
+    NCCLCHECK(ncclScatter(sptr, rptr, count, type, root, comm, stream));
+#elif NCCL_VERSION_CODE >= NCCL_VERSION(2,7,0)
+    NCCLCHECK(ncclGroupStart());
+    if (rank == root) {
+      for (int r=0; r<nRanks; r++) {
+        NCCLCHECK(ncclSend(sptr + r * rankOffset, count, type, r, comm, stream));
+      }
    }
+    NCCLCHECK(ncclRecv(rptr, count, type, root, comm, stream));
+    NCCLCHECK(ncclGroupEnd());
+#else
+    printf("NCCL 2.7 or later is needed for scatter. This test was compiled with %d.%d.\n", NCCL_MAJOR, NCCL_MINOR);
+    return testNcclError;
+#endif
+  } else {
+    return testNotImplemented;
  }
-  NCCLCHECK(ncclRecv(recvbuff, count, type, root, comm, stream));
-  NCCLCHECK(ncclGroupEnd());
-
  return testSuccess;
 }

@ -69,7 +81,7 @@ struct testColl scatterTest = {

 void ScatterGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  ScatterGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  ScatterGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t ScatterRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -105,8 +117,8 @@ testResult_t ScatterRunTest(struct threadArgs* args, int root, ncclDataType_t ty
 }

 struct testEngine scatterEngine = {
-  ScatterGetBuffSize,
-  ScatterRunTest
+  .getBuffSize = ScatterGetBuffSize,
+  .runTest = ScatterRunTest
 };

 #pragma weak ncclTestEngine=scatterEngine
--- a/src/sendrecv.cu
+++ b/src/sendrecv.cu
@ -7,7 +7,7 @@
 #include "cuda_runtime.h"
 #include "common.h"

-void SendRecvGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
+void SendRecvGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, size_t eltSize, int nranks) {
  *sendcount = count;
  *recvcount = count;
  *sendInplaceOffset = 0;
@ -43,18 +43,24 @@ void SendRecvGetBw(size_t count, int typesize, double sec, double* algBw, double
  *busBw = baseBw * factor;
 }

-testResult_t SendRecvRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
-  int nRanks;
-  NCCLCHECK(ncclCommCount(comm, &nRanks));
-  int rank;
-  NCCLCHECK(ncclCommUserRank(comm, &rank));
-  int recvPeer = (rank-1+nRanks) % nRanks;
-  int sendPeer = (rank+1) % nRanks;
+testResult_t SendRecvRunColl(void* sendbuff, size_t sendoffset, void* recvbuff, size_t recvoffset, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream, int deviceImpl) {
+  if (deviceImpl == 0) {
+    int nRanks;
+    NCCLCHECK(ncclCommCount(comm, &nRanks));
+    int rank;
+    NCCLCHECK(ncclCommUserRank(comm, &rank));
+    int recvPeer = (rank-1+nRanks) % nRanks;
+    int sendPeer = (rank+1) % nRanks;

-  NCCLCHECK(ncclGroupStart());
-  NCCLCHECK(ncclSend(sendbuff, count, type, sendPeer, comm, stream));
-  NCCLCHECK(ncclRecv(recvbuff, count, type, recvPeer, comm, stream));
-  NCCLCHECK(ncclGroupEnd());
+    char* sptr = (char*)sendbuff + sendoffset;
+    char* rptr = (char*)recvbuff + recvoffset;
+    NCCLCHECK(ncclGroupStart());
+    NCCLCHECK(ncclSend(sptr, count, type, sendPeer, comm, stream));
+    NCCLCHECK(ncclRecv(rptr, count, type, recvPeer, comm, stream));
+    NCCLCHECK(ncclGroupEnd());
+  } else {
+    return testNotImplemented;
+  }
  return testSuccess;
 }

@ -68,7 +74,7 @@ struct testColl sendRecvTest = {

 void SendRecvGetBuffSize(size_t *sendcount, size_t *recvcount, size_t count, int nranks) {
  size_t paramcount, sendInplaceOffset, recvInplaceOffset;
-  SendRecvGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, nranks);
+  SendRecvGetCollByteCount(sendcount, recvcount, &paramcount, &sendInplaceOffset, &recvInplaceOffset, count, /*eltSize=*/1, nranks);
 }

 testResult_t SendRecvRunTest(struct threadArgs* args, int root, ncclDataType_t type, const char* typeName, ncclRedOp_t op, const char* opName) {
@ -107,8 +113,8 @@ testResult_t SendRecvRunTest(struct threadArgs* args, int root, ncclDataType_t t
 }

 struct testEngine sendRecvEngine = {
-  SendRecvGetBuffSize,
-  SendRecvRunTest
+  .getBuffSize = SendRecvGetBuffSize,
+  .runTest = SendRecvRunTest
 };

 #pragma weak ncclTestEngine=sendRecvEngine
--- a/src/util.cu
+++ b/src/util.cu
@ -0,0 +1,690 @@
+/*************************************************************************
+ * Copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+// This contains an utlities to handle output both to stdout and to
+// json files.
+//
+// An ad-hoc, libc-based approach to writing json has been adopted to
+// keep things simple and to avoid injecting a dependency on the
+// library for an external JSON utility.
+//
+// However, this means that the code is a brittle to changes and care
+// should be taken when adding/removing things. We also essentially
+// give up when passed non-ASCII strings and non-printable characters
+// except some of the usual ones.
+
+#include "nccl.h"
+#include "util.h"
+#include <assert.h>
+#include <errno.h>
+
+#include "ucommd.h"
+
+static Ucommd ucommd_;
+
+#define PRINT if (is_main_thread) printf
+
+extern int nThreads;
+extern int nGpus;
+extern size_t minBytes;
+extern size_t maxBytes;
+extern size_t stepBytes;
+extern size_t stepFactor;
+extern int datacheck;
+extern int warmup_iters;
+extern int iters;
+extern int agg_iters;
+extern int parallel_init;
+extern int blocking_coll;
+extern int cudaGraphLaunches;
+
+static FILE *json_report_fp;
+static thread_local bool write_json;
+
+#define JSON_FILE_VERSION 1
+
+#define TIME_STRING_FORMAT "%Y-%m-%d %H:%M:%S"
+
+typedef enum {
+  JSON_NONE, // A pseudo-state meaning that the document is empty
+  JSON_KEY,
+  JSON_OBJECT_EMPTY,
+  JSON_OBJECT_SOME,
+  JSON_LIST_EMPTY,
+  JSON_LIST_SOME,
+} json_state_t;
+
+// We use these statics to maintain a stack of states where we are writing.
+// the init_json_output function gets this set up, and it's the finalize_json_output function's job to clean this up.
+json_state_t *states = nullptr;
+size_t state_cap = 0; // Allocated stack capacity
+size_t state_n = 0;   // # of items in the stack.
+
+// This tries to sanitize/quote a string from 'in' into 'out',
+// assuming 'out' has length 'lim'.  We mainly quote ",/,\,\t,\n, and
+// bail if we encounter non-printable stuff or non-ASCII stuff.
+// 'in' should be null-terminated, of course.
+//
+// We return false if we were not able to copy all of 'in', either for
+// length reasons or for unhandled characters.
+static bool sanitizeJson(char out[], int lim, const char *in) {
+  int c = 0;
+  while(*in) {
+    if(c+1 >= lim) {
+      out[c] = 0;
+      return false;
+    }
+    switch(*in) {
+    case '"':
+    case '\\':
+    case '/':
+    case '\t':
+    case '\n':
+      if(c + 2 > lim) {
+        out[c] = 0;
+        return false;
+      }
+
+      out[c++] = '\\';
+      if(*in == '\n') {
+        out[c++] = 'n';
+      }
+      else if( *in == '\t') {
+        out[c++] = 't';
+      }
+      else {
+        out[c++] = *in;
+      }
+      break;
+    default:
+      if (*in >= 0x7F || *in <= 0x1F) {
+        out[c] = 0;
+        return false;
+      }
+      out[c++] = *in;
+      break;
+    }
+    ++in;
+  }
+  out[c] = 0;
+  return true;
+}
+
+// Push state onto the state stack. Reallocate for extra storage if needed.
+// Because JSON_NONE is a pseudo-state, don't allow it to be pushed.
+static void jsonPushState(json_state_t state) {
+  assert(state != JSON_NONE);
+  if(state_cap <= (state_n+1)) {
+    state_cap = max((size_t)16, state_cap*2);
+    states = (json_state_t *)realloc(states, sizeof(json_state_t)*state_cap);
+    assert(states);
+  }
+  states[state_n++] = state;
+}
+
+// Return the current state at the top of the stack
+static json_state_t jsonCurrState() {
+  if(state_n == 0) {
+    return JSON_NONE;
+  }
+  return states[state_n-1];
+}
+
+// Replace the stack with state (equivalent to a pop & push if stack is not empty)
+static void jsonReplaceState(json_state_t state) {
+  assert(state != JSON_NONE);
+  assert(state_n != 0);
+  states[state_n-1] = state;
+}
+
+// Pop the top state off the stack, or return that the state is empty
+static json_state_t jsonPopState() {
+  if(state_n == 0) {
+    return JSON_NONE;
+  }
+  return states[--state_n];
+}
+
+// Emit a key and separator. Santize the key.
+// This is only acceptable if the top state is an object
+// Emit a ',' separator of we aren't the first item.
+static void jsonKey(const char *name) {
+  switch(jsonCurrState()) {
+  case JSON_OBJECT_EMPTY:
+    jsonReplaceState(JSON_OBJECT_SOME);
+    break;
+  case JSON_OBJECT_SOME:
+    fprintf(json_report_fp, ",");
+    break;
+  default:
+    assert(0);
+    break;
+  }
+  char tmp[2048];
+  sanitizeJson(tmp, sizeof(tmp), name);
+  fprintf(json_report_fp, "\"%s\":", tmp);
+  jsonPushState(JSON_KEY);
+}
+
+// Helper function for inserting values.
+// Only acceptable after keys, top-level, or in lists.
+// Emit preceeding ',' if in a list and not first item.
+static void jsonValHelper() {
+  switch(jsonCurrState()) {
+  case JSON_LIST_EMPTY:
+    jsonReplaceState(JSON_LIST_SOME);
+    break;
+  case JSON_LIST_SOME:
+    fprintf(json_report_fp, ",");
+    break;
+  case JSON_KEY:
+    jsonPopState();
+    break;
+  case JSON_NONE:
+    break;
+  default:
+    assert(0);
+  }
+}
+
+// Start an object
+static void jsonStartObject() {
+  jsonValHelper();
+  fprintf(json_report_fp, "{");
+  jsonPushState(JSON_OBJECT_EMPTY);
+}
+
+// Close an object
+static void jsonFinishObject() {
+  switch(jsonPopState()) {
+  case JSON_OBJECT_EMPTY:
+  case JSON_OBJECT_SOME:
+    break;
+  default:
+    assert(0);
+  }
+  fprintf(json_report_fp, "}");
+}
+
+// Start a list
+static void jsonStartList() {
+  jsonValHelper();
+  fprintf(json_report_fp, "[");
+  jsonPushState(JSON_LIST_EMPTY);
+}
+
+// Close a list
+static void jsonFinishList() {
+  switch(jsonPopState()) {
+  case JSON_LIST_EMPTY:
+  case JSON_LIST_SOME:
+    break;
+  default:
+    assert(0);
+  }
+  fprintf(json_report_fp, "]");
+}
+
+// Write a null value
+static void jsonNull() {
+  jsonValHelper();
+  fprintf(json_report_fp, "null");
+}
+
+// Write a (sanititzed) string
+static void jsonStr(const char *str) {
+  if(str == nullptr) {
+    jsonNull();
+    return;
+  }
+  jsonValHelper();
+  char tmp[2048];
+  sanitizeJson(tmp, sizeof(tmp), str);
+  fprintf(json_report_fp, "\"%s\"", tmp);
+}
+
+// Write a bool as "true" or "false" strings.
+static void jsonBool(bool val) {
+  jsonStr(val ? "true" : "false");
+}
+
+// Write an integer value
+static void jsonInt(const int val) {
+  jsonValHelper();
+  fprintf(json_report_fp, "%d", val);
+}
+
+// Write a size_t value
+static void jsonSize_t(const size_t val) {
+  jsonValHelper();
+  fprintf(json_report_fp, "%zu", val);
+}
+
+// Write a double value
+static void jsonDouble(const double val) {
+  jsonValHelper();
+  if(val != val) {
+    fprintf(json_report_fp, "\"nan\"");
+  }
+  else {
+    fprintf(json_report_fp, "%lf", val);
+  }
+}
+
+// Fill buff with a formatted time string corresponding to 'now.
+// Write len or fewer bytes.
+void formatNow(char *buff, int len) {
+  time_t now;
+  time(&now);
+  struct tm *timeinfo = localtime(&now);
+
+  strftime(buff, len, TIME_STRING_FORMAT, timeinfo);
+}
+
+// We provide some status line to stdout.
+// The JSON stream is left with a trailing comma and the top-level
+// object open for the next set of top-level items (config and
+// results).
+
+// This uses unguarded 'printf' rather than the PRINT() macro because
+// is_main_thread is not set up at this point.
+void jsonOutputInit(const char *in_path,
+                    int argc, char **argv,
+                    char **envp) {
+  if(in_path == nullptr) {
+    return;
+  }
+
+  #ifdef MPI_SUPPORT
+  int proc;
+  MPI_Comm_rank(MPI_COMM_WORLD, &proc);
+  if(proc != 0) {
+    return;
+  }
+  #endif
+
+  char *try_path = strdup(in_path);
+  int try_count = 0;
+  json_report_fp = fopen(try_path, "wx");
+  while(json_report_fp == NULL) {
+    if(errno != EEXIST) {
+      printf("# skipping json output; %s not accessible\n", try_path);
+      free(try_path);
+      return;
+    }
+    free(try_path);
+    if(asprintf(&try_path, "%s.%d", in_path, try_count++) == -1) {
+      printf("# skipping json output; failed to probe destination\n");
+      return;
+    }
+    json_report_fp = fopen(try_path, "wx");
+  }
+
+  printf("# Writing JSON output to %s\n", try_path);
+  free(try_path);
+
+  write_json = true;
+
+  jsonStartObject(); // will be closed finalize_json_output
+
+  jsonKey("version"); jsonInt(JSON_FILE_VERSION);
+
+  jsonKey("start_time");
+  {
+    char timebuffer[128];
+    formatNow(timebuffer, sizeof(timebuffer));
+    jsonStr(timebuffer);
+  }
+
+  jsonKey("args");
+  jsonStartList();
+  for(int i = 0; i < argc; i++) {
+    jsonStr(argv[i]);
+  }
+  jsonFinishList();
+
+  jsonKey("env");
+  jsonStartList();
+  for(char **e = envp; *e; e++) {
+    jsonStr(*e);
+  }
+  jsonFinishList();
+  jsonKey("nccl_version"); jsonInt(test_ncclVersion);
+}
+
+void jsonIdentifyWriter(bool is_writer) {
+  write_json &= is_writer;
+}
+
+// This cleans up the json output, finishing the object and closing the file.
+// If we were not writing json output, we don't do anything.
+void jsonOutputFinalize() {
+  if(write_json) {
+
+    jsonKey("end_time");
+    char timebuffer[128];
+    formatNow(timebuffer, sizeof(timebuffer));
+    jsonStr(timebuffer);
+
+    jsonFinishObject();
+
+    assert(jsonCurrState() == JSON_NONE);
+    free(states);
+    states = nullptr;
+    state_n = 0;
+    state_cap = 0;
+
+    fclose(json_report_fp);
+    json_report_fp = nullptr;
+  }
+}
+
+struct rankInfo_t {
+  int rank;
+  int group;
+  int pid;
+  char hostname[1024];
+  int device;
+  char device_hex[128];
+  char devinfo[1024];
+};
+
+// Helper function to parse the device info lines passed via MPI to the root rank.
+// This fills 'rank' with the parsed contents of 'instring'.
+static int parseRankInfo(rankInfo_t *rank, const char *instring) {
+  int end;
+  sscanf(instring,
+         "#  Rank %d Group %d Pid %d on %1024s device %d [%128[^]]] %1024[^\n]\n%n",
+         &rank->rank,
+         &rank->group,
+         &rank->pid,
+         rank->hostname,
+         &rank->device,
+         rank->device_hex,
+         rank->devinfo,
+         &end);
+  return end;
+}
+
+static void jsonRankInfo(const rankInfo_t *ri) {
+  jsonStartObject();
+  jsonKey("rank");        jsonInt(ri->rank);
+  jsonKey("group");       jsonInt(ri->group);
+  jsonKey("pid");         jsonInt(ri->pid);
+  jsonKey("hostname");    jsonStr(ri->hostname);
+  jsonKey("device");      jsonInt(ri->device);
+  jsonKey("device_hex");  jsonStr(ri->device_hex);
+  jsonKey("device_info"); jsonStr(ri->devinfo);
+  jsonFinishObject();
+}
+
+// Write the start of a benchmark output line containing the bytes &
+// op type, both to stdout and to json if we are writing there.
+void writeBenchmarkLinePreamble(size_t nBytes, size_t nElem, const char typeName[], const char opName[], int root) {
+  char rootName[100];
+  sprintf(rootName, "%6i", root);
+  PRINT("%12li  %12li  %8s  %6s  %6s", nBytes, nElem, typeName, opName, rootName);
+
+  if(write_json) {
+    jsonStartObject();
+    jsonKey("size");  jsonSize_t(nBytes);
+    jsonKey("count"); jsonSize_t(nElem);
+    jsonKey("type");  jsonStr(typeName);
+    jsonKey("redop"); jsonStr(opName);
+    jsonKey("root");  jsonStr(rootName);
+  }
+}
+
+// Finish a result record we were writing to stdout/json
+void writeBenchmarkLineTerminator(int actualIters, const char *name) {
+  PRINT("\n");
+  if(write_json) {
+    jsonKey("actual_iterations"); jsonInt(actualIters);
+    jsonKey("experiment_name");   jsonStr(name);
+    jsonFinishObject();
+  }
+}
+
+// Handle a cases where we don't write out of place results
+void writeBenchMarkLineNullBody() {
+  PRINT("                                ");  // only do in-place for trace replay
+  if(write_json) {
+    jsonKey("out_of_place"); jsonNull();
+  }
+}
+
+void getFloatStr(double value, int width, char* str) {
+  int power = 0;
+  for (uint64_t val = 1; value >= val; val *= 10) power++;
+
+  if (power < width-2) sprintf(str, "%*.2f", width, value);
+  else if (power < width-1) sprintf(str, "%*.1f", width, value);
+  else if (power < width+1) sprintf(str, "%*.0f", width, value);
+  else if (width >= 7) sprintf(str, "%*.1e", width, value);
+  else if (width >= 8) sprintf(str, "%*.2e", width, value);
+  else sprintf(str, "%*.0e", width, value);
+}
+
+// Write the performance-related payload to stdout/json.
+// We call this function twice at the top level per test: once for out-of-place, and once for in-place.
+// The Json output assumes out-of-place happens first.
+void writeBenchmarkLineBody(double timeUsec, double algBw, double busBw, bool reportErrors, int64_t wrongElts, bool report_cputime, bool report_timestamps, bool out_of_place) {
+  char timeStr[8];
+  getFloatStr(timeUsec, 7, timeStr);
+
+  char algBwStr[7];
+  getFloatStr(algBw, 6, algBwStr);
+
+  char busBwStr[7];
+  getFloatStr(busBw, 6, busBwStr);
+
+  if (reportErrors) {
+    PRINT("  %7s  %6s  %6s  %6g", timeStr, algBwStr, busBwStr, (double)wrongElts);
+  } else {
+    PRINT("  %7s  %6s  %6s    N/A", timeStr, algBwStr, busBwStr);
+  }
+
+  if (!out_of_place && report_timestamps) {
+    char timebuffer[128];
+    formatNow(timebuffer, sizeof(timebuffer));
+    PRINT("%21s", timebuffer);
+  }
+
+  if(write_json) {
+    jsonKey(out_of_place ? "out_of_place" : "in_place");
+    jsonStartObject();
+    jsonKey(report_cputime ? "cpu_time" : "time"); jsonDouble(timeUsec);
+    jsonKey("alg_bw");                             jsonDouble(algBw);
+    jsonKey("bus_bw");                             jsonDouble(busBw);
+    jsonKey("nwrong");                             (reportErrors ? jsonDouble((double)wrongElts) : jsonNull());
+    jsonFinishObject();
+  }
+}
+
+// This writes out a report about the run parameters and devices
+// involved to stdout and json.  For MPI, this will use a collective
+// to gather from each rank to the root.
+
+// Root then consumes this output, printing raw lines for stdout and
+// parsing them for JSON for proper formatting.
+
+// Perhaps actually sending records around instead of formatted
+// strings would be smarter/easier, but I chose to adapt what was
+// already in place.
+testResult_t writeDeviceReport(size_t *maxMem, int localRank, int proc, int totalProcs, int color, const char hostname[], const char *program_name) {
+  PRINT("# nccl-tests version %s nccl-headers=%d nccl-library=%d\n", NCCL_TESTS_VERSION, NCCL_VERSION_CODE, test_ncclVersion);
+  PRINT("# Collective test starting: %s\n", program_name);
+  // PRINT("# nThread %d nGpus %d minBytes %ld maxBytes %ld step: %ld(%s) warmup iters: %d iters: %d agg iters: %d validation: %d graph: %d\n",
+  PRINT("# nGpus %d minBytes %ld maxBytes %ld step: %ld(%s) warmup iters: %d iters: %d agg iters: %d validation: %d graph: %d\n",
+      nGpus, minBytes, maxBytes,
+        (stepFactor > 1)?stepFactor:stepBytes, (stepFactor > 1)?"factor":"bytes",
+        warmup_iters, iters, agg_iters, datacheck, cudaGraphLaunches);
+  if (blocking_coll) PRINT("# Blocking Enabled: wait for completion and barrier after each collective \n");
+  if (parallel_init) PRINT("# Parallel Init Enabled: threads call into NcclInitRank concurrently \n");
+  PRINT("#\n");
+
+  if(write_json) {
+    jsonKey("config");
+    jsonStartObject();
+    jsonKey("nthreads");      jsonInt(nThreads);
+    jsonKey("ngpus");         jsonInt(nGpus);
+    jsonKey("minimum_bytes"); jsonSize_t(minBytes);
+    jsonKey("maximum_bytes"); jsonSize_t(maxBytes);
+    if(stepFactor > 1) {
+      jsonKey("step_factor");   jsonInt(stepFactor);
+    }
+    else {
+      jsonKey("step_bytes");  jsonSize_t(stepBytes);
+    }
+
+    jsonKey("warmup_iters");          jsonInt(warmup_iters);
+    jsonKey("iterations");            jsonInt(iters);
+    jsonKey("aggregated_iterations"); jsonInt(agg_iters);
+    jsonKey("validation");            jsonInt(datacheck);
+    jsonKey("graph");                 jsonInt(cudaGraphLaunches);
+    jsonKey("blocking_collectives");  jsonBool(blocking_coll);
+    jsonKey("parallel_init");         jsonBool(parallel_init);
+  }
+
+  PRINT("# Using devices\n");
+#define MAX_LINE 2048
+  char line[MAX_LINE];
+  int len = 0;
+  const char* envstr = getenv("NCCL_TESTS_DEVICE");
+  const int gpu0 = envstr ? atoi(envstr) : -1;
+  int available_devices;
+  CUDACHECK(cudaGetDeviceCount(&available_devices));
+  for (int i=0; i<nThreads*nGpus; i++) {
+    const int cudaDev = (gpu0 != -1 ? gpu0 : localRank*nThreads*nGpus) + i;
+    const int rank = proc*nThreads*nGpus+i;
+    cudaDeviceProp prop;
+    if (cudaDev >= available_devices) {
+      fprintf(stderr, "Invalid number of GPUs: %d requested but only %d were found.\n",
+              (gpu0 != -1 ? gpu0 : localRank*nThreads*nGpus) + nThreads*nGpus, available_devices);
+      fprintf(stderr, "Please check the number of processes and GPUs per process.\n");
+      return testNotImplemented;
+    }
+    CUDACHECK(cudaGetDeviceProperties(&prop, cudaDev));
+    if (len < MAX_LINE) {
+      len += snprintf(line+len, MAX_LINE-len, "#  Rank %2d Group %2d Pid %6d on %10s device %2d [%04x:%02x:%02x] %s\n",
+                      rank, color, getpid(), hostname, cudaDev, prop.pciDomainID, prop.pciBusID, prop.pciDeviceID, prop.name);
+    }
+    *maxMem = std::min(*maxMem, prop.totalGlobalMem);
+  }
+  if (len >= MAX_LINE) {
+    strcpy(line+MAX_LINE-5, "...\n");
+  }
+
+#if MPI_SUPPORT
+  char *lines = (proc == 0) ? (char *)malloc(totalProcs*MAX_LINE) : NULL;
+  // Gather all output in rank order to root (0)
+  MPI_Gather(line, MAX_LINE, MPI_BYTE, lines, MAX_LINE, MPI_BYTE, 0, MPI_COMM_WORLD);
+  if (proc == 0) {
+    if(write_json) {
+      jsonKey("devices");
+      jsonStartList();
+    }
+    for (int p = 0; p < totalProcs; p++) {
+    int stride = ucommd_.getLocalSize() > 0 ? ucommd_.getLocalSize() : 1;
+    for (int p = stride-1; p < totalProcs; p+=stride)
+      PRINT("%s", lines+MAX_LINE*p);
+      if(write_json) {
+        rankInfo_t rankinfo;
+        parseRankInfo(&rankinfo, lines + MAX_LINE*p);
+        jsonRankInfo(&rankinfo);
+      }
+    }
+    if(write_json) {
+      jsonFinishList();
+    }
+    free(lines);
+  }
+  MPI_Allreduce(MPI_IN_PLACE, maxMem, 1, MPI_LONG, MPI_MIN, MPI_COMM_WORLD);
+#else
+  PRINT("%s", line);
+  if(write_json) {
+    rankInfo_t rankinfo;
+    parseRankInfo(&rankinfo, line);
+    jsonKey("devices");
+    jsonStartList();
+    jsonRankInfo(&rankinfo);
+    jsonFinishList();
+  }
+#endif
+  if(write_json) {
+    jsonFinishObject();
+  }
+
+  return testSuccess;
+}
+
+// Write a result header to stdout/json.
+// Json results object and contained table list are left open
+void writeResultHeader(bool report_cputime, bool report_timestamps) {
+  const char* tsLbl  = report_timestamps ? "timestamp" : "";
+  const int tsPad = report_timestamps ? 19 : 0;
+  const char* tsFmt = report_timestamps ? TIME_STRING_FORMAT : "";
+  const char* timeStr = report_cputime ? "cputime" : "time";
+  PRINT("#\n");
+  PRINT("# %10s  %12s  %8s  %6s  %6s           out-of-place                       in-place          \n", "", "", "", "", "");
+  PRINT("# %10s  %12s  %8s  %6s  %6s  %7s  %6s  %6s  %6s  %7s  %6s  %6s  %6s %*s\n", "size", "count", "type", "redop", "root",
+        timeStr, "algbw", "busbw", "#wrong", timeStr, "algbw", "busbw", "#wrong", tsPad, tsLbl);
+  PRINT("# %10s  %12s  %8s  %6s  %6s  %7s  %6s  %6s  %6s  %7s  %6s  %6s  %6s %*s\n", "(B)", "(elements)", "", "", "",
+        "(us)", "(GB/s)", "(GB/s)", "", "(us)", "(GB/s)", "(GB/s)", "", tsPad, tsFmt);
+
+  if(write_json) {
+    jsonKey("results"); jsonStartList();
+  }
+}
+
+// Write the footer for results to stdout/json.
+// We close the table list and write out the summary items.
+// Results object is left open for errors.
+void writeResultFooter(const int errors[], const double bw[], double check_avg_bw, const char *program_name) {
+
+  if(write_json) {
+    jsonFinishList();
+  }
+
+  PRINT("# Out of bounds values : %d %s\n", errors[0], errors[0] ? "FAILED" : "OK");
+  // PRINT("# Avg bus bandwidth    : %g %s\n", bw[0], check_avg_bw == -1 ? "" : (bw[0] < check_avg_bw*(0.9) ? "FAILED" : "OK"));
+  PRINT("# Avg bus bandwidth    : %g %s\n", bw[0], check_avg_bw == -1 ? "" : (bw[0] < check_avg_bw/**(0.9)*/ ? "FAILED" : "OK"));
+  if (bw[0] < check_avg_bw) PRINT("# Expected min bandwidth : %g\n", check_avg_bw);
+  PRINT("#\n");
+  PRINT("# Collective test concluded: %s\n", program_name);
+
+  if(write_json) {
+    jsonKey("out_of_bounds");
+    jsonStartObject();
+    jsonKey("count");      jsonInt(errors[0]);
+    jsonKey("okay");       jsonBool(errors[0] == 0);
+    jsonFinishObject();
+    jsonKey("average_bus_bandwidith");
+    jsonStartObject();
+    jsonKey("bandwidith"); jsonDouble(bw[0]);
+    // jsonKey("okay");       check_avg_bw == -1 ? jsonStr("unchecked") : jsonBool(bw[0] >= check_avg_bw*(0.9));
+    jsonKey("okay");       check_avg_bw == -1 ? jsonStr("unchecked") : jsonBool(bw[0] >= check_avg_bw/**(0.9)*/);
+    jsonFinishObject();
+  }
+}
+
+// Write out remaining errors to stdout/json.
+void writeErrors() {
+  const char *error = ncclGetLastError(NULL);
+  if(error && strlen(error) > 0) {
+    PRINT("# error: %s\n", error);
+  } else {
+    PRINT("\n");
+  }
+  if(write_json) {
+    jsonKey("errors");
+    jsonStartList();
+    if(error) {
+      jsonStr(error);
+    }
+    jsonFinishList();
+  }
+}
+
--- a/src/util.h
+++ b/src/util.h
@ -0,0 +1,37 @@
+/*************************************************************************
+ * Copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+#ifndef __UTIL_H__
+#define __UTIL_H__
+
+#include "common.h"
+
+// Try to set up JSON file output. If MPI is used, only rank 0 will proceed.
+// This should be called by only a single thread.
+// If 'in_path' is NULL, we stop.
+// Otherwise, we borrow 'in_path' and try to open it as a new file.
+// If it already exists, we probe for new files by appending integers
+// until we succeed.
+// Then we write argv and envp to the json output, santizing them. We also
+// write the nccl version.
+// The top-level object remains open for the rest of the output.
+void jsonOutputInit(const char *path, int argc, char **argv, char **envp);
+
+// Should be called to identify main thread after threads are started to ensure we don't duplicate output
+void jsonIdentifyWriter(bool is_writer);
+
+// Write end time and close top-level object. Reset json state and close output file.
+void jsonOutputFinalize();
+
+void writeBenchmarkLinePreamble(size_t nBytes, size_t nElem, const char typeName[], const char opName[], int root);
+void writeBenchmarkLineTerminator(int actualIters, const char *name);
+void writeBenchMarkLineNullBody();
+void writeBenchmarkLineBody(double timeUsec, double algBw, double busBw, bool reportErrors, int64_t wrongElts, bool report_cputime, bool report_timestamps, bool out_of_place);
+testResult_t writeDeviceReport(size_t *maxMem, int localRank, int proc, int totalProcs, int color, const char hostname[], const char *program_name);
+void writeResultHeader(bool report_cputime, bool report_timestamps);
+void writeResultFooter(const int errors[], const double bw[], double check_avg_bw, const char *program_name);
+void writeErrors();
+
+#endif
--- a/src/vector_types.h
+++ b/src/vector_types.h
@ -0,0 +1,89 @@
+/*************************************************************************
+ * Copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef _VECTOR_TYPES_H_
+#define _VECTOR_TYPES_H_
+
+#include <cuda_runtime.h>
+
+// Helper functions to use vectorized types
+
+// This maps at compile time each data type to its best available vectorized type.
+// As close to 128 bits as possible
+template <typename T>
+struct VectorTypeMapping{
+  using Type=T; // Default no vectorization
+};
+
+template <>
+struct VectorTypeMapping<float>{
+  using Type=float4;
+};
+
+template <>
+struct VectorTypeMapping<double>{
+  using Type=double2;
+};
+
+template <>
+struct VectorTypeMapping<int8_t>{
+  using Type=char4;  // Largest built-in CUDA type for char (32-bit)
+};
+
+template <>
+struct VectorTypeMapping<uint8_t>{
+  using Type=uchar4; // Largest built-in CUDA type for uchar (32-bit)
+};
+
+template <>
+struct VectorTypeMapping<int32_t>{
+  using Type=int4;
+};
+
+template <>
+struct VectorTypeMapping<uint32_t>{
+  using Type=uint4;
+};
+
+
+// Vector addition helper functions
+// They enable clean math with vector types.
+template <typename T>
+__device__ __forceinline__ T vectorAdd(T a, T b) {
+  return a + b;
+}
+
+template <>
+__device__ __forceinline__ float4 vectorAdd(float4 a, float4 b) {
+  return make_float4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+}
+
+template <>
+__device__ __forceinline__ double2 vectorAdd(double2 a, double2 b) {
+  return make_double2(a.x + b.x, a.y + b.y);
+}
+
+template <>
+__device__ __forceinline__ char4 vectorAdd(char4 a, char4 b) {
+  return make_char4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+}
+
+template <>
+__device__ __forceinline__ uchar4 vectorAdd(uchar4 a, uchar4 b) {
+  return make_uchar4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+}
+
+template <>
+__device__ __forceinline__ int4 vectorAdd(int4 a, int4 b) {
+  return make_int4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+}
+
+template <>
+__device__ __forceinline__ uint4 vectorAdd(uint4 a, uint4 b) {
+  return make_uint4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+}
+
+#endif // _VECTOR_TYPES_H_
--- a/verifiable/Makefile
+++ b/verifiable/Makefile
@ -1,13 +1,18 @@
-include ../../makefiles/common.mk
+#
+# Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+#
+# See LICENSE.txt for license information
+#
+include ../src/common.mk

 .PHONY: all clean

-BUILDDIR := $(abspath ../../build)
+BUILDDIR := $(abspath ../build)
 NCCLDIR := $(BUILDDIR)
 NVCUFLAGS += -I$(NCCLDIR)/include/ -I../include
-DST_DIR := $(BUILDDIR)/test/verifiable
+DST_DIR := $(BUILDDIR)/verifiable

-all: $(DST_DIR)/self_test $(DST_DIR)/verifiable.o
+all: $(DST_DIR)/self_test

 clean:
 	rm -rf $(DST_DIR)
@ -18,7 +23,7 @@ include verifiable.mk

 self_test: $(DST_DIR)/self_test

-$(DST_DIR)/self_test: verifiable.cu verifiable.h
+$(DST_DIR)/self_test: main.cu $(TEST_VERIFIABLE_LIBS)
 	@printf "Linking  %s\n" $@
 	@mkdir -p $(DST_DIR)
-	$(NVCC) -o $@ $(NVCUFLAGS) -DSELF_TEST=1 verifiable.cu $(NVLDFLAGS)
+	$(NVCC) -o $@ $(NVCUFLAGS) $< -L$(TEST_VERIFIABLE_BUILDDIR) -lverifiable $(NVLDFLAGS) -Xlinker "-rpath=\$$ORIGIN"
--- a/verifiable/main.cu
+++ b/verifiable/main.cu
@ -0,0 +1,14 @@
+
+#include <cuda_runtime.h>
+#include <iostream>
+
+#define NCCL_VERIFIABLE_SELF_TEST 1
+#include "verifiable.h"
+
+int main(int arg_n, char **args) {
+  std::cerr<<"You are hoping to see no output beyond this line."<<std::endl;
+  cudaSetDevice(0);
+  ncclVerifiableLaunchSelfTest();
+  cudaDeviceSynchronize();
+  return 0;
+}
--- a/verifiable/verifiable.cu
+++ b/verifiable/verifiable.cu
@ -8,6 +8,15 @@
 #if CUDART_VERSION >= 11000
 #include <cuda_bf16.h>
 #endif
+#if CUDART_VERSION >= 11080
+#include <cuda_fp8.h>
+#endif
+
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2,24,0) && defined(__CUDA_FP8_TYPES_EXIST__)
+  #define HAVE_ncclFloat8 1
+#else
+  #define HAVE_ncclFloat8 0
+#endif

 #if NCCL_VERSION_CODE >= NCCL_VERSION(2,10,0) && defined(__CUDA_BF16_TYPES_EXIST__)
  #define HAVE_ncclBfloat16 1
@ -84,10 +93,16 @@ template<typename T>
 struct IsIntegral: std::is_integral<T> {};
 template<>
 struct IsIntegral<half>: std::false_type {};
-#ifdef __CUDA_BF16_TYPES_EXIST__
+#if HAVE_ncclBfloat16
 template<>
 struct IsIntegral<__nv_bfloat16>: std::false_type {};
 #endif
+#if HAVE_ncclFloat8
+template<>
+struct IsIntegral<__nv_fp8_e4m3>: std::false_type {};
+template<>
+struct IsIntegral<__nv_fp8_e5m2>: std::false_type {};
+#endif
 }

 ////////////////////////////////////////////////////////////////////////////////
@ -107,23 +122,72 @@ __host__ __device__ T inhibit(T x) {
 ////////////////////////////////////////////////////////////////////////////////

 namespace {
-  template<typename Y, typename X>
-  __host__ __device__ Y castTo(X x) {
+  template<typename Y>
+  __host__ __device__ Y castTo(uint64_t x) {
    return Y(x);
  }
  template<typename Y>
  __host__ __device__ Y castTo(float x) {
    return Y(x);
  }
+  template<typename Y>
+  __host__ __device__ Y castTo(double x) {
+    return Y(x);
+  }
+
  template<>
  __host__ __device__ half castTo<half>(float x) {
    return __float2half(x);
  }
-  #ifdef __CUDA_BF16_TYPES_EXIST__
+  template<>
+  __host__ __device__ half castTo<half>(double x) {
+    return __double2half(x);
+  }
+  template<>
+  __host__ __device__ half castTo<half>(uint64_t x) {
+    return __ull2half_rn(x);
+  }
+
+  #if HAVE_ncclBfloat16
  template<>
  __host__ __device__ __nv_bfloat16 castTo<__nv_bfloat16>(float x) {
    return __float2bfloat16(x);
  }
+  template<>
+  __host__ __device__ __nv_bfloat16 castTo<__nv_bfloat16>(double x) {
+    return __double2bfloat16(x);
+  }
+  template<>
+  __host__ __device__ __nv_bfloat16 castTo<__nv_bfloat16>(uint64_t x) {
+    return __double2bfloat16((double)x);
+  }
+  #endif
+
+  #if HAVE_ncclFloat8
+  template<>
+  __host__ __device__ __nv_fp8_e4m3 castTo<__nv_fp8_e4m3>(float x) {
+    return __nv_fp8_e4m3(x);
+  }
+  template<>
+  __host__ __device__ __nv_fp8_e4m3 castTo<__nv_fp8_e4m3>(double x) {
+    return __nv_fp8_e4m3(x);
+  }
+  template<>
+  __host__ __device__ __nv_fp8_e4m3 castTo<__nv_fp8_e4m3>(uint64_t x) {
+    return __nv_fp8_e4m3((double)x);
+  }
+  template<>
+  __host__ __device__ __nv_fp8_e5m2 castTo<__nv_fp8_e5m2>(float x) {
+    return __nv_fp8_e5m2(x);
+  }
+  template<>
+  __host__ __device__ __nv_fp8_e5m2 castTo<__nv_fp8_e5m2>(double x) {
+    return __nv_fp8_e5m2(x);
+  }
+  template<>
+  __host__ __device__ __nv_fp8_e5m2 castTo<__nv_fp8_e5m2>(uint64_t x) {
+    return __nv_fp8_e5m2((double)x);
+  }
  #endif
 }

@ -151,7 +215,7 @@ struct ReduceSum {
      return __float2half(__half2float(a) + __half2float(b));
    #endif
  }
-  #ifdef __CUDA_BF16_TYPES_EXIST__
+  #if HAVE_ncclBfloat16
  __host__ __device__ __nv_bfloat16 operator()(__nv_bfloat16 a, __nv_bfloat16 b) const {
    #if __CUDA_ARCH__ >= 800
      return __hadd(a, b);
@ -160,6 +224,22 @@ struct ReduceSum {
    #endif
  }
  #endif
+  #if HAVE_ncclFloat8
+  __host__ __device__ __nv_fp8_e4m3 operator()(__nv_fp8_e4m3 a, __nv_fp8_e4m3 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e4m3(__hadd(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e4m3(float(a) + float(b));
+    #endif
+  }
+  __host__ __device__ __nv_fp8_e5m2 operator()(__nv_fp8_e5m2 a, __nv_fp8_e5m2 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e5m2(__hadd(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e5m2(float(a) + float(b));
+    #endif
+  }
+  #endif
  template<typename T>
  __host__ __device__ T postOp(T x) const { return x; }
 };
@ -175,7 +255,7 @@ struct ReduceProd {
      return __float2half(__half2float(a) * __half2float(b));
    #endif
  }
-  #ifdef __CUDA_BF16_TYPES_EXIST__
+  #if HAVE_ncclBfloat16
  __host__ __device__ __nv_bfloat16 operator()(__nv_bfloat16 a, __nv_bfloat16 b) const {
    #if __CUDA_ARCH__ >= 800
      return __hmul(a, b);
@ -184,6 +264,22 @@ struct ReduceProd {
    #endif
  }
  #endif
+  #if HAVE_ncclFloat8
+  __host__ __device__ __nv_fp8_e4m3 operator()(__nv_fp8_e4m3 a, __nv_fp8_e4m3 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e4m3(__hmul(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e4m3(float(a) * float(b));
+    #endif
+  }
+  __host__ __device__ __nv_fp8_e5m2 operator()(__nv_fp8_e5m2 a, __nv_fp8_e5m2 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e5m2(__hmul(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e5m2(float(a) * float(b));
+    #endif
+  }
+  #endif
  template<typename T>
  __host__ __device__ T postOp(T x) const { return x; }
 };
@ -201,7 +297,7 @@ struct ReduceMin {
      return __half2float(a) < __half2float(b) ? a : b;
    #endif
  }
-  #ifdef __CUDA_BF16_TYPES_EXIST__
+  #if HAVE_ncclBfloat16
  __host__ __device__ __nv_bfloat16 operator()(__nv_bfloat16 a, __nv_bfloat16 b) const {
    #if __CUDA_ARCH__ >= 800
      return __hmin(a, b);
@ -212,6 +308,22 @@ struct ReduceMin {
    #endif
  }
  #endif
+  #if HAVE_ncclFloat8
+  __host__ __device__ __nv_fp8_e4m3 operator()(__nv_fp8_e4m3 a, __nv_fp8_e4m3 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e4m3(__hmin(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e4m3(float(a) < float(b) ? a : b);
+    #endif
+  }
+  __host__ __device__ __nv_fp8_e5m2 operator()(__nv_fp8_e5m2 a, __nv_fp8_e5m2 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e5m2(__hmin(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e5m2(float(a) < float(b) ? a : b);
+    #endif
+  }
+  #endif
  template<typename T>
  __host__ __device__ T postOp(T x) const { return x; }
 };
@ -229,7 +341,7 @@ struct ReduceMax {
      return __half2float(a) > __half2float(b) ? a : b;
    #endif
  }
-  #ifdef __CUDA_BF16_TYPES_EXIST__
+  #if HAVE_ncclBfloat16
  __host__ __device__ __nv_bfloat16 operator()(__nv_bfloat16 a, __nv_bfloat16 b) const {
    #if __CUDA_ARCH__ >= 800
      return __hmax(a, b);
@ -240,6 +352,22 @@ struct ReduceMax {
    #endif
  }
  #endif
+  #if HAVE_ncclFloat8
+  __host__ __device__ __nv_fp8_e4m3 operator()(__nv_fp8_e4m3 a, __nv_fp8_e4m3 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e4m3(__hmax(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e4m3(float(a) > float(b) ? a : b);
+    #endif
+  }
+  __host__ __device__ __nv_fp8_e5m2 operator()(__nv_fp8_e5m2 a, __nv_fp8_e5m2 b) const {
+    #if __CUDA_ARCH__ >= 800
+      return __nv_fp8_e5m2(__hmax(__half(a), __half(b)));
+    #else
+      return __nv_fp8_e5m2(float(a) > float(b) ? a : b);
+    #endif
+  }
+  #endif
  template<typename T>
  __host__ __device__ T postOp(T x) const { return x; }
 };
@ -297,29 +425,47 @@ struct ReduceAvg {

 namespace {
 template<typename T>
-struct FloatLayout;
+struct FloatLayout { static constexpr bool is_floating_point = false; };
 template<>
 struct FloatLayout<float> {
+  static constexpr bool is_floating_point = true;
  static constexpr int exponent_bits = 8, mantissa_bits = 23;
  static constexpr int exponent_bias = (1<<(exponent_bits-1))-1;
 };
 template<>
 struct FloatLayout<double> {
+  static constexpr bool is_floating_point = true;
  static constexpr int exponent_bits = 11, mantissa_bits = 52;
  static constexpr int exponent_bias = (1<<(exponent_bits-1))-1;
 };
 template<>
 struct FloatLayout<half> {
+  static constexpr bool is_floating_point = true;
  static constexpr int exponent_bits = 5, mantissa_bits = 10;
  static constexpr int exponent_bias = (1<<(exponent_bits-1))-1;
 };
-#ifdef __CUDA_BF16_TYPES_EXIST__
+#if HAVE_ncclBfloat16
 template<>
 struct FloatLayout<__nv_bfloat16> {
+  static constexpr bool is_floating_point = true;
  static constexpr int exponent_bits = 8, mantissa_bits = 7;
  static constexpr int exponent_bias = (1<<(exponent_bits-1))-1;
 };
 #endif
+#if HAVE_ncclFloat8
+template<>
+struct FloatLayout<__nv_fp8_e4m3> {
+  static constexpr bool is_floating_point = true;
+  static constexpr int exponent_bits = 4, mantissa_bits = 3;
+  static constexpr int exponent_bias = (1<<(exponent_bits-1))-1;
+};
+template<>
+struct FloatLayout<__nv_fp8_e5m2> {
+  static constexpr bool is_floating_point = true;
+  static constexpr int exponent_bits = 5, mantissa_bits = 2;
+  static constexpr int exponent_bias = (1<<(exponent_bits-1))-1;
+};
+#endif

 template<typename T>
 __host__ __device__ T makeFloat(int sign, int exp, uint64_t mant) {
@ -632,11 +778,12 @@ __host__ __device__ void genOutput(
 ////////////////////////////////////////////////////////////////////////////////
 // Nil reduction (byte copy functions). Optimized to assume rank_n=1

+// genInput specialization for integer ReduceNil.
 namespace {
-template<typename T, bool IsIntegral>
+template<typename T>
 __host__ __device__ void genInput(
    T &ans, ReduceNil, int rank_n, int rank_me, uint64_t seed, intptr_t index,
-    std::integral_constant<bool, IsIntegral>
+    std::true_type /*integral*/
  ) {
  (void)rank_n, (void)rank_me; // silence unused warnings
  union { uint64_t bits; T tmp; };
@ -646,6 +793,24 @@ __host__ __device__ void genInput(
  ans = tmp;
 }

+// genInput specialization for floating point ReduceNil.
+template<typename T>
+__host__ __device__ void genInput(
+    T &ans, ReduceNil, int rank_n, int rank_me, uint64_t seed, intptr_t index,
+    std::false_type /*integral*/
+  ) {
+  (void)rank_n; // silence unused warnings
+  constexpr uint64_t mant_mask = (uint64_t(1) << FloatLayout<T>::mantissa_bits)-1;
+  uint64_t rng = hashOf(index ^ index<<16 ^ rank_me, seed);
+  int sign = rng & 1;
+  rng ^= rng>>1;
+  int exp = rng & ((1<<(FloatLayout<T>::exponent_bits-1))-1);
+  exp += 1<<(FloatLayout<T>::exponent_bits-2);
+  rng ^= rng >> FloatLayout<T>::exponent_bits;
+  uint64_t mant = rng & mant_mask;
+  ans = makeFloat<T>(sign, exp, mant);
+}
+
 template<typename T, typename ReduceFn, bool IsIntegral>
 __host__ __device__ void genOutput(
    T &ans, ReduceNil op, int rank_n, uint64_t seed, intptr_t index,
@ -734,20 +899,34 @@ __host__ __device__ void genOutput(
 namespace {
 template<typename T>
 __host__ __device__ void genInput(
-    T &ans, ReduceAvg, int rank_n, int rank_me, uint64_t seed, intptr_t index,
+    T &ans, ReduceAvg, int rank_n, int rank_me, uint64_t rng, intptr_t index,
    std::false_type /*integral*/
  ) {
-  ans = genInOutFloatSum<T>(/*input_not_output=*/true, rank_n, rank_me, seed, index, /*same_sign=*/true);
+  // We can't control the nranks divisor in avareages so to control error we
+  // limit to two ranks contributing non-zero values. This way there is no ambiguity
+  // of summation.
+  int r = shuffleRank(rank_n, rank_me, rng);
+  uint64_t m = (rng*(r ? 0xbeef : 1)) & ((1ul<<FloatLayout<T>::mantissa_bits)-1);
+  ans = r < 2 ? castTo<T>(1+m) : castTo<T>((uint64_t)0);
 }

 template<typename T>
 __host__ __device__ void genOutput(
-    T &ans, ReduceAvg, int rank_n, uint64_t seed, intptr_t index,
+    T &ans, ReduceAvg, int rank_n, uint64_t rng, intptr_t index,
    std::false_type /*integral*/
  ) {
-  ans = genInOutFloatSum<T>(/*input_not_output=*/false, rank_n, 0, seed, index, /*same_sign=*/true);
-  using T1 = typename std::conditional<(sizeof(T)<sizeof(double)), float, double>::type;
-  ans = ReduceProd()(ans, T1(1)/T1(rank_n));
+  shuffleRank(rank_n, -1, rng);
+  uint64_t m0 = (rng*(0 ? 0xbeef : 1)) & ((1ul<<FloatLayout<T>::mantissa_bits)-1);
+  uint64_t m1 = (rng*(1 ? 0xbeef : 1)) & ((1ul<<FloatLayout<T>::mantissa_bits)-1);
+  if (rank_n == 1) {
+    ans = castTo<T>(1+m0);
+  } else {
+    // NCCL varies which datatype it does the muls with depending on __CUDA_ARCH__.
+    // We account for this by using a tolerance of 2 ulps during the verification.
+    using TMul = typename std::conditional<(sizeof(T) < sizeof(double)), float, double>::type;
+    ans = ReduceSum()((T)(TMul(1+m0)*TMul(1.0/rank_n)),
+                      (T)(TMul(1+m1)*TMul(1.0/rank_n)));
+  }
 }
 }

@ -809,10 +988,9 @@ __host__ __device__ T genOutput(

 ////////////////////////////////////////////////////////////////////////////////

-#if !SELF_TEST
 namespace {
 template<typename T, typename ReduceFn>
-__global__ void prepareInput2(
+__global__ void __launch_bounds__(512, 1) prepareInput2(
    T *elts, intptr_t elt_n, ReduceFn op, int rank_n, int rank_me,
    uint64_t seed, intptr_t elt_ix0
  ) {
@ -833,40 +1011,49 @@ __global__ void prepareInput2(
 }

 template<typename ReduceOp>
-void prepareInput1(
+cudaError_t prepareInput1(
    void *elts, intptr_t elt_n, int elt_ty, ReduceOp op, int rank_n, int rank_me,
    uint64_t seed, intptr_t elt_ix0, cudaStream_t stream
  ) {
-  int block_n = std::min<intptr_t>(32, (elt_n + 4*512-1)/(4*512));
-  #define CASE_TY(T) prepareInput2<<<block_n, 512, 0, stream>>>((T*)elts, elt_n, op, rank_n, rank_me, seed, elt_ix0); break;
+  void const *fn = nullptr;
  switch(elt_ty) {
-  case ncclInt8: CASE_TY(int8_t)
-  case ncclUint8: CASE_TY(uint8_t)
-  case ncclInt32: CASE_TY(int32_t)
-  case ncclUint32: CASE_TY(uint32_t)
-  case ncclInt64: CASE_TY(int64_t)
-  case ncclUint64: CASE_TY(uint64_t)
-  case ncclFloat16: CASE_TY(half)
+  case ncclInt8: fn = (void const*)&prepareInput2<int8_t, ReduceOp>; break;
+  case ncclUint8: fn = (void const*)&prepareInput2<uint8_t, ReduceOp>; break;
+  case ncclInt32: fn = (void const*)&prepareInput2<int32_t, ReduceOp>; break;
+  case ncclUint32: fn = (void const*)&prepareInput2<uint32_t, ReduceOp>; break;
+  case ncclInt64: fn = (void const*)&prepareInput2<int64_t, ReduceOp>; break;
+  case ncclUint64: fn = (void const*)&prepareInput2<uint64_t, ReduceOp>; break;
+  case ncclFloat16: fn = (void const*)&prepareInput2<half, ReduceOp>; break;
  #if HAVE_ncclBfloat16
-  case ncclBfloat16: CASE_TY(__nv_bfloat16)
+  case ncclBfloat16: fn = (void const*)&prepareInput2<__nv_bfloat16, ReduceOp>; break;
  #endif
-  case ncclFloat32: CASE_TY(float)
-  case ncclFloat64: CASE_TY(double)
-  default: assert(0);
+  #if HAVE_ncclFloat8
+  case ncclFloat8e4m3: fn = (void const*)&prepareInput2<__nv_fp8_e4m3, ReduceOp>; break;
+  case ncclFloat8e5m2: fn = (void const*)&prepareInput2<__nv_fp8_e5m2, ReduceOp>; break;
+  #endif
+  case ncclFloat32: fn = (void const*)&prepareInput2<float, ReduceOp>; break;
+  case ncclFloat64: fn = (void const*)&prepareInput2<double, ReduceOp>; break;
+  default: assert(0); return cudaErrorInvalidValue;
  }
  #undef CASE_TY
+  dim3 grid = {1, 1, 1};
+  grid.x = (unsigned int)std::min<intptr_t>(32, (elt_n + 4*512-1)/(4*512));
+  dim3 block = {512, 1, 1};
+  void *args[7] = {&elts, &elt_n, &op, &rank_n, &rank_me, &seed, &elt_ix0};
+  if (grid.x == 0) return cudaSuccess;
+  return cudaLaunchKernel(fn, grid, block, args, 0, stream);
 }
 }

-void ncclVerifiablePrepareInput(
+cudaError_t ncclVerifiablePrepareInput(
    void *elts, intptr_t elt_n, int elt_ty, int red_op, int rank_n, int rank_me,
    uint64_t seed, intptr_t elt_ix0, cudaStream_t stream
  ) {
  #define CASE_OP(op) \
    if(rank_n == 1) \
-      prepareInput1(elts, elt_n, elt_ty, ReduceNil(), rank_n, rank_me, seed, elt_ix0, stream); \
+      return prepareInput1(elts, elt_n, elt_ty, ReduceNil(), rank_n, rank_me, seed, elt_ix0, stream); \
    else \
-      prepareInput1(elts, elt_n, elt_ty, op, rank_n, rank_me, seed, elt_ix0, stream); \
+      return prepareInput1(elts, elt_n, elt_ty, op, rank_n, rank_me, seed, elt_ix0, stream); \
    break;
  switch(red_op) {
  case ncclSum: CASE_OP(ReduceSum())
@ -882,14 +1069,12 @@ void ncclVerifiablePrepareInput(
  }
  #undef CASE_OP
 }
-#endif

 ////////////////////////////////////////////////////////////////////////////////

-#if !SELF_TEST
 namespace {
 template<typename T, typename ReduceFn>
-__global__ void prepareExpected2(
+__global__ void __launch_bounds__(512, 1) prepareExpected2(
    T *elts, intptr_t elt_n, ReduceFn op, int rank_n,
    uint64_t seed, intptr_t elt_ix0
  ) {
@ -909,40 +1094,49 @@ __global__ void prepareExpected2(
 }

 template<typename ReduceOp>
-void prepareExpected1(
+cudaError_t prepareExpected1(
    void *elts, intptr_t elt_n, int elt_ty, ReduceOp op, int rank_n,
    uint64_t seed, intptr_t elt_ix0, cudaStream_t stream
  ) {
-  int block_n = std::min<intptr_t>(32, (elt_n + 4*512-1)/(4*512));
-  #define CASE_TY(T) prepareExpected2<<<block_n, 512, 0, stream>>>((T*)elts, elt_n, op, rank_n, seed, elt_ix0); break;
+  void const *fn = nullptr;
  switch(elt_ty) {
-  case ncclInt8: CASE_TY(int8_t)
-  case ncclUint8: CASE_TY(uint8_t)
-  case ncclInt32: CASE_TY(int32_t)
-  case ncclUint32: CASE_TY(uint32_t)
-  case ncclInt64: CASE_TY(int64_t)
-  case ncclUint64: CASE_TY(uint64_t)
-  case ncclFloat16: CASE_TY(half)
+  case ncclInt8: fn = (void const*)&prepareExpected2<int8_t, ReduceOp>; break;
+  case ncclUint8: fn = (void const*)&prepareExpected2<uint8_t, ReduceOp>; break;
+  case ncclInt32: fn = (void const*)&prepareExpected2<int32_t, ReduceOp>; break;
+  case ncclUint32: fn = (void const*)&prepareExpected2<uint32_t, ReduceOp>; break;
+  case ncclInt64: fn = (void const*)&prepareExpected2<int64_t, ReduceOp>; break;
+  case ncclUint64: fn = (void const*)&prepareExpected2<uint64_t, ReduceOp>; break;
+  case ncclFloat16: fn = (void const*)&prepareExpected2<half, ReduceOp>; break;
  #if HAVE_ncclBfloat16
-  case ncclBfloat16: CASE_TY(__nv_bfloat16)
+  case ncclBfloat16: fn = (void const*)&prepareExpected2<__nv_bfloat16, ReduceOp>; break;
  #endif
-  case ncclFloat32: CASE_TY(float)
-  case ncclFloat64: CASE_TY(double)
-  default: assert(0);
+  #if HAVE_ncclFloat8
+  case ncclFloat8e4m3: fn = (void const*)&prepareExpected2<__nv_fp8_e4m3, ReduceOp>; break;
+  case ncclFloat8e5m2: fn = (void const*)&prepareExpected2<__nv_fp8_e5m2, ReduceOp>; break;
+  #endif
+  case ncclFloat32: fn = (void const*)&prepareExpected2<float, ReduceOp>; break;
+  case ncclFloat64: fn = (void const*)&prepareExpected2<double, ReduceOp>; break;
+  default: assert(0); return cudaErrorInvalidValue;
  }
  #undef CASE_TY
+  dim3 grid = {1, 1, 1};
+  grid.x = (unsigned int)std::min<intptr_t>(32, (elt_n + 4*512-1)/(4*512));
+  dim3 block = {512, 1, 1};
+  void *args[6] = {&elts, &elt_n, &op, &rank_n, &seed, &elt_ix0};
+  if (grid.x == 0) return cudaSuccess;
+  return cudaLaunchKernel(fn, grid, block, args, 0, stream);
 }
 }

-void ncclVerifiablePrepareExpected(
+cudaError_t ncclVerifiablePrepareExpected(
    void *elts, intptr_t elt_n, int elt_ty, int red_op, int rank_n,
    uint64_t seed, intptr_t elt_ix0, cudaStream_t stream
  ) {
  #define CASE_OP(op) \
    if(rank_n == 1) \
-      prepareExpected1(elts, elt_n, elt_ty, ReduceNil(), rank_n, seed, elt_ix0, stream); \
+      return prepareExpected1(elts, elt_n, elt_ty, ReduceNil(), rank_n, seed, elt_ix0, stream); \
    else \
-      prepareExpected1(elts, elt_n, elt_ty, op, rank_n, seed, elt_ix0, stream); \
+      return prepareExpected1(elts, elt_n, elt_ty, op, rank_n, seed, elt_ix0, stream); \
    break;
  switch(red_op) {
  case ncclSum: CASE_OP(ReduceSum())
@ -958,52 +1152,10 @@ void ncclVerifiablePrepareExpected(
  }
  #undef CASE_OP
 }
-#endif

 ////////////////////////////////////////////////////////////////////////////////

 namespace {
-/* How we compare floating point values when exactness is impossible is interesting.
- * First, we take note that simply reinterpreting integer bits as floating point
- * gives us a monotonic mapping which exponentially spaces out floats. Thus
- * consecutive integers encode consecutive floats. In general, using integer
- * subraction on the bitpatterns of two floats gives us an integer which is the
- * logarithm of their relative difference. But, if the floats always have similar
- * exponents, than the integer difference is actually proportional to the
- * relative error (this is because we are counting hops in the mantissa bits only,
- * not the exponent bits). So a cheap way to compare if two floats are relatively
- * close is: abs(intBits(a), intBits(b)) < tolerance. The following formula
- * calculates such a tolerance for a summation of n floats. This formula
- * was derived by inspecting the maximum observed integer difference over many
- * random runs of summation. The parameter values were computed by the
- * companion program "inexact_regress.cu".
- */
-__host__ __device__ unsigned calcSumFloatTolerance(int rank_n, int elt_ty) {
-  float power, coef;
-  switch(elt_ty) {
-  case ncclFloat32:
-  case ncclFloat64:
-    power = .51f;
-    coef = 1.25f;
-    break;
-  case ncclFloat16:
-    power = .91f;
-    coef = .75f;
-    break;
-  #if HAVE_ncclBfloat16
-  case ncclBfloat16:
-    power = .91f;
-    coef = .66f;
-    break;
-  #endif
-  }
-  #if __CUDA_ARCH__
-    return 1 + unsigned(coef*powf(float(rank_n), power));
-  #else
-    return 1 + unsigned(coef*std::pow(float(rank_n), power));
-  #endif
-}
-
 template<typename T>
 __host__ __device__  uint64_t calcDelta(T a, T b) {
  union { T t; uint8_t i1; uint16_t i2; uint32_t i4; uint64_t i8; } x, y;
@ -1020,10 +1172,9 @@ __host__ __device__  uint64_t calcDelta(T a, T b) {

 ////////////////////////////////////////////////////////////////////////////////

-#if !SELF_TEST
 namespace {
 template<typename T>
-__global__ void verifyPrepared(
+__global__ void __launch_bounds__(512, 1) verifyPrepared(
    T const *results, T const *expected, intptr_t elt_n, unsigned tolerance, int64_t *bad_elt_n
  ) {
  intptr_t i0 = blockIdx.x*(elt_n/gridDim.x);
@ -1039,16 +1190,34 @@ __global__ void verifyPrepared(
    bad += tolerance < delta ? 1 : 0;
    #if 0
      if(tolerance < delta) {
-        printf("verifyPrepared ix=%lld got=%g exp=%g\n", (long long)i, (float)results[i], (float)expected[i]);
+        printf("verifyPrepared ix=%lld got=%g exp=%g tol=%d\n", (long long)i, (float)results[i], (float)expected[i], tolerance);
      }
    #endif
    i += blockDim.x;
  }
-  asm volatile("red.global.add.u64 [%0],%1;" :: "l"(bad_elt_n), "l"(bad));
+  asm volatile("red.global.add.u64 [%0],%1;" :: "l"(bad_elt_n), "l"(bad) : "memory");
+}
+
+cudaError_t verifyPrepared1(int bytePerElt,
+    void const *results, void const *expected, intptr_t elt_n, unsigned tolerance, int64_t *bad_elt_n, cudaStream_t stream, int block_n
+  ) {
+  void const *fn = nullptr;
+  switch(bytePerElt) {
+  case 1: fn = (void const*)&verifyPrepared<uint8_t>; break;
+  case 2: fn = (void const*)&verifyPrepared<uint16_t>; break;
+  case 4: fn = (void const*)&verifyPrepared<uint32_t>; break;
+  case 8: fn = (void const*)&verifyPrepared<uint64_t>; break;
+  default: assert(0); return cudaErrorInvalidValue;
+  }
+  dim3 grid = {(unsigned int)block_n, 1, 1};
+  dim3 block = {512, 1, 1};
+  void *args[5] = {&results, &expected, &elt_n, &tolerance, &bad_elt_n};
+  if (grid.x == 0) return cudaSuccess;
+  return cudaLaunchKernel(fn, grid, block, args, 0, stream);
 }

 template<typename T, typename Uint, typename ReduceFn>
-__global__ void verifyInline2(
+__global__ void __launch_bounds__(512, 1) verifyInline2(
    T const *results, intptr_t elt_n, ReduceFn op, int rank_n, uint64_t seed,
    intptr_t elt_ix0, unsigned tolerance, int64_t *bad_elt_n
  ) {
@ -1077,39 +1246,52 @@ __global__ void verifyInline2(
    #endif
    i += blockDim.x;
  }
-  asm volatile("red.global.add.u64 [%0],%1;" :: "l"(bad_elt_n), "l"(bad));
+  asm volatile("red.global.add.u64 [%0],%1;" :: "l"(bad_elt_n), "l"(bad) : "memory");
 }

 template<typename T, typename Uint>
-void verifyInline1(
+cudaError_t verifyInline1(
    T const *results, intptr_t elt_n, int red_op, int rank_n, uint64_t seed, intptr_t elt_ix0,
    unsigned tolerance, int64_t *bad_elt_n, cudaStream_t stream, int block_n
  ) {
+  void const *fn = nullptr;
+  ReduceNil opnil;
+  ReduceSum opsum;
+  ReduceMin opmin;
+  ReduceMax opmax;
+  ReduceProd opprod;
+  ReduceAvg opavg{rank_n};
+  ReducePreMulSum oppremulsum;
+  void *args[8] = {&results, &elt_n, nullptr, &rank_n, &seed, &elt_ix0, &tolerance, &bad_elt_n};
  #define CASE_OP(op) \
-    if(rank_n == 1) \
-    verifyInline2<T, Uint><<<block_n, 512, 0, stream>>> \
-      ((T const*)results, elt_n, ReduceNil(), rank_n, seed, elt_ix0, tolerance, bad_elt_n); \
-    else \
-    verifyInline2<T, Uint><<<block_n, 512, 0, stream>>> \
-      ((T const*)results, elt_n, op, rank_n, seed, elt_ix0, tolerance, bad_elt_n); \
-    break;
+    if(rank_n == 1) { \
+      fn = (void const*)&verifyInline2<T, Uint, ReduceNil>; \
+      args[2] = &opnil; \
+    } else { \
+      fn = (void const*)&verifyInline2<T, Uint, decltype(op)>; \
+      args[2] = &op; \
+    } break;
  switch(red_op) {
-  case ncclSum: CASE_OP(ReduceSum())
-  case ncclMin: CASE_OP(ReduceMin())
-  case ncclMax: CASE_OP(ReduceMax())
-  case ncclProd: CASE_OP(ReduceProd())
+  case ncclSum: CASE_OP(opsum)
+  case ncclMin: CASE_OP(opmin)
+  case ncclMax: CASE_OP(opmax)
+  case ncclProd: CASE_OP(opprod)
  #if HAVE_ncclAvg
-  case ncclAvg: CASE_OP(ReduceAvg{rank_n})
+  case ncclAvg: CASE_OP(opavg)
  #endif
  #if HAVE_ncclPreMulSum
-  default: CASE_OP(ReducePreMulSum())
+  default: CASE_OP(oppremulsum)
  #endif
  }
  #undef CASE_OP
+  dim3 grid = {(unsigned int)block_n, 1, 1};
+  dim3 block = {512, 1, 1};
+  if (grid.x == 0) return cudaSuccess;
+  return cudaLaunchKernel(fn, grid, block, args, 0, stream);
 }
 }

-void ncclVerifiableVerify(
+cudaError_t ncclVerifiableVerify(
    void const *results, void const *expected, intptr_t elt_n, int elt_ty,
    int red_op, int rank_n, uint64_t seed, intptr_t elt_ix0,
    int64_t *bad_elt_n, cudaStream_t stream
@ -1118,11 +1300,21 @@ void ncclVerifiableVerify(
  #if HAVE_ncclBfloat16
    floating |= elt_ty == ncclBfloat16;
  #endif
-
+  #if HAVE_ncclFloat8
+    floating |= elt_ty == ncclFloat8e4m3;
+    floating |= elt_ty == ncclFloat8e5m2;
+  #endif
+  
  unsigned tolerance = 0;
  #if HAVE_ncclAvg
-  if (floating && red_op == ncclAvg)
-    tolerance = calcSumFloatTolerance(rank_n, elt_ty);
+  if (floating && red_op == ncclAvg) {
+    // Average does it's pre-multiplies in an unspecified floating point format
+    // (could be the actual type T or float or half). That means the premultiply
+    // verify does could generate a discrepancy in the least mantissa digit. After
+    // adding those two (since avg only has two non-zero contributions) we could
+    // be off by a distance of 2 units.
+    tolerance = 2;
+  }
  #endif

  int block_n = std::min<intptr_t>(32, (elt_n + 4*512-1)/(4*512));
@ -1130,9 +1322,9 @@ void ncclVerifiableVerify(
  *bad_elt_n = 0;
  #define CASE_TY(T, Uint) { \
      if(expected != nullptr) { \
-        verifyPrepared<<<block_n, 512, 0, stream>>>((Uint const*)results, (Uint const*)expected, elt_n, tolerance, bad_elt_n); \
+        return verifyPrepared1(sizeof(T), results, expected, elt_n, tolerance, bad_elt_n, stream, block_n); \
      } else { \
-        verifyInline1<T, Uint>((T const*)results, elt_n, red_op, rank_n, seed, elt_ix0, tolerance, bad_elt_n, stream, block_n); \
+        return verifyInline1<T, Uint>((T const*)results, elt_n, red_op, rank_n, seed, elt_ix0, tolerance, bad_elt_n, stream, block_n); \
      } \
    } break;
  switch(elt_ty) {
@ -1143,29 +1335,30 @@ void ncclVerifiableVerify(
  case ncclInt64: CASE_TY(int64_t, uint64_t)
  case ncclUint64: CASE_TY(uint64_t, uint64_t)
  case ncclFloat16: CASE_TY(half, uint16_t)
+  #if HAVE_ncclFloat8
+  case ncclFloat8e4m3: CASE_TY(__nv_fp8_e4m3, uint8_t)
+  case ncclFloat8e5m2: CASE_TY(__nv_fp8_e5m2, uint8_t)
+  #endif
  #if HAVE_ncclBfloat16
  case ncclBfloat16: CASE_TY(__nv_bfloat16, uint16_t)
  #endif
  case ncclFloat32: CASE_TY(float, uint32_t)
  case ncclFloat64: CASE_TY(double, uint64_t)
-  default: assert(0);
+  default: assert(0); return cudaErrorInvalidValue;
  }
  #undef CASE_TY
 }
-#endif

 ////////////////////////////////////////////////////////////////////////////////

-#if SELF_TEST
-#include <iostream>
-
+namespace {
 template<typename T, typename Op>
 __device__ void sweep2(int ty, char const *tyname, Op op, char const *opname, int rank_n) {
  //if(!std::is_same<T,half>::value) return;
  //if(!std::is_same<Op,ReduceProd>::value) return;
  //if(rank_n!=3) return;

-  unsigned tolerance = !IsIntegral<T>::value && std::is_same<Op,ReduceAvg>::value ? calcSumFloatTolerance(rank_n, ty) : 0;
+  unsigned tolerance = !IsIntegral<T>::value && std::is_same<Op,ReduceAvg>::value ? 2 : 0;
  uint64_t seed = 0xc8e2bed69766d533;

  for(int ix=threadIdx.x; ix < 10000; ix+=blockDim.x) {
@ -1202,7 +1395,7 @@ __device__ void sweep1(int ty, char const *tyname) {
  }
 }

-__global__ void sweep() {
+__global__ void __launch_bounds__(512, 1) sweep() {
  sweep1<int8_t>(ncclInt8, "int8");
  sweep1<uint8_t>(ncclUint8, "uint8");
  sweep1<int32_t>(ncclInt32, "int32");
@ -1210,18 +1403,18 @@ __global__ void sweep() {
  sweep1<int64_t>(ncclInt64, "int64");
  sweep1<uint64_t>(ncclUint64, "uint64");
  sweep1<half>(ncclFloat16, "half");
+  #if HAVE_ncclFloat8
+    sweep1<__nv_fp8_e4m3>(ncclBfloat16, "float8e4m3");
+    sweep1<__nv_fp8_e5m2>(ncclBfloat16, "float8e5m2");
+  #endif
  #if HAVE_ncclBfloat16
    sweep1<__nv_bfloat16>(ncclBfloat16, "bfloat16");
  #endif
  sweep1<float>(ncclFloat32, "float");
  sweep1<double>(ncclFloat64, "double");
 }
-
-int main(int arg_n, char **args) {
-  std::cerr<<"You are hoping to see no output beyond this line."<<std::endl;
-  cudaSetDevice(0);
-  sweep<<<1,512>>>();
-  cudaDeviceSynchronize();
-  return 0;
 }
-#endif
+
+void ncclVerifiableLaunchSelfTest() {
+  sweep<<<1,512>>>();
+}
--- a/verifiable/verifiable.h
+++ b/verifiable/verifiable.h
@ -34,13 +34,13 @@ __host__ __device__ T ncclVerifiablePremulScalar(int rank_me) {
 }

 // Enqueue kernel to generate data which is to be reduced.
-void ncclVerifiablePrepareInput(
+cudaError_t ncclVerifiablePrepareInput(
  void *elts, intptr_t elt_n, int elt_ty, int red_op, int rank_n, int rank_me,
  uint64_t seed, intptr_t elt_ix0, cudaStream_t stream
 );

 // Enqueue kernel to generate expected results of reduction.
-void ncclVerifiablePrepareExpected(
+cudaError_t ncclVerifiablePrepareExpected(
  void *elts, intptr_t elt_n, int elt_ty, int red_op, int rank_n,
  uint64_t seed, intptr_t elt_ix0, cudaStream_t stream
 );
@ -51,9 +51,14 @@ void ncclVerifiablePrepareExpected(
 // which can be costly. Thus if you plan to run the same reduction multiple
 // times it is advantageous to precompute the expected values with
 // ncclVerifiablePrepareExpected and pass them as `expected` here.
-void ncclVerifiableVerify(
+cudaError_t ncclVerifiableVerify(
  void const *results, void const *expected, intptr_t elt_n, int elt_ty,
  int red_op, int rank_n, uint64_t seed, intptr_t elt_ix0,
  int64_t *bad_elt_n, cudaStream_t stream
 );
+
+#ifdef NCCL_VERIFIABLE_SELF_TEST
+void ncclVerifiableLaunchSelfTest();
+#endif
+
 #endif
--- a/verifiable/verifiable.mk
+++ b/verifiable/verifiable.mk
@ -1,11 +1,18 @@
-# We requires both of the following paths to be set upon including this makefile
+# We require both of the following paths to be set upon including this makefile
 # TEST_VERIFIABLE_SRCDIR = <points to this directory>
-# TEST_VERIFIABLE_BUILDDIR = <points to destination of .o file>
+# TEST_VERIFIABLE_BUILDDIR = <points to destination of .so file>

 TEST_VERIFIABLE_HDRS = $(TEST_VERIFIABLE_SRCDIR)/verifiable.h
 TEST_VERIFIABLE_OBJS = $(TEST_VERIFIABLE_BUILDDIR)/verifiable.o
+TEST_VERIFIABLE_LIBS = $(TEST_VERIFIABLE_BUILDDIR)/libverifiable.so

-$(TEST_VERIFIABLE_BUILDDIR)/verifiable.o: $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu $(TEST_VERIFY_REDUCE_HDRS)
+$(TEST_VERIFIABLE_BUILDDIR)/verifiable.o: $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu $(TEST_VERIFIABLE_HDRS)
 	@printf "Compiling %s\n" $@
 	@mkdir -p $(TEST_VERIFIABLE_BUILDDIR)
-	$(NVCC) -o $@ $(NVCUFLAGS) -c $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu
+	$(NVCC) -Xcompiler "-fPIC" -o $@ $(NVCUFLAGS) -c $(TEST_VERIFIABLE_SRCDIR)/verifiable.cu
+
+$(TEST_VERIFIABLE_BUILDDIR)/libverifiable.so: $(TEST_VERIFIABLE_OBJS)
+	@printf "Creating DSO %s\n" $@
+	@mkdir -p $(TEST_VERIFIABLE_BUILDDIR)
+	$(CC) -shared -o $@.0 $^ -Wl,-soname,$(notdir $@).0
+	ln -sf $(notdir $@).0 $@