ambrosemcduff158d02e1
New member
- Joined
- Apr 9, 2026
- Messages
- 2
I’m seeing low NCCL bandwidth between two GB10 systems over a direct 200GbE RoCE link:
ibdev2netdev maps the active interface correctly, ibstat rocep1s0f1 shows State: Active, Physical state: LinkUp, Rate: 200, and ethtool shows Speed: 200000Mb/s with Link detected: yes on both hosts. I’m binding NCCL/UCX to enp1s0f1np1.
However, nccl-tests all_gather_perf is very consistent at only about 16.7–17.8 GB/s busbw in both launch directions, including across an 8 MiB to 1 GiB sweep. Both systems are running 6.17.0-1014-nvidia.
I posted this on the Nvidia Dgx forums, and was told the FE has the fix, but edge expert does not.
Looking for any assistance here.
- NVIDIA DGX Spark
- MSI EdgeXpert / MS-C931
ibdev2netdev maps the active interface correctly, ibstat rocep1s0f1 shows State: Active, Physical state: LinkUp, Rate: 200, and ethtool shows Speed: 200000Mb/s with Link detected: yes on both hosts. I’m binding NCCL/UCX to enp1s0f1np1.
However, nccl-tests all_gather_perf is very consistent at only about 16.7–17.8 GB/s busbw in both launch directions, including across an 8 MiB to 1 GiB sweep. Both systems are running 6.17.0-1014-nvidia.
I posted this on the Nvidia Dgx forums, and was told the FE has the fix, but edge expert does not.
Looking for any assistance here.