DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE

Joined
Apr 9, 2026
Messages
2
I’m seeing low NCCL bandwidth between two GB10 systems over a direct 200GbE RoCE link:
  • NVIDIA DGX Spark
  • MSI EdgeXpert / MS-C931
Both sides look healthy at the link level.

ibdev2netdev maps the active interface correctly, ibstat rocep1s0f1 shows State: Active, Physical state: LinkUp, Rate: 200, and ethtool shows Speed: 200000Mb/s with Link detected: yes on both hosts. I’m binding NCCL/UCX to enp1s0f1np1.

However, nccl-tests all_gather_perf is very consistent at only about 16.7–17.8 GB/s busbw in both launch directions, including across an 8 MiB to 1 GiB sweep. Both systems are running 6.17.0-1014-nvidia.

I posted this on the Nvidia Dgx forums, and was told the FE has the fix, but edge expert does not.

Looking for any assistance here.
 
Thank you for sharing the detailed information.

Recently, NVIDIA has identified an NCCL bandwidth degradation issue on DGX OS GA2.0.
A corresponding UEFI update has been released to address this behavior, and validation is currently in progressing.

Once the official version has completed validation, it will be released in the near future.
We will keep you updated accordingly.
 
Thank you for sharing the detailed information.

Recently, NVIDIA has identified an NCCL bandwidth degradation issue on DGX OS GA2.0.
A corresponding UEFI update has been released to address this behavior, and validation is currently in progressing.

Once the official version has completed validation, it will be released in the near future.
We will keep you updated accordingly.
Any updates on the timeline for this firmware? Trying to decide whether I should keep the MSI boxes I have or return and get another brand. Thanks!
 
Any updates on the timeline for this firmware? Trying to decide whether I should keep the MSI boxes I have or return and get another brand. Thanks!
The firmware is currently under validation, and we expect to complete testing and official release it via LVFS within this week.
Once available, we will post an update on the forum immediately. We appreciate your patience in the meantime.
 
I’ve already tested this, and it works. You should be gtg as far as I know.

Follow the NCCL tutorial and try it with the 16G settings.

Link to similar thread 🧵 via nvidia:

https://forums.developer.nvidia.com/t/dgx-spark-edgexpert-nccl-only-17-gb-s-over-200gbe/366055
I tried updating both of my MSI EdgeExpert machines to the testing firmware as suggested in that link, but I'm still capping out at 17GB/s . Which firmware versions do you have right now that are working?

Here's my testing firmware update. (Despite the prompts saying "dgx-spark-1", these are both MSI EdgeExpert machines/)

Code:
nvidia@dgx-spark-1:~:0$ sudo fwupdmgr enable-remote lvfs-testing
╔══════════════════════════════════════════════════════════════════════════════╗
║ Enable new remote?                                                           ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ The LVFS is a free service that operates as an independent legal entity and  ║
║ has no connection with Ubuntu. Your distributor may not have verified any    ║
║ of the firmware updates for compatibility with your system or connected      ║
║ devices. All firmware is provided only by the original equipment             ║
║ manufacturer.                                                                ║
║                                                                              ║
║ This remote contains firmware which is not embargoed, but is still being     ║
║ tested by the hardware vendor. You should ensure you have a way to manually  ║
║ downgrade the firmware if the firmware update fails.                         ║
║                                                                              ║
║ Enabling this functionality is done at your own risk, which means you have   ║
║ to contact your original equipment manufacturer regarding any problems       ║
║ caused by these updates. Only problems with the update process itself        ║
║ should be filed at https://bugs.launchpad.net/ubuntu/.                       ║
╚══════════════════════════════════════════════════════════════════════════════╝
Agree and enable the remote? [Y|n]:
Authenticating…          [               /                       ]
Do you want to refresh this remote now? (Requires internet connection) [Y|n]:
Downloading…             [**********************************     ]
Successfully enabled and refreshed remote
nvidia@dgx-spark-1:~:0$ sudo fwupdmgr refresh --force
Updating lvfs-testing
Downloading…             [**********************************     ]
Updating lvfs
Downloading…             [******************************         ]
Successfully downloaded new metadata: 4 local devices supported
nvidia@dgx-spark-1:~:0$ sudo fwupdmgr update
Devices with no available firmware updates:
 • MZALC4T0HBL1-00B07
 • UEFI dbx
╔══════════════════════════════════════════════════════════════════════════════╗
║ Upgrade Embedded Controller from 10500 to 10600?                             ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ This stable release fixes the following issues:                              ║
║                                                                              ║
║ • This update improves the performance and stability of the Device.          ║
║                                                                              ║
║ MS-C931 must remain plugged into a power source for the duration of the      ║
║ update to avoid damage.                                                      ║
╚══════════════════════════════════════════════════════════════════════════════╝
Perform operation? [Y|n]:
Waiting…                 [***************************************]
Successfully installed firmware
Do not turn off your computer or remove the AC adapter while the update is in progress.
Devices with the latest available firmware version:
 • TPM
╔══════════════════════════════════════════════════════════════════════════════╗
║ Upgrade UEFI Device Firmware from 10600 to 10700?                            ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ This stable release fixes the following issues:                              ║
║                                                                              ║
║ • This update improves the performance and stability of the Device.          ║
║                                                                              ║
║ MS-C931 must remain plugged into a power source for the duration of the      ║
║ update to avoid damage.                                                      ║
╚══════════════════════════════════════════════════════════════════════════════╝
Perform operation? [Y|n]:
Updating UEFI Device Firmware…                                   ] Less than one minute remaining…
Waiting…                 [***************************************]
Successfully installed firmware
Do not turn off your computer or remove the AC adapter while the update is in progress.
Do not turn off your computer or remove the AC adapter while the update is in progress.
╔══════════════════════════════════════════════════════════════════════════════╗
║ Upgrade UEFI Device Firmware from 502 to 507?                                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ This stable release fixes the following issues:                              ║
║                                                                              ║
║ • This update improves the performance and stability of the Device.          ║
║                                                                              ║
║ MS-C931 must remain plugged into a power source for the duration of the      ║
║ update to avoid damage.                                                      ║
╚══════════════════════════════════════════════════════════════════════════════╝
Perform operation? [Y|n]:
Updating UEFI Device Firmware…                                   ]
Waiting…                 [***************************************]
Successfully installed firmware
Do not turn off your computer or remove the AC adapter while the update is in progress.
Do not turn off your computer or remove the AC adapter while the update is in progress.
Do not turn off your computer or remove the AC adapter while the update is in progress.
An update requires a reboot to complete. Restart now? [y|N]: y

Broadcast message from root@dgx-spark-1 on pts/3 (Sun 2026-04-19 14:07:44 PDT):

The system will reboot now!

Here's the firmware update history:

Code:
$ sudo fwupdmgr get-history
MSI MS-C931
│
├─TPM:
│ │   Device ID:          6f0c4a0cc2cd035c45fece35226e4aaafc449cd5
│ │   Previous version:   117572608
│ │   Update State:       Success
│ │   Last modified:      2026-04-17 20:49
│ │   GUID:               0eb9bda9-3010-493a-a6a8-b5e80eddf870
│ │   Device Flags:       • Internal device
│ │                       • Updatable
│ │                       • System requires external power source
│ │                       • Supported on remote server
│ │                       • Needs a reboot after installation
│ │                       • Reported to remote server
│ │                       • Device is usable for the duration of the update
│ │                       • Signed Payload
│ │
│ └─(null) Update:
│       New version:      117572609
│       Remote ID:        lvfs
│       Description:
│       The vendor did not supply any release notes.
│
├─Embedded Controller:
│ │   Device ID:          5d6993a98ded44b273f7a27e71ecba7eb4a40d36
│ │   Previous version:   10500
│ │   Update State:       Success
│ │   Last modified:      2026-04-19 21:07
│ │   GUID:               e6c43bc0-723f-4d54-8ba6-448a96fc27db
│ │   Device Flags:       • Internal device
│ │                       • Updatable
│ │                       • System requires external power source
│ │                       • Supported on remote server
│ │                       • Needs a reboot after installation
│ │                       • Reported to remote server
│ │                       • Device is usable for the duration of the update
│ │                       • Signed Payload
│ │
│ └─MS_C931 Embedded Controller Update:
│       New version:      10600
│       Remote ID:        lvfs-testing
│       Release ID:       139355
│       Summary:          MS_C931 Embedded Controller Firmware Update
│       License:          Proprietary
│       Size:             519.0 kB
│       Created:          2026-04-08
│       Urgency:          High
│       Vendor:           MSI
│       Duration:         30 seconds
│       Release Flags:    • Trusted metadata
│       Description:
│       This stable release fixes the following issues:
│
│       • This update improves the performance and stability of the Device.
│       Checksum:         2fff83af9133e76a0a377f64d1590c5de6e1977052a3ce776ed4813da2339029
│
├─UEFI Device Firmware:
│ │   Device ID:          83147e0ea0fb6bd696a500d8ffb108132676271d
│ │   Previous version:   10600
│ │   Update State:       Success
│ │   Last modified:      2026-04-19 21:07
│ │   GUID:               632db1a2-f5c9-40ec-b700-ebcff320e2a2
│ │   Device Flags:       • Internal device
│ │                       • Updatable
│ │                       • System requires external power source
│ │                       • Supported on remote server
│ │                       • Needs a reboot after installation
│ │                       • Reported to remote server
│ │                       • Device is usable for the duration of the update
│ │                       • Signed Payload
│ │
│ └─MS_C931 SoC FW System Update:
│       New version:      10700
│       Remote ID:        lvfs-testing
│       Release ID:       139354
│       Summary:          MS_C931 SoC Firmware Update
│       License:          Proprietary
│       Size:             30.4 MB
│       Created:          2026-04-08
│       Urgency:          High
│       Vendor:           MSI
│       Duration:         30 seconds
│       Release Flags:    • Trusted metadata
│       Description:
│       This stable release fixes the following issues:
│
│       • This update improves the performance and stability of the Device.
│       Checksum:         a91f08180321200fb29147157779b98e9031eb94dfb3dab9f2e8c7c6c9bf8447
│
└─UEFI Device Firmware:
  │   Device ID:          c1e32194292eae35c64314ee9b1d9690d6142c76
  │   Previous version:   502
  │   Update State:       Success
  │   Last modified:      2026-04-19 21:07
  │   GUID:               fff25056-e175-45e2-bb1b-23de59689cff
  │   Device Flags:       • Internal device
  │                       • Updatable
  │                       • System requires external power source
  │                       • Supported on remote server
  │                       • Needs a reboot after installation
  │                       • Reported to remote server
  │                       • Device is usable for the duration of the update
  │                       • Signed Payload
  │
  └─MS_C931 USB-C PD FW Controller Update:
        New version:      507
        Remote ID:        lvfs-testing
        Release ID:       136521
        Summary:          MS_C931 USB-C PD FW Firmware Update
        License:          Proprietary
        Size:             1.1 MB
        Created:          2026-02-11
        Urgency:          High
        Tested by MSI:
          Tested:         2026-02-11
          Distribution:   ubuntu 24.04
          Old version:    502
          Version[fwupd]: 1.9.33
        Vendor:           MSI
        Duration:         30 seconds
        Release Flags:    • Trusted metadata
                          • Tested by trusted vendor
        Description:
        This stable release fixes the following issues:

        • This update improves the performance and stability of the Device.
        Checksum:         2656e4e95dd9e258b395f2d8bfb20fb1d8a08b1bdf141103e3258d7544d9b601

And after that was done, here's my all_gather_perf:

Code:
$ mpirun -np 2 -H 192.168.200.12:1,192.168.200.13:1   --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"   -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH   $HOME/nccl-tests/build/all_gather_perf
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '192.168.200.13' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

# nccl-tests version 2.18.3 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
#  Rank  0 Group  0 Pid   7487 on dgx-spark-1 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   7842 on dgx-spark-2 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
    33554432       4194304     float    none      -1   991.32   33.85   16.92       0   921.66   36.41   18.20       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 17.5637
#
# Collective test concluded: all_gather_perf
#

I'm also still seeing the power errors in dmesg after rebooting:

Code:
[    2.704522] mlx5_core 0000:01:00.0: mlx5_pcie_event:326:(pid 12): Detected insufficient power on the PCIe slot (27W).
[    3.270838] mlx5_core 0000:01:00.1: mlx5_pcie_event:326:(pid 12): Detected insufficient power on the PCIe slot (27W).
[    3.864445] mlx5_core 0002:01:00.0: mlx5_pcie_event:326:(pid 165): Detected insufficient power on the PCIe slot (27W).
[    4.432064] mlx5_core 0002:01:00.1: mlx5_pcie_event:326:(pid 12): Detected insufficient power on the PCIe slot (27W).
 
The firmware has been validated and uploaded to LVFS.
It is expected to be available for download within 24 hours.
Please kindly note.
 
Hello.
In our case, we were able to restore 200Gbps performance after *downgrading* the firmware version. Running "fwupdmgr downgrade", and selecting the relevant device ID.
@Athrun, which commands should we run in order to get the latest firmware update? Through the DGX dashboard (GUI) or through fwupdmgr?
Additionally, is it dangerous to download updates from lvfs-testing? This is why we resorted to downgrading in the first place...
Anyway, like @ambrosemcduff158d02e1, here is also my link of discussion on nvidia forum https://forums.developer.nvidia.com...i-firmware-update-solved-via-downgrade/368025
 
Hello.
In our case, we were able to restore 200Gbps performance after *downgrading* the firmware version. Running "fwupdmgr downgrade", and selecting the relevant device ID.
@Athrun, which commands should we run in order to get the latest firmware update? Through the DGX dashboard (GUI) or through fwupdmgr?
Additionally, is it dangerous to download updates from lvfs-testing? This is why we resorted to downgrading in the first place...
Anyway, like @ambrosemcduff158d02e1, here is also my link of discussion on nvidia forum https://forums.developer.nvidia.com...i-firmware-update-solved-via-downgrade/368025
Thanks for sharing your findings — it’s helpful to know that downgrading restored the 200Gbps performance.

Regarding your questions:
1. Firmware update method
Per product design, the official and recommended way to perform firmware updates is through the DGX Dashboard .
This ensures that updates are applied with proper platform validation and compatibility checks. While fwupdmgr can be used for manual operations, it is not the primary method for production updates.

2. Latest firmware and whether to upgrade
The NCCL-related performance issue has been verified, and a corresponding fix has already been uploaded to LVFS.
We recommend updating to the latest version via the DGX Dashboard, which will provide the validated and production-ready firmware.

3. lvfs-testing usage
We recommend using official firmware releases only.
The lvfs-testing is intended for validation and pre-release purposes, and may include builds that have not completed full verification.
Therefore, it is not recommended for production environments.

Please let us know if you encounter any issues during the update process.
 
Back
Top