MRC Protocol: How OpenAI's Ethernet Rewrite Scales to 100,000 GPUs
OpenAI, Microsoft, and NVIDIA quietly rebuilt Ethernet for frontier AI training. Here's how the Multipath Reliable Connection protocol actually works.
AnIntent Editorial
Photo by Zoshua Colah on Unsplash
Most coverage of MRC frames it as a brand-new networking standard. It isn't. The Multipath Reliable Connection (MRC) protocol is closer to a careful surgery on RDMA over Converged Ethernet, the transport that already moves data between GPUs in most AI clusters today. What changed is that OpenAI, Microsoft, NVIDIA, AMD, Broadcom, and Intel decided the patient could no longer survive without it.
The pressure point is scale. A single training run for a frontier model now spans tens of thousands of accelerators, and a network blip measured in milliseconds can stall the entire job. OpenAI says MRC "spreads a single transfer across hundreds of paths and routes around failures in microseconds," and the company developed it with its hardware partners over roughly two years before publishing the specification through the Open Compute Project.
The misconception: MRC is not a replacement for Ethernet
The phrase "new AI networking protocol" makes it sound like a clean break. It isn't one. MRC does not replace Ethernet. It extends a transport standard called Remote Direct Memory Access over Converged Ethernet (RoCEv2) that has been used in data centers for years.
Think about a delivery network in a city you already know. The streets, traffic lights, and addresses (Ethernet, IP, RoCE) are the same. What MRC changes is the dispatcher's logic. Instead of sending a single truck down one chosen route and praying nothing goes wrong, the dispatcher splits the shipment across hundreds of couriers on hundreds of streets, lets each one carry the destination address on the package itself, and reroutes around any blocked intersection in microseconds.
That last detail is what makes MRC different from earlier multipath schemes. It borrows techniques from the Ultra Ethernet Consortium's earlier work on high-performance networking, incorporating path-control technology called Segment Routing over IPv6, or SRv6. SRv6 lets the sending NIC stamp the entire path into each packet, which removes any dependency on switches reconverging their routing tables when a link dies.
Why a 100,000-GPU cluster broke the old playbook
The old playbook for AI fabrics had two assumptions. First, that Ethernet could be made "lossless" using a feature called Priority Flow Control (PFC), where a congested switch tells its upstream neighbor to pause traffic. Second, that you could keep stacking switch tiers (three, sometimes four) to wire enough GPUs together. Both assumptions buckle at gigascale.
PFC is the more dangerous of the two. When a single hot spot forms in a fabric carrying 100,000 GPUs, pause frames can cascade backward through the network and freeze it entirely. According to Microsoft's Azure HPC engineering team, MRC disables PFC entirely and runs Ethernet in best-effort mode, deliberately accepting the occasional dropped packet to avoid the global pauses that can deadlock an entire AI fabric.
The second problem is topology. A traditional Clos network needs three or four tiers of switches to fan out across that many endpoints, and each tier adds latency, power draw, optics, and failure domains. The MRC design sidesteps this by treating one fast link as many slow ones.
How one 800 Gb/s NIC becomes eight 100 Gb/s planes
This is the part that surprises most people. One 800Gb/s interface can connect to eight different switches, building eight separate parallel 100Gb/s planes rather than a single 800Gb/s network, OpenAI's engineering write-up explains. Each plane is its own independent fabric. If one plane melts down, the other seven keep moving traffic, and MRC's transport layer redistributes flows across the survivors in microseconds.
Microsoft's deployment uses the same trick. Microsoft's Fairwater implementation splits each NIC into multiple lower-speed ports (eight x 100 Gbps) and builds multiple parallel network planes, and the result is a topology that flattens dramatically. The numbers are worth being specific about: OpenAI says the multi-plane design can support roughly 131,000 GPUs using only two switch tiers, replacing the conventional three- or four-tier topologies that had become standard for AI fabrics.
Fewer tiers is not just an aesthetic preference. According to Converge Digest's analysis of the design, the flatter topology lowers power consumption, reduces latency, minimizes component count, and improves fault tolerance. For a hyperscaler buying fabric switches by the thousand, dropping a whole tier is a budget line measured in tens of millions of dollars and megawatts.
What actually happens to a packet under MRC
The packet-level behavior is where the protocol earns its name. A single message from one GPU to another no longer travels as one ordered stream down one chosen path. Instead, the sending NIC chops it up, stamps each fragment with an SRv6 path header and a final memory destination, and sprays the fragments across as many parallel routes as the fabric offers. Broadcom's Thor Ultra NIC, one of the first MRC-capable adapters, can distribute traffic across up to 128 paths simultaneously and supports 2-, 4-, and 8-plane architectures at 800 Gbps.
Packets arriving out of order would normally be a disaster for RDMA. MRC sidesteps that by carrying the final memory destination in each packet, as Converge Digest describes, allowing GPUs to place data directly into memory even when packets arrive out of order. The receiving NIC reassembles the message in the GPU's memory rather than in a network buffer.
When something does go wrong, the recovery is granular. Microsoft describes two mechanisms working together: selective acknowledgments enable rapid retransmission of only the packets that were lost, while packet trimming signals congestion without forcing full packet drops. A trimmed packet keeps its header and discards its payload, so the receiver still learns it existed and can ask for it again, but the switch buffer never overflows.
RoCE vs MRC: what the diff really looks like
The simplest way to see the gap between classic RoCE and MRC is to put the design choices side by side:
- Path selection: RoCE typically pins a flow to a single path chosen by ECMP hashing. MRC sprays each transfer across hundreds of paths using SRv6 source routing.
- Loss handling: RoCE leans on Priority Flow Control to keep the fabric lossless. MRC turns PFC off and recovers losses with selective acknowledgments and packet trimming.
- Failure recovery: RoCE waits for routing protocols to reconverge after a link failure, which can take seconds. MRC's SRv6 source routing eliminates dependency on dynamic routing convergence after failures.
- Topology: RoCE deployments often need three or four switch tiers to wire 100,000+ GPUs. MRC's plane-based design does it in two.
- Ordering: RoCE assumes in-order delivery. MRC tolerates out-of-order arrival because each packet carries its destination address.
This is also why MRC reads as a direct counter to InfiniBand's reliability pitch. HyperFrame Research VP Ron Westfall told Data Center Knowledge that MRC offers "the losslessness of InfiniBand with the flexibility of a stateless, global IPv6 standard." The same publication notes that in 2025, Ethernet sales and shipments to AI back-end networks surpassed those of InfiniBand for the first time, per analyst Sameh Boujelbene.
Where MRC is actually running today
MRC is not a paper specification. It has already been deployed in production at OpenAI's largest supercomputers, including its Oracle Cloud Infrastructure site in Abilene, Texas, and Microsoft's Fairwater facility. OpenAI's own post confirms it is running across all of its largest NVIDIA GB200 supercomputers used to train frontier models.
The NVIDIA side of the deployment is more interesting than the press release suggests. According to SiliconANGLE, OpenAI has already used MRC on NVIDIA's Spectrum-X Ethernet platform to train recent frontier large language models powering ChatGPT and Codex, and Microsoft is deploying MRC in some of its largest AI factories built on GB200 systems. NVIDIA SVP Gilad Shainer described the protocol as extending the routing "brain" all the way to the host, which is a good shorthand for what SRv6 source routing actually does.
NVIDIA has been careful not to position MRC as the only path forward. The same SiliconANGLE report notes that Spectrum-X supports both Adaptive RDMA and MRC as two distinct transport options, with Shainer saying that different hyperscalers will tune transport protocols to their own workloads rather than converging on a single winner. That hedge matters: it suggests NVIDIA expects Meta, Google, and AWS to each ship variants of this idea rather than adopt MRC wholesale.
The Open Compute Project angle nobody is discussing
Here is the detail that gets buried in most coverage. Microsoft published the MRC specification and open-sourced the associated libraries through the Open Compute Project. That overlaps awkwardly with the Ultra Ethernet Consortium's 1.0 specification, released in June 2025, which also targets RDMA-style transport for AI and HPC at scale.
The two efforts are not identical. UEC defines a full new transport called Ultra Ethernet Transport (UET) with its own packet format, drawing roughly 75% of its design from HPE's Slingshot interconnect. MRC takes the opposite approach: keep RoCEv2 on the wire, borrow UEC ideas where useful, and ship something that can run on hardware already in production. For hyperscalers facing 18-month build cycles, that pragmatism is the entire point.
The political subtext is that the companies actually running frontier training clusters wanted something deployable on this year's silicon, not a clean-sheet redesign that would take another two years to validate. MRC's publication through OCP rather than UEC says quietly what the marketing won't: the operators decided to fork the standards process and ship.
What this means for everyone not building 100,000-GPU clusters
If you are not training frontier models, MRC will reach you indirectly through the silicon. Broadcom's Tomahawk 5 provides 51.2 Tbps of switching capacity, and Tomahawk 6 scales to 102.4 Tbps with 512 ports at 200 Gbps. Those are the chips that will populate enterprise AI fabrics in 2027 and beyond, and MRC support is now part of the feature checklist alongside RoCEv2.
The more interesting downstream effect is on the InfiniBand vs Ethernet decision that every enterprise AI buyer faces. With production validation at OpenAI and Microsoft, the "InfiniBand is the only safe choice for serious training" argument loses a lot of its weight. That doesn't make InfiniBand obsolete, but it ends the era when Ethernet at gigascale was treated as experimental.
For more on how the underlying hardware economics are shifting, see our coverage of Anthropic's compute deal with SpaceX's Colossus 1 cluster and the broader AI Infrastructure category on AnIntent. The networking layer is finally catching up to the accelerators, and MRC is the clearest signal yet of what the next five years of AI fabrics will look like.
Frequently Asked Questions
No. MRC borrows techniques from the Ultra Ethernet Consortium but is published separately through the Open Compute Project and extends RoCEv2 rather than replacing the transport. Ultra Ethernet's 1.0 specification defines a new transport called UET with its own packet format, while MRC keeps RoCEv2 on the wire and adds SRv6 source routing on top.
MRC runs on 800 Gb/s network interfaces that support SRv6 source routing and multipath spraying. Confirmed silicon includes NVIDIA ConnectX SuperNICs paired with Spectrum-X Ethernet switches and Broadcom's Thor Ultra NIC, which can distribute traffic across up to 128 paths and supports 2-, 4-, and 8-plane architectures.
It runs on standard Ethernet wire and IP routing, but it requires NICs and switches that support its multipath spraying, SRv6 path stamping, packet trimming, and selective acknowledgments. It also requires turning Priority Flow Control off, which is a configuration change most existing RoCE deployments would need to make.
MRC targets the same losslessness goal that made InfiniBand the default for AI training, but achieves it through statistical recovery rather than fabric-wide pause signals. HyperFrame Research analyst Ron Westfall described it as combining InfiniBand-style performance with the flexibility of a stateless, global IPv6 standard, and Ethernet shipments to AI back-end networks surpassed InfiniBand in 2025.
OpenAI has used MRC on NVIDIA Spectrum-X to train recent frontier large language models that power ChatGPT and Codex. The protocol is deployed across OpenAI's largest GB200 supercomputers, including the Oracle Cloud Infrastructure site in Abilene, Texas, and Microsoft's Fairwater supercomputers.