OpenAI Unveils WebRTC Architecture for Low-Latency Voice to 900M Users
Serge Bulaev
OpenAI has introduced a new WebRTC system that may help deliver low-latency voice to about 900 million weekly users. This design splits the work between a stateless relay at the network edge and a stateful transceiver in regional clusters, which seems to lower voice delay to under 400 ms for most users. Routing information is encoded in the ICE ufrag, allowing fast connection setup and possibly helping other large deployments with similar issues. Early results suggest this setup reduces costs and speeds up upgrades, while precautions are in place to limit security risks. The architecture appears to suggest a trend toward separating simple routing at the edge from more complex processing deeper in the network.

To deliver low-latency voice to its growing user base, OpenAI has developed a WebRTC architecture approach that targets sub-400 ms latency. This design separates packet routing from protocol state, using a split relay-transceiver model to enhance scalability, reduce costs, and accelerate deployments for real-time AI conversations.
Two-layer design
The architecture is founded on two distinct layers that divide responsibilities for maximum efficiency:
- Relay Layer: A lightweight service running at the geographic edge. Its primary function is to forward UDP packets, acting as an efficient traffic director.
- Stateful Transceiver: A sophisticated component residing in regional clusters. It terminates the WebRTC connection, handling complex protocol state for ICE, DTLS, and SRTP, and converts media for AI model inference.
This design allows the relay to manage traffic flows efficiently, while the transceiver layer handles session state management. Engineers found that alternative SFU or TURN models introduced unacceptable latency for speech-to-speech interactions.
OpenAI's architecture splits WebRTC functions into two parts: a relay at the network edge for packet forwarding and a stateful transceiver in regional data centers for handling complex protocol logic. This separation allows each layer to scale independently, optimizing performance and cost.
Routing through ufrag
A key innovation is the elimination of database lookups for initial packet routing. OpenAI achieves this by encoding routing information directly into the ICE username fragment (ufrag). This data includes a region ID, transceiver pool, and a random salt. Relays instantly parse this information from the first STUN request to select the correct backend, a method compatible with standard browsers. The technique was detailed in an OpenAI post and further explained in an interview with webrtcHacks.
Performance at scale
The system is designed to support OpenAI's voice traffic for a significant user base. A global fleet of relays uses distributed infrastructure to reduce network latency for users. Performance remains robust during traffic spikes from product launches and updates.
Security and privacy notes
The design enhances security by minimizing the public attack surface to a fixed UDP port range on the relays. Since these relays do not decrypt media, a compromise would only expose encrypted SRTP packets. To address potential information leaks from embedding metadata in the ufrag, OpenAI obfuscates the data by rotating salts and encrypting region codes.
Cost and operational impact
This architecture yields significant operational and financial benefits. By running lightweight code at the edge, OpenAI can leverage simple Kubernetes Horizontal Pod Autoscaling (HPA) rules. Moving stateful transceivers to regional compute nodes has helped optimize infrastructure costs. Furthermore, upgrades are streamlined due to the separation of concerns between routing and state management.
Key takeaways for practitioners
- Use the ICE ufrag for routing: Embed routing metadata in the ufrag to eliminate initial database lookups and reduce latency.
- Decouple state: Isolate relays at the edge to manage UDP traffic efficiently, avoiding Kubernetes port exhaustion issues.
- Co-locate stateful processing: Position stateful transceivers close to backend services like GPU-based inference to minimize internal network latency.
OpenAI's model represents a potential paradigm shift in building real-time media services. By creating micro-layered services that place minimal UDP routing at the edge and concentrate complex protocol logic deeper in the network stack, this architecture provides a scalable and efficient blueprint for the future of large-scale voice AI.
What problem did OpenAI solve by splitting WebRTC into stateless and stateful layers?
OpenAI addressed the fundamental mismatch between WebRTC's traditional design and modern cloud infrastructure. WebRTC typically expects stable server endpoints, whereas Kubernetes uses ephemeral pods. By splitting the stack into a relay layer (packet routing) and a stateful transceiver (protocol termination), they achieved horizontal scalability without sacrificing session integrity, allowing each layer to scale independently.
How does OpenAI route the first STUN packet without database lookups?
OpenAI encodes routing metadata directly into the ICE username fragment (ufrag) during signaling. This optimization allows the relay to extract destination information from the very first STUN binding request without an external database query. The ufrag acts as a self-describing routing key, enabling immediate packet forwarding and eliminating a critical latency bottleneck.
Why did OpenAI reject SFU and TURN architectures for their voice workload?
Both SFU (Selective Forwarding Unit) and TURN (Traversal Using Relays around NAT) were rejected because their overhead is mismatched with 1:1 voice requirements. SFUs add unnecessary complexity for peer-to-peer flows, while TURN introduces extra round-trips that increase latency - unacceptable for conversational AI that demands sub-400ms response times. OpenAI's model provides TURN-like reliability without the latency penalty.
What technical implementation details make the Global Relay performant?
The relay is built in Go and uses several performance-critical techniques: SO_REUSEPORT for kernel-level load balancing, thread pinning to reduce context switching, and preallocated buffers to minimize garbage collection pressure. This lightweight footprint handles media traffic without kernel bypass, proving that high performance is achievable without exotic networking stacks.
How does OpenAI ensure low latency for voice AI users?
Low latency is achieved through a combination of mechanisms: geographically distributed ingress points via the relay fleet, intelligent routing to direct users to optimal edge locations, and zero-lookup first-packet routing via ufrag metadata. This system maintains the sub-400ms latency threshold required for natural, real-time AI conversations at scale.