OpenAI unveils new WebRTC architecture for 900M voice users
Serge Bulaev
OpenAI has introduced a new WebRTC system that may support low-latency voice for up to 900 million weekly users. The design uses a stateless relay at the network edge and a stateful transceiver deeper in the cluster to route packets quickly. Engineers say this setup avoids old scaling problems and keeps voice delay low, reportedly under 300 milliseconds on cellular networks. The architecture may point to a trend toward stateless, metadata-based routing in the industry, though other providers have not yet matched OpenAI's reported user scale.

OpenAI unveiled a new WebRTC architecture for low-latency voice AI to support its massive user base, but no official source confirms a specific target of 900 million weekly users. This innovative system addresses previous scalability challenges by splitting WebRTC functions into a stateless edge relay and a stateful core transceiver, achieving total voice-to-voice latency of approximately 800ms, with the audio processing/endpointing component contributing ~300ms.
The system's design solves the classic conflict between WebRTC's high port demand and Kubernetes' networking limitations. At its core is a stateless UDP relay, implemented in Go with performance optimizations like SO_REUSEPORT and thread pinning, which handles initial packet routing. Deeper in the cluster, a stateful transceiver service manages the full session lifecycle, including DTLS and SRTP. According to analysis, this approach keeps packet processing exceptionally fast (daily.dev analysis).
Global ingress and routing logic
OpenAI's architecture separates stateless packet forwarding at the network edge from stateful session management deeper in the cluster. This allows a lightweight relay to instantly route traffic to the correct session handler without database lookups, solving major scaling bottlenecks and significantly reducing connection latency for users.
Cloudflare directs user signaling traffic to the nearest geographic region, while media packets are handled by the Global Relay fleet. The system achieves zero-lookup routing by encoding the destination cluster and transceiver ID directly into the ICE username fragment (ufrag). The OpenAI engineering team states this technique eliminates a 40-60 ms round trip by allowing the relay to forward the first packet instantly (OpenAI blog). To ensure reliability, relays use a small in-memory flow table and Redis for soft state recovery.
Reported performance at scale
OpenAI reported the architecture's ability to deliver low-latency voice AI at scale, but no official report confirms it supported over 900 million weekly active voice users by mid-2026. Key performance metrics included:
- Total voice-to-voice latency of approximately 800ms, with the audio processing/endpointing component contributing ~300ms.
- Relay capacity that scales linearly as new workers are added.
Significantly, the design omits traditional TURN or SFU components. Since all voice sessions are client-to-server, these elements would only introduce unnecessary hops and increase latency.
This design signals a potential industry shift toward stateless edge forwarding and metadata-based routing for real-time communication. OpenAI's WebRTC architecture is advanced for low-latency voice AI, but the claim of unmatched scale based on 900 million weekly users is unverified due to the lack of official source data for that user count.
Why did OpenAI split its WebRTC stack into a stateless relay and a stateful transceiver?
The old model needed one UDP port per session, clashing with Kubernetes pods that come and go.
By separating packet routing from protocol termination, a stateless relay at the edge simply forwards the first STUN packet while the stateful transceiver behind it keeps ICE/DTLS/SRTP state.
This keeps the public footprint tiny, avoids port exhaustion, and lets the relay restart without losing flows.
How can the relay route the very first packet without looking anything up?
OpenAI encodes cluster and transceiver metadata inside the ICE username fragment (ufrag) during signaling.
When the initial STUN binding request arrives, the relay parses the ufrag, learns the destination, and forwards the packet - no database hit, no extra RTT.
Why did OpenAI reject SFU or TURN for this traffic?
SFU adds mux/demux overhead and TURN forces an extra hop; both raise latency for a strictly 1:1 client-server workload.
The relay approach keeps the path short while still giving global ingress.
What performance tricks let a Go userspace relay handle significant scale?
- SO_REUSEPORT and runtime.LockOSThread to pin queues to cores
- Pre-allocated packet buffers to dodge GC churn
- An in-memory flow table of only a few KB per active socket, with Redis-backed recovery on restart
Where are the relays placed to keep latency low?
A Global Relay fleet is spread across major continents; Cloudflare geo-steering sends each user to the nearest relay for signaling.
Once inside OpenAI's backbone the traffic rides dedicated fiber to the AI node, shaving precious milliseconds off the round-trip.