Architecture note

Scaling WebRTC architecture for 1,000+ concurrent VTubers

2026-05-18·5 min read·Realtime Systems

In social VR and live motion-capture environments, latency directly maps to retention. The architecture has to reduce client load, protect weak connections, and keep the room feeling live.

The latency constraint

A VTuber room feels broken before the metrics look broken. Motion delay, audio drift, and unstable joins all show up as social discomfort first.

Standard mesh networks buckle as rooms grow because every client starts carrying too much upstream work. The product needed a media path that protected the weakest device in the room.

Moving to SFU topologies

A Selective Forwarding Unit moves stream routing into infrastructure instead of asking every browser to talk to every other browser.

That shift reduced client uplink pressure from room-size complexity to a smaller, predictable path. It also gave the backend a place to observe quality, route around weak links, and debug live failures.

The role of simulcast

Simulcast let clients send multiple encoded resolutions at the same time. The SFU could then route the right quality to each receiver based on bandwidth and device constraints.

That mattered because the goal was not a demo room. It was an architecture that could hold sub-100ms targets while supporting an 18,000+ user base and rooms that needed to feel alive.