WebRTC technologies are currently showing their potential for providing peer-to-peer real-time communications in a seamless and scalable way. However, most relevant use cases demanded by users require further features such as group communications, media recording and media interoperability. Providing them requires the presence of WebRTC media infrastructures that are sometimes complex to manage and to scale.
In this talk, we present the experiences of the Kurento.org team creating auto-scalable WebRTC infrastructures in the large. Following results generated by the NUBOMEDIA and FIWARE research projects, we introduce stateless and stateful scalability models, which provide different scalability definitions and properties. Stateless models are suitable services requiring large number of WebRTC sessions with few participants each. Such models are commonly deployed today and they are compatible with current state-of-the-art on RTP topologies (e.g. following SFU or MCU architectures). On the other hand, stateful models are capable of scaling to very large sessions (with thousands or hundred of thousands of participants) but require new types of RTP topologies beyond plain SFU and MCU models.
During the talk, we also show how to deploy such stateful and stateless infrastructures on top of IaaS clouds such as Amazon or OpenStack so that their scalability can be automatically managed. We also present the different KPIs that auto-scaling algorithms may use as well as our experiences on the accuracy and appropriateness of them. To conclude, we introduce some real-word problems on such deployments related to infrastructure monitoring and instrumentation, fault-tolerance and fault resilience mechanism and security issues.
9. WebRTC Vs traditional WWW
Platforms: the three tiers
http://www.kurento.org
9
Application Server Container
Service Layer
Application 1 Application N…
WebRTC
Media Server
DD.BB.
Server
Signaling
10. Vertical scalability on monolithic
WebRTC platforms
http://www.kurento.org
10
Application Server Instance
Media Server Instance
Application 1 Application N…
Qualityofservice
Number of WebRTC legs
Typical scalability curve
for SFU media servers
~500 to 1000 in
commodity hardware
The bottleneck is here
11. Horizontal scalability of WebRTC
Media Servers
http://www.kurento.org
11
Application
Server
Application
Server
Application
Server
Media
Server
Media
Server
Media
Server
Media
Server
Media Resource Broker
…
…
RFC6917
Load Balancer
12. Media Resource Broker
• Functions
– MS registration
• MS instances register on the MRB
– MS brokering
• Query model
– AS instances query the MRB for locating a MS instance
– MRB is explicit for the AS
• In-line model
– MRB routes signaling (control requests)
– MRB is transparent for the AS
• MRB does not hold state about MS instances
– MS instances are independent
– MS instances are equivalent
– We say it’s stateless
http://www.kurento.org
12
13. Stateless MRB use cases
• Independent MS
– B2B calls
– WebRTC GW
– Room servers
– Media recording
– Etc.
http://www.kurento.org
13
Stateless - MRB
Application
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Call Call
14. • Amazon Web Services EC2
– Most popular public cloud
• OpenStack
– Popular public clouds (e.g. RackSpace)
– Popular for private clouds
• Deployment
– Cloud deployment templates
• CloudFormation (Amazon)
• Heat (OpenStack)
Deploying in public and private clouds
http://www.kurento.org
14
15. Templates
– Declarative language for
• Declaration of resources
and relationships
– Images, Computing Nodes,
Networks, Volumes, Load
Balancers, Autoscaling
groups, etc.
• Deployment
– Instantiation of resources
• Runtime
– Provisioning
– Autoscaling
http://www.kurento.org
15
16. Deploying in public clouds
http://www.kurento.org
16
AWS AMI / OpenStack Glance
Media Server
Image
Application
Server Image
Broker
Image
Stack definition template
AWS EC2 / OpenStack Nova
CloudFormation / HeatChef + Packer
Autoscaling
Rules
Launch
configurations
Autoscaling
Group
Autoscaling
Group
Elastic Load
Balancer
Application
Server
Instance
Application
Server
Instance
Broker
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Source code
18. Experiences deploying large WebRTC
infrastructures in public clouds
• Lessons learnt: fault-resilience is hard
– AS & MRB layers
• Are stateless => use distributed cache systems
– MS layer
• Is stateful => lots of problems
http://www.kurento.org
18
Application
Server
Application
Server
Media
Server
Media
Server
Media
Server
Media
Server
Media Resource Broker
…
…
19. Computing Node
Lessons learnt: avoid single points of
failure
http://www.kurento.org
19
MS
MRB
Computing Node
MS
Computing Node
… MS
Elastic Load Balancer
Computing Node
MS
Computing Node
…
MRB MRB
distributed cache
The wrong way
(single point of failure)
The right way
(fault-tolerant MRB)
20. Lessons learnt: fault-recovery at the MS
layer
• Fault-tolerance on the MS layer
– Stateful problem
• MS instances hold specific
resources that cannot be
“serialized” to a distributed
cache:
– Specific Sockets
• Machine failure => session failure
– Our proposed solution
• Re construct the session
– Detect failure
– Notify failure
– Reconnect
http://www.kurento.org
20
MRB
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Media
Server
Instance
Call Call
Application
Server
Instance
Failure
detection
Failure
notification
Session
reconnection
22. Lessons learnt: lack of optimal scale-out
events and metrics
• Lessons learnt: firing scale-out events
– which metric?
– Bottleneck depends on applications: network, CPU, memory, etc.
– our recommendation: define a synthetic metric (i.e. scaling points)
and be conservative
http://www.kurento.org
22
Qualityofservice
Number of WebRTC legs
Typical scalability curve
for SFU media servers
50%
40%
23. Lessons learnt: scaling-in is harder
than scaling-out
• The options (none-good)
– Expose # sessions as a metric
• Depends on cloud capabilities
• AS needs to be made cloud
aware
– Session migration
• AS needs to be made cloud
aware
• Renegotiations
– Retain period
• Sub-optimal utilization
• The simplest
http://www.kurento.org
23
MRB
Application
Server
Instance
MS1 MS2 MS3 MS4
Which one would
you remove?
24. Limits of the (stateless) MRB
http://www.kurento.org
24
Media stream
OnetoMANY
27. Stateful because …
• MRB
– Must be aware of media topology
• Stateful information about MS relationships
– Request routing depends on topology
• Where to place a new viewer?
– Request routing depends on internal state
• CPU load
• QoS
• Memory
• Etc.
http://www.kurento.org
27
28. Experiences with stateful MRB in AWS
EC2 & OpenStack
• Lessons learned: beware of WebRTC internals
– Differentiated quality
• SVC is the solution
– but its not ready
• Plain SFU forwarding models are not an option.
– RTCP feedback of viewers with bad connectivity destroy QoE
• Simulcast may be an option
– Suppress feedback of viewers with really bad connectivity
• Layered transcoding works nicely
– But its expensive
– Churn and the generation of key-frames
• Periodic key-frame generation is an option
– In VP8 expect significant increase in BW consumption
• Layered transcoding works nicely
– But its again expensive
http://www.kurento.org
28
29. Experiences with stateful MRB in AWS
EC2 & OpenStack
• Lessons learned: the cloud is evil
– Placement of incoming WebRTC legs
• New science required here
– Ideas?
• Our solutions
– Count number of WebRTC legs (points mechanisms9
– Ad-hoc, hard and error prone
– Fault-resilience
• New science required here
– Ideas?
• Our solution
– Re-construct internal parts of the tree, but never leaves.
– Requires client renegotiation
– Ad-hoc, hard and error prone
http://www.kurento.org
29