27. WebSocket Load
balancing
Permission and
security model
(Admin, Mods, ...)
Frontend Server Backend Server
UI
Frontend
Server
data storage
Redis
Cluster
Smashcast
REST-API
Backend
Server
Auto scaling
Auto scaling
Long Polling Fallback
Fallback
Server
28. • Small, cheap machines
• Frontend handles the connections, no logic
• Backend stateless, can be restarted/upgraded any time
• When a frontend breaks it affects only for a few user
• Socket.io for handling websockets
• Up & Downscale as needed
Servers
50. • Some services don’t need a login
• There is always the need for schedule things
• Services need to send & get information from and to the
API
• What happen when a frontend server dies?
Examples
Thats me, 1980/81 with my first computer, anyone know the computer? I have studed arts, lived in new york & berlin, have made startups and have crashed startups
Smashcast, was until April Hitbox, but Hitbox got bought by Azubu, a competitor and now we are Smashcast
What is Smashcast? This is the frontpage
And thats a stream page, you see the live stream, chat, etc.
Real time is important. When something happens on the stream viewers wants to react as fast as possible.
Thats why we have build a real time infrastructure based on websockets.
Now a few explanations of the elelemts on the site that use this infrastructure.
The chat, ca do everything a chat needs to do, including posting images, gifs, selfies, etc.
Sounds easy, but isnt when you want to scale it!
And there is a big difference between a small whatsapp group and a huge real time chat!!!
Take for example images in chat, sounds easy!
Look at the person in the top left corner
Just link to the source and you are done!
We had this for two years…. Until we realized: there is a problem!
For example: someone posts a 100+mb gif, then alle viewers will start to download it and, when there internet is slow, dont forget, there is a 3Mbit stream running next to the chat, the stream will lag, giving a bad user experience!
But there are bigger problems with this images in the chat!
Imagine you hate one of the smaller streamer (like 5-20 viewers) on smashcast. You set up a small server with a nice gif on it and you post this gif, when the streamer is live, in his chat.
And now you have the IP-address of all his viewers and the IP of the streamer in your log files!
So next step is
Googling DDOS and bring down this streamer you hate! So easy!
But there is even a third problem!
Imagine a stream has 50k viewers and someone posts a gif. The server where the gif is hosted must be strong, because he will get now 50k hits at the same time!
Ist like a DDOS…
So we need to get the image, check it and save it in the CDN
And we are testing AWS machine learning for porn detection
Back to the realtime features. Cheering is another on on the site. It allows people to cheer for a team or a stream.
Lot of things to do! And thats just the beginning! Again, here you run quite fast into problems with too many messages you send to the users, etc.
Another feature on the site is the feed below the stream. All viewers get updates via the websocket
Last but not least the viewcounter.
Thats this number here. Maybe the most important thing on the site
So this number is the main KPI for all streams, the bigger the better so a lot of people try to influence this number. Updates are send out in realtime or every 10 seconds (depending on how big the stream is) to all viewers.
So, lets explain how we started it the realtime system a few years ago, the famous first version.
We went with nodejs & redis, redis because it is a great product for storing data that you need fast. When we have a lot of users the redis servers make thousends of requests/seconds without any problems and AWS offers a very good managed version of redis.
Nodejs because of its fast io for this, nowadays i would maybe move to go.
So we went with a typical frontend/backend setup, the frontend handles the websocket connection and is quite dump, the backend all the chat logic.
Fallback server is for the less than 1% that don’t support websockets, some providers block them and some older android versions too
We use AWS
Single core machines
This 1st version worked fine, but the problem is:
So we had a similar infrastructure for the viewcounter and when we worked on cheering & the feed we would have to build a similar infrastructure for this too.
So we need a different approach
Sounds easy, or? Exists since 30 years.
Thats when we decided to use rabbitmq. We could have used Kafka too but i have quite some experience with rabbitmq and it fits perfect. Anyone using it too?
As i said, it is easy to use, easy to mantain, and the best thing is the web interface, i will show it to you later.
So, how does the new server structure looks like? This is Mike Pence, Vice President of the USA while visiting NASA….
We still have the frontend server (with fallback, of course) and in between the rabbitmq cluster, which distribute the messages to the services and then back to the frontend.
So how does this work now? How is a command from the frontend send to the backend, processed and back to the frontend?
First, you need to login into every service. Ok, this flow is not that complicated, but wait for it!
This is a login message the userinterface sends to the frontend server he is connected to. In this case he wants to login (joinchannel) to the chat service for the channel „karlus“.
The frontend server is then sending this message to the rabbitmq-cluster
In rabbitmq there are two exchanges defined: fromFrontend & toFrontend. The frontendserver are connected to both, on one they are listening and on the other they are sending messages.
So this message is send to the fromfrontend exchange because it is from the frontend server.
Here you can see this. The frontend server sets the routing and the routing key is chat.joinchannel.karlus because it is a joinchannel command for the chat service for the channel karlus
The fromFrontend exchange routes now all messages where the routing key starts with chat. To the chat queue.
The chat backend servers (in this case two servers) are connected to this chat queue and rabbitmq is distributing the messages via round robin to the chat backend servers.
This is how the backend servers than get their messages. They have no clue that the message is coming from a frontend server, could come from other services, etc.
The chat backend servers then process the message, in this case do the login, etc. and then send back a message to rabbitmq to the „toFrontend“ exchange.
Each frontend server has his own queue that is connected directly to the „tofrontend“ exchange. With this setup it is possible for the backend to send a message directly to one frontend server (for example, the loginmsg, because this is only for one user) or to all frontend servers (for example, a chat msg).
After processing the message the backend server sends ther message back to rabbitmq. For a normal chat message this would be like this, again, the routin gkey is the service, command & channel.
The loginmsg is a bit different because it is send directly to the frontend server that send the message because the other frontend servers dont need to see it, so there is no routing key, only the queue is specified where it is send to.
And this is how this looks in the user interface? Green are the messages from the user interface, white the messages from the frontend server.
Here you see the login msg we just saw
And here the message coming back from the chat backend server.
When we build this system we realized that we need some generic services that can be used by other services.
For example some services dont need a login
A cronjob is needed for example for the cheering service to send status updates back to the viewers or for the viewcount server.
The connection to & from the API is based on a service, so other services can send a message to a service and this will then interact with the REST API.
The cleanup service is there to log out viewers from other services when a frontend server goes down.
So we added this needed services to the same rabbitmq cluster
Some services are really simple, this is the main function for the login service for example.
This can lead to quite complicated flows.
To handle this we have our own library that we use to connect & use with rabbitmq
Well, and at the end, is the chat system working? Does it scale?
Well, i dont have a screenshot about our latest record that was close to 200k, but this one shows you a channel with 100k people.
All 154k connections where handled by 16 frontend servers and 8 backend servers, costing us around $20 for the evening.
Well, and at the end, is the chat system working? Does it scale?
As i said, it is easy to use, easy to mantain, and the best thing is the web interface, i will show it to you later.
Just one mor think:
Just one mor think:
Just one mor think:
It was during at that time biggest event ever, 60k people on one stream and suddenly all of them saw this.
And we did this!
I know, this sound s stupid, but i will give you two examples:
Imagine you have a stream with 100k viewers. Every time a new viewer comes to this stream he/she gets the info about how to get the stream from our server.
Now imagine the streamer has a problem, lets say his computer crashes and the stream drops, mean is getting black or stucked.
What does 100k people do?
This.
And lets hope that your api can handle this!
And they wont stop until trhe have a stream again!