Presentation given by Błażej Kasperczyk at Pykonik meetup in Kraków.
How many applications, and where do we put them? Why is our system so bad at keeping up with what the users want? What to do in case of a noisy neighbour?
When you're aiming to provide a platform where the developers could easily launch an application without worrying about configuring the system, you will have to code it sooner or later. As with most very simple concepts, it presents a plethora of challenges to deal with.
3. Team PaaS...
3
• DevOps team
• Develop and maintain the
Platform
• Backend-oriented
• Python 3.x, Tornado
"With the friends you have in your team, you don't really need enemies!"
4. • Approx. 2300 VMs of varying sizes
• 1400 active applications, 600 of them in Python3.x
• 9000+ running instances
• ...a third of it is Python, Tornado-based applications
...And our little cloud – and what runs on it
4
5. • Push-button deployment
• Scale by available resources and the amount of applications
• Quick application installation with our build system
• Communications bus between applications
The PaaS layer
5
6. The slow start
• Work started in 2011
• Python2.7 + GEvent
• Works over SSH
• Push model
• Hard limit: 10 applications on
each vm
• ...and it works!
6
7. The inevitable
• Approx. 150 VMs max
• 300 VMs becomes a hard
limit that cannot be bypassed
• A single point of failure
7
While the panel was primitive, "papyrus" was a top trending colour!
8. • In place of the old orchestrator – a table of states and a coordinator
• An API that exposes what needs to be done to reach the desired state
• A daemon running on the VM handles the rest
• It's 2013 - let's be modern, let's do it in Python3!
A moment of reinvention:
What if we use our cloud, to scale our cloud?
8
9. Scoreboard
• Coordinates cloud management
• PostgreSQL backend
• Responsible for provisioning
• Supports over 2000 machines...
• ...each querying multiple times
every minute...
• ...currently.
• It can rebuild itself in case of a
database failure
9
10. Agent daemon
• Runs on the VM it manages
• Automatically launched with
each new VM
• Launches and maintains
applications
• Reports statistics for monitoring
purposes
• Allows the developer to
remotely shut the application
down
10
12. Weight balancing
• Each VM has a capacity limit
• Each application declares its size
• Light (White/Green)
• Medium (Yellow)
• Heavy (Red)
• ...that should do it, right?
12
13. Oversized cats
• A worker can have spikes of
100% CPU usage and 10%
averaged.
• An application can declare
high usage but be harmless.
13
15. Docker
• Requires a major overhaul of
our application building and
deployment...
• ...and will actually do what
we already have.
15
16. LXC
• Current architecture requires
a lack of network translation
between the Agent and
Application...
• ...and that caused issues
when launching applications
16
17. CGroups!
• The same mechanism that is
used by most containers
• Automatic cleanup
• Simplicity of the solution
17
18. • Applications in the cloud no longer exceed their assigned resources
• CPU is limited for each instance
• OOMKiller kicks in for memory-heavy applications that tries to exceed its limits
Everything is now in a box...
18
19. • Time does not stop, or that time we went Xenial and got eaten by SystemD
• The Damocles' sword called "Impending Knapsack Problem"
• Autoscaling
• ...and a few other things
...time to relax, right?
19
As a side effect, we actually made a sane frontend.