How to Troubleshoot Apps for the Modern Connected Worker
Kenneth Tan - Product fishos openstack kubernetes resource management
1. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
S A R D I N A
OpenStack + Kubernetes
resource management, efficiency, optimization
Kenneth Tan
kenneth.tan@sardinasystems.com
+44 798 941 7838
+421 948 251 435
www.sardinasystems.com
2. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
1.
Sardina Systems
Integrated cloud platform:
Full-lifecycle automation
3. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
Smart Cloud Platform
Full OpenStack + Kubernetes
platform
Zero Downtime Operation
Assurance — industry-first
Option: OpenShift Origin
European Innovation
European cloud automation
ISV with presence in UK,
Romania, Russia, Slovakia
OpenStack Foundation
Corporate Sponsor
Young, Innovative
Founded in 2014
Day-1 Team: supercomputing,
finance, defence, telco
Today: 20+ people, PhDs
Award Winning
DCD Global Award
IDC HPC Innovation Award
EU H2020 winner
UK Data 50 Award
and more
Partners
4. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
2. resource management: vm, containers
3. back to basics: nuts and bolts
4. kvm
5. cpu
6. memory
7. storage
8. network
9. data center view: the payoff
10 & 11. solution and success stories
agenda
5. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
2.
resource
management
6. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
operator-consumer relationship
when relationship between operator and service
consumer is close and trusted, segregation and
contention is seldom a problem
7. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
operator-consumer relationship
when relationship between operator and service
consumer not close and trusted, management
of segregation and contention is key
8. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
containers
Unix V7: chroot process
isolation, filesystem access
segregation, tar packaging
1979
FreeBSD: jails, segregate
IP, configuration
2000
Linux VServer: jails,
segregate filesystem,
network, memory
2001
Solaris: Containers,
segregate system
resources, leverage ZFS
2004
Virtuozzo/OpenVZ: kernel
based virtualization,
isolation
2005
Linux: Process Containers,
cgoups, segregate CPU,
memory, disk, network
2006
Linux Containers (LXC):
container manager, croups,
namespaces
2008
Docker: LXC then
libcontainer, programable,
packaging with image
2013
9. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
containers: key functions
packaging system
kernel slicer
10. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
containers: key question
which kernel?
host kernel vs VM kernel
11. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
containers: slice host kernel
escape risk
operator: highly risky
12. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
containers: slice vm kernel
operator: no risk
any escape would be contained
within consumer’s environment
13. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
vm vs container, host vs vm
live migration
escape
nature of organizations
14. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
loosely-coupled relationship
commonly run workloads in vms
k8s, containers in vms
cleanly manage segregation
15. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
organizations and businesses
tightly coupled consumer + operator:
common in smaller organizations
loosely coupled = containers in vms
can be a mixture
16. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
enterprise
focus: loosely-coupled operator-consumer
relationship
typical in major enterprises
17. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cloud resource management
1. given new vm requests, where should the vm be
placed (optimally)?
2. how can workloads in an environment be optimally
rebalanced across the servers?
18. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
what’s the payoff?
optimized capex + opex
cater to more workload with same capex + opex
optimized workload performance
19. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
translation …
costs go down
user + boss happiness go up :-)
20. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
3.
back to basics
21. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
constituents of cloud infrastructure
1. servers
2. storage devices
3. networking devices
22. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
infrastructure resource management
workload runs on servers
storage and networking are supporting/
enabling devices
focus on: servers, servers, servers!
23. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
what makes up a server?
1. cpu
2. memory
3. disk
4. network
24. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
what makes up a server?
1. cpu
2. memory
3. disk
4. network
these hardware parts are
managed by the
operating system,
typically linux.
26. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
what is a vm?
from host’s perspective: a vm is nothing more than a
process
vm only emulates a machine, relying on underlying host
to provide actual cpu, memory, disk(s) and network
resources
27. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
vm resource management: reduces to process management
29. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cpu
linux is responsible for scheduling a process’ access to
the cpu
30. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cpu
when process is ready to execute, linux queues the
process in an execution queue
31. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cpu
length of execution queue: “system load”
32. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cpu
processes in queue given time slices to execute on cpu
33. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
a common approach
manage cpu resource based on system load, try to
fit/match system load to notional cpu capacity
34. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
or worse …
matching static vcpu reservation to static resource
capacity!
35. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
what’s the problem with these approaches?
36. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu has many sub-parts
cpu has multiple execution units (integer, fp, for a
start!)
cpu’s execution units handle different operations,
which can be mutually exclusive!
37. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu has many sub-parts
but (the simple) “system load” view does not
differentiate between them!
38. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu needs data in registers
cpu can only operate on data in registers
if data required is not in registers, the data will
have to be fetched from higher levels of memory
39. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu needs data in registers
cpu is idle when waiting for data!
… but “system load” will still show 1!
40. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu heterogeneity
with cpu heterogeneity, “system load” on one
machine may not necessarily translate “system
load” on another machine (or even linearly
comparable in any way!)
41. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu vs idleness
a loaded program that is idle does not require cpu
resource
42. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
demand spikes
operating system queues processes and allocates
the processes time to execute on cpu
demand spikes can therefore be handled
gracefully
43. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
(longer) demand spikes?
demand “spike” lasting more than “short” periods
of time are no longer spikes!
44. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
forecasting behavior?
operating system does not forecast behavior of
processes
why: expensive computationally, low likelihood of
accuracy
45. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
forecasting behavior
better: handle resource requirement dynamically
effectively
46. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cpu resources in cloud infrastructure
bad idea to allocate based on static “units” of cpu
and overcommitment
leads to gross wastage, while real resource
contention problems remain!
47. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
cpu resources in cloud infrastructure
this is not news! …since 1968!
every time-sharing os since 1968 addresses this
problem!
48. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
the smart(er) way …
dynamic cpu resource allocation for vms
based on live utilization of each detailed subpart
resource of cpu
do it data-center-wide!
50. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
memory
a machine has multiple levels of memory hierarchy
(eg: caches, memory, swap)
not all memory are equal!
51. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
memory
operating system is responsible for moving data
throughout the memory hierarchy
data movement is not elemental, contrary to
popular assumption! data movement happens in
aggregate!
52. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
memory
memory organized as pages
memory is not allocated by linux when requested
by a process
53. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
memory
linux actively moves a process’ memory pages out
to swap, bring back when needed
access to multiple levels of memory hierarchy have
limits
54. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
a common approach
manage memory based on simplified view of
reserved memory of a vm, and overcommit
55. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
what’s the problem with this approach?
56. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: data locality
not taking into account data locality (and why the
data is there)
not all memory are equal
57. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: data locality
limits on access to multiple levels of memory
hierarchy
operating system actively page in/out a process’
memory
58. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: cpu needs data in registers
cpu can only operate on data in registers
when data is needed by a process in its execution
on a cpu, it will need to be moved in to the
registers
and there are limits on access to the multiple
levels of memory hierarchy
59. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
memory resources in cloud infrastructure
bad idea to just manage based on grossly
simplified view of static “reserved memory” and
overcommitment
leads to gross wastage while not addressing
resource contention problems, ignoring how an
operating system works!
60. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
memory resources in cloud infrastructure
this is not news! … since 1970!
these are the problems handled by every virtual
memory operating systems since 1970!
61. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
the smart(er) way …
dynamic memory resource management for vms
based on live utilization of each detailed resource
in the multiple memory hierarchy levels
do it data-center-wide!
63. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
storage
operating system is responsible for managing a
process’ access to storage devices
process does not read/write. process asks the
operating system to read/write!
64. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
storage
resource management: data volume and transfer
amount
physical limits on read/write access to data
65. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
a common approach
simply try to manage based on volume of
allocated storage, or try to be “smarter”, by
statically allocating i/o access
66. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: getting to data
vms seldom access all data in allocated storage
volume at once: volume is not the key resume
management problem
bigger problem: physical limits on read/write
access to storage not taken into account
67. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
storage resources in cloud infrastructure
bad idea to just manage storage volume
allocation
bad idea to allocate based on static “units” of i/o
access, with or without overcommitment
68. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
storage resources in cloud infrastructure
leads to gross wastage, while real resource
contention problems remain!
yes, this is not news either!
69. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
the smart(er) approach …
dynamic storage resource management for vms:
important to be able to read/write the data!
based on live i/o subsystem resource utilization
do it data-center-wide!
71. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
network
the operating system manages a process’ access
to network
networking stack (ip stack) handles HUGE variety
of network characteristics
operating system: queues network packets, flow
control, retransmission, etc.
72. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
network
bandwidth is not unlimited
bandwidth is not all … what about latency?
73. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
a common approach
assign amount of network bandwidth per vm,
whether statically or based on a formulaic profile
74. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: not all ports are equal
uplinks and downlinks are not symmetric
ports on different switches may not be equal
what about cross-section bandwidth?
75. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
problem: networks are dynamic
contrast: data transfer vs flow control,
retransmission, network queue, buffers (for
example)
these network factors (and their underlying causes)
are not taken into account
76. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
network resources in cloud infrastructure
bad idea to just manage bandwidth allocation
(regardless of whether static or based on formulaic
profile)
not a solution for handling causes of
fragmentation, retransmission, flow control
problems, buffer bloat
77. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
network resources in cloud infrastructure
leads to gross wastage, while real resource
contention problems remain!
yes, this is not news either!
78. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
network resources in cloud infrastructure
dynamic network resource management for vms:
important to be able to send/receive the data!
based on live network i/o states
do it data-center-wide!
79. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
9.
“know” the data
center
80. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
full view of the data center
occurrence: what
reason: why
time points: when
locality: where
source: which
81. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
full view approach enables …
where to place new workload taking into account
live resource utilization across the facility
how to rebalance workload optimally on servers
taking into account live resource utilization across
the facility
82. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
more importantly …
to do so without negatively impacting overall
performance
83. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
knowing the data center solves …
catering to the workload with optimal number of
servers
maximize utilization on the servers without
impacting performance of the vms
84. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
knowing the data center …
costs go down
user + boss happiness go up :-)
85. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
10.
Sardina FishOS
86. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
Sardina FishOS
Smart Cloud Platform
OpenStack + Kubernetes
Full cloud platform: including
Ceph, management tools
Optional: Integrate OpenShift
Origin
Optimized OpEx + CapEx
World’s first energy-optimizing
and utilization-improving
Drop TCO by 60+%
Full VMware functionalities &
more at 10% of VMware price
Market-Leading Automation
Technology
Fully automated zero-
downtime operations
Infrastructure-as-code
AI-Driven
Provides AI-driven smart,
efficient, super-scalable cloud
automation technology
Automated smart fault
handling ensuring service
continuity
Built for Reliability
HA-by-default solution
architecture
Zero-downtime Upgrade
Scalable multi-site HA
Award Winning
DCD Global Award
IDC HPC Innovation Award
EU H2020 winner
UK Data 50 Award
and more
87. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
Cloud Platform Automation
— Built by Operators for
Operators
Fully automated Zero-
Downtime operations
Flexible, scalable, reliable,
efficient
88. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
11.
Success stories
Market-proven in large scale
environments
89. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
Successful FishOS implementations
Sun Trading (Finance)
❖ Top 10 trader on NY market
❖ 4 sites globally, managed as 1
Alpha System (Service Provider)
❖ FishOS solution blueprint
❖ Time scales: 1 day
❖ Hardware up at 1400
❖ FishOS environment up by 1700
Site: Government
❖ 150k VMs
❖ 1 site
Site: Research
❖ Major R&D site
❖ Full system design, retro-fit to fix
existing problematic, non-
upgradable system
Site: Financial Services
❖ Major European banking group
❖ 3 data centers in HQ location
❖ FishOS solution blueprint
❖ Time scales: 2 weeks
CFMS (Manufacturing)
❖ Aerospace design facility
❖ Time scales: 2 weeks
❖ 1 site
90. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
Successful FishOS implementations
UniCORE (IT Services)
❖ IT services: finance, insurance
sector
❖ FishOS solution blueprint
❖ Time scales: 1 day
❖ Accelerated development
U of Edinburgh (Academic,
Research)
❖ Largest UK public sector site
❖ Time scales: 8 weeks
❖ Integration: deployment,
authentication, storage
Site: IT Services
❖ Technical computing
❖ Time scales: 1 week
❖ High performance storage and
networking integration
Site: IT Services
❖ IT services: public sector, utilities
❖ FishOS solution blueprint
❖ Time scales: 1 day
Site: Machine Learning
❖ Major German public sector
❖ FishOS solution blueprint
❖ Time scales: 1 day
❖ GPU-powered workload
… your site?
91. Sardina Systems Proprietary. Copyright (C) 2014 -- 2019. Sardina Systems. All rights reserved.
want a job?
want a free FishOS system?
kenneth.tan@sardinasystems.com
+44 798 941 7838
+421 948 251 435