10. configuration/package
management
“what/how
do
things
get
installed?”
(10’s
of
machines)
hosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
$
ssh
host
./configure
&&
make
install
11. configuration/package
management
“what/how
do
things
get
installed?”
(10’s
of
machines)
hosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
$
ssh
host
rpm
-‐ivh
pkg-‐x.y.z.rpm
12. deployment
“what
should
run
where?”
“how
should
it
be
started/stopped?”
$
ssh
host
nohup
myapp
(10’s
of
machines)
hosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
13. deployment
“what
should
run
where?”
“how
should
it
be
started/stopped?”
$
ssh
host
monit
start
myapp
(10’s
of
machines)
hosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
14. deployment
“what
should
run
where?”
“how
should
it
be
started/stopped?”
$
scp
myapp
host
$
ssh
host
monit
myapp
(10’s
of
machines)
hosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
15. deployment
“what
should
run
where?”
“how
should
it
be
started/stopped?”
(10’s
of
machines)
$
ssh
host
git
pull
&&
monit
myapp
hosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
16. “how
should
apps
find
each
other?”
naming
webhosts.txt
web1.twttr.com
web2.twttr.com
web3.twttr.com
web4.twttr.com
dbhosts.txt
db1.twttr.com
db2.twttr.com
db3.twttr.com
db4.twttr.com
(10’s
of
machines)
17. (10’s
of
machines)
(100’s
-‐>
1000’s
of
machines)
to
scale,
need
more
automation
65. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
66. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
67. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
68. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
69. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
70. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
(2) multi-‐tenancy
on
individual
machines
71. Mesos
service
batch
storage
…
streaming
(1) when
resources
become
idle,
can
be
scheduled
and
reused
by
other
schedulers
(2) multi-‐tenancy
on
individual
machines
77. deployment
Apache
Aurora
(incubating),
a
scheduler
for
running
stateless
services
written
in
any
language
(but
primarily
used
at
Twitter
for
JVM
services)
78. deployment
(via
Aurora)
developers
(1) describe
service
using
Python
based
DSL
(2)
submit
service
to
Aurora
using
CLI
79. deployment
(via
Marathon)
developers
(1) describe
services
using
JSON
(2)
submit
service
to
Marathon
via
REST
81. naming
(1) task
gets
launched
on
machine
Apache
ZooKeeper
using
Apache
ZooKeeper
and
server
sets
(github.com/twitter/commons)
82. naming
(2)
service
gets
registered
in
a
server
set
in
ZooKeeper
(1) task
gets
launched
on
machine
Apache
ZooKeeper
using
Apache
ZooKeeper
and
server
sets
(github.com/twitter/commons)
83. naming
(2)
service
gets
registered
in
a
server
set
in
ZooKeeper
(1) task
gets
launched
on
machine
(3)
other
services
use
ZooKeeper
to
find
services
they
need
Apache
ZooKeeper
using
Apache
ZooKeeper
and
server
sets
(github.com/twitter/commons)
84. naming
(2)
service
gets
registered
in
a
server
set
in
ZooKeeper
(1) task
gets
launched
on
machine
(3)
other
services
use
ZooKeeper
to
find
services
they
need
(4)
services
connect
directly
with
one
another
Apache
ZooKeeper
using
Apache
ZooKeeper
and
server
sets
(github.com/twitter/commons)
85. naming
alternative
(2)
update
HAProxy
with
new
service
location
(1) task
gets
launched
on
machine
(3)
other
services
send
traffic
through
HAProxy
ZooKeeper/server sets requires injecting code into your clients!
87. where
are
we
today?
ops
developers
deploys
decoupled
from
ops
(many
deploys
per
day,
per
service)
maintenance
consists
of
“draining”
hosts,
getting
tasks
rescheduled,
then
pulling
the
cord
95. challenges revisited
① failures
② maintenance
③ utilization
public
or
private
IaaS,
failures
still
occur
(on
EC2,
instead
of
racks,
have
availability
zones,
instead
of
datacenters,
have
regions)
96. challenges revisited
① failures
② maintenance
③ utilization
provider
wins
with
public
IaaS,
better
resource
sharing
with
private
IaaS,
but
a
static
partition
of
VMs
is
still
a
static
partition!
118. operating
system
“a
collection
of
software
that
manages
the
computer
hardware
resources
and
provides
common
services
for
computer
programs”
- Wikipedia
119. datacenter
operating
system
“a
collection
of
software
that
manages
the
datacenter
computer
hardware
resources
and
provides
common
services
for
computer
programs”
- Wikipedia
120. datacenter
operating
system
“a
collection
of
software
that
manages
the
datacenter
computer
hardware
resources
and
provides
common
services
for
computer
programs”
- Wikipedia
121. today
Your App
API
tomorrow
datacenter
OS
provides
common
functionality
every
new
distributed
system
re-‐
implements:
•
failure
detection
•
package
distribution
•
task
starting
•
resource
isolation
•
resource
monitoring
•
task
killing,
cleanup
•
…
122. today
Your App
API
tomorrow
provides
common
functionality
every
new
distributed
system
re-‐
implements:
•
failure
detection
•
package
distribution
•
task
starting
•
resource
isolation
•
resource
monitoring
•
task
killing,
cleanup
•
…
datacenter
OS
Don’t
reinvent
the
wheel!