A journey in the public clouds

A Journey In The Public Clouds
With Datadog

Alexis Lê-Quôc (Product Guy) at Datadog
IASA New York Chapter
June 28th, 2011

What I’m going to talk about
‣What we do and for whom
‣The kind of data we deal with
‣Our architecture
‣Our architecture in a public cloud (AWS)
‣What we learned
‣Q+A

SaaS Platform for
Aggregation, Correlation, Collaboration
For Dev & Ops

What we do?

The Mess
Usage Analytics
Too many data streams,
IAAS / PAAS
too many silos
Issue Resolution

t
ics
Servers and Devices
ics igh

ices
etr ins
metr

g
billin Too many choices to
m m
cho
et
ri c s
s
?!? change make, too often
Dev team

changes !?
ics choices
metr
Ops team Applications

tri
cs ch
an
Only getting worse as
me
nts ge
SaaS Silos multiply
me

even s
ve ts
tri

ad

e + fe
es edb
cs

vic

oic ack
ch
e
me
s
s
tric
choice

tri
me

cs

Separate Dev and Ops
Cap. Planning SDLC support

Monitoring

teams, looking at separate
Hosting
data streams
Asset Mgmt
CDNs

Data-Driven decision making in IT is rarely happening.
Too slow, Too expensive, requires too much discipline.

We Simplify
Datadog to the rescue
system metrics
key metrics
quality metrics to Alice Dev

SaaS data

visibility
capacity metrics

usage analytics
recommendations
cloud billing to Bob Ops

code metrics

visibility
conﬁg changes

IaaS pricing
business metrics
perf. data to Charlie CEO

vendors info

curated metadata
Aggregation Correlation Collaboration

AGGREGATION
Aggregation

https://app.datad0g.com/dash/dash/1000#/date_range/1308057152698-1308143552698
Correlation

What Architecture For
What Kind Of Data?

Events Metrics
User comments Unique visitors
Alert Load
Build Transaction duration
Batch job etc.

Atomicity
Concistency
Isolation
Durability

e.g. SQL DBs

CLASSICS
http://en.wikipedia.org/wiki/Eventual_consistency

Atomicity Basically
Concistency Available
Isolation Soft-state
Durability Eventual
consistency
e.g. SQL DBs
e.g. DNS

CLASSICS
http://en.wikipedia.org/wiki/Eventual_consistency

Data
Intensive
Real
Time

e.g. real-time web

NEW COMER
Brian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf

Aggregation
Constant data inﬂux
Large data sets

Correlation
On-demand visualization
Background data analysis

Collaboration
Real-time updates
On-the-ﬂy data analysis

Aggregation

SE
BA
Large data sets

Correlation

Collaboration
Real-time updates

Aggregation

SE

T

IR
BA

D
Large data sets

Correlation

Collaboration
Real-time updates

Aggregation

SE

T

IR
BA

D
Large data sets

Correlation

SE

BA

Collaboration
Real-time updates

Aggregation

SE

T

IR
BA

D
Large data sets

Correlation

SE

BA

Collaboration

T
Real-time updates

IR
D

Aggregation

SE

T

IR
BA

D
Large data sets

Correlation

SE

BA

Collaboration

T
Real-time updates

IR
D

Datadog = DIRT + BASE + a tiny bit of ACID

How It All Fits Together
http://www.ﬂickr.com/photos/tom-margie/1253798184/

Architecture
Simpliﬁed

SE
BA

Architecture
Simpliﬁed

SE
T
IR

BA
D

Architecture
Simpliﬁed

SE

ID
T
IR

C
BA

A
D

4 Dimensions
Compute
Storage
Network
Management

ON-PREMISE TRAITS
http://www.ﬂickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

Compute
Fast
Inelastic

ON-PREMISE TRAITS

Compute
Fast
Inelastic

Storage
Fast
Centralized
Redundant

ON-PREMISE TRAITS

Compute Network
Fast Fast
Inelastic Localized

Storage
Fast
Centralized
Redundant

ON-PREMISE TRAITS

Compute Network
Fast Fast
Inelastic Localized

Storage
Fast Management
Centralized People-based
Redundant Full access

ON-PREMISE TRAITS

Compute
Slow
Elastic

CLOUD TRAITS

Compute
Slow
Elastic

Storage
Slow
Jittery
Maybe durable
Low memory

CLOUD TRAITS

Compute Network
Slow “Fast”
Elastic Geo-distributed

Storage
Slow
Jittery
Maybe durable
Low memory

CLOUD TRAITS

Compute Network
Slow “Fast”
Elastic Geo-distributed

Storage
Slow
Jittery Management
Maybe durable No bare-metal
Low memory “Magic” API

CLOUD TRAITS

Network
Layer 2: Virtual Domain
Layer 3: Crude Edge Filtering
Layer 7: Crude Load Balancing
DNS
CDN

Network
Layer 2: Virtual Domain

!
Layer 3: Crude Edge Filtering

ks
or
Layer 7: Crude Load Balancing
DNS
W
It
CDN

Latency

BASE
Amazon S3

BASE
Apache Cassandra
ACID
PostgreSQL
DIRT
Redis
Capacity

Storage

Latency

BASE

y
nc
Amazon S3

te
La
t
BASE

pu
y

gh
er
Apache Cassandra

ou
ACID tt

hr
Ji

dt
PostgreSQL
i te
Lim

DIRT
y
or
em

Redis
Capacity
m
w
Lo

Storage

Low Memory
http://aws.amazon.com/ec2/#instance

Jittery, Limited Throughput
Network Block Storage (EBS)

https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016

Average wait in ms

DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.34
03:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.72
03:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.31
03:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35

Read throughput in sector/s Average service
Total: 368Mb/s time in ms

Limited Throughput In Numbers
RAID 0 EBS Volumes, m1.large instances

Software RAID
RAID 0
Offsite backups

Some Tricks

Software RAID Limited by slowest
RAID 0 volume
Offsite backups

Some Tricks

RAID 0 volume
Offsite backups

Streaming replication
S3 backups

Some Tricks

RAID 0 volume
Offsite backups

Ephemeral volumes
And Offsite backups

Streaming replication
S3 backups

Some Tricks

RAID 0 volume
Offsite backups

Ephemeral volumes
And Offsite backups Complexity
Recovery Time Objective
Streaming replication Recovery Point Objective
S3 backups

Some Tricks

RAID 0 volume
Offsite backups

Ephemeral volumes
S3 backups

Database Service
MySQL/Oracle RDS

Some Tricks

RAID 0 volume
Offsite backups

Ephemeral volumes
S3 backups

Database Service Trust
MySQL/Oracle RDS RDS Outage 2 months ago

Some Tricks

Network Block Storage
Is The Dark Side

Is The Dark Side

Bait For Enterprise
Customers

Is The Dark Side

Bait For Enterprise
Customers

Hard Problem For
Cloud Providers

Don’t rely on networked block storage
Small data sets only if you have to

Don’t trust data-at-rest
Copy, replicate, back up

Do use S3 if you can
Object semantics a limitation
Slow but durable

Some Do’s And Don’t

“Performance”
Scale up Shard

ACID
Nodes

BASE DIRT Add more
Nodes Nodes
Number

Compute

Don’t rely on scale-ups
Low memory a hard limit for DBs
Noisy neighbors
Individual performance poor and jittery

Scale out
First scale up
Then Shard
Parallelize across machines
Vector-processing via GPUs


An API for everything
Compute
Storage
Network
Management

Do use the AWS APIs
Almost like magic
Rich libraries
Ever expanding

Do use tools
e.g. Chef, Puppet, cfengine, etc.
Datadog

Do Kill and Respawn
Low-level debugging impossible
Instance creation is cheap


New Rules
New Tools
New Playbook

Same Fundamentals

Questions!

http://datadoghq.com
twitter: @alq

A journey in the public clouds

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

A journey in the public clouds