Under the Covers of DynamoDB

Under the Covers of DynamoDB
Matt Wood
Principal Data Scientist
@mza

Overview
1. Getting started
2. Data modeling
3. Partitioning
4. Replication & Analytics
5. Customer story: Localytics

DynamoDB is a managed
NoSQL database service.

Store and retrieve any amount of data.
Serve any level of request traffic.

Without the operational burden.

Consistent, predictable performance.

Single digit millisecond latency.
Backed on solid-state drives.

Flexible data model.

Key/attribute pairs. No schema required.
Easy to create. Easy to adjust.

Seamless scalability.

No table size limits. Unlimited storage.
No downtime.

Durable.

Consistent, disk only writes.
Replication across data centers and availability zones.

Two decisions + three clicks
= ready for use

Level of throughput
Primary keys

Two decisions + three clicks
= ready for use

Provisioned throughput.

Reserve IOPS for reads and writes.
Scale up for down at any time.

Pay per capacity unit.

Priced per hour of provisioned throughput.

Write throughput.

Size of item x writes per second
$0.0065 for 10 write units

Consistent writes.

Atomic increment and decrement.

Optimistic concurrency control: conditional writes.

Transactions.

Item level transactions only.

Puts, updates and deletes are ACID.

Strong or eventual consistency

Read throughput.


Read throughput.

Provisioned units = size of item x reads per second

$0.0065 per hour for 50 units


Read throughput.

Provisioned units = size of item x reads per second
2
$0.0065 per hour for 100 units


Read throughput.

Same latency expectations.

Mix and match at ‘read time’.

Provisioned throughput is
managed by DynamoDB.

Data is partitioned and
managed by DynamoDB.

Indexed data storage.

$0.25 per GB per month.
Tiered bandwidth pricing:
aws.amazon.com/dynamodb/pricing

Reserved capacity.

Up to 53% for 1 year reservation.
Up to 76% for 3 year reservation.

Authentication.

Session based to minimize latency.
Uses the Amazon Security Token Service.
Handled by AWS SDKs.
Integrates with IAM.

Monitoring.

CloudWatch metrics:
latency, consumed read and write throughput,
errors and throttling.

Libraries, mappers and mocks.

ColdFusion, Django, Erlang, Java, .Net,
Node.js, Perl, PHP, Python, Ruby

http://j.mp/dynamodb-libs

date = 2012-05-16-
id = 100 09-00-10 total = 25.00

date = 2012-05-15-
id = 101 15-00-11 total = 35.00

date = 2012-05-16-
id = 101 12-00-10 total = 100.00

Table

date = 2012-05-16-
id = 100 09-00-10 total = 25.00

date = 2012-05-15-
id = 101 15-00-11 total = 35.00

date = 2012-05-16-
id = 101 12-00-10 total = 100.00

date = 2012-05-16-
id = 100 09-00-10 total = 25.00
Item

date = 2012-05-15-
id = 101 15-00-11 total = 35.00

date = 2012-05-16-
id = 101 12-00-10 total = 100.00

date = 2012-05-16-
id = 100 09-00-10 total = 25.00
Attribute

date = 2012-05-15-
id = 101 15-00-11 total = 35.00

date = 2012-05-16-
id = 101 12-00-10 total = 100.00

Where is the schema?

Tables do not require a formal schema.
Items are an arbitrarily sized hash.

Indexing.

Items are indexed by primary and secondary keys.
Primary keys can be composite.
Secondary keys are local to the table.

Hash key

ID Date Total

Hash key Range key

ID Date Total

Composite primary key

Hash key Range key Secondary range key

ID Date Total

Programming DynamoDB.

Small but perfectly formed API.

CreateTable PutItem

UpdateTable GetItem

DeleteTable UpdateItem

DescribeTable DeleteItem

ListTables BatchGetItem

Query BatchWriteItem

Scan

Conditional updates.

PutItem, UpdateItem, DeleteItem can take
optional conditions for operation.

UpdateItem performs atomic increments.

One API call, multiple items

BatchGet returns multiple items by key.
BatchWrite performs up to 25 put or delete operations.
Throughput is measured by IO, not API calls.

Query vs Scan

Query returns items by key.
Scan reads the whole table sequentially.

Query patterns

Retrieve all items by hash key.

Range key conditions:
==, <, >, >=, <=, begins with, between.

Counts. Top and bottom n values.
Paged responses.

EXAMPLE 1:

Mapping relationships.

Players
user_id = location = joined =
mza Cambridge 2011-07-04
jeffbarr Seattle 2012-01-20
werner Worldwide 2011-05-15

Players

Scores
user_id = game = score =
mza angry-birds 11,000
user_id = game = score =
mza tetris 1,223,000
user_id = location = score =
werner bejewelled 55,000

Players

Scores Leader boards
user_id = game = score = game = score = user_id =
mza angry-birds 11,000 angry-birds 11,000 mza
mza tetris 1,223,000 tetris 1,223,000 mza
user_id = location = score = game = score = user_id =
werner bejewelled 55,000 tetris 9,000,000 jeffbarr

Players
Query for scores
jeffbarr Seattle 2012-01-20 by user


Players
user_id = location = joined = High scores by game


EXAMPLE 2:

Storing large items.

Unlimited storage.

Unlimited attributes per item.
Unlimited items per table.

Maximum of 64k per item.

Split across items.

message =
message_id = 1 part = 1
<first 64k>

message =
<second 64k>

joined =
<third 64k>

Store a pointer to S3.

message =
message_id = 1
http://s3.amazonaws.com...

message =
message_id = 2

message =
message_id = 3

EXAMPLE 3:

Time series data

Hot and cold tables.
April
event_id = timestamp = key =
1000 2013-04-16-09-59-01 value
1001 2013-04-16-09-59-02 value
1002 2013-04-16-09-59-02 value

March
1000 2013-03-01-09-59-01 value
1001 2013-03-01-09-59-02 value

December January February March April

Archive data.

Move old data to S3: lower cost.
Still available for analytics.

Run queries across hot and cold data
with Elastic MapReduce.

Uniform workload.

Data stored across multiple partitions.
Data is primarily distributed by primary key.

Provisioned throughput is divided evenly across partitions.

To achieve and maintain full
provisioned throughput, spread
workload evenly across hash keys.

Non-Uniform workload.

Might be throttled, even at high levels of throughput.

BEST PRACTICE 1:

Distinct values for hash keys.
Hash key elements should have a
high number of distinct values.

Lots of users with unique user_id.
Workload well distributed across hash key.
user_id = first_name = last_name =
mza Matt Wood
jeffbarr Jeff Barr
werner Werner Vogels
simone Simone Brunozzi

... ... ...

BEST PRACTICE 2:

Avoid limited hash key values.
Hash key elements should have a
high number of distinct values.

Small number of status codes.
Unevenly, non-uniform workload.

status = date =
200 2012-04-01-00-00-01
status = date =
404 2012-04-01-00-00-01
status date =
404 2012-04-01-00-00-01
status = date =
404 2012-04-01-00-00-01

BEST PRACTICE 3:

Model for even distribution.
Access by hash key value should be evenly
distributed across the dataset.

Large number of devices.
Small number which are much more popular than others.
Workload unevenly distributed.
mobile_id = access_date =
100 2012-04-01-00-00-01
100 2012-04-01-00-00-02
100 2012-04-01-00-00-03
100 2012-04-01-00-00-04

... ...

Sample access pattern.
Workload randomized by hash key.

100.1 2012-04-01-00-00-01
100.2 2012-04-01-00-00-02
100.3 2012-04-01-00-00-03
100.4 2012-04-01-00-00-04

... ...

Seamless scale.

Scalable methods for data processing.
Scalable methods for backup/restore.

Amazon Elastic MapReduce.

Managed Hadoop service for
data-intensive workflows.

aws.amazon.com/emr

create external table items_db
(id string, votes bigint, views bigint) stored by
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
tblproperties
("dynamodb.table.name" = "items",
"dynamodb.column.mapping" =
"id:id,votes:votes,views:views");

select id, likes, views
from items_db
order by views desc;

DynamoDB @ Localytics

Mohit Dilawari
Director of Engineering
@mdilawari

About Localytics
• Mobile App Analytics Service

• 750+ Million Devices and over 20,000 Apps

• Customers Include:

…and many more.
84

About the Development Team

• Small team of four managing entire AWS infrastructure - 100 EC2
Instances
• Experts in BigData
• Leveraging Amazon's service has been the key to our success
• Large scale users of:
• SQS
• S3
• ELB
• RDS
• Route53
• Elastic Cache
• EMR
…and of course DynamoDB
85

Why DynamoDB?

Set it and Forget it

86

Our use-case: Dedup Data
• Each datapoint includes a globally unique ID

• Mobile traffic over 2G/3G will upload periodic duplicate data

• We accept data up to a 28 day window

87

First Design for Dedup table
Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333

Table Name = dedup_table

ID

aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111


"Test and Set" in a single operation

88

Optimization One - Data Aging

• Partition by Month

• Create new table day before the month

• Need to keep two months of data

89

Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333

Check Previous month:

Table Name = March2013_dedup
ID
aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111 Not Here!


90

Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333

Test and Set in current month:

Table Name = April2013_dedup

ID

bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111
bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222
bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333 Inserted

91

Optimization Two

• Reduce the index size - Reduces costs

• Each item has a 100 byte overhead which is substantial

• Combine multiple IDs together to one record

• Split each ID into two halves
o First half is the key. Second Half is added to the set

92

Optimization Two - Use Sets
Unique ID: ccccccccccccccccccccccccccc999999999999999

ccccccccccccccccccccccccccc 999999999999999

Prefix Values

aaaaaaaaaaaaaaaaaaaaaaaaa [111111111111111, 222222222222222, 333333333333333]

bbbbbbbbbbbbbbbbbbbbbbbbb [444444444444444, 555555555555555, 666666666666666]

ccccccccccccccccccccccccccc [777777777777777, 888888888888888, ]

93

Optimization Three - Combine Months
• Go back to a single table

Prefix March2013 April2013

aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434....

bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767.....

ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313....

One Operation 1. Delete February2013 Field
2. Check ID in March2013
• Test and Set into April 2013
94

Recap

Compare Plans for 20 Billion IDs per month
Plan Storage Read Write Costs Total Savings
Costs Costs

Naive (after a $8400 0 $4000 $12400
year)

Data Age $900 $350 $4000 $5250 57%

Using Sets $150 $350 $4000 $4500 64%

Multiple Months $150 0 $4000 $4150 67%

95

Thank You
@mdilawari

96

Summary
1. Getting started
2. Data modeling
3. Partitioning
4. Replication & Analytics
5. Customer story: Localytics

Thank you!
matthew@amazon.com
@mza

Under the Covers of DynamoDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Under the Covers of DynamoDB

Similar to Under the Covers of DynamoDB (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Under the Covers of DynamoDB