In this session you'll learn about the decisions that went into designing and building DynamoDB, and how it allows you to stay focused on your application while enjoying single digit latencies at any scale. We'll dive deep on how to model data, maintain maximum throughput, and drive analytics against your data, while profiling real world use cases, tips and tricks from customers running on DynamoDB today.
33. date = 2012-05-16-
id = 100 09-00-10 total = 25.00
date = 2012-05-15-
id = 101 15-00-11 total = 35.00
date = 2012-05-16-
id = 101 12-00-10 total = 100.00
34. Table
date = 2012-05-16-
id = 100 09-00-10 total = 25.00
date = 2012-05-15-
id = 101 15-00-11 total = 35.00
date = 2012-05-16-
id = 101 12-00-10 total = 100.00
35. date = 2012-05-16-
id = 100 09-00-10 total = 25.00
Item
date = 2012-05-15-
id = 101 15-00-11 total = 35.00
date = 2012-05-16-
id = 101 12-00-10 total = 100.00
36. date = 2012-05-16-
id = 100 09-00-10 total = 25.00
Attribute
date = 2012-05-15-
id = 101 15-00-11 total = 35.00
date = 2012-05-16-
id = 101 12-00-10 total = 100.00
37. Where is the schema?
Tables do not require a formal schema.
Items are an arbitrarily sized hash.
38. Indexing.
Items are indexed by primary and secondary keys.
Primary keys can be composite.
Secondary keys are local to the table.
48. One API call, multiple items
BatchGet returns multiple items by key.
BatchWrite performs up to 25 put or delete operations.
Throughput is measured by IO, not API calls.
50. Query vs Scan
Query returns items by key.
Scan reads the whole table sequentially.
51. Query patterns
Retrieve all items by hash key.
Range key conditions:
==, <, >, >=, <=, begins with, between.
Counts. Top and bottom n values.
Paged responses.
67. Uniform workload.
Data stored across multiple partitions.
Data is primarily distributed by primary key.
Provisioned throughput is divided evenly across partitions.
68. To achieve and maintain full
provisioned throughput, spread
workload evenly across hash keys.
70. BEST PRACTICE 1:
Distinct values for hash keys.
Hash key elements should have a
high number of distinct values.
71. Lots of users with unique user_id.
Workload well distributed across hash key.
user_id = first_name = last_name =
mza Matt Wood
user_id = first_name = last_name =
jeffbarr Jeff Barr
user_id = first_name = last_name =
werner Werner Vogels
user_id = first_name = last_name =
simone Simone Brunozzi
... ... ...
72. BEST PRACTICE 2:
Avoid limited hash key values.
Hash key elements should have a
high number of distinct values.
73. Small number of status codes.
Unevenly, non-uniform workload.
status = date =
200 2012-04-01-00-00-01
status = date =
404 2012-04-01-00-00-01
status date =
404 2012-04-01-00-00-01
status = date =
404 2012-04-01-00-00-01
74. BEST PRACTICE 3:
Model for even distribution.
Access by hash key value should be evenly
distributed across the dataset.
75. Large number of devices.
Small number which are much more popular than others.
Workload unevenly distributed.
mobile_id = access_date =
100 2012-04-01-00-00-01
mobile_id = access_date =
100 2012-04-01-00-00-02
mobile_id = access_date =
100 2012-04-01-00-00-03
mobile_id = access_date =
100 2012-04-01-00-00-04
... ...
84. About Localytics
• Mobile App Analytics Service
• 750+ Million Devices and over 20,000 Apps
• Customers Include:
…and many more.
84
85. About the Development Team
• Small team of four managing entire AWS infrastructure - 100 EC2
Instances
• Experts in BigData
• Leveraging Amazon's service has been the key to our success
• Large scale users of:
• SQS
• S3
• ELB
• RDS
• Route53
• Elastic Cache
• EMR
…and of course DynamoDB
85
87. Our use-case: Dedup Data
• Each datapoint includes a globally unique ID
• Mobile traffic over 2G/3G will upload periodic duplicate data
• We accept data up to a 28 day window
87
88. First Design for Dedup table
Unique ID: aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
Table Name = dedup_table
ID
aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111
aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222
aaaaaaaaaaaaaaaaaaaaaaaaa333333333333333
"Test and Set" in a single operation
88
89. Optimization One - Data Aging
• Partition by Month
• Create new table day before the month
• Need to keep two months of data
89
90. Optimization One - Data Aging
Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
Check Previous month:
Table Name = March2013_dedup
ID
aaaaaaaaaaaaaaaaaaaaaaaaa111111111111111 Not Here!
aaaaaaaaaaaaaaaaaaaaaaaaa222222222222222
90
91. Optimization One - Data Aging
Unique ID: bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333
Test and Set in current month:
Table Name = April2013_dedup
ID
bbbbbbbbbbbbbbbbbbbbbbbbb111111111111111
bbbbbbbbbbbbbbbbbbbbbbbbb222222222222222
bbbbbbbbbbbbbbbbbbbbbbbbb333333333333333 Inserted
91
92. Optimization Two
• Reduce the index size - Reduces costs
• Each item has a 100 byte overhead which is substantial
• Combine multiple IDs together to one record
• Split each ID into two halves
o First half is the key. Second Half is added to the set
92
94. Optimization Three - Combine Months
• Go back to a single table
Prefix March2013 April2013
aaaaaaaaaa... [111111111111111, 22222222222... [1212121212121212, 3434343434....
bbbbbbbbbb... [444444444444444, 555555555.... [4545454545454545, 6767676767.....
ccccccccccc... [777777777777777, 888888888... [8989898989898989, 1313131313....
One Operation 1. Delete February2013 Field
2. Check ID in March2013
• Test and Set into April 2013
94
95. Recap
Compare Plans for 20 Billion IDs per month
Plan Storage Read Write Costs Total Savings
Costs Costs
Naive (after a $8400 0 $4000 $12400
year)
Data Age $900 $350 $4000 $5250 57%
Using Sets $150 $350 $4000 $4500 64%
Multiple Months $150 0 $4000 $4150 67%
95