4. About Me
• At MongoDB since January 2016
• Senior Technical Services Engineer - I answer your support questions.
• Performance Driven - Software performance and benchmarking for
the last decade
• New to MongoDB, but not performance
• Loves data
• Programming Polyglot
5. Disclaimer
• This is my personal journey
• I made lots of mistakes
• You are probably smarter than me• (I’m hopefully smarter than I was two years ago)
6. My Project
• I’ve been collecting Water/Electric meter data since February 2015.
• Now that I work at a database company, maybe I should put this in a
database?
• See what I can learn about my consumption.
• Get access to my meter data on the internet.
7. IOT
• Internet of things
• I want my things (meters) to be connected to the Internet
• This would let me remotely monitor my utilization
12. ETL
• Extract, Transform, Load
• Not in the traditional sense (not already in another database)
• Many of the same characteristics
• Convert between formats
• Reading all the data quickly
• Inserting into another database
22. Redundant Data!
• The meters send readings every few minutes.
• The reading does not have up-to-date information.
• We only care about the first change.
2015-02-13T18:01:09.079 Consumption: 5048615
2015-02-13T18:02:11.272 Consumption: 5048621
2015-02-13T18:03:14.093 Consumption: 5048621
2015-02-13T18:04:13.155 Consumption: 5048621
2015-02-13T18:05:10.849 Consumption: 5048621
2015-02-13T18:06:11.668 Consumption: 5048623
25. It Works! (Sort of)
• Entire import process takes overnight (around four hours)
• Read 10.6GB
• Inserts 90,840,510 documents
26. Problem: Queries
• Query for monthly, daily, day of
week are similar.
• Generate ranges, grab a pair of
readings, calculate the
difference.
• Aggregation isn’t a great match.
before =
db.getSiblingDB("meters").mine.find({scm.id
: myid, time: {'$lte':
begin}}).sort({time:-1}).limit(1).toArray()
[0];
after =
db.getSiblingDB("meters").mine.find({scm.id
: myid, time: {'$gte':
end}}).sort({time:-1}).limit(1).toArray()[0
];
consumption = after.consumption -
before.consumption
29. Problem: Displaying Data
• Requires multiple calls to the
database
• Could be off by depending on
when we see readings
before =
db.getSiblingDB("meters").mine.find({scm.id:
myid, time:{'$lte':
begin}}).sort({time:-1}).limit(1).toArray()[0];
readings = db.getSiblingDB("meters").mine.find({
scm.id: myid,
time: {$gte: before.time})
.sort({time:1})
var previous = readings.shift();
var count = 0;
var hourly = [];
readings.forEach(reading => {
if(hourly.length > 24) return;
if(reading.time.getHours() != previous.getHours()){
hourly.push(reading.consumption -
previous.consumption); previous = reading;
}
});
31. Performance: Rewrite in Go
• More control over cleaning our data
• Driver allows easy batch insertion
• Split into multiple workers (goroutines) to distribute insertion load
• Take advantage of all our cores
36. CHANGE THE SCHEMA!
• The schema I started with didn’t meet my requirements.
• Resisted this change as it required additional application work (writing
my own ETL tool).
• Think about how you will use your data!
38. One document per hour
• This makes hourly, daily, and
weekly calculations easier to
calculate.
• Easy cutoff for insertion, wait
until an hour passes to insert our
documents.
{
"_id" :
ObjectId("54de8229791e4b133c000052"),
"meter" : 29026302,
"date" : ISODate("2015-02-13T23:00:57Z"),
"consumption" : 5.526729939432698,
"begin" : 50480.575901639124,
"end" : 50486.10263157856,
...
}
39. Store a before and after reading
• Used in our ETL tool
• Linear interpolation from these
values to project what the start
and end reading would have
been.
• Included for completeness, but
otherwise unnecessary. These
fields are never queried and
could be omitted.
"before" : {
"date" :
ISODate("2015-02-13T22:59:56Z"),
"consumption" : 50480.57,
"delta" : 5.785714347892832
},
"after" : {
"date" :
ISODate("2015-02-14T00:00:11Z"),
"consumption" : 50486.12,
"delta" : 5.68421056066001
}
…
40. Embed readings
• We may want to graph usage
within the hour, so store the raw
values.
• Store deltas to make our life
easier later.
"readings" : [
{
"date" :
ISODate("2015-02-13T23:00:57Z"),
"consumption" : 50480.66,
"delta" : 5.311475388465158
},
{
"date" :
ISODate("2015-02-13T23:02:00Z"),
"consumption" : 50480.75,
"delta" : 5.142857168616757
} ...
41. Split out Time
• Splitting out the hour, day,
month, year, day of week makes
for easy queries.
• Aggregation is easy and fast as a
$dayOfMonth projection isn’t
required.
• We can now use a simple
aggregation to explore by year,
month, week, day and hour.
{
...
"date" : new Date("2015-02-13T17:00:57"),
"hour" : 17,
"weekday" : 5,
"day" : 13,
"month" : 2,
"year" : 2015,
…
}
44. Queries: 24 Hour Graph
• Filter by the meter’s id
• Sort based on date
• 24 documents returned for
graphing
• Already binned on hour
boundaries
last24 = db.getSiblingDB("meters").mine.({
meter:29026302},
{consumption:1,
date:1})
.sort({date:-1})
.limit(24)
.toArray()
48. Performance by the numbers
• 4 hours to 3 minutes
• Deduplication process eliminates 202 minutes
• Data cleaning process eliminates 24 minutes
• Parallel insertion eliminates 11 minutes
• 90,840,510 Readings to 436,477
• 90,840,510 Docs to 31,396
• 10.6 GB File to 13MB compressed WiredTiger data (31MB
uncompressed)
49. Getting from 180 to 60 seconds
• Buffer input heavily, we should never be waiting on IO
• Perform simple checks to avoid stripping whitespace
• Using fixed string parsing vs. regex
• Tune batch sizes and workers to keep the system busy
• Optimistically encode documents to reduce encoding overhead
• Batch golang channel sending to reduce overhead
57. Key Takeaways
• Follow best practices
• Batch writes improve throughput by reducing roundtrips
• Multiple insertion workers remove roundtrip bottleneck
• Design you schema so you can easily access your data
• Understand the big picture
• You can treat database performance just like any software issue.
• Tabular data isn’t a great way to represent many problems.
58. What have I learned?
• My household consumes a lot of water
• Changed shower heads (30% savings)
• Changed water heater ($50 a month savings)
• When certain people are home, energy consumption rises
• Replaced light bulbs (few $ a month)