12. A Weather Datum
• A station ID
• A timestamp
• Lat, Long, Elevation
• A LOT OF WEATHER DATA (135 page manual for
parsing)
• Lots of optional sections
13. How much of it do we have?
• 2.5 billion distinct data points
• 4 Terabytes
• Number of documents is huge, overall data size is
reasonable
• We'll call this: "moderately big" data
18. Things We Care About
• Performance
‣ Ingestion
‣ App Specific
‣ Ad-hoc
• Cost
• Flexibility
19. Performance Breakdown
• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.
22. Stations
• USAF and WBAN IDs exist for most of North America.
Prefix with "u" and "w" then the ID
• For ships we use the prefix "x" and their lat and lng to
create a station id.
27. Choice: Embedding?
• Embedding keeps your logic in the schema instead of
the application.
• Depends on cardinality, don't embed "squillions"
• Don't embed objects that have to change frequently.
29. Choice: Unique Identifier
!
{_id: {!
'st': 'w12345',!
'ts': ISODate("2014-06-19T19:53:58.680Z")!
}!
}
• Not great if there are duplicates
• Slightly More complex queries
• ~12 bytes saved per document
30. Choice: Field Shortening
• Indexes are still the same size
• Decreases readability
• In our example you can save ~40% space with
minimum field lengths
• Probably better to go for semi-readable with ~20%
space savings
33. Choice: Indexes
• Prefer sparse indexes! All Geo indexes are sparse.
• Relying on index intersection can reduce storage
needs but compound indexes are more performant.
• Build indexes AFTER ingesting the data!
45. Performance Breakdown
• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.
47. Bulk Loading: Single Server
Settings
8 Threads
100 Batch Size
Total loading time: 10 h 20 min
Documents per second: ~70,000
Index build time 7 h 40 min (ts_1_st_1)
49. Bulk Loading: Sharded Cluster
Shard Key Station ID, hashed
Settings
10 mongos @ 144 threads
200 batch size
Total loading time: 3 h 10 min
Documents per second: ~228,000
Index build time 5 min (ts_1_st_1)
50. Queries: Point in Space-Time
db.data.find({"st" : "u747940",
"ts" : ISODate("1969-07-16T12:00:00Z")})
51. Queries: Point in Space-Time
db.data.find({"st" : "u747940",
"ts" : ISODate("1969-07-16T12:00:00Z")})
0
0.5
1
1.5
2
single server cluster
ms
avg
95th
99th
max.
throughput:
40,000/s 610,000/s
(10 mongos)
52. Queries: One Station, One Year
db.data.find({"st" : "u103840",
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("1990-01-01")}})
53. Queries: One Station, One Year
db.data.find({"st" : "u103840",
"ts" : {"$gte": ISODate("1989-01-01"),
"$lt" : ISODate("1990-01-01")}})
0
1000
2000
3000
4000
5000
single server cluster
ms
avg
95th
99th
max.
throughput: 20/s 430/s
(10 mongos)
targeted query
54. Queries: The Whole World
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
55. Queries: The Whole World
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
0
2000
4000
6000
8000
10000
single server cluster
ms
avg
95th
99th
max.
throughput: 8/s
310/s
(10 mongos)
scatter/gather query
56. Analytics: Maximum Temperature
db.data.aggregate
([
{
"$match"
:
{
"airTemperature.quality"
:
{
"$in"
:
[
"1",
"5"
]
}
}
},
{
"$group"
:
{
"_id"
:
null,
"maxTemp"
:
{
"$max"
:
"$airTemperature.value"
}
}
}
])
61.8 °C = 143 °F
2 h 30 min
Single Server
2 min
Cluster
58. Summary: Cluster
!
Con
• High cost
!
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields significant speed-up
• Analytics are possible
!
..