SlideShare a Scribd company logo
1 of 59
Download to read offline
Weather of the Century
J. Randall Hunt
@jrhunt

Developer Advocate, MongoDB
@midwestio
What was the weather the day you were born?
Agenda
• Data and Schema
• Application
• Operational Concerns
MONGODB INTERLUDE!
What Is It And Why Use It?
• Document Data Store
• Geo Indexing
• "Simple" Sharded deployments
Terminology
RDBMS MongoDB (Document Store)
Database Database
Table Collection
Row(s) (bson) Document
Index Index
Join Nope.
The Data
Where To Get Data?
A Weather Datum
• A station ID
• A timestamp
• Lat, Long, Elevation
• A LOT OF WEATHER DATA (135 page manual for
parsing)
• Lots of optional sections
How much of it do we have?
• 2.5 billion distinct data points
• 4 Terabytes
• Number of documents is huge, overall data size is
reasonable
• We'll call this: "moderately big" data
How does it grow?
How does it grow?
Who Else Is This Relevant For?
• Particle Physics
• Stocks, high frequency trading
• Insurance
• People with lots of small pieces data
Schema Design 101
Things We Care About
• Performance
‣ Ingestion
‣ App Specific
‣ Ad-hoc
• Cost
• Flexibility
Performance Breakdown
• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.
0303725053947282013060322517+40779-073969FM-15+0048KNYC
V0309999C00005030485MN0080475N5+02115+02005100975
ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999
GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{
"st" : "u725053",
"ts" : ISODate("2013-06-03T22:51:00Z"),
"airTemperature" : {
"value" : 21.1,
"quality" : "5"
},
"atmosphericPressure" : {
"value" : 1009.7,
"quality" : "5"
}
}
Station ID:
NYC Central Park
Schema
{!
st: "u724463",!
ts: ISODate("1991-01-01T00:00:00Z"),!
position: {!
type: "Point",!
coordinates: [!
-94.6,!
39.117!
]!
},!
elevation: 231,!
… other fields …!
}!
station ID and source
Stations
• USAF and WBAN IDs exist for most of North America.
Prefix with "u" and "w" then the ID
• For ships we use the prefix "x" and their lat and lng to
create a station id.
Schema
{!
st: "u724463",!
ts: ISODate("1991-01-01T00:00:00Z"),!
position: {!
type: "Point",!
coordinates: [!
-94.6,!
39.117!
]!
},!
elevation: 231,!
… other fields …!
}!
GeoJSON
GeoJSON
• A rich geographical data format
• Lines, MultiLines, Polygons, Geometries
• Able to perform queries on complex structures
Schema
!
airTemperature: {!
value: -4.9,!
quality: "1"!
}!
Choice: Embedding?
Problem: ~100 "weather codes" and optional sections
• Store them inline
• Store them in another collection
Choice: Embedding?
• Embedding keeps your logic in the schema instead of
the application.
• Depends on cardinality, don't embed "squillions"
• Don't embed objects that have to change frequently.
Choice: Unique Identifier
{_id: ObjectId("53a33f823ed4ac438f8c63b7")}!
• Simple, guaranteed unique identifier
• 12 bytes
Choice: Unique Identifier
!
{_id: {!
'st': 'w12345',!
'ts': ISODate("2014-06-19T19:53:58.680Z")!
}!
}
• Not great if there are duplicates
• Slightly More complex queries
• ~12 bytes saved per document
Choice: Field Shortening
• Indexes are still the same size
• Decreases readability
• In our example you can save ~40% space with
minimum field lengths
• Probably better to go for semi-readable with ~20%
space savings
{!
"_id": ObjectId("5298c40f3004e2fe02922e29"),!
"st": "w13731",!
"ts": ISODate("1949-01-01T05:00:00Z"),!
"airTemperature": {!
"quality": "5",!
"value": 1.1!
},!
"skyCondition": {!
"cavok": "N",!
"ceilingHeight": {!
"determination": "9",!
"quality": "4",!
"value": 1433!
}!
},!
... ... ...!
}!
1236 Bytes
{!
"_id": ObjectId("5398c40f3004e2fe02922e29"),!
"st": "w13731",!
"ts": ISODate("1949-01-01T05:00:00Z"),!
"aT": {!
"q": "5",!
"v": 1.1!
},!
"sC": {!
"c": "N",!
"cH": {!
"d": "9",!
"q": "4",!
"v": 1433!
}!
},!
... ... ...!
}!
786 Bytes
Choice: Indexes
• Prefer sparse indexes! All Geo indexes are sparse.
• Relying on index intersection can reduce storage
needs but compound indexes are more performant.
• Build indexes AFTER ingesting the data!
The Application
Overview
Javascript
!
Chrome
!
Google Earth
browser plugin
KML
!
Python
PyMongo
Data
Data
ClientServer
Aggregation
pipeline = [{!
'$match': {!
'ts': {!
'$gte': dt,!
'$lt': dt + timedelta(hours=1)},!
'airTemperature.quality': {!
'$in': ['0', '1', '5', '9']}!
}!
}, {!
'$group': {!
'_id': '$st',!
'position': {'$first': '$position'},!
'airTemperature': {'$first': '$airTemperature'}}!
}]!
!
cursor = db.data.aggregate(pipeline, cursor={})!
{!
name : "New York",!
! geometry : {!
type: "MultiPolygon",!
coordinates: [!
[!
[-71.94, 41.28],!
[-71.92, 41.29],!
/* 2000 more points... */!
[-71.94, 41.28]!
]!
]!
}!
}!
db.states.createIndex({!
geometry: '2dsphere'!
});!
GeoFencing
GeoFencing
db.states.find_one({!
'geometry': {!
'$geoIntersects': {!
'$geometry': {!
'type': 'Point',!
'coordinates': [lng, lat]}}}})!
Operational Concerns
Single Server
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge
Sharded Cluster
Application / mongos
...
100 x r3.2xlarge
61 GB RAM
@
100 GB disk
mongod
c3.8xlarge
Cost?
..
$60,000 / yr
$700,000 / yr
Performance Breakdown
• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.
Bulk Loading: Single Server 8 threads
100 batch size
Bulk Loading: Single Server
Settings
8 Threads
100 Batch Size
Total loading time: 10 h 20 min
Documents per second: ~70,000
Index build time 7 h 40 min (ts_1_st_1)
Bulk Loading: Sharded Cluster 144 threads

200 batch size
Bulk Loading: Sharded Cluster
Shard Key Station ID, hashed
Settings
10 mongos @ 144 threads
200 batch size
Total loading time: 3 h 10 min
Documents per second: ~228,000
Index build time 5 min (ts_1_st_1)
Queries: Point in Space-Time
db.data.find({"st" : "u747940",

"ts" : ISODate("1969-07-16T12:00:00Z")})
Queries: Point in Space-Time
db.data.find({"st" : "u747940",

"ts" : ISODate("1969-07-16T12:00:00Z")})
0
0.5
1
1.5
2
single server cluster
ms
avg
95th
99th
max.
throughput:
40,000/s 610,000/s
(10 mongos)
Queries: One Station, One Year
db.data.find({"st" : "u103840",

"ts" : {"$gte": ISODate("1989-01-01"),

"$lt" : ISODate("1990-01-01")}})
Queries: One Station, One Year
db.data.find({"st" : "u103840",

"ts" : {"$gte": ISODate("1989-01-01"),

"$lt" : ISODate("1990-01-01")}})
0
1000
2000
3000
4000
5000
single server cluster
ms
avg
95th
99th
max.
throughput: 20/s 430/s
(10 mongos)
targeted query
Queries: The Whole World
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
Queries: The Whole World
db.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
0
2000
4000
6000
8000
10000
single server cluster
ms
avg
95th
99th
max.
throughput: 8/s
310/s
(10 mongos)
scatter/gather query
Analytics: Maximum Temperature
db.data.aggregate	
  ([	
  
	
  	
  {	
  "$match"	
  :	
  {	
  "airTemperature.quality"	
  :	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "$in"	
  :	
  [	
  "1",	
  "5"	
  ]	
  }	
  }	
  },	
  
	
  	
  {	
  "$group"	
  :	
  {	
  "_id"	
  	
  	
  	
  	
  :	
  null,

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "maxTemp"	
  :	
  {	
  "$max"	
  :	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "$airTemperature.value"	
  }	
  }	
  }	
  
])	
  	
  
61.8 °C = 143 °F
2 h 30 min
Single Server
2 min
Cluster
Summary: Single Server
Pro
• Cost Effective
• Low latency for single queries
Con
• Table scans are still slow
Summary: Cluster
!
Con
• High cost
!
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields significant speed-up
• Analytics are possible
!
..
Thank You!
J. Randall Hunt
@jrhunt

Developer Advocate, MongoDB
@midwest.io

More Related Content

What's hot

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
MongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
Tyler Brock
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 

What's hot (20)

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene index
 
Apache Spark - Aram Mkrtchyan
Apache Spark - Aram MkrtchyanApache Spark - Aram Mkrtchyan
Apache Spark - Aram Mkrtchyan
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (...
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (...MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (...
MongoDB World 2019: Exploring your MongoDB Data with Pirates (R) and Snakes (...
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
The elements of a functional mindset
The elements of a functional mindsetThe elements of a functional mindset
The elements of a functional mindset
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Mongo indexes
Mongo indexesMongo indexes
Mongo indexes
 
Javascript Arrays
Javascript ArraysJavascript Arrays
Javascript Arrays
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Weather of the Century: Visualization
Weather of the Century: VisualizationWeather of the Century: Visualization
Weather of the Century: Visualization
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 

Viewers also liked (7)

Git
GitGit
Git
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica sets
 
Schema Design in MongoDB - TriMug Meetup North Carolina
Schema Design in MongoDB - TriMug Meetup North CarolinaSchema Design in MongoDB - TriMug Meetup North Carolina
Schema Design in MongoDB - TriMug Meetup North Carolina
 
The 2008 battle of sadr city reimagining urban combat
The 2008 battle of sadr city reimagining urban combatThe 2008 battle of sadr city reimagining urban combat
The 2008 battle of sadr city reimagining urban combat
 
MongoDB at LAHacks :)
MongoDB at LAHacks :)MongoDB at LAHacks :)
MongoDB at LAHacks :)
 
Sharding in MongoDB Days 2013
Sharding in MongoDB Days 2013Sharding in MongoDB Days 2013
Sharding in MongoDB Days 2013
 
Canada DevOps Conference
Canada DevOps ConferenceCanada DevOps Conference
Canada DevOps Conference
 

Similar to A Century Of Weather Data - Midwest.io

Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
Edward Capriolo
 
Boundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary Front end tech talk: how it works
Boundary Front end tech talk: how it works
Boundary
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
ikanow
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
MongoDB
 

Similar to A Century Of Weather Data - Midwest.io (20)

Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
High-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using RedisHigh-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using Redis
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Boundary Front end tech talk: how it works
Boundary Front end tech talk: how it worksBoundary Front end tech talk: how it works
Boundary Front end tech talk: how it works
 
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
WiredTiger MongoDB Integration
WiredTiger MongoDB Integration WiredTiger MongoDB Integration
WiredTiger MongoDB Integration
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best Practices
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 
Building your first Java Application with MongoDB
Building your first Java Application with MongoDBBuilding your first Java Application with MongoDB
Building your first Java Application with MongoDB
 
Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseQuerying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and Couchbase
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 

More from Randall Hunt

More from Randall Hunt (7)

WhereML a Serverless ML Powered Location Guessing Twitter Bot
WhereML a Serverless ML Powered Location Guessing Twitter BotWhereML a Serverless ML Powered Location Guessing Twitter Bot
WhereML a Serverless ML Powered Location Guessing Twitter Bot
 
How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019
 
Where ml ai_heavy
Where ml ai_heavyWhere ml ai_heavy
Where ml ai_heavy
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent Recap
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017
 
TIAD - Is Automation Worth My Time?
TIAD - Is Automation Worth My Time?TIAD - Is Automation Worth My Time?
TIAD - Is Automation Worth My Time?
 
Replication MongoDB Days 2013
Replication MongoDB Days 2013Replication MongoDB Days 2013
Replication MongoDB Days 2013
 

A Century Of Weather Data - Midwest.io