The document discusses visualizing weather data stored in MongoDB. It describes extracting location and temperature data from MongoDB documents into NumPy arrays, using that data to perform grid interpolation and contour mapping with SciPy and Matplotlib. It then compares the performance of this process using PyMongo versus a new library called Monary, finding Monary is over 7 times faster for querying large datasets. In the end it thanks various Python libraries that helped enable this visualization and analysis of weather data from MongoDB.
33. import numpy
import pymongo
Not terrifically fast
data = []
db = pymongo.MongoClient().my_database
for doc in db.collection.find(query):
data.append((
doc['position']['coordinates'][0],
doc['position']['coordinates'][1],
doc['airTemperature']['value']))
arrays = numpy.array(data)
34. Analyzing large datasets
• Querying: 109k documents per second
• (On localhost)
• Can we go faster?
• Enter “Monary”
35. Monary
by David Beach
MongoDB PyMongo Python dicts NumPy Matplotlib
MongoDB Monary NumPy Matplotlib
44. Thank you
A. Jesse Jiryu Davis
Senior Python Engineer, MongoDB
#MongoDBWorld
Hinweis der Redaktion
This will not be a serious MongoDB talk.
Serious MongoDB talks show slides with lots of hairy data.
There’s usually a cylinder this means we’ve gotten very serious because we’re talking about databases.
And when things get really serious there’s multiple cylinders in boxes.
You’re not going to see this stuff because this is not a serious MongoDB talk.
This will be a talk about making pretty pictures.
Also, Math
Open source Python packages that can analyze & visualize data from MongoDB
specialized MongoDB driver that can parse almost a million documents per second
But this isn’t a serious talk because there won’t be any cylinders.
If you came for cylinders, I don’t want you to be disappointed.
A little review if you weren’t at Randall’s or Andre’s talks in this series.
We downloaded 2.5 billion weather measurements from the US Government.
That teal logo is the NOAA logo, National Oceanic and Atmospheric Administration
The stations do have cylinders, does that mean they’re databases?
Stations have various frequencies: once per day, twice, hourly, every 5 minutes, ….
Exponentially-growing data set
André showed how you can choose the price-performance tradeoff that’s right for you:
Single-server.
Massively sharded cluster.
I went with the single-server option.
Oops, a picture of a cylinder. Must’ve snuck in from another slide deck.
I used Python to generate this visualization.
Global air temperature each hour in December last year.
The remainder of this talk is going to discuss:
open source Python packages
algorithms
performance issues.
There are such powerful open source data analysis tools in Python,
the code to do all this is quite simple
The work’s all been done for me.
explain this code: we use pymongo to get data from mongodb
pymongo represents bson documents as python dicts
we take the values from each dict and put it in a python array
then we convert the python array to a numpy array
get pointers to the three columns in the numpy array
now these latitudes, longitudes, and temperatures represent the stations that reported at the given hour
how do we make the contour plot? we have to interpolate among these points to come up with a temperature map of the whole globe
TODO: scipy
I’ll explain momentarily how SciPy and Matplotlib are able to do this.
But first notice all the white areas.
Step one: interpolation. We’re going to transform a messy distribution of points into a perfectly even grid.
We begin with a point somewhere on earth for each station that reported a temperature at the hour we’re plotting.
The arrangement is very uneven.
In order to interpolate them, we first perform a Delaunay triangulation.
Comes up with a set of non-overlapping triangles for all the points.
Next overlay a grid. We want to know the temperature at each grid intersection.
This is called Barycentric Interpolation, use that at the cocktail party later on.
Temperatures at the corners are 48, 54, and 53. (Sorry one is cut off.)
Here’s the grid point we need to make a temperature for.
Measure the area of each of these three triangles.
Use that for a weighted-average of the three temperatures. In this case it’s 51.1 degrees.
This is called Barycentric Interpolation, use that at the cocktail party later on.
So Barycentric Interpolation can be applied to any grid point.
Brings us from this…
To this!
So we can discard our original samples now and just use the grid.
Now we’ve finished interpolating, the next stage is contouring.
Contouring is much too complicated for me to understand, Matplotlib just takes care of it somehow.
Finally, we fill in the colors.
But notice we can only contour the spaces between stations. There’s no way to know about the edges.
So that gets us from this map of just the stations.
To this map with contours and colors.
But now you see why we have blank edges. We can only fill in the spaces between stations, and my program doesn’t understand that the North Pole is between Canada and Russia.
So I came up with a hack to fill in the rest of the space.
Here’s our flat projection of the Earth. Matplotlib doesn’t know that the left edge connects to the right edge.
It doesn’t know that if you keep heading North from the United States you end up in Russia.
So I just flipped and tiled the earth. Now there are 7 earths laid out on a super-earth-sized grid.
That allows us to go from this…
… to this!
TODO: the movie again!
So my program just re-executes the process once for each hour’s worth of data, for the whole month of December.
But it’s a little slow, takes almost a second to generate each frame, which means that creating a minute-long movie might take 10 minutes of rendering time.
This is one of the bottlenecks: creating and discarding Python dictionaries, plus all the time spent on hashtable lookups.
These are idealized circumstances of course: no network latency, data is already in memory.
It’s fast, but can we go faster?
Monary!
Directly from MongoDB to NumPy, all written in C.
No intermediate Python dictionaries.
Written by David Beach, a financial analyst. Just an open source project by a MongoDB community member.
Monary is staticly-typed
You get numpy arrays back, directly, with no further processing
6458 rows for 1991-06-02 12:00:00
PyMongo takes 0.0593 sec
Monary takes 0.0079 sec
Monary is 8x faster
Now we can generate this near-realtime
David Beach, original author
Me, and Jason Carey: MongoDB driver engineers, overseers
Kyle Suarez and Matt Cotter: Interns, contributing this summer
Rutgers, Carleton College
David Beach, original author
Me, and Jason Carey: MongoDB driver engineers, overseers
Kyle Suarez and Matt Cotter: Interns, contributing this summer
Rutgers, Carleton College
David Beach, original author
Me, and Jason Carey: MongoDB driver engineers, overseers
Kyle Suarez and Matt Cotter: Interns, contributing this summer
So we can query data from MongoDB using Python and achieve very high throughput, using Monary and NumPy
We can do sophisticated processing and visualization of that data using SciPy and Matplotlib