Presentation to the staff of the Pacific Climate Impacts Consortium on 2014/02/18 about its Computational Support Group's work on version 2.0 of the PCIC Data Portal.
3. Outline
1 Demos
2 Architecture
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
3 Bonus
Automated Testing
2014-02-18
PCIC Data Portal 2.0
Outline
1. Last week we deployed our 4th and hopefully ïŹnal release candidate
for version 2.0 of the PCIC Data Portal. Itâs been a four month
beta period over which we have received and responded to
feedback, both from inside PCIC and from some external beta
testers. Many of you have seen these at the various theme meetings
that we had throughout the fall, but Iâd like to take this opportunity
to both introduce the rest of you to the data portal as well as
elaborate on more on what is running behind the scenes and all of
the work that has gone into producing it.
2. Typically in these presentations, I hold you captive with all of the
technical details ïŹrst and save the demo for the end. But in this
case, Iâll start with the demo and then if you donât care about how
we did it, you can just check out after that.
5. Raster Portal(s)
Coming soon!
2014-02-18
PCIC Data Portal 2.0
Demos
Raster Portal(s)
1. The software that we have written are a variety of components to
generally handle the organization and presentation of raster data;
that is gridded ïŹelds of spatiotemporal data. There are several sets
of high value data, for which we have written a âraster portalâ
which can serve that data up.
7. BCSD Downscale Canada
2014-02-18
PCIC Data Portal 2.0
Demos
BCSD Downscale Canada
1. Youâll see that the feature set is intentionally fairly sparse. The
applicationâs purpose is to allow the users to get the data they
want, and only the data they want, and then to send them on their
way. The main section of screen real estate is the map. The map is
for displaying the areas for which data exists and then to allow the
user to select an area for which to download.
2. In the top right, there is a tree selection which controls the dataset
that is displayed and that which will be downloaded. And ïŹnally
there are a couple options for selecting a time range and data
format.
3. We only support formats which support multidimensional data,
which isnât very many right now. Weâll be adding Arc ASCII Grid by
the end of the ïŹscal year, which isnât technically multidimensional,
but weâll probably send a zip ïŹle of individual grids, one per
timestep.
9. BC PRISM
2014-02-18
PCIC Data Portal 2.0
Demos
BC PRISM
1. The BC PRISM portal is very similar to the BCSD Downscaling
portal, with a few minor diïŹerences. First of all the map projection
is speciïŹc to BC. Weâve used the BC Albers projection, which is a
little more visually appealling (though it does present some
challenges). Secondly, because the PRISM data only consists of
monthly climatologies, the data volume in the temporal dimension is
very small. For that reason, we elimintated the time subset controls,
and chose just to give the user the entire time range.
12. Software Components
2014-02-18
PCIC Data Portal 2.0
Architecture
Software Components
1. One thing that youâll notice from this diagram is that the data itself
is at the foundation of this software stack. Without the data in
place before hand, essentially nothing else can exist without it. Even
the metadata in the database comes from the NetCDF ïŹles. This is
why we have been somewhat militant about wanting your data to
be ïŹnalized before we begin to work on the portal to publish it.
2. The NetCDF box here is the only thing that just data sitting on
disk. These four boxes (PostgreSQL, ncWMS, pydap, pdp) are all
diïŹerent pieces of software running on the server which respond to
incoming web requests. PostgreSQL organizes all of the metadata
about the available data, ncWMS provides the climate visualization
layers, pydap responds to requests for the actual data, and pdp
responds to all of the requests that build up the user interface. [Do
a page load showing the network tools]
14. Metadata Database
2014-02-18
PCIC Data Portal 2.0
Architecture
Metadata Database
Metadata Database
1. This might be a bit too much detail, but try to bear with me. This
database stores the full relationship strucutre between all of the
data ïŹles that we store and want to publish. It tracks all of the ïŹles
on disk that we have, all of the diïŹerent variables that they contain,
full ranges for each variable so that we can quickly set color scales
and such for the visualization layers. It stores all of the metadata
about the ïŹles such as the timesteps that they contain, what their
grid parameters are, what models they are from and how they relate
to other driving models (for example in the case of an RCM forced
by a GCM). All of these can be grouped into âensemblesâ which is
a group of rasters that we are publishing together on a single portal
page.
2. The data contained in the schema allows the web application to
function quickly, because everything is quickly searchable without
opening up a bunch of ïŹles and having to read terrabytes of data
just to determine a few key attributes.
16. Python Backend
2014-02-18
PCIC Data Portal 2.0
Architecture
Python Backend
Python Backend
1. We have written a full web application backend in python which
does all of the ïŹle format translation, all of the database
communication and passes all of the metadata on to the webUI to
be interpreted by the user. The application consists of about 2800
lines of python code plus 1500 lines of testing code that we have
written outright. Thereâs about another 3000 lines of code which
makes up PyDAP which we have heavily modiïŹed.
18. Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
OPeNDAP and PyDAP
Designed to be a:
âdiscipline-neutral means of requesting and providing data across
the [web]â
Data Access Protocol (DAP)
Open source
Machine-to-machine transfer of scientiïŹc data
Mostly supported by US scientiïŹc agencies (NOAA, NASA,
NSF)
James Hiebert PCIC Data Portal 2.0
19. OPeNDAP and PyDAP
Designed to be a:
âdiscipline-neutral means of requesting and providing data across
the [web]â
Data Access Protocol (DAP)
Open source
Machine-to-machine transfer of scientiïŹc data
Mostly supported by US scientiïŹc agencies (NOAA, NASA,
NSF)
2014-02-18
PCIC Data Portal 2.0
Architecture
Pydap
OPeNDAP and PyDAP
1. PyDAP is the component of the data portal that actually provides
the data download services. Itâs an implementation of the
OPeNDAP protocol which is designed to be a discipline neutral
means of transferring data across the web. This protocol is open
source and is designed to OS and application independent such that
you can get data into whatever software you want to use to do your
data analysis. Itâs supported by mostly US scientiïŹc agencies such
as NOAA, NASA and the National Science Foundation.
21. OPeNDAP and PyDAP
2014-02-18
PCIC Data Portal 2.0
Architecture
Pydap
OPeNDAP and PyDAP
1. There are a number of diïŹerent OPenDAP servers out there, but
PyDAP is the one that we use to serve all of the data itself. Its
architecture is quite a bit more ïŹexible than some of the other
OpenDAP servers out there. This is a rough layout of the
architecture. It has a number of âhandlersâ which are written to
interpret diïŹerent data formats and translate them to the DAP
structure. Then on the top end, there are numerous ârespondersâ
that translate the DAP structure into output formats that the user
wants.
2. [describe more speciïŹcally which parts are our and which we use]
22. Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
How much of Pydap is our code?a
a
Source: hg churn
pydap.handlers.pcic: 100%
pydap.handlers.hdf5: 68.0%
pydap.responses.netcdf: 61.5%
pydap.handlers.sql: 12.3%
pydap.handlers.csv: 3.7%
pydap: 2.3%
pydap.responses.xls: 1.3%
pydap.responses.html: ?
James Hiebert PCIC Data Portal 2.0
23. How much of Pydap is our code?a
a
Source: hg churn
pydap.handlers.pcic: 100%
pydap.handlers.hdf5: 68.0%
pydap.responses.netcdf: 61.5%
pydap.handlers.sql: 12.3%
pydap.handlers.csv: 3.7%
pydap: 2.3%
pydap.responses.xls: 1.3%
pydap.responses.html: ?
2014-02-18
PCIC Data Portal 2.0
Architecture
Pydap
How much of Pydap is our code?a
a
Source: hg churn
1. To give you a bit of an idea of to what degree Pydap was
âoïŹ-the-shelfâ, I ran the command âhg churnâ on all of the pydap
repositories, which measures the changes in the repository by lines
of code. The fractions shown are the churn of PCIC staïŹ divided by
the total churn of all committers. You can see that we wrote one
handler by ourselves, the hdf and netcdf work is mostly ours, and for
the rest of the modules we only had to make minimal changes.
25. Big data, big RAM, BadRequest, Oh My!
2014-02-18
PCIC Data Portal 2.0
Architecture
Pydap
Big data, big RAM, BadRequest, Oh My!
1. One of the technical problems that we ran up against was that all of
the available OPeNDAP data servers load their responses entirely
into RAM before sending them out. So if you want to serve up large
data sets, the size of your response is limited by your available RAM
divided by the number of concurrent responses that you are
prepared to serve. If you try and make a request to, say, THREDDS
OPeNDAP server thatâs larger than the JVM allocated memory, the
user will just get back a BadRequest error.
2. For some applications this may be ïŹne, or even desirable, but for
the purposes of serving large data sets, the network pipe is usually
the bottleneck. Rather than annoy and frustrate the user by forcing
them to carve up their data requests to be arbitrarily small, we
wanted to allow as large a request as the users were prepared to
accept.
27. Generators: 70âs tech that works today!
a function which yields execution rather than returning
yields values one at a time, on-demand
low memory footprint
faster; no calling overhead
elegant!
2014-02-18
PCIC Data Portal 2.0
Architecture
Pydap
Generators: 70âs tech that works today!
1. Enter generators and coroutines. Generators are a programming
control where a function, rather than returning, can yield execution
and sort of return values one at a time on-demand. It has the
performance advantage of maintaining a low memory footprint, if
you want to return something large, you donât have to do so all at
once, and they tend to be slightly faster, because you avoid a lot of
calling overhead of stack manipulation.
2. Generators have been around for a good thirty-ïŹve years, but have
been experiencing a bit of a Renaissance lately. If one programs in
python, they are extremely easy to use, and with the advent of big
data applications, they have a lot of utility.
28. Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Generator Example
from i t e r t o o l s import i s l i c e
def f i b o n a c c i ( ) :
a , b = 0 , 1
while True :
y i e l d a
a , b = b , a+b
# p r i n t the f i r s t 10 v a l u e s of the f i b o n a c c i sequen
for x in i s l i c e ( f i b o n a c c i () , 10):
print x
James Hiebert PCIC Data Portal 2.0
29. Generator Example
from i t e r t o o l s import i s l i c e
def f i b o n a c c i ( ) :
a , b = 0 , 1
while True :
y i e l d a
a , b = b , a+b
# p r i n t the f i r s t 10 v a l u e s of the f i b o n a c c i sequen
for x in i s l i c e ( f i b o n a c c i () , 10):
print x
2014-02-18
PCIC Data Portal 2.0
Architecture
Pydap
Generator Example
1. For those who arenât familiar, hereâs a quick example to understand
generators. Generating a Fibonacci sequence is kind of the
quintessential toy example. The generator function, ïŹbonacci(), is
deïŹned at the top. Youâll notice that itâs an inïŹnite loop, because
the sequence is by deïŹnition, inïŹnite. But rather than building up
the values in memory, it just has a simple and elegant âyieldâ
statement right inside the loop. The calling loop down below,
actually pulls items from the function, one at a time, and then does
whatever it needs to do with them. Itâs fast, eïŹcient, and actually
fairly elegant, readable code, too.
2. So you can see, for something like a web application serving big
datasets, this is perfect, because we can provide a very low latency
response, and then stream the data to the user as our high-latency
operations like disk reads take place.
3. None of the OPeNDAP servers out there supported streaming, so
many of the modiïŹcations that we made to PyDAP were for it to
31. ncWMS
OïŹ-the-shelf
Visualization of NetCDF
rasters
Full featured WMS server
Limitations
File-based layer
conïŹgurations (tedious
and error-prone!)
Loads layers serially on
startup (slow!)
Scans layers for ranges
(really slow!)
2014-02-18
PCIC Data Portal 2.0
Architecture
ncWMS
ncWMS
1. Weâre using a modiïŹed version ncWMS to provide visualization of
the climate rasters. It gives us a lot of stuïŹ for free. Itâs a full
featured Web Mapping Service server that converts netcdf ïŹles into
tiled images usable on the web. [demo]
2. Unfortunately it has a few limitations that make it non-ideal for use
with big data. To conïŹgure a layer, you have to go through the
ïŹles, one-by-one and add them to the list and conïŹgure 5-10
diïŹerent attributes. Additionally, when ever you start, re-start the
server, it goes through every single ïŹle, in order, scans them to
determine their ranges, so that it can assign a colorbar. This can
take many minutes, possibly hours, and it only gets slower the more
layers you add.
3. David Bronaugh has done some great work making modiïŹcations to
ncWMS to run it oïŹ of our metadata database, so that it gets its
list of layers from the database and all of the variable ranges and
everything. This has made it possible to scale our deployment up
33. Mapnik and Basemaps
Create our own basemaps from OpenStreetMap
Maximum ïŹexibility in domain and projection
2014-02-18
PCIC Data Portal 2.0
Architecture
Basemaps
Mapnik and Basemaps
1. A ïŹat image of the climate rasters arenât that useful, especially if
you want to look at details in a particular locality. So thanks to
some great work by Basil, we have our own web basemaps based on
data from the OpenStreetMap project. We have the ability to
generate our own basemaps in any projection that we want and for
any domain. And we have control over the tile service so we can
tweak it for maximum performance.
35. JavaScript Front-end
2600 lines of JavaScript
Responsible for tying everything together for the web user
Does little to no processing itself / just makes requests to
various servers
2014-02-18
PCIC Data Portal 2.0
Architecture
Front-end
JavaScript Front-end
1. Finally, the last piece of the software stack is the JavaScript
front-end that ties everything else together for the user. This is
probably the most ïŹnicky and possibly most complex piece of the
code base even though it doesnât actually provide any functionality
in and of itself. It has be be aware of all of the various services that
are provided, it has to asyncronously make the requests, process
them, display things to the user, and often the results of one
request aïŹect other things on the page.
2. [Show dataset selection, and how it is a request. Show how dataset
selection triggers layer change the loading of layer attributes]. If any
of these things fails, badness ensues.
37. Automated Testing
2014-02-18
PCIC Data Portal 2.0
Bonus
Automated Testing
1. In our two main repositories, we have about 1500 lines of code
speciïŹcally for automated testing of the functionality of both the
PCDS data portal and the raster portals. This test suite covers a
large swath of the code base, but is also compact so we can run the
full test suite in less than 5 seconds. This is fast enough that it can
be intergrated directly into your development workïŹow and you can
ensure that any changes you make to the code have not negatively
and unintendedly aïŹected any previously programmed functionality.
38. Demos
Architecture
Bonus
Automated Testing
Automated Testing
Why?
Thereâs a lot of code and many code paths. Manual testing is
insane, takes days, and isnât complete.
Provides an âexecutable speciïŹcationâ for what the software
should do
Provides a way to ensure that code changes donât aïŹect
existing functionality (a.k.a. regression testing)
James Hiebert PCIC Data Portal 2.0
39. Automated Testing
Why?
Thereâs a lot of code and many code paths. Manual testing is
insane, takes days, and isnât complete.
Provides an âexecutable speciïŹcationâ for what the software
should do
Provides a way to ensure that code changes donât aïŹect
existing functionality (a.k.a. regression testing)
2014-02-18
PCIC Data Portal 2.0
Bonus
Automated Testing
Automated Testing
1. So with a system that provide this much functionality, there are a
lot of diïŹerent code paths through it, any of which could be taken
for diïŹerent user requests. Itâs important to test as many of these
as possible, every time you make changes in the system. To
manually go through all of theseâand we did with the release of the
PCDS portal a year agoâis meticulous, time consuming and error
prone. Automating this process pays oïŹ very quickly both in time
and in code quality.
2. Additionally, the tests provide a sort of âexecutable speciïŹcationâ,
declaring what the various pieces of the code are supposed to do. If
a tests fails, your code doesnât meet the spec.
3. Finally, the test suite provides a baseline against which further
development cannot regress. It ensures that future changes will not
negatively impact the functionality that we have previously
developed.
4. [demo of pytest]