PCIC Data Portal 2.0

Demos
Architecture
Bonus
PCIC Data Portal 2.0
Staﬀ Meeting
James Hiebert
February 18, 2014
James Hiebert PCIC Data Portal 2.0

Demos
Architecture
Bonus
Outline
1 Demos
2 Architecture
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
3 Bonus
Automated Testing

Outline
1 Demos
2 Architecture
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
3 Bonus
Automated Testing
2014-02-18
Outline
1. Last week we deployed our 4th and hopefully ﬁnal release candidate
for version 2.0 of the PCIC Data Portal. It’s been a four month
beta period over which we have received and responded to
feedback, both from inside PCIC and from some external beta
testers. Many of you have seen these at the various theme meetings
that we had throughout the fall, but I’d like to take this opportunity
to both introduce the rest of you to the data portal as well as
elaborate on more on what is running behind the scenes and all of
the work that has gone into producing it.
2. Typically in these presentations, I hold you captive with all of the
technical details ﬁrst and save the demo for the end. But in this
case, I’ll start with the demo and then if you don’t care about how
we did it, you can just check out after that.

Demos
Architecture
Bonus
Raster Portal(s)
Coming soon!

Raster Portal(s)
Coming soon!
2014-02-18
Demos
Raster Portal(s)
1. The software that we have written are a variety of components to
generally handle the organization and presentation of raster data;
that is gridded ﬁelds of spatiotemporal data. There are several sets
of high value data, for which we have written a “raster portal”
which can serve that data up.

Demos
Architecture
Bonus
BCSD Downscale Canada

2014-02-18
Demos
1. You’ll see that the feature set is intentionally fairly sparse. The
application’s purpose is to allow the users to get the data they
want, and only the data they want, and then to send them on their
way. The main section of screen real estate is the map. The map is
for displaying the areas for which data exists and then to allow the
user to select an area for which to download.
2. In the top right, there is a tree selection which controls the dataset
that is displayed and that which will be downloaded. And finally
there are a couple options for selecting a time range and data
format.
3. We only support formats which support multidimensional data,
which isn’t very many right now. We’ll be adding Arc ASCII Grid by
the end of the fiscal year, which isn’t technically multidimensional,
but we’ll probably send a zip file of individual grids, one per
timestep.

Demos
Architecture
Bonus
BC PRISM

BC PRISM
2014-02-18
Demos
BC PRISM
1. The BC PRISM portal is very similar to the BCSD Downscaling
portal, with a few minor diﬀerences. First of all the map projection
is speciﬁc to BC. We’ve used the BC Albers projection, which is a
little more visually appealling (though it does present some
challenges). Secondly, because the PRISM data only consists of
monthly climatologies, the data volume in the temporal dimension is
very small. For that reason, we elimintated the time subset controls,
and chose just to give the user the entire time range.

Demos
Architecture
Bonus
VIC (Generation 1)

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Software Components

Software Components
2014-02-18
Architecture
Software Components
1. One thing that you’ll notice from this diagram is that the data itself
is at the foundation of this software stack. Without the data in
place before hand, essentially nothing else can exist without it. Even
the metadata in the database comes from the NetCDF files. This is
why we have been somewhat militant about wanting your data to
be finalized before we begin to work on the portal to publish it.
2. The NetCDF box here is the only thing that just data sitting on
disk. These four boxes (PostgreSQL, ncWMS, pydap, pdp) are all
different pieces of software running on the server which respond to
incoming web requests. PostgreSQL organizes all of the metadata
about the available data, ncWMS provides the climate visualization
layers, pydap responds to requests for the actual data, and pdp
responds to all of the requests that build up the user interface. [Do
a page load showing the network tools]

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Metadata Database

Metadata Database
2014-02-18
Architecture
Metadata Database
Metadata Database
1. This might be a bit too much detail, but try to bear with me. This
database stores the full relationship strucutre between all of the
data files that we store and want to publish. It tracks all of the files
on disk that we have, all of the different variables that they contain,
full ranges for each variable so that we can quickly set color scales
and such for the visualization layers. It stores all of the metadata
about the files such as the timesteps that they contain, what their
grid parameters are, what models they are from and how they relate
to other driving models (for example in the case of an RCM forced
by a GCM). All of these can be grouped into “ensembles” which is
a group of rasters that we are publishing together on a single portal
page.
2. The data contained in the schema allows the web application to
function quickly, because everything is quickly searchable without
opening up a bunch of files and having to read terrabytes of data
just to determine a few key attributes.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Python Backend

Python Backend
2014-02-18
Architecture
Python Backend
Python Backend
1. We have written a full web application backend in python which
does all of the ﬁle format translation, all of the database
communication and passes all of the metadata on to the webUI to
be interpreted by the user. The application consists of about 2800
lines of python code plus 1500 lines of testing code that we have
written outright. There’s about another 3000 lines of code which
makes up PyDAP which we have heavily modiﬁed.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Python Backend
1 ensemble_name = ’bc_prism ’
portal_config = {
’title ’: ’High -Resolution Climatology ’,
’ensemble_name ’: ensemble_name ,
’js_files ’ : wrap_mini ([
6 ’js/ prism_demo_map .js’,
’js/ prism_demo_controls .js’,
’js/ prism_demo_app .js’],
basename=’bc_prism ’, debug=True)
}
11 portal_config = updateConfig (global_config , portal_config )
map_app = wrap_auth(MapApp (** portal_config ), required=False)
dsn = dsn + ’? application_name =pdp_prism ’
with session_scope (dsn) as sesh:
16 conf = db_raster_configurator (sesh , "Download Data", 0.1, 0, ensemble_name ,
root_url= global_config [’app_root ’]. rstrip(’/’) + ’/’ +
ensemble_name + ’/data/’
)
data_server = wrap_auth( RasterServer (dsn , conf ))
21 catalog_server = RasterCatalog (dsn , conf) #No Auth
menu = PrismEnsembleLister (dsn)
portal = PathDispatcher ([
(’^/map /?.*$’, map_app),
(’^/ catalog /.*$’, catalog_server ),
26 (’^/ data /.*$’, data_server ),
(’^/ menu.json .*$’, menu)
]) James Hiebert PCIC Data Portal 2.0

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
OPeNDAP and PyDAP
Designed to be a:
“discipline-neutral means of requesting and providing data across
the [web]”
Data Access Protocol (DAP)
Open source
Machine-to-machine transfer of scientiﬁc data
Mostly supported by US scientiﬁc agencies (NOAA, NASA,
NSF)

OPeNDAP and PyDAP
Designed to be a:
“discipline-neutral means of requesting and providing data across
the [web]”
Data Access Protocol (DAP)
Open source
Machine-to-machine transfer of scientific data
Mostly supported by US scientific agencies (NOAA, NASA,
NSF)
2014-02-18
Architecture
Pydap
OPeNDAP and PyDAP
1. PyDAP is the component of the data portal that actually provides
the data download services. It’s an implementation of the
OPeNDAP protocol which is designed to be a discipline neutral
means of transferring data across the web. This protocol is open
source and is designed to OS and application independent such that
you can get data into whatever software you want to use to do your
data analysis. It’s supported by mostly US scientific agencies such
as NOAA, NASA and the National Science Foundation.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
OPeNDAP and PyDAP

OPeNDAP and PyDAP
2014-02-18
Architecture
Pydap
OPeNDAP and PyDAP
1. There are a number of different OPenDAP servers out there, but
PyDAP is the one that we use to serve all of the data itself. Its
architecture is quite a bit more flexible than some of the other
OpenDAP servers out there. This is a rough layout of the
architecture. It has a number of “handlers” which are written to
interpret different data formats and translate them to the DAP
structure. Then on the top end, there are numerous “responders”
that translate the DAP structure into output formats that the user
wants.
2. [describe more specifically which parts are our and which we use]

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
How much of Pydap is our code?a
a
Source: hg churn
pydap.handlers.pcic: 100%
pydap.handlers.hdf5: 68.0%
pydap.responses.netcdf: 61.5%
pydap.handlers.sql: 12.3%
pydap.handlers.csv: 3.7%
pydap: 2.3%
pydap.responses.xls: 1.3%
pydap.responses.html: ?

a
Source: hg churn
pydap.handlers.pcic: 100%
pydap.handlers.hdf5: 68.0%
pydap.responses.netcdf: 61.5%
pydap.handlers.sql: 12.3%
pydap.handlers.csv: 3.7%
pydap: 2.3%
pydap.responses.xls: 1.3%
pydap.responses.html: ?
2014-02-18
Architecture
Pydap
a
Source: hg churn
1. To give you a bit of an idea of to what degree Pydap was
“oﬀ-the-shelf”, I ran the command “hg churn” on all of the pydap
repositories, which measures the changes in the repository by lines
of code. The fractions shown are the churn of PCIC staﬀ divided by
the total churn of all committers. You can see that we wrote one
handler by ourselves, the hdf and netcdf work is mostly ours, and for
the rest of the modules we only had to make minimal changes.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Big data, big RAM, BadRequest, Oh My!

2014-02-18
Architecture
Pydap
1. One of the technical problems that we ran up against was that all of
the available OPeNDAP data servers load their responses entirely
into RAM before sending them out. So if you want to serve up large
data sets, the size of your response is limited by your available RAM
divided by the number of concurrent responses that you are
prepared to serve. If you try and make a request to, say, THREDDS
OPeNDAP server that’s larger than the JVM allocated memory, the
user will just get back a BadRequest error.
2. For some applications this may be ﬁne, or even desirable, but for
the purposes of serving large data sets, the network pipe is usually
the bottleneck. Rather than annoy and frustrate the user by forcing
them to carve up their data requests to be arbitrarily small, we
wanted to allow as large a request as the users were prepared to
accept.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Generators: 70’s tech that works today!
a function which yields execution rather than returning
yields values one at a time, on-demand
low memory footprint
faster; no calling overhead
elegant!

a function which yields execution rather than returning
yields values one at a time, on-demand
low memory footprint
faster; no calling overhead
elegant!
2014-02-18
Architecture
Pydap
1. Enter generators and coroutines. Generators are a programming
control where a function, rather than returning, can yield execution
and sort of return values one at a time on-demand. It has the
performance advantage of maintaining a low memory footprint, if
you want to return something large, you don’t have to do so all at
once, and they tend to be slightly faster, because you avoid a lot of
calling overhead of stack manipulation.
2. Generators have been around for a good thirty-ﬁve years, but have
been experiencing a bit of a Renaissance lately. If one programs in
python, they are extremely easy to use, and with the advent of big
data applications, they have a lot of utility.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Generator Example
from i t e r t o o l s import i s l i c e
def f i b o n a c c i ( ) :
a , b = 0 , 1
while True :
y i e l d a
a , b = b , a+b
# p r i n t the f i r s t 10 v a l u e s of the f i b o n a c c i sequen
for x in i s l i c e ( f i b o n a c c i () , 10):
print x

Generator Example
from i t e r t o o l s import i s l i c e
def f i b o n a c c i ( ) :
a , b = 0 , 1
while True :
y i e l d a
a , b = b , a+b
# p r i n t the f i r s t 10 v a l u e s of the f i b o n a c c i sequen
for x in i s l i c e ( f i b o n a c c i () , 10):
print x
2014-02-18
Architecture
Pydap
Generator Example
1. For those who aren’t familiar, here’s a quick example to understand
generators. Generating a Fibonacci sequence is kind of the
quintessential toy example. The generator function, fibonacci(), is
defined at the top. You’ll notice that it’s an infinite loop, because
the sequence is by definition, infinite. But rather than building up
the values in memory, it just has a simple and elegant “yield”
statement right inside the loop. The calling loop down below,
actually pulls items from the function, one at a time, and then does
whatever it needs to do with them. It’s fast, efficient, and actually
fairly elegant, readable code, too.
2. So you can see, for something like a web application serving big
datasets, this is perfect, because we can provide a very low latency
response, and then stream the data to the user as our high-latency
operations like disk reads take place.
3. None of the OPeNDAP servers out there supported streaming, so
many of the modifications that we made to PyDAP were for it to

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
ncWMS
Oﬀ-the-shelf
Visualization of NetCDF
rasters
Full featured WMS server
Limitations
File-based layer
conﬁgurations (tedious
and error-prone!)
Loads layers serially on
startup (slow!)
Scans layers for ranges
(really slow!)

ncWMS
Off-the-shelf
Visualization of NetCDF
rasters
Full featured WMS server
Limitations
File-based layer
configurations (tedious
and error-prone!)
Loads layers serially on
startup (slow!)
Scans layers for ranges
(really slow!)
2014-02-18
Architecture
ncWMS
ncWMS
1. We’re using a modified version ncWMS to provide visualization of
the climate rasters. It gives us a lot of stuff for free. It’s a full
featured Web Mapping Service server that converts netcdf files into
tiled images usable on the web. [demo]
2. Unfortunately it has a few limitations that make it non-ideal for use
with big data. To configure a layer, you have to go through the
files, one-by-one and add them to the list and configure 5-10
different attributes. Additionally, when ever you start, re-start the
server, it goes through every single file, in order, scans them to
determine their ranges, so that it can assign a colorbar. This can
take many minutes, possibly hours, and it only gets slower the more
layers you add.
3. David Bronaugh has done some great work making modifications to
ncWMS to run it off of our metadata database, so that it gets its
list of layers from the database and all of the variable ranges and
everything. This has made it possible to scale our deployment up

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
Mapnik and Basemaps
Create our own basemaps from OpenStreetMap
Maximum ﬂexibility in domain and projection

Mapnik and Basemaps
Create our own basemaps from OpenStreetMap
Maximum ﬂexibility in domain and projection
2014-02-18
Architecture
Basemaps
Mapnik and Basemaps
1. A ﬂat image of the climate rasters aren’t that useful, especially if
you want to look at details in a particular locality. So thanks to
some great work by Basil, we have our own web basemaps based on
data from the OpenStreetMap project. We have the ability to
generate our own basemaps in any projection that we want and for
any domain. And we have control over the tile service so we can
tweak it for maximum performance.

Demos
Architecture
Bonus
Metadata Database
Python Backend
Pydap
ncWMS
Basemaps
Front-end
JavaScript Front-end
2600 lines of JavaScript
Responsible for tying everything together for the web user
Does little to no processing itself / just makes requests to
various servers

2600 lines of JavaScript
Responsible for tying everything together for the web user
Does little to no processing itself / just makes requests to
various servers
2014-02-18
Architecture
Front-end
1. Finally, the last piece of the software stack is the JavaScript
front-end that ties everything else together for the user. This is
probably the most ﬁnicky and possibly most complex piece of the
code base even though it doesn’t actually provide any functionality
in and of itself. It has be be aware of all of the various services that
are provided, it has to asyncronously make the requests, process
them, display things to the user, and often the results of one
request aﬀect other things on the page.
2. [Show dataset selection, and how it is a request. Show how dataset
selection triggers layer change the loading of layer attributes]. If any
of these things fails, badness ensues.

Demos
Architecture
Bonus
Automated Testing
Automated Testing

Automated Testing
2014-02-18
Bonus
Automated Testing
1. In our two main repositories, we have about 1500 lines of code
specifically for automated testing of the functionality of both the
PCDS data portal and the raster portals. This test suite covers a
large swath of the code base, but is also compact so we can run the
full test suite in less than 5 seconds. This is fast enough that it can
be intergrated directly into your development workflow and you can
ensure that any changes you make to the code have not negatively
and unintendedly affected any previously programmed functionality.

Demos
Architecture
Bonus
Automated Testing
Automated Testing
Why?
There’s a lot of code and many code paths. Manual testing is
insane, takes days, and isn’t complete.
Provides an “executable speciﬁcation” for what the software
should do
Provides a way to ensure that code changes don’t aﬀect
existing functionality (a.k.a. regression testing)

Automated Testing
Why?
There’s a lot of code and many code paths. Manual testing is
insane, takes days, and isn’t complete.
Provides an “executable specification” for what the software
should do
Provides a way to ensure that code changes don’t affect
existing functionality (a.k.a. regression testing)
2014-02-18
Bonus
Automated Testing
Automated Testing
1. So with a system that provide this much functionality, there are a
lot of different code paths through it, any of which could be taken
for different user requests. It’s important to test as many of these
as possible, every time you make changes in the system. To
manually go through all of these–and we did with the release of the
PCDS portal a year ago–is meticulous, time consuming and error
prone. Automating this process pays off very quickly both in time
and in code quality.
2. Additionally, the tests provide a sort of “executable specification”,
declaring what the various pieces of the code are supposed to do. If
a tests fails, your code doesn’t meet the spec.
3. Finally, the test suite provides a baseline against which further
development cannot regress. It ensures that future changes will not
negatively impact the functionality that we have previously
developed.
4. [demo of pytest]

Demos
Architecture
Bonus
Automated Testing
Questions
and hopefully answers

PCIC Data Portal 2.0

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie PCIC Data Portal 2.0

Ähnlich wie PCIC Data Portal 2.0 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PCIC Data Portal 2.0