This presentation covers one topic that we have mastered after several years : Geospatial.
We will reveal how we deal with very specific spatial challenges in our day to day use cases :
• Answer questions combining the best of BigData and geospatial analysis.
• Ingestion and use of raster and vector data with our Massive Parallel Processing platform (Thor).
• Store and query spatial information with sub-second queries, using our data refinery (Roxie)
And much more under the umbrella of LexisNexis HPCC Systems (High Performance Computing Cluster), an open source platform for Big Data processing and analytics.
8. Projections are used to represent the world in ways
we can process
•The Earth is round and maps are flat
•Physical Maps
•Computer Maps
What is a projection?
Have I seen projections before?
•Peter vs Mercator vs Winkel tripel
•GPS (latitude/longitude)
•Google Maps
10. WGS84
•Latitude and longitude
•Our best approximation of the world
•Not always the best for a specific region
•Not technically a projection
Projections to know about
Mercator
•Many different ones, choose one based on your location
•Reduces the area it covers to a simple Cartesian plane
•Good near the central axis, bad far away from it :
• Web Mercator covers the whole world – good near equator, gets worse as you travel north or
south
• Irish National Grid – very good for Ireland, awful anywhere else.
11. Lies, damned lies, statistics… and maps!
*https://twitter.com/flashboy/status/641221733509373952
12. Lies, damned lies, statistics… and maps!
Projection Woes:
A straight line in Mercator is
not a straight line in WGS84
Four points converted
to WGS84
Where the lines
should be
Don’t re-project polygons!
This “solution” is only good
enough for visuals, not for
maths.
19. STEPS
Spatial filtering of vector geometries
Spatial operations using vector geometries
Spatial reference projection and transformation
Reading of compressed geo-raster files
Big Data
Extend HPCC and ECL to support the following main
capabilities :
21. Ingesting Vector Data
It’s a CSV file.
Id Name Geometry Projection Value
1 Alice’s
place
POINT (53.78925462 -6.08354321) 4326* €5,973,000
2 Bob’s place POINT (-34.78925462 7.08354321) 4326 €872,000
3 Celine’s
place
POINT (102.78925462 -6.08354321) 4326 €9,324,000
* WGS84 (Lat/Lon)
3.
Peril tag
2.
Geocode address
1.
Policy data
Data ready to
ingest
22. Ingesting Vector Data
It’s a GML / XML file.
3.
Process and index
2.
Parse XPATH
1.
Shape data
Data ready to
query
23. Ingesting Vector Data
It’s a GML / XML file.
3.
Process and index
2.
Parse XPATH
1.
Shape data
Data ready to
query
24. Ingesting Vector Data
It’s a GML / XML file.
3.
Process and index
2.
Parse XPATH
1.
Shape data
Data ready to
query
25. Indexing vector data
• Outline Box: Biggest rectangle
• Boxes contain boxes
• Bottom box in the tree contains actual
geometries
• Here, 3 levels pictured
• Boxes can overlap (entries are only in one)
26. Querying vector data
Searching an R-Tree: e.g. Finding all buildings (points) inside a flood zone (polygon)
Does the query polygon overlap our box?
Return empty list
Search our boxes’
children
Is it a leaf node?
Return all nodes
for verification
Y
N
Y
N
27. Ingesting Raster Data
It’s a raster / TIFF file. Bitmap image
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
28. Ingesting Raster Data
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
Tiling divides raster images into
small manageable areas of known
dimensions.
These tiles have their own
metadata:
• Bounding box
• Grid position
29. Ingesting Raster Data
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
1. Figure out which grid position the
geometry needs
2. Extract the required pixel
3. Interrogate the pixel for its value
4. Interpret its value
5. Return to user
30. Ingesting Raster Data
It’s a raster / TIFF file. Bitmap image
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
31. Ingesting Raster Data
It’s a raster / TIFF file.
3.
Process and index
2.
Tile and spray
1.
Raster data
Data ready to
query
32. Bringing it all together
*Andrew Farrell
In pursuit of perils : Geo-spatial risk analysis through HPCC Systems
https://hpccsystems.com/resources/blog/afarrell/pursuit-perils-geo-spatial-risk-analysis-
through-hpcc-systems
35. Why Geospatial with HPCC?
• Efficient parallel processing
• Ability to import libraries from different languages
• Good coverage of functions and spatial predicates
• Fast ingestion
• Support for different formats
• Sub-second queries