(Presented by Esri)
When people analyze a problem, they often include location at the core of the analysis. Location and spatial context, combined with geographical knowledge, can make the biggest difference in understanding a problem and analyzing it in a more meaningful way.
In this session, we show how Amazon EMR can be used with location and geospatial analytics, and how the Amazon EMR API and the Python SDK were used to build tools that integrate Big Data and geospatial analysis. We also show powerful visualization options for displaying your results, using maps which can be shared in reports or distributed online and to mobile apps.
5. Big Data – A New Data Type for Geospatial
Maps
Spreadsheets
Social Media
Big Data
Services
Sensor
Networks
DBMS
Imagery
6. Geospatial in Big Data
1. Geo Enable & Enrich Big Data (Geo E&E)
2. Run spatial queries and operations on data where it
resides
3. Results in Geospatial tools: Visualize results as a map;
Include in a report; Publish in a web or mobile app
7. Questions in Utilities
Smart Meters
• Billions of readings
• Where are the failures?
• What was the weather like here? Did it impact
operations in any of the areas?
• Patterns of usage in specific areas?
8. Questions in Agriculture
Tractor Control Box readings
• Billions of readings
• What was the yield in a field?
– Broken by 2 inch x 2 inch
• What was the impact of weather (or other
factors) on yield?
• What are the other places with conditions like
this place?
9. Questions in Telco
Smart phones
• Billions of readings
• Where and when do people start using what
kind of apps?
• Patterns of usage in certain areas on certain
times?
10. Questions in Healthcare
Service Location
• Doctor/ patient/ location and time of service
– Fraud detection
– Quality of service
• Health indicators readings related to where
patient has been
– Impact of conditions, like weather
11. Questions in Social Media
Service Quality
• Where are the most complaints/ praises about a
brand?
• Where is it best to start a new product limited roll
out?
• What is the impact of other factors on what people
say?
• Are there patterns within a certain area on how
people react?
12. Geospatial Analysis
• Beyond a point on the map
• Simple operations
– Geometry relations
• High level analysis
– Hot spot analysis
19. GIS tools for Hadoop libraries
ArcGIS
Geoprocessing
Tools
Connect From ArcGIS
to Hadoop using GP
Run Hive Queries
with spatial
Esri UDF
operators
Build Map/
Reduce Spatial
Esri Geometry
API
Apps in Java
23. Amazon EMR for Geospatial Analysis
• Flexible platform to get started and grow large
• Hosted and managed by Amazon Web Services
– No need for large Big Data in house infrastructure
– No need for hiring many people to maintain Hadoop
• Data ecosystem in the cloud is leveraged
– Geospatial data is usually large in size
– Access to third party datasets in the same ecosystem
24. GIS tools for Hadoop libraries
Geoprocessing
Tools
Esri Spatial
UDF
Esri Geometry
API
Connect From ArcGIS
to Hadoop using GP
Amazon
Elastic
MapReduce
(Amazon EMR)
25. ArcGIS Geoprocessing Tools
• Framework
– Performing analysis
– Manage geographical data
• Rich library of analysis tools
• Chaining tools to create models
– Drag and drop model builder
• Developing new custom tools
– Python
26. GP Tools for AWS
• https://github.com/Esri/gptools-for-aws
• GP tools to use
– Amazon EMR
– Amazon S3
• Open Source
– Apache 2.0 license
31. Putting it all together!
Geospatial analysis of log files
• Using: GP tools for AWS
• Goal: Analyze log files of a tile
base-map web service
– Real life high demand web service
– Where is the most demand?
• Map visualization
32. The Architecture
Amazon EMR Master Node
Amazon EMR Slave Node
ArcGIS
Desktop
+
GP Tools for
AWS
Availability Zone #1
Data
AWS cloud
Scripts/
Logs/
output
33. Data Files
• Structured CSV files
– ~8 GB
• Data rows
– Represented 1 month
– More than 700 million records
• Represents all 18 map scales
– To know in which areas users are looking for details
34. HQL Script
• External tables for data rows
• Calculations run through temp tables
– Consolidate tile scales from most detailed to level
13
– Calculate points (x,y) representing each tile
– Aggregate results
– Format output as csv not tab delimited
• Ported from RDBMS operations
– Adapted to Hive
35. Visualization
• Download output to local disk
• Add as a layer, set x/y for display
– Set coordinate system
– Use visualization settings to cluster points
and categorize
• Use base maps
37. Lessons Learned
• External tables and Amazon S3
• Cluster shutdown protection
• Data
– Partitioning
• Cluster sizes vs. execution time
– Standard Large
– High Memory, XLarge vs Quadruple Xlarge
• Costs
38. Summary
• The value of asking Big Data spatial questions
• Hadoop is now spatially enabled
– GIS Tools for Hadoop
• Boto for using Amazon EMR
• Geospatial analysts empowered
– GP Tools for AWS
• Real world scenario using Amazon EMR and GP
Tools