SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Magellan – Spark as a Geospatial
Analytics Engine
Ram Sriharsha
Hortonworks
Who Am I?
• Apache Spark PMC Member
• Hortonworks Architect, Spark + Data Science
• Prior to HWX, Principal Research Scientist @
Yahoo Labs (Large Scale Machine Learning)
– Login Risk Detection, Sponsored Search Click
Prediction, Advertising Effectiveness Models,
Online Learning, …
What is Geospatial Analytics?
How do pickup/ dropoff neighborhood
hotspots evolve with time?
Correct GPS errors with more
Accurate landmark measurements
Incorporate location in IR and search
advertising
Do we need one more library?
• Spatial Analytics at scale is challenging
– Simplicity + Scalability = Hard
• Ancient Data Formats
– metadata, indexing not handled well, inefficient storage
• Geospatial Analytics is not simply BI anymore
– Statistical + Machine Learning being leveraged in geospatial
• Now is the time to do it!
– Explosion of mobile data
– Finer granularity of data collection for geometries
– Analytics stretching the limits of traditional approaches
– Spark SQL + Catalyst + Tungsten makes extensible SQL engines easier
than ever before!
Introduction to Magellan
Introduction to Magellan
Introduction to Magellan
Shapefiles
*.shp
*.dbf
sqlContext.read.format(“magellan”)
.load(${neighborhoods.path})
GeoJSON
*.json
sqlContext.read.format(“magellan”)
.option(“type”, “geojson”)
.load(${neighborhoods.path})
polygon metadata
([0], [(-122.4413024, 7.8066277),
…])
neighborhood -> Marina
([0], [(-122.4111659, 37.8003388),
…])
neighborhood -> North Beach
Introduction to Magellan
polygon metadata
([0], [(-
122.4413024,
7.8066277), …])
neighborhood ->
Marina
([0], [(-
122.4111659,
37.8003388), …])
neighborhood ->
North Beach
neighborhoods.filter(
point(-122.4111659, 37.8003388)
within
‘polygon
).show()
polygon metadata
([0], [(-
122.4111659,
37.8003388), …])
neighborhood ->
North Beach
Shape literal
Boolean Expression
Introduction to Magellan
polygon metadata
([0], [(-122.4111659,
37.8003388),…])
neighborhood->
North Beach
([0], [(-122.4413024,
7.8066277),…])
neighborhood->
Marina
point
(-122.4111659,
37.8003388)
(-122.4343576,
37.8068007)
points.join(neighborhoods).
where(‘point within ‘polygon).
show()
point polygon metadata
(-122.4343576,
37.8068007)
([0], [(-
122.4111659,
37.8003388),
…])
neighborhood-
> North Beach
Introduction to Magellan
polygon metadata
([0], [(-122.4111659,
37.8003388),…])
neighborhood->
North Beach
([0], [(-122.4413024,
7.8066277),…])
neighborhood->
Marina
neighborhoods.filter(
point(-122.4111659, 37.8003388).buffer(0.1)
intersects
‘polygon
).show()
point polygon metadata
(-122.4343576,
37.8068007)
([0], [(-
122.4111659,
37.8003388),
…])
neighborhood-
> North Beach
‘point within ‘polygon
the join
• Inherits all join optimizations from Spark
SQL
– if neighborhoods table is small, Broadcast
Cartesian Join
– else Cartesian Join
Status
• Magellan 1.0.3 available as Spark Package.
• Scala
• Spark 1.4
• Spark Package: Magellan
• Github: https://github.com/harsha2010/magellan
• Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-
spark/
• Notebook example: http://bit.ly/1GwLyrV
• Input Formats: ESRI Shapefile, GeoJSON, OSM-XML
• Please try it out and give feedback!
What is next?
• Magellan 1.0.4
– Spark 1.6
– Python
– Spatial Join
– Persistent Indices
– Better leverage Tungsten via codegen + memory
layout optimizations
– More operators: buffer, distance, area etc.
the join revisited
• What is the time complexity?
– m points, n polygons (assume average k edges)
– l partitions
– O(mn/l) computations of ‘point within ‘polygon
– O(ml) communication cost
– Each ‘point within ‘polygon costs O(k)
– Total cost = O(ml) + O(mnk/l)
– O(m√n√k) cost, with O(√n√k) partitions
Optimization?
• Do we need to send every point to every
partition?
• Do we need to compute ‘point in
‘neighborhood for each neighborhood
within a given partition?
2d indices
• Quad Tree
• R Tree
• Dimensional Reduction
– Hashing
– PCA
– Space Filling Curves
dimensional reduction
• What does a good dimensional reduction
look like?
– (Approximately) preserve nearness
– enable range queries
– No (little) collision
row order curve
snake order curve
z order curve
Binary Representation
Binary Representation
properties
• Locality: Two points differing in a small # of
bits are near each other
– converse not necessarily true!
• Containment
• Efficient construction
• Nice bounds on precision
geohash
• Z order curve with base 32 encoding
• Start with world boundary (-90,90) X (-180,
180) and recursively subdivide based on
precision
encode (-74.009, 40.7401)
• 40.7401 is in [0, 90) => first bit = 1
• 40.7401 is in [0, 45) => second bit = 0
• 40.7401 is in [22.5, 45) => third bit = 1
• …
• do same for longitude
answer = dr5rgb
decode dr5rgb
• Decode from Base 32 -> Binary
– 01100 10111 00101 01111 01010
• lat = 101110001111, long = 0100101111000
• Now decode binary -> decimal.
– latitude starts with 1 => between 0 - 90
– second bit = 0 => between 0 - 45
– third bit = 1 => between 22.5 - 45
– ...
An algorithm to scale join?
• Preprocess points
– For each point compute geohash of precision p covering point
• Preprocess neighborhoods
– For each neighborhood compute geohashes of precision p that
intersect neighborhood.
• Inner join on geohash
• Filter out edge cases
Implementation in Spark SQL
• Override Strategy to define SpatialJoinStrategy
– Logic to decide when to trigger this join
• Only trigger if geospatial queries
• Only trigger if join is complex: if n ~ O(1) then broadcast join
is good enough
– Override BinaryNode to handle the physical execution
plan ourselves
• Override execute(): RDD to execute join and return results
– Stitch it up using Experimental Strategies in
SQLContext
Persistent Indices
• Often, the geometry dataset does not
change… eg. neighborhoods
• Index the dataset once and for all?
Persistent Indices
• Use Magellan to generate spatial indices
– Think of geometry as document, list of geohashes as
words!
• Persist indices to Elastic Search
• Use Magellan Data Source to query indexed ES
data
• Pushdown geometric predicates where possible
– Predicate rewritten to IR query
Overall architecture
Elastic search
Shard Server
Spark Cluster
nbd.filter(
point(…)
within
‘polygon
)
curl –XGET ‘http://…’ –d ‘{
“query” : {
“filtered” : {
“filter” : {
“geohash” : [“dr5rgb”]
}
}
}
}’
THANK YOU.
Twitter: @halfabrane, Github: @harsha2010

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 

Was ist angesagt? (20)

Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
SexTant: Visualizing Time-Evolving Linked Geospatial Data
SexTant: Visualizing Time-Evolving Linked Geospatial DataSexTant: Visualizing Time-Evolving Linked Geospatial Data
SexTant: Visualizing Time-Evolving Linked Geospatial Data
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 

Andere mochten auch

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

Andere mochten auch (12)

Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Ähnlich wie Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha

Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
sscdotopen
 
Modularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularityModularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularity
Andrzej Olszak
 

Ähnlich wie Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha (20)

Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Lucene solr 4 spatial extended deep dive
Lucene solr 4 spatial   extended deep diveLucene solr 4 spatial   extended deep dive
Lucene solr 4 spatial extended deep dive
 
Modularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularityModularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularity
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Geo Location Initial PoC
Geo Location Initial PoCGeo Location Initial PoC
Geo Location Initial PoC
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
LSH
LSHLSH
LSH
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 

Mehr von Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Kürzlich hochgeladen (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha

  • 1. Magellan – Spark as a Geospatial Analytics Engine Ram Sriharsha Hortonworks
  • 2. Who Am I? • Apache Spark PMC Member • Hortonworks Architect, Spark + Data Science • Prior to HWX, Principal Research Scientist @ Yahoo Labs (Large Scale Machine Learning) – Login Risk Detection, Sponsored Search Click Prediction, Advertising Effectiveness Models, Online Learning, …
  • 3. What is Geospatial Analytics? How do pickup/ dropoff neighborhood hotspots evolve with time? Correct GPS errors with more Accurate landmark measurements Incorporate location in IR and search advertising
  • 4. Do we need one more library? • Spatial Analytics at scale is challenging – Simplicity + Scalability = Hard • Ancient Data Formats – metadata, indexing not handled well, inefficient storage • Geospatial Analytics is not simply BI anymore – Statistical + Machine Learning being leveraged in geospatial • Now is the time to do it! – Explosion of mobile data – Finer granularity of data collection for geometries – Analytics stretching the limits of traditional approaches – Spark SQL + Catalyst + Tungsten makes extensible SQL engines easier than ever before!
  • 7. Introduction to Magellan Shapefiles *.shp *.dbf sqlContext.read.format(“magellan”) .load(${neighborhoods.path}) GeoJSON *.json sqlContext.read.format(“magellan”) .option(“type”, “geojson”) .load(${neighborhoods.path}) polygon metadata ([0], [(-122.4413024, 7.8066277), …]) neighborhood -> Marina ([0], [(-122.4111659, 37.8003388), …]) neighborhood -> North Beach
  • 8. Introduction to Magellan polygon metadata ([0], [(- 122.4413024, 7.8066277), …]) neighborhood -> Marina ([0], [(- 122.4111659, 37.8003388), …]) neighborhood -> North Beach neighborhoods.filter( point(-122.4111659, 37.8003388) within ‘polygon ).show() polygon metadata ([0], [(- 122.4111659, 37.8003388), …]) neighborhood -> North Beach Shape literal Boolean Expression
  • 9. Introduction to Magellan polygon metadata ([0], [(-122.4111659, 37.8003388),…]) neighborhood-> North Beach ([0], [(-122.4413024, 7.8066277),…]) neighborhood-> Marina point (-122.4111659, 37.8003388) (-122.4343576, 37.8068007) points.join(neighborhoods). where(‘point within ‘polygon). show() point polygon metadata (-122.4343576, 37.8068007) ([0], [(- 122.4111659, 37.8003388), …]) neighborhood- > North Beach
  • 10. Introduction to Magellan polygon metadata ([0], [(-122.4111659, 37.8003388),…]) neighborhood-> North Beach ([0], [(-122.4413024, 7.8066277),…]) neighborhood-> Marina neighborhoods.filter( point(-122.4111659, 37.8003388).buffer(0.1) intersects ‘polygon ).show() point polygon metadata (-122.4343576, 37.8068007) ([0], [(- 122.4111659, 37.8003388), …]) neighborhood- > North Beach
  • 12. the join • Inherits all join optimizations from Spark SQL – if neighborhoods table is small, Broadcast Cartesian Join – else Cartesian Join
  • 13. Status • Magellan 1.0.3 available as Spark Package. • Scala • Spark 1.4 • Spark Package: Magellan • Github: https://github.com/harsha2010/magellan • Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in- spark/ • Notebook example: http://bit.ly/1GwLyrV • Input Formats: ESRI Shapefile, GeoJSON, OSM-XML • Please try it out and give feedback!
  • 14. What is next? • Magellan 1.0.4 – Spark 1.6 – Python – Spatial Join – Persistent Indices – Better leverage Tungsten via codegen + memory layout optimizations – More operators: buffer, distance, area etc.
  • 15. the join revisited • What is the time complexity? – m points, n polygons (assume average k edges) – l partitions – O(mn/l) computations of ‘point within ‘polygon – O(ml) communication cost – Each ‘point within ‘polygon costs O(k) – Total cost = O(ml) + O(mnk/l) – O(m√n√k) cost, with O(√n√k) partitions
  • 16.
  • 17. Optimization? • Do we need to send every point to every partition? • Do we need to compute ‘point in ‘neighborhood for each neighborhood within a given partition?
  • 18. 2d indices • Quad Tree • R Tree • Dimensional Reduction – Hashing – PCA – Space Filling Curves
  • 19. dimensional reduction • What does a good dimensional reduction look like? – (Approximately) preserve nearness – enable range queries – No (little) collision
  • 25. properties • Locality: Two points differing in a small # of bits are near each other – converse not necessarily true! • Containment • Efficient construction • Nice bounds on precision
  • 26. geohash • Z order curve with base 32 encoding • Start with world boundary (-90,90) X (-180, 180) and recursively subdivide based on precision
  • 27. encode (-74.009, 40.7401) • 40.7401 is in [0, 90) => first bit = 1 • 40.7401 is in [0, 45) => second bit = 0 • 40.7401 is in [22.5, 45) => third bit = 1 • … • do same for longitude answer = dr5rgb
  • 28. decode dr5rgb • Decode from Base 32 -> Binary – 01100 10111 00101 01111 01010 • lat = 101110001111, long = 0100101111000 • Now decode binary -> decimal. – latitude starts with 1 => between 0 - 90 – second bit = 0 => between 0 - 45 – third bit = 1 => between 22.5 - 45 – ...
  • 29. An algorithm to scale join? • Preprocess points – For each point compute geohash of precision p covering point • Preprocess neighborhoods – For each neighborhood compute geohashes of precision p that intersect neighborhood. • Inner join on geohash • Filter out edge cases
  • 30. Implementation in Spark SQL • Override Strategy to define SpatialJoinStrategy – Logic to decide when to trigger this join • Only trigger if geospatial queries • Only trigger if join is complex: if n ~ O(1) then broadcast join is good enough – Override BinaryNode to handle the physical execution plan ourselves • Override execute(): RDD to execute join and return results – Stitch it up using Experimental Strategies in SQLContext
  • 31.
  • 32. Persistent Indices • Often, the geometry dataset does not change… eg. neighborhoods • Index the dataset once and for all?
  • 33. Persistent Indices • Use Magellan to generate spatial indices – Think of geometry as document, list of geohashes as words! • Persist indices to Elastic Search • Use Magellan Data Source to query indexed ES data • Pushdown geometric predicates where possible – Predicate rewritten to IR query
  • 34. Overall architecture Elastic search Shard Server Spark Cluster nbd.filter( point(…) within ‘polygon ) curl –XGET ‘http://…’ –d ‘{ “query” : { “filtered” : { “filter” : { “geohash” : [“dr5rgb”] } } } }’
  • 35.
  • 36. THANK YOU. Twitter: @halfabrane, Github: @harsha2010