SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Harnessing Spark Catalyst for
Custom Data Payloads
GIS Raster Support in Spark DataFrames
Simeon	H.K.	Fitch
Co-Founder	&	VP	of	R&D,	Astraea
Astraea
• Developing a machine learning platform to
make solving planetary problems easier
• With exploding population growth and finite
resources, we need to have tools to better plan
for sustainable growth
• We aim to bring earth science data to business
applications through machine learning
2
See	the	earth.	As	it	was,	as	it	is, as	it	could	be.​
Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3
PROBLEM STATEMENT
To efficiently and effectively build machine learning models with Earth observation data
4
Data Native Form
5
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Remote Sensing Data Product
Granule/Scene/Tile
(GeoTIFF, HDF-EOS, GML-JPEG2000)
… …
add_offset
Band 32 emissivity
scale_factor
TileID
Value
0.002
1, 255
0.49
long_name
Key
valid_range
51004010
Multiband
Tile
Granule-wide
properties
Canonical ML Functional Form
6
c
1
a
1
b
1TPEA
1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1]
Spark Dataframe Row
(i.e. ML Observation)
Band Values at
Single Cell
. . .. . .. . .. . .. . .. . .
Projected Extent of
Tile + Cell Row/
Column
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Analytics Base Table
(ABT)
…
t1
t2
t2
t1
t2
t1
T3
T2
T2
T3
T2
T1
…
Delivering Imagery to ML
SLAAW
Scenes/
Granules
(Scene 1)
t0,b1
(Scene 1)
t0, bn
(Scene 1)
t0,b3
(Scene 1)
t0,b2
(Scene 1)
t0, b7
(Scene 1)
t0, b6
(Scene 1)
t0, b4
(Scene 1)
t0, b5
(Scene 2)
t1,b1
(Scene 2)
t1, bn
(Scene 2)
t1,b3
(Scene 2)
t1,b2
(Scene 2)
t1, b7
(Scene 2)
t1, b6
(Scene 2)
t1, b4
(Scene 2)
t1, b5
(Scene N)
tf,b1
(Scene N)
tf, bn
(Scene N)
tf,b3
(Scene N)
tf,b2
(Scene N)
tf, b7
(Scene N)
tf, b6
(Scene N)
tf, b4
(Scene N)
tf, b5
…
…
…
Feature
Engineering
Exploratory Data
Analysis
(EDA)
Data Quality
Check
(DQC)
Base Analytics Functional Form
(BAFF)
t1
t2
t2
t2
t1
t1
i6
i5
i4
i3
i2
i1
…
7
World-wide	data	coverage
Distributed	DataFrame
Distributed	DataFrame
Scalable	Machine	Learning
time
wavelength
Why This is Hard: Dimensionality
8
Spatial
(500m	→	5m	→	30cm)	
Temporal
(Refresh	rates:	Weeks	→	Daily	→	Hourly)	
Spectral
(4	bands	→	200	bands)
Planet
DigiGlobe
Landsat8
Planetary	
Resources
Metadata
• Coordinate	Reference	System
• Temporal/Spatial	Extent
• QA	Flags
• Calibration	parameters
+
Why This is Hard: Data Footprint
9
As resolution scales, image size explodes
Data	footprint	for	one	football	field	size	multiband	raster	
(single	point	in	time!)
• 30	meters
• 8 band
• 0.5	GB/image
Landsat8
(NASA)
• 3	meters
• 4	band
• 16	GB/image
Planet
PlanetScope
Ortho
• 30	centimeters
• 4	band
• 1.0	TB/image
DigiGlobe
• 10	m	Resolution
• 200	band	(hyper-spectral)
• 50	TB/	image?
Planetary
Resources
CAPABILITY DEMONSTRATION
Prototyping Spark Catalyst raster integration
10
Domain-Specific Data Discretization
Swath ~ Granule ~ Scene ~ Raster
⇓
Tile ~ Chip
⇓
Cell ~ Pixel
11
𝑛	×	𝑚	where	𝑛, 𝑚 ≳ 1200
(e.g.	Landsat	8:	76002)	
𝑛.
, where	𝑛 ≲ 512
(Typical:	642 to	2562)
1×1
Each	of	these	has	one	or	more	“bands”
(e.g.	Landsat	8:	11,	MODIS:	36,	Hyperion:	220)
TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
12
§ vectorizeTiles
§ explodeTiles
§ localMax
§ localMin
§ localStats
§ localAdd
§ localSubtract
§ tileHistogram
§ tileStatistics
§ tileMean
§ aggHistogram
§ aggStats
See	work-in-progress	code	and	examples/tests	in:
https://github.com/s22s/geotrellis-spark-sql/
TileUDT Notebook Demo
13
ZeppelinHub Version
14
IMPLEMENTATION
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin
15
GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17
Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18
Anatomy of a UDT
To	access	private	API,	need	to	be	a	subpackage of	sql.
Supertype parameterized	on	user	type
Name	shown	in	schema	and	query	plan
Runtime	class	descriptor	of	user	type
Schema	describing	how	the	type	will	be	
encoded	within	Catalyst.	You	have	lots	of	
flexibility	here,	even	using	other	UDTs.	In	this	
example	we	pack	the	tile	into	an	opaque	blob.
Conversion	from	user	data	type	to	Catalyst	encoding
Conversion	from	Catalyst	encoding	to	user	data	type
19
UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20
Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21
Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22
23
THANK YOU!
The End

Weitere ähnliche Inhalte

Was ist angesagt?

Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Databricks
 

Was ist angesagt? (20)

Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
 

Ähnlich wie Harnessing Spark Catalyst for Custom Data Payloads

Ähnlich wie Harnessing Spark Catalyst for Custom Data Payloads (20)

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Bogdan Kecman INIT Presentation
Bogdan Kecman INIT PresentationBogdan Kecman INIT Presentation
Bogdan Kecman INIT Presentation
 
Bogdan Kecman Advanced Databasing
Bogdan Kecman Advanced DatabasingBogdan Kecman Advanced Databasing
Bogdan Kecman Advanced Databasing
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 

Kürzlich hochgeladen

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Kürzlich hochgeladen (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Harnessing Spark Catalyst for Custom Data Payloads

  • 1. Harnessing Spark Catalyst for Custom Data Payloads GIS Raster Support in Spark DataFrames Simeon H.K. Fitch Co-Founder & VP of R&D, Astraea
  • 2. Astraea • Developing a machine learning platform to make solving planetary problems easier • With exploding population growth and finite resources, we need to have tools to better plan for sustainable growth • We aim to bring earth science data to business applications through machine learning 2 See the earth. As it was, as it is, as it could be.​
  • 3. Preface • Assumptions: – Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame compute model – Basic understanding of a typical ETL/ML pipeline • Prior Art: – Approach outlined derived from other work – Fundamental raster support via Azavea’s GeoTrellis – Spark integration cues taken from: • CCRi’s GeoMesa • Databrick’s Spark-Avro • Caveat Emptor: – As of Spark 2.1.0, approach is not officially sanctioned; uses undocumented, private APIs – Not for everyone, but for us, benefits outweigh the risks 3
  • 4. PROBLEM STATEMENT To efficiently and effectively build machine learning models with Earth observation data 4
  • 5. Data Native Form 5 Bandc Bandb Banda Temporal Projected Extent (TPE) Granule Metadata (GM) Remote Sensing Data Product Granule/Scene/Tile (GeoTIFF, HDF-EOS, GML-JPEG2000) … … add_offset Band 32 emissivity scale_factor TileID Value 0.002 1, 255 0.49 long_name Key valid_range 51004010 Multiband Tile Granule-wide properties
  • 6. Canonical ML Functional Form 6 c 1 a 1 b 1TPEA 1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1] Spark Dataframe Row (i.e. ML Observation) Band Values at Single Cell . . .. . .. . .. . .. . .. . . Projected Extent of Tile + Cell Row/ Column Bandc Bandb Banda Temporal Projected Extent (TPE) Granule Metadata (GM)
  • 7. Analytics Base Table (ABT) … t1 t2 t2 t1 t2 t1 T3 T2 T2 T3 T2 T1 … Delivering Imagery to ML SLAAW Scenes/ Granules (Scene 1) t0,b1 (Scene 1) t0, bn (Scene 1) t0,b3 (Scene 1) t0,b2 (Scene 1) t0, b7 (Scene 1) t0, b6 (Scene 1) t0, b4 (Scene 1) t0, b5 (Scene 2) t1,b1 (Scene 2) t1, bn (Scene 2) t1,b3 (Scene 2) t1,b2 (Scene 2) t1, b7 (Scene 2) t1, b6 (Scene 2) t1, b4 (Scene 2) t1, b5 (Scene N) tf,b1 (Scene N) tf, bn (Scene N) tf,b3 (Scene N) tf,b2 (Scene N) tf, b7 (Scene N) tf, b6 (Scene N) tf, b4 (Scene N) tf, b5 … … … Feature Engineering Exploratory Data Analysis (EDA) Data Quality Check (DQC) Base Analytics Functional Form (BAFF) t1 t2 t2 t2 t1 t1 i6 i5 i4 i3 i2 i1 … 7 World-wide data coverage Distributed DataFrame Distributed DataFrame Scalable Machine Learning time wavelength
  • 8. Why This is Hard: Dimensionality 8 Spatial (500m → 5m → 30cm) Temporal (Refresh rates: Weeks → Daily → Hourly) Spectral (4 bands → 200 bands) Planet DigiGlobe Landsat8 Planetary Resources Metadata • Coordinate Reference System • Temporal/Spatial Extent • QA Flags • Calibration parameters +
  • 9. Why This is Hard: Data Footprint 9 As resolution scales, image size explodes Data footprint for one football field size multiband raster (single point in time!) • 30 meters • 8 band • 0.5 GB/image Landsat8 (NASA) • 3 meters • 4 band • 16 GB/image Planet PlanetScope Ortho • 30 centimeters • 4 band • 1.0 TB/image DigiGlobe • 10 m Resolution • 200 band (hyper-spectral) • 50 TB/ image? Planetary Resources
  • 10. CAPABILITY DEMONSTRATION Prototyping Spark Catalyst raster integration 10
  • 11. Domain-Specific Data Discretization Swath ~ Granule ~ Scene ~ Raster ⇓ Tile ~ Chip ⇓ Cell ~ Pixel 11 𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200 (e.g. Landsat 8: 76002) 𝑛. , where 𝑛 ≲ 512 (Typical: 642 to 2562) 1×1 Each of these has one or more “bands” (e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)
  • 12. TileUDT and Friends • Using the approach covered in the next section we register TileUDT with Spark • With UDTs come User Defined Functions (UDFs) • Some examples: 12 § vectorizeTiles § explodeTiles § localMax § localMin § localStats § localAdd § localSubtract § tileHistogram § tileStatistics § tileMean § aggHistogram § aggStats See work-in-progress code and examples/tests in: https://github.com/s22s/geotrellis-spark-sql/
  • 14. 14 IMPLEMENTATION From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
  • 15. Software Stack • Scala • Apache Spark • GeoTrellis • Accumulo • Docker • Apache Zeppelin 15
  • 16. GeoTrellis • GeoTrellis is an open source Scala framework for efficiently manipulating raster GIS data • Provides facilities to ingest and process tiles at scale • Has powerful abstractions for working with RDD[Tile]s. – Mosaicing, stitching, pyramiding, resampling, reprojecting, etc. – Implements C. Dana Tomlin’s “Map Algebra” 16
  • 17. Getting From RDDs to DataFrames • Goal: work with tiles via DataFrame APIs – Better ergonomics – More computationally efficient – Required for SparkML • Bonus: if a capability is available in DataFrames, it’s also available in SQL! 17
  • 18. Encoding Data with Spark Catalyst • Catalyst is the engine behind Spark DataFrames & SQL • Moving data from RDDs to DataFrames requires using one of two Catalyst APIs: – ExpressionEncoder[Tile] or – UserDefinedType[Tile] • Both are (currently) package private • Both have steep learning curves • Both are extremely powerful once harnessed – ExpressionEncoder is ideal for simple structures – UserDefinedType is more efficient for larger data payloads • For our needs, UserDefinedType (UDT) is the best fit 18
  • 19. Anatomy of a UDT To access private API, need to be a subpackage of sql. Supertype parameterized on user type Name shown in schema and query plan Runtime class descriptor of user type Schema describing how the type will be encoded within Catalyst. You have lots of flexibility here, even using other UDTs. In this example we pack the tile into an opaque blob. Conversion from user data type to Catalyst encoding Conversion from Catalyst encoding to user data type 19
  • 20. UDT Registration • User defined type is registered with Catalyst by providing mapping between native type and UDT 20
  • 21. Spark Catalyst Toolbox • User Defined Type (UDT) • User Defined Function (UDF, 2 forms) • User Defined Aggregation Function (UDAF) • User Defined Table Function (UDTF, a.k.a. “Generator”) • Data Source • Query Plan • Optimization Rule 21
  • 22. Future Work • GeoTrellis Layer Store as an integrated Spark DataSource (in progress) • Expanding standard GeoTrellis RDD features into efficient UDFs • GIS Vector primitives (a la GeoMesa) • Becoming an official module of GeoTrellis 22

Hinweis der Redaktion

  1. Approach is general, not limited to GIS/EO
  2. A little about who we are and what we’re up to
  3. Explain why it matters.... can't be a data scientist if you can get the data to the form you need for modelling
  4. What we're working on at Astraea: platform to allow data scientists to efficiently build and deploy models based on EO data.
  5. SLAAW has to happen before you can even start your experimental design Save the Data Scientists time by providing higher-level abstractions for doing the “science” Make a really challenging data source more accessible to the data scientist. Two goals: address SLAAW; make data science steps more efficient. World wide collections of data. Need to be able to scale. Distinction between Python/R dataframes and Spark distributed ones
  6. 1) These functions can be applied globally to the distributed dataframe Allows for SLAAW, DQC, EDA, FE
  7. Get rasters into Spark Manipulate rasters Move rasters into Dataframe
  8. GeoTrellis gets the imagery into Spark Map Algebra provides fundamental sets of primitives for performing analytics on GIS raster data
  9. GeoTrellis alone only gets us part of the way there