Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Harnessing Spark Catalyst for Custom Data Payloads
1. Harnessing Spark Catalyst for
Custom Data Payloads
GIS Raster Support in Spark DataFrames
Simeon H.K. Fitch
Co-Founder & VP of R&D, Astraea
2. Astraea
• Developing a machine learning platform to
make solving planetary problems easier
• With exploding population growth and finite
resources, we need to have tools to better plan
for sustainable growth
• We aim to bring earth science data to business
applications through machine learning
2
See the earth. As it was, as it is, as it could be.
3. Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3
9. Why This is Hard: Data Footprint
9
As resolution scales, image size explodes
Data footprint for one football field size multiband raster
(single point in time!)
• 30 meters
• 8 band
• 0.5 GB/image
Landsat8
(NASA)
• 3 meters
• 4 band
• 16 GB/image
Planet
PlanetScope
Ortho
• 30 centimeters
• 4 band
• 1.0 TB/image
DigiGlobe
• 10 m Resolution
• 200 band (hyper-spectral)
• 50 TB/ image?
Planetary
Resources
11. Domain-Specific Data Discretization
Swath ~ Granule ~ Scene ~ Raster
⇓
Tile ~ Chip
⇓
Cell ~ Pixel
11
𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200
(e.g. Landsat 8: 76002)
𝑛.
, where 𝑛 ≲ 512
(Typical: 642 to 2562)
1×1
Each of these has one or more “bands”
(e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)
12. TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
12
§ vectorizeTiles
§ explodeTiles
§ localMax
§ localMin
§ localStats
§ localAdd
§ localSubtract
§ tileHistogram
§ tileStatistics
§ tileMean
§ aggHistogram
§ aggStats
See work-in-progress code and examples/tests in:
https://github.com/s22s/geotrellis-spark-sql/
16. GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
17. Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17
18. Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18
19. Anatomy of a UDT
To access private API, need to be a subpackage of sql.
Supertype parameterized on user type
Name shown in schema and query plan
Runtime class descriptor of user type
Schema describing how the type will be
encoded within Catalyst. You have lots of
flexibility here, even using other UDTs. In this
example we pack the tile into an opaque blob.
Conversion from user data type to Catalyst encoding
Conversion from Catalyst encoding to user data type
19
20. UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20
21. Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21
22. Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22
Explain why it matters.... can't be a data scientist if you can get the data to the form you need for modelling
What we're working on at Astraea: platform to allow data scientists to efficiently build and deploy models based on EO data.
SLAAW has to happen before you can even start your experimental design
Save the Data Scientists time by providing higher-level abstractions for doing the “science”
Make a really challenging data source more accessible to the data scientist.
Two goals: address SLAAW; make data science steps more efficient.
World wide collections of data. Need to be able to scale.
Distinction between Python/R dataframes and Spark distributed ones
1) These functions can be applied globally to the distributed dataframe
Allows for SLAAW, DQC, EDA, FE
Get rasters into Spark
Manipulate rasters
Move rasters into Dataframe
GeoTrellis gets the imagery into Spark
Map Algebra provides fundamental sets of primitives for performing analytics on GIS raster data
GeoTrellis alone only gets us part of the way there