Distributed georeferenced raster processing on Spark with GeoTrellis

•

0 gefällt mir•438 views

GeoTrellis is a geographic data processing engine for high performance applications. This presentation is focused on how Spark RDD partitioning scheme can influent the whole Spark application behaviour.

Technologie

DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK
Grigory Pomadchin @daunnc / @pomadchin

GEO +
VectorTiles + PointClouds +
———————————————————————————————
Raster +Vector +
/w

• Raster Foundry
(Spark SQL & ML)
• Raster Frames
(Spark SQL & ML, Datasets
query API)
• GeoPy Spark
(Python bindings)
• Vector Pipes
(Vector tiles on Spark)
• PDAL
intergration
(PointClouds on Spark)
GEOTRELLIS ECOSYSTEM

• RDD 
(a basic core spark type from the
past (no))
• Manual partitioning control
• DATASET
• Query planning
optimizations, more related
to already well partitioned
and structured data.
PARTITIONING SCHEME
SPECIAL BROWN COLORED FUNCTIONS
• Join
• groupByKey
• reduceByKey
• combineByKey
• Repartition
• Each function that has no
preservePartitioning flag or
can accept partitioner as an
argument, probably map?

join
reduceByKey
join map
reduceByKey
join
mapVlues reduceByKey
MAP IS A FUNCTION OF A DIFFERENT KIND?

DATA PREPARATION
• {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark
• (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)},
{Multiband | Singleband}Tile
• K - {Spatial(col, row) | SpaceTime(col, row, time)}

WAT?!
• Load data into Spark memory according to some
partitioning scheme
• Ahead of shuffle: smaller chunks are better for
Spark (as the max shuffle block size is only 2GBs)
• Are we dependent on the input data type? (yes)
• Window reading (what’s the desired / perfect
window size?)

SPARK SHUFFLE BLOCK FEATURE
• ~ 128mb per partition (rule of a thumb)
• if(partitionsNumber ~ 2000) repartition(> 2000)

WINDOWED READS
• Here we have a sort of some crop
function by grid bounds on each
element: tiff.crop(gridBounds) (it is
the meaning of rr.readWindows func)

WINDOWED READS
• 13 GB loads not efficient into
memory of three AWS m3.xlarge
instances .

WINDOWED READS
• Instead of 13Gb it fetches even 40Gb
per partition…

WINDOWED READS
• The solution is to pack segments into
desired windows based on the input
format requirements
• After all the main idea is to leverage
the gains by having a good partion
scheme

READ / WRITE
• SFC index and parallelism level control
• Cassandra and range queries example 
(range queries and compare it to spark Cassandra connector, queries parallelism
inside Spark partitions)

API & SPARK PROBLEMS
• Spark has its limitations
• It’s not required for a small data amount 
(In the real time case even milliseconds are important, otherwise we have to live
somehow with the Spark slow responses)
• The second API in addition to the RDD API is
the answer? 
(Collections API; does it make any sense to abstract over RDDs and Collections?)

• https://github.com/locationtech/geotrellis
• https://geotrellis.io
• https://www.azavea.com
• https://www.azavea.com/blog
• https://yuns-stacy.github.io/geotrellis-angular-demo/dashboard
• http://rasterframes.io/ml/statistics.html
• https://github.com/pomadchin

Empfohlen

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScyllaDB

Powering a Graph Data System with Scylla + JanusGraphScyllaDB

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty

Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty

Running Scylla on Kubernetes with Scylla OperatorScyllaDB

Empfohlen

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScyllaDB

Powering a Graph Data System with Scylla + JanusGraphScyllaDB

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty

Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty

Running Scylla on Kubernetes with Scylla OperatorScyllaDB

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...ScyllaDB

Zeppelin and spark sql demystifiedOmid Vahdaty

ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB

SparkSQL: A Compiler from Queries to RDDsDatabricks

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

How to be Successful with ScyllaScyllaDB

Emr spark tuning demystifiedOmid Vahdaty

Spark Summit EU talk by Qifan PuSpark Summit

PostgreSQL on AWS: Tips & Tricks (and horror stories)Alexander Kukushkin

Operating and Supporting Delta Lake in ProductionDatabricks

The True Cost of NoSQL DBaaS OptionsScyllaDB

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Apache Spark II (SparkSQL)Datio Big Data

How to Monitor and Size Workloads on AWS i3 instancesScyllaDB

Seastar Summit 2019 KeynoteScyllaDB

What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB

Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez

Performance Troubleshooting Using Apache Spark MetricsDatabricks

Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB

Empowering the AWS DynamoDB™ application developer with AlternatorScyllaDB

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime

Weitere ähnliche Inhalte

Was ist angesagt?

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...ScyllaDB

Zeppelin and spark sql demystifiedOmid Vahdaty

ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB

SparkSQL: A Compiler from Queries to RDDsDatabricks

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

How to be Successful with ScyllaScyllaDB

Emr spark tuning demystifiedOmid Vahdaty

Spark Summit EU talk by Qifan PuSpark Summit

PostgreSQL on AWS: Tips & Tricks (and horror stories)Alexander Kukushkin

Operating and Supporting Delta Lake in ProductionDatabricks

The True Cost of NoSQL DBaaS OptionsScyllaDB

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Apache Spark II (SparkSQL)Datio Big Data

How to Monitor and Size Workloads on AWS i3 instancesScyllaDB

Seastar Summit 2019 KeynoteScyllaDB

What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB

Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez

Performance Troubleshooting Using Apache Spark MetricsDatabricks

Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB

Empowering the AWS DynamoDB™ application developer with AlternatorScyllaDB

Was ist angesagt? (20)

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...

Zeppelin and spark sql demystified

ScyllaDB's Avi Kivity on UDF, UDA, and the Future

SparkSQL: A Compiler from Queries to RDDs

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...

How to be Successful with Scylla

Emr spark tuning demystified

Spark Summit EU talk by Qifan Pu

PostgreSQL on AWS: Tips & Tricks (and horror stories)

Operating and Supporting Delta Lake in Production

The True Cost of NoSQL DBaaS Options

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Apache Spark II (SparkSQL)

How to Monitor and Size Workloads on AWS i3 instances

Seastar Summit 2019 Keynote

What Kiwi.com Has Learned Running ScyllaDB and Go

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Performance Troubleshooting Using Apache Spark Metrics

Using ScyllaDB with JanusGraph for Cyber Security

Empowering the AWS DynamoDB™ application developer with Alternator

Ähnlich wie Distributed georeferenced raster processing on Spark with GeoTrellis

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj

New Developments in SparkDatabricks

Spark architechure.pptxSaiSriMadhuriYatam

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Glint with Apache SparkVenkata Naga Ravi

SparkHeena Madan

Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.

Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore

Paris Data Geek - Spark Streaming Djamel Zouaoui

Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

Spark on Yarn @ NetflixNezih Yigitbasi

Producing Spark on YARN for ETLDataWorks Summit/Hadoop Summit

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly

Apache Spark Overview @ ferretAndrii Gakhov

Dive into spark2Gal Marder

Ähnlich wie Distributed georeferenced raster processing on Spark with GeoTrellis (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

Processing Large Data with Apache Spark -- HasGeek

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Spark: The State of the Art Engine for Big Data Processing

New Developments in Spark

Spark architechure.pptx

Apache Spark: The Next Gen toolset for Big Data Processing

Glint with Apache Spark

Spark

Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Paris Data Geek - Spark Streaming

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Introduction to Spark - Phoenix Meetup 08-19-2014

Spark on Yarn @ Netflix

Producing Spark on YARN for ETL

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Apache Spark Overview @ ferret

Dive into spark2

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Training state-of-the-art general text embeddingZilliz

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Install Stable Diffusion in windows machinePadma Pradeep

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Are Multi-Cloud and Serverless Good or Bad?

Training state-of-the-art general text embedding

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Vertex AI Gemini Prompt Engineering Tips

Streamlining Python Development: A Guide to a Modern Project Setup

Install Stable Diffusion in windows machine

What's New in Teams Calling, Meetings and Devices March 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DMCC Future of Trade Web3 - Special Edition

Gen AI in Business - Global Trends Report 2024.pdf

Nell’iperspazio con Rocket: il Framework Web di Rust!

SIP trunking in Janus @ Kamailio World 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Scanning the Internet for External Cloud Exposures via SSL Certs

Distributed georeferenced raster processing on Spark with GeoTrellis

1. DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK Grigory Pomadchin @daunnc / @pomadchin

2. GEO + VectorTiles + PointClouds + ——————————————————————————————— Raster +Vector + /w

3. • Raster Foundry (Spark SQL & ML) • Raster Frames (Spark SQL & ML, Datasets query API) • GeoPy Spark (Python bindings) • Vector Pipes (Vector tiles on Spark) • PDAL intergration (PointClouds on Spark) GEOTRELLIS ECOSYSTEM

4. WHATS UNDER THE COVERS?

5. SPACE FILLING CURVES

7. • RDD  (a basic core spark type from the past (no)) • Manual partitioning control • DATASET • Query planning optimizations, more related to already well partitioned and structured data. PARTITIONING SCHEME SPECIAL BROWN COLORED FUNCTIONS • Join • groupByKey • reduceByKey • combineByKey • Repartition • Each function that has no preservePartitioning flag or can accept partitioner as an argument, probably map?

8. MAP IS A FUNCTION OF A DIFFERENT KIND?

9. join reduceByKey join map reduceByKey join mapVlues reduceByKey MAP IS A FUNCTION OF A DIFFERENT KIND?

10. • inspired by Eugene Cheipesh slides

11. DATA PREPARATION • {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark • (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)}, {Multiband | Singleband}Tile • K - {Spatial(col, row) | SpaceTime(col, row, time)}

12. • inspired by Eugene Cheipesh slides

13. WAT?! • Load data into Spark memory according to some partitioning scheme • Ahead of shuffle: smaller chunks are better for Spark (as the max shuffle block size is only 2GBs) • Are we dependent on the input data type? (yes) • Window reading (what’s the desired / perfect window size?)

14. SPARK SHUFFLE BLOCK FEATURE • ~ 128mb per partition (rule of a thumb) • if(partitionsNumber ~ 2000) repartition(> 2000)

15. WINDOWED READS

16. WINDOWED READS • Here we have a sort of some crop function by grid bounds on each element: tiff.crop(gridBounds) (it is the meaning of rr.readWindows func)

17. WINDOWED READS • 13 GB loads not efficient into memory of three AWS m3.xlarge instances .

18. WINDOWED READS • Instead of 13Gb it fetches even 40Gb per partition…

19. WINDOWED READS • The solution is to pack segments into desired windows based on the input format requirements • After all the main idea is to leverage the gains by having a good partion scheme

20. READ / WRITE • SFC index and parallelism level control • Cassandra and range queries example  (range queries and compare it to spark Cassandra connector, queries parallelism inside Spark partitions)

21. READ / WRITE

22. API & SPARK PROBLEMS • Spark has its limitations • It’s not required for a small data amount  (In the real time case even milliseconds are important, otherwise we have to live somehow with the Spark slow responses) • The second API in addition to the RDD API is the answer?  (Collections API; does it make any sense to abstract over RDDs and Collections?)

23. • https://github.com/locationtech/geotrellis • https://geotrellis.io • https://www.azavea.com • https://www.azavea.com/blog • https://yuns-stacy.github.io/geotrellis-angular-demo/dashboard • http://rasterframes.io/ml/statistics.html • https://github.com/pomadchin