SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK
Grigory Pomadchin @daunnc / @pomadchin
GEO +
VectorTiles + PointClouds +
———————————————————————————————
Raster +Vector +
/w
• Raster Foundry
(Spark SQL & ML)
• Raster Frames
(Spark SQL & ML, Datasets
query API)
• GeoPy Spark
(Python bindings)
• Vector Pipes
(Vector tiles on Spark)
• PDAL
intergration
(PointClouds on Spark)
GEOTRELLIS ECOSYSTEM
WHATS UNDER THE COVERS?
SPACE FILLING CURVES
• RDD

(a basic core spark type from the
past (no))
• Manual partitioning control
• DATASET
• Query planning
optimizations, more related
to already well partitioned
and structured data.
PARTITIONING SCHEME
SPECIAL BROWN COLORED FUNCTIONS
• Join
• groupByKey
• reduceByKey
• combineByKey
• Repartition
• Each function that has no
preservePartitioning flag or
can accept partitioner as an
argument, probably map?
MAP IS A FUNCTION OF A DIFFERENT KIND?
join
reduceByKey
join map
reduceByKey
join
mapVlues reduceByKey
MAP IS A FUNCTION OF A DIFFERENT KIND?
• inspired by Eugene Cheipesh slides
DATA PREPARATION
• {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark
• (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)},
{Multiband | Singleband}Tile
• K - {Spatial(col, row) | SpaceTime(col, row, time)}
• inspired by Eugene Cheipesh slides
WAT?!
• Load data into Spark memory according to some
partitioning scheme
• Ahead of shuffle: smaller chunks are better for
Spark (as the max shuffle block size is only 2GBs)
• Are we dependent on the input data type? (yes)
• Window reading (what’s the desired / perfect
window size?)
SPARK SHUFFLE BLOCK FEATURE
• ~ 128mb per partition (rule of a thumb)
• if(partitionsNumber ~ 2000) repartition(> 2000)
WINDOWED READS
WINDOWED READS
• Here we have a sort of some crop
function by grid bounds on each
element: tiff.crop(gridBounds) (it is
the meaning of rr.readWindows func)
WINDOWED READS
• 13 GB loads not efficient into
memory of three AWS m3.xlarge
instances .
WINDOWED READS
• Instead of 13Gb it fetches even 40Gb
per partition…
WINDOWED READS
• The solution is to pack segments into
desired windows based on the input
format requirements
• After all the main idea is to leverage
the gains by having a good partion
scheme
READ / WRITE
• SFC index and parallelism level control
• Cassandra and range queries example

(range queries and compare it to spark Cassandra connector, queries parallelism
inside Spark partitions)
READ / WRITE
API & SPARK PROBLEMS
• Spark has its limitations
• It’s not required for a small data amount

(In the real time case even milliseconds are important, otherwise we have to live
somehow with the Spark slow responses)
• The second API in addition to the RDD API is
the answer?

(Collections API; does it make any sense to abstract over RDDs and Collections?)
• https://github.com/locationtech/geotrellis
• https://geotrellis.io
• https://www.azavea.com
• https://www.azavea.com/blog
• https://yuns-stacy.github.io/geotrellis-angular-demo/dashboard
• http://rasterframes.io/ml/statistics.html
• https://github.com/pomadchin

Weitere ähnliche Inhalte

Was ist angesagt?

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...ScyllaDB
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
 
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with ScyllaScyllaDB
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)Alexander Kukushkin
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
The True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS OptionsThe True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS OptionsScyllaDB
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesScyllaDB
 
Seastar Summit 2019 Keynote
Seastar Summit 2019 KeynoteSeastar Summit 2019 Keynote
Seastar Summit 2019 KeynoteScyllaDB
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Using ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber SecurityUsing ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB
 
Empowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorScyllaDB
 

Was ist angesagt? (20)

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
The True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS OptionsThe True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS Options
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
 
Seastar Summit 2019 Keynote
Seastar Summit 2019 KeynoteSeastar Summit 2019 Keynote
Seastar Summit 2019 Keynote
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Using ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber SecurityUsing ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber Security
 
Empowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with Alternator
 

Ähnlich wie Distributed georeferenced raster processing on Spark with GeoTrellis

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 

Ähnlich wie Distributed georeferenced raster processing on Spark with GeoTrellis (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Spark
SparkSpark
Spark
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Distributed georeferenced raster processing on Spark with GeoTrellis

  • 1. DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK Grigory Pomadchin @daunnc / @pomadchin
  • 2. GEO + VectorTiles + PointClouds + ——————————————————————————————— Raster +Vector + /w
  • 3. • Raster Foundry (Spark SQL & ML) • Raster Frames (Spark SQL & ML, Datasets query API) • GeoPy Spark (Python bindings) • Vector Pipes (Vector tiles on Spark) • PDAL intergration (PointClouds on Spark) GEOTRELLIS ECOSYSTEM
  • 4. WHATS UNDER THE COVERS?
  • 6.
  • 7. • RDD
 (a basic core spark type from the past (no)) • Manual partitioning control • DATASET • Query planning optimizations, more related to already well partitioned and structured data. PARTITIONING SCHEME SPECIAL BROWN COLORED FUNCTIONS • Join • groupByKey • reduceByKey • combineByKey • Repartition • Each function that has no preservePartitioning flag or can accept partitioner as an argument, probably map?
  • 8. MAP IS A FUNCTION OF A DIFFERENT KIND?
  • 10. • inspired by Eugene Cheipesh slides
  • 11. DATA PREPARATION • {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark • (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)}, {Multiband | Singleband}Tile • K - {Spatial(col, row) | SpaceTime(col, row, time)}
  • 12. • inspired by Eugene Cheipesh slides
  • 13. WAT?! • Load data into Spark memory according to some partitioning scheme • Ahead of shuffle: smaller chunks are better for Spark (as the max shuffle block size is only 2GBs) • Are we dependent on the input data type? (yes) • Window reading (what’s the desired / perfect window size?)
  • 14. SPARK SHUFFLE BLOCK FEATURE • ~ 128mb per partition (rule of a thumb) • if(partitionsNumber ~ 2000) repartition(> 2000)
  • 16. WINDOWED READS • Here we have a sort of some crop function by grid bounds on each element: tiff.crop(gridBounds) (it is the meaning of rr.readWindows func)
  • 17. WINDOWED READS • 13 GB loads not efficient into memory of three AWS m3.xlarge instances .
  • 18. WINDOWED READS • Instead of 13Gb it fetches even 40Gb per partition…
  • 19. WINDOWED READS • The solution is to pack segments into desired windows based on the input format requirements • After all the main idea is to leverage the gains by having a good partion scheme
  • 20. READ / WRITE • SFC index and parallelism level control • Cassandra and range queries example
 (range queries and compare it to spark Cassandra connector, queries parallelism inside Spark partitions)
  • 22. API & SPARK PROBLEMS • Spark has its limitations • It’s not required for a small data amount
 (In the real time case even milliseconds are important, otherwise we have to live somehow with the Spark slow responses) • The second API in addition to the RDD API is the answer?
 (Collections API; does it make any sense to abstract over RDDs and Collections?)
  • 23. • https://github.com/locationtech/geotrellis • https://geotrellis.io • https://www.azavea.com • https://www.azavea.com/blog • https://yuns-stacy.github.io/geotrellis-angular-demo/dashboard • http://rasterframes.io/ml/statistics.html • https://github.com/pomadchin