SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
Working with
using
&
What we’ll cover
● OpenStreetMap (OSM) and it’s data model
● A Missing Maps use case that needed big data tooling to
process OSM History
● OSMesa, what it is, and what it can do
● The future of distributed OSM processing, and what it will
enable
What is OpenStreetMap?
OSM Data Model
The OSM data model consists mainly of 3 elements:
● Nodes - Points
● Ways - LineStrings, Polygons
● Relations - GeometryCollections, Polygon with holes,
MultiPolygons
As well as the tag-based metadata that applies to each
elements, and changesets grouping edits
OSM Data Model: Relations
OSM Data Model: Changesets
● Edits are grouped into changesets, which have their own
metadata such as use comments (for developers, think
commit messages)
● Adding hashtags to user comments allows downstream
processing to group changes - for example, #HOTLunch
Backfilling missing maps
● Missing maps leaderboard processes OSM change files to
increment user and campaign statistics
● The statistics were correct for when the streaming
calculation started, but there was the problem of accounting
for edits previous to that streaming calculation not counting
towards user’s totals.
● So, there was a need to “backfill” the statistics based on
OSM history.
● Through the Red Cross and a grant
from Microsoft Philanthropies, Seth
Fitzsimmons of Pacific Atlas was
hired to solve the backfilling problem.
● Seth was previously involved with
releasing OSM data as a public
dataset on AWS and early work on
distributed processing of OSM data
Reducing the “time to first question”
Source: Seth’s blog post about processing OSM with Athena
Backfill: Athena approach
● Seth first tried to use Athena to calculate the backfill
statistics. This approach didn’t work
● The complexity of the queries made the jobs blow up or
never finish
● Also, Athena's geospatial support hadn't been announced
yet, and once it was, it still didn’t work with the complicated
set of queries
● Seth started showing interest in a set of tools that Azavea
was building at the time that used Apache Spark and
GeoTrellis for calculations calculating similar statistics
● He ported his complicated SQL queries for Athena to
SparkSQL and started contributing to that effort
Backfill: New approach
Leaderboard 2.0 blog post
What is OSMesa?
● It's a loose term for a workflow for OSM data processing
● Still being defined - useful, but amorphous
● More a group of tools and techniques then, say, a library
● Uses Spark, GeoTrellis and AWS to process OSM data into
geometries, vector tiles, and statistics
● a distributed computation engine.
● An API that lets you work with distributed data as a
collection, including a DataFrames API
● Written in Scala, with language bindings for use with Java,
Python, and R.
● Spark DataFrames provide an API that is similar to R or
Pandas DataFrames; allows working with data in a SQL-like
manner
● Very powerful, and can express complicated queries
● (partially) Abstracts away the complexities of distributed
computing
● Core geospatial library in Scala
● Enables Spark with geospatial types and operations
● Generally focused on Raster data, wraps JTS for vector
support
● Vector Tile module for reading and writing vector tiles
OSMesa workflow
AWS EMR Cluster
AWS S3
ORC
Statistics
Vector Tiles
ORC files
● With OSMesa, we can create full historical geometries.
● To do this, we need needed to create a concept of “minor
versions” of geometries
Creating features from History
way v1
highway=unclassified
node v1
node v1
node v1
node v1
node v1
node v1
node v2
node v2
way v2
highway=primary
node v1
node v1
node v2
node v2
way v1
highway=unclassified
way v1
highway=unclassified
node v1
node v1
node v1
node v1
way v1.1
highway=unclassified
node v1
node v1
node v2
node v2
way v2
highway=primary
node v1
node v1
node v2
node v2
minor
version
change
● With minor versions, we can bake new ORC files that
contain geometries of every element in OSM history, with
ways/relations representing every edit to the element as well
as elements that they contain
● Then, we compute statistics per changeset based on
geometries, and roll up the statistics per user and hashtag
Full historical geometries
● Processing of full history into features in under 40 minutes
(cluster of 255 m3.2xlarge nodes)
● This is not a small cluster ( ≈$65/hour). YMMV with smaller
clusters.
● We are building update mechanisms to avoid refreshing the
entire dataset
Processing OSM data at scale
Some data created by OSMesa...
Viewing time slices of Rhode Island OSM
Historical edits for several hashtag campaigns
Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies
● Building matching between OSM and other vector datasets
● Generating vector tiles for URCHN containing a subset of
historical data to front-end analytics
OSMesa: Other current uses
This is just the beginning
The Future: Validation workflows, Reputation
scores
● Better validation workflows is a big question in the OSM
community right now (according to SOTM US 2017)
● HOT Tasking manager does some; we can do better
● One way to improve validation workflows is to suggest
validation be done by veteran mappers, validation be
suggested for more junior mappers (“reputations core”)
● Development Seed, who contribute & uses OSMesa work,
have great ideas in this space.
The Future: Data Science notebooks,
production workflows
● We are aiming to create a Python notebook environment for
doing data science on OSM, in combination with raster data
● By building on Spark and projects like GeoMesa’s
“JTSFrames”, RasterFrames, and GeoTrellis, we’re creating
a platform that works both for data scientist poking around
in a Jupyter notebook and production systems.
The Future: Machine Learning pre- and post-
processing
● Pre-processing geospatial imagery and OSM into training
chips - a distributed label-maker
● Managing data into and out of Raster Vision
● Post-processing by cleaning the model output, matching to
OSM or other vector data to remove duplicates, conflation
workflows
● Matching OSM to imagery dates: e.g. pre- and post-
disaster.
Join in the fun
● There is a lot of interesting development challenges that
need to be met in the OSM world
● OSM has many different voices in the room, but they all
have one goal: building a better map
● Join the effort to build a better map
If you could ask the OpenStreetMap any
question, at any scale, what would you ask it?
THANKS!
Rob Emanuele, Azavea
@lossyrob (Twitter, GitHub)
www.azavea.com
Seth Fitzsimmons, Pacific Atlas
@mojodna (Twitter, GitHub)
www.pacatlas.com
github.com/azavea/osmesa
OSM Data Model: Nodes
● Single location; only OSM element with geospatial data
● Can represent points of interest, or be solely for inclusion in
ways
● Represents a Point geometry
OSM Data Model: Ways
● References a sequence of ordered nodes
● Represents a LineString geometry
● Closed ways can represent Polygon geometries
OSM Data Model: Relations
● Group of nodes, ways, and other relations
● Used for representing a Polygon with holes,
MultiPolygons, and more generally GeometryCollections
OSM Data Model: Tags
● Each Node, Way and Relation can have a sequence of
tags, which are string-based keys and values. This
describes the role of each element on the map, e.g.
○ highway=residential
○ landuse=grass
○ amenity=fast_food
Source: Dongpo Deng, https://www.slideshare.net/dongpo/the-one-and-many-maps-participatory-and-temporal-diversities-in-openstreetmap
https://planet.openstreetmap.org/
Ways to work with OSM snapshots
● Import OSM data into PostGIS
○ osm2pgsql
○ imposm3
● Render into raster tiles or vector tiles
○ Mapnik
○ Tegola
● Utilize for routing software
○ pgRouting
Ways to work with OSM history
● Clip it using osmium, and import a subset into PostGIS
● After that … not a lot of mature tooling available
Why is OSM history useful
● Calculating user history statistics
● Calculating campaign history statistics
● Calculating complete answers to the question, “what has
changed?”
● Taking a snapshot of OSM at any point in history
● Analytics for research
Why ORC?
● On-demand querying + predicate push-down is possible if
OSM data is in a format that was well-understood by the
Hadoop ecosystem
● bespoke formats have their place, especially when size or
other considerations are all-consuming, but it's really
frustrating to see people continually implementing OSM PBF
parsers to be slightly faster when those parsers are typically
single-use (for a specific application). i wanted to sidestep
the whole process and use a well-known, well-supported
The Approach: Features from OSM data
● Join element data to the other elements that contain them;
for example, join each node to the way(s) it belongs to.
● Assign a minor version to ways and relations modified
because the underlying elements change; e.g. a minor
version increments for a way if someone moves the nodes
belonging to it.
● Create Points, Line, Polygons, and Multipolygons for each
major and minor version of the element.
ProcessOSM.scala on GitHub
The Approach: Vector Tile Generation
Analytic Vector Tiles
● The name we’ve been using for Vector Tiles that contain
information for analysis not (necessarily) for display
● OSMesa/VectorPipe can create sets of Analytic Vector Tiles
from arbitrary subsets of OSM History and publish them to
AWS S3
● Think custom Mapbox QA Tiles, containing relations and
historical elements
● We are creating streaming update workflows to keep
Analytic Vector Tile sets up-to-the-minute (almost).
Other work in this space
● Mapbox’s Jennings Anderson gave a talk at SOTM and
wrote a blog post around quarterly QA tiles
● Uses a work-in-progress project called osm-wayback to
create the historical QA tiles. Goal of project is “...to create
historic geometries for each intermediate version of an OSM
feature.”
● RocksDB on the backend, which creates a ≈ 600GB index
● We have collaborating and looking to further collaborate,
the work is awesome
Animation of Rhode Island OSM edits over time
Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies
How to get started with OSMesa
● GitHub
● Gitter
● Docs are a TODO
An Aside - “Push vs Pull” models for AI
tooling for OSM (and in general)

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
REST API 디자인 개요
REST API 디자인 개요REST API 디자인 개요
REST API 디자인 개요nexusz99
 
디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향
디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향
디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향Open Cyber University of Korea
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영NAVER D2
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Ontico
 
글쓰는 개발자 모임, 글또
글쓰는 개발자 모임, 글또글쓰는 개발자 모임, 글또
글쓰는 개발자 모임, 글또Seongyun Byeon
 

Was ist angesagt? (20)

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Rest web services
Rest web servicesRest web services
Rest web services
 
REST API 디자인 개요
REST API 디자인 개요REST API 디자인 개요
REST API 디자인 개요
 
디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향
디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향
디지털 전환과 교육 혁신 지원을 위한 에듀테크 국제 표준화 동향
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Msbi Architecture
Msbi ArchitectureMsbi Architecture
Msbi Architecture
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
 
글쓰는 개발자 모임, 글또
글쓰는 개발자 모임, 글또글쓰는 개발자 모임, 글또
글쓰는 개발자 모임, 글또
 

Ähnlich wie Working with OpenStreetMap using Apache Spark and Geotrellis

Rendering OpenStreetMap Data using Mapnik
Rendering OpenStreetMap Data using MapnikRendering OpenStreetMap Data using Mapnik
Rendering OpenStreetMap Data using MapnikGraham Jones
 
OpenStreetMap louis liu
OpenStreetMap   louis liuOpenStreetMap   louis liu
OpenStreetMap louis liuAidIQ
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Integrating PostGIS in Web Applications
Integrating PostGIS in Web ApplicationsIntegrating PostGIS in Web Applications
Integrating PostGIS in Web ApplicationsCommand Prompt., Inc
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Casesmathieuraj
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scaleDenis Chapligin
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsAhmad Jawwad
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Concepts and Methods of Embedding Statistical Data into Maps
Concepts and Methods of Embedding Statistical Data into MapsConcepts and Methods of Embedding Statistical Data into Maps
Concepts and Methods of Embedding Statistical Data into MapsMohammad Liton Hossain
 
SoTM US Routing
SoTM US RoutingSoTM US Routing
SoTM US RoutingMapQuest
 
Gsoc proposal 2021 polaris
Gsoc proposal 2021 polarisGsoc proposal 2021 polaris
Gsoc proposal 2021 polarisAyushBansal122
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 

Ähnlich wie Working with OpenStreetMap using Apache Spark and Geotrellis (20)

Rendering OpenStreetMap Data using Mapnik
Rendering OpenStreetMap Data using MapnikRendering OpenStreetMap Data using Mapnik
Rendering OpenStreetMap Data using Mapnik
 
OpenStreetMap louis liu
OpenStreetMap   louis liuOpenStreetMap   louis liu
OpenStreetMap louis liu
 
Openstreetmap
OpenstreetmapOpenstreetmap
Openstreetmap
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Integrating PostGIS in Web Applications
Integrating PostGIS in Web ApplicationsIntegrating PostGIS in Web Applications
Integrating PostGIS in Web Applications
 
Presto
PrestoPresto
Presto
 
Open layers
Open layersOpen layers
Open layers
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
 
Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scale
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data Systems
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Gsoc proposal
Gsoc proposalGsoc proposal
Gsoc proposal
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Concepts and Methods of Embedding Statistical Data into Maps
Concepts and Methods of Embedding Statistical Data into MapsConcepts and Methods of Embedding Statistical Data into Maps
Concepts and Methods of Embedding Statistical Data into Maps
 
SoTM US Routing
SoTM US RoutingSoTM US Routing
SoTM US Routing
 
Gsoc proposal 2021 polaris
Gsoc proposal 2021 polarisGsoc proposal 2021 polaris
Gsoc proposal 2021 polaris
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 

Mehr von Rob Emanuele

2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasetsRob Emanuele
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Rob Emanuele
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsRob Emanuele
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechRob Emanuele
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechRob Emanuele
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkRob Emanuele
 

Mehr von Rob Emanuele (9)

2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 
Processing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTechProcessing Geospatial at Scale at LocationTech
Processing Geospatial at Scale at LocationTech
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
 

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Working with OpenStreetMap using Apache Spark and Geotrellis

  • 2. What we’ll cover ● OpenStreetMap (OSM) and it’s data model ● A Missing Maps use case that needed big data tooling to process OSM History ● OSMesa, what it is, and what it can do ● The future of distributed OSM processing, and what it will enable
  • 4.
  • 5. OSM Data Model The OSM data model consists mainly of 3 elements: ● Nodes - Points ● Ways - LineStrings, Polygons ● Relations - GeometryCollections, Polygon with holes, MultiPolygons As well as the tag-based metadata that applies to each elements, and changesets grouping edits
  • 6. OSM Data Model: Relations
  • 7. OSM Data Model: Changesets ● Edits are grouped into changesets, which have their own metadata such as use comments (for developers, think commit messages) ● Adding hashtags to user comments allows downstream processing to group changes - for example, #HOTLunch
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Backfilling missing maps ● Missing maps leaderboard processes OSM change files to increment user and campaign statistics ● The statistics were correct for when the streaming calculation started, but there was the problem of accounting for edits previous to that streaming calculation not counting towards user’s totals. ● So, there was a need to “backfill” the statistics based on OSM history.
  • 13. ● Through the Red Cross and a grant from Microsoft Philanthropies, Seth Fitzsimmons of Pacific Atlas was hired to solve the backfilling problem. ● Seth was previously involved with releasing OSM data as a public dataset on AWS and early work on distributed processing of OSM data
  • 14.
  • 15. Reducing the “time to first question”
  • 16.
  • 17. Source: Seth’s blog post about processing OSM with Athena
  • 18. Backfill: Athena approach ● Seth first tried to use Athena to calculate the backfill statistics. This approach didn’t work ● The complexity of the queries made the jobs blow up or never finish ● Also, Athena's geospatial support hadn't been announced yet, and once it was, it still didn’t work with the complicated set of queries
  • 19. ● Seth started showing interest in a set of tools that Azavea was building at the time that used Apache Spark and GeoTrellis for calculations calculating similar statistics ● He ported his complicated SQL queries for Athena to SparkSQL and started contributing to that effort Backfill: New approach
  • 21. What is OSMesa? ● It's a loose term for a workflow for OSM data processing ● Still being defined - useful, but amorphous ● More a group of tools and techniques then, say, a library ● Uses Spark, GeoTrellis and AWS to process OSM data into geometries, vector tiles, and statistics
  • 22. ● a distributed computation engine. ● An API that lets you work with distributed data as a collection, including a DataFrames API ● Written in Scala, with language bindings for use with Java, Python, and R.
  • 23. ● Spark DataFrames provide an API that is similar to R or Pandas DataFrames; allows working with data in a SQL-like manner ● Very powerful, and can express complicated queries ● (partially) Abstracts away the complexities of distributed computing
  • 24. ● Core geospatial library in Scala ● Enables Spark with geospatial types and operations ● Generally focused on Raster data, wraps JTS for vector support ● Vector Tile module for reading and writing vector tiles
  • 25. OSMesa workflow AWS EMR Cluster AWS S3 ORC Statistics Vector Tiles ORC files
  • 26. ● With OSMesa, we can create full historical geometries. ● To do this, we need needed to create a concept of “minor versions” of geometries Creating features from History
  • 27. way v1 highway=unclassified node v1 node v1 node v1 node v1 node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 way v1 highway=unclassified
  • 28. way v1 highway=unclassified node v1 node v1 node v1 node v1 way v1.1 highway=unclassified node v1 node v1 node v2 node v2 way v2 highway=primary node v1 node v1 node v2 node v2 minor version change
  • 29. ● With minor versions, we can bake new ORC files that contain geometries of every element in OSM history, with ways/relations representing every edit to the element as well as elements that they contain ● Then, we compute statistics per changeset based on geometries, and roll up the statistics per user and hashtag Full historical geometries
  • 30. ● Processing of full history into features in under 40 minutes (cluster of 255 m3.2xlarge nodes) ● This is not a small cluster ( ≈$65/hour). YMMV with smaller clusters. ● We are building update mechanisms to avoid refreshing the entire dataset Processing OSM data at scale
  • 31. Some data created by OSMesa...
  • 32. Viewing time slices of Rhode Island OSM
  • 33. Historical edits for several hashtag campaigns
  • 34. Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies
  • 35. ● Building matching between OSM and other vector datasets ● Generating vector tiles for URCHN containing a subset of historical data to front-end analytics OSMesa: Other current uses
  • 36. This is just the beginning
  • 37. The Future: Validation workflows, Reputation scores ● Better validation workflows is a big question in the OSM community right now (according to SOTM US 2017) ● HOT Tasking manager does some; we can do better ● One way to improve validation workflows is to suggest validation be done by veteran mappers, validation be suggested for more junior mappers (“reputations core”) ● Development Seed, who contribute & uses OSMesa work, have great ideas in this space.
  • 38. The Future: Data Science notebooks, production workflows ● We are aiming to create a Python notebook environment for doing data science on OSM, in combination with raster data ● By building on Spark and projects like GeoMesa’s “JTSFrames”, RasterFrames, and GeoTrellis, we’re creating a platform that works both for data scientist poking around in a Jupyter notebook and production systems.
  • 39. The Future: Machine Learning pre- and post- processing ● Pre-processing geospatial imagery and OSM into training chips - a distributed label-maker ● Managing data into and out of Raster Vision ● Post-processing by cleaning the model output, matching to OSM or other vector data to remove duplicates, conflation workflows ● Matching OSM to imagery dates: e.g. pre- and post- disaster.
  • 40. Join in the fun ● There is a lot of interesting development challenges that need to be met in the OSM world ● OSM has many different voices in the room, but they all have one goal: building a better map ● Join the effort to build a better map
  • 41. If you could ask the OpenStreetMap any question, at any scale, what would you ask it?
  • 42. THANKS! Rob Emanuele, Azavea @lossyrob (Twitter, GitHub) www.azavea.com Seth Fitzsimmons, Pacific Atlas @mojodna (Twitter, GitHub) www.pacatlas.com github.com/azavea/osmesa
  • 43. OSM Data Model: Nodes ● Single location; only OSM element with geospatial data ● Can represent points of interest, or be solely for inclusion in ways ● Represents a Point geometry
  • 44. OSM Data Model: Ways ● References a sequence of ordered nodes ● Represents a LineString geometry ● Closed ways can represent Polygon geometries
  • 45. OSM Data Model: Relations ● Group of nodes, ways, and other relations ● Used for representing a Polygon with holes, MultiPolygons, and more generally GeometryCollections
  • 46. OSM Data Model: Tags ● Each Node, Way and Relation can have a sequence of tags, which are string-based keys and values. This describes the role of each element on the map, e.g. ○ highway=residential ○ landuse=grass ○ amenity=fast_food
  • 47. Source: Dongpo Deng, https://www.slideshare.net/dongpo/the-one-and-many-maps-participatory-and-temporal-diversities-in-openstreetmap
  • 49. Ways to work with OSM snapshots ● Import OSM data into PostGIS ○ osm2pgsql ○ imposm3 ● Render into raster tiles or vector tiles ○ Mapnik ○ Tegola ● Utilize for routing software ○ pgRouting
  • 50. Ways to work with OSM history ● Clip it using osmium, and import a subset into PostGIS ● After that … not a lot of mature tooling available
  • 51. Why is OSM history useful ● Calculating user history statistics ● Calculating campaign history statistics ● Calculating complete answers to the question, “what has changed?” ● Taking a snapshot of OSM at any point in history ● Analytics for research
  • 52. Why ORC? ● On-demand querying + predicate push-down is possible if OSM data is in a format that was well-understood by the Hadoop ecosystem ● bespoke formats have their place, especially when size or other considerations are all-consuming, but it's really frustrating to see people continually implementing OSM PBF parsers to be slightly faster when those parsers are typically single-use (for a specific application). i wanted to sidestep the whole process and use a well-known, well-supported
  • 53. The Approach: Features from OSM data ● Join element data to the other elements that contain them; for example, join each node to the way(s) it belongs to. ● Assign a minor version to ways and relations modified because the underlying elements change; e.g. a minor version increments for a way if someone moves the nodes belonging to it. ● Create Points, Line, Polygons, and Multipolygons for each major and minor version of the element. ProcessOSM.scala on GitHub
  • 54. The Approach: Vector Tile Generation
  • 55. Analytic Vector Tiles ● The name we’ve been using for Vector Tiles that contain information for analysis not (necessarily) for display ● OSMesa/VectorPipe can create sets of Analytic Vector Tiles from arbitrary subsets of OSM History and publish them to AWS S3 ● Think custom Mapbox QA Tiles, containing relations and historical elements ● We are creating streaming update workflows to keep Analytic Vector Tile sets up-to-the-minute (almost).
  • 56. Other work in this space ● Mapbox’s Jennings Anderson gave a talk at SOTM and wrote a blog post around quarterly QA tiles ● Uses a work-in-progress project called osm-wayback to create the historical QA tiles. Goal of project is “...to create historic geometries for each intermediate version of an OSM feature.” ● RocksDB on the backend, which creates a ≈ 600GB index ● We have collaborating and looking to further collaborate, the work is awesome
  • 57. Animation of Rhode Island OSM edits over time
  • 58. Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies
  • 59.
  • 60. How to get started with OSMesa ● GitHub ● Gitter ● Docs are a TODO
  • 61. An Aside - “Push vs Pull” models for AI tooling for OSM (and in general)