SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Spatial Analytics Workshop
Pete Skomoroch, LinkedIn (@peteskomoroch)
Kevin Weil, Twitter (@kevinweil)
Sean Gorman, FortiusOne (@seangorman)

#spatialanalytics
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Spatial Analysis

      Analytical techniques to determine the spatial
      distribution of a variable, the relationship between
      the spatial distribution of variables, and the
      association of the variables in an area.
Pattern Analysis
Spatial Analysis Types

     1. Spatial autocorrelation
     2. Spatial interpolation
     3. Spatial interaction
     4. Simulation and modeling
     5. Density mapping
Spatial Autocorrelation

      Spatial autocorrelation statistics measure and analyze
      the degree of dependency among observations in a
      geographic space.


      First law of geography: “everything is related to everything
      else, but near things are more related than distant things.”
        -- Waldo Tobler
Moran’s I - Per Capita
Moran’s I - Random Variable   Income in Monroe County




       Moran’s I = .012              Moran’s I = .66
Spatial Interpolation

      Spatial interpolation methods estimate the variables
      at unobserved locations in geographic space based
      on the values at observed locations.
$14.00
                                                   Chicago




                                                             $14.00
                                                              NYC



                                         $7.55
                                          Henry
Natural Gas Demand in Response to
February 21, 2003 Alberta Clipper cold
front
$18.50
                                                   Chicago




                                                             $30.00
                                                              NYC



                                         $16.00
                                          Henry
Natural Gas Demand in Response to
February 24, 2003 Alberta Clipper cold
front
$20.00
                                                   Chicago




                                                             $37.00
                                                              NYC



                                         $22.00
                                          Henry
Natural Gas Demand in Response to
February 25, 2003 Alberta Clipper cold
front
Spatial Interaction

      Spatial interaction or “gravity models” estimate
      the flow of people, material, or information
      between locations in geographic space.
Introduction
‣   Motiviation
‣   Execution
‣   Prototype
‣   Service
‣   API
‣   Operations
‣   UX

                  Global Oil Supply and Demand Gravity
                                  Model
Simulation and Modeling

      Simple interactions among proximal entities can
      lead to intricate, persistent, and functional spatial
      entities at aggregate levels (complex adaptive
      systems).
Spatial Interdependency Analysis of
                                                                            the San Francisco Failure Simulation




                        Total Number of   No. Links   % Links     %Volume
Infrastructure          Links             Congested   Congested   Delay
Refined Products
(National)
                             3,197              1       0.03%       0.05%
Refined Products
(MSA)                                                   12.50%
                              8                 1                    93%


Power Grid (Regional)        1,942              4        0%          N/A


Power Grid (MSA)              16                2        13%         N/A
Density Mapping

     Calculating the proximity and frequency of a
     spatial phenomenon by creating a probabilistic
     surface.
New York City Fiber Density Map
Standard GIS Architectures
Distributed Analytics

      Queueing analysis tasks from disparate data sources
      for agents to run across distributed servers to collate
      back to the user as answers.
Disparate Data




                                               Distributed Servers
                                      Agents
 User
                 Request Queue

                           Analysis
(http://finder.geocommons.com/overlays/20148)




       1. Rasterize
       2. Kernel
          density calc
       3. Color map              Agent
                                               Amazon EC2
User
       Request Queue



                    Amazon S3
Vector Density Mapping Demo
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣   And growth is accelerating
‣   Need multiple machines,
    horizontal scalability
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   MapReduce-based parallel computation
    (even harder to process a PB)
‣   Generic key-value based computation interface
    allows for wide applicability
‣   Open source, top-level Apache project
‣   Scalable: Y! has a 4000-node cluster
‣   Powerful: sorted a TB of random integers in 62 seconds
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close
                                                    to 2x faster.
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins: lengthy, error-prone
‣   n-stage jobs: Hard to manage
‣   Prototyping/exploration requires             ‣   analytics in Eclipse?
    compilation                                      ur doin it wrong...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis

‣   The Pig version is:
‣        5% of the code, 5% of the time
‣        Within 50% of the execution time.
‣   Pig      Geo:

    ‣   Programmable: fuzzy matching, custom filtering
    ‣   Easily link multiple datasets, regardless of size/structure
    ‣   Iterative, quick
A Real Example

‣   Fire up your EMR.
    ‣   ... or follow along at http://bit.ly/whereanalytics
‣   Pete used Twitter’s streaming API to store some tweets
‣   Simplest thing: group by location and count with Pig
    ‣   http://bit.ly/where20pig


‣   Here comes some code!
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets_with_location = FILTER tweets BY user_location !=
'NULL';
normalized_locations = FOREACH tweets_with_location
GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY
user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as
location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil           37985
indonesia        33777
brazil           22432
london           17294
usa              14564
são paulo        14238
new york         13420
tokyo            10967
singapore        10225
rio de janeiro   10135
los angeles      9934
california       9386
chicago          9155
uk               9095
jakarta          9086
germany          8741
canada           8201
                 7696
                 7121
jakarta, indonesia  6480
nyc              6456
new york, ny     6331
Neat, but...

 ‣   Wow, that data is messy!
     ‣   brasil, brazil at #1 and #3
     ‣   new york, nyc, and new york ny all in the top 30
 ‣   Pete to the rescue.
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Users by County
Lady Gaga
Tea Party
Dallas
Colbert
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Questions?   Follow us at
             twitter.com/peteskomoroch
             twitter.com/kevinweil
             twitter.com/seangorman

Weitere ähnliche Inhalte

Was ist angesagt?

Introdution to Landsat and Google Earth Engine
Introdution to Landsat and Google Earth EngineIntrodution to Landsat and Google Earth Engine
Introdution to Landsat and Google Earth Engine
Veerachai Tanpipat
 
Introduction to Groundwater Modelling
Introduction to Groundwater ModellingIntroduction to Groundwater Modelling
Introduction to Groundwater Modelling
C. P. Kumar
 
Inverse distance weighting
Inverse distance weightingInverse distance weighting
Inverse distance weighting
Penchala Vineeth
 

Was ist angesagt? (20)

Flood frequency analyses
Flood frequency analysesFlood frequency analyses
Flood frequency analyses
 
Introdution to Landsat and Google Earth Engine
Introdution to Landsat and Google Earth EngineIntrodution to Landsat and Google Earth Engine
Introdution to Landsat and Google Earth Engine
 
Spatial interpolation techniques
Spatial interpolation techniquesSpatial interpolation techniques
Spatial interpolation techniques
 
Interpolation techniques in ArcGIS
Interpolation techniques in ArcGISInterpolation techniques in ArcGIS
Interpolation techniques in ArcGIS
 
Swat model
Swat model Swat model
Swat model
 
Introduction to Groundwater Modelling
Introduction to Groundwater ModellingIntroduction to Groundwater Modelling
Introduction to Groundwater Modelling
 
Components of Spatial Data Quality in GIS
Components of Spatial Data Quality in GISComponents of Spatial Data Quality in GIS
Components of Spatial Data Quality in GIS
 
Hydrological modelling
Hydrological modellingHydrological modelling
Hydrological modelling
 
Geo-spatial Analysis and Modelling
Geo-spatial Analysis and ModellingGeo-spatial Analysis and Modelling
Geo-spatial Analysis and Modelling
 
Sanitaion methods & technologies
Sanitaion methods & technologiesSanitaion methods & technologies
Sanitaion methods & technologies
 
Stream flow measurement
Stream flow  measurementStream flow  measurement
Stream flow measurement
 
Seminar on gis analysis functions
Seminar on gis analysis functionsSeminar on gis analysis functions
Seminar on gis analysis functions
 
Ground Water Hydrology
Ground Water HydrologyGround Water Hydrology
Ground Water Hydrology
 
Flood plain zoning
Flood plain zoningFlood plain zoning
Flood plain zoning
 
Application of GIS and RS in Watershed Management
Application of GIS and RS in Watershed ManagementApplication of GIS and RS in Watershed Management
Application of GIS and RS in Watershed Management
 
Introduction to ArcGIS
Introduction to ArcGISIntroduction to ArcGIS
Introduction to ArcGIS
 
Hydrological modelling i5
Hydrological modelling i5Hydrological modelling i5
Hydrological modelling i5
 
Inverse distance weighting
Inverse distance weightingInverse distance weighting
Inverse distance weighting
 
Applications of remote sensing and modelling in flood risk analysis and irrig...
Applications of remote sensing and modelling in flood risk analysis and irrig...Applications of remote sensing and modelling in flood risk analysis and irrig...
Applications of remote sensing and modelling in flood risk analysis and irrig...
 
Determination of Soil Type by Ternary Diagram textural plotting
Determination of Soil Type by Ternary Diagram textural plottingDetermination of Soil Type by Ternary Diagram textural plotting
Determination of Soil Type by Ternary Diagram textural plotting
 

Andere mochten auch

The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _final
Rushin Naik
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining source
Ataxo Group
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 

Andere mochten auch (20)

Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _final
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - Devnest
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion mining
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product design
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining source
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science research
 
PPT FOR BIG
PPT FOR BIGPPT FOR BIG
PPT FOR BIG
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 

Ähnlich wie Spatial Analytics, Where 2.0 2010

Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
Rajarshi Guha
 
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
J. Kevin Byrne
 

Ähnlich wie Spatial Analytics, Where 2.0 2010 (20)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressiveness
 
Shuronr
ShuronrShuronr
Shuronr
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
 
Optical networks, light paths and GRID computing
Optical networks, light paths and GRID computingOptical networks, light paths and GRID computing
Optical networks, light paths and GRID computing
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
 
Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Spatial Analytics, Where 2.0 2010

  • 1. Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics
  • 2. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 3. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 4. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 5. Spatial Analysis Analytical techniques to determine the spatial distribution of a variable, the relationship between the spatial distribution of variables, and the association of the variables in an area.
  • 7. Spatial Analysis Types 1. Spatial autocorrelation 2. Spatial interpolation 3. Spatial interaction 4. Simulation and modeling 5. Density mapping
  • 8. Spatial Autocorrelation Spatial autocorrelation statistics measure and analyze the degree of dependency among observations in a geographic space. First law of geography: “everything is related to everything else, but near things are more related than distant things.” -- Waldo Tobler
  • 9. Moran’s I - Per Capita Moran’s I - Random Variable Income in Monroe County Moran’s I = .012 Moran’s I = .66
  • 10. Spatial Interpolation Spatial interpolation methods estimate the variables at unobserved locations in geographic space based on the values at observed locations.
  • 11. $14.00 Chicago $14.00 NYC $7.55 Henry Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front
  • 12. $18.50 Chicago $30.00 NYC $16.00 Henry Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front
  • 13. $20.00 Chicago $37.00 NYC $22.00 Henry Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front
  • 14. Spatial Interaction Spatial interaction or “gravity models” estimate the flow of people, material, or information between locations in geographic space.
  • 15. Introduction ‣ Motiviation ‣ Execution ‣ Prototype ‣ Service ‣ API ‣ Operations ‣ UX Global Oil Supply and Demand Gravity Model
  • 16. Simulation and Modeling Simple interactions among proximal entities can lead to intricate, persistent, and functional spatial entities at aggregate levels (complex adaptive systems).
  • 17. Spatial Interdependency Analysis of the San Francisco Failure Simulation Total Number of No. Links % Links %Volume Infrastructure Links Congested Congested Delay Refined Products (National) 3,197 1 0.03% 0.05% Refined Products (MSA) 12.50% 8 1 93% Power Grid (Regional) 1,942 4 0% N/A Power Grid (MSA) 16 2 13% N/A
  • 18. Density Mapping Calculating the proximity and frequency of a spatial phenomenon by creating a probabilistic surface.
  • 19. New York City Fiber Density Map
  • 21. Distributed Analytics Queueing analysis tasks from disparate data sources for agents to run across distributed servers to collate back to the user as answers.
  • 22. Disparate Data Distributed Servers Agents User Request Queue Analysis
  • 23. (http://finder.geocommons.com/overlays/20148) 1. Rasterize 2. Kernel density calc 3. Color map Agent Amazon EC2 User Request Queue Amazon S3
  • 25.
  • 26.
  • 27.
  • 28. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 29. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  • 30. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  • 31. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 32. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 33. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 34. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 35. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 36. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 37. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 38. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • 39. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 40. Why Pig? ‣ Because I bet you can read the following script.
  • 41. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 43. Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  • 44. A Real Example ‣ Fire up your EMR. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ Pete used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  • 45.
  • 46. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 47. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 48. tweets_with_location = FILTER tweets BY user_location != 'NULL';
  • 49. normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
  • 50. grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
  • 51. location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  • 52. sorted_counts = ORDER location_counts BY user_count DESC;
  • 53. STORE sorted_counts INTO 'global_location_tweets';
  • 54. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  • 55. Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Pete to the rescue.
  • 56. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 72. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 73. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 74. Questions? Follow us at twitter.com/peteskomoroch twitter.com/kevinweil twitter.com/seangorman

Hinweis der Redaktion