SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Spatial Analytics Workshop
Pete Skomoroch, LinkedIn (@peteskomoroch)
Kevin Weil, Twitter (@kevinweil)
Sean Gorman, FortiusOne (@seangorman)

#spatialanalytics
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Spatial Analysis

      Analytical techniques to determine the spatial
      distribution of a variable, the relationship between
      the spatial distribution of variables, and the
      association of the variables in an area.
Pattern Analysis
Spatial Analysis Types

     1. Spatial autocorrelation
     2. Spatial interpolation
     3. Spatial interaction
     4. Simulation and modeling
     5. Density mapping
Spatial Autocorrelation

      Spatial autocorrelation statistics measure and analyze
      the degree of dependency among observations in a
      geographic space.


      First law of geography: “everything is related to everything
      else, but near things are more related than distant things.”
        -- Waldo Tobler
Moran’s I - Per Capita
Moran’s I - Random Variable   Income in Monroe County




       Moran’s I = .012              Moran’s I = .66
Spatial Interpolation

      Spatial interpolation methods estimate the variables
      at unobserved locations in geographic space based
      on the values at observed locations.
$14.00
                                                   Chicago




                                                             $14.00
                                                              NYC



                                         $7.55
                                          Henry
Natural Gas Demand in Response to
February 21, 2003 Alberta Clipper cold
front
$18.50
                                                   Chicago




                                                             $30.00
                                                              NYC



                                         $16.00
                                          Henry
Natural Gas Demand in Response to
February 24, 2003 Alberta Clipper cold
front
$20.00
                                                   Chicago




                                                             $37.00
                                                              NYC



                                         $22.00
                                          Henry
Natural Gas Demand in Response to
February 25, 2003 Alberta Clipper cold
front
Spatial Interaction

      Spatial interaction or “gravity models” estimate
      the flow of people, material, or information
      between locations in geographic space.
Introduction
‣   Motiviation
‣   Execution
‣   Prototype
‣   Service
‣   API
‣   Operations
‣   UX

                  Global Oil Supply and Demand Gravity
                                  Model
Simulation and Modeling

      Simple interactions among proximal entities can
      lead to intricate, persistent, and functional spatial
      entities at aggregate levels (complex adaptive
      systems).
Spatial Interdependency Analysis of
                                                                            the San Francisco Failure Simulation




                        Total Number of   No. Links   % Links     %Volume
Infrastructure          Links             Congested   Congested   Delay
Refined Products
(National)
                             3,197              1       0.03%       0.05%
Refined Products
(MSA)                                                   12.50%
                              8                 1                    93%


Power Grid (Regional)        1,942              4        0%          N/A


Power Grid (MSA)              16                2        13%         N/A
Density Mapping

     Calculating the proximity and frequency of a
     spatial phenomenon by creating a probabilistic
     surface.
New York City Fiber Density Map
Standard GIS Architectures
Distributed Analytics

      Queueing analysis tasks from disparate data sources
      for agents to run across distributed servers to collate
      back to the user as answers.
Disparate Data




                                               Distributed Servers
                                      Agents
 User
                 Request Queue

                           Analysis
(http://finder.geocommons.com/overlays/20148)




       1. Rasterize
       2. Kernel
          density calc
       3. Color map              Agent
                                               Amazon EC2
User
       Request Queue



                    Amazon S3
Vector Density Mapping Demo
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣   And growth is accelerating
‣   Need multiple machines,
    horizontal scalability
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   MapReduce-based parallel computation
    (even harder to process a PB)
‣   Generic key-value based computation interface
    allows for wide applicability
‣   Open source, top-level Apache project
‣   Scalable: Y! has a 4000-node cluster
‣   Powerful: sorted a TB of random integers in 62 seconds
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close
                                                    to 2x faster.
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins: lengthy, error-prone
‣   n-stage jobs: Hard to manage
‣   Prototyping/exploration requires             ‣   analytics in Eclipse?
    compilation                                      ur doin it wrong...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis

‣   The Pig version is:
‣        5% of the code, 5% of the time
‣        Within 50% of the execution time.
‣   Pig      Geo:

    ‣   Programmable: fuzzy matching, custom filtering
    ‣   Easily link multiple datasets, regardless of size/structure
    ‣   Iterative, quick
A Real Example

‣   Fire up your EMR.
    ‣   ... or follow along at http://bit.ly/whereanalytics
‣   Pete used Twitter’s streaming API to store some tweets
‣   Simplest thing: group by location and count with Pig
    ‣   http://bit.ly/where20pig


‣   Here comes some code!
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets_with_location = FILTER tweets BY user_location !=
'NULL';
normalized_locations = FOREACH tweets_with_location
GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY
user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as
location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil           37985
indonesia        33777
brazil           22432
london           17294
usa              14564
são paulo        14238
new york         13420
tokyo            10967
singapore        10225
rio de janeiro   10135
los angeles      9934
california       9386
chicago          9155
uk               9095
jakarta          9086
germany          8741
canada           8201
                 7696
                 7121
jakarta, indonesia  6480
nyc              6456
new york, ny     6331
Neat, but...

 ‣   Wow, that data is messy!
     ‣   brasil, brazil at #1 and #3
     ‣   new york, nyc, and new york ny all in the top 30
 ‣   Pete to the rescue.
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Users by County
Lady Gaga
Tea Party
Dallas
Colbert
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Questions?   Follow us at
             twitter.com/peteskomoroch
             twitter.com/kevinweil
             twitter.com/seangorman

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Workshop on Storm Water Modeling Approaches
Workshop on Storm Water Modeling ApproachesWorkshop on Storm Water Modeling Approaches
Workshop on Storm Water Modeling Approaches
 
Quantitative techniques in geography
Quantitative techniques in geographyQuantitative techniques in geography
Quantitative techniques in geography
 
Spatial vs non spatial
Spatial vs non spatialSpatial vs non spatial
Spatial vs non spatial
 
GIS data analysis
GIS data analysisGIS data analysis
GIS data analysis
 
Digital image processing
Digital image processingDigital image processing
Digital image processing
 
Manual of Remote Sensing
Manual of Remote SensingManual of Remote Sensing
Manual of Remote Sensing
 
GIS data structure
GIS data structureGIS data structure
GIS data structure
 
Spatial Autocorrelation
Spatial AutocorrelationSpatial Autocorrelation
Spatial Autocorrelation
 
Data models in geographical information system(GIS)
Data models in geographical information system(GIS)Data models in geographical information system(GIS)
Data models in geographical information system(GIS)
 
Scope and content of population geography
Scope and content of population geographyScope and content of population geography
Scope and content of population geography
 
Digital elevation model in GIS
Digital elevation model in GISDigital elevation model in GIS
Digital elevation model in GIS
 
Vector data model
Vector data modelVector data model
Vector data model
 
Pre processing of raw rs data
Pre processing of raw rs dataPre processing of raw rs data
Pre processing of raw rs data
 
History of cartography
History of cartographyHistory of cartography
History of cartography
 
Image Classification Techniques in GIS
Image Classification Techniques in GISImage Classification Techniques in GIS
Image Classification Techniques in GIS
 
The Paradigms of Geography
 The Paradigms of Geography The Paradigms of Geography
The Paradigms of Geography
 
GIS - Topology
GIS - Topology GIS - Topology
GIS - Topology
 
Land cover and Land Use
Land cover and Land UseLand cover and Land Use
Land cover and Land Use
 
Spatial data for GIS
Spatial data for GISSpatial data for GIS
Spatial data for GIS
 
Remote sensing & gis
Remote sensing & gisRemote sensing & gis
Remote sensing & gis
 

Andere mochten auch

Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalRushin Naik
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011ZoeMM
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestOllie Parsley
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningSngular Meaning
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Hortonworks
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designChristine Outram
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining sourceAtaxo Group
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science researchDavide Bennato
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1Johan Blomme
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on TwitterPulkit Goyal
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveIMC Institute
 

Andere mochten auch (20)

Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _final
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - Devnest
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion mining
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product design
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining source
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science research
 
PPT FOR BIG
PPT FOR BIGPPT FOR BIG
PPT FOR BIG
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 

Ähnlich wie Spatial Analytics, Where 2.0 2010

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover ComparisonBZjoe
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessSangjin Han
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Alex Kozlov
 
Optical networks, light paths and GRID computing
Optical networks, light paths and GRID computingOptical networks, light paths and GRID computing
Optical networks, light paths and GRID computingKeith Russell
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Michail Argyriou
 
Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Andrew Turner
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Integrated Carbon Observation System (ICOS)
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...J. Kevin Byrne
 

Ähnlich wie Spatial Analytics, Where 2.0 2010 (20)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressiveness
 
Shuronr
ShuronrShuronr
Shuronr
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
 
Optical networks, light paths and GRID computing
Optical networks, light paths and GRID computingOptical networks, light paths and GRID computing
Optical networks, light paths and GRID computing
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
 
Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
 

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Spatial Analytics, Where 2.0 2010

  • 1. Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics
  • 2. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 3. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 4. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 5. Spatial Analysis Analytical techniques to determine the spatial distribution of a variable, the relationship between the spatial distribution of variables, and the association of the variables in an area.
  • 7. Spatial Analysis Types 1. Spatial autocorrelation 2. Spatial interpolation 3. Spatial interaction 4. Simulation and modeling 5. Density mapping
  • 8. Spatial Autocorrelation Spatial autocorrelation statistics measure and analyze the degree of dependency among observations in a geographic space. First law of geography: “everything is related to everything else, but near things are more related than distant things.” -- Waldo Tobler
  • 9. Moran’s I - Per Capita Moran’s I - Random Variable Income in Monroe County Moran’s I = .012 Moran’s I = .66
  • 10. Spatial Interpolation Spatial interpolation methods estimate the variables at unobserved locations in geographic space based on the values at observed locations.
  • 11. $14.00 Chicago $14.00 NYC $7.55 Henry Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front
  • 12. $18.50 Chicago $30.00 NYC $16.00 Henry Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front
  • 13. $20.00 Chicago $37.00 NYC $22.00 Henry Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front
  • 14. Spatial Interaction Spatial interaction or “gravity models” estimate the flow of people, material, or information between locations in geographic space.
  • 15. Introduction ‣ Motiviation ‣ Execution ‣ Prototype ‣ Service ‣ API ‣ Operations ‣ UX Global Oil Supply and Demand Gravity Model
  • 16. Simulation and Modeling Simple interactions among proximal entities can lead to intricate, persistent, and functional spatial entities at aggregate levels (complex adaptive systems).
  • 17. Spatial Interdependency Analysis of the San Francisco Failure Simulation Total Number of No. Links % Links %Volume Infrastructure Links Congested Congested Delay Refined Products (National) 3,197 1 0.03% 0.05% Refined Products (MSA) 12.50% 8 1 93% Power Grid (Regional) 1,942 4 0% N/A Power Grid (MSA) 16 2 13% N/A
  • 18. Density Mapping Calculating the proximity and frequency of a spatial phenomenon by creating a probabilistic surface.
  • 19. New York City Fiber Density Map
  • 21. Distributed Analytics Queueing analysis tasks from disparate data sources for agents to run across distributed servers to collate back to the user as answers.
  • 22. Disparate Data Distributed Servers Agents User Request Queue Analysis
  • 23. (http://finder.geocommons.com/overlays/20148) 1. Rasterize 2. Kernel density calc 3. Color map Agent Amazon EC2 User Request Queue Amazon S3
  • 25.
  • 26.
  • 27.
  • 28. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 29. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  • 30. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  • 31. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 32. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 33. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 34. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 35. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 36. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 37. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 38. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • 39. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 40. Why Pig? ‣ Because I bet you can read the following script.
  • 41. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 43. Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  • 44. A Real Example ‣ Fire up your EMR. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ Pete used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  • 45.
  • 46. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 47. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 48. tweets_with_location = FILTER tweets BY user_location != 'NULL';
  • 49. normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
  • 50. grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
  • 51. location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  • 52. sorted_counts = ORDER location_counts BY user_count DESC;
  • 53. STORE sorted_counts INTO 'global_location_tweets';
  • 54. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  • 55. Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Pete to the rescue.
  • 56. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 72. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 73. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 74. Questions? Follow us at twitter.com/peteskomoroch twitter.com/kevinweil twitter.com/seangorman

Hinweis der Redaktion