SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Spatial Analytics Workshop
Pete Skomoroch, LinkedIn (@peteskomoroch)
Kevin Weil, Twitter (@kevinweil)
Sean Gorman, FortiusOne (@seangorman)

#spatialanalytics
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Spatial Analysis

      Analytical techniques to determine the spatial
      distribution of a variable, the relationship between
      the spatial distribution of variables, and the
      association of the variables in an area.
Pattern Analysis
Spatial Analysis Types

     1. Spatial autocorrelation
     2. Spatial interpolation
     3. Spatial interaction
     4. Simulation and modeling
     5. Density mapping
Spatial Autocorrelation

      Spatial autocorrelation statistics measure and analyze
      the degree of dependency among observations in a
      geographic space.


      First law of geography: “everything is related to everything
      else, but near things are more related than distant things.”
        -- Waldo Tobler
Moran’s I - Per Capita
Moran’s I - Random Variable   Income in Monroe County




       Moran’s I = .012              Moran’s I = .66
Spatial Interpolation

      Spatial interpolation methods estimate the variables
      at unobserved locations in geographic space based
      on the values at observed locations.
$14.00
                                                   Chicago




                                                             $14.00
                                                              NYC



                                         $7.55
                                          Henry
Natural Gas Demand in Response to
February 21, 2003 Alberta Clipper cold
front
$18.50
                                                   Chicago




                                                             $30.00
                                                              NYC



                                         $16.00
                                          Henry
Natural Gas Demand in Response to
February 24, 2003 Alberta Clipper cold
front
$20.00
                                                   Chicago




                                                             $37.00
                                                              NYC



                                         $22.00
                                          Henry
Natural Gas Demand in Response to
February 25, 2003 Alberta Clipper cold
front
Spatial Interaction

      Spatial interaction or “gravity models” estimate
      the flow of people, material, or information
      between locations in geographic space.
Introduction
‣   Motiviation
‣   Execution
‣   Prototype
‣   Service
‣   API
‣   Operations
‣   UX

                  Global Oil Supply and Demand Gravity
                                  Model
Simulation and Modeling

      Simple interactions among proximal entities can
      lead to intricate, persistent, and functional spatial
      entities at aggregate levels (complex adaptive
      systems).
Spatial Interdependency Analysis of
                                                                            the San Francisco Failure Simulation




                        Total Number of   No. Links   % Links     %Volume
Infrastructure          Links             Congested   Congested   Delay
Refined Products
(National)
                             3,197              1       0.03%       0.05%
Refined Products
(MSA)                                                   12.50%
                              8                 1                    93%


Power Grid (Regional)        1,942              4        0%          N/A


Power Grid (MSA)              16                2        13%         N/A
Density Mapping

     Calculating the proximity and frequency of a
     spatial phenomenon by creating a probabilistic
     surface.
New York City Fiber Density Map
Standard GIS Architectures
Distributed Analytics

      Queueing analysis tasks from disparate data sources
      for agents to run across distributed servers to collate
      back to the user as answers.
Disparate Data




                                               Distributed Servers
                                      Agents
 User
                 Request Queue

                           Analysis
(http://finder.geocommons.com/overlays/20148)




       1. Rasterize
       2. Kernel
          density calc
       3. Color map              Agent
                                               Amazon EC2
User
       Request Queue



                    Amazon S3
Vector Density Mapping Demo
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣   And growth is accelerating
‣   Need multiple machines,
    horizontal scalability
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   MapReduce-based parallel computation
    (even harder to process a PB)
‣   Generic key-value based computation interface
    allows for wide applicability
‣   Open source, top-level Apache project
‣   Scalable: Y! has a 4000-node cluster
‣   Powerful: sorted a TB of random integers in 62 seconds
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close to
                                                    2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                                                    county, given tweets table?
                                                ‣   Input: key=row, value=tweet info
                                                ‣   Map: output key=county, value=1
                                                ‣   Shuffle: sort by county
                                                ‣   Reduce: for each county, sum
                                                ‣   Output: county, tweet count
                                                ‣   With 2x machines, runs close
                                                    to 2x faster.
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins: lengthy, error-prone
‣   n-stage jobs: Hard to manage
‣   Prototyping/exploration requires             ‣   analytics in Eclipse?
    compilation                                      ur doin it wrong...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis

‣   The Pig version is:
‣        5% of the code, 5% of the time
‣        Within 50% of the execution time.
‣   Pig      Geo:

    ‣   Programmable: fuzzy matching, custom filtering
    ‣   Easily link multiple datasets, regardless of size/structure
    ‣   Iterative, quick
A Real Example

‣   Fire up your EMR.
    ‣   ... or follow along at http://bit.ly/whereanalytics
‣   Pete used Twitter’s streaming API to store some tweets
‣   Simplest thing: group by location and count with Pig
    ‣   http://bit.ly/where20pig


‣   Here comes some code!
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets_with_location = FILTER tweets BY user_location !=
'NULL';
normalized_locations = FOREACH tweets_with_location
GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY
user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as
location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil           37985
indonesia        33777
brazil           22432
london           17294
usa              14564
são paulo        14238
new york         13420
tokyo            10967
singapore        10225
rio de janeiro   10135
los angeles      9934
california       9386
chicago          9155
uk               9095
jakarta          9086
germany          8741
canada           8201
                 7696
                 7121
jakarta, indonesia  6480
nyc              6456
new york, ny     6331
Neat, but...

 ‣   Wow, that data is messy!
     ‣   brasil, brazil at #1 and #3
     ‣   new york, nyc, and new york ny all in the top 30
 ‣   Pete to the rescue.
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Users by County
Lady Gaga
Tea Party
Dallas
Colbert
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing the Two Together
‣   Conclusion
‣   Q&A
Questions?   Follow us at
             twitter.com/peteskomoroch
             twitter.com/kevinweil
             twitter.com/seangorman

Weitere ähnliche Inhalte

Was ist angesagt?

Symbology and Classifying data in ARC GIS
Symbology and Classifying data in ARC GISSymbology and Classifying data in ARC GIS
Symbology and Classifying data in ARC GISKU Leuven
 
Spatial interpolation techniques
Spatial interpolation techniquesSpatial interpolation techniques
Spatial interpolation techniquesManisha Shrivastava
 
11 - NIH 1 - Snowmelt Runoff-Sep-16
11 - NIH 1 - Snowmelt Runoff-Sep-1611 - NIH 1 - Snowmelt Runoff-Sep-16
11 - NIH 1 - Snowmelt Runoff-Sep-16indiawrm
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GISKU Leuven
 
Basic of gis concept and theories
Basic of gis concept and theoriesBasic of gis concept and theories
Basic of gis concept and theoriesMohsin Siddique
 
Pre processing of raw rs data
Pre processing of raw rs dataPre processing of raw rs data
Pre processing of raw rs dataguriaghosh
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelSetia Pramana
 
Microwave remote sensing
Microwave remote sensingMicrowave remote sensing
Microwave remote sensingMohsin Siddique
 
Status and Perspectives of GIS Application in BANGLADESH
Status and Perspectives of GIS Application in BANGLADESHStatus and Perspectives of GIS Application in BANGLADESH
Status and Perspectives of GIS Application in BANGLADESHBayes Ahmed
 
Applications of GIS to Logistics and Transportation
Applications of GIS to Logistics and TransportationApplications of GIS to Logistics and Transportation
Applications of GIS to Logistics and Transportationsorbi
 
Introduction to ArcGIS
Introduction to ArcGISIntroduction to ArcGIS
Introduction to ArcGISKate Dougherty
 
Introduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorIntroduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorAmir Al-Ansary
 
Flood risk mapping using GIS and remote sensing and SAR
Flood risk mapping using GIS and remote sensing and SARFlood risk mapping using GIS and remote sensing and SAR
Flood risk mapping using GIS and remote sensing and SARRohan Tuteja
 
Inverse distance weighting
Inverse distance weightingInverse distance weighting
Inverse distance weightingPenchala Vineeth
 
Geographic Information System unit 1
Geographic Information System   unit 1Geographic Information System   unit 1
Geographic Information System unit 1sridevi5983
 

Was ist angesagt? (20)

Symbology and Classifying data in ARC GIS
Symbology and Classifying data in ARC GISSymbology and Classifying data in ARC GIS
Symbology and Classifying data in ARC GIS
 
Spatial interpolation techniques
Spatial interpolation techniquesSpatial interpolation techniques
Spatial interpolation techniques
 
11 - NIH 1 - Snowmelt Runoff-Sep-16
11 - NIH 1 - Snowmelt Runoff-Sep-1611 - NIH 1 - Snowmelt Runoff-Sep-16
11 - NIH 1 - Snowmelt Runoff-Sep-16
 
Spatial Data Model
Spatial Data ModelSpatial Data Model
Spatial Data Model
 
Kriging
KrigingKriging
Kriging
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GIS
 
Basic of gis concept and theories
Basic of gis concept and theoriesBasic of gis concept and theories
Basic of gis concept and theories
 
Pre processing of raw rs data
Pre processing of raw rs dataPre processing of raw rs data
Pre processing of raw rs data
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft Excel
 
Microwave remote sensing
Microwave remote sensingMicrowave remote sensing
Microwave remote sensing
 
Status and Perspectives of GIS Application in BANGLADESH
Status and Perspectives of GIS Application in BANGLADESHStatus and Perspectives of GIS Application in BANGLADESH
Status and Perspectives of GIS Application in BANGLADESH
 
Applications of GIS to Logistics and Transportation
Applications of GIS to Logistics and TransportationApplications of GIS to Logistics and Transportation
Applications of GIS to Logistics and Transportation
 
Introduction to ArcGIS
Introduction to ArcGISIntroduction to ArcGIS
Introduction to ArcGIS
 
Map algebra
Map algebraMap algebra
Map algebra
 
Introduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorIntroduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood Estimator
 
Flood risk mapping using GIS and remote sensing and SAR
Flood risk mapping using GIS and remote sensing and SARFlood risk mapping using GIS and remote sensing and SAR
Flood risk mapping using GIS and remote sensing and SAR
 
Inverse distance weighting
Inverse distance weightingInverse distance weighting
Inverse distance weighting
 
Ethiopia geodetic network
Ethiopia geodetic network Ethiopia geodetic network
Ethiopia geodetic network
 
Geographic Information System unit 1
Geographic Information System   unit 1Geographic Information System   unit 1
Geographic Information System unit 1
 
Stream flow
Stream flow Stream flow
Stream flow
 

Andere mochten auch

Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalRushin Naik
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011ZoeMM
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestOllie Parsley
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningSngular Meaning
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Hortonworks
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designChristine Outram
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining sourceAtaxo Group
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science researchDavide Bennato
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1Johan Blomme
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on TwitterPulkit Goyal
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveIMC Institute
 

Andere mochten auch (20)

Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _final
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - Devnest
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion mining
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product design
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining source
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science research
 
PPT FOR BIG
PPT FOR BIGPPT FOR BIG
PPT FOR BIG
 
Spatial data analysis 1
Spatial data analysis 1Spatial data analysis 1
Spatial data analysis 1
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 

Ähnlich wie Spatial Analytics, Where 2.0 2010

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover ComparisonBZjoe
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessSangjin Han
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Alex Kozlov
 
Optical networks, light paths and GRID computing
Optical networks, light paths and GRID computingOptical networks, light paths and GRID computing
Optical networks, light paths and GRID computingKeith Russell
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Michail Argyriou
 
Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Andrew Turner
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Integrated Carbon Observation System (ICOS)
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...J. Kevin Byrne
 

Ähnlich wie Spatial Analytics, Where 2.0 2010 (20)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison2009 Ohio River Basin Landcover Comparison
2009 Ohio River Basin Landcover Comparison
 
Large-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressivenessLarge-scale computation without sacrificing expressiveness
Large-scale computation without sacrificing expressiveness
 
Shuronr
ShuronrShuronr
Shuronr
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
 
Optical networks, light paths and GRID computing
Optical networks, light paths and GRID computingOptical networks, light paths and GRID computing
Optical networks, light paths and GRID computing
 
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
Branch and-bound nearest neighbor searching over unbalanced trie-structured o...
 
Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011Playful Explorations of Public and Personal Data - OSCON Data 2011
Playful Explorations of Public and Personal Data - OSCON Data 2011
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
Kevin Byrne’s Presentation: Sustainability Storyboarded and Geovisualized Acr...
 

Kürzlich hochgeladen

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Kürzlich hochgeladen (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Spatial Analytics, Where 2.0 2010

  • 1. Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics
  • 2. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 3. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 4. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 5. Spatial Analysis Analytical techniques to determine the spatial distribution of a variable, the relationship between the spatial distribution of variables, and the association of the variables in an area.
  • 7. Spatial Analysis Types 1. Spatial autocorrelation 2. Spatial interpolation 3. Spatial interaction 4. Simulation and modeling 5. Density mapping
  • 8. Spatial Autocorrelation Spatial autocorrelation statistics measure and analyze the degree of dependency among observations in a geographic space. First law of geography: “everything is related to everything else, but near things are more related than distant things.” -- Waldo Tobler
  • 9. Moran’s I - Per Capita Moran’s I - Random Variable Income in Monroe County Moran’s I = .012 Moran’s I = .66
  • 10. Spatial Interpolation Spatial interpolation methods estimate the variables at unobserved locations in geographic space based on the values at observed locations.
  • 11. $14.00 Chicago $14.00 NYC $7.55 Henry Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front
  • 12. $18.50 Chicago $30.00 NYC $16.00 Henry Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front
  • 13. $20.00 Chicago $37.00 NYC $22.00 Henry Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front
  • 14. Spatial Interaction Spatial interaction or “gravity models” estimate the flow of people, material, or information between locations in geographic space.
  • 15. Introduction ‣ Motiviation ‣ Execution ‣ Prototype ‣ Service ‣ API ‣ Operations ‣ UX Global Oil Supply and Demand Gravity Model
  • 16. Simulation and Modeling Simple interactions among proximal entities can lead to intricate, persistent, and functional spatial entities at aggregate levels (complex adaptive systems).
  • 17. Spatial Interdependency Analysis of the San Francisco Failure Simulation Total Number of No. Links % Links %Volume Infrastructure Links Congested Congested Delay Refined Products (National) 3,197 1 0.03% 0.05% Refined Products (MSA) 12.50% 8 1 93% Power Grid (Regional) 1,942 4 0% N/A Power Grid (MSA) 16 2 13% N/A
  • 18. Density Mapping Calculating the proximity and frequency of a spatial phenomenon by creating a probabilistic surface.
  • 19. New York City Fiber Density Map
  • 21. Distributed Analytics Queueing analysis tasks from disparate data sources for agents to run across distributed servers to collate back to the user as answers.
  • 22. Disparate Data Distributed Servers Agents User Request Queue Analysis
  • 23. (http://finder.geocommons.com/overlays/20148) 1. Rasterize 2. Kernel density calc 3. Color map Agent Amazon EC2 User Request Queue Amazon S3
  • 25.
  • 26.
  • 27.
  • 28. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 29. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  • 30. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  • 31. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 32. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 33. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 34. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 35. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 36. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 37. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 38. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • 39. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 40. Why Pig? ‣ Because I bet you can read the following script.
  • 41. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 43. Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  • 44. A Real Example ‣ Fire up your EMR. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ Pete used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  • 45.
  • 46. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 47. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 48. tweets_with_location = FILTER tweets BY user_location != 'NULL';
  • 49. normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
  • 50. grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
  • 51. location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  • 52. sorted_counts = ORDER location_counts BY user_count DESC;
  • 53. STORE sorted_counts INTO 'global_location_tweets';
  • 54. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  • 55. Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Pete to the rescue.
  • 56. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 72. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 73. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  • 74. Questions? Follow us at twitter.com/peteskomoroch twitter.com/kevinweil twitter.com/seangorman

Hinweis der Redaktion