SlideShare a Scribd company logo
1 of 17
Download to read offline
R, Hadoop and Amazon Web
         Services
    Portland R Users Group
     December 20th, 2011
A general disclaimer
• Good programmers learn fast and develop expertise in
  technologies and methodologies in a rather intrepid,
  exploratory manner.
• I am by no means a expert in the paradigm which we
  are discussing this evening but I’d like to share what I
  have learned in the last year while developing
  MapReduce applications in R within the AWS.
  Translation: ask anything and everything but reserve
  the right to say “I don’t know, yet.”
• Also, this is a meetup.com meeting – seems only
  appropriate to keep this short, sweet, high-level and
  full of solicitous discussion points.
The whole point of this presentation
• I am selfish (and you should be too!)
    – I like collaborators
    – I like collaborators interested in things I am interested in
    – I believe that dissemination of information related to sophisticated,
      numerical decision making processes generally makes the world a
      better place
    – I believe that the more people use Open Source technology, the more
      people contribute to Open Source technology and the better Open
      Source technology gets in general. Hence, my life gets easier and
      cheaper which is presumably analogous to “better” in some respect.
    – There is beer at this meetup. Queue short intermission.
• Otherweiser® (brought by the aforementioned speaking point,) I’d
  really be very happy if people said to themselves at the end of this
  presentation “Hadoop seems easy! I’m going to give it a try.”
Why are we talking about this
                    anyhow?
“Every two days now we create as much information as we did from the dawn of
   civilization up until 2003.“ -Eric Schmidt, August 2010

•   We aggregate a lot of data (and have been)
     – Particularly businesses like Google, Amazon, Apple etc…
     – Presumably the government is doing awful things with data too
•   But aggregation isn’t understanding
     – Lawnmower Man aside
     – We need to UNDERSTAND the data- that is take raw data and make it interoperable.
     – Hence the need for a marriage of Statistics and Programming directed at understanding
       phenomena expressed in these large data sets
     – Can’t recommend this book enough:
          •   The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert
              Tibshirani and Jerome Freidman
          •   http://www.amazon.com/Elements-Statistical-Learning-Prediction-
              Statistics/dp/0387848576/ref=pd_sim_b_1
•   So everybody is going crazy about this in general.
Also, who is this “self” I speak of?
• tis’ I, Timothy Dalbey
     • I work for the Emerging Technologies Group of News
       Corporation
     • I live in North East Portland and keep an office on 53rd
       and 5th in New York City
     • Studied Mathematics and Economics as a
       undergraduate student and Statistics as a graduate
       student at University of Virginia
     • 2 awesome kids and a awesome partner at home: Liam,
       Juniper and Lindsay
     • Enthusiastic about technology, science and futuristic
       endeavors in general
Elastic MapReduce
• Elastic Map reduce is
  – A service of Amazon Web Services
  – Is composed of Amazon Machine Images
     • ssh capability
     • Debian Linux
     • Preloaded with ancient versions of R
  – A complimentary set of Ruby Client Tools
  – A web interface
  – Preconfigured to run Hadoop
Hadoop
• Popular framework for controlling distributed cluster computations
     – Popularity is important – queue story about MPI at Levy Laboratory
       and Beowulf clusters…
• Hadoop is a Apache Project product
     – http://hadoop.apache.org/
•   Open Source
•   Java
•   Configurable (mostly uses XML config files)
•   Fault Tolerant
•   Lots of ways to interact with Hadoop
     –   Pig
     –   Hive
     –   Streaming
     –   Custom .jar
Hadoop is MapReduce
• What is a MapReduce?
   – Originally coined by Google Labs in 2004
   – A super simplified single-node version of the paradigm is as follows:
       cat input.txt | ./mapper.R | sort | reducer.R > output.txt
• That is, MapReduce has follows a general process:
   –   Read input (cat input)
   –   Map (mapper.R)
   –   Partition
   –   Comparison (sort)
   –   Reduce (reducer.R)
   –   Output (output.txt)
• You can use most popular scripting languages
   – Perl, PHP, Python etc…
   – R
But – that sort of misses the point
• MapReduce is computational paradigm intended for
   – Large Datasets
   – Multi-Node Computation
   – Truly Parallel Processing
• Master/Slave architecture
   – Nodes are agnostic of one another, only the master
     node(s) have any idea about the greater scheme of things.
      • The importance of truly parallel processing
• A good first question before engaging in creating a
  Hadoop job is:
   – Is this process a good candidate for Hadoop processing in
     the first place?
Benefits to using AWS for Hadoop Jobs
• Preconfigured to run Hadoop
   – This is itself is something of a miracle
• Virtual Servers
   – Use the servers for only as long as you need
   – configurability
• Handy command line tools
• S3 is sitting in the same cloud
   – Your data is sitting in the same space
• Servers come at $0.06 per hour of compute time
  – dirt cheap
Specifics
•       Bootstrapping
           –       Bootstrapping is a process by which you may customize the nodes via bash shell
                       •     Acquiring data
                       •     Updating R
                       •     Installing Packages
                       •     Please, you example:

#!/bin/bash
#debian R upgrade
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg -a --export 06F90DE5381BA480 | sudo apt-key add -
echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev



•       Input file
           –       Mapper specific
                       •     Classic example in WordCounter.py
                                   –     Example: “It was the best of times, it was the worst of times…”
                                   –     Note: Big data set!
                       •     An example from a recent appliocation of mine:
                                   –     "25621”r"23803"r"31712”r…
                                   –     Note: Not such a big data set


•       Mapper & Reducer
           –       Both typically draw from STDIN and write to STDOUT
           –       Please see the following examples
The typical “Hello World” MapReduce
                Mapper
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    cat(paste(words, "t1n", sep=""), sep="")
}

close(con)
The typical “Hello World” MapReduce
                 Reducer
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
        val <- unlist(strsplit(line, "t"))
        list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
       line <- trimWhiteSpace(line)
       split <- splitLine(line)
       word <- split$word
       count <- split$count
       if (exists(word, envir = env, inherits = FALSE)) {
           oldcount <- get(word, envir = env)
           assign(word, oldcount + count, envir = env)
       }else{
           assign(word, count, envir = env)
       }
}

close(con)
for (w in ls(env, all = TRUE)){
       cat(w, "t", get(w, envir = env), "n", sep = "”)
}
MapReduce and R: Forecasting data
       for News Corporation
• 50k+ products with historical unit sales data of roughly
  2.5MM rows
• Some of the titles require heavy computational processing
   – Titles with insufficient data require augmented or surrogate
     data in order to make “good” predictions – thus identifying good
     candidate data was also necessary in addition to prediction
     methods
   – Took lots of time (particularly in R)
      • But R had the analysis tools I needed!
• Key observation: The predictions were independent of one
  another which made the process truly parallel.
• Thus, Hadoop and Elastic MapReduce were merited
My Experience Learning and Using
            Hadoop with AWS
•   Debugging is something of a nightmare.
     –   SSH onto nodes to figure out what’s really going on
     –   STDERR is your enemy – it will cause your job to fail rather completely
     –   STDERR is your best friend. No errors and failed jobs are rather frustrating
•   Most of the work is in transactional with AWS Elastic MapReduce
•   I followed conventional advice which is “move data to the nodes.”
     –   This meant moving data into csv’s in S3 and importing the data into R via standard read methods
     –   This also meant that my processes were database agnostic
     –   JSON is a great way of structuring input and output between phases of the MapReduce Process
           •   To that effect, check out RJSON – great package.
•   In general, the following rule seems to apply:
     –   Data frame bad.
     –   Data table good.
           •   http://cran.r-project.org/web/packages/data.table/index.html
•   Packages to simplify R make my skin crawl
     –   Ever see Jurassic Park?
     –   Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I
         said that.
R Package to Utilize Map Reduce
• Segue – Written J.D. Long
  – http://www.cerebralmastication.com
     • P.s. We all realize that www is a subdomain, right?
       World Wide Web… is that really necessary?
  – Handles much of the transactional details and
    allows the use of Elastic MapReduce through
    apply() and lapply() wrappers
• Seems like this is a good tutorial too:
  – http://jeffreybreen.wordpress.com/2011/01/10/s
    egue-r-to-amazon-elastic-mapreduce-hadoop/
Other stuff
• Distributed Cache
  – Load your data the smart way!
• Ruby Command Tools
  – Interact with AWS the smart way!
• Web interface
  – Simple.
  – Helpful when monitoring jobs when you wake up
    at 3:30AM and wonder “is my script still running?”

More Related Content

What's hot

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 

What's hot (18)

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Geek camp
Geek campGeek camp
Geek camp
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 

Viewers also liked

R, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrR, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrPortland R User Group
 
Zing Me - Build brand engagement with Zing Me
Zing Me - Build brand engagement with Zing MeZing Me - Build brand engagement with Zing Me
Zing Me - Build brand engagement with Zing Mezingopen
 
Distributed search solutions and comparison
Distributed search   solutions and comparison Distributed search   solutions and comparison
Distributed search solutions and comparison zingopen
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Cloudera, Inc.
 
Zing Me Platform Policy
Zing Me Platform PolicyZing Me Platform Policy
Zing Me Platform Policyzingopen
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakShelly Sanchez Terrell
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Viewers also liked (10)

R, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchrR, HTTP, and APIs, with a preview of TopicWatchr
R, HTTP, and APIs, with a preview of TopicWatchr
 
Zing Me - Build brand engagement with Zing Me
Zing Me - Build brand engagement with Zing MeZing Me - Build brand engagement with Zing Me
Zing Me - Build brand engagement with Zing Me
 
Distributed search solutions and comparison
Distributed search   solutions and comparison Distributed search   solutions and comparison
Distributed search solutions and comparison
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
 
Zing Me Platform Policy
Zing Me Platform PolicyZing Me Platform Policy
Zing Me Platform Policy
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
 
Teaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & TextspeakTeaching Students with Emojis, Emoticons, & Textspeak
Teaching Students with Emojis, Emoticons, & Textspeak
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Similar to R, Hadoop and Amazon Web Services

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 

Similar to R, Hadoop and Amazon Web Services (20)

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

R, Hadoop and Amazon Web Services

  • 1. R, Hadoop and Amazon Web Services Portland R Users Group December 20th, 2011
  • 2. A general disclaimer • Good programmers learn fast and develop expertise in technologies and methodologies in a rather intrepid, exploratory manner. • I am by no means a expert in the paradigm which we are discussing this evening but I’d like to share what I have learned in the last year while developing MapReduce applications in R within the AWS. Translation: ask anything and everything but reserve the right to say “I don’t know, yet.” • Also, this is a meetup.com meeting – seems only appropriate to keep this short, sweet, high-level and full of solicitous discussion points.
  • 3. The whole point of this presentation • I am selfish (and you should be too!) – I like collaborators – I like collaborators interested in things I am interested in – I believe that dissemination of information related to sophisticated, numerical decision making processes generally makes the world a better place – I believe that the more people use Open Source technology, the more people contribute to Open Source technology and the better Open Source technology gets in general. Hence, my life gets easier and cheaper which is presumably analogous to “better” in some respect. – There is beer at this meetup. Queue short intermission. • Otherweiser® (brought by the aforementioned speaking point,) I’d really be very happy if people said to themselves at the end of this presentation “Hadoop seems easy! I’m going to give it a try.”
  • 4. Why are we talking about this anyhow? “Every two days now we create as much information as we did from the dawn of civilization up until 2003.“ -Eric Schmidt, August 2010 • We aggregate a lot of data (and have been) – Particularly businesses like Google, Amazon, Apple etc… – Presumably the government is doing awful things with data too • But aggregation isn’t understanding – Lawnmower Man aside – We need to UNDERSTAND the data- that is take raw data and make it interoperable. – Hence the need for a marriage of Statistics and Programming directed at understanding phenomena expressed in these large data sets – Can’t recommend this book enough: • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Freidman • http://www.amazon.com/Elements-Statistical-Learning-Prediction- Statistics/dp/0387848576/ref=pd_sim_b_1 • So everybody is going crazy about this in general.
  • 5. Also, who is this “self” I speak of? • tis’ I, Timothy Dalbey • I work for the Emerging Technologies Group of News Corporation • I live in North East Portland and keep an office on 53rd and 5th in New York City • Studied Mathematics and Economics as a undergraduate student and Statistics as a graduate student at University of Virginia • 2 awesome kids and a awesome partner at home: Liam, Juniper and Lindsay • Enthusiastic about technology, science and futuristic endeavors in general
  • 6. Elastic MapReduce • Elastic Map reduce is – A service of Amazon Web Services – Is composed of Amazon Machine Images • ssh capability • Debian Linux • Preloaded with ancient versions of R – A complimentary set of Ruby Client Tools – A web interface – Preconfigured to run Hadoop
  • 7. Hadoop • Popular framework for controlling distributed cluster computations – Popularity is important – queue story about MPI at Levy Laboratory and Beowulf clusters… • Hadoop is a Apache Project product – http://hadoop.apache.org/ • Open Source • Java • Configurable (mostly uses XML config files) • Fault Tolerant • Lots of ways to interact with Hadoop – Pig – Hive – Streaming – Custom .jar
  • 8. Hadoop is MapReduce • What is a MapReduce? – Originally coined by Google Labs in 2004 – A super simplified single-node version of the paradigm is as follows: cat input.txt | ./mapper.R | sort | reducer.R > output.txt • That is, MapReduce has follows a general process: – Read input (cat input) – Map (mapper.R) – Partition – Comparison (sort) – Reduce (reducer.R) – Output (output.txt) • You can use most popular scripting languages – Perl, PHP, Python etc… – R
  • 9. But – that sort of misses the point • MapReduce is computational paradigm intended for – Large Datasets – Multi-Node Computation – Truly Parallel Processing • Master/Slave architecture – Nodes are agnostic of one another, only the master node(s) have any idea about the greater scheme of things. • The importance of truly parallel processing • A good first question before engaging in creating a Hadoop job is: – Is this process a good candidate for Hadoop processing in the first place?
  • 10. Benefits to using AWS for Hadoop Jobs • Preconfigured to run Hadoop – This is itself is something of a miracle • Virtual Servers – Use the servers for only as long as you need – configurability • Handy command line tools • S3 is sitting in the same cloud – Your data is sitting in the same space • Servers come at $0.06 per hour of compute time – dirt cheap
  • 11. Specifics • Bootstrapping – Bootstrapping is a process by which you may customize the nodes via bash shell • Acquiring data • Updating R • Installing Packages • Please, you example: #!/bin/bash #debian R upgrade gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480 gpg -a --export 06F90DE5381BA480 | sudo apt-key add - echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list sudo apt-get update sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev • Input file – Mapper specific • Classic example in WordCounter.py – Example: “It was the best of times, it was the worst of times…” – Note: Big data set! • An example from a recent appliocation of mine: – "25621”r"23803"r"31712”r… – Note: Not such a big data set • Mapper & Reducer – Both typically draw from STDIN and write to STDOUT – Please see the following examples
  • 12. The typical “Hello World” MapReduce Mapper #! /usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con)
  • 13. The typical “Hello World” MapReduce Reducer #! /usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) }else{ assign(word, count, envir = env) } } close(con) for (w in ls(env, all = TRUE)){ cat(w, "t", get(w, envir = env), "n", sep = "”) }
  • 14. MapReduce and R: Forecasting data for News Corporation • 50k+ products with historical unit sales data of roughly 2.5MM rows • Some of the titles require heavy computational processing – Titles with insufficient data require augmented or surrogate data in order to make “good” predictions – thus identifying good candidate data was also necessary in addition to prediction methods – Took lots of time (particularly in R) • But R had the analysis tools I needed! • Key observation: The predictions were independent of one another which made the process truly parallel. • Thus, Hadoop and Elastic MapReduce were merited
  • 15. My Experience Learning and Using Hadoop with AWS • Debugging is something of a nightmare. – SSH onto nodes to figure out what’s really going on – STDERR is your enemy – it will cause your job to fail rather completely – STDERR is your best friend. No errors and failed jobs are rather frustrating • Most of the work is in transactional with AWS Elastic MapReduce • I followed conventional advice which is “move data to the nodes.” – This meant moving data into csv’s in S3 and importing the data into R via standard read methods – This also meant that my processes were database agnostic – JSON is a great way of structuring input and output between phases of the MapReduce Process • To that effect, check out RJSON – great package. • In general, the following rule seems to apply: – Data frame bad. – Data table good. • http://cran.r-project.org/web/packages/data.table/index.html • Packages to simplify R make my skin crawl – Ever see Jurassic Park? – Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I said that.
  • 16. R Package to Utilize Map Reduce • Segue – Written J.D. Long – http://www.cerebralmastication.com • P.s. We all realize that www is a subdomain, right? World Wide Web… is that really necessary? – Handles much of the transactional details and allows the use of Elastic MapReduce through apply() and lapply() wrappers • Seems like this is a good tutorial too: – http://jeffreybreen.wordpress.com/2011/01/10/s egue-r-to-amazon-elastic-mapreduce-hadoop/
  • 17. Other stuff • Distributed Cache – Load your data the smart way! • Ruby Command Tools – Interact with AWS the smart way! • Web interface – Simple. – Helpful when monitoring jobs when you wake up at 3:30AM and wonder “is my script still running?”