SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Revolution	
  Confidential




Using R with Hadoop




        by   Jeffrey Breen
        Principal, Think Big Academy
        Think Big Analytics
        jeffrey.breen@thinkbiganalytics.com
        Twitter: @JeffreyBreen

                                                    1
Revolution	
  Confidential




 Think&Big




Start&Smart




Scale&Fast



                                   2
Agenda
                                                            Revolution	
  Confidential




Why R?
What is Hadoop?
Counting words with MapReduce
RHadoop in Action
   Client scenario: Scaling legacy analytics with RHadoop
   Client scenario: Network security threat detection
Big Data Warehousing with Hive
   Client scenario: Using Hadoop as a “drop-in” database replacement
   Client scenario: Preprocessing data for R
Want to learn more?
Q&A

                                                                                3
Revolution	
  Confidential




http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html                       4
Revolution	
  Confidential




http://www.wengerna.com/giant-knife-16999                       5
Number	
  of	
  R	
  Packages	
  Available
                                                       How	
  many	
  R	
  Packages	
  
                                                         are	
  there	
  now?

                                                     At	
  the	
  command	
  line	
  
                                                     enter:
                                                     >	
  dim(available.packages())




Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
Revolution	
  Confidential




Poll: What do you use for analytics?




                                                      7
Revolution R Enterprise has Open-Source                                                           Revolution	
  Confidential
R Engine at the core
                 3,700 community packages and growing exponentially
             Multi-Threaded        Cluster       Web Services     Big Data          Parallel
             Math Libraries        support           API          Analysis           Tools


 Technical                                                                                     IDE / Developer
  Support                                                                                            GUI




                       Community                                               Build
                        Packages                 R Engine                    Assurance
                                             Language Libraries




                                                                                                                 1



                                                                                                                      8
What is Hadoop?
                                                                Revolution	
  Confidential




An open source project designed to support large scale data
   processing
Inspired by Google’s MapReduce-based computational
   infrastructure
Comprised of several components
   Hadoop Distributed File System (HDFS)
   MapReduce processing framework, job scheduler, etc.
   Ingest/outgest services (Sqoop, Flume, etc.)
   Higher level languages and libraries (Hive, Pig, Cascading, Mahout)
Written in Java, first opened up to alternatives through its
  Streaming API
   → If your language of choice can handle stdin and stdout, you can use it
     to write MapReduce jobs

                                                                                    9
Revolution	
  Confidential




Poll: Have you used Hadoop?




                                                10
Enter RHadoop
                                                Revolution	
  Confidential




RHadoop is an open source project sponsored by
 Revolution Analytics
Package Overview
  rmr2 - all MapReduce-related functions
  rhdfs - interaction with Hadoop’s HDFS file system
  rhbase - access to the NoSQL HBase database
rmr2 uses Hadoop’s Streaming API to allow R
  users to write MapReduce jobs in R
  handles all of the I/O and job submission for you (no
    while(<stdin>)-like loops!)
                                                                   11
RHadoop Advantages
                                                         Revolution	
  Confidential




Modular
  Packages group similar functions
  Only load (and learn!) what you need
  Minimizes prerequisites and dependencies
Open Source
  Cost: Low (no) barrier to start using
  Transparency: Development, issue tracker, Wiki, etc. hosted on
    github: https://github.com/RevolutionAnalytics/RHadoop/
Supported
  Sponsored by Revolution Analytics
  Training & professional services available
  Support included with Revolution R Enterprise
                                                                           12
Revolution	
  Confidential

      Hadoop cluster components
                                    SQL
                                    Store
                                                                Ingest                           Outgest                                      SQL
                                                                Service                          Service                                      Store

                                    Logs




                                                                              Cluster
Key                                Primary
                                 Master Server
                                                                                                                            Client Servers

                                                                        Secondary                                      ✲ Hive, Pig, ...

italics: process                  ✲ Job Tracker                        Master Server                              ✲ cron+bash, Azkaban, …
                                                                                                                      Sqoop, Scribe, …
                                   Name Node                            Secondary
                                                                                                                  Monitoring, Management
                                                                        Name Node
✲ : MR jobs
                                                                              Slaves
                                 Slave Server                         Slave Server                               Slave Server
                                ✲ Task Tracker                    ✲ Task Tracker                                ✲ Task Tracker

                                   Data Node                           Data Node                                    Data Node                              ...
                                 DiskDiskDiskDisk
                                   DiskDiskDiskDisk                   DiskDiskDiskDisk
                                                                        DiskDiskDiskDisk                         DiskDiskDiskDisk
                                                                                                                   DiskDiskDiskDisk




                   from Think Big Academy’s Hadoop Developer Course                                                                                                13
                                                                               Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Revolution	
  Confidential

    Hadoop’s distributed file system
                                  SQL
                                  Store
                                                              Ingest                           Outgest                                      SQL
                                                              Service                          Service                                      Store

                                  Logs




Services                                                                    Cluster                                       Client Servers
                                 Primary
                               Master Server                                                                         ✲ Hive, Pig, ...
 Name Node
                                                                      Secondary
                                ✲ Job Tracker                        Master Server                              ✲ cron+bash, Azkaban, …
                                                                                                                    Sqoop, Scribe, …
                                 Name Node                            Secondary
                                                                                                                Monitoring, Management
                                                                      Name Node
 Data Nodes
64MB blocks                                                                 Slaves
                               Slave Server                         Slave Server                               Slave Server
3x replication                ✲ Task Tracker                    ✲ Task Tracker                                ✲ Task Tracker

                                 Data Node                           Data Node                                    Data Node                              ...
                               DiskDiskDiskDisk
                                 DiskDiskDiskDisk                   DiskDiskDiskDisk
                                                                      DiskDiskDiskDisk                         DiskDiskDiskDisk
                                                                                                                 DiskDiskDiskDisk




                 from Think Big Academy’s Hadoop Developer Course            Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
                                                                                                                                                                 14
Revolution	
  Confidential




Wordcount in Hadoop
 (The “hello world” of MapReduce)




                                                      15
Input        Mappers                   Sort,                       Reducers                                                Output
                                        Shuffle

Hadoop uses                          (hadoop, 1)
MapReduce
                                                                                                                       a2
                               (mapreduce, 1)                                                                          hadoop 1
                                                                                                                       is 2
                              (uses, 1)


               We need to convert
                                  (is, 1), (a, 1)
 There is a
 Map phase
                                 (map, 1),(phase,1)

                    the Input (there, 1)                                                                               map 1
                                                                                                                       mapreduce 1
                                                                                                                       phase 2

                into the Output.
                                         (phase,1)
                          (is, 1), (a, 1)                                                                              reduce 1
                                  (there, 1),                                                                          there 2
  There is a
Reduce phase                      (reduce 1)                                                                           uses 1



                  from Think Big Academy’s Hadoop Developer Course     Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Input        Mappers


Hadoop uses
MapReduce
                (N, "…")




 There is a
 Map phase
                (N, "…")




                 (N, "")




  There is a
Reduce phase
                (N, "…")



                    from Think Big Academy’s Hadoop Developer Course   Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Input        Mappers

                                         (hadoop, 1)
Hadoop uses
MapReduce
                (N, "…")                 (uses, 1)
                                         (mapreduce, 1)

                                         (there, 1)
                                         (is, 1)
 There is a
 Map phase
                (N, "…")                 (a, 1)
                                         (map, 1)
                                         (phase, 1)


                 (N, "")


                                         (there, 1)
                                         (is, 1)
  There is a
Reduce phase
                (N, "…")                 (a, 1)
                                         (reduce, 1)
                                         (phase, 1)

                    from Think Big Academy’s Hadoop Developer Course   Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Revolution	
  Confidential




http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png                     19
Input        Mappers                     Sort,                       Reducers
                                          Shuffle
                                                                          0-9, a-l
Hadoop uses                            (hadoop, 1)
MapReduce
                (N, "…")
                                 (mapreduce, 1)
                                (uses, 1)
                                    (is, 1), (a, 1)
 There is a
 Map phase
                (N, "…")                                                   m-q
                                   (map, 1),(phase,1)
                                (there, 1)


                 (N, "")
                                           (phase,1)                         r-z
                            (is, 1), (a, 1)
                                     (there, 1),
  There is a
Reduce phase
                (N, "…")            (reduce, 1)



                    from Think Big Academy’s Hadoop Developer Course     Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Input        Mappers                     Sort,                       Reducers
                                          Shuffle
                                                                            0-9, a-l
Hadoop uses                            (hadoop, 1)
MapReduce
                (N, "…")                                                 (a, [1,1]),
                                 (mapreduce, 1)                        (hadoop, [1]),
                                                                         (is, [1,1])
                                (uses, 1)
                                    (is, 1), (a, 1)
 There is a
 Map phase
                (N, "…")                                                     m-q
                                   (map, 1),(phase,1)
                                (there, 1)              (map, [1]),
                                                    (mapreduce, [1]),
                                                      (phase, [1,1])
                 (N, "")
                                           (phase,1)                         r-z
                            (is, 1), (a, 1)                            (reduce, [1]),
                                    (there, 1),                        (there, [1,1]),
  There is a
Reduce phase
                (N, "…")            (reduce 1)                            (uses, 1)



                    from Think Big Academy’s Hadoop Developer Course       Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Input        Mappers                     Sort,                       Reducers                                                  Output
                                          Shuffle
                                                                            0-9, a-l
Hadoop uses                            (hadoop, 1)
MapReduce
                (N, "…")                                                 (a, [1,1]),                                       a2
                                 (mapreduce, 1)                        (hadoop, [1]),                                      hadoop 1
                                                                         (is, [1,1])                                       is 2
                                (uses, 1)
                                    (is, 1), (a, 1)
 There is a
 Map phase
                (N, "…")                                                      m-q
                                   (map, 1),(phase,1)
                                (there, 1)              (map, [1]),                                                        map 1
                                                    (mapreduce, [1]),                                                      mapreduce 1
                                                      (phase, [1,1])                                                       phase 2
                 (N, "")
                                           (phase,1)                         r-z
                            (is, 1), (a, 1)                            (reduce, [1]),                                      reduce 1
                                    (there, 1),                        (there, [1,1]),                                     there 2
  There is a
Reduce phase
                (N, "…")            (reduce 1)                            (uses, 1)                                        uses 1



                    from Think Big Academy’s Hadoop Developer Course       Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
Input            Mappers                     Sort,                       Reducers                                                  Output
                                              Shuffle
                                                                                0-9, a-l
Hadoop uses                                (hadoop, 1)
MapReduce
                    (N, "…")                                                 (a, [1,1]),                                       a2
                                     (mapreduce, 1)                        (hadoop, [1]),                                      hadoop 1
                                                                             (is, [1,1])                                       is 2
                                    (uses, 1)
                                        (is, 1), (a, 1)
 There is a
 Map phase
                    (N, "…")                                                      m-q
                                       (map, 1),(phase,1)
                                    (there, 1)              (map, [1]),                                                        map 1
                                                        (mapreduce, [1]),                                                      mapreduce 1
                                                          (phase, [1,1])                                                       phase 2
                     (N, "")
    Map:                                       (phase,1)             Reduce:
                                                                             r-z

• Transform	
  one	
  input	
  to	
  0-­‐N	
   •                     Collect multiple                               inputs into
                                (is, 1), (a, 1)                         (reduce, [1]),                                 reduce 1
                                        (there, 1),                    (there, [1,1]),                                         there 2
  There is a
                    (N, "…")                                         one output.
                                                                          (uses, 1)
    outputs.	
  
Reduce phase                            (reduce 1)                                                                             uses 1



                        from Think Big Academy’s Hadoop Developer Course       Copyright	
  ©	
  2011-­‐2013,	
  Think	
  Big	
  AnalyJcs,	
  All	
  Rights	
  Reserved
wordcount: code
                                                                                       Revolution	
  Confidential



library(rmr2)

map = function(k,lines) {

    words.list = strsplit(lines, 's')
    words = unlist(words.list)

    return( keyval(words, 1) )
}

reduce = function(word, counts) {
   keyval(word, sum(counts))
}

wordcount = function (input, output = NULL) {
   mapreduce(input = input ,
              output = output,
              input.format = "text",
              map = map,
              reduce = reduce)}


                      from Revolution Analytics’ Getting Started with RHadoop course
                                                                                                         24
wordcount: submit job and fetch results
                                                                                           Revolution	
  Confidential




Submit job
    > hdfs.root = 'wordcount'
    > hdfs.data = file.path(hdfs.root, 'data')
    > hdfs.out = file.path(hdfs.root, 'out')
    > out = wordcount(hdfs.data, hdfs.out)


Fetch results from HDFS
    > results = from.dfs( out )
    > results.df = as.data.frame(results, stringsAsFactors=F )
    > colnames(results.df) = c('word', 'count')
    > head(results.df)
           word count
    1 greatness     2
    2    damned     3
    3       tis     5
    4      jade     1
    5   magician     1
    6      hands     2


                          from Revolution Analytics’ Getting Started with RHadoop course
                                                                                                             25
Code notes
                                                             Revolution	
  Confidential




Scalable
   Hadoop and MapReduce abstract away system details
   Code runs on 1 node or 1,000 nodes without modification
Portable
   You write normal R code, interacting with normal R objects
   RHadoop’s rmr2 library abstracts away Hadoop details
   All the functionality you expect is there—including Enterprise R’s
Flexible
   Only the mapper deals with the data directly
   All components communicate via key-value pairs
   Key-value “schema” chosen for each analysis rather than as a
     prerequisite to loading data into the system
                                                                               26
Data silos have developed by source,                     Revolution	
  Confidential
type, system, department...




Customer Service, Website search,   Operations
                                                 Social Media
Call Center       Clickstreams      & Finance




                                                                           27
Caution: Silos Kill   Revolution	
  Confidential




AP Photo/The Kansas City Star, Keith Myers                                           28
Revolution	
  Confidential




Poll: How do you intend to use R &
            Hadoop?




                                                  29
Scaling up a legacy modeling environment
                                                  Revolution	
  Confidential




Scenario
  A financial services client has built a model to predict
    credit risk using a legacy system, but now wants to
    run the scoring algorithm more frequently than
    possible in existing hardware- (and license-) bound
    environment
Solution
  Model coefficients are exported from legacy system
   and written to HDFS as a text file so they are
   available for our R code to use while scoring the
   entire data set nightly
                                                                    30
“Side loading” data made easy
                                             Revolution	
  Confidential




[...]
# first load in our coefficients
# Note that this data.frame will be
# _automatically_ distributed to each node

COEFFS.DF = as.data.frame( from.dfs(hdfs.coeffs,
  format = make.input.format('csv', sep='t')) )

out = mapreduce(input = hdfs.data,
  output = hdfs.out,
  input.format = make.input.format('csv', sep='t'),
  map = scoring.map,
  verbose=T)
                                                               31
Logistic regression scoring “by hand”
                                                               Revolution	
  Confidential



score.logit = function(coeffs, data, reverse.coeffs.signs=F)
{
   if( length(coeffs) != length(data) + 1 )
   {
         warning("Bad row: ", data)
         return(NULL)
   }

    # SAS may reverse the signs of the coefficients
    # (this assumes the intercept should be changed as well)
    if (reverse.coeffs.signs)
         coeffs = -1 * coeffs

    intercept = coeffs[1]
    coeffs = coeffs[-1]

    return( 1/(1 + exp(-intercept - sum(coeffs * data))) )
}



                                                                                 32
Network security threat detection
                                                  Revolution	
  Confidential




Scenario
  A risk-conscious company with a high-profile web
    presence was seeking a way to detect potential
    attacks using captured network traffic (PCAP files).
Solution
  Typical traffic was modeled using a combination of
    algorithms on the full data set. Baseline results from
    this “training” phase are stored on HDFS and can be
    refreshed periodically, as needed.
  A detection job loads the baseline results and analyzes
    incoming network traffic looking for anomalies.
                                                                    33
Big Data Warehousing with Hive
                                                 Revolution	
  Confidential




Hive supplies a SQL-like query language
  very familiar for those with relational database
    experience
But Hive compiles, optimizes, and executes
 these queries as MapReduce jobs on the
 Hadoop cluster
Can be used in conjunction with other Hadoop
 jobs, such as those written with rmr2



                                                                   34
Hive architecture & access
                                                          Revolution	
  Confidential




        Terminal   browser       RODBC, RJDBC, etc.


        Hive                 JDBC     ODBC
          CLI      HWI        Thrift Server

                     Driver
                                              Metastore
         (compiles, optimizes, executes)



        Hadoop

                     Master
                                                DFS
          ✲ Job Tracker    Name Node



                                                                            35
Hadoop as a “drop-in” database replacement
                                                              Revolution	
  Confidential




Scenario
   A market research company had developed a large collection of
     mature R code to conduct mission-critical analysis. Reaching the
     limits of their legacy database infrastructure, the company has
     decided to move their data to Hadoop, but wants to avoid a “Big
     Bang” project to preserve existing functionality.


Solution
   As a stop-gap solution, ODBC connections to Hive have replaced
     connections to the legacy relational database with minimal
     recoding.
   Existing functionality was preserved while the team got up to speed
     with their new Hadoop environment and its capabilities.
   Solution allows for parallel testing of two systems.
                                                                                36
Accessing Hive via ODBC/JDBC
                                                                 Revolution	
  Confidential



library(RJDBC)

# set the classpath to include the JDBC driver location, plus commons-logging
[...]
class.path = c(hive.class.path, commons.class.path)
drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path,
   "`")

# make a connection to the running Hive Server:
conn = dbConnect(drv, "jdbc:hive://localhost:10000/default")

# setting the database name in the URL doesn't help,
# so issue 'use databasename' command:
res = dbSendQuery(conn, 'use mydatabase')

# submit the query and fetch the results as a data.frame:
df = dbGetQuery(conn, 'SELECT name, sub FROM employees LATERAL VIEW
   explode(subordinates) subView AS sub')



                                                                                   37
Preprocessing data with Hive
                                                                                       Revolution	
  Confidential




Scenario
   A marketing analytics team had built their own segmentation model
     based on their own database of web visitors. A partnership with a new
     data provider brought an order of magnitude more visitors to segment.
Solution
   A series of MapReduce jobs (Hive queries and RHadoop rmr2) was used
      to match, filter, and transform the data into a manageable size for
      interactive modeling with Revolution R Enterprise.


     Segmented
      Sample


                 Match    Filter   Extract




     Raw data

                                                Interactive modeling & exploratory
                                                analytics in Revolution R Enterprise
                                   Output



                                                                                                         38
Other ways to use R and Hadoop
                                                                     Revolution	
  Confidential




HDFS
   Revolution Enterprise R can read and write files directly on the distributed
     file system
   Files can include ScaleR’s XDF-formatted data sets


MapReduce
   Many other R packages have been written to use R and Hadoop together,
     including RHIPE, segue, Oracle’s R Connector for Hadoop, etc.


Hive
   Hadoop Streaming is also available for Hive to leverage functionality
     external to Hadoop and Java
   RHive leverages RServe to connect the two: http://cran.r-project.org/web/
     packages/RHive/

                                                                                       39
Want to learn more?
                                                                               Revolution	
  Confidential




Upcoming public Getting Started with RHadoop 1-day classes
    Hands-on examples and exercises covering rhdfs, rhbase, and rmr2
    Algorithms and data include wordcount, analysis of airline flight data, and collaborative
       filtering using structured and unstructured data from text, CSV files and Twitter
    February 25, 2013 - Palo Alto, CA
    March 13, 2013 - Boston, MA
    more to come: http://www.eventbrite.com/org/3179781746?s=12196176
    (also available as private, on-site training)


Revolution Analytics Quick Start Program for Hadoop
    Private Getting Started with RHadoop training
    Onsite consulting assistance for initial use case
    Revolution R for Hadoop licenses and support
    http://www.revolutionanalytics.com/services/consulting/hadoop-quick-start.php



                                                                                                 40
Revolution	
  Confidential




Thank you! Any Questions?




                                              41
Revolution	
  Confidential




                  42

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 

Was ist angesagt? (20)

Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
RHadoop
RHadoopRHadoop
RHadoop
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop
HadoopHadoop
Hadoop
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 

Ähnlich wie Using R with Hadoop

100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0Revolution Analytics
 
Cloud Biocep
Cloud BiocepCloud Biocep
Cloud BiocepInria
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
What's new in Nuxeo 5.2? - Solutions Linux 2009
What's new in Nuxeo 5.2? - Solutions Linux 2009What's new in Nuxeo 5.2? - Solutions Linux 2009
What's new in Nuxeo 5.2? - Solutions Linux 2009Stefane Fermigier
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 
Spark meets Spring
Spark meets SpringSpark meets Spring
Spark meets Springmark_fisher
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4Andreas Dewes
 
Big Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumBig Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumIntelAPAC
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
Growing Adoption of Open Source in Enterprises
Growing Adoption of Open Source in EnterprisesGrowing Adoption of Open Source in Enterprises
Growing Adoption of Open Source in EnterprisesWSO2
 
Modern Web development and operations practices
Modern Web development and operations practicesModern Web development and operations practices
Modern Web development and operations practicesGrig Gheorghiu
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSergey Lukjanov
 
Log everything!
Log everything!Log everything!
Log everything!ICANS GmbH
 

Ähnlich wie Using R with Hadoop (20)

100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Cloud Biocep
Cloud BiocepCloud Biocep
Cloud Biocep
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
What's new in Nuxeo 5.2? - Solutions Linux 2009
What's new in Nuxeo 5.2? - Solutions Linux 2009What's new in Nuxeo 5.2? - Solutions Linux 2009
What's new in Nuxeo 5.2? - Solutions Linux 2009
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
Spark meets Spring
Spark meets SpringSpark meets Spring
Spark meets Spring
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
Big Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumBig Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick Buddenbaum
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Growing Adoption of Open Source in Enterprises
Growing Adoption of Open Source in EnterprisesGrowing Adoption of Open Source in Enterprises
Growing Adoption of Open Source in Enterprises
 
Modern Web development and operations practices
Modern Web development and operations practicesModern Web development and operations practices
Modern Web development and operations practices
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Savanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStack
 
Log everything!
Log everything!Log everything!
Log everything!
 

Mehr von Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

Mehr von Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Using R with Hadoop

  • 1. Revolution  Confidential Using R with Hadoop by Jeffrey Breen Principal, Think Big Academy Think Big Analytics jeffrey.breen@thinkbiganalytics.com Twitter: @JeffreyBreen 1
  • 3. Agenda Revolution  Confidential Why R? What is Hadoop? Counting words with MapReduce RHadoop in Action Client scenario: Scaling legacy analytics with RHadoop Client scenario: Network security threat detection Big Data Warehousing with Hive Client scenario: Using Hadoop as a “drop-in” database replacement Client scenario: Preprocessing data for R Want to learn more? Q&A 3
  • 6. Number  of  R  Packages  Available How  many  R  Packages   are  there  now? At  the  command  line   enter: >  dim(available.packages()) Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
  • 7. Revolution  Confidential Poll: What do you use for analytics? 7
  • 8. Revolution R Enterprise has Open-Source Revolution  Confidential R Engine at the core 3,700 community packages and growing exponentially Multi-Threaded Cluster Web Services Big Data Parallel Math Libraries support API Analysis Tools Technical IDE / Developer Support GUI Community Build Packages R Engine Assurance Language Libraries 1 8
  • 9. What is Hadoop? Revolution  Confidential An open source project designed to support large scale data processing Inspired by Google’s MapReduce-based computational infrastructure Comprised of several components Hadoop Distributed File System (HDFS) MapReduce processing framework, job scheduler, etc. Ingest/outgest services (Sqoop, Flume, etc.) Higher level languages and libraries (Hive, Pig, Cascading, Mahout) Written in Java, first opened up to alternatives through its Streaming API → If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs 9
  • 11. Enter RHadoop Revolution  Confidential RHadoop is an open source project sponsored by Revolution Analytics Package Overview rmr2 - all MapReduce-related functions rhdfs - interaction with Hadoop’s HDFS file system rhbase - access to the NoSQL HBase database rmr2 uses Hadoop’s Streaming API to allow R users to write MapReduce jobs in R handles all of the I/O and job submission for you (no while(<stdin>)-like loops!) 11
  • 12. RHadoop Advantages Revolution  Confidential Modular Packages group similar functions Only load (and learn!) what you need Minimizes prerequisites and dependencies Open Source Cost: Low (no) barrier to start using Transparency: Development, issue tracker, Wiki, etc. hosted on github: https://github.com/RevolutionAnalytics/RHadoop/ Supported Sponsored by Revolution Analytics Training & professional services available Support included with Revolution R Enterprise 12
  • 13. Revolution  Confidential Hadoop cluster components SQL Store Ingest Outgest SQL Service Service Store Logs Cluster Key Primary Master Server Client Servers Secondary ✲ Hive, Pig, ... italics: process ✲ Job Tracker Master Server ✲ cron+bash, Azkaban, … Sqoop, Scribe, … Name Node Secondary Monitoring, Management Name Node ✲ : MR jobs Slaves Slave Server Slave Server Slave Server ✲ Task Tracker ✲ Task Tracker ✲ Task Tracker Data Node Data Node Data Node ... DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk from Think Big Academy’s Hadoop Developer Course 13 Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 14. Revolution  Confidential Hadoop’s distributed file system SQL Store Ingest Outgest SQL Service Service Store Logs Services Cluster Client Servers Primary Master Server ✲ Hive, Pig, ... Name Node Secondary ✲ Job Tracker Master Server ✲ cron+bash, Azkaban, … Sqoop, Scribe, … Name Node Secondary Monitoring, Management Name Node Data Nodes 64MB blocks Slaves Slave Server Slave Server Slave Server 3x replication ✲ Task Tracker ✲ Task Tracker ✲ Task Tracker Data Node Data Node Data Node ... DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved 14
  • 15. Revolution  Confidential Wordcount in Hadoop (The “hello world” of MapReduce) 15
  • 16. Input Mappers Sort, Reducers Output Shuffle Hadoop uses (hadoop, 1) MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) We need to convert (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) the Input (there, 1) map 1 mapreduce 1 phase 2 into the Output. (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is a Reduce phase (reduce 1) uses 1 from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 17. Input Mappers Hadoop uses MapReduce (N, "…") There is a Map phase (N, "…") (N, "") There is a Reduce phase (N, "…") from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 18. Input Mappers (hadoop, 1) Hadoop uses MapReduce (N, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (N, "…") (a, 1) (map, 1) (phase, 1) (N, "") (there, 1) (is, 1) There is a Reduce phase (N, "…") (a, 1) (reduce, 1) (phase, 1) from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 20. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (N, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (N, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is a Reduce phase (N, "…") (reduce, 1) from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 21. Input Mappers Sort, Reducers Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (N, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (N, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is a Reduce phase (N, "…") (reduce 1) (uses, 1) from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 22. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (N, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (N, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a Reduce phase (N, "…") (reduce 1) (uses, 1) uses 1 from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 23. Input Mappers Sort, Reducers Output Shuffle 0-9, a-l Hadoop uses (hadoop, 1) MapReduce (N, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (N, "") Map: (phase,1) Reduce: r-z • Transform  one  input  to  0-­‐N   • Collect multiple inputs into (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a (N, "…") one output. (uses, 1) outputs.   Reduce phase (reduce 1) uses 1 from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • 24. wordcount: code Revolution  Confidential library(rmr2) map = function(k,lines) { words.list = strsplit(lines, 's') words = unlist(words.list) return( keyval(words, 1) ) } reduce = function(word, counts) { keyval(word, sum(counts)) } wordcount = function (input, output = NULL) { mapreduce(input = input , output = output, input.format = "text", map = map, reduce = reduce)} from Revolution Analytics’ Getting Started with RHadoop course 24
  • 25. wordcount: submit job and fetch results Revolution  Confidential Submit job > hdfs.root = 'wordcount' > hdfs.data = file.path(hdfs.root, 'data') > hdfs.out = file.path(hdfs.root, 'out') > out = wordcount(hdfs.data, hdfs.out) Fetch results from HDFS > results = from.dfs( out ) > results.df = as.data.frame(results, stringsAsFactors=F ) > colnames(results.df) = c('word', 'count') > head(results.df) word count 1 greatness 2 2 damned 3 3 tis 5 4 jade 1 5 magician 1 6 hands 2 from Revolution Analytics’ Getting Started with RHadoop course 25
  • 26. Code notes Revolution  Confidential Scalable Hadoop and MapReduce abstract away system details Code runs on 1 node or 1,000 nodes without modification Portable You write normal R code, interacting with normal R objects RHadoop’s rmr2 library abstracts away Hadoop details All the functionality you expect is there—including Enterprise R’s Flexible Only the mapper deals with the data directly All components communicate via key-value pairs Key-value “schema” chosen for each analysis rather than as a prerequisite to loading data into the system 26
  • 27. Data silos have developed by source, Revolution  Confidential type, system, department... Customer Service, Website search, Operations Social Media Call Center Clickstreams & Finance 27
  • 28. Caution: Silos Kill Revolution  Confidential AP Photo/The Kansas City Star, Keith Myers 28
  • 29. Revolution  Confidential Poll: How do you intend to use R & Hadoop? 29
  • 30. Scaling up a legacy modeling environment Revolution  Confidential Scenario A financial services client has built a model to predict credit risk using a legacy system, but now wants to run the scoring algorithm more frequently than possible in existing hardware- (and license-) bound environment Solution Model coefficients are exported from legacy system and written to HDFS as a text file so they are available for our R code to use while scoring the entire data set nightly 30
  • 31. “Side loading” data made easy Revolution  Confidential [...] # first load in our coefficients # Note that this data.frame will be # _automatically_ distributed to each node COEFFS.DF = as.data.frame( from.dfs(hdfs.coeffs, format = make.input.format('csv', sep='t')) ) out = mapreduce(input = hdfs.data, output = hdfs.out, input.format = make.input.format('csv', sep='t'), map = scoring.map, verbose=T) 31
  • 32. Logistic regression scoring “by hand” Revolution  Confidential score.logit = function(coeffs, data, reverse.coeffs.signs=F) { if( length(coeffs) != length(data) + 1 ) { warning("Bad row: ", data) return(NULL) } # SAS may reverse the signs of the coefficients # (this assumes the intercept should be changed as well) if (reverse.coeffs.signs) coeffs = -1 * coeffs intercept = coeffs[1] coeffs = coeffs[-1] return( 1/(1 + exp(-intercept - sum(coeffs * data))) ) } 32
  • 33. Network security threat detection Revolution  Confidential Scenario A risk-conscious company with a high-profile web presence was seeking a way to detect potential attacks using captured network traffic (PCAP files). Solution Typical traffic was modeled using a combination of algorithms on the full data set. Baseline results from this “training” phase are stored on HDFS and can be refreshed periodically, as needed. A detection job loads the baseline results and analyzes incoming network traffic looking for anomalies. 33
  • 34. Big Data Warehousing with Hive Revolution  Confidential Hive supplies a SQL-like query language very familiar for those with relational database experience But Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop cluster Can be used in conjunction with other Hadoop jobs, such as those written with rmr2 34
  • 35. Hive architecture & access Revolution  Confidential Terminal browser RODBC, RJDBC, etc. Hive JDBC ODBC CLI HWI Thrift Server Driver Metastore (compiles, optimizes, executes) Hadoop Master DFS ✲ Job Tracker Name Node 35
  • 36. Hadoop as a “drop-in” database replacement Revolution  Confidential Scenario A market research company had developed a large collection of mature R code to conduct mission-critical analysis. Reaching the limits of their legacy database infrastructure, the company has decided to move their data to Hadoop, but wants to avoid a “Big Bang” project to preserve existing functionality. Solution As a stop-gap solution, ODBC connections to Hive have replaced connections to the legacy relational database with minimal recoding. Existing functionality was preserved while the team got up to speed with their new Hadoop environment and its capabilities. Solution allows for parallel testing of two systems. 36
  • 37. Accessing Hive via ODBC/JDBC Revolution  Confidential library(RJDBC) # set the classpath to include the JDBC driver location, plus commons-logging [...] class.path = c(hive.class.path, commons.class.path) drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path, "`") # make a connection to the running Hive Server: conn = dbConnect(drv, "jdbc:hive://localhost:10000/default") # setting the database name in the URL doesn't help, # so issue 'use databasename' command: res = dbSendQuery(conn, 'use mydatabase') # submit the query and fetch the results as a data.frame: df = dbGetQuery(conn, 'SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subView AS sub') 37
  • 38. Preprocessing data with Hive Revolution  Confidential Scenario A marketing analytics team had built their own segmentation model based on their own database of web visitors. A partnership with a new data provider brought an order of magnitude more visitors to segment. Solution A series of MapReduce jobs (Hive queries and RHadoop rmr2) was used to match, filter, and transform the data into a manageable size for interactive modeling with Revolution R Enterprise. Segmented Sample Match Filter Extract Raw data Interactive modeling & exploratory analytics in Revolution R Enterprise Output 38
  • 39. Other ways to use R and Hadoop Revolution  Confidential HDFS Revolution Enterprise R can read and write files directly on the distributed file system Files can include ScaleR’s XDF-formatted data sets MapReduce Many other R packages have been written to use R and Hadoop together, including RHIPE, segue, Oracle’s R Connector for Hadoop, etc. Hive Hadoop Streaming is also available for Hive to leverage functionality external to Hadoop and Java RHive leverages RServe to connect the two: http://cran.r-project.org/web/ packages/RHive/ 39
  • 40. Want to learn more? Revolution  Confidential Upcoming public Getting Started with RHadoop 1-day classes Hands-on examples and exercises covering rhdfs, rhbase, and rmr2 Algorithms and data include wordcount, analysis of airline flight data, and collaborative filtering using structured and unstructured data from text, CSV files and Twitter February 25, 2013 - Palo Alto, CA March 13, 2013 - Boston, MA more to come: http://www.eventbrite.com/org/3179781746?s=12196176 (also available as private, on-site training) Revolution Analytics Quick Start Program for Hadoop Private Getting Started with RHadoop training Onsite consulting assistance for initial use case Revolution R for Hadoop licenses and support http://www.revolutionanalytics.com/services/consulting/hadoop-quick-start.php 40