An Introduction
                             Kai Voigt, Cloudera Inc
                        Berlin Buzzwords, June 6th 2011




Montag, 6. Juni 2011
Big Data

                       •   Capture

                       •   Storage

                       •   Search

                       •   Analytics




Montag, 6. Juni 2011
hadoop.apache.org


                       Google Filesystem   Hadoop Distributed
                            (GFS)          Filesystem (HDFS)


                          MapReduce           MapReduce



Montag, 6. Juni 2011
Hadoop Distributed
                             Filesystem
                       • Easy Access
                       • Distributed
                       • Redundant
                       • Scalable

Montag, 6. Juni 2011
File Splits
                                             Large File


                                            6440M



                 Block   Block   Block   Block     Block   Block         Block   Block
                   1       2       3       4         5       6     ...    100     101


               64M 64M 64M 64M 64M 64M                                   64M 40M

Montag, 6. Juni 2011
Block Placement
                        Block     Block   Block   Block   Block
                          1         2       3       2       2


                        Block     Block           Block
                          3         1               3

                                                  Block
                                                    1



                       Node 1 Node 2 Node 3 Node 4 Node 5


Montag, 6. Juni 2011
MapReduce
                       Block    Block   Block   Block
     Input               1        2       3       4


                                                         map()
 Inter-
mediate
  Data
                                                        reduce()

 Output


Montag, 6. Juni 2011
WordCount Example
                  map (offset, line) {
                    foreach word in line {
                      emit (word, 1)
                    }
                  }




Montag, 6. Juni 2011
WordCount Example
                  reduce (word, count[]) {
                    total = 0;
                    foreach number in count[] {
                      total += number
                    }
                    emit (word, total)
                  }


Montag, 6. Juni 2011
map()



                       (   , 1)    (   , 1)   (   , 1)
                       (   , 1)    (   , 1)   (   , 1)
                       (   , 1)    (   , 1)   (   , 1)
                       (   , 1)    (   , 1)   (   , 1)



Montag, 6. Juni 2011
Sort & Shuffle
                       (   , 1)              (   , 1)             (   , 1)
                       (   , 1)              (   , 1)             (   , 1)
                       (   , 1)              (   , 1)             (   , 1)
                       (   , 1)              (   , 1)             (   , 1)


                           (      , (1,1))
                                                        (   , (1,1,1,1))
                           (      , (1,1))
                                                        (   , (1,1))
                           (      , (1))


Montag, 6. Juni 2011
reduce()
                       (       , (1,1))
                                          (   , (1,1,1,1))
                       (       , (1,1))
                                          (   , (1,1))
                       (       , (1))




                           (     , 2)         (   , 4)
                           (     , 2)         (   , 2)
                           (     , 1)

Montag, 6. Juni 2011
Use Case:
                           Recommendations
                       • "People looking as this article also looked
                         at these articles"
                       • "You might also know these people"
                       • iTunes Genius Playlist
                       • Banner Placement

Montag, 6. Juni 2011
Use Case: Text
                               Processing

                       • Document Indexing
                       • Semantic Analytics


Montag, 6. Juni 2011
Use Case: Machine
                              Learning
                                                  Data
                                                   Data
                                                    Data

                       • Spam vs No Spam
                       • Credit Card Fraud
                       • "People of Interest"
                                                Information




Montag, 6. Juni 2011
Use Case: Graphs

                       • Shortest Paths
                       • Bottleneck Nodes
                       • Flow Optimization
                       • Spanning Trees
                       • Spanning Routes

Montag, 6. Juni 2011
Hadoop Ecosystem
                       Hive & Pig   High Level Language Access
                        HBase            Real Time Access
                        Sqoop          SQL to/from Hadoop
                         Flume      Distributed Data Collection
                         Oozie            Job Workflow
                        Mahout       Machine Learning Library

                                    many more


Montag, 6. Juni 2011
Homework

                       • Cloudera's Distribution including Hadoop
                         (CDH3)
                       • Online Tutorials
                       • WordCount Example
                       • Conference Demo Cluster

Montag, 6. Juni 2011
Quick Demo!



Montag, 6. Juni 2011
Thank you!

                       • Kai Voigt
                       • kai@cloudera.com
                       • http://www.cloudera.com/
                       • http://apache.hadoop.org/

Montag, 6. Juni 2011

Hadoop introduction berlin buzzwords 2011

  • 1.
    An Introduction Kai Voigt, Cloudera Inc Berlin Buzzwords, June 6th 2011 Montag, 6. Juni 2011
  • 2.
    Big Data • Capture • Storage • Search • Analytics Montag, 6. Juni 2011
  • 3.
    hadoop.apache.org Google Filesystem Hadoop Distributed (GFS) Filesystem (HDFS) MapReduce MapReduce Montag, 6. Juni 2011
  • 4.
    Hadoop Distributed Filesystem • Easy Access • Distributed • Redundant • Scalable Montag, 6. Juni 2011
  • 5.
    File Splits Large File 6440M Block Block Block Block Block Block Block Block 1 2 3 4 5 6 ... 100 101 64M 64M 64M 64M 64M 64M 64M 40M Montag, 6. Juni 2011
  • 6.
    Block Placement Block Block Block Block Block 1 2 3 2 2 Block Block Block 3 1 3 Block 1 Node 1 Node 2 Node 3 Node 4 Node 5 Montag, 6. Juni 2011
  • 7.
    MapReduce Block Block Block Block Input 1 2 3 4 map() Inter- mediate Data reduce() Output Montag, 6. Juni 2011
  • 8.
    WordCount Example map (offset, line) { foreach word in line { emit (word, 1) } } Montag, 6. Juni 2011
  • 9.
    WordCount Example reduce (word, count[]) { total = 0; foreach number in count[] { total += number } emit (word, total) } Montag, 6. Juni 2011
  • 10.
    map() ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) Montag, 6. Juni 2011
  • 11.
    Sort & Shuffle ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , (1,1)) ( , (1,1,1,1)) ( , (1,1)) ( , (1,1)) ( , (1)) Montag, 6. Juni 2011
  • 12.
    reduce() ( , (1,1)) ( , (1,1,1,1)) ( , (1,1)) ( , (1,1)) ( , (1)) ( , 2) ( , 4) ( , 2) ( , 2) ( , 1) Montag, 6. Juni 2011
  • 13.
    Use Case: Recommendations • "People looking as this article also looked at these articles" • "You might also know these people" • iTunes Genius Playlist • Banner Placement Montag, 6. Juni 2011
  • 14.
    Use Case: Text Processing • Document Indexing • Semantic Analytics Montag, 6. Juni 2011
  • 15.
    Use Case: Machine Learning Data Data Data • Spam vs No Spam • Credit Card Fraud • "People of Interest" Information Montag, 6. Juni 2011
  • 16.
    Use Case: Graphs • Shortest Paths • Bottleneck Nodes • Flow Optimization • Spanning Trees • Spanning Routes Montag, 6. Juni 2011
  • 17.
    Hadoop Ecosystem Hive & Pig High Level Language Access HBase Real Time Access Sqoop SQL to/from Hadoop Flume Distributed Data Collection Oozie Job Workflow Mahout Machine Learning Library many more Montag, 6. Juni 2011
  • 18.
    Homework • Cloudera's Distribution including Hadoop (CDH3) • Online Tutorials • WordCount Example • Conference Demo Cluster Montag, 6. Juni 2011
  • 19.
  • 20.
    Thank you! • Kai Voigt • kai@cloudera.com • http://www.cloudera.com/ • http://apache.hadoop.org/ Montag, 6. Juni 2011