PASS Camp 2012


Big Data mit Microsoft (Teil 1)

Software Developer / Solution Architect
Twitter: @SaschaDittmann
Blog: http://www.sascha-dittmann.de
Was könnte das sein?


       180.000.000.000.000.000.000



     1.800.000.000.000.000.000.000
Weltweites Datenvolumen


       180.000.000.000.000.000.000
     = 0,18 ZB (Zettabytes) - Stand 2006


     1.800.000.000.000.000.000.000
      = 1,8 ZB (Zettabytes) - Stand 2011
Skalierung




    Vertikale Skalierung   Horizontale Skalierung
Apache Hadoop Ecosystem
                                      Oozie                                                    HBase / Cassandra
                                                               Traditional BI Tools
                                    (Workflow)                                              (Columnar NoSQL Databases)


                                               Hive
                                                           Cascading
                             Pig (Data      (Warehouse                       Apache
                                                         (programming                         Flume          Sqoop
                               Flow)         and Data                        Mahout
                                                             model)
                                              Access)
  Zookeeper (Coordination)




                                                                                                                         Avro (Serialization)
                                         HBase (Column DB)


                                                 MapReduce (Job Scheduling/Execution System)

                                                 Hadoop = MapReduce + HDFS
                                                                     HDFS
                                                         (Hadoop Distributed File System)
Apache Hadoop Ecosystem
                                                                                                     Visual Studio


                                                                         Oozie                                                              HBase / Cassandra
                                                                                                      Traditional BI Tools
                                                                       (Workflow)                                                      (Columnar NoSQL Databases)



                                                                                  Hive
                                                                                                  Cascading
                                                               Pig (Data       (Warehouse                            Apache
                                                                                                (programming                            Flume            Sqoop
                                                                 Flow)          and Data                             Mahout
                                                                                                    model)
                                                                                 Access)
                 Active Directory
 System Center




                                    Zookeeper (Coordination)




                                                                                                                                                                    Avro (Serialization)
                                                                            HBase (Column DB)


                                                                                         MapReduce (Job Scheduling/Execution System)


                                                                                                Hadoop = MapReduce + HDFS


                                                                                                            HDFS
                                                                                                (Hadoop Distributed File System)




                                                                                            Windows
Hadoop Distributed File System

Bootvorgang
Ausfallsicherheit
Benutzeranfrage
Hadoop Distributed File System

Bootvorgang
Ausfallsicherheit
Benutzeranfrage
Hadoop Distributed File System

Bootvorgang
Ausfallsicherheit
Benutzeranfrage
Hadoop Distributed File System

   Portable Operating System Interface (POSIX)
   Replikation auf mehrere Datenknoten
  js> #ls input/ncdc
  Found 9 items
  drwxr-xr-x - Sascha   supergroup   0 2012-04-24 13:01 /user/Sascha/input/ncdc/_distcp_logs_g0dedn
  drwxr-xr-x - Sascha   supergroup   0 2012-04-24 12:04 /user/Sascha/input/ncdc/_distcp_logs_ofj0u6
  drwxr-xr-x - Sascha   supergroup   0 2012-04-24 13:09 /user/Sascha/input/ncdc/all
  drwxr-xr-x - Sascha   supergroup   0 2012-04-24 13:01 /user/Sascha/input/ncdc/all2
  drwxr-xr-x - Sascha   supergroup   0 2012-04-23 13:06 /user/Sascha/input/ncdc/metadata
  drwxr-xr-x - Sascha   supergroup   0 2012-04-23 13:06 /user/Sascha/input/ncdc/micro
  drwxr-xr-x - Sascha   supergroup   0 2012-04-23 13:06 /user/Sascha/input/ncdc/micro-tab
  -rw-r--r-- 3 Sascha   supergroup   529 2012-04-23 13:06 /user/Sascha/input/ncdc/sample.txt
  -rw-r--r-- 3 Sascha   supergroup   168 2012-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
Map / Reduce

   DataNode   DataNode   DataNode   0067011990999991950051507004+68750
                                    0043011990999991950051512004+68750
                                    0043011990999991950051518004+68750
                                    0043012650999991949032412004+62300
                                    0043012650999991949032418004+62300




                                    1949,0
                                                           1952,-11
                                    1950,22
     Map        Map        Map      1950,55
                                                           1950,33




     Sort       Sort       Sort     1949,0
                                    1950,[22,33,55]
    Shuffle    Shuffle    Shuffle   1952,-11




               Reduce
                                    1949,0
                                    1950,55
                                    1952,-11
Combine Methode

  DataNode   DataNode   DataNode   0067011990999991950051507004+68750
                                   0043011990999991950051512004+68750
                                   0043011990999991950051518004+68750
                                   0043012650999991949032412004+62300
                                   0043012650999991949032418004+62300




                                   1949,0
                                                          1952,-11
                                   1950,22
    Map        Map        Map      1950,55
                                                          1950,33




                                   1949,0                 1952,-11
   Combine    Combine    Combine   1950,55                1950,33




    Sort       Sort       Sort     1949,0
                                   1950,[33,55]
   Shuffle    Shuffle    Shuffle   1952,-11




              Reduce
                                   1949,0
                                   1950,55
                                   1952,-11
RDBMS vs. Hadoop


                           RDBMS                  Hadoop
 Datenmenge                Gigabytes              Petabytes
 Zugriff                   Interaktiv und Batch   Batch
 Lese- / Schreibzugriffe   Viele Lese- und        Einmaliges Schreiben
                           Schreibzugriffe        Viele Lesezugriffe
 Datenstruktur             Statisches Schema      Dynamisches Schema
 Datenintegrität           Hoch                   Niedrig
 Skalierungsverhalten      Nicht-Linear           Linear
Demo‘s


    Hadoop Umgebung
    HDFS
    Map/Reduce via JavaScript
    Data Streaming mit C#
    Power Pivot
Pig Latin
   pig
   .from("/user/Sascha/input/texte")
   .mapReduce("/user/…/WordCount.js"
               , "Woerter, Anzahl:long")
   .orderBy("Anzahl DESC")
   .take(15)
   .to("/user/Sascha/output/Top15Woerter")

PASS Camp 2012 - Big Data mit Microsoft (Teil 1)

  • 1.
    PASS Camp 2012 BigData mit Microsoft (Teil 1) Software Developer / Solution Architect Twitter: @SaschaDittmann Blog: http://www.sascha-dittmann.de
  • 2.
    Was könnte dassein? 180.000.000.000.000.000.000 1.800.000.000.000.000.000.000
  • 3.
    Weltweites Datenvolumen 180.000.000.000.000.000.000 = 0,18 ZB (Zettabytes) - Stand 2006 1.800.000.000.000.000.000.000 = 1,8 ZB (Zettabytes) - Stand 2011
  • 4.
    Skalierung Vertikale Skalierung Horizontale Skalierung
  • 5.
    Apache Hadoop Ecosystem Oozie HBase / Cassandra Traditional BI Tools (Workflow) (Columnar NoSQL Databases) Hive Cascading Pig (Data (Warehouse Apache (programming Flume Sqoop Flow) and Data Mahout model) Access) Zookeeper (Coordination) Avro (Serialization) HBase (Column DB) MapReduce (Job Scheduling/Execution System) Hadoop = MapReduce + HDFS HDFS (Hadoop Distributed File System)
  • 6.
    Apache Hadoop Ecosystem Visual Studio Oozie HBase / Cassandra Traditional BI Tools (Workflow) (Columnar NoSQL Databases) Hive Cascading Pig (Data (Warehouse Apache (programming Flume Sqoop Flow) and Data Mahout model) Access) Active Directory System Center Zookeeper (Coordination) Avro (Serialization) HBase (Column DB) MapReduce (Job Scheduling/Execution System) Hadoop = MapReduce + HDFS HDFS (Hadoop Distributed File System) Windows
  • 7.
    Hadoop Distributed FileSystem Bootvorgang Ausfallsicherheit Benutzeranfrage
  • 8.
    Hadoop Distributed FileSystem Bootvorgang Ausfallsicherheit Benutzeranfrage
  • 9.
    Hadoop Distributed FileSystem Bootvorgang Ausfallsicherheit Benutzeranfrage
  • 10.
    Hadoop Distributed FileSystem  Portable Operating System Interface (POSIX)  Replikation auf mehrere Datenknoten js> #ls input/ncdc Found 9 items drwxr-xr-x - Sascha supergroup 0 2012-04-24 13:01 /user/Sascha/input/ncdc/_distcp_logs_g0dedn drwxr-xr-x - Sascha supergroup 0 2012-04-24 12:04 /user/Sascha/input/ncdc/_distcp_logs_ofj0u6 drwxr-xr-x - Sascha supergroup 0 2012-04-24 13:09 /user/Sascha/input/ncdc/all drwxr-xr-x - Sascha supergroup 0 2012-04-24 13:01 /user/Sascha/input/ncdc/all2 drwxr-xr-x - Sascha supergroup 0 2012-04-23 13:06 /user/Sascha/input/ncdc/metadata drwxr-xr-x - Sascha supergroup 0 2012-04-23 13:06 /user/Sascha/input/ncdc/micro drwxr-xr-x - Sascha supergroup 0 2012-04-23 13:06 /user/Sascha/input/ncdc/micro-tab -rw-r--r-- 3 Sascha supergroup 529 2012-04-23 13:06 /user/Sascha/input/ncdc/sample.txt -rw-r--r-- 3 Sascha supergroup 168 2012-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
  • 11.
    Map / Reduce DataNode DataNode DataNode 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1952,-11 1950,22 Map Map Map 1950,55 1950,33 Sort Sort Sort 1949,0 1950,[22,33,55] Shuffle Shuffle Shuffle 1952,-11 Reduce 1949,0 1950,55 1952,-11
  • 12.
    Combine Methode DataNode DataNode DataNode 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1952,-11 1950,22 Map Map Map 1950,55 1950,33 1949,0 1952,-11 Combine Combine Combine 1950,55 1950,33 Sort Sort Sort 1949,0 1950,[33,55] Shuffle Shuffle Shuffle 1952,-11 Reduce 1949,0 1950,55 1952,-11
  • 13.
    RDBMS vs. Hadoop RDBMS Hadoop Datenmenge Gigabytes Petabytes Zugriff Interaktiv und Batch Batch Lese- / Schreibzugriffe Viele Lese- und Einmaliges Schreiben Schreibzugriffe Viele Lesezugriffe Datenstruktur Statisches Schema Dynamisches Schema Datenintegrität Hoch Niedrig Skalierungsverhalten Nicht-Linear Linear
  • 14.
    Demo‘s  Hadoop Umgebung  HDFS  Map/Reduce via JavaScript  Data Streaming mit C#  Power Pivot
  • 15.
    Pig Latin pig .from("/user/Sascha/input/texte") .mapReduce("/user/…/WordCount.js" , "Woerter, Anzahl:long") .orderBy("Anzahl DESC") .take(15) .to("/user/Sascha/output/Top15Woerter")