Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Introduction to Apache Hadoop

Apache Hadoop Presentation by Steve Watt at Data Day Austin 2011

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Introduction to Apache Hadoop

  1. 1. Introduction to Apache Hadoop Steve Watt - IBM Big Data Lead @wattsteve #datadayaustin http://stevewatt.blogspot.com
  2. 2. The Origins of Hadoop
  3. 3. The Origins of Hadoop <ul><li>A Petabyte scale explosion of Data on the Internet and in the Enterprise, begs the following questions: </li></ul><ul><li>How do we handle unstructured data ? </li></ul><ul><li>How do we scale? </li></ul><ul><li>An example: A need to process 100 TB datasets </li></ul><ul><li>On 1 Node: </li></ul><ul><ul><ul><li>Scanning @50 MB/s = 23 days </li></ul></ul></ul><ul><li>On 1000 Node Cluster </li></ul><ul><ul><ul><li>Scanning @50 MB/s = 33 mins </li></ul></ul></ul>
  4. 4. The Origins of Hadoop <ul><li>In 2004 Google publishes seminal whitepapers on a new programming paradigm to handle data at Internet Scale (Google processes upwards of 20 PB per day using Map/Reduce) </li></ul><ul><li>http://research.google.com/people/sanjay/index.html </li></ul><ul><li>The Apache Foundation launches Hadoop – An Open-Source implementation of Google Map/Reduce and the distributed Google FileSystem </li></ul>
  5. 5. So what exactly is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
  6. 6. Hadoop – The Hadoop Cluster - Distributed File System - Map/Reduce
  7. 8. Hadoop - Map/Reduce <ul><li>Setup </li></ul><ul><ul><li>Jobs are submitted to each machine in the cluster to run against the blocks that are local to each particular machine </li></ul></ul><ul><ul><li>The Job specifies an InputFormatter which knows how to read the data in the block </li></ul></ul><ul><ul><li>The InputFormatter contains a record reader which identifies all the records in the block for processing </li></ul></ul><ul><li>Map step </li></ul><ul><ul><li>One map task for each block (aka Input Split) </li></ul></ul><ul><ul><li>The map function will be called for each record in the input dataset </li></ul></ul><ul><ul><li>Produces a list of (key, value) pairs </li></ul></ul><ul><li>Reduce step </li></ul><ul><ul><li>The Reducer receives a sorted list of Keys with their corresponding values </li></ul></ul><ul><ul><li>The Reducer is called once for each key </li></ul></ul>
  8. 9. Hadoop - Map/Reduce on the Cluster
  9. 10. Hadoop - Map/Reduce Logical Flow
  10. 11. Hadoop – Map/Reduce – JobTracker Details
  11. 12. Hadoop – Map/Reduce – Job Details
  12. 13. Examples of Industry using Hadoop <ul><li>Trend Analysis of existing unstructured data (such as mining log files for key metrics) </li></ul><ul><li>Targeted crawling (obtains the data) coupled with information extraction and classification (structures the data) </li></ul><ul><li>Text Analytics – the ability to run extractors over unstructured data to cleanse, structure and normalize it so that it can be queried via - (Pig / HIVE / BigSheets). </li></ul><ul><li>A programming model for cloud computing : Hadoop jobs running natively in the cloud, over data stored in the cloud and storing the output in the cloud – Amazon EC2 </li></ul>
  13. 14. The Hadoop Ecosystem ClusterChef / Apache Whirr Hadoop Pig / WuKong Cassandra / HBase Offline Systems (Analytics) Online Systems (OLTP @ Scale) BigSheets / DataMeer Hive Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Load Tooling https://github.com/tomwhite/hadoop-ecosystem/raw/master/hadoop-ecosystem.dot.png
  14. 15. Installing and Running Hadoop - Demo <ul><li>Modes: Standalone, Pseudo-Distributed, Fully Distributed </li></ul><ul><li>Pseudo-Distributed Steps (http://stevewatt.blogspot.com): </li></ul><ul><ul><li>Untar Hadoop in desired directory </li></ul></ul><ul><ul><li>Setup Passwordless SSH </li></ul></ul><ul><ul><li>Set JAVA_HOME in conf/Hadoop-env.sh </li></ul></ul><ul><ul><li>Modify the conf/hdfs-site.xml, conf/mapred-site.xml and conf/core-site.xml </li></ul></ul><ul><ul><li>Set conf/Master and conf/Slaves to “localhost” </li></ul></ul><ul><ul><li>Format Namenode: bin/hadoop namenode -format </li></ul></ul><ul><ul><li>Start Hadoop: bin/start-all.sh </li></ul></ul><ul><ul><li>Check Runtime Status - http: //localhost :50030 & 50070 </li></ul></ul><ul><ul><li>Run TeraGen/Terasort/TeraValidate System Test </li></ul></ul>