SlideShare ist ein Scribd-Unternehmen logo
1 von 85
Downloaden Sie, um offline zu lesen
Hadoop
Scott Leberknight
Yahoo! "Search Assist"
e Hadoop users. .
Notabl
           Yahoo!            LinkedIn

         Facebook          New York Times

           Twitter          Rackspace

            Baidu           eHarmony

            eBay             Powerset
                             http://wiki.apache.org/hadoop/PoweredBy
Hadoop in the Real
     World..
Recommendation
                                 Financial analysis
       systems


   Natural Language
                               Correlation engines
   Processing (NLP)


   Data warehousing           Image/video processing


Market research/forecasting        Log analysis
Finance        Social networking




  Health &
                Academic research
Life Sciences




Government      Telecommunications
History..
Inspired by Google BigTable and
    MapReduce papers circa 2004



      Created by Doug Cutting



Originally built to support distribution
       for Nutch search engine



    Named after a stuffed elephant
OK, So what exactly
     is Hadoop?
An open source...



       batch/offline oriented...



             data & I/O intensive...



                       general purpose framework for
                     creating distributed applications that
                        process huge amounts of data.
One definition of "huge"
              25,000 machines

           More than 10 clusters

3 petabytes of data (compressed, unreplicated)

                 700+ users

             10,000+ jobs/week
Had oop
M ajor nts:
 C omp one

         Distributed File System
                 (HDFS)

                Map/Reduce System
But first, what
 isn't Hadoop?
doop is NOT:
Ha
   ...a relational database!



    ...an online transaction processing (OLTP) system!



    ...a structured data store of any kind!
Hadoop vs. Relational
Hadoop                                  Relational

       Scale-out                                 Scale-up(*)

  Key/value pairs                                    Tables
Say how to process                       Say what you want
     the data                                  (SQL)
   Offline/batch                            Online/real-time

 (*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
HDFS
(Hadoop Distributed File System)
Data is distributed and replicated
    over multiple machines



    Designed for large files
(where "large" means GB to TB)



        Block oriented



Linux-style commands, e.g. ls, cp,
           mv, rm, etc.
NameNode
                      File Block Mappings:

                      /user/aaron/data1.txt -> 1, 2, 3
                      /user/aaron/data2.txt -> 4, 5
                      /user/andrew/data3.txt -> 6, 7



DataNode(s)

5   1         4   2                     2                3   7
    4             6              1      4                    6
2             3                  6                           1
    3         7   5              7                       5
fault tolerant when nodes fail

Self-healing      rebalances files across cluster




scalable   just by adding new nodes!
Map/Reduce
Split input files (e.g. by HDFS blocks)



    Operate on key/value pairs



Mappers filter & transform input data



 Reducers aggregate mapper output
move code to data
map:
       (K1, V1)         list(K2, V2)




reduce:
       (K2, list(V2))   list(K3, V3)
Word Count
(the canonical Map/Reduce example)
the quick brown fox
    jumped over
 the lazy brown dog
m ap phase -
    inputs
                  (K1, V1)

           (0, "the quick brown fox")

           (20, "jumped over")

           (32, "the lazy brown dog")
map ph
                                      ase -
             list(K2, V2)      outpu
                                     ts

("the", 1)            ("quick", 1)

("brown", 1)          ("fox", 1)

("jumped", 1)         ("over", 1)

("the", 1)            ("lazy", 1)

("brown", 1)          ("dog", 1)
redu ce phase -
     inputs     (K2, list(V2))

    ("brown", (1, 1))       ("dog", (1))

    ("fox", (1))            ("jumped", (1))

    ("lazy", (1))           ("over", (1))

    ("quick", (1))          ("the", (1, 1))
reduce
                                      phase
                                 outpu      -
               list(K3, V3)            ts

("brown", 2)              ("dog", 1)

("fox", 1)                ("jumped", 1)

("lazy", 1)               ("over", 1)

("quick", 1)              ("the", 2)
WordCount in code..
public class SimpleWordCount
  extends Configured implements Tool {

    public static class MapClass
      extends Mapper<Object, Text, Text, IntWritable> {
      ...
    }

    public static class Reduce
      extends Reducer<Text, IntWritable, Text, IntWritable> {
      ...
    }

    public int run(String[] args) throws Exception { ... }

    public static void main(String[] args) { ... }
}
public static class MapClass
  extends Mapper<Object, Text, Text, IntWritable> {

    private static final IntWritable ONE = new IntWritable(1L);
    private Text word = new Text();

    @Override
    protected void map(Object key, Text value, Context context)
      throws IOException, InterruptedException {

        StringTokenizer st = new StringTokenizer(value.toString());
        while (st.hasMoreTokens()) {
          word.set(st.nextToken());
          context.write(word, ONE);
        }
    }
}
public static class Reduce
  extends Reducer<Text, IntWritable, Text, IntWritable> {

    private IntWritable count = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
                          Context context)
      throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable value : values) {
          sum += value.get();
        }
        count.set(sum);
        context.write(key, count);
    }
}
public int run(String[] args) throws Exception {
  Configuration conf = getConf();

    Job job = new Job(conf, "Counting Words");
    job.setJarByClass(SimpleWordCount.class);
    job.setMapperClass(MapClass.class);
    job.setReducerClass(Reduce.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
  int result = ToolRunner.run(new Configuration(),
                              new SimpleWordCount(),
                              args);
  System.exit(result);
}
aF low
                   uce Dat
       p/Red
   M  a




(Image from Hadoop in Action...great book!)
Partitioning
 Deciding which keys go to which reducer


  Desire even distribution across reducers


Skewed data can overload a single reducer!
Map/Reduce Partitioning & Shuffling




(Image from Hadoop in Action...great book!)
Combiner
Effectively a reduce in the mappers


       a.k.a. "Local Reduce"
Shuffling WordCount
                                  data               # k/v pairs shuffled


without combiner             ("the", 1)                        1000


 with combiner            ("the", 1000)                            1



         (looking at one mapper that sees the word "the" 1000 times)
Advanced Map/Reduce
     Hadoop Streaming


  Chaining Map/Reduce jobs


        Joining data


        Bloom filters
Architecture
HDFS
NameNode


Secondary
NameNode
            Map/Reduce
DataNode      JobTracker


             TaskTracker
Secondary
               NameNode




               NameNode                   JobTracker




 DataNode1                  DataNode2                   DataNodeN



TaskTracker1               TaskTracker2                TaskTrackerN
map                        map                          map


    reduce                     reduce                      reduce
NameNode
     Bookkeeper for HDFS


      Manages DataNodes


Should not store data or run jobs


     Single point of failure!
DataNode
   Store actual file blocks on disk


    Does not store entire files!


  Report block info to NameNode


Receive instructions from NameNode
Secondary NameNode

    Snapshot of NameNode


Not a failover server for NameNode!


Help minimize downtime/data loss
       if NameNode fails
JobTracker

 Partition tasks across HDFS cluster


       Track map/reduce tasks


Re-start failed tasks on different nodes


         Speculative execution
TaskTracker

Track individual map & reduce tasks


  Report progress to JobTracker
Monitoring/
 Debugging
distributed processing



distributed debugging
Logs
      View task logs on machine where
         specific task was processed
               (or via web UI)


$HADOOP_HOME/logs/userlogs on task tracker
Counters
       Define one or more counters


Increment counters during map/reduce tasks


 Counter values displayed in job tracker UI
IsolationRunner

Re-run failed tasks with original input data



  Must set keep.failed.tasks.files to 'true'
Skipping Bad Records
        Data may not always be clean


  New data may have new interesting twists


Can you pre-process to filter & validate input?
Performance Tuning
Speculative execution   Use a Combiner
      (on by default)




 Reduce amount of         JVM Re-use
    input data              (be careful)




                        Refactor code/
 Data compression
                          algorithms
Managing
Hadoop
Lots of knobs          Trash can



 Needs active          Add/remove
 management            data nodes


                    Network topology/
"Fair" scheduling
                     rack awareness


NameNode/SNN
                    Permissions/quotas
 management
Hive
Simulate structure for data stored in Hadoop



Query language analogous to SQL (Hive QL)



Translates queries into Map/Reduce job(s)...



     ...so not for real-time processing!
Queries:
     Projection           Joins (inner, outer, semi)

     Grouping             Aggregation

     Sub-queries          Multi-table insert


Customizable:
     User-defined functions

      Input/output formats with SerDe
/user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt



"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
                          Patent citation dataset
3858242,3668705
3858242,3707004
3858243,2949611
3858243,3146465
3858243,3156927
3858243,3221341
3858243,3574238
...
                                   http://www.nber.org/patents
create external table patent_citations (citing string, cited string)
row format delimited fields terminated by ','
stored as textfile
location '/user/sleberkn/nber-patent/tables/patent_citation';



create table citation_histogram (num_citations int, count int)
stored as sequencefile;
insert overwrite table citation_histogram
select num_citations, count(num_citations) from
    (select cited, count(cited) as num_citations
    from patent_citations group by cited) citation_counts
group by num_citations
order by num_citations;
Hadoop in the clouds
Amazon EC2 + S3
EC2 instances are compute nodes (Map/Reduce)


Storage options:

    HDFS on EC2 nodes

    HDFS on EC2 nodes loading data from S3

    Native S3 (bypasses HDFS)
Amazon Elastic MapReduce
         Interact via web-based console


            Submit Map/Reduce job
               (streaming, Hive, Pig, or JAR)




EMR configures & launches Hadoop cluster for job


         Uses S3 for data input/output
Recap..
Hadoop = HDFS + Map/Reduce


Distributed, parallel processing


 Designed for fault tolerance


     Horizontal scale-out


 Structure & queries via Hive
References
http://hadoop.apache.org/

http://hadoop.apache.org/hive/

Hadoop in Action
 http://www.manning.com/lam/

Definitive Guide to Hadoop, 2nd ed.
 http://oreilly.com/catalog/0636920010388

Yahoo! Hadoop blog
 http://developer.yahoo.net/blogs/hadoop/

Cloudera
 http://www.cloudera.com/
http://lmgtfy.com/?q=hadoop

http://www.letmebingthatforyou.com/?q=hadoop
(my info)


scott.leberknight@nearinfinity.com
www.nearinfinity.com/blogs/
twitter: sleberknight

Weitere ähnliche Inhalte

Was ist angesagt?

Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 

Was ist angesagt? (20)

Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Unit 1
Unit 1Unit 1
Unit 1
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 

Ähnlich wie Hadoop

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 

Ähnlich wie Hadoop (20)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 

Mehr von Scott Leberknight (20)

JShell & ki
JShell & kiJShell & ki
JShell & ki
 
JUnit Pioneer
JUnit PioneerJUnit Pioneer
JUnit Pioneer
 
JDKs 10 to 14 (and beyond)
JDKs 10 to 14 (and beyond)JDKs 10 to 14 (and beyond)
JDKs 10 to 14 (and beyond)
 
Unit Testing
Unit TestingUnit Testing
Unit Testing
 
SDKMAN!
SDKMAN!SDKMAN!
SDKMAN!
 
JUnit 5
JUnit 5JUnit 5
JUnit 5
 
AWS Lambda
AWS LambdaAWS Lambda
AWS Lambda
 
Dropwizard
DropwizardDropwizard
Dropwizard
 
RESTful Web Services with Jersey
RESTful Web Services with JerseyRESTful Web Services with Jersey
RESTful Web Services with Jersey
 
httpie
httpiehttpie
httpie
 
jps & jvmtop
jps & jvmtopjps & jvmtop
jps & jvmtop
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
 
Google Guava
Google GuavaGoogle Guava
Google Guava
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
iOS
iOSiOS
iOS
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
HBase Lightning Talk
HBase Lightning TalkHBase Lightning Talk
HBase Lightning Talk
 
wtf is in Java/JDK/wtf7?
wtf is in Java/JDK/wtf7?wtf is in Java/JDK/wtf7?
wtf is in Java/JDK/wtf7?
 
CoffeeScript
CoffeeScriptCoffeeScript
CoffeeScript
 

Kürzlich hochgeladen

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Hadoop

  • 2.
  • 3.
  • 5. e Hadoop users. . Notabl Yahoo! LinkedIn Facebook New York Times Twitter Rackspace Baidu eHarmony eBay Powerset http://wiki.apache.org/hadoop/PoweredBy
  • 6. Hadoop in the Real World..
  • 7. Recommendation Financial analysis systems Natural Language Correlation engines Processing (NLP) Data warehousing Image/video processing Market research/forecasting Log analysis
  • 8. Finance Social networking Health & Academic research Life Sciences Government Telecommunications
  • 10. Inspired by Google BigTable and MapReduce papers circa 2004 Created by Doug Cutting Originally built to support distribution for Nutch search engine Named after a stuffed elephant
  • 11. OK, So what exactly is Hadoop?
  • 12. An open source... batch/offline oriented... data & I/O intensive... general purpose framework for creating distributed applications that process huge amounts of data.
  • 13. One definition of "huge" 25,000 machines More than 10 clusters 3 petabytes of data (compressed, unreplicated) 700+ users 10,000+ jobs/week
  • 14. Had oop M ajor nts: C omp one Distributed File System (HDFS) Map/Reduce System
  • 15. But first, what isn't Hadoop?
  • 16. doop is NOT: Ha ...a relational database! ...an online transaction processing (OLTP) system! ...a structured data store of any kind!
  • 18. Hadoop Relational Scale-out Scale-up(*) Key/value pairs Tables Say how to process Say what you want the data (SQL) Offline/batch Online/real-time (*) Sharding attempts to horizontally scale RDBMS, but is difficult at best
  • 20. Data is distributed and replicated over multiple machines Designed for large files (where "large" means GB to TB) Block oriented Linux-style commands, e.g. ls, cp, mv, rm, etc.
  • 21. NameNode File Block Mappings: /user/aaron/data1.txt -> 1, 2, 3 /user/aaron/data2.txt -> 4, 5 /user/andrew/data3.txt -> 6, 7 DataNode(s) 5 1 4 2 2 3 7 4 6 1 4 6 2 3 6 1 3 7 5 7 5
  • 22. fault tolerant when nodes fail Self-healing rebalances files across cluster scalable just by adding new nodes!
  • 24. Split input files (e.g. by HDFS blocks) Operate on key/value pairs Mappers filter & transform input data Reducers aggregate mapper output
  • 25. move code to data
  • 26. map: (K1, V1) list(K2, V2) reduce: (K2, list(V2)) list(K3, V3)
  • 27. Word Count (the canonical Map/Reduce example)
  • 28. the quick brown fox jumped over the lazy brown dog
  • 29. m ap phase - inputs (K1, V1) (0, "the quick brown fox") (20, "jumped over") (32, "the lazy brown dog")
  • 30. map ph ase - list(K2, V2) outpu ts ("the", 1) ("quick", 1) ("brown", 1) ("fox", 1) ("jumped", 1) ("over", 1) ("the", 1) ("lazy", 1) ("brown", 1) ("dog", 1)
  • 31. redu ce phase - inputs (K2, list(V2)) ("brown", (1, 1)) ("dog", (1)) ("fox", (1)) ("jumped", (1)) ("lazy", (1)) ("over", (1)) ("quick", (1)) ("the", (1, 1))
  • 32. reduce phase outpu - list(K3, V3) ts ("brown", 2) ("dog", 1) ("fox", 1) ("jumped", 1) ("lazy", 1) ("over", 1) ("quick", 1) ("the", 2)
  • 34. public class SimpleWordCount extends Configured implements Tool { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... } public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... } }
  • 35. public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text(); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } } }
  • 36. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); } }
  • 37. public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; }
  • 38. public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result); }
  • 39. aF low uce Dat p/Red M a (Image from Hadoop in Action...great book!)
  • 40. Partitioning Deciding which keys go to which reducer Desire even distribution across reducers Skewed data can overload a single reducer!
  • 41. Map/Reduce Partitioning & Shuffling (Image from Hadoop in Action...great book!)
  • 42. Combiner Effectively a reduce in the mappers a.k.a. "Local Reduce"
  • 43. Shuffling WordCount data # k/v pairs shuffled without combiner ("the", 1) 1000 with combiner ("the", 1000) 1 (looking at one mapper that sees the word "the" 1000 times)
  • 44. Advanced Map/Reduce Hadoop Streaming Chaining Map/Reduce jobs Joining data Bloom filters
  • 46. HDFS NameNode Secondary NameNode Map/Reduce DataNode JobTracker TaskTracker
  • 47. Secondary NameNode NameNode JobTracker DataNode1 DataNode2 DataNodeN TaskTracker1 TaskTracker2 TaskTrackerN map map map reduce reduce reduce
  • 48. NameNode Bookkeeper for HDFS Manages DataNodes Should not store data or run jobs Single point of failure!
  • 49.
  • 50.
  • 51. DataNode Store actual file blocks on disk Does not store entire files! Report block info to NameNode Receive instructions from NameNode
  • 52. Secondary NameNode Snapshot of NameNode Not a failover server for NameNode! Help minimize downtime/data loss if NameNode fails
  • 53. JobTracker Partition tasks across HDFS cluster Track map/reduce tasks Re-start failed tasks on different nodes Speculative execution
  • 54.
  • 55.
  • 56. TaskTracker Track individual map & reduce tasks Report progress to JobTracker
  • 57.
  • 60. Logs View task logs on machine where specific task was processed (or via web UI) $HADOOP_HOME/logs/userlogs on task tracker
  • 61.
  • 62. Counters Define one or more counters Increment counters during map/reduce tasks Counter values displayed in job tracker UI
  • 63.
  • 64. IsolationRunner Re-run failed tasks with original input data Must set keep.failed.tasks.files to 'true'
  • 65. Skipping Bad Records Data may not always be clean New data may have new interesting twists Can you pre-process to filter & validate input?
  • 67. Speculative execution Use a Combiner (on by default) Reduce amount of JVM Re-use input data (be careful) Refactor code/ Data compression algorithms
  • 69. Lots of knobs Trash can Needs active Add/remove management data nodes Network topology/ "Fair" scheduling rack awareness NameNode/SNN Permissions/quotas management
  • 70. Hive
  • 71. Simulate structure for data stored in Hadoop Query language analogous to SQL (Hive QL) Translates queries into Map/Reduce job(s)... ...so not for real-time processing!
  • 72. Queries: Projection Joins (inner, outer, semi) Grouping Aggregation Sub-queries Multi-table insert Customizable: User-defined functions Input/output formats with SerDe
  • 73. /user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt "CITING","CITED" 3858241,956203 3858241,1324234 3858241,3398406 3858241,3557384 3858241,3634889 3858242,1515701 3858242,3319261 Patent citation dataset 3858242,3668705 3858242,3707004 3858243,2949611 3858243,3146465 3858243,3156927 3858243,3221341 3858243,3574238 ... http://www.nber.org/patents
  • 74. create external table patent_citations (citing string, cited string) row format delimited fields terminated by ',' stored as textfile location '/user/sleberkn/nber-patent/tables/patent_citation'; create table citation_histogram (num_citations int, count int) stored as sequencefile;
  • 75. insert overwrite table citation_histogram select num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_counts group by num_citations order by num_citations;
  • 76.
  • 77. Hadoop in the clouds
  • 78. Amazon EC2 + S3 EC2 instances are compute nodes (Map/Reduce) Storage options: HDFS on EC2 nodes HDFS on EC2 nodes loading data from S3 Native S3 (bypasses HDFS)
  • 79. Amazon Elastic MapReduce Interact via web-based console Submit Map/Reduce job (streaming, Hive, Pig, or JAR) EMR configures & launches Hadoop cluster for job Uses S3 for data input/output
  • 81. Hadoop = HDFS + Map/Reduce Distributed, parallel processing Designed for fault tolerance Horizontal scale-out Structure & queries via Hive
  • 83. http://hadoop.apache.org/ http://hadoop.apache.org/hive/ Hadoop in Action http://www.manning.com/lam/ Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388 Yahoo! Hadoop blog http://developer.yahoo.net/blogs/hadoop/ Cloudera http://www.cloudera.com/