SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Performance Issues on
              Hadoop Clusters
                           Jiong Xie
                     Advisor: Dr. Xiao Qin
                     Committee Members:
                       Dr. Cheryl Seals
                       Dr. Dean Hendrix
                          University Reader:
                          Dr. Fa Foster Dai
05/08/12              1
Overview of My Research
    Data locality    Data movement        Data shuffling



 Data Placement     Prefetching Data
                                         Reduce network
on Heterogeneous       from Disk
                                             congest
     Cluster           to Memory
                                        [To Be Submitted]
    [HCW 10]        [Submit to IPDPS]




05/08/12                   2
Data-Intensive Applications




05/08/12            3
Data-Intensive Applications (cont.)




05/08/12        4
Background
• MapReduce programming model is
  growing in popularity

• Hadoop is used by Yahoo, Facebook,
  Amazon.




05/08/12           5
Hadoop Overview
               --Mapreduce Running System




           (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.
           OSDI ’04, pages 137–150)
05/08/12                                            6
Hadoop Distributed File System




           (http://lucene.apache.org/hadoop)


05/08/12                                       7
Motivations
• MapReduce provides
     – Automatic parallelization & distribution
     – Fault tolerance
     – I/O scheduling
     – Monitoring & status updates




05/08/12                    8
Existing Hadoop Clusters
• Observation 1: Cluster nodes are
  dedicated
     – Data locality issues
     – Data transfer time
• Observation 2: The number of nodes is
  increased
     d Scalability issues
     e Shuffling overhead goes up

05/08/12                      9
Proposed Solutions
           Input     Map
                            Reduce
                     Map             Output


                     Map    Reduce


                     Map    Reduce

                     Map
                                P1: Data placement
                                P2: Prefetching
                                P3: Preshuffling
05/08/12                   10
Solutions
                  P1: Data placement
            Offline, distributed data, heterogeneous node




   P2: Prefetching                    P3: Preshuffling
 Online, data preloading             Intermediate data movement,
                                     reducing traffic




05/08/12                        11
Improving MapReduce
     Performance through Data
    Placement in Heterogeneous
          Hadoop Clusters


05/08/12        12
Motivational Example

Node A                             1 task/min
(fast)
Node B                             2x slower
(slow)
Node C                             3x slower
(slowest)


                  Time (min)


 05/08/12             13
The Native Strategy
Node A                                            6 tasks

Node B                                            3 tasks

Node C                                            2 tasks



                      Time (min)
            Loading   Transferring   Processing
 05/08/12                  14
Our Solution
     --Reducing data transfer time
Node A
                                                  6 tasks
Node A’

Node B’                                           3 tasks

Node C’                                           2 tasks

                      Time (min)

            Loading   Transferring   Processing
 05/08/12                  15
Challenges
• Does distribution strategy depend on
  applications?
• Initialization of data distribution
• The data skew problems
     – New data arrival
     – Data deletion
     – Data updating
     – New joining nodes
05/08/12                   16
Measure Computing Ratios
• Computing ratio

• Fast machines process large data sets
                                1 task/min
  Node A

  Node B                        2x slower


  Node C                        3x slower

     Time
05/08/12            17
Measuring Computing Ratios
1. Run an application, collect response time
2. Set ratio of a node offering the shortest response time as 1
3. Normalize ratios of other nodes
4. Calculate the least common multiple of these ratios
5. Determine the amount of data processed by each node



      Node     Response       Ratio        # of File       Speed
                 time(s)                   Fragments
     Node A         10           1            6           Fastest
     Node B         20           2            3           Average
     Node C         30           3            2           Slowest

05/08/12                             18
Initialize Data Distribution
                                                 Portions
                                                         3:2:1
• Input files split into 64MB        Namenode
                                                 File1
  blocks                                             1
                                                     2
                                                             4
                                                             5
                                                     3       6
• Round-Robin data
  distribution algorithm
                             A          B        C


                                 7       a
                                 8                       c
                                         b
                                 9

                                     Datanodes
05/08/12                19
Data Redistribution                                        4
                                                                         3
                                                                         2
                                                                         1
1.Get network topology,                         Namenode
  ratio, and utilization                                   L1       A C
2.Build and sort two lists:
   under-utilized node list  L1                           L2       B
   over-utilized node list  L2
3. Select the source and                    A         B         C       Portion
   destination node from                                                  3:2:1
   the lists.
                                        1   4     7   a         6
4.Transfer data                         2   5     8   b
                                                  9   c
5.Repeat step 3, 4 until the                      3

   list is empty.
05/08/12                           20
Experimental Environment
            Node         CPU Model        CPU(Hz)     L1 Cache(KB)
            Node A     Intel core 2 Duo   2*1G=2G          204
            Node B       Intel Celeron     2.8G            256
            Node C     Intel Pentium 3     1.2G            256
            Node D     Intel Pentium 3     1.2G            256
            Node E     Intel Pentium 3     1.2G            256

                 Five nodes in a Hadoop heterogeneous cluster




05/08/12                             21
Benckmarks
• Grep: a tool searching for a regular
  expression in a text file

• WordCount: a program used to count
  words in a text file

• Sort: a program used to list the inputs in
  sorted order.
05/08/12              22
Response Time of Grep and
      Wordcount in Each Node




                     Application dependence
Computing ratio is
                     Data size independence
05/08/12               23
Computing Ratio for Two
               Applications

     Computing Node         Ratios for Grep       Ratios for Wordcount
            Node A                  1                        1
            Node B                  2                        2
            Node C                 3.3                       5
            Node D                 3.3                       5
            Node E                 3.3                       5

           Computing ratio of the five nodes with respective of Grep and
                              Wordcount applications




05/08/12                                 24
Six Data Placement Decisions




05/08/12        25
Impact of data placement on
        performance of Grep




05/08/12         26
Impact of data placement on
      performance of WordCount




05/08/12         27
Summary of Data Placement
P1: Data Placement Strategy
• Motivation: Fast machines process large data sets
• Problem: Data locality issue in heterogeneous
  clusters
• Contributions: Distribute data according to
  computing capability
     – Measure computing ratio
     – Initialize data placement
     – Redistribution


05/08/12                     28
Predictive Scheduling and
 Prefetching for Hadoop clusters



05/08/12       29
Prefetching
• Goal: Improving performance
• Approach
     – Best effort to guarantee data locality.
     – Keeping data close to computing nodes
     – Reducing the CPU stall time




05/08/12                  30
Challenges
• What to prefetch?
• How to prefetch?
• What is the size of blocks to be
  prefetched?




05/08/12             31
Dataflow in Hadoop

                             1.Submit job




                                          t
                                       ea


                                                   k
                                    rtb


                                               tas
                                                       2.Schedule

                                  ea


                                              xt
                                 5.h


                                            Ne
                                          6.
    3.Read Input                 map           Local
                                                            reduce
                                                FS
                   Block 1
           HDFS
                   Block 2                     Local
                                 map            FS          reduce
           7.Read new file

                                  4. Run map
05/08/12                                      32
Dataflow in Hadoop

                               1.Submit job




                                            t
                                                            2.Schedule



                                         ea


                                                        k
                                      rtb
                                                            + more task




                                                    tas
                                    ea
                                                            + meta



                                                 xt
                                   5.h


                                              Ne
                                                            data

                                            6.
    3.Read Input                   map            Local
                                                              reduce
                                                   FS
                   Block 1
           HDFS
                   Block 2                        Local
                                   map             FS         reduce
           5.1.Read new file

                                    4. Run map
05/08/12                                        33
Prefetching Processing


                         6
                             7




                     8




05/08/12            34
Software Architecture




05/08/12            35
Grep Performance



                              9.5% 1G
                              8.5% 2G




05/08/12          36
WordCount Performance



                                   8.9% 1G
                                   8.1% 2G




05/08/12            37
Large/Small file in a node




            9.1% Grep             18% Grep
            8.3% WordCount        24% WordCount

05/08/12                     38
Experiment Setting




05/08/12           39
Large/Small file in cluster




05/08/12               40
Summary
P2: Predictive Scheduler and Prefetching
• Goal: Moving data before task assigns
• Problem: Synchronization task and data
• Contributions: Preloading the required data early
  than the task assigned
     – Predictive scheduler
     – Prefetching mechanism
     – Worker thread



05/08/12                   41
Adaptive Preshuffling
            in Hadoop clusters



05/08/12            42
Preshuffling
• Observation 1: Too much data move from
  Map worker to Reduce worker
     – Solution1: Map nodes apply pre-shuffling
       functions to their local output


• Observation 2: No reduce can start until a
  map is complete.
     – Solution2: Intermediate data is pipelined
       between mappers and reducers.
05/08/12                  43
Preshuffling
• Goal : Minimize data shuffle during
  Reduce
• Approach
     – Pipeline
     – Overlap between map and data movement
     – Group map and reduce
• Challenges
     – Synchronize map and reduce
     – Data locality
05/08/12                44
Dataflow in Hadoop

                         1.Submit job                        2.Schedule




                                                             2.
                                    at




                                                              Ne
                                     e
                                 rtb


                                             k




                                                                 w
                                          tas




                                                                    tas
                               ea

                                           xt
                              5.h




                                                                     k
                                        Ne
                                    6.
3.Read Input                                                               5.Write data
                             map          Local
                                           FS
                                                                  reduce
               Block 1
                                                  HTTP GET                       HDFS
   HDFS
               Block 2                    Local
                             map           FS                     reduce


                                                  3. Request data
                              4. Run map          4. Send data
  05/08/12                                        45
PreShuffle




           map                    reduce



           map                    reduce

                   Data request


05/08/12                46
In-memory buffer




05/08/12           47
Pipelining – A new design




                     map        reduce
           Block 1

HDFS                                     HDFS
           Block 2
                     map        reduce




05/08/12                   48
WordCount Performance




            230 seconds vs 180 seconds

05/08/12                       49
WordCount Performance




05/08/12            50
Sort Performace




05/08/12         51
Summary
P3: Preshuffling
• Goal: Minimize data shuffling during the Reduce
• Problem: task distribution and synchronization
• Contributions: preshuffling agorithm
     – Push data instead of tradition pull
     – In-memory buffer
     – Pipeline




05/08/12                      52
Conclusion


                                            Input
                                                            P1: Data placement
                                                    Offline, distributed data, heterogeneous
                                                    node
    Map

             Map

                    Map

                              Map

                                            Map
                                                            P2: Prefetching
                                                    Online, data preloading, single node

                                                            P3: Preshuffling
                                                    Intermediate data movement, reducing
           Reduce


                     Reduce


                                   Reduce




                                                    traffic
                              Output




05/08/12                                               53
Future Work
• Extend Pipelining
     – Implement the pipelining design
• Small files issue
     – Har file
     – Sequence file
     – CombineFileInputFormat
• Extend Data placement

05/08/12                 54
Thanks!
     And Questions?


55
Run Time affected by Network Condition




             Experiment result conducted by Yixian Yang
05/08/12                         56
Traffic Volume affected by Network
                        Condition




                  Experiment result conducted by Yixian Yang
05/08/12                             57

Weitere ähnliche Inhalte

Was ist angesagt?

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Was ist angesagt? (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop
Hadoop Hadoop
Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Andere mochten auch

Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoopRon Sher
 
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Analytics
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (10)

HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Heirarchy
HeirarchyHeirarchy
Heirarchy
 
Resume2015 copy
Resume2015 copyResume2015 copy
Resume2015 copy
 
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission Showcase
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Improving Hadoop Performance with Data Placement and Prefetching

Networking lecture 4 Data Link Layer by Mamun sir
Networking lecture 4 Data Link Layer by Mamun sirNetworking lecture 4 Data Link Layer by Mamun sir
Networking lecture 4 Data Link Layer by Mamun sirsharifbdp
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architecturesRaji Gogulapati
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingBig data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingRahul Singh
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...IJSRD
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...IJSRD
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...IJSRD
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptChris Richardson
 

Ähnlich wie Improving Hadoop Performance with Data Placement and Prefetching (20)

Networking lecture 4 Data Link Layer by Mamun sir
Networking lecture 4 Data Link Layer by Mamun sirNetworking lecture 4 Data Link Layer by Mamun sir
Networking lecture 4 Data Link Layer by Mamun sir
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Hadoop Inside
Hadoop InsideHadoop Inside
Hadoop Inside
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingBig data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
 
Big data quiz
Big data quizBig data quiz
Big data quiz
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
 

Mehr von Xiao Qin

How to apply for internship positions?
How to apply for internship positions?How to apply for internship positions?
How to apply for internship positions?Xiao Qin
 
How to write research papers? Version 5.0
How to write research papers? Version 5.0How to write research papers? Version 5.0
How to write research papers? Version 5.0Xiao Qin
 
Making a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 WorksheetMaking a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 WorksheetXiao Qin
 
Making a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 TipsMaking a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 TipsXiao Qin
 
Auburn csse faculty orientation
Auburn csse faculty orientationAuburn csse faculty orientation
Auburn csse faculty orientationXiao Qin
 
Auburn CSSE graduate student orientation
Auburn CSSE graduate student orientationAuburn CSSE graduate student orientation
Auburn CSSE graduate student orientationXiao Qin
 
CSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress ReportCSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress ReportXiao Qin
 
Project 2 How to modify os161: A Manual
Project 2 How to modify os161: A ManualProject 2 How to modify os161: A Manual
Project 2 How to modify os161: A ManualXiao Qin
 
Project 2 how to modify OS/161
Project 2 how to modify OS/161Project 2 how to modify OS/161
Project 2 how to modify OS/161Xiao Qin
 
Project 2 how to install and compile os161
Project 2 how to install and compile os161Project 2 how to install and compile os161
Project 2 how to install and compile os161Xiao Qin
 
Project 2 - how to compile os161?
Project 2 - how to compile os161?Project 2 - how to compile os161?
Project 2 - how to compile os161?Xiao Qin
 
Understanding what our customer wants-slideshare
Understanding what our customer wants-slideshareUnderstanding what our customer wants-slideshare
Understanding what our customer wants-slideshareXiao Qin
 
OS/161 Overview
OS/161 OverviewOS/161 Overview
OS/161 OverviewXiao Qin
 
Surviving a group project
Surviving a group projectSurviving a group project
Surviving a group projectXiao Qin
 
P#1 stream of praise
P#1 stream of praiseP#1 stream of praise
P#1 stream of praiseXiao Qin
 
Data center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniquesData center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniquesXiao Qin
 
How to do research?
How to do research?How to do research?
How to do research?Xiao Qin
 
COMP2710 Software Construction: header files
COMP2710 Software Construction: header filesCOMP2710 Software Construction: header files
COMP2710 Software Construction: header filesXiao Qin
 
COMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercisesCOMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercisesXiao Qin
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 

Mehr von Xiao Qin (20)

How to apply for internship positions?
How to apply for internship positions?How to apply for internship positions?
How to apply for internship positions?
 
How to write research papers? Version 5.0
How to write research papers? Version 5.0How to write research papers? Version 5.0
How to write research papers? Version 5.0
 
Making a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 WorksheetMaking a competitive nsf career proposal: Part 2 Worksheet
Making a competitive nsf career proposal: Part 2 Worksheet
 
Making a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 TipsMaking a competitive nsf career proposal: Part 1 Tips
Making a competitive nsf career proposal: Part 1 Tips
 
Auburn csse faculty orientation
Auburn csse faculty orientationAuburn csse faculty orientation
Auburn csse faculty orientation
 
Auburn CSSE graduate student orientation
Auburn CSSE graduate student orientationAuburn CSSE graduate student orientation
Auburn CSSE graduate student orientation
 
CSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress ReportCSSE Graduate Programs Committee: Progress Report
CSSE Graduate Programs Committee: Progress Report
 
Project 2 How to modify os161: A Manual
Project 2 How to modify os161: A ManualProject 2 How to modify os161: A Manual
Project 2 How to modify os161: A Manual
 
Project 2 how to modify OS/161
Project 2 how to modify OS/161Project 2 how to modify OS/161
Project 2 how to modify OS/161
 
Project 2 how to install and compile os161
Project 2 how to install and compile os161Project 2 how to install and compile os161
Project 2 how to install and compile os161
 
Project 2 - how to compile os161?
Project 2 - how to compile os161?Project 2 - how to compile os161?
Project 2 - how to compile os161?
 
Understanding what our customer wants-slideshare
Understanding what our customer wants-slideshareUnderstanding what our customer wants-slideshare
Understanding what our customer wants-slideshare
 
OS/161 Overview
OS/161 OverviewOS/161 Overview
OS/161 Overview
 
Surviving a group project
Surviving a group projectSurviving a group project
Surviving a group project
 
P#1 stream of praise
P#1 stream of praiseP#1 stream of praise
P#1 stream of praise
 
Data center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniquesData center specific thermal and energy saving techniques
Data center specific thermal and energy saving techniques
 
How to do research?
How to do research?How to do research?
How to do research?
 
COMP2710 Software Construction: header files
COMP2710 Software Construction: header filesCOMP2710 Software Construction: header files
COMP2710 Software Construction: header files
 
COMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercisesCOMP2710: Software Construction - Linked list exercises
COMP2710: Software Construction - Linked list exercises
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 

Kürzlich hochgeladen

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Kürzlich hochgeladen (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Improving Hadoop Performance with Data Placement and Prefetching

  • 1. Performance Issues on Hadoop Clusters Jiong Xie Advisor: Dr. Xiao Qin Committee Members: Dr. Cheryl Seals Dr. Dean Hendrix University Reader: Dr. Fa Foster Dai 05/08/12 1
  • 2. Overview of My Research Data locality Data movement Data shuffling Data Placement Prefetching Data Reduce network on Heterogeneous from Disk congest Cluster to Memory [To Be Submitted] [HCW 10] [Submit to IPDPS] 05/08/12 2
  • 5. Background • MapReduce programming model is growing in popularity • Hadoop is used by Yahoo, Facebook, Amazon. 05/08/12 5
  • 6. Hadoop Overview --Mapreduce Running System (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150) 05/08/12 6
  • 7. Hadoop Distributed File System (http://lucene.apache.org/hadoop) 05/08/12 7
  • 8. Motivations • MapReduce provides – Automatic parallelization & distribution – Fault tolerance – I/O scheduling – Monitoring & status updates 05/08/12 8
  • 9. Existing Hadoop Clusters • Observation 1: Cluster nodes are dedicated – Data locality issues – Data transfer time • Observation 2: The number of nodes is increased d Scalability issues e Shuffling overhead goes up 05/08/12 9
  • 10. Proposed Solutions Input Map Reduce Map Output Map Reduce Map Reduce Map P1: Data placement P2: Prefetching P3: Preshuffling 05/08/12 10
  • 11. Solutions P1: Data placement Offline, distributed data, heterogeneous node P2: Prefetching P3: Preshuffling Online, data preloading Intermediate data movement, reducing traffic 05/08/12 11
  • 12. Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters 05/08/12 12
  • 13. Motivational Example Node A 1 task/min (fast) Node B 2x slower (slow) Node C 3x slower (slowest) Time (min) 05/08/12 13
  • 14. The Native Strategy Node A 6 tasks Node B 3 tasks Node C 2 tasks Time (min) Loading Transferring Processing 05/08/12 14
  • 15. Our Solution --Reducing data transfer time Node A 6 tasks Node A’ Node B’ 3 tasks Node C’ 2 tasks Time (min) Loading Transferring Processing 05/08/12 15
  • 16. Challenges • Does distribution strategy depend on applications? • Initialization of data distribution • The data skew problems – New data arrival – Data deletion – Data updating – New joining nodes 05/08/12 16
  • 17. Measure Computing Ratios • Computing ratio • Fast machines process large data sets 1 task/min Node A Node B 2x slower Node C 3x slower Time 05/08/12 17
  • 18. Measuring Computing Ratios 1. Run an application, collect response time 2. Set ratio of a node offering the shortest response time as 1 3. Normalize ratios of other nodes 4. Calculate the least common multiple of these ratios 5. Determine the amount of data processed by each node Node Response Ratio # of File Speed time(s) Fragments Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest 05/08/12 18
  • 19. Initialize Data Distribution Portions 3:2:1 • Input files split into 64MB Namenode File1 blocks 1 2 4 5 3 6 • Round-Robin data distribution algorithm A B C 7 a 8 c b 9 Datanodes 05/08/12 19
  • 20. Data Redistribution 4 3 2 1 1.Get network topology, Namenode ratio, and utilization L1 A C 2.Build and sort two lists: under-utilized node list  L1 L2 B over-utilized node list  L2 3. Select the source and A B C Portion destination node from 3:2:1 the lists. 1 4 7 a 6 4.Transfer data 2 5 8 b 9 c 5.Repeat step 3, 4 until the 3 list is empty. 05/08/12 20
  • 21. Experimental Environment Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node E Intel Pentium 3 1.2G 256 Five nodes in a Hadoop heterogeneous cluster 05/08/12 21
  • 22. Benckmarks • Grep: a tool searching for a regular expression in a text file • WordCount: a program used to count words in a text file • Sort: a program used to list the inputs in sorted order. 05/08/12 22
  • 23. Response Time of Grep and Wordcount in Each Node Application dependence Computing ratio is Data size independence 05/08/12 23
  • 24. Computing Ratio for Two Applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5 Computing ratio of the five nodes with respective of Grep and Wordcount applications 05/08/12 24
  • 25. Six Data Placement Decisions 05/08/12 25
  • 26. Impact of data placement on performance of Grep 05/08/12 26
  • 27. Impact of data placement on performance of WordCount 05/08/12 27
  • 28. Summary of Data Placement P1: Data Placement Strategy • Motivation: Fast machines process large data sets • Problem: Data locality issue in heterogeneous clusters • Contributions: Distribute data according to computing capability – Measure computing ratio – Initialize data placement – Redistribution 05/08/12 28
  • 29. Predictive Scheduling and Prefetching for Hadoop clusters 05/08/12 29
  • 30. Prefetching • Goal: Improving performance • Approach – Best effort to guarantee data locality. – Keeping data close to computing nodes – Reducing the CPU stall time 05/08/12 30
  • 31. Challenges • What to prefetch? • How to prefetch? • What is the size of blocks to be prefetched? 05/08/12 31
  • 32. Dataflow in Hadoop 1.Submit job t ea k rtb tas 2.Schedule ea xt 5.h Ne 6. 3.Read Input map Local reduce FS Block 1 HDFS Block 2 Local map FS reduce 7.Read new file 4. Run map 05/08/12 32
  • 33. Dataflow in Hadoop 1.Submit job t 2.Schedule ea k rtb + more task tas ea + meta xt 5.h Ne data 6. 3.Read Input map Local reduce FS Block 1 HDFS Block 2 Local map FS reduce 5.1.Read new file 4. Run map 05/08/12 33
  • 34. Prefetching Processing 6 7 8 05/08/12 34
  • 36. Grep Performance 9.5% 1G 8.5% 2G 05/08/12 36
  • 37. WordCount Performance 8.9% 1G 8.1% 2G 05/08/12 37
  • 38. Large/Small file in a node 9.1% Grep 18% Grep 8.3% WordCount 24% WordCount 05/08/12 38
  • 40. Large/Small file in cluster 05/08/12 40
  • 41. Summary P2: Predictive Scheduler and Prefetching • Goal: Moving data before task assigns • Problem: Synchronization task and data • Contributions: Preloading the required data early than the task assigned – Predictive scheduler – Prefetching mechanism – Worker thread 05/08/12 41
  • 42. Adaptive Preshuffling in Hadoop clusters 05/08/12 42
  • 43. Preshuffling • Observation 1: Too much data move from Map worker to Reduce worker – Solution1: Map nodes apply pre-shuffling functions to their local output • Observation 2: No reduce can start until a map is complete. – Solution2: Intermediate data is pipelined between mappers and reducers. 05/08/12 43
  • 44. Preshuffling • Goal : Minimize data shuffle during Reduce • Approach – Pipeline – Overlap between map and data movement – Group map and reduce • Challenges – Synchronize map and reduce – Data locality 05/08/12 44
  • 45. Dataflow in Hadoop 1.Submit job 2.Schedule 2. at Ne e rtb k w tas tas ea xt 5.h k Ne 6. 3.Read Input 5.Write data map Local FS reduce Block 1 HTTP GET HDFS HDFS Block 2 Local map FS reduce 3. Request data 4. Run map 4. Send data 05/08/12 45
  • 46. PreShuffle map reduce map reduce Data request 05/08/12 46
  • 48. Pipelining – A new design map reduce Block 1 HDFS HDFS Block 2 map reduce 05/08/12 48
  • 49. WordCount Performance 230 seconds vs 180 seconds 05/08/12 49
  • 52. Summary P3: Preshuffling • Goal: Minimize data shuffling during the Reduce • Problem: task distribution and synchronization • Contributions: preshuffling agorithm – Push data instead of tradition pull – In-memory buffer – Pipeline 05/08/12 52
  • 53. Conclusion Input P1: Data placement Offline, distributed data, heterogeneous node Map Map Map Map Map P2: Prefetching Online, data preloading, single node P3: Preshuffling Intermediate data movement, reducing Reduce Reduce Reduce traffic Output 05/08/12 53
  • 54. Future Work • Extend Pipelining – Implement the pipelining design • Small files issue – Har file – Sequence file – CombineFileInputFormat • Extend Data placement 05/08/12 54
  • 55. Thanks! And Questions? 55
  • 56. Run Time affected by Network Condition Experiment result conducted by Yixian Yang 05/08/12 56
  • 57. Traffic Volume affected by Network Condition Experiment result conducted by Yixian Yang 05/08/12 57

Hinweis der Redaktion

  1. Comment: three colors.
  2. 1. 將要執行的 MapReduce 程式複製到 Master 與每一臺 Worker 機器中。 2.Master 決定 Map 程式與 Reduce 程式,分別由哪些 Worker 機器執行。 3. 將所有的資料區塊,分配到執行 Map 程式的 Worker 機器中進行 Map 。 4. 將 Map 後的結果存入 Worker 機器的本地磁碟。 5. 執行 Reduce 程式的 Worker 機器,遠端讀取每一份 Map 結果,進行彙整與排序,同時執行 Reduce 程式。 6. 將使用者需要的運算結果輸出。 Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes
  3. Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes Hadoop Distributed File System (HDFS) Single name node, many data nodes Files stored as large, fixed-size (e.g. 64MB) blocks HDFS typically holds map input and reduce output
  4. Comment: no “want to”
  5. 90
  6. 强调
  7. Comments: differences between shuffling overhead and data transfer time.
  8. Comment: what is computing ratio.
  9. Comment 1: make step descriptions short. Comment 2: animations map to steps.
  10. blocks
  11. More io feature
  12. Higher resolution
  13. Comment: Titles of all the slides must use the same format Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
  14. Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
  15. Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
  16. 对于第二部分,最好有一个任务流和数据流的滑动窗口,而这个调度窗口大小是性能的关键,这样才能给你的预测调度提供理论基础。它的目标是减少每个任务数据同步或任务之间的同步间隙,实现流水线的作业处理。 你的第一部分比较清晰,第二部分比较乱, slide31 的目标和途径比较明确, slide41 的结论也清晰,但这两者之间的线索不突出你的贡献部分 第二部分的那个动画也不够典型。
  17. which data block should be loaded and where is it? how to synchronize computation process with prefetching data. When the data is loaded to the cache ahead the requirement for a long time, it wastes the resource. On the contrary, if the data arrives later than the computation, it is useless. how to optimize the prefetching rate. What is the best percentage of data pre-loading of each machine? The MapReduce can maximal benefit from the best prefetching rate while minimizing the prefetching overhead. When to prefetch is to control how early to trigger the prefetching action of which node. In our previous research, we observed that one node process the same block size data in a fix time. Before the last block finishes, the next block starts loading data. In this design, we estimate the execution time of each block in each node. The execution time for the same application in various node may be different. we collect and adapt the average the running time for a block in each node. What to prefech is to determine which block load to the memory. Initially, the predictive scheduler assign two tasks to the each node. when the preloading work was trigger, it can access the data according to the required information from second task. How much to prefech is decide the amount of prepared data preload to the memory. According to the scheduler, when one task in running, one or more tasks are waiting in the node, the meta data is loaded in the memory of this node. When the prefetching action is triggered, the preloading work automatically load data form disk to memory. In view of large size of HDFS blocks, the number should be not so aggressive. Here we set the number of task is two, the prefetching block is one.
  18. Comment: add animations.
  19. Comment: add animations.
  20. Figure 4.3: Comparing the excution time of Grep in native and prefetching Mapreduce Grep 9.5% 1G 8.5% 2G WordCount 8.9% 8.1% 2G
  21. Above single large file 9% Small files 24% Increase the number of computing node we can get the same improvement through psp
  22. 第三部分动画比较典型,但 slide44-slide53 的关系还需要提炼,最好能有所加强。
  23. One map task for each block of the input file Applies user-defined map function to each record in the block Record = <key, value> User-defined number of reduce tasks Each reduce task is assigned a set of record groups Record group = all records with same key For each group, apply user-defined reduce function to the record values in that group Reduce tasks read from every map task Each read returns the record groups for that reduce task
  24. To reduce the amount of transfer data, the map worker apply combiner functions to their local outputs before storing or transferring the intermediate data so as to reduce the traffic utilization. The combining strategy can minimize the amount of data that needs to be transferred to the reducers and speed up the execution time of the overall job. To reduce the communication between nodes, intermediate results are pipelined between fixed mappers and reducers. The reducer do not need to ask for each map node in the cluster. Moreover, the reducers begin processing data as soon as it is produced by mappers. 57 To use the overlapping between data movement and map computation. Our experiment shows that shuffle is always much longer than map computation time. Especially, when the network is saturated, overlapping time will disappear consequently. Secondly, reducing latency in the network also improve throughput but also increase the possibility of network conflict. Pipeline idea can be lend to hide the overlapping. To limit the communication between map and reduce. In MapReduce’s parallel execu- tion, a node’s execution includes map computation, waiting for shuffle, and reduce compu- tation, and the critical path is the slowest node. To improve performance, the critical path has to be shortened. However, the natural overlap of one node’s shuffle wait with another node’s execution does not shorten the critical path. Instead, the operations within a single node need to be overlapped. The reduce check all the avail data from all the map node in the cluster. If some reduce node and map node can group with particular key-value pairs. the network communicate costs can be shorten.
  25. The execution time of native Hadoop for 1GB data WordCount Star. Figure 5.4.2 illustrates the average response time for 1GB data WordCount of native Hadoop. The Reduce does not launch the progress until the all the map tasks finish. Figure 5.1(b) illustrates the average response time for 1GB data WordCount of preshuffling Hadoop. The Reduce task receive map outputs a little earlier and can begin sorting earlier so as to reduce the time to required for the final merge can achieve a better cluster utilization by 14%. In the native Hadoop, Hadoop can finish all the map tasks earlier than preshuffling hadoop because preshuffling allow reduce tasks to begin executing more quickly; the reduce tasks prepare for longing required resources, causing the map phase take longer.
  26. Figure 5.2 reports the improvement of Preshuffling compared with native Hadoop for WordCount. With the increase of block size, Preshuffling can achieve better improvement. when the block size is less than 16MB, the Preshuffling algorithm did not gain a significant performance improvement. The same results can be found in Figure 5.3. When the block size increases to 128MB or more, the improvement also arrive to a fix number. Preshuffling Hadoop can gain average 14% for WordCount and 12.5% for Sort. Figure 5.2: The improvement of Preshuffling with different block size for WordCount. Figure 5.2: The improvement of Preshuffling with different block size for Sort.
  27. title
  28. Average 14% improvement
  29. Hadoop Archive Hadoop Archive 或者 HAR ,是一个高效地将小文件放入 HDFS 块中的文件存档工具,它能够将多个小文件打包成一个 HAR 文件,这样在减少 namenode 内存使用的同时,仍然允许对文件进行透明的访问。 对某个目录 /foo/bar 下的所有小文件存档成 /outputdir/ zoo.har : hadoop archive -archiveName zoo.har -p /foo/bar /outputdir 当然,也可以指定 HAR 的大小 ( 使用 -Dhar.block.size) 。 HAR 是在 Hadoop file system 之上的一个文件系统,因此所有 fs shell 命令对 HAR 文件均可用,只不过是文件路径格式不一样, HAR 的访问路径可以是以下两种格式: Sequence file sequence file 由一系列的二进制 key/value 组成,如果为 key 小文件名, value 为文件内容,则可以将大批小文件合并成一个大文件。 Hadoop-0.21.0 中提供了 SequenceFile ,包括 Writer , Reader 和 SequenceFileSorter 类进行写,读和排序操作。如果 hadoop 版本低于 0.21.0 的版本,实现方法可参见 [3] 。 CombineFileInputFormat CombineFileInputFormat 是一种新的 inputformat ,用于将多个文件合并成一个单独的 split ,另外,它会考虑数据的存储位置。