SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Hanborq Optimizations on
  Hadoop MapReduce
          Feb.21, 2012
   Big Data Engineering Team
Motivations
•   MapReduce is a proved and successful data processing framework. It can be implemented enough
    efficient and flexible.
     – MapReduce: Simplified Data Processing on Large Clusters
     – MapReduce: A Flexible Data Processing Tool

•   Hadoop is the most popular Open Source implementation of MapReduce, but it’s not so good
    implemented.
     –   Long latency.
     –   Inefficiency with low performance.
     –   Not so flexible. Not simple enough to build, develop, etc.
     –   It has taken so long time to become mature, since 2006.

•   Our customers always challenge the latency and performance of Hadoop MapReduce.
•   The majority of real deploys cannot afford to install thousands of server. The inefficiency of Hadoop
    waste many hardware and energy.

•   It’s possible to improve the Hadoop Core and applications to achieve better experience.
     – Tenzing: A SQL Implementation On The MapReduce Framework
     – Dremel: Interactive Analysis of Web-Scale Datasets


                                                                                                            2
To build a fast architecture.

MAPREDUCE RUNTIME ENVIRONMENT


                                3
Runtime
                 Job/Task Schedule & Latency (1)
• Problem: even for a empty job, Hadoop will take tens of seconds to complete it.

• HDH Improvements:
    – Worker Pool
         • Like Google Tenzing, HDH MapReduce runtime does not spawn new JVM processes for each job/task, but
           instead start these slot/worker processes at initialization phase and keep them running constantly.
         • Fast/Real-time communication between Workers and TaskTracker.
    – Transfer Job description information (JobConf, splits info, etc.) within RPC (Client->JobTracker).
         • Reduce the overhead to transfer, persist, load and parse XML files.
         • Reduce the useless and default attributes to be transferred for JobConf.
    – Heartbeat (TaskTracker)
         • Speed up the task assignment.
         • Use triggered(out-of-band ) real-time heartbeat when special event happens.
    –   Avoid duplicated loading of configuration files for Configuration Objects.
    –   Use DistributedCache to deploy job’s jar files.
    –   Avoid some unnecessary “sleep”.
    –   Avoid some big buffer/memory allocation which takes long time and to be inefficient.
    –   …

                                                                                                                 4
Runtime
              Job/Task Schedule & Latency (2)
• Worker Pool, RPC, Heartbeat

      MapReduce               RPC
                                                     JobTracker
        Client             (JobConf)



                                         beat
                                    Heart


             TaskTracker                             TaskTracker                     TaskTracker



     Child      Child       Child            Child      Child       Child    Child      Child       Child
    Worker     Worker      Worker           Worker     Worker      Worker   Worker     Worker      Worker
             Worker Pool                             Worker Pool                     Worker Pool




                                                                                                            5
Hints
• Enable these configuration properties
  – mapreduce.tasktracker.outofband.heartbeat = true
  – mapred.job.reuse.jvm.num.tasks = -1 (even through it is not
    used in HDH)


• Issues
  – Java, GC issue for a constantly running worker JVM.



                                                                  6
To build a fast engine.

MAPREDUCE PROCESSING ENGINE


                              7
Processing Engine Improvements
• Shuffle: Use sendfile to reduce data copy and context
  switch.
• Shuffle: Netty Shuffle Server (map side) and Batch Fetch
  (reduce side).
• Sort Avoidance.
  – Spilling and Partitioning, Counting Sort, Bytes Merge, Early
    Reduce, etc.
  – Hash Aggregation in job implementation.


                                                                   8
Shuffle: Use sendfile
     to reduce data copy and context switch (1)
• Rewrite shuffle server with Netty using Zero Copy API to
  transfer map output data.
  – Less data copy, more efficiency
  – Less data copy, less CPU usage
  – Less context switches, less CPU usage


• Saving more CPU to do user tasks.


                                                             9
Shuffle: Use sendfile
       to reduce data copy and context switch (2)
Traditional data copy                  Data copy with sendfile




                                 vs.




                                                     If NIC support gather operations



                                                                                        10
Shuffle: Use sendfile
       to reduce data copy and context switch (3)
Traditional context switches         Context Switch with sendfile




                               vs.




                                                                    11
Shuffle:
              Netty Server & Batch Fetch (1)
• Less TCP connection overhead.
• Reduce the effect of TCP slow start.
• More important, better shuffle schedule in Reduce Phase
  result in better overall performance.

• Configuration
  – mapred-site.xml
     <property>
        <name>mapreduce.shuffle.max.maps</name>
        <value>4</value>
        <description>Reduce side batch fetch for efficient shuffle copy.</description>
     </property>


                                                                                         12
Shuffle:
             Netty Server & Batch Fetch (2)
One connection per map                       Batch fetch
• Each fetch thread in reduce copy           • Fetch thread copy multiple map outputs
                                               per connection.
  one map output per connection,
                                             • This fetch thread take over this TT, other
  even there are many outputs in TT.           fetch threads can’t fetch outputs from
                                               this TT during coping period.



                                       vs.




                                                                                            13
Shuffle:
                             Netty Server & Batch Fetch evaluations
             •      Test Data
                      – 8 text file, each ~600MB in size
                      – ~50,000,000 records in total
                                  •   key: 10bytes, record : 98bytes
             •      Test job
                      –     The test job including following phases: MapSortShuffleSortMergeReduce(only read input, no
                            output).
                                                                                                                                                                   80 maps, 4reduces
                     80 maps,1 reduce                                                      80 maps, 2reduces                                             02:01   01:51
                                          03:48                                  03:36                                                                                   01:41
            03:49                                                                                                                                        01:44                   01:36 01:32 01:34
                    03:45                                                                02:49
                              03:44                                              02:53                                                                   01:26
            03:45
                                                                                                 02:16 02:17 02:11 02:16




                                                                                                                                             Time(min)
                                                                                                                                                         01:09
                                                                     Time(min)
Time(min)




            03:40                                                                02:10                                                                                                               CDH3U2
                                                   CDH3U2                                                                  CDH3U2
                                      03:36                                                                                                              00:52                                       Batch Fetch=1
                          03:36
            03:36                                  Batch Fetch=1                 01:26                                     Batch Fetch=1
                                                                                                                                                         00:35                                       Batch Fetch = 2
                                                   Batch Fetch = 2                                                         Batch Fetch = 2
            03:32                                                                00:43                                                                   00:17                                       Batch Fetch=4
                                                   Batch Fetch=4                                                           Batch Fetch=4
            03:27                                                                00:00                                                                   00:00                                       Batch Fetch=20
                                                   Batch Fetch=20                                                          Batch Fetch=20




                 We find the gains of this improvement are not very distinct when the total M*R is at a low level. To be verified!
                                                                                                                                                                                                            14
Sort Avoidance
• Many real-world jobs require shuffling, but not sorting. And the sorting bring
  much overhead.
    – Hash Aggregations
    – Hash Joins
    – Filtering and simple processing (process each record independently from other
      records)
    – …, etc.

• When sorting is turned off, the mapper feeds data to the reducer which directly
  passes the data to the Reduce() function bypassing the intermediate sorting step.
    – Spilling, Partitioning, Merging and Reducing will be more efficient.

• How to turn off sorting?
    – JobConf job = (JobConf) getConf();
    – job.setBoolean("mapred.sort.avoidance", true);

                                                                                      15
Sort Avoidance:
                  Spilling and Partitioning
• When spills, records compare by partition only.
• Partition comparison using counting sort [O(n)], not quick sort
  [O(nlog n)].




                                                                    16
Sort Avoidance:
     Early Reduce (Remove shuffle barrier)

• Currently reduce function can’t start until all map
  outputs have been fetched already.
• When sort is unnecessary, reduce function can start as
  soon as there is any map output available.
• Greatly improve overall performance!



                                                           17
Sort Avoidance:
                     Bytes Merge
• No overhead of key/value
  serialization/deserialization,
  comparison.
• Don’t take care of
  records, just bytes.
• Just concatenate byte
  streams together – read in
  bytes, write out bytes.

                                      18
Sort Avoidance:
             Sequential Reduce Inputs
• Sequential read input files to feed reduce function, So
  no disk seeks, better performance.




                                                            19
Let’s try HDH Hadoop.

BENCHMARKS


                        20
Benchmarks:
                      Runtime Job/Task Schedule & Latency (1)
• Testbed:
    – 5 node cluster (4 slaves), 8 map slots and 1 reduce slots per node.
• Test Jobs:
    – bin/hadoop jar hadoop-examples-0.20.2-?dh3u2.jar sleep -m maps -r reduces -mt 1 -rt 1

• The HDH is very fast to launch the job and tasks.

         Job Latency (in second, lower is better)                             Job Latency (in second, lower is better)
                 Total Tasks (32 maps, 4 reduces)                                     Total Tasks (96 maps, 4 reduces)
    30                                                                   50
                                                                                      43
                 24                                                      45
    25                                                                   40
                                        21
    20                                                                   35
                                                                         30
                                                                                                             24
    15                                                                   25
                                                                         20
    10                                                                   15
                                                                         10
     5
                                                             1            5                                                       1
     0                                                                    0
          CDH3u2 (Cloudera)      CDH3u2 (Cloudera)    HDH3u2 (Hanborq)         CDH3u2 (Cloudera)      CDH3u2 (Cloudera)    HDH3u2 (Hanborq)
         (reuse.jvm disabled)   (reuse.jvm enabled)                           (reuse.jvm disabled)   (reuse.jvm enabled)



                                                                                                                                              21
Benchmarks:
                             Runtime Job/Task Schedule & Latency (2)
• Another Testbed:
   – 4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node.
• Test Jobs:
   – bin/hadoop jar hadoop-examples-0.20.2-?dh3u2.jar sleep -m 24~384 -r 9 -mt 1 -rt 1

                                           Job latency according to number of map tasks
                                                     (lower and flater is better)
                             90
                             80
                             70
                             60
            time (seconds)




                                                                                          CDH3u2
                             50                                                           (reuse.jvm disabled)
                                                                                          CDH3u2
                             40
                                                                                          (reuse.jvm enabled)
                             30
                                                                                          HDH
                             20
                             10
                              0
                                  24maps        96maps    192maps   288maps   384maps


                                                                                                                 22
Benchmarks:
            Sort Avoidance and Aggregation (1)
• Testbed:
    – 5 node cluster (4 slaves), 8 map slots and 2 reduce slots per node.
    – There is only 6GB RAM and 1 SATA 7.2K disk.
• Test Data:
    – Data size : ~ 20G
• Test Cases:
    – Case1
        • Hash aggregation in map and reduce phase.
        • Map only output limited integer key-value pairs, so shuffling data is very tiny (in MB)
    – Case2
        • Always use the old method (sort and combiner) to implement aggregation.
        • Map output many integer key-value pairs, but shuffling data is still not large (tens of MB)
    – Case3
        • Hash aggregation in reduce phase, map does not use hash but just output many longer key-value pairs.
        • Map output many long key-value pairs, so shuffling data is distinct large (~12 GB) WRT Case1 and Case2.
        • This case is intently designed to test and highlight the effect of Sort-Avoidance.




                                                                                                                    23
Benchmarks:
                              Sort Avoidance and Aggregation (2)
                                                                   •   The Case1 and Case2 like:
      Sort Avoidance and Aggregation
                                                                        –   SELECT intA, COUNT(1) FROM T1 GROUP BY intA;
                               (lower is better)
                                                                   •   The Case3 likes:
                       2400
                                                                        –   SELECT A, B, C, D, SUM(M), SUM(N), SUM(R), SUM(P), SUM(Q) ...
                                                      2186
                       2200                                                 FROM T2 GROUP BY A, B, C, D;

                       2000

                       1800
                                                                   •   Case1:
                                                                        –   The shuffling data is very small, sorting on it is very fast in
                       1600
                                                                            memory. So the improvement just gains ~11%, which may
                       1400                                                 mainly come from the “Worker Pool implementation”..
                                                                   •   Case2:
      time (seconds)




                       1200
                                                                        –   Still use sorting to do aggregation, the tiny gains may mainly
                       1000
                                                                            come from the “Worker Pool implementation”.
                        800                                             –   This case also demonstrate the processing engine
                                                             615
                        600                                                 improvements do not bring in negative effect.

                        400
                                                                   •   Case3:
                               197 175      216 198                     –   The shuffling(and sorting in CDH3u2) data is large
                        200
                                                                            enough, the gains from Sort-Avoidance become very
                          0                                                 distinct.
                                Case1         Case2    Case3
CHD3u2 (Cloudera)                197           216      2186
HDH (Hanborq)                    175           198      615
                                                                                                                                              24
Benchmarks:
                 Sort Avoidance and Aggregation (3)
•   Testbed:
     –   4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node.                                  Real Aggregration Jobs
     –   48GB RAM and 5 SATA 7.2k disks                                                                               (lower is better)
     –   Large RAM and more disks to avoid the bottleneck of disk IO                                            700


                                                                                                                600
•   Test Data:
     –   400,000,000 rows of comma separated text. ~100 bytes per row.                                          500
     –   ~40GB data in total. (then we can avoid the bottleneck of disk IO)




                                                                                             time (seconds)
                                                                                                                400

•   Query1:
                                                                                                                300
     –   Equals to: select type, sum(reqnum) from cdr group by type;
•   Query2:                                                                                                     200
     –   Equals to: select
         userid, sum(reqnum), max(reqnum), min(reqnum), avg(reqnum), sum(dur), ma                               100
         x(dur), min(dur), avg(dur) from cdr group by userid;
                                                                                                                  0
                                                                                                                       Case1-1   Case2-1   Case1-2   Case2-2
•   Case1-1: Use sort to implement aggregation of Query1                                  CDH3u2 (Cloudera)              238      603       136       206
•   Case2-1: Use sort to implement aggregation of Query2                                  HDH (Hanborq)                  233      578        96       151
•   Case1-2: Use hash(map and reduce) to implement aggregation of              Analysis:
    Query1                                                                     - Case1/2 still use sort, the gains may mainly come from Worker Pool.
                                                                               - Case 3/4 use hash aggregation, so we can benefit distinctly from
•   Case2-2: Use hash(map and reduce) to implement aggregation of
                                                                               Sort-Avoidance.
    Query2                                                                                                                                        25
Benchmarks:
                                       TeraSort
• Testbed:                                                                  TeraSort: Sort 100GB
   – 5 node cluster (4 slaves), 8 map slots and 2                        (in minute, lower is better)
     reduce slots per node.                                         60

   – There is only 6GB memory and 1 SATA 7.2K disk.                              49
                                                                    50
                                                                                                   43

• Generate 100GB data                                               40

   – bin/hadoop jar hadoop-examples-0.20.2-                         30
     ?dh3u2.jar teragen 1000000000 /teradata
• Terasort Job:                                                     20

   – bin/hadoop jar hadoop-examples-0.20.2-                         10
     ?dh3u2.jar terasort /teradata /terasort
                                                                     0
                                                                          CDH3u2 (Cloudera)   HDH (Hanborq)


   •   Since there is only 1 disk on each machine, the bottleneck is the disk (iostat).
   •   Got 12% improvement under the bottleneck of disk.
   •   The gains may come from: Shuffle, and Task Scheduling in Worker Pool.
                                                                                                              26
Benchmarks:
                                      Integration with Hive
•   Testbed                                                                                                   Hive Query
     –   4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node.                               (lower is better)
     –   48GB RAM and 5 SATA 7.2k disks
                                                                                                500
     –   Large RAM and more disks to avoid the bottleneck of disk IO

                                                                                                450
•   Dataset
     –   400,000,000 rows of comma separated text. ~100 bytes per row.
                                                                                                400
     –   ~40GB data in total. (then we can avoid the bottleneck of disk IO)
                                                                                                350
•   Query1
     –   select type, sum(reqnum) from cdr group by type;                                       300




                                                                               time (seconds)
•   Query2
     –   INSERT OVERWRITE DIRECTORY '/tmp/out' select                                           250                            CDH3u2 (Cloudera)
         userid, sum(reqnum), max(reqnum), min(reqnum), avg(reqnum), s
         um(dur), max(dur), min(dur), avg(dur) from cdr group by userid;                                                       HDH (Hanborq)
                                                                                                200

•   Result Analysis
                                                                                                150
     –   Since we did not modify Hive to use the “Sort-Avoidance”
         feature, the short saved time may mainly come from the “Worker
         Pool”                                                                                  100
     –   We have plan to modify Hive to support “Sort-Avoidance” for such
         aggregation and aggregation-join queries.                                               50


                                                                                                  0
                                                                                                      Query1        Qyery2
                                                                                                                                               27
Benchmark More …
• The above evaluations just ran on a small cluster for
  developers, to demonstrate a quick view of our
  improvements.

• We are working on further improvements and doing
  more comprehensive evaluations on a larger and
  powerful cluster. And the result will be output as soon
  as possible.

                                                            28
To be open.

OPEN SOURCE


              29
HDH
                Hanborq Distribution with Hadoop
• HDH to make Hadoop Fast, Simple and Robust.

• HDH delivers a series of improvements on Hadoop Core, and Hadoop-
  based tools and applications for putting Hadoop to work solving Big Data
  problems in production.

• HDH may be ideal for enterprises seeking an
  integrated, fast, simple, and robust Hadoop Distribution. In particular, if
  you think your MapReduce jobs are slow and low performing, the HDH
  may be you choice.

• Like and based-on Apache Hadoop and Cloudera’s CDH, Hanborq
  delivers HDH. Except for Hadoop Core, it will include various and
  different components.
                                                                                30
Hanborq Open Source
• Github
   – Welcome to visit Hanborq’s Open Source Repositories
   – https://github.com/hanborq/

• Hadoop
   – A Hanborq optimized Hadoop Core, especially with high performance of
     MapReduce. It's the core part of HDH.

• RockStor (coming soon)
   – An Object Storage System implementation over Hadoop and HBase, which can
     provide similar service like Amazon S3 (Simple Storage Service.)

• We will continue to open source more useful projects in the future …

                                                                            31
Thank You Very Much!
 Anty Rao, Guangxian Liao, Schubert Zhang
{ant.rao, liaoguangxian, schubert.zhang}@gmail.com

    https://github.com/hanborq/hadoop
    http://www.slideshare.net/hanborq
 http://www.slideshare.net/schubertzhang

               to be continue …



                                                     32

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Scheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC ClustersScheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC ClustersMarcelo Veiga Neves
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
Bft mr-clouds-of-clouds-discco2012 - navtalk
Bft mr-clouds-of-clouds-discco2012 - navtalkBft mr-clouds-of-clouds-discco2012 - navtalk
Bft mr-clouds-of-clouds-discco2012 - navtalkPedro (A. R. S.) Costa
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentYahoo Developer Network
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingNandan Kumar
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116ksk_ha
 
MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanJamie Pitts
 
Oracle Exadata Version 2
Oracle Exadata Version 2Oracle Exadata Version 2
Oracle Exadata Version 2Jarod Wang
 

Was ist angesagt? (20)

Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Scheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC ClustersScheduling MapReduce Jobs in HPC Clusters
Scheduling MapReduce Jobs in HPC Clusters
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Bft mr-clouds-of-clouds-discco2012 - navtalk
Bft mr-clouds-of-clouds-discco2012 - navtalkBft mr-clouds-of-clouds-discco2012 - navtalk
Bft mr-clouds-of-clouds-discco2012 - navtalk
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116
 
MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and Gearman
 
Prdc2012
Prdc2012Prdc2012
Prdc2012
 
Oracle Exadata Version 2
Oracle Exadata Version 2Oracle Exadata Version 2
Oracle Exadata Version 2
 

Ähnlich wie Hanborq optimizations on hadoop map reduce 20120221a

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종NAVER D2
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationScott Miao
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 

Ähnlich wie Hanborq optimizations on hadoop map reduce 20120221a (20)

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 

Mehr von Schubert Zhang

Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and InfrastructureSchubert Zhang
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSchubert Zhang
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile DevelopmentSchubert Zhang
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Schubert Zhang
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aSchubert Zhang
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor IntroductionSchubert Zhang
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Schubert Zhang
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBaseSchubert Zhang
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on HadoopSchubert Zhang
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-streamSchubert Zhang
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南Schubert Zhang
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 

Mehr von Schubert Zhang (20)

Blockchain in Action
Blockchain in ActionBlockchain in Action
Blockchain in Action
 
科普区块链
科普区块链科普区块链
科普区块链
 
Engineering Culture and Infrastructure
Engineering Culture and InfrastructureEngineering Culture and Infrastructure
Engineering Culture and Infrastructure
 
Simple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluationSimple practices in performance monitoring and evaluation
Simple practices in performance monitoring and evaluation
 
Scrum Agile Development
Scrum Agile DevelopmentScrum Agile Development
Scrum Agile Development
 
Career Advice
Career AdviceCareer Advice
Career Advice
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算Bigtable数据模型解决CDR清单存储问题的资源估算
Bigtable数据模型解决CDR清单存储问题的资源估算
 
Big Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223aBig Data Engineering Team Meeting 20120223a
Big Data Engineering Team Meeting 20120223a
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
Wild Thinking of BigdataBase
Wild Thinking of BigdataBaseWild Thinking of BigdataBase
Wild Thinking of BigdataBase
 
RockStor - A Cloud Object System based on Hadoop
RockStor -  A Cloud Object System based on HadoopRockStor -  A Cloud Object System based on Hadoop
RockStor - A Cloud Object System based on Hadoop
 
Fans of running gump
Fans of running gumpFans of running gump
Fans of running gump
 
Hadoop compress-stream
Hadoop compress-streamHadoop compress-stream
Hadoop compress-stream
 
Ganglia轻度使用指南
Ganglia轻度使用指南Ganglia轻度使用指南
Ganglia轻度使用指南
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Big data and cloud
Big data and cloudBig data and cloud
Big data and cloud
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Hanborq optimizations on hadoop map reduce 20120221a

  • 1. Hanborq Optimizations on Hadoop MapReduce Feb.21, 2012 Big Data Engineering Team
  • 2. Motivations • MapReduce is a proved and successful data processing framework. It can be implemented enough efficient and flexible. – MapReduce: Simplified Data Processing on Large Clusters – MapReduce: A Flexible Data Processing Tool • Hadoop is the most popular Open Source implementation of MapReduce, but it’s not so good implemented. – Long latency. – Inefficiency with low performance. – Not so flexible. Not simple enough to build, develop, etc. – It has taken so long time to become mature, since 2006. • Our customers always challenge the latency and performance of Hadoop MapReduce. • The majority of real deploys cannot afford to install thousands of server. The inefficiency of Hadoop waste many hardware and energy. • It’s possible to improve the Hadoop Core and applications to achieve better experience. – Tenzing: A SQL Implementation On The MapReduce Framework – Dremel: Interactive Analysis of Web-Scale Datasets 2
  • 3. To build a fast architecture. MAPREDUCE RUNTIME ENVIRONMENT 3
  • 4. Runtime Job/Task Schedule & Latency (1) • Problem: even for a empty job, Hadoop will take tens of seconds to complete it. • HDH Improvements: – Worker Pool • Like Google Tenzing, HDH MapReduce runtime does not spawn new JVM processes for each job/task, but instead start these slot/worker processes at initialization phase and keep them running constantly. • Fast/Real-time communication between Workers and TaskTracker. – Transfer Job description information (JobConf, splits info, etc.) within RPC (Client->JobTracker). • Reduce the overhead to transfer, persist, load and parse XML files. • Reduce the useless and default attributes to be transferred for JobConf. – Heartbeat (TaskTracker) • Speed up the task assignment. • Use triggered(out-of-band ) real-time heartbeat when special event happens. – Avoid duplicated loading of configuration files for Configuration Objects. – Use DistributedCache to deploy job’s jar files. – Avoid some unnecessary “sleep”. – Avoid some big buffer/memory allocation which takes long time and to be inefficient. – … 4
  • 5. Runtime Job/Task Schedule & Latency (2) • Worker Pool, RPC, Heartbeat MapReduce RPC JobTracker Client (JobConf) beat Heart TaskTracker TaskTracker TaskTracker Child Child Child Child Child Child Child Child Child Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Pool Worker Pool Worker Pool 5
  • 6. Hints • Enable these configuration properties – mapreduce.tasktracker.outofband.heartbeat = true – mapred.job.reuse.jvm.num.tasks = -1 (even through it is not used in HDH) • Issues – Java, GC issue for a constantly running worker JVM. 6
  • 7. To build a fast engine. MAPREDUCE PROCESSING ENGINE 7
  • 8. Processing Engine Improvements • Shuffle: Use sendfile to reduce data copy and context switch. • Shuffle: Netty Shuffle Server (map side) and Batch Fetch (reduce side). • Sort Avoidance. – Spilling and Partitioning, Counting Sort, Bytes Merge, Early Reduce, etc. – Hash Aggregation in job implementation. 8
  • 9. Shuffle: Use sendfile to reduce data copy and context switch (1) • Rewrite shuffle server with Netty using Zero Copy API to transfer map output data. – Less data copy, more efficiency – Less data copy, less CPU usage – Less context switches, less CPU usage • Saving more CPU to do user tasks. 9
  • 10. Shuffle: Use sendfile to reduce data copy and context switch (2) Traditional data copy Data copy with sendfile vs. If NIC support gather operations 10
  • 11. Shuffle: Use sendfile to reduce data copy and context switch (3) Traditional context switches Context Switch with sendfile vs. 11
  • 12. Shuffle: Netty Server & Batch Fetch (1) • Less TCP connection overhead. • Reduce the effect of TCP slow start. • More important, better shuffle schedule in Reduce Phase result in better overall performance. • Configuration – mapred-site.xml <property> <name>mapreduce.shuffle.max.maps</name> <value>4</value> <description>Reduce side batch fetch for efficient shuffle copy.</description> </property> 12
  • 13. Shuffle: Netty Server & Batch Fetch (2) One connection per map Batch fetch • Each fetch thread in reduce copy • Fetch thread copy multiple map outputs per connection. one map output per connection, • This fetch thread take over this TT, other even there are many outputs in TT. fetch threads can’t fetch outputs from this TT during coping period. vs. 13
  • 14. Shuffle: Netty Server & Batch Fetch evaluations • Test Data – 8 text file, each ~600MB in size – ~50,000,000 records in total • key: 10bytes, record : 98bytes • Test job – The test job including following phases: MapSortShuffleSortMergeReduce(only read input, no output). 80 maps, 4reduces 80 maps,1 reduce 80 maps, 2reduces 02:01 01:51 03:48 03:36 01:41 03:49 01:44 01:36 01:32 01:34 03:45 02:49 03:44 02:53 01:26 03:45 02:16 02:17 02:11 02:16 Time(min) 01:09 Time(min) Time(min) 03:40 02:10 CDH3U2 CDH3U2 CDH3U2 03:36 00:52 Batch Fetch=1 03:36 03:36 Batch Fetch=1 01:26 Batch Fetch=1 00:35 Batch Fetch = 2 Batch Fetch = 2 Batch Fetch = 2 03:32 00:43 00:17 Batch Fetch=4 Batch Fetch=4 Batch Fetch=4 03:27 00:00 00:00 Batch Fetch=20 Batch Fetch=20 Batch Fetch=20 We find the gains of this improvement are not very distinct when the total M*R is at a low level. To be verified! 14
  • 15. Sort Avoidance • Many real-world jobs require shuffling, but not sorting. And the sorting bring much overhead. – Hash Aggregations – Hash Joins – Filtering and simple processing (process each record independently from other records) – …, etc. • When sorting is turned off, the mapper feeds data to the reducer which directly passes the data to the Reduce() function bypassing the intermediate sorting step. – Spilling, Partitioning, Merging and Reducing will be more efficient. • How to turn off sorting? – JobConf job = (JobConf) getConf(); – job.setBoolean("mapred.sort.avoidance", true); 15
  • 16. Sort Avoidance: Spilling and Partitioning • When spills, records compare by partition only. • Partition comparison using counting sort [O(n)], not quick sort [O(nlog n)]. 16
  • 17. Sort Avoidance: Early Reduce (Remove shuffle barrier) • Currently reduce function can’t start until all map outputs have been fetched already. • When sort is unnecessary, reduce function can start as soon as there is any map output available. • Greatly improve overall performance! 17
  • 18. Sort Avoidance: Bytes Merge • No overhead of key/value serialization/deserialization, comparison. • Don’t take care of records, just bytes. • Just concatenate byte streams together – read in bytes, write out bytes. 18
  • 19. Sort Avoidance: Sequential Reduce Inputs • Sequential read input files to feed reduce function, So no disk seeks, better performance. 19
  • 20. Let’s try HDH Hadoop. BENCHMARKS 20
  • 21. Benchmarks: Runtime Job/Task Schedule & Latency (1) • Testbed: – 5 node cluster (4 slaves), 8 map slots and 1 reduce slots per node. • Test Jobs: – bin/hadoop jar hadoop-examples-0.20.2-?dh3u2.jar sleep -m maps -r reduces -mt 1 -rt 1 • The HDH is very fast to launch the job and tasks. Job Latency (in second, lower is better) Job Latency (in second, lower is better) Total Tasks (32 maps, 4 reduces) Total Tasks (96 maps, 4 reduces) 30 50 43 24 45 25 40 21 20 35 30 24 15 25 20 10 15 10 5 1 5 1 0 0 CDH3u2 (Cloudera) CDH3u2 (Cloudera) HDH3u2 (Hanborq) CDH3u2 (Cloudera) CDH3u2 (Cloudera) HDH3u2 (Hanborq) (reuse.jvm disabled) (reuse.jvm enabled) (reuse.jvm disabled) (reuse.jvm enabled) 21
  • 22. Benchmarks: Runtime Job/Task Schedule & Latency (2) • Another Testbed: – 4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node. • Test Jobs: – bin/hadoop jar hadoop-examples-0.20.2-?dh3u2.jar sleep -m 24~384 -r 9 -mt 1 -rt 1 Job latency according to number of map tasks (lower and flater is better) 90 80 70 60 time (seconds) CDH3u2 50 (reuse.jvm disabled) CDH3u2 40 (reuse.jvm enabled) 30 HDH 20 10 0 24maps 96maps 192maps 288maps 384maps 22
  • 23. Benchmarks: Sort Avoidance and Aggregation (1) • Testbed: – 5 node cluster (4 slaves), 8 map slots and 2 reduce slots per node. – There is only 6GB RAM and 1 SATA 7.2K disk. • Test Data: – Data size : ~ 20G • Test Cases: – Case1 • Hash aggregation in map and reduce phase. • Map only output limited integer key-value pairs, so shuffling data is very tiny (in MB) – Case2 • Always use the old method (sort and combiner) to implement aggregation. • Map output many integer key-value pairs, but shuffling data is still not large (tens of MB) – Case3 • Hash aggregation in reduce phase, map does not use hash but just output many longer key-value pairs. • Map output many long key-value pairs, so shuffling data is distinct large (~12 GB) WRT Case1 and Case2. • This case is intently designed to test and highlight the effect of Sort-Avoidance. 23
  • 24. Benchmarks: Sort Avoidance and Aggregation (2) • The Case1 and Case2 like: Sort Avoidance and Aggregation – SELECT intA, COUNT(1) FROM T1 GROUP BY intA; (lower is better) • The Case3 likes: 2400 – SELECT A, B, C, D, SUM(M), SUM(N), SUM(R), SUM(P), SUM(Q) ... 2186 2200 FROM T2 GROUP BY A, B, C, D; 2000 1800 • Case1: – The shuffling data is very small, sorting on it is very fast in 1600 memory. So the improvement just gains ~11%, which may 1400 mainly come from the “Worker Pool implementation”.. • Case2: time (seconds) 1200 – Still use sorting to do aggregation, the tiny gains may mainly 1000 come from the “Worker Pool implementation”. 800 – This case also demonstrate the processing engine 615 600 improvements do not bring in negative effect. 400 • Case3: 197 175 216 198 – The shuffling(and sorting in CDH3u2) data is large 200 enough, the gains from Sort-Avoidance become very 0 distinct. Case1 Case2 Case3 CHD3u2 (Cloudera) 197 216 2186 HDH (Hanborq) 175 198 615 24
  • 25. Benchmarks: Sort Avoidance and Aggregation (3) • Testbed: – 4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node. Real Aggregration Jobs – 48GB RAM and 5 SATA 7.2k disks (lower is better) – Large RAM and more disks to avoid the bottleneck of disk IO 700 600 • Test Data: – 400,000,000 rows of comma separated text. ~100 bytes per row. 500 – ~40GB data in total. (then we can avoid the bottleneck of disk IO) time (seconds) 400 • Query1: 300 – Equals to: select type, sum(reqnum) from cdr group by type; • Query2: 200 – Equals to: select userid, sum(reqnum), max(reqnum), min(reqnum), avg(reqnum), sum(dur), ma 100 x(dur), min(dur), avg(dur) from cdr group by userid; 0 Case1-1 Case2-1 Case1-2 Case2-2 • Case1-1: Use sort to implement aggregation of Query1 CDH3u2 (Cloudera) 238 603 136 206 • Case2-1: Use sort to implement aggregation of Query2 HDH (Hanborq) 233 578 96 151 • Case1-2: Use hash(map and reduce) to implement aggregation of Analysis: Query1 - Case1/2 still use sort, the gains may mainly come from Worker Pool. - Case 3/4 use hash aggregation, so we can benefit distinctly from • Case2-2: Use hash(map and reduce) to implement aggregation of Sort-Avoidance. Query2 25
  • 26. Benchmarks: TeraSort • Testbed: TeraSort: Sort 100GB – 5 node cluster (4 slaves), 8 map slots and 2 (in minute, lower is better) reduce slots per node. 60 – There is only 6GB memory and 1 SATA 7.2K disk. 49 50 43 • Generate 100GB data 40 – bin/hadoop jar hadoop-examples-0.20.2- 30 ?dh3u2.jar teragen 1000000000 /teradata • Terasort Job: 20 – bin/hadoop jar hadoop-examples-0.20.2- 10 ?dh3u2.jar terasort /teradata /terasort 0 CDH3u2 (Cloudera) HDH (Hanborq) • Since there is only 1 disk on each machine, the bottleneck is the disk (iostat). • Got 12% improvement under the bottleneck of disk. • The gains may come from: Shuffle, and Task Scheduling in Worker Pool. 26
  • 27. Benchmarks: Integration with Hive • Testbed Hive Query – 4 node cluster (3 slaves), 8 map slots and 3 reduce slots per node. (lower is better) – 48GB RAM and 5 SATA 7.2k disks 500 – Large RAM and more disks to avoid the bottleneck of disk IO 450 • Dataset – 400,000,000 rows of comma separated text. ~100 bytes per row. 400 – ~40GB data in total. (then we can avoid the bottleneck of disk IO) 350 • Query1 – select type, sum(reqnum) from cdr group by type; 300 time (seconds) • Query2 – INSERT OVERWRITE DIRECTORY '/tmp/out' select 250 CDH3u2 (Cloudera) userid, sum(reqnum), max(reqnum), min(reqnum), avg(reqnum), s um(dur), max(dur), min(dur), avg(dur) from cdr group by userid; HDH (Hanborq) 200 • Result Analysis 150 – Since we did not modify Hive to use the “Sort-Avoidance” feature, the short saved time may mainly come from the “Worker Pool” 100 – We have plan to modify Hive to support “Sort-Avoidance” for such aggregation and aggregation-join queries. 50 0 Query1 Qyery2 27
  • 28. Benchmark More … • The above evaluations just ran on a small cluster for developers, to demonstrate a quick view of our improvements. • We are working on further improvements and doing more comprehensive evaluations on a larger and powerful cluster. And the result will be output as soon as possible. 28
  • 29. To be open. OPEN SOURCE 29
  • 30. HDH Hanborq Distribution with Hadoop • HDH to make Hadoop Fast, Simple and Robust. • HDH delivers a series of improvements on Hadoop Core, and Hadoop- based tools and applications for putting Hadoop to work solving Big Data problems in production. • HDH may be ideal for enterprises seeking an integrated, fast, simple, and robust Hadoop Distribution. In particular, if you think your MapReduce jobs are slow and low performing, the HDH may be you choice. • Like and based-on Apache Hadoop and Cloudera’s CDH, Hanborq delivers HDH. Except for Hadoop Core, it will include various and different components. 30
  • 31. Hanborq Open Source • Github – Welcome to visit Hanborq’s Open Source Repositories – https://github.com/hanborq/ • Hadoop – A Hanborq optimized Hadoop Core, especially with high performance of MapReduce. It's the core part of HDH. • RockStor (coming soon) – An Object Storage System implementation over Hadoop and HBase, which can provide similar service like Amazon S3 (Simple Storage Service.) • We will continue to open source more useful projects in the future … 31
  • 32. Thank You Very Much! Anty Rao, Guangxian Liao, Schubert Zhang {ant.rao, liaoguangxian, schubert.zhang}@gmail.com https://github.com/hanborq/hadoop http://www.slideshare.net/hanborq http://www.slideshare.net/schubertzhang to be continue … 32