SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
HadoopDistributions:
EvaluatingCloudera,
Hortonworks,andMapR
inMicro-benchmarksand
Real-worldApplications
VladimirStarostenkov,SeniorR&DDeveloper,
KirillGrigorchuk,HeadofR&DDepartment
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be
reproduced or transmitted in any form or by any means without written permission from the author.
2
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Table of Contents
1. Introduction........................................................................................................................................... 4
2. Tools, Libraries, and Methods ............................................................................................................... 5
2.1 Micro benchmarks..................................................................................................................................................................6
2.1.1 WordCount.................................................................................................................................................................6
2.1.2 Sort................................................................................................................................................................................7
2.1.3 TeraSort .......................................................................................................................................................................7
2.1.4 Distributed File System I/O...................................................................................................................................7
2.2 Real-world applications ........................................................................................................................................................7
2.2.1 PageRank ....................................................................................................................................................................8
2.2.2 Bayes ............................................................................................................................................................................8
3. What Makes This Research Unique?...................................................................................................... 9
3.1 Testing environment .............................................................................................................................................................9
4. Results.................................................................................................................................................. 11
4.1 Overall cluster performance.............................................................................................................................................11
4.2 Hortonworks Data Platform (HDP).................................................................................................................................12
....................................................................................14
4.4 MapR ........................................................................................................................................................................................15
5. Conclusion ........................................................................................................................................... 18
Appendix A: Main Features and Their Comparison Across Distributions .............................................. 19
Appendix B: Overview of the Distributions ............................................................................................ 21
1. MapR...........................................................................................................................................................................................21
2. Cloudera....................................................................................................................................................................................22
3. Hortonworks ............................................................................................................................................................................23
Appendix C: Performance Results for Each Benchmarking Test ............................................................ 24
1. Real-world applications........................................................................................................................................................24
1.1 Bayes.............................................................................................................................................................................24
1.2 PageRank.....................................................................................................................................................................25
3
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
2. Micro benchmarks ............................................................................................................................... 26
2.1 Distributed File System I/O (DFSIO)...............................................................................................................................26
2.2 Hive aggregation .................................................................................................................................................................27
2.3 Sort............................................................................................................................................................................................28
2.4 TeraSort...................................................................................................................................................................................29
2.5 WordCount.............................................................................................................................................................................30
Appendix D: Performance Results for Each Test Sectioned by Distribution.......................................... 32
1. MapR...........................................................................................................................................................................................32
2. Hortonworks ............................................................................................................................................................................42
3. Cloudera....................................................................................................................................................................................52
Appendix E: Disk Benchmarking............................................................................................................. 62
1. DFSIO (read) benchmark......................................................................................................................................................62
2. DFSIO (write) benchmark ....................................................................................................................................................63
Appendix F: Parameters used to optimize Hadoop Jobs........................................................................ 64
4
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
1. Introduction
Perhaps, there is hardly any expert in the big data field who has heard nothing about
Hadoop. Furthermore, very often Hadoop is used as a synonym to the term big data. It is
most likely, that the wide usage and popularity of this framework may have given a jump-
start to development of various distributions that derived from the initial open-source
edition.
The MapReduce paradigm was firstly introduced by Google and Yahoo continued with
development of Hadoop, which is based on this data processing method. From that moment
on, Hadoop has grown into several major distributions and dozens of sub-projects used by
thousands of companies. However, the rapid development of the Hadoop ecosystem and
the extension of its application area, lead to a misconception that Hadoop can be used to
solve any high-load computing task easily, which is not exactly true.
Actually, when a company is considering Hadoop to address its needs, it has to answer two
questions:
 Is Hadoop the right tool for me?
 Which distribution can be more suitable for my tasks?
To collect information on these two points, companies spend an enormous amount of time
researching into distributed computing paradigms and projects, data formats and their
optimization methods, etc. This benchmark demonstrates performance results of the most
popular open-source Hadoop distributions, such as Cloudera, MapR, and Hortonworks. It
also provides all the information you may need to evaluate these options.
In this research, such solutions as Amazon Elastic MapReduce (Amazon EMR), Windows
Azure HDInsight, etc., are not analyzed, since they require uploading business data to public
clouds. This benchmark evaluates only stand-alone distributions that can be installed in
private data centers.
5
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
2. Tools, Libraries, and Methods
Every Hadoop distribution selected for this research can be tried as a demo on a virtual
machine. This can help you to learn specific features of each solution and test how it works.
It can be quite a good idea for a proof of concept stage to take a set of real data and run
Hadoop on several virtual machines in a cloud. Although, it will not help you to choose a
configuration for your bare metal cluster (for details, see
a , you will be able to evaluate whether Hadoop is a
good tool for your system. The paper Nobody ever got fired for using Hadoop on a cluster
by Microsoft Research provides instructions for choosing an optimal hardware configuration.
On the whole, evaluating performance of a Hadoop cluster is challenging, since the results
will vary depending on the cluster size and configuration. There are two main methods of
Hadoop benchmarking: micro-benchmarks and emulated loads. Micro-benchmarks are
shipped with most distributions and allow for testing particular parts of the infrastructure,
for instance TestDFSIO analyzes the disk system, Sort evaluates MapReduce tasks,
WordCount measures cluster performance, etc. The second approach provides for testing
the system under workloads similar to those in real-life use cases. SWIM and GridMix3 consist
of workloads that have been emulated based on historical data collected from a real cluster
in operation. Executing the workloads via replaying the synthesized traces may help to
evaluate the side effects of concurrent job execution in Hadoop.
HiBench, a Hadoop Benchmark Suite by Intel, consists of several Hadoop workloads,
including both synthetic micro-benchmarks and real-world applications. All workloads
included into this suite are grouped into four categories: micro-benchmarks, Web search,
machine learning, and analytical queries. To find out about the workload used in this
benchmark, read the paper
MapReduce-Based Data Analysis
We monitored CPU, disk, RAM, network, and JVM parameters with Ganglia Monitoring
System and tuned the parameters of each job (see Appendix A) to achieve maximum
utilization of all resources.
Figure 1 demonstrates how data is processed inside a Hadoop cluster. Combiner is an
optional component that reduces the size of data output by a node that executes a Map
task.
6
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
To find out more about MapReduce internals, read by Tom
White.
To show how the amount of data changes on each stage of a MapReduce job, the whole
amount of input data was taken as 1.00. All the other indicators were calculated as ratios to
the input amount. For instance, the input data set was 100 GB (1.00) in size. After a Map task
had been completed, it increased to 142 GB (1.42), see Table 1. Using ratios instead of the
real data amounts allows for analyzing trends. In addition, these results can help to predict
the behavior of a cluster that deals with input data of a different size.
Figure 1. A simplified scheme of a MapReduce paradigm
2.1 Micro benchmarks
2.1.1 WordCount
WordCount can be used to evaluate CPU scalability of a cluster. On the Map stage, this
workload extracts small amounts of data from a large data set and this process utilizes the
total CPU capacity. Due to the very low load on disk load and network, under this kind of
workload, a cluster of any size is expected to scale linearly.
7
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Map input Combiner input Combiner output Reduce output
1.00 1.42 0.07 0.03
Table 1. Changes in the amount of data at each of the MapReduce stages
2.1.2 Sort
This workload sorts out unsorted text data; the amount of data on all the stages of the
Hadoop MapReduce process is the same. Being mostly I/O-bound, this workload has
moderate CPU utilization, as well as heavy disk and network I/O utilization (during the
shuffle stage). RandomTextWriter generates the input data.
Map input Map output Reduce output
1.0 1.0 (uncompressed) 1.0
Table 2. Changes in the amount of data at each of the MapReduce stages
2.1.3 TeraSort
TeraSort input data consists of 100-byte rows generated by the TeraGen application. Even
though this workload has high/moderate CPU utilization during Map/Reduce stages
respectively, it is mostly an I/O-bound workload.
Map input Map output Reduce output
1.0 0.2 (compressed) 1.0
Table 3. Changes in the amount of data at each of the MapReduce stages
2.1.4 Distributed File System I/O
The DFSIO test used in this benchmark is an enhanced version of TestDFSIO, which is an
integral part of a standard Apache Hadoop distribution. This benchmark measures HDFS
throughput.
2.2 Real-world applications
8
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
2.2.1 PageRank
PageRank is a widely-known algorithm that evaluates and ranks Web sites in search results.
To calculate a rating, a PageRank job is repeated several times, which is an iterative CPU-
bound workload. The benchmark consists of several chained Hadoop jobs, each represented
by a separate row in the table below. In this benchmark, PageRank had two HDFS blocks per
CPU core, which is the smallest input per node in this test.
Map input Combiner input Combiner output Reduce output
1.0 1.0E-005 1.0E-007 1.0E-008
1.0 5.0 1.0
1.0 0.1 0.1
Table 4. Changes in the amount of data at each of the MapReduce stages
2.2.2 Bayes
The next application is a part of the Apache Mahout project. The Bayes Classification
workload has rather complex patterns of accessing CPU, memory, disk, and network. This
test creates a heavy load on a CPU when completing Map tasks. However in this case, this
workload hit an I/O bottleneck.
Bayes
Map input Combiner input Combiner output Reduce output
1.0 28.9 22.1 19.4
19.4 14.4 12.7 7.4
7.4 9.3 4.8 4.6
7.4 3.1 1.0E-004 1.0E-005
Table 5. Changes in the amount of data at each of the MapReduce stages
9
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
3. What Makes This Research Unique?
Although there are a variety of Hadoop distributions, tests evaluating their functional and
performance characteristics are rare and provide rather general information, lacking deep
analytical investigation of the subject. The wide majority of distributions are direct branches
of the initial Apache Hadoop project and do not provide any additional features. Therefore,
they were disregarded and are not covered in this research.
This research aims to provide a deep insight into Hadoop distributions and evaluate their
features to support you in selecting a tool that fits your project best. For this benchmark, our
R&D team selected three commonly used open-source Hadoop distributions:

 Hortonworks Data Platform (HDP) v1.3
 MapR M3 v3.0
In general, virtualized cloud environments provide flexibility in tuning, which was required
to carry out tests on clusters of a different size. In addition, cloud infrastructure allows for
obtaining more unbiased results, since all tests can be easily repeated to verify their results.
For this benchmark, all tests were run on the ProfitBricks virtualized infrastructure. The
deployment settings selected for each distribution provided similar test conditions, as much
as it was possible.
In this research, we tested distributions that include updates, patches, and additional
features that ensure stability of the framework. Hortonworks and Cloudera are active
contributors to Apache Hadoop and they provide fast bug fixing for their solutions.
Therefore, their distributions are considered more stable and up-to-date.
3.1 Testing environment
ProfitBricks was our partner that provided computing capacities to carry out the research.
This company is a leading IaaS provider and it provides great flexibility in choosing node
configuration. The engineers did not have to address the Support Department to change
some crucial parameters of the node disk configuration, therefore they were able to try
different options and find the optimal settings for the benchmarking environment.
10
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
The service provides a wide range of configuration parameters for each node, for instance,
the CPU capacities of a node can vary from 1 to 48 cores, and RAM can be from 1 to 196 GB
per node. This variety of options allows for achieving the optimal CPU/RAM/network/storage
balance for each Hadoop task.
Unlike Amazon that offers preconfigured nodes, ProfitBricks allows for manual tuning of
each node based on your previous experience and performance needs. InfiniBand is a
modern technology used by ProfitBricks. It allowed for achieving the maximum inter-node
communication performance inside a data center.
Cluster configuration:
Each node had four CPU cores, 16 GB of RAM, and 100 GB of virtualized disk space. Cluster
size ranged from 4 to 16 nodes. Nodes required for running Ganglia and cluster
management were not included into this configuration. The top cluster configuration
featured 64 computing cores and 256 GB of RAM for processing 1.6 TB of test data.
11
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
4. Results
Each Cloudera and Hortonworks DataNode contained one disk. MapR distribution was
evaluated in a slightly different way. Three data disks were attached to each DataNode
following MapR recommendations. Taking this into account, MapR was expected to perform
I/O sensitive tasks three times faster. However, the actual results were affected by some
peculiarities of virtualization (see Figure 11).
The comparison of Cloudera and Hortonworks features showed that these two distributions
are very similar (see Appendix A). It was also proved by the results of the tests (see Appendix
B). The overall performance of Hortonworks and Cloudera clusters is demonstrated by Figure
4 and 6 respectively.
4.1 Overall cluster performance
Throughput in bytes per second was measured for a cluster that consisted of 4, 8, 12, and 16
DataNodes (Figures 2-7). The throughput of 8-, 12-, and 16-node clusters was compared
against the throughput of a four-node cluster in each benchmark test. The speed of data
processing of 8-, 12-, and 16-node clusters was divided by the throughput of a 4-node
cluster. These values demonstrate cluster scalability in each of the tests. The higher the value
is, the better.
Although data consistency may be guaranteed by a hosting/cloud provider, to employ the
advantages of data locality, Hadoop requires using its internal data replication.
Figure 2. The overall performance results of the MapR distribution in all benchmark tests
0
1
2
3
4
5
BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT
MapR
Overall cluster performance
4 8 12 16
12
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 3. The average performance of a single node of the MapR cluster in all benchmark tests
Cluster performance scales linearly under the WordCount workload. It behaves the same in
running PageRank until the cluster reaches an I/O bottleneck. The results of other
benchmarks strongly correlated with DFSIO. Disk I/O throughput did not scale in this test
environment, however, analyzing the reasons for that was not the focus of this research. To
learn more about the drawbacks of Hadoop virtualization, read
Sammer, C
As it was mentioned before, MapR had three disks per node. In case all nodes are hosted on
the same server, the virtual cluster utilizes disk bandwidth which is obviously limited
much faster. ProfitBricks allows for hosting up to 48-62 cores on the same server, which is
equivalent to a cluster that consists of 12 15 nodes with the configuration described in this
benchmark (four cores per node).
4.2 Hortonworks Data Platform (HDP)
In most cases, the performance of a cluster based on the Hortonworks distribution scaled
more linearly. However, its starting performance value in disk-bound tasks was lower than
that of MapR. The maximum scalability was reached in WordCount and PageRank tasks.
Unfortunately, when the cluster grew to eight nodes, its performance stopped to increase
due to I/O limitations.
0
0.2
0.4
0.6
0.8
1
1.2
BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT
MapR
Performance per Node
4 8 12 16
13
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 4. The overall performance results of the Hortonworks cluster in all benchmark tests
Figure 5. The average performance of a single node of the Hortonworks cluster in all benchmark
tests
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT
Hortonworks
Overall cluster performance
4 8 12 16
0
0.2
0.4
0.6
0.8
1
1.2
BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT
Hortonworks
Performance per Node
4 8 12 16
14
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
4 (CDH)
The Cloudera Hadoop distribution showed almost the same performance as Hortonworks,
except for Hive queries, where it was slower.
Figure 6. The overall performance results of the Cloudera cluster in all benchmark tests
Figure 7. The average performance of a single node of the Cloudera cluster in all benchmark tests
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT
Cloudera
Overall cluster performance
4 8 12 16
0
0.2
0.4
0.6
0.8
1
1.2
Cloudera
Performance per Node
4 8 12 16
15
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 8. Throughput scalability measured in the CPU-bound benchmark
Cluster performance under CPU-bound workloads increased as expected: a 16-node cluster
was four times as fast as a 4-node one. However, the difference in performance between the
distributions was within the limit of an experimental error.
4.4 MapR
The performance results of the MapR cluster under the Sort load were quite unexpected.
Figure 9. Performance results for MapR in the Sort benchmark
In the Sort task, the cluster scaled linearly from four to eight nodes. After that, the
performance of each particular node started to degrade sharply. The same situation was
observed in the DFSIO write test.
1
2
3
4
5
4 8 12 16
WordCount
Overall cluster performance
MapR
Hortonworks
Cloudera
0.9
0.95
1
1.05
1.1
4 8 12 16
WordCount
Performance per Node
MapR
Hortonworks
Cloudera
Baseline
0 0.5 1 1.5 2
Performance,the more the better
NumberofNodes
MapR
Sort: Overall cluster performance
16 12 8 4
Baseline
0 0.5 1 1.5
Performance,the more the better
NumberofNodes
MapR
Sort: Performance per node
16 12 8 4
16
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 10. The MapR performance results in the DFSIO (write) benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Performance,the more the better
NumberofNodes
MapR
DFSIO (write): Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
NumberofNodes
MapR
DFSIO (write): Performance per node
16 12 8 4
17
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
The virtualized disks of a 4-node cluster showed the reading/writing speed of 250/700
MB/sec. The overall cluster performance grew not in a linear way (see Figure 11), meaning
that the total speed of data processing can be improved by the optimal combination of CPU,
RAM, and disk space parameters.
Figure 11. Performance results for the MapR, Hortonworks, and Cloudera distributions in the
DFSIO (read/write) benchmark
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
4 8 12 16
DFSIO Read
Overall cluster performance
MapR
Hortonworks
Cloudera
0.2
0.4
0.6
0.8
1
1.2
1.4
4 8 12 16
DFSIO Write
Overall cluster performance
MapR
Hortonworks
Cloudera
18
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
5. Conclusion
Despite of the fact that the configuration of a cluster deployed in the cloud was similar to
that of the one deployed on bare metal, the performance and scalability of the virtualized
solution were different. In general, Hadoop deployed on bare metal is expected to scale
linearly, until inter-node communication will start to slow it down or it reaches the limits of
the HDFS, which is around several thousand of nodes.
The actual measurements showed that even though the overall performance was very high,
it was affected by the limited total disk throughput. Therefore, the disk I/O became a serious
bottleneck whereas the computing capacities of the cluster were not fully utilized. Apache
Spark, which was announced by Cloudera when this research was conducted, or GridGain
In-Memory Accelerator for Hadoop can be suggested for using in the ProfitBricks
environment.
It can be assumed that the type of Hadoop distribution has a much less considerable impact
on the overall system throughput than the configuration of the MapReduce task parameters.
For instance, the TeraSort workload was processed 2 3 times faster when the parameters
described in Appendix E were tuned specifically for this load. By configuring these settings,
you can achieve 100% utilization of your CPU, RAM, disk, and network. So, the performance
of each distribution can be greatly improved by selecting proper parameters for each
specific load.
Running Hadoop in clouds allows for fast horizontal and vertical scaling, however, there are
fewer possibilities for tuning each part of the infrastructure. In case you opt for a virtualized
deployment, you should select a hosting/IaaS provider that gives freedom in configuring
your infrastructure. To achieve optimal utilization of resources, you will need information on
the parameters set for network and disk storage and have a possibility to change them.
Liked this white paper?
Share it on the Web!
19
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Appendix A: Main Features and Their Comparison
Across Distributions
Component type Implementation Hortonworks
HDP -1.3
Cloudera CDH
4.3
MapR M3 v3.0
File system HDFS 1.2.0 HDFS 2.0.0 MapR-FS
- non-Hadoop access NFSv3 Fuse-DFS v2.0.0 Direct Access NFS
- Web access REST HTTP API WebHDFS HttpFS *
MapReduce 1.2.0 0.20.2 **
- software abstraction layer Cascading x x 2.1
Non-relational database Apache HBase 0.94.6.1 0.94.6 0.92.2
Metadata services Apache HCatalog ***Hive 0.5.0 0.4.0
Scripting platform Apache Pig 0.11 0.11.0 0.10.0
- data analysis framework DataFu x 0.0.4 x
Data access and querying Apache Hive 0.11.0 0.10.0 0.9.0
Workflow scheduler Apache Oozie 3.3.2 3.3.2 3.2.0
Cluster coordination Apache
Zookeeper
3.4.5 3.4.5 3.4(?)
Bulk data transfer between
relational databases and
Hadoop
Apache Sqoop 1.4.3 1.4.3 1.4.2
Distributed log management
services
Apache Flume 1.3.1 1.3.0 1.2.0
Machine learning and data
analysis
Mahout 0.7.0 0.7 0.7
20
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Hadoop UI Hue 2.2.0 2.3.0 -
- data integration service Talend Open
Studio for Big
Data
5.3 x x
Cloud services Whirr x 0.8.2 0.7.0
Parallel query execution
engine
Tez (Stinger) Impala ****
Full-text search Search 0.1.5
Administration Apache Ambari Cloudera
Manager
MapR Control
System
- installation Apache Ambari Cloudera
Manager
-
- monitoring Ganglia x x
Nagios x x
- fine-grained authorization Sentry 1.1
Splitting resource
management and scheduling
YARN 2.0.4 2.0.0 -
Table 6. The comparison of functionality in different Hadoop distributions
* - via NFS
** - MapR has a custom Hadoop-compatible MapReduce implementation
*** - HCatalog has been merged with Hive. The latest stand-alone release was v0.5.0
**** - Apache Drill is at an early development stage
x - available, but not mentioned in the distribution documentation / requires manual
installation or additional configuration
21
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Appendix B: Overview of the Distributions
1. MapR
Summary
MapR has the MapRFS feature which is a substitute for the standard HDFS. Unlike HDFS, this
system aims to sustain deployments that consist of up to 10,000 of nodes with no single
point of failure, which is guaranteed by the distributed NameNode. MapR allows for storing
1 10 Exabytes of data and provides support for NFS and random read-write semantics. It is
stated by the MapR developers that elimination of the Hadoop abstraction layers can help to
increase performance 2x.
There are three editions of the MapR distribution: M3, which is completely free, M5 and M7,
the latter two are paid enterprise versions. Although M3 provides unlimited scalability and
NFS, it does not ensure high availability and snapshots that are available in M5 or instant
recovery of M7. M7 is an enterprise-level platform for NoSQL and Hadoop deployments.
MapR distribution is available as a part of Amazon Elastic MapReduce and Google Cloud
Platform.
Notable customers and partners
 MapR M3 and M5 editions are available as premium options for Amazon Elastic MapReduce;
 Google partnered with MapR in launching Compute Engine;
 Cisco Systems announced support for MapR software on the UCS platform;
 comScore
Support and documentation
 Support contact details
 Documentation
The company
Based in San Jose, California, MapR focuses on development of Hadoop-based projects. The
company contributes to such projects as HBase, Pig, Apache Hive, and Apache ZooKeeper.
After signing an agreement with EMC in 2011, the company supplies a specific Hadoop
22
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
distribution tuned for EMC hardware. MapR also partners with Google and Amazon in
improving their Elastic Map Reduce (EMR) service.
2. Cloudera
Summary
Of all the distributions analyzed in this research, s solution has the most powerful
Hadoop deployment and administration tools designed for managing a cluster of an
unlimited size. It is also open-source and the company is an active contributor to Apache
Hadoop. Cloudera is a major Apache Hadoop contributor. In addition, the Cloudera
distribution has its own native components, such as Impala, a query engine for massive
parallel processing and Cloudera Search powered by Apache Solr.
Notable customers and partners
 eBay
 CBS Interactive
 Qualcomm
 Expedia
Support and documentation
 Support contact details
 Documentation
The company
Based in Palo Alto, Cloudera is one of the leading companies that provides Hadoop-related
services and trainings for the staff statistics, more than 50% of
its efforts are dedicated to improving such open-source projects as Apache Hive, Apache
Avro, Apache HBase, etc. that are a part of a large Hadoop ecosystem. In addition, Cloudera
invests into Apache Software Foundation, a community of developers who contribute to the
family of Apache software projects.
23
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
3. Hortonworks
Summary
Being 100% open-source, Hortonworks is strongly committed to Apache Hadoop and it is
one of the main contributors to the solution.
brings high performance, scalability, and SQL compliance to Hadoop deployments. YARN,
the Hadoop OS, and Apache Tez, a framework for near real-time big data processing, help
Stringer to speed up Hive and Pig by up to 100x.
As a result of Hortonworks partnership with Microsoft, HDP is the only Hadoop distribution
available as a native component of Windows Server. A Windows-based Hadoop cluster can
be easily deployed on Windows Azure through HDInsight Service.
Notable customers and partners
 Western Digital
 eBay
 Samsung Electronics
Support and documentation
 Support contact details
 Documentation
The company
Hortonworks is a company headquartered in Palo Alto, California. Being a sponsor of the
Apache Software Foundation and one of the main contributors to Apache Hadoop, the
company specializes in providing support for Apache Hadoop. The Hortonworks distribution
includes such components as HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Together
with Yahoo!, Hortonworks hosts the annual Hadoop Summit event, the leading conference
for the Apache Hadoop community.
24
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Appendix C: Performance Results for Each
Benchmarking Test
1. Real-world applications
1.1 Bayes
Figure 12. Bayes: the overall cluster performance
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
4 8 12 16
Bayes
Overall cluster performance
MapR
Hortonworks
Cloudera
25
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 13: Bayes: the performance of a single node for each cluster size
1.2 PageRank
Figure 14. PageRank: the overall cluster performance
0
0.2
0.4
0.6
0.8
1
1.2
4 8 12 16
Bayes
Performance per Node
MapR
Hortonworks
Cloudera
1
1.5
2
2.5
3
3.5
4
4.5
4 8 12 16
PageRank
Overall cluster performance
MapR
Hortonworks
Cloudera
26
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 15. PageRank: the performance of a single node for each cluster size
2. Micro benchmarks
2.1 Distributed File System I/O (DFSIO)
Figure 16. DFSIO: the overall cluster performance
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
4 8 12 16
PageRank
Performance per Node
MapR
Hortonworks
Cloudera
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
4 8 12 16
DFSIO
Overall cluster performance
MapR
Hortonworks
Cloudera
27
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 17. DFSIO: the performance of a single node for each cluster size
2.2 Hive aggregation
Figure 18. Hive aggregation: the overall cluster performance
0
0.2
0.4
0.6
0.8
1
1.2
4 8 12 16
DFSIO
Performance per Node
MapR
Hortonworks
Cloudera
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
4 8 12 16
HIVEAGGR
Overall cluster performance
MapR
Hortonworks
Cloudera
28
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 19. Hive aggregation: the performance of a single node for each cluster size
2.3 Sort
Figure 20. Sort: the overall cluster performance
0
0.2
0.4
0.6
0.8
1
1.2
4 8 12 16
HIVEAGGR
Performance per Node
MapR
Hortonworks
Cloudera
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
4 8 12 16
Sort
Overall cluster performance
MapR
Hortonworks
Cloudera
29
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 21. Sort: the performance of a single node for each cluster size
2.4 TeraSort
0
0.2
0.4
0.6
0.8
1
1.2
4 8 12 16
Sort
Performance per Node
MapR
Hortonworks
Cloudera
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
4 8 12 16
TeraSort
Overall cluster performance
MapR
Hortonworks
Cloudera
30
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 22. TeraSort: the overall cluster performance
Figure 23. TeraSort: the performance of a single node for each cluster size
2.5 WordCount
Figure 24. WordCount: the overall cluster performance
0
0.2
0.4
0.6
0.8
1
1.2
4 8 12 16
TeraSort
Performance per Node
MapR
Hortonworks
Cloudera
1
1.5
2
2.5
3
3.5
4
4.5
4 8 12 16
WordCount
The overall cluster performance
MapR
Hortonworks
Cloudera
31
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 25. WordCount: the performance of a single node for each cluster size
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
4 8 12 16
Performanceratios
WordCount
Performance per Node
MapR
Hortonworks
Cloudera
32
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Appendix D: Performance Results for Each Test
Sectioned by Distribution
1. MapR
Figure 26. The overall performance of the MapR cluster in the Bayes benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Performance,the more the better
Numberofnodes
MapR
Bayes: Overall cluster performance
16 12 8 4
33
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 27. The performance of a single node of the MapR cluster in the Bayes benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
Bayes: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
MapR
DFSIO (read): Overall cluster performance
16 12 8 4
34
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 28. The overall performance of the MapR cluster in the DFSIO (read) benchmark
Figure 29. The performance of a single node of the MapR cluster in the DFSIO (read) benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
DFSIO (read): Performance per node
16 12 8 4
35
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 30. The overall performance of the MapR cluster in the DFSIO (write) benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Performance,the more the better
Numberofnodes
MapR
DFSIO (write): Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
DFSIO (write): Performance per node
16 12 8 4
36
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 31. The performance of a single node of the MapR cluster in the DFSIO (write) benchmark
Figure 32. The overall performance of the MapR cluster in the DFSIO benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Performance,the more the better
Numberofnodes
MapR
DFSIO: Overall cluster performance
16 12 8 4
37
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 33. The performance of a single node of the MapR cluster in the DFSIO benchmark
Figure 34. The overall performance of the MapR cluster in the Hive aggregation benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
DFSIO: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
MapR
Hive aggregation: Overall cluster performance
16 12 8 4
38
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 35. The performance of a single node of the MapR cluster in the Hive aggregation
benchmark
Figure 36. The overall performance of the MapR cluster in the PageRank benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
Hive aggregation: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2 2.5 3 3.5
Performance,the more the better
Numberofnodes
MapR
PageRank: Overall cluster performance
16 12 8 4
39
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 37. The performance of a single node of the MapR cluster in the PageRank benchmark
Figure 38. The overall performance of the MapR cluster in the Sort benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
PageRank: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
MapR
Sort: Overall cluster performance
16 12 8 4
40
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 39. The performance of a single node of the MapR cluster in the Sort benchmark
Figure 40. The overall performance of the MapR cluster in the TeraSort benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
Sort: Performance per node
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Performance,the more the better
Numberofnodes
MapR
TeraSort: Overall cluster performance
16 12 8 4
41
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 41. The performance of a single node of the MapR cluster in the TeraSort benchmark
Figure 42. The overall performance of the MapR cluster in the WordCount benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
TeraSort: Performance per node
16 12 8 4
Baseline
0 1 2 3 4 5
Performance,the more the better
Numberofnodes
MapR
WordCount: Overall cluster performance
16 12 8 4
42
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 43. The performance of a single node of the MapR cluster in the WordCount benchmark
2. Hortonworks
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
WordCount: Performance per node
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Performance,the more the better
Numberofnodes
Hortonworks
Bayes: Overall cluster performance
16 12 8 4
43
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 44. The overall performance of the Hortonworks cluster in the Bayes benchmark
Figure 45. The performance of a single node of the Hortonworks cluster in the Bayes benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Hortonworks
Bayes: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
Hortonworks
DFSIO (read): Overall cluster performance
16 12 8 4
44
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 46. The overall performance of the Hortonworks cluster in the DFSIO (read) benchmark
Figure 47. The performance of a single node of the Hortonworks cluster in the DFSIO (read)
benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Hortonworks
DFSIO (read): Performance per node
16 12 8 4
45
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 48. The overall performance of the Hortonworks cluster in the DFSIO (write) benchmark
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
Hortonworks
DFSIO (write): Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Hortonworks
DFSIO (write): Performance per node
16 12 8 4
46
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 49. The performance of a single node of the Hortonworks cluster in the DFSIO (write)
benchmark
Figure 50. The overall performance of the Hortonworks cluster in the DFSIO benchmark
Baseline
0 0.5 1 1.5 2
Performance,the more the better
NumberofnodesHortonworks
DFSIO: Overall cluster performance
16 12 8 4
47
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 51. The performance of a single node of the Hortonworks cluster in the DFSIO benchmark
Figure 52. The overall performance of the Hortonworks cluster in the Hive aggregation
benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Hortonworks
DFSIO: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2 2.5 3
Performance,the more the better
Numberofnodes
Hortonworks
Hive aggregation: Overall cluster performance
16 12 8 4
48
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 53. The performance of a single node of the Hortonworks cluster in the Hive aggregation
benchmark
Figure 54. The overall performance of the Hortonworks cluster in the PageRank benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
MapR
Hive aggregation: Performance per node
16 12 8 4
Baseline
0 1 2 3 4 5
Performance,the more the better
Numberofnodes
Hortonworks
PageRank: Overall cluster performance
16 12 8 4
49
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 55. The performance of a single node of the Hortonworks cluster in the PageRank
benchmark
Figure 56. The overall performance of the Hortonworks cluster in the Sort benchmark
Baseline
0.9 0.92 0.94 0.96 0.98 1 1.02
Performance,the more the better
Numberofnodes
Hortonworks
PageRank: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
NumberofNodes
Hortonworks
Sort: Overall cluster performance
16 12 8 4
50
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 57. The performance of a single node of the Hortonworks cluster in the Sort benchmark
Figure 58. The overall performance of the Hortonworks cluster in the TeraSort benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
NumberofNodes
Hortonworks
Sort: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2 2.5 3
Performance,the more the better
NumberofNodes
Hortonworks
TeraSort: Overall cluster performance
16 12 8 4
51
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 59. The performance of a single node of the Hortonworks cluster in the TeraSort
benchmark
Figure 60. The overall performance of the Hortonworks cluster in the WordCount benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
NumberofNodes
Hortonworks
TeraSort: Performance per node
16 12 8 4
Baseline
0 1 2 3 4 5
Performance,the more the better
Numberofnodes
Hortonworks
WordCount: Overall cluster performance
16 12 8 4
52
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 61. The performance of a single node of the Hortonworks cluster in the WordCount
benchmark
3. Cloudera
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Hortonworks
WordCount: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5
Performance,the more the better
Numbersofnodes
Cloudera
Bayes: Overall cluster performance
16 12 8 4
53
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 62. The overall performance of the Cloudera cluster in the Bayes benchmark
Figure 63. The performance of a single node of the Cloudera cluster in the Bayes benchmark
Figure 64. The overall performance of the Cloudera cluster in the DFSIO (read) benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numbersofnodes
Cloudera
Bayes: Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
Cloudera
DFSIO (read): Overall cluster performance
16 12 8 4
54
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 65. The performance of a single node of the Cloudera cluster in the DFSIO (read)
benchmark
Figure 66. The overall performance of the Cloudera cluster in the DFSIO (write) benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
DFSIO (read): Performance per node
16 12 8 4
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
Cloudera
DFSIO (write): Overall cluster performance
16 12 8 4
55
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 67. The performance of a single node of the Cloudera cluster in the DFSIO (write)
benchmark
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
DFSIO (write): Performance per node
16 12 8 4
56
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 68. The overall performance of the Cloudera cluster in the DFSIO benchmark
Figure 69. The performance of a single node of the Cloudera cluster in the DFSIO benchmark
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
Cloudera
DFSIO: Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
DFSIO: Performance per node
16 12 8 4
57
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 70. The overall performance of the Cloudera cluster in the Hive aggregation benchmark
Figure 71. The performance of a single node of the Cloudera cluster in the Hive aggregation
benchmark
Baseline
0 0.5 1 1.5 2 2.5
Performance,the more the better
Numberofnodes
Cloudera
Hive aggregation: Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
Hive aggregation: Performance per node
16 12 8 4
58
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 72. The overall performance of the Cloudera cluster in the PageRank benchmark
Figure 73. The performance of a single node of the Cloudera cluster in the PageRank benchmark
Baseline
0 1 2 3 4 5
Performance,the more the better
Numberofnodes
Cloudera
PageRank: Overall cluster performance
16 12 8 4
Baseline
0.9 0.92 0.94 0.96 0.98 1 1.02
Performance,the more the better
Numberofnodes
Cloudera
PageRank: Performance per node
16 12 8 4
59
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 74. The overall performance of the Cloudera cluster in the Sort benchmark
Figure 75. The performance of a single node of the Cloudera cluster in the Sort benchmark
Baseline
0 0.5 1 1.5 2
Performance,the more the better
Numberofnodes
Cloudera
Sort: Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
Sort: Performance per node
16 12 8 4
60
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 76. The overall performance of the Cloudera cluster in the TeraSort benchmark
Figure 77. The performance of a single node of the Cloudera cluster in the TeraSort benchmark
Baseline
0 0.5 1 1.5 2 2.5 3
Performance,the more the better
Numberofnodes
Cloudera
TeraSort: Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
TeraSort: Performance per node
16 12 8 4
61
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Figure 78. The overall performance of the Cloudera cluster in the WordCount benchmark
Figure 79. The performance of a single node of the Cloudera cluster in the WordCount benchmark
Baseline
0 1 2 3 4 5
Performance,the more the better
Numberofnodes
Cloudera
WordCount: Overall cluster performance
16 12 8 4
Baseline
0 0.2 0.4 0.6 0.8 1 1.2
Performance,the more the better
Numberofnodes
Cloudera
WordCount: Performance per node
16 12 8 4
62
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Appendix E: Disk Benchmarking
The line charts below demonstrate performance of the disks.
1. DFSIO (read) benchmark
Figure 80. The overall performance of each distribution in the DFSIO-read benchmark, sectioned
by the cluster size
Figure 81. The performance of a single node of each distribution in the DFSIO-read benchmark,
sectioned by the cluster size
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
4 8 12 16
DFSIO Read
Overall cluster performance
MapR
Hortonworks
Cloudera
0.3
0.5
0.7
0.9
1.1
4 8 12 16
DFSIO Read
Performance per Node
MapR
Hortonworks
Cloudera
63
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
2. DFSIO (write) benchmark
Figure 82. The overall performance of each distribution in the DFSIO-write benchmark, sectioned
by the cluster size
Figure 83. The performance of a single node of each distribution in the DFSIO-write benchmark,
sectioned by the cluster size
0.2
0.4
0.6
0.8
1
1.2
1.4
4 8 12 16
DFSIO Write
Overall cluster performance
MapR
Hortonworks
Cloudera
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
4 8 12 16
DFSIO Write
Performance per Node
MapR
Hortonworks
Cloudera
64
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Appendix F: Parameters used to optimize Hadoop
Jobs
Parameter Description
mapred.map.tasks The total number of Map tasks for the job to run
mapred.reduce.tasks The total number of Reduce tasks for the job to run
mapred.output.compress Set true in order to compress the output of the MapReduce job, yse
mapred.output.compression.codec to specify the compression
codec.
mapred.map.child.java.opts The Java options for JVM running a -Xmx
memory size. Use mapred.reduce.child.java.opts for Reduce tasks.
io.sort.mb A Map task output buffer size. Use this value to control the spill
process. When buffer is filled up to a io.sort.spill.percent a
background spill thread is started.
mapred.job.reduce.input.buffer.percent The percentage of memory relative to the maximum heap size to
retain Map outputs during the Reduce process
mapred.inmem.merge.threshold The threshold number of Map outputs for starting the process of
merging the outputs and spilling to disk. 0 means there is no
threshold, and the spill behavior is controlled by
mapred.job.shuffle.merge.percent.
mapred.job.shuffle.merge.percent The threshold usage proportion for the Map outputs buffer for
starting the process of merging the outputs and spilling to disk
mapred.reduce.slowstart.completed.m
aps
The time to start Reducers in percentage to complete Map tasks
dfs.replication HDFS replication factor
dfs.block.size HDFS block size
mapred.task.timeout The timeout (in milliseconds) after which a task is considered as
failed. See mapreduce.reduce.shuffle.connect.timeout,
mapreduce.reduce.shuffle.read.timeout and
mapred.healthChecker.script.timeout in order to adjust timeouts
mapred.map.tasks.speculative.executio
n
Speculative execution of Map tasks. See also
mapred.reduce.tasks.speculative.execution.
mapred.job.reuse.jvm.num.tasks The maximum number of tasks to run for a given job for each JVM
io.sort.record.percent The proportion of io.sort.mb reserved for storing record boundaries
65
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
of the Map outputs. The remaining space is used for the Map
output records themselves.
Example $HADOOP_EXECUTABLE jar $HADOOP_EXAMPLES_JAR terasort 
-Dmapred.map.tasks=60 
-Dmapred.reduce.tasks=30 
-Dmapred.output.compress=true 
-Dmapred.map.child.java.opts="-Xmx3500m" 
-Dmapred.reduce.child.java.opts="-Xmx7000m" 
-Dio.sort.mb=2047 
-Dmapred.job.reduce.input.buffer.percent=0.9 
-Dmapred.inmem.merge.threshold=0 
-Dmapred.job.shuffle.merge.percent=1 
-Dmapred.reduce.slowstart.completed.maps=0.8 
-Ddfs.replication=1 
-Ddfs.block.size=536870912 
-Dmapred.task.timeout=120000 
-Dmapreduce.reduce.shuffle.connect.timeout=60000 
-Dmapreduce.reduce.shuffle.read.timeout=30000 
-Dmapred.healthChecker.script.timeout=60000 
-Dmapred.map.tasks.speculative.execution=false 
-Dmapred.reduce.tasks.speculative.execution=false 
-Dmapred.job.reuse.jvm.num.tasks=-1 
-Dio.sort.record.percent=0.138 
-Dio.sort.spill.percent=1.0 
$INPUT_HDFS $OUTPUT_HDFS
Table 7. Parameters that were tuned to achieve optimal performance of Hadoop jobs
About the author:
Vladimir Starostenkov is a Senior R&D Engineer at Altoros, a company that focuses on
accelerating big data projects and platform-as-a-service enablement. He has more than five
years of experience in implementing complex software architectures, including data-
intensive systems and Hadoop-driven applications. Having strong background in physics
and computer science, Vladimir is interested in artificial intelligence and machine learning
algorithms.
About Altoros:
Altoros is a big data and Platform-as-a-Service specialist that provides system integration for
IaaS/cloud providers, software companies, and information-driven enterprises. The company
builds solutions on the intersection of Hadoop, NoSQL, Cloud Foundry PaaS, and multi-cloud
deployment automation. For more, please visit www.altoros.com or follow @altoros.
Liked this white paper?
Share it on the Web!

Weitere ähnliche Inhalte

Was ist angesagt?

Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructuredAmi Mahloof
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQAraf Karsh Hamid
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composerBruce Kuo
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 

Was ist angesagt? (20)

Terraform modules restructured
Terraform modules restructuredTerraform modules restructured
Terraform modules restructured
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 

Ähnlich wie Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR

EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DWSidi yazid
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)JonySaini2
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
Ppdg Robust File Replication
Ppdg Robust File ReplicationPpdg Robust File Replication
Ppdg Robust File Replicationguru100
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
Building a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneBuilding a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneMirko Calvaresi
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docxlorainedeserre
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docxBHANU281672
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Deploying Oracle 11g R2 on Linux RHEL6
Deploying Oracle 11g R2 on Linux RHEL6Deploying Oracle 11g R2 on Linux RHEL6
Deploying Oracle 11g R2 on Linux RHEL6Mc Cloud
 
Big data technologies : A survey
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey fatimabenjelloun1
 
Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4
Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4
Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4Yusuf Hadiwinata Sutandar
 
LCI report-Demo
LCI report-DemoLCI report-Demo
LCI report-DemoMo Mamouei
 

Ähnlich wie Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR (20)

EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter Kit
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DW
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Ppdg Robust File Replication
Ppdg Robust File ReplicationPpdg Robust File Replication
Ppdg Robust File Replication
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
Actian DataFlow Whitepaper
Actian DataFlow WhitepaperActian DataFlow Whitepaper
Actian DataFlow Whitepaper
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Building a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneBuilding a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and Lucene
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
Cube_it!_software_report_for_IMIS
Cube_it!_software_report_for_IMISCube_it!_software_report_for_IMIS
Cube_it!_software_report_for_IMIS
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Deploying Oracle 11g R2 on Linux RHEL6
Deploying Oracle 11g R2 on Linux RHEL6Deploying Oracle 11g R2 on Linux RHEL6
Deploying Oracle 11g R2 on Linux RHEL6
 
Big data technologies : A survey
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey
 
Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4
Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4
Cloud Forms Iaa S V2wp 6299847 0411 Dm Web 4
 
LCI report-Demo
LCI report-DemoLCI report-Demo
LCI report-Demo
 

Mehr von Douglas Bernardini (20)

Top reasons to choose SAP hana
Top reasons to choose SAP hanaTop reasons to choose SAP hana
Top reasons to choose SAP hana
 
The REAL face of Big Data
The REAL face of Big DataThe REAL face of Big Data
The REAL face of Big Data
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
 
R-language
R-languageR-language
R-language
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Splunk
SplunkSplunk
Splunk
 
Finance month closing with HANA
Finance month closing with HANAFinance month closing with HANA
Finance month closing with HANA
 
RDBMS x NoSQL
RDBMS x NoSQLRDBMS x NoSQL
RDBMS x NoSQL
 
SAP - SOLUTION MANAGER
SAP - SOLUTION MANAGER SAP - SOLUTION MANAGER
SAP - SOLUTION MANAGER
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
 
DBA oracle
DBA oracleDBA oracle
DBA oracle
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
SAP Business Objects - Lopes Supermarket
SAP   Business Objects - Lopes SupermarketSAP   Business Objects - Lopes Supermarket
SAP Business Objects - Lopes Supermarket
 
SAP - Business Objects - Ri happy
SAP - Business Objects - Ri happySAP - Business Objects - Ri happy
SAP - Business Objects - Ri happy
 
Hadoop on retail
Hadoop on retailHadoop on retail
Hadoop on retail
 
Retail: Big data e Omni-Channel
Retail: Big data e Omni-ChannelRetail: Big data e Omni-Channel
Retail: Big data e Omni-Channel
 
Granular Access Control Using Cell Level Security In Accumulo
Granular Access Control  Using Cell Level Security  In Accumulo             Granular Access Control  Using Cell Level Security  In Accumulo
Granular Access Control Using Cell Level Security In Accumulo
 
Proposta aderencia drogaria onofre
Proposta aderencia   drogaria onofreProposta aderencia   drogaria onofre
Proposta aderencia drogaria onofre
 
SAP-Solution-Manager
SAP-Solution-ManagerSAP-Solution-Manager
SAP-Solution-Manager
 

Kürzlich hochgeladen

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Kürzlich hochgeladen (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR

  • 1. HadoopDistributions: EvaluatingCloudera, Hortonworks,andMapR inMicro-benchmarksand Real-worldApplications VladimirStarostenkov,SeniorR&DDeveloper, KirillGrigorchuk,HeadofR&DDepartment © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author.
  • 2. 2 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Table of Contents 1. Introduction........................................................................................................................................... 4 2. Tools, Libraries, and Methods ............................................................................................................... 5 2.1 Micro benchmarks..................................................................................................................................................................6 2.1.1 WordCount.................................................................................................................................................................6 2.1.2 Sort................................................................................................................................................................................7 2.1.3 TeraSort .......................................................................................................................................................................7 2.1.4 Distributed File System I/O...................................................................................................................................7 2.2 Real-world applications ........................................................................................................................................................7 2.2.1 PageRank ....................................................................................................................................................................8 2.2.2 Bayes ............................................................................................................................................................................8 3. What Makes This Research Unique?...................................................................................................... 9 3.1 Testing environment .............................................................................................................................................................9 4. Results.................................................................................................................................................. 11 4.1 Overall cluster performance.............................................................................................................................................11 4.2 Hortonworks Data Platform (HDP).................................................................................................................................12 ....................................................................................14 4.4 MapR ........................................................................................................................................................................................15 5. Conclusion ........................................................................................................................................... 18 Appendix A: Main Features and Their Comparison Across Distributions .............................................. 19 Appendix B: Overview of the Distributions ............................................................................................ 21 1. MapR...........................................................................................................................................................................................21 2. Cloudera....................................................................................................................................................................................22 3. Hortonworks ............................................................................................................................................................................23 Appendix C: Performance Results for Each Benchmarking Test ............................................................ 24 1. Real-world applications........................................................................................................................................................24 1.1 Bayes.............................................................................................................................................................................24 1.2 PageRank.....................................................................................................................................................................25
  • 3. 3 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 2. Micro benchmarks ............................................................................................................................... 26 2.1 Distributed File System I/O (DFSIO)...............................................................................................................................26 2.2 Hive aggregation .................................................................................................................................................................27 2.3 Sort............................................................................................................................................................................................28 2.4 TeraSort...................................................................................................................................................................................29 2.5 WordCount.............................................................................................................................................................................30 Appendix D: Performance Results for Each Test Sectioned by Distribution.......................................... 32 1. MapR...........................................................................................................................................................................................32 2. Hortonworks ............................................................................................................................................................................42 3. Cloudera....................................................................................................................................................................................52 Appendix E: Disk Benchmarking............................................................................................................. 62 1. DFSIO (read) benchmark......................................................................................................................................................62 2. DFSIO (write) benchmark ....................................................................................................................................................63 Appendix F: Parameters used to optimize Hadoop Jobs........................................................................ 64
  • 4. 4 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 1. Introduction Perhaps, there is hardly any expert in the big data field who has heard nothing about Hadoop. Furthermore, very often Hadoop is used as a synonym to the term big data. It is most likely, that the wide usage and popularity of this framework may have given a jump- start to development of various distributions that derived from the initial open-source edition. The MapReduce paradigm was firstly introduced by Google and Yahoo continued with development of Hadoop, which is based on this data processing method. From that moment on, Hadoop has grown into several major distributions and dozens of sub-projects used by thousands of companies. However, the rapid development of the Hadoop ecosystem and the extension of its application area, lead to a misconception that Hadoop can be used to solve any high-load computing task easily, which is not exactly true. Actually, when a company is considering Hadoop to address its needs, it has to answer two questions:  Is Hadoop the right tool for me?  Which distribution can be more suitable for my tasks? To collect information on these two points, companies spend an enormous amount of time researching into distributed computing paradigms and projects, data formats and their optimization methods, etc. This benchmark demonstrates performance results of the most popular open-source Hadoop distributions, such as Cloudera, MapR, and Hortonworks. It also provides all the information you may need to evaluate these options. In this research, such solutions as Amazon Elastic MapReduce (Amazon EMR), Windows Azure HDInsight, etc., are not analyzed, since they require uploading business data to public clouds. This benchmark evaluates only stand-alone distributions that can be installed in private data centers.
  • 5. 5 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 2. Tools, Libraries, and Methods Every Hadoop distribution selected for this research can be tried as a demo on a virtual machine. This can help you to learn specific features of each solution and test how it works. It can be quite a good idea for a proof of concept stage to take a set of real data and run Hadoop on several virtual machines in a cloud. Although, it will not help you to choose a configuration for your bare metal cluster (for details, see a , you will be able to evaluate whether Hadoop is a good tool for your system. The paper Nobody ever got fired for using Hadoop on a cluster by Microsoft Research provides instructions for choosing an optimal hardware configuration. On the whole, evaluating performance of a Hadoop cluster is challenging, since the results will vary depending on the cluster size and configuration. There are two main methods of Hadoop benchmarking: micro-benchmarks and emulated loads. Micro-benchmarks are shipped with most distributions and allow for testing particular parts of the infrastructure, for instance TestDFSIO analyzes the disk system, Sort evaluates MapReduce tasks, WordCount measures cluster performance, etc. The second approach provides for testing the system under workloads similar to those in real-life use cases. SWIM and GridMix3 consist of workloads that have been emulated based on historical data collected from a real cluster in operation. Executing the workloads via replaying the synthesized traces may help to evaluate the side effects of concurrent job execution in Hadoop. HiBench, a Hadoop Benchmark Suite by Intel, consists of several Hadoop workloads, including both synthetic micro-benchmarks and real-world applications. All workloads included into this suite are grouped into four categories: micro-benchmarks, Web search, machine learning, and analytical queries. To find out about the workload used in this benchmark, read the paper MapReduce-Based Data Analysis We monitored CPU, disk, RAM, network, and JVM parameters with Ganglia Monitoring System and tuned the parameters of each job (see Appendix A) to achieve maximum utilization of all resources. Figure 1 demonstrates how data is processed inside a Hadoop cluster. Combiner is an optional component that reduces the size of data output by a node that executes a Map task.
  • 6. 6 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. To find out more about MapReduce internals, read by Tom White. To show how the amount of data changes on each stage of a MapReduce job, the whole amount of input data was taken as 1.00. All the other indicators were calculated as ratios to the input amount. For instance, the input data set was 100 GB (1.00) in size. After a Map task had been completed, it increased to 142 GB (1.42), see Table 1. Using ratios instead of the real data amounts allows for analyzing trends. In addition, these results can help to predict the behavior of a cluster that deals with input data of a different size. Figure 1. A simplified scheme of a MapReduce paradigm 2.1 Micro benchmarks 2.1.1 WordCount WordCount can be used to evaluate CPU scalability of a cluster. On the Map stage, this workload extracts small amounts of data from a large data set and this process utilizes the total CPU capacity. Due to the very low load on disk load and network, under this kind of workload, a cluster of any size is expected to scale linearly.
  • 7. 7 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Map input Combiner input Combiner output Reduce output 1.00 1.42 0.07 0.03 Table 1. Changes in the amount of data at each of the MapReduce stages 2.1.2 Sort This workload sorts out unsorted text data; the amount of data on all the stages of the Hadoop MapReduce process is the same. Being mostly I/O-bound, this workload has moderate CPU utilization, as well as heavy disk and network I/O utilization (during the shuffle stage). RandomTextWriter generates the input data. Map input Map output Reduce output 1.0 1.0 (uncompressed) 1.0 Table 2. Changes in the amount of data at each of the MapReduce stages 2.1.3 TeraSort TeraSort input data consists of 100-byte rows generated by the TeraGen application. Even though this workload has high/moderate CPU utilization during Map/Reduce stages respectively, it is mostly an I/O-bound workload. Map input Map output Reduce output 1.0 0.2 (compressed) 1.0 Table 3. Changes in the amount of data at each of the MapReduce stages 2.1.4 Distributed File System I/O The DFSIO test used in this benchmark is an enhanced version of TestDFSIO, which is an integral part of a standard Apache Hadoop distribution. This benchmark measures HDFS throughput. 2.2 Real-world applications
  • 8. 8 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 2.2.1 PageRank PageRank is a widely-known algorithm that evaluates and ranks Web sites in search results. To calculate a rating, a PageRank job is repeated several times, which is an iterative CPU- bound workload. The benchmark consists of several chained Hadoop jobs, each represented by a separate row in the table below. In this benchmark, PageRank had two HDFS blocks per CPU core, which is the smallest input per node in this test. Map input Combiner input Combiner output Reduce output 1.0 1.0E-005 1.0E-007 1.0E-008 1.0 5.0 1.0 1.0 0.1 0.1 Table 4. Changes in the amount of data at each of the MapReduce stages 2.2.2 Bayes The next application is a part of the Apache Mahout project. The Bayes Classification workload has rather complex patterns of accessing CPU, memory, disk, and network. This test creates a heavy load on a CPU when completing Map tasks. However in this case, this workload hit an I/O bottleneck. Bayes Map input Combiner input Combiner output Reduce output 1.0 28.9 22.1 19.4 19.4 14.4 12.7 7.4 7.4 9.3 4.8 4.6 7.4 3.1 1.0E-004 1.0E-005 Table 5. Changes in the amount of data at each of the MapReduce stages
  • 9. 9 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 3. What Makes This Research Unique? Although there are a variety of Hadoop distributions, tests evaluating their functional and performance characteristics are rare and provide rather general information, lacking deep analytical investigation of the subject. The wide majority of distributions are direct branches of the initial Apache Hadoop project and do not provide any additional features. Therefore, they were disregarded and are not covered in this research. This research aims to provide a deep insight into Hadoop distributions and evaluate their features to support you in selecting a tool that fits your project best. For this benchmark, our R&D team selected three commonly used open-source Hadoop distributions:   Hortonworks Data Platform (HDP) v1.3  MapR M3 v3.0 In general, virtualized cloud environments provide flexibility in tuning, which was required to carry out tests on clusters of a different size. In addition, cloud infrastructure allows for obtaining more unbiased results, since all tests can be easily repeated to verify their results. For this benchmark, all tests were run on the ProfitBricks virtualized infrastructure. The deployment settings selected for each distribution provided similar test conditions, as much as it was possible. In this research, we tested distributions that include updates, patches, and additional features that ensure stability of the framework. Hortonworks and Cloudera are active contributors to Apache Hadoop and they provide fast bug fixing for their solutions. Therefore, their distributions are considered more stable and up-to-date. 3.1 Testing environment ProfitBricks was our partner that provided computing capacities to carry out the research. This company is a leading IaaS provider and it provides great flexibility in choosing node configuration. The engineers did not have to address the Support Department to change some crucial parameters of the node disk configuration, therefore they were able to try different options and find the optimal settings for the benchmarking environment.
  • 10. 10 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. The service provides a wide range of configuration parameters for each node, for instance, the CPU capacities of a node can vary from 1 to 48 cores, and RAM can be from 1 to 196 GB per node. This variety of options allows for achieving the optimal CPU/RAM/network/storage balance for each Hadoop task. Unlike Amazon that offers preconfigured nodes, ProfitBricks allows for manual tuning of each node based on your previous experience and performance needs. InfiniBand is a modern technology used by ProfitBricks. It allowed for achieving the maximum inter-node communication performance inside a data center. Cluster configuration: Each node had four CPU cores, 16 GB of RAM, and 100 GB of virtualized disk space. Cluster size ranged from 4 to 16 nodes. Nodes required for running Ganglia and cluster management were not included into this configuration. The top cluster configuration featured 64 computing cores and 256 GB of RAM for processing 1.6 TB of test data.
  • 11. 11 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 4. Results Each Cloudera and Hortonworks DataNode contained one disk. MapR distribution was evaluated in a slightly different way. Three data disks were attached to each DataNode following MapR recommendations. Taking this into account, MapR was expected to perform I/O sensitive tasks three times faster. However, the actual results were affected by some peculiarities of virtualization (see Figure 11). The comparison of Cloudera and Hortonworks features showed that these two distributions are very similar (see Appendix A). It was also proved by the results of the tests (see Appendix B). The overall performance of Hortonworks and Cloudera clusters is demonstrated by Figure 4 and 6 respectively. 4.1 Overall cluster performance Throughput in bytes per second was measured for a cluster that consisted of 4, 8, 12, and 16 DataNodes (Figures 2-7). The throughput of 8-, 12-, and 16-node clusters was compared against the throughput of a four-node cluster in each benchmark test. The speed of data processing of 8-, 12-, and 16-node clusters was divided by the throughput of a 4-node cluster. These values demonstrate cluster scalability in each of the tests. The higher the value is, the better. Although data consistency may be guaranteed by a hosting/cloud provider, to employ the advantages of data locality, Hadoop requires using its internal data replication. Figure 2. The overall performance results of the MapR distribution in all benchmark tests 0 1 2 3 4 5 BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT MapR Overall cluster performance 4 8 12 16
  • 12. 12 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 3. The average performance of a single node of the MapR cluster in all benchmark tests Cluster performance scales linearly under the WordCount workload. It behaves the same in running PageRank until the cluster reaches an I/O bottleneck. The results of other benchmarks strongly correlated with DFSIO. Disk I/O throughput did not scale in this test environment, however, analyzing the reasons for that was not the focus of this research. To learn more about the drawbacks of Hadoop virtualization, read Sammer, C As it was mentioned before, MapR had three disks per node. In case all nodes are hosted on the same server, the virtual cluster utilizes disk bandwidth which is obviously limited much faster. ProfitBricks allows for hosting up to 48-62 cores on the same server, which is equivalent to a cluster that consists of 12 15 nodes with the configuration described in this benchmark (four cores per node). 4.2 Hortonworks Data Platform (HDP) In most cases, the performance of a cluster based on the Hortonworks distribution scaled more linearly. However, its starting performance value in disk-bound tasks was lower than that of MapR. The maximum scalability was reached in WordCount and PageRank tasks. Unfortunately, when the cluster grew to eight nodes, its performance stopped to increase due to I/O limitations. 0 0.2 0.4 0.6 0.8 1 1.2 BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT MapR Performance per Node 4 8 12 16
  • 13. 13 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 4. The overall performance results of the Hortonworks cluster in all benchmark tests Figure 5. The average performance of a single node of the Hortonworks cluster in all benchmark tests 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT Hortonworks Overall cluster performance 4 8 12 16 0 0.2 0.4 0.6 0.8 1 1.2 BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT Hortonworks Performance per Node 4 8 12 16
  • 14. 14 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 4 (CDH) The Cloudera Hadoop distribution showed almost the same performance as Hortonworks, except for Hive queries, where it was slower. Figure 6. The overall performance results of the Cloudera cluster in all benchmark tests Figure 7. The average performance of a single node of the Cloudera cluster in all benchmark tests 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT Cloudera Overall cluster performance 4 8 12 16 0 0.2 0.4 0.6 0.8 1 1.2 Cloudera Performance per Node 4 8 12 16
  • 15. 15 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 8. Throughput scalability measured in the CPU-bound benchmark Cluster performance under CPU-bound workloads increased as expected: a 16-node cluster was four times as fast as a 4-node one. However, the difference in performance between the distributions was within the limit of an experimental error. 4.4 MapR The performance results of the MapR cluster under the Sort load were quite unexpected. Figure 9. Performance results for MapR in the Sort benchmark In the Sort task, the cluster scaled linearly from four to eight nodes. After that, the performance of each particular node started to degrade sharply. The same situation was observed in the DFSIO write test. 1 2 3 4 5 4 8 12 16 WordCount Overall cluster performance MapR Hortonworks Cloudera 0.9 0.95 1 1.05 1.1 4 8 12 16 WordCount Performance per Node MapR Hortonworks Cloudera Baseline 0 0.5 1 1.5 2 Performance,the more the better NumberofNodes MapR Sort: Overall cluster performance 16 12 8 4 Baseline 0 0.5 1 1.5 Performance,the more the better NumberofNodes MapR Sort: Performance per node 16 12 8 4
  • 16. 16 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 10. The MapR performance results in the DFSIO (write) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Performance,the more the better NumberofNodes MapR DFSIO (write): Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better NumberofNodes MapR DFSIO (write): Performance per node 16 12 8 4
  • 17. 17 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. The virtualized disks of a 4-node cluster showed the reading/writing speed of 250/700 MB/sec. The overall cluster performance grew not in a linear way (see Figure 11), meaning that the total speed of data processing can be improved by the optimal combination of CPU, RAM, and disk space parameters. Figure 11. Performance results for the MapR, Hortonworks, and Cloudera distributions in the DFSIO (read/write) benchmark 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 4 8 12 16 DFSIO Read Overall cluster performance MapR Hortonworks Cloudera 0.2 0.4 0.6 0.8 1 1.2 1.4 4 8 12 16 DFSIO Write Overall cluster performance MapR Hortonworks Cloudera
  • 18. 18 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 5. Conclusion Despite of the fact that the configuration of a cluster deployed in the cloud was similar to that of the one deployed on bare metal, the performance and scalability of the virtualized solution were different. In general, Hadoop deployed on bare metal is expected to scale linearly, until inter-node communication will start to slow it down or it reaches the limits of the HDFS, which is around several thousand of nodes. The actual measurements showed that even though the overall performance was very high, it was affected by the limited total disk throughput. Therefore, the disk I/O became a serious bottleneck whereas the computing capacities of the cluster were not fully utilized. Apache Spark, which was announced by Cloudera when this research was conducted, or GridGain In-Memory Accelerator for Hadoop can be suggested for using in the ProfitBricks environment. It can be assumed that the type of Hadoop distribution has a much less considerable impact on the overall system throughput than the configuration of the MapReduce task parameters. For instance, the TeraSort workload was processed 2 3 times faster when the parameters described in Appendix E were tuned specifically for this load. By configuring these settings, you can achieve 100% utilization of your CPU, RAM, disk, and network. So, the performance of each distribution can be greatly improved by selecting proper parameters for each specific load. Running Hadoop in clouds allows for fast horizontal and vertical scaling, however, there are fewer possibilities for tuning each part of the infrastructure. In case you opt for a virtualized deployment, you should select a hosting/IaaS provider that gives freedom in configuring your infrastructure. To achieve optimal utilization of resources, you will need information on the parameters set for network and disk storage and have a possibility to change them. Liked this white paper? Share it on the Web!
  • 19. 19 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Appendix A: Main Features and Their Comparison Across Distributions Component type Implementation Hortonworks HDP -1.3 Cloudera CDH 4.3 MapR M3 v3.0 File system HDFS 1.2.0 HDFS 2.0.0 MapR-FS - non-Hadoop access NFSv3 Fuse-DFS v2.0.0 Direct Access NFS - Web access REST HTTP API WebHDFS HttpFS * MapReduce 1.2.0 0.20.2 ** - software abstraction layer Cascading x x 2.1 Non-relational database Apache HBase 0.94.6.1 0.94.6 0.92.2 Metadata services Apache HCatalog ***Hive 0.5.0 0.4.0 Scripting platform Apache Pig 0.11 0.11.0 0.10.0 - data analysis framework DataFu x 0.0.4 x Data access and querying Apache Hive 0.11.0 0.10.0 0.9.0 Workflow scheduler Apache Oozie 3.3.2 3.3.2 3.2.0 Cluster coordination Apache Zookeeper 3.4.5 3.4.5 3.4(?) Bulk data transfer between relational databases and Hadoop Apache Sqoop 1.4.3 1.4.3 1.4.2 Distributed log management services Apache Flume 1.3.1 1.3.0 1.2.0 Machine learning and data analysis Mahout 0.7.0 0.7 0.7
  • 20. 20 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Hadoop UI Hue 2.2.0 2.3.0 - - data integration service Talend Open Studio for Big Data 5.3 x x Cloud services Whirr x 0.8.2 0.7.0 Parallel query execution engine Tez (Stinger) Impala **** Full-text search Search 0.1.5 Administration Apache Ambari Cloudera Manager MapR Control System - installation Apache Ambari Cloudera Manager - - monitoring Ganglia x x Nagios x x - fine-grained authorization Sentry 1.1 Splitting resource management and scheduling YARN 2.0.4 2.0.0 - Table 6. The comparison of functionality in different Hadoop distributions * - via NFS ** - MapR has a custom Hadoop-compatible MapReduce implementation *** - HCatalog has been merged with Hive. The latest stand-alone release was v0.5.0 **** - Apache Drill is at an early development stage x - available, but not mentioned in the distribution documentation / requires manual installation or additional configuration
  • 21. 21 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Appendix B: Overview of the Distributions 1. MapR Summary MapR has the MapRFS feature which is a substitute for the standard HDFS. Unlike HDFS, this system aims to sustain deployments that consist of up to 10,000 of nodes with no single point of failure, which is guaranteed by the distributed NameNode. MapR allows for storing 1 10 Exabytes of data and provides support for NFS and random read-write semantics. It is stated by the MapR developers that elimination of the Hadoop abstraction layers can help to increase performance 2x. There are three editions of the MapR distribution: M3, which is completely free, M5 and M7, the latter two are paid enterprise versions. Although M3 provides unlimited scalability and NFS, it does not ensure high availability and snapshots that are available in M5 or instant recovery of M7. M7 is an enterprise-level platform for NoSQL and Hadoop deployments. MapR distribution is available as a part of Amazon Elastic MapReduce and Google Cloud Platform. Notable customers and partners  MapR M3 and M5 editions are available as premium options for Amazon Elastic MapReduce;  Google partnered with MapR in launching Compute Engine;  Cisco Systems announced support for MapR software on the UCS platform;  comScore Support and documentation  Support contact details  Documentation The company Based in San Jose, California, MapR focuses on development of Hadoop-based projects. The company contributes to such projects as HBase, Pig, Apache Hive, and Apache ZooKeeper. After signing an agreement with EMC in 2011, the company supplies a specific Hadoop
  • 22. 22 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. distribution tuned for EMC hardware. MapR also partners with Google and Amazon in improving their Elastic Map Reduce (EMR) service. 2. Cloudera Summary Of all the distributions analyzed in this research, s solution has the most powerful Hadoop deployment and administration tools designed for managing a cluster of an unlimited size. It is also open-source and the company is an active contributor to Apache Hadoop. Cloudera is a major Apache Hadoop contributor. In addition, the Cloudera distribution has its own native components, such as Impala, a query engine for massive parallel processing and Cloudera Search powered by Apache Solr. Notable customers and partners  eBay  CBS Interactive  Qualcomm  Expedia Support and documentation  Support contact details  Documentation The company Based in Palo Alto, Cloudera is one of the leading companies that provides Hadoop-related services and trainings for the staff statistics, more than 50% of its efforts are dedicated to improving such open-source projects as Apache Hive, Apache Avro, Apache HBase, etc. that are a part of a large Hadoop ecosystem. In addition, Cloudera invests into Apache Software Foundation, a community of developers who contribute to the family of Apache software projects.
  • 23. 23 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 3. Hortonworks Summary Being 100% open-source, Hortonworks is strongly committed to Apache Hadoop and it is one of the main contributors to the solution. brings high performance, scalability, and SQL compliance to Hadoop deployments. YARN, the Hadoop OS, and Apache Tez, a framework for near real-time big data processing, help Stringer to speed up Hive and Pig by up to 100x. As a result of Hortonworks partnership with Microsoft, HDP is the only Hadoop distribution available as a native component of Windows Server. A Windows-based Hadoop cluster can be easily deployed on Windows Azure through HDInsight Service. Notable customers and partners  Western Digital  eBay  Samsung Electronics Support and documentation  Support contact details  Documentation The company Hortonworks is a company headquartered in Palo Alto, California. Being a sponsor of the Apache Software Foundation and one of the main contributors to Apache Hadoop, the company specializes in providing support for Apache Hadoop. The Hortonworks distribution includes such components as HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Together with Yahoo!, Hortonworks hosts the annual Hadoop Summit event, the leading conference for the Apache Hadoop community.
  • 24. 24 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Appendix C: Performance Results for Each Benchmarking Test 1. Real-world applications 1.1 Bayes Figure 12. Bayes: the overall cluster performance 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 4 8 12 16 Bayes Overall cluster performance MapR Hortonworks Cloudera
  • 25. 25 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 13: Bayes: the performance of a single node for each cluster size 1.2 PageRank Figure 14. PageRank: the overall cluster performance 0 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 Bayes Performance per Node MapR Hortonworks Cloudera 1 1.5 2 2.5 3 3.5 4 4.5 4 8 12 16 PageRank Overall cluster performance MapR Hortonworks Cloudera
  • 26. 26 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 15. PageRank: the performance of a single node for each cluster size 2. Micro benchmarks 2.1 Distributed File System I/O (DFSIO) Figure 16. DFSIO: the overall cluster performance 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 4 8 12 16 PageRank Performance per Node MapR Hortonworks Cloudera 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 4 8 12 16 DFSIO Overall cluster performance MapR Hortonworks Cloudera
  • 27. 27 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 17. DFSIO: the performance of a single node for each cluster size 2.2 Hive aggregation Figure 18. Hive aggregation: the overall cluster performance 0 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 DFSIO Performance per Node MapR Hortonworks Cloudera 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 4 8 12 16 HIVEAGGR Overall cluster performance MapR Hortonworks Cloudera
  • 28. 28 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 19. Hive aggregation: the performance of a single node for each cluster size 2.3 Sort Figure 20. Sort: the overall cluster performance 0 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 HIVEAGGR Performance per Node MapR Hortonworks Cloudera 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 4 8 12 16 Sort Overall cluster performance MapR Hortonworks Cloudera
  • 29. 29 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 21. Sort: the performance of a single node for each cluster size 2.4 TeraSort 0 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 Sort Performance per Node MapR Hortonworks Cloudera 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 4 8 12 16 TeraSort Overall cluster performance MapR Hortonworks Cloudera
  • 30. 30 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 22. TeraSort: the overall cluster performance Figure 23. TeraSort: the performance of a single node for each cluster size 2.5 WordCount Figure 24. WordCount: the overall cluster performance 0 0.2 0.4 0.6 0.8 1 1.2 4 8 12 16 TeraSort Performance per Node MapR Hortonworks Cloudera 1 1.5 2 2.5 3 3.5 4 4.5 4 8 12 16 WordCount The overall cluster performance MapR Hortonworks Cloudera
  • 31. 31 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 25. WordCount: the performance of a single node for each cluster size 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06 4 8 12 16 Performanceratios WordCount Performance per Node MapR Hortonworks Cloudera
  • 32. 32 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Appendix D: Performance Results for Each Test Sectioned by Distribution 1. MapR Figure 26. The overall performance of the MapR cluster in the Bayes benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Performance,the more the better Numberofnodes MapR Bayes: Overall cluster performance 16 12 8 4
  • 33. 33 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 27. The performance of a single node of the MapR cluster in the Bayes benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR Bayes: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes MapR DFSIO (read): Overall cluster performance 16 12 8 4
  • 34. 34 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 28. The overall performance of the MapR cluster in the DFSIO (read) benchmark Figure 29. The performance of a single node of the MapR cluster in the DFSIO (read) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR DFSIO (read): Performance per node 16 12 8 4
  • 35. 35 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 30. The overall performance of the MapR cluster in the DFSIO (write) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Performance,the more the better Numberofnodes MapR DFSIO (write): Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR DFSIO (write): Performance per node 16 12 8 4
  • 36. 36 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 31. The performance of a single node of the MapR cluster in the DFSIO (write) benchmark Figure 32. The overall performance of the MapR cluster in the DFSIO benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Performance,the more the better Numberofnodes MapR DFSIO: Overall cluster performance 16 12 8 4
  • 37. 37 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 33. The performance of a single node of the MapR cluster in the DFSIO benchmark Figure 34. The overall performance of the MapR cluster in the Hive aggregation benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR DFSIO: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes MapR Hive aggregation: Overall cluster performance 16 12 8 4
  • 38. 38 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 35. The performance of a single node of the MapR cluster in the Hive aggregation benchmark Figure 36. The overall performance of the MapR cluster in the PageRank benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR Hive aggregation: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 2.5 3 3.5 Performance,the more the better Numberofnodes MapR PageRank: Overall cluster performance 16 12 8 4
  • 39. 39 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 37. The performance of a single node of the MapR cluster in the PageRank benchmark Figure 38. The overall performance of the MapR cluster in the Sort benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR PageRank: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes MapR Sort: Overall cluster performance 16 12 8 4
  • 40. 40 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 39. The performance of a single node of the MapR cluster in the Sort benchmark Figure 40. The overall performance of the MapR cluster in the TeraSort benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR Sort: Performance per node 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Performance,the more the better Numberofnodes MapR TeraSort: Overall cluster performance 16 12 8 4
  • 41. 41 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 41. The performance of a single node of the MapR cluster in the TeraSort benchmark Figure 42. The overall performance of the MapR cluster in the WordCount benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR TeraSort: Performance per node 16 12 8 4 Baseline 0 1 2 3 4 5 Performance,the more the better Numberofnodes MapR WordCount: Overall cluster performance 16 12 8 4
  • 42. 42 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 43. The performance of a single node of the MapR cluster in the WordCount benchmark 2. Hortonworks Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR WordCount: Performance per node 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Performance,the more the better Numberofnodes Hortonworks Bayes: Overall cluster performance 16 12 8 4
  • 43. 43 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 44. The overall performance of the Hortonworks cluster in the Bayes benchmark Figure 45. The performance of a single node of the Hortonworks cluster in the Bayes benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Hortonworks Bayes: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes Hortonworks DFSIO (read): Overall cluster performance 16 12 8 4
  • 44. 44 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 46. The overall performance of the Hortonworks cluster in the DFSIO (read) benchmark Figure 47. The performance of a single node of the Hortonworks cluster in the DFSIO (read) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Hortonworks DFSIO (read): Performance per node 16 12 8 4
  • 45. 45 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 48. The overall performance of the Hortonworks cluster in the DFSIO (write) benchmark Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes Hortonworks DFSIO (write): Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Hortonworks DFSIO (write): Performance per node 16 12 8 4
  • 46. 46 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 49. The performance of a single node of the Hortonworks cluster in the DFSIO (write) benchmark Figure 50. The overall performance of the Hortonworks cluster in the DFSIO benchmark Baseline 0 0.5 1 1.5 2 Performance,the more the better NumberofnodesHortonworks DFSIO: Overall cluster performance 16 12 8 4
  • 47. 47 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 51. The performance of a single node of the Hortonworks cluster in the DFSIO benchmark Figure 52. The overall performance of the Hortonworks cluster in the Hive aggregation benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Hortonworks DFSIO: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 2.5 3 Performance,the more the better Numberofnodes Hortonworks Hive aggregation: Overall cluster performance 16 12 8 4
  • 48. 48 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 53. The performance of a single node of the Hortonworks cluster in the Hive aggregation benchmark Figure 54. The overall performance of the Hortonworks cluster in the PageRank benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes MapR Hive aggregation: Performance per node 16 12 8 4 Baseline 0 1 2 3 4 5 Performance,the more the better Numberofnodes Hortonworks PageRank: Overall cluster performance 16 12 8 4
  • 49. 49 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 55. The performance of a single node of the Hortonworks cluster in the PageRank benchmark Figure 56. The overall performance of the Hortonworks cluster in the Sort benchmark Baseline 0.9 0.92 0.94 0.96 0.98 1 1.02 Performance,the more the better Numberofnodes Hortonworks PageRank: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better NumberofNodes Hortonworks Sort: Overall cluster performance 16 12 8 4
  • 50. 50 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 57. The performance of a single node of the Hortonworks cluster in the Sort benchmark Figure 58. The overall performance of the Hortonworks cluster in the TeraSort benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better NumberofNodes Hortonworks Sort: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 2.5 3 Performance,the more the better NumberofNodes Hortonworks TeraSort: Overall cluster performance 16 12 8 4
  • 51. 51 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 59. The performance of a single node of the Hortonworks cluster in the TeraSort benchmark Figure 60. The overall performance of the Hortonworks cluster in the WordCount benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better NumberofNodes Hortonworks TeraSort: Performance per node 16 12 8 4 Baseline 0 1 2 3 4 5 Performance,the more the better Numberofnodes Hortonworks WordCount: Overall cluster performance 16 12 8 4
  • 52. 52 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 61. The performance of a single node of the Hortonworks cluster in the WordCount benchmark 3. Cloudera Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Hortonworks WordCount: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 Performance,the more the better Numbersofnodes Cloudera Bayes: Overall cluster performance 16 12 8 4
  • 53. 53 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 62. The overall performance of the Cloudera cluster in the Bayes benchmark Figure 63. The performance of a single node of the Cloudera cluster in the Bayes benchmark Figure 64. The overall performance of the Cloudera cluster in the DFSIO (read) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numbersofnodes Cloudera Bayes: Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes Cloudera DFSIO (read): Overall cluster performance 16 12 8 4
  • 54. 54 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 65. The performance of a single node of the Cloudera cluster in the DFSIO (read) benchmark Figure 66. The overall performance of the Cloudera cluster in the DFSIO (write) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera DFSIO (read): Performance per node 16 12 8 4 Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes Cloudera DFSIO (write): Overall cluster performance 16 12 8 4
  • 55. 55 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 67. The performance of a single node of the Cloudera cluster in the DFSIO (write) benchmark Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera DFSIO (write): Performance per node 16 12 8 4
  • 56. 56 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 68. The overall performance of the Cloudera cluster in the DFSIO benchmark Figure 69. The performance of a single node of the Cloudera cluster in the DFSIO benchmark Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes Cloudera DFSIO: Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera DFSIO: Performance per node 16 12 8 4
  • 57. 57 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 70. The overall performance of the Cloudera cluster in the Hive aggregation benchmark Figure 71. The performance of a single node of the Cloudera cluster in the Hive aggregation benchmark Baseline 0 0.5 1 1.5 2 2.5 Performance,the more the better Numberofnodes Cloudera Hive aggregation: Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera Hive aggregation: Performance per node 16 12 8 4
  • 58. 58 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 72. The overall performance of the Cloudera cluster in the PageRank benchmark Figure 73. The performance of a single node of the Cloudera cluster in the PageRank benchmark Baseline 0 1 2 3 4 5 Performance,the more the better Numberofnodes Cloudera PageRank: Overall cluster performance 16 12 8 4 Baseline 0.9 0.92 0.94 0.96 0.98 1 1.02 Performance,the more the better Numberofnodes Cloudera PageRank: Performance per node 16 12 8 4
  • 59. 59 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 74. The overall performance of the Cloudera cluster in the Sort benchmark Figure 75. The performance of a single node of the Cloudera cluster in the Sort benchmark Baseline 0 0.5 1 1.5 2 Performance,the more the better Numberofnodes Cloudera Sort: Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera Sort: Performance per node 16 12 8 4
  • 60. 60 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 76. The overall performance of the Cloudera cluster in the TeraSort benchmark Figure 77. The performance of a single node of the Cloudera cluster in the TeraSort benchmark Baseline 0 0.5 1 1.5 2 2.5 3 Performance,the more the better Numberofnodes Cloudera TeraSort: Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera TeraSort: Performance per node 16 12 8 4
  • 61. 61 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Figure 78. The overall performance of the Cloudera cluster in the WordCount benchmark Figure 79. The performance of a single node of the Cloudera cluster in the WordCount benchmark Baseline 0 1 2 3 4 5 Performance,the more the better Numberofnodes Cloudera WordCount: Overall cluster performance 16 12 8 4 Baseline 0 0.2 0.4 0.6 0.8 1 1.2 Performance,the more the better Numberofnodes Cloudera WordCount: Performance per node 16 12 8 4
  • 62. 62 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Appendix E: Disk Benchmarking The line charts below demonstrate performance of the disks. 1. DFSIO (read) benchmark Figure 80. The overall performance of each distribution in the DFSIO-read benchmark, sectioned by the cluster size Figure 81. The performance of a single node of each distribution in the DFSIO-read benchmark, sectioned by the cluster size 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 4 8 12 16 DFSIO Read Overall cluster performance MapR Hortonworks Cloudera 0.3 0.5 0.7 0.9 1.1 4 8 12 16 DFSIO Read Performance per Node MapR Hortonworks Cloudera
  • 63. 63 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 2. DFSIO (write) benchmark Figure 82. The overall performance of each distribution in the DFSIO-write benchmark, sectioned by the cluster size Figure 83. The performance of a single node of each distribution in the DFSIO-write benchmark, sectioned by the cluster size 0.2 0.4 0.6 0.8 1 1.2 1.4 4 8 12 16 DFSIO Write Overall cluster performance MapR Hortonworks Cloudera 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 4 8 12 16 DFSIO Write Performance per Node MapR Hortonworks Cloudera
  • 64. 64 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Appendix F: Parameters used to optimize Hadoop Jobs Parameter Description mapred.map.tasks The total number of Map tasks for the job to run mapred.reduce.tasks The total number of Reduce tasks for the job to run mapred.output.compress Set true in order to compress the output of the MapReduce job, yse mapred.output.compression.codec to specify the compression codec. mapred.map.child.java.opts The Java options for JVM running a -Xmx memory size. Use mapred.reduce.child.java.opts for Reduce tasks. io.sort.mb A Map task output buffer size. Use this value to control the spill process. When buffer is filled up to a io.sort.spill.percent a background spill thread is started. mapred.job.reduce.input.buffer.percent The percentage of memory relative to the maximum heap size to retain Map outputs during the Reduce process mapred.inmem.merge.threshold The threshold number of Map outputs for starting the process of merging the outputs and spilling to disk. 0 means there is no threshold, and the spill behavior is controlled by mapred.job.shuffle.merge.percent. mapred.job.shuffle.merge.percent The threshold usage proportion for the Map outputs buffer for starting the process of merging the outputs and spilling to disk mapred.reduce.slowstart.completed.m aps The time to start Reducers in percentage to complete Map tasks dfs.replication HDFS replication factor dfs.block.size HDFS block size mapred.task.timeout The timeout (in milliseconds) after which a task is considered as failed. See mapreduce.reduce.shuffle.connect.timeout, mapreduce.reduce.shuffle.read.timeout and mapred.healthChecker.script.timeout in order to adjust timeouts mapred.map.tasks.speculative.executio n Speculative execution of Map tasks. See also mapred.reduce.tasks.speculative.execution. mapred.job.reuse.jvm.num.tasks The maximum number of tasks to run for a given job for each JVM io.sort.record.percent The proportion of io.sort.mb reserved for storing record boundaries
  • 65. 65 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. of the Map outputs. The remaining space is used for the Map output records themselves. Example $HADOOP_EXECUTABLE jar $HADOOP_EXAMPLES_JAR terasort -Dmapred.map.tasks=60 -Dmapred.reduce.tasks=30 -Dmapred.output.compress=true -Dmapred.map.child.java.opts="-Xmx3500m" -Dmapred.reduce.child.java.opts="-Xmx7000m" -Dio.sort.mb=2047 -Dmapred.job.reduce.input.buffer.percent=0.9 -Dmapred.inmem.merge.threshold=0 -Dmapred.job.shuffle.merge.percent=1 -Dmapred.reduce.slowstart.completed.maps=0.8 -Ddfs.replication=1 -Ddfs.block.size=536870912 -Dmapred.task.timeout=120000 -Dmapreduce.reduce.shuffle.connect.timeout=60000 -Dmapreduce.reduce.shuffle.read.timeout=30000 -Dmapred.healthChecker.script.timeout=60000 -Dmapred.map.tasks.speculative.execution=false -Dmapred.reduce.tasks.speculative.execution=false -Dmapred.job.reuse.jvm.num.tasks=-1 -Dio.sort.record.percent=0.138 -Dio.sort.spill.percent=1.0 $INPUT_HDFS $OUTPUT_HDFS Table 7. Parameters that were tuned to achieve optimal performance of Hadoop jobs About the author: Vladimir Starostenkov is a Senior R&D Engineer at Altoros, a company that focuses on accelerating big data projects and platform-as-a-service enablement. He has more than five years of experience in implementing complex software architectures, including data- intensive systems and Hadoop-driven applications. Having strong background in physics and computer science, Vladimir is interested in artificial intelligence and machine learning algorithms. About Altoros: Altoros is a big data and Platform-as-a-Service specialist that provides system integration for IaaS/cloud providers, software companies, and information-driven enterprises. The company builds solutions on the intersection of Hadoop, NoSQL, Cloud Foundry PaaS, and multi-cloud deployment automation. For more, please visit www.altoros.com or follow @altoros. Liked this white paper? Share it on the Web!