ESG: NetApp Open Solution for Hadoop

Lab Validation
Report
NetApp Open Solution for Hadoop

Open Source Data Analytics with Enterprise-class Storage Services

By Brian Garrett, VP, ESG Lab, & Julie Lockner, Sr Analyst & VP, Data Management

May 2012

© 2012, Enterprise Strategy Group, Inc. All Rights Reserved.

Lab Validation: NetApp Open Solution for Hadoop 2

Contents
Introduction .................................................................................................................................................. 3
Background ............................................................................................................................................................... 3
NetApp Open Solution for Hadoop .......................................................................................................................... 5
ESG Lab Validation ........................................................................................................................................ 6
Getting Started ......................................................................................................................................................... 6
Performance and Scalability ..................................................................................................................................... 7
Efficiency................................................................................................................................................................. 10
Recoverability ......................................................................................................................................................... 12
ESG Lab Validation Highlights ..................................................................................................................... 16
Issues to Consider ....................................................................................................................................... 16
The Bigger Truth ......................................................................................................................................... 17
Appendix ..................................................................................................................................................... 18

ESG Lab Reports
The goal of ESG Lab reports is to educate IT professionals about data center technology products for
companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation process that should
be conducted before making purchasing decisions, but rather to provide insight into these emerging
technologies. Our objective is to go over some of the more valuable feature/functions of products, show how
they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's
expert third-party perspective is based on our own hands-on testing as well as on interviews with customers
who use these products in production environments. This ESG Lab report was sponsored by NetApp.

All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise
Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from
time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in
part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of the Enterprise
Strategy Group, Inc., is in violation of U.S. Copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should
you have any questions, please contact ESG Client Relations at 508.482.0188.



Introduction
This ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highly
reliable, ready to deploy, scalable storage solution for Enterprise Hadoop.

Background
Driven by unrelenting data volume growth, the need for real-time data processing and data analytics, and the
increasing complexity and variety of data sources, ESG expects broad adoption of MapReduce data processing and
analytics frameworks over the next two to five years. These frameworks require new approaches for storing,
integrating and processing “big data.” ESG defines big data as any data set that exceeds the boundaries and sizes of
traditional IT processing; big data sets can range from ten to hundreds of terabytes in size.
Data analytics is a top IT priority for forward-looking IT organizations. In fact, a recent ESG survey indicates that
more than half (54%) of enterprise organizations (i.e., 1,000 or more employees) consider data analytics to be a
top-five IT priority and 38% plan on deploying a new data analytics solution in the next 12-18 months. A growing
number of IT organizations are using the open source Apache Hadoop MapReduce framework as a foundation for
their big data analytics initiatives. As shown in Figure 1, more than 50% of organizations polled by ESG are using
Hadoop, planning to deploy Hadoop in the next 12 months, or considering Hadoop.1
Figure 1. Plans to Implement a MapReduce Framework such as Apache Hadoop

What are your organization’s plans to implement a MapReduce framework (e.g.,
Apache Hadoop) to address data analytics challenges? (Percent of respondents,
N=270)

Already using, 8%
Don’t know, 11% Plan to implement
within 12 months,
13%

No plans to
implement, 33%

No plans to
implement at this
time but interested,
35%
Source: Enterprise Strategy Group, 2011.
As with any exciting and emerging technology, big data analytics also has its challenges. Management is an issue
because the platforms are expensive and require new server and storage purchases, integration with existing data
sets and processes, training in new technologies, an analytics toolset, and people with expertise in dealing with it.
When IT managers were asked about their data analytics challenges, 47% named data integration complexity, 34%
cited a lack of skills necessary to properly manage large data sets and derive value from them, 29% said data set
sizes limiting their ability to perform analytics, and 28% said difficulty in completing analytics within a reasonable
period of time.

1
Source: ESG Research Report, The Impact of Big Data on Data Analytics, September 2011.



Looking beyond the high level organizational challenges associated with a big data analytics initiative, the Hadoop
framework adds technology and implementation issues that need to be considered. The common reference
architecture for a Hadoop cluster leverages commodity server nodes with internal hard drives; for conventional
data centers with mature ITIL processes, this introduces two challenges. First, data protection is, by default,
handled in the Hadoop software layer; every time a file is written to the Hadoop Distributed File System (HDFS),
two additional copies are written in case of a disk drive or a data node failure. This not only impacts data ingest and
throughput performance, but also reduces disk capacity utilization. Second, high availability is limited based on an
existing single point of failure in the Hadoop metadata repository. This single point of failure will eventually be
addressed by the Hadoop community, but, in the meantime, analytics downtime due to a name node failure is a
key concern. As shown in Figure 2, a majority of ESG survey respondents (55%) indicate that three hours or less of
data analytics platform downtime would result in a significant revenue loss or other adverse business impact.2
Figure 2. Data Analytics Downtime Tolerance

Please indicate the amount of downtime your organization’s data analytics
platforms can tolerate before your organization experiences significant revenue
loss or other adverse business impact. (Percent of respondents, N=399)

Don’t know, 6% None, 6%
More than 3 days,
4%
Less than 1 hour,
1 day to 3 days, 10% 21%

11 hours to 24
hours, 10%

4 hours to 10 hours, 1 hour to 3 hours,
18% 26%

Source: Enterprise Strategy Group, 2011.
NetApp, in collaboration with leading Hadoop distribution vendors, is working to develop reference architectures,
best practices, and solutions that address these challenges while maximizing the speed, efficiency, and availability
of open source Hadoop deployments.

2
Source: ESG Survey, The Convergence of Big Data Processing, Hadoop, and Integrated Infrastructure, December 2011.



NetApp Open Solution for Hadoop
Hadoop is an open source and significant emerging technology for solving business problems around large volumes
of mostly unstructured data that cannot be analyzed with traditional database tools. The NetApp Open Solution for
Hadoop combines the power of the Hadoop framework with flexible storage, professional support, and services of
NetApp and its partners to deliver higher Hadoop cluster availability and efficiency. Based on a reference
architecture, it focuses on scaling Hadoop from its departmental origins to an enterprise infrastructure with
independent compute and storage scaling, faster cluster ingest, and faster job completion under failure conditions.
NetApp Open Solution for Hadoop extends the value of the open source Hadoop framework with enterprise-class
storage and services. As shown in Figure 3, NetApp FAS2040 and E2660 storage replace traditional DAS internal
hard drives within a Hadoop cluster. Compute and storage resources are decoupled with SAS attached NetApp
E2660 arrays and the recoverability of a failed Hadoop name node is improved with a NFS attached FAS2040. The
storage components are completely transparent to the Hadoop distribution and require no modification to the
native, underlying Hadoop platform. Note that while the FAS2040 is used for this testing configuration, any other
product in the FAS storage family can also be used.
Figure 3. NetApp Open Solution for Hadoop

The NetApp Open Solution for Hadoop includes:
NetApp E2660s with hardware RAID and hot-swappable disks increases efficiency, performance, scalability,
availability, serviceability, and manageability compared to a traditional Hadoop deployment with internal
hard drives and replication at the application layer. With data being protected by hardware RAID, higher
storage utilization rates can be achieved by reducing the default Hadoop replication count.
A NetApp FAS2040 with shared NFS attached capacity accelerates recoverability after a primary name node
failure, compared to a traditional Hadoop deployment with internal hard drives.
A high speed 10 Gbps Ethernet network and direct attached 6 Gbps SAS-attached E2600s with network free
hardware RAID increases the performance, scalability and efficiency of the Hadoop infrastructure.
High capacity E2660 disk arrays and a building block design that decouples the compute and storage layers
provide near-linear scalability that’s ideally suited for big data analytics applications with extreme compute
and storage capacity requirements.
A field-tested solution comprised of open source Apache Hadoop distribution and enterprise-class NetApp
storage, with professional design services and support reduces risk and accelerates deployment.



ESG Lab Validation
ESG Lab performed hands-on evaluation and testing of solution at a NetApp facility in Research Triangle Park, North
Carolina. Testing was designed to demonstrate that the NetApp Open Solution for Hadoop can perform and scale
linearly as data volumes and load increase, can recover from both a single and double node failure with no
disruption to a running Hadoop job, and can quickly recover from a name node failure. The performance and
scalability benefits of using network-free hardware RAID and a lower Hadoop replication count were evaluated as
well. Testing was performed using open source software, workload generators, and monitoring tools.

Getting Started
A Hadoop cluster with one name node, one secondary name node, one job tracker node, and up to 24 data nodes
was used during ESG Lab testing. Rack-mounted servers with quad core Intel Xeon processors and 48GB of RAM
were connected to six NetApp E2660s, with the name node and secondary name node connected to a single
NetApp FAS2040. Each NetApp E2660 was filled with 60 2TB 7200 RPM NL-SAS drives for a total raw capacity of
720TB. A building block approach was used, with groups of four data nodes sharing an E2660 through 6 Gbps SAS
connections. A 1 Gbps Ethernet network was used for the cluster interconnect and NFS connections to name and
job tracker nodes. Cloudera Distribution for Hadoop software was installed over the Red Hat Linux operating
system on each of the nodes in the cluster.3
Figure 4. The ESG Lab Test Bed

3
Configuration details are listed in the Appendix.



Performance and Scalability
Hadoop uses a shared nothing programming paradigm and a massively parallel clustered architecture to meet the
extreme compute and capacity requirements of big data analytics applications. Aiming to augment the
performance and scalability capacity of traditional database architectures, Hadoop brings the compute power to
the data. The name node and job trackers handle distribution and orchestration as the data nodes do all of the
analytical processing work.
HDFS is a distributed network file system used by nodes in a Hadoop cluster. Software mirroring is the default data
protection scheme within the HDFS file system. For every block of data written into the HDFS file system, an
additional two copies are written to other nodes for a total of three copies. This is referred to as a replication count
of three, and is the default for most Hadoop implementations that rely on internal hard drive capacity. This
software data mirroring increases the processing load on data nodes and the utilization of the shared network
between nodes. To put this into perspective, consider what happens when a 2TB data set is loaded into a Hadoop
cluster with a default replication count of three: in this example, 2TB of application data results in 6TB of raw data
being processed and moved over the network.
A NetApp E2660 with hardware RAID reduces the processing and network overhead associated with software
mirroring, which increases the performance and scalability of a Hadoop cluster. With up to 15 high capacity, high
performance disk drives (2TB, 7.2K NL-SAS) available for each data node, the performance of a Hadoop cluster is
magnified compared to a traditional Hadoop cluster with internal SATA drives. A right-sized building block approach
provides near-linear scalability as compute and storage capacity are added to a cluster.

ESG Lab Testing
ESG Lab performed a series of tests to measure the performance and scalability of a 24-data-node NetApp Open
Solution for Hadoop. Note that there are actually 27 nodes, 24 data nodes, one name node, one secondary name
node and one job tracker node. The TeraGen utility, included in the Hadoop open source distribution, was used to
simulate the loading of a large analytic data set. Testing was performed with cluster sizes of 8, 16, and 24 data
nodes and a Hadoop replication count of two. Testing began with the creation of a 1TB data set on an 8-data-node
cluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-node
cluster. The results are presented in Figure 5 and Table 1.
Figure 5. Data Loading Performance Analysis



Table 1. Performance Scalability Test Results: Data Loading with the TeraGen Utility
Data nodes 8 16 24
NetApp E2660 arrays 2 4 6
NetApp E2660 drives 120 240 360
Usable capacity (TB) 180 360 720
Hadoop data set size (TB) 1 2 3
Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18
Aggregate throughput (MB/sec) 1,574 3,222 4,630

What the Numbers Mean
The NetApp Solution for Hadoop was designed to scale performance in near-linear fashion as data nodes
and E2660 disk arrays are added to the cluster. This modular building block approach can also be used to
provide consistent levels of performance as a data set grows.
The job completion time for each of the TeraGen runs was recorded as the amount of data generated, the
number of data nodes, and the number of E2660 arrays added linearly.
In this example, the solution scaled up to 24 data nodes and six E2660 arrays with a total of 360 drives and
720TB of usable disk capacity.
As the number of data nodes increased and the volume of data generated increased linearly, the
completion time remained flat, at approximately 10 minutes (+/- 3%). This demonstrates the linear
performance scalability of the NetApp Solution for Hadoop.
A job completion time of ten minutes for the creation of a 3TB data set indicates that the 24-node NetApp
solution sustained a high aggregate throughput rate of 4.630 GB/sec.
An aggregate data creation rate of 4.630 GB/sec can be used to create 16.7TB of data per hour.
Performance testing continued with a similar series of tests designed to measure the scalability of the solution
when processing long running data analytics jobs. The open source TeraSort utility included in the Hadoop
distribution was used during this phase of testing. Using the data created with TeraGen, TeraSort was tested with
cluster sizes of 8, 16, and 24 data nodes, a map count of seven, and a reducer count of five per data node. Testing
began with a sort of the 1TB data set on an eight-data-node cluster. The test was repeated with a 2TB data set on a
16-data-node cluster and a 3TB data set on a 24-data-node cluster. The elapsed job run time was recorded after
each test. Each test began with a freshly created TeraGen data source. The results are presented in Table 2 and
Figure 6.
Table 2. Performance Scalability Test Results: Data Analytics with the TeraSort Utility
Data nodes 8 16 24
Job completion time (hh:mm:ss) 00:29:19 00:30:19 00:30:21
Aggregate throughput (MB/sec) 542 1,049 1,571



Figure 6. Data Analytics Performance Analysis

The job completion time for each of the TeraSort runs was recorded as the amount of data generated, the
number of data nodes, and the number of E2660 arrays increased linearly.
As the number of data nodes grew and the volume of data generated increased linearly, job completion
time remained flat at approximately 30 minutes (+/- 2%).
As shown in Figure 6, aggregate analytics throughput scaled linearly as data nodes and E2660 arrays were
added to the cluster.

Why This Matters
A growing number of organizations are deploying big data analytics platforms to improve the efficiency and
profitability of their businesses. ESG research indicates that data analytics and managing data growth are among
the top five IT priorities in more than 50% of organizations. When asked about their data analytics challenges, 29%
said data set sizes are limiting their ability to perform analytics, and 28% reported difficulty in completing analytics
within a reasonable period of time.
The NetApp Open Solution for Hadoop combines the compute scalability of a shared Hadoop cluster with the
storage efficiency and scalability of network-free hardware RAID. Because the solution was designed to have the
Hadoop data replication setting lower than the default and because it standardizes on a 10GbE network, there is
less chance of having a network bottleneck compared to a traditional Hadoop deployment as data volumes grow.
ESG Lab confirmed that NetApp has created a big data analytics solution with near-linear performance scalability
that dwarfs the capabilities of traditional databases and disk arrays—testing with a 24-node cluster and a 3TB data
set scaled up to 4.63 GB/sec of aggregate load throughput and 1.57 GB/sec of aggregate analytics throughput.



Efficiency
The NetApp Open Solution for Hadoop improves capacity and performance efficiency compared to a traditional
Hadoop deployment. With protection from disk failures provided by NetApp E2660s with hardware RAID, the
Hadoop default replication setting of three can be reduced to two. NetApp E2660s with network-free hardware
RAID-5 (6+1) and a Hadoop replication count of two increase storage capacity utilization by 22%, compared to a
Hadoop cluster with internal drives and a default replication count of three. Network-free hardware RAID also
increases the performance and scalability of the cluster due to a reduction in the amount of mirrored data flowing
over the network.

ESG Lab Testing
The TeraGen tests were repeated with a replication count of two as the size of the cluster was increased from eight
to 24 data nodes. The elapsed job time was compared with those collected earlier with a default Hadoop
replication count of three. The results are summarized in Figure 7 and Table 3.
Figure 7. Increasing Hadoop Cluster Efficiency with the “NetApp Effect”

Table 3. Performance Efficiency Test Results: Data Loading with TeraGen
Replication
Count
Data nodes 8 16 24
2 Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18
2 Aggregate throughput (MB/sec) 1,573 3,221 4,629
3 Job completion time (hh:mm:ss) 00:15:32 00:16:11 00:16:44
3 Aggregate throughput (MB/sec) 1,023 1,964 2,849



As the number of data nodes grew and the volume of data generated increased linearly, the job completion
time remained flat at approximately ten minutes (+/- 2%) with a NetApp-enabled replication count of two.
Job completion time increased by 50% or more with a Hadoop default replication count of three due to the
extra processing and network overhead associated with triple mirroring.
The increase in cluster efficiency (the NetApp effect) not only reduced job completion times, but also
increased aggregate throughput.
As shown in Figure 7, the NetApp effect was magnified as the size of the cluster and the amount of network
traffic increased. Note how the “Replication 2 with NetApp” line (green, circles) increases linearly compared
to the “Replication 3” line (red, triangles). Also note how the gap between the two increases as the cluster
grows due to the increase in network traffic.
The NetApp effect resulted in a peak aggregate throughput improvement of 62.5% during the 24-node test
(4.629 vs. 2.849 GB/sec).

Why This Matters
Data growth shows no signs of abating. As data accumulates, there is a corresponding burden on IT to maintain
acceptable levels of performance, whether that is measured by the speed with which an application responds, the
ability to aggregate and deliver data, or the ultimate business value of information. Management teams are
recognizing that their growing data stores bring massive, and largely untapped, potential to improve business
intelligence. At the same time, they also recognize the challenges that big data poses to existing analytics tools and
processes, as well as the impact data growth is having on the bottom line in the form of increased requirements for
storage capacity and compute power. It is for these reasons that IT managers are struggling to meet the conflicting
goals of keeping up with explosive data growth and lowering the cost of delivering data analytics services.
The default replication count for Hadoop is three. This is strongly recommended for data protection with Hadoop
configurations with internal disk drives. Replication is also needed for cluster self-healing. “Self-healing” is used to
describe Hadoop’s ability to ensure job completion in the event of task failure. It does this by reassigning failed
tasks to other nodes in the cluster. This is made possible by the replication of blocks throughout the cluster.
With the NetApp Open Solution for Hadoop, replication is not required for data protection since data is protected
with hardware RAID. As a result, a replication count of two is sufficient for self-healing. Hadoop MapReduce jobs
that write data to the HDFS, such as data ingest, benefit from the lower replication count: they generally run faster
and require less storage space than a Hadoop cluster with internal disk storage and a replication count of three.
During ESG Lab testing with a 24-node cluster, the NetApp effect reduced disk capacity requirements by 22% as it
increased aggregate data load performance by 62%. In other words, organizations can manage more data at a
lower cost with NetApp.



Recoverability
When a name node fails, a Hadoop administrator needs to recover the metadata and restart the Hadoop cluster
using a standby, secondary name node.
In a Hadoop server cluster with internal storage, when a disk drive fails, the entire data node is “blacklisted” and no
longer available to execute tasks. This can result in degraded performance and the need for a Hadoop administrator
to take the data node offline, service and replace the failed component, and then redeploy. This process can take
several hours to complete. This single point of failure is being addressed by the open source Hadoop community,
but was not yet generally available when this report was published.
NetApp Open Solution for Hadoop increases the availability and recoverability of a Hadoop cluster in three
significant ways:
1. Recovery from a name node failure is accelerated dramatically using an NFS attached FAS2040 instead of
internal storage on the primary and secondary name nodes. If and when a name node failure occurs, a
quick recovery from an NFS attached FAS2040 can restore analytics services in minutes instead of hours.
2. NetApp E2600s with hardware RAID provide transparent recovery from hard drive failures. The data node is
not blacklisted and any job tasks that were running continue uninterrupted.
3. The NetApp E2660 management console (SANtricity) provides a centralized management GUI for
monitoring and managing drive failures. This reduces the complexity associated with manually recovering
from drive failures in a Hadoop cluster with internal drives.

ESG Lab Testing
A variety of errors were tested with a 24-data-node Hadoop cluster running a 3TB TeraSort job. As shown in Figure
8, errors were injected to validate that jobs continue to run after data node and E2660 hard drive failures, and that
the cluster can be quickly recovered after a name node failure. A dual drive failure was also tested to simulate and
measure job recovery time after an internal hard drive failure in a traditional Hadoop cluster.
Figure 8. ESG Lab Error Injection Testing



Disk Drive Failure
To simulate a disk drive failure, a drive was taken offline while a Hadoop TeraSort job was running.4 The Hadoop job
tracker web interface was used to confirm that the job completed successfully. The NetApp E2660 SANtricity
management console was used to identify which drive had failed and monitor automatic recovery from a hot spare.
A SANtricity management console screenshot taken shortly after the drive had failed is shown in Figure 9.
Figure 9. Transparent Recovery from a Hard Drive Failure with E2660 Hardware RAID

Another TeraSort job was started. While it was running, a lab manager physically replaced the failed hard drive.
The TeraSort job completed without error, as expected. Another TeraSort job was started and a dual drive error
was introduced to simulate and measure the job completion time after a traditional Hadoop hard drive failure in a
data node.5 As shown in Table 4, the TeraSort job took slightly longer (5.7% longer) to complete during the single
drive failure with the hardware RAID recovery of the NetApp E2660. The simulated internal drive failure took more
than twice as long (236.2%) as the data node was blacklisted and job tasks were restarted on surviving nodes.
Table 4. Drive Failure Recovery Results
Job Completion Time Throughput Delta
Test Scenario (hh:mm:ss) (MB/sec) (vs. Healthy Cluster)
Healthy cluster 00:30:21 1,821 N/A
NetApp E2660 drive failure 00:32:06 1,486 -5.7%
Internal data node drive failure 01:12:13 660 -237.9%

4
Drive failures were introduced when the Hadoop job tracker indicated that the TeraSort job was 80% complete.
5
In a Hadoop cluster using internal disk drives, a local file system is created on each disk. If a disk fails, that file system fails. A local disk
failure was simulated during ESG Lab testing by failing two disk drives in the same RAID 5 volume group. All data on that file system was lost
and all tasks running on that file system failed. The job tracker detected this and reassigned failed tasks to other nodes where copies of the
lost blocks exist. With the NetApp solution, a single disk drive has very little impact on running tasks, and all data in the local file system using
that LUN remains available as RAID reconstruct begins. With direct attached disks, if a single disk fails, a file system fails as described above.



The screen shot shown in Figure 10 shows Hadoop job tracker status after the successful completion of the
TeraSort job following the simulated internal hard drive failure. Note how the non-zero failed/killed counts indicate
the number of map and reduce tasks that were restarted on surviving nodes (439 and 5, respectively).
Figure 10. Jobs Completion after a Simulated Internal Hard Drive Failure

The screen shot shown in Figure 11 summarizes the status of the Hadoop Distributed File System (HDFS) after the
data node with a simulated internal hard drive failure was blacklisted. These errors didn’t occur with the E2660
drive failure, as the Hadoop job ran uninterrupted.
Figure 11. Hadoop Self-healing in Action: Cluster Summary after a Simulated Internal Drive Failure



Name Node Failure
The Hadoop name node server was halted while a TeraSort job was running with a goal of demonstrating how an
NFS attached NetApp FAS2040 can be used to quickly recover from the single point of failure when a name node
goes offline in a Hadoop cluster. As shown in Figure 12, the job failed as expected after 13 minutes and 23 seconds.
After the job failed, name node metadata was copied to the secondary name node and the name node daemon was
started on the secondary name node server. The procedure outlined in the NetApp Open Solution for Hadoop
Solutions Guide6 was used to copy metadata to the secondary name node and start the name node daemon on the
secondary name node.
Figure 12. Job Failure after a Name Node Failure: NetApp FAS2040 Recovery Begins

Five minutes after getting started with the recovery process, the Hadoop cluster was up and running. An fsck of the
HDFS file system indicated that the cluster was healthy and a restarted TeraSort job completed without error.

Why This Matters
A majority of respondents to a recent ESG survey indicated that three hours or less of data analytics platform
downtime would result in significant revenue loss or other adverse business impact. The single point of HDFS
failure in the open source Hadoop distribution that was generally available as of this writing can lead to three or
more hours of data analytics platform unavailability.
ESG Lab has confirmed NetApp Open Solution for Hadoop reduces name node recovery time from hours to minutes
(five minutes during ESG Lab testing). NetApp E2660s with hardware RAID dramatically improved recoverability
after simulated hard drive failures. The complexity and performance impact of a blacklisted name node was
avoided as 3TB TeraSort analytics job with NetApp completed more than twice as quickly as with a simulated
internal hard drive failure.

6
http://media.netapp.com/documents/tr-3969.pdf



ESG Lab Validation Highlights
 The capacity and performance of a NetApp solution scaled linearly when data nodes and NetApp E2660
storage arrays were added to a Hadoop cluster.
 ESG Lab tested up to 24 data nodes and six NetApp E2660 arrays with 720TB of usable disk capacity.
 Load performance testing with the TeraGen utility delivered linear performance scalability.
 A 24-node cluster sustained a high aggregate load throughput rate of 4.630 GB/sec.
 Big data analytics performance testing with the TeraSort utility yielded linear performance scalability as
data nodes and E2660 arrays were added.
 Network-free hardware RAID and a lower Hadoop replication count reduced network overhead, which
increased the aggregate performance of the cluster. A peak aggregate throughput improvement of 62.5%
was recorded during the 24-node test (4.629 vs. 2.849 GB/sec).
 A MapReduce job running during a simulated internal drive failure took more than twice as long (225%) to
complete than during failure of a hardware RAID protected E2660 drive.
 An NFS attached NetApp FAS2040 for name node metadata storage was used to recover from a primary
name node failure in five minutes, compared to multiple hours in a traditional configuration.

Issues to Consider
 While the results demonstrate how the NetApp Open Solution for Hadoop is ideally suited to meet the
extreme compute and storage performance needs of big data analytic load and long running queries,
applications with lots of small files, multiple writers, or many users with low response time requirements
may be better suited for traditional relational databases and storage solutions.
 The single point of failure issue in the Hadoop distribution used during this ESG Lab Validation is being fixed
in the open source community, but was not yet available and therefore not tested as part of ESG Lab’s
assessment of the NetApp Open Solution for Hadoop. Even so, future releases of Hadoop that resolve the
name node failure problem are still expected to rely on NFS shared storage as a functional requirement.
NetApp, with its FAS family, is an industry leader in NFS shared storage.
 The test results presented in this report are based on a benchmarks deployed in a controlled environment.
Due to the many variables in each production data center environment, capacity planning and testing in
your own environment are recommended.
 A growing number of best practices, tuning guidelines, and proof points are available for reference when
planning, deploying, and tuning a Hadoop Open Solution for NetApp. To learn more, visit:
http://www.netapp.com/hadoop.



The Bigger Truth
Whether measured by increased revenues, market share gains, reduced costs, or scientific breakthroughs, data
analytics has always played a key role in the ability to harness value from electronically-stored information. What
has changed recently is that, as more business processes have become automated, information that was once
stored in separate online and offline repositories and formats is now readily available for amalgamation and
analysis to increase business insight and enhance decision support. Business executives are asking more of their
data and are expecting faster and more impactful answers. The result is an ever-increasing priority on data analytics
activities and, subsequently, more pressure on existing business analyst and IT teams to deliver.
Hadoop is a powerful open source framework for data analytics. It’s an emerging and fast growing solution that’s
considered one of the most impactful technology innovations since HTML. While ESG research indicates that a small
number of organizations are using Hadoop at this time, interest and plans for adoption over the next 12-18 months
is high (48%).
For those new to Hadoop, there is a steep learning curve. Very few enterprise applications are built to run on
massively parallel clusters, so there is much to learn. The NetApp Open Solution for Hadoop is a tested and proven
reference architecture storage appliance that reduces the risk and time associated with Hadoop adoption.
NetApp has embraced the open source Hadoop model and is working with major distributors to support open
source Hadoop software running on industry standard servers. Instead of promoting the use of a proprietary
clustered file system, NetApp has embraced the use of the open source Hadoop file system (HDFS). Instead of
promoting the use of SAN or NAS attached storage, NetApp has embraced the use of direct attached storage. Using
SAS direct connected NetApp E2660 arrays with hardware protected RAID, the NetApp solution improves
performance, scalability, and availability compared to typical internal hard drive Hadoop deployments. Thanks to an
NFS attached NetApp FAS2040 for shared access to metadata, recovery from a Hadoop name node failure is
reduced from hours to minutes.
With up to 5 GB/sec of aggregate TeraGen load performance on a 24-node cluster, ESG Lab has confirmed that the
NetApp Solution for Hadoop provides excellent near-linear performance scalability that dwarfs the capabilities of
traditional disk arrays and databases. NetApp E2660s with network-free hardware RAID improved the efficiency
and performance of the cluster by 66% compared to a traditional Hadoop deployment with triple mirroring. The
value of transparent RAID recovery was obvious after drive failures were simulated: the performance impact on a
long running sort job was less than 6% compared to more than 200% for a simulated internal drive failure that
blacklisted a Hadoop data node.
If you’re looking to accelerate the delivery of insight to your business with an enterprise-class big data analytics
infrastructure, ESG Lab recommends a close look at the NetApp Open Solution for Hadoop—it reduces risk with a
storage solution that delivers reliability, fast deployment, and scalability of open source Hadoop for the enterprise.



Appendix
The configuration of the test bed that was used during the ESG Lab Validation is summarized in Table 5.
Table 5. Configuration Summary
Servers
HDFS data nodes 24 servers, each with quad core Intel Xeon CPU, 48GB RAM
HDFS name node 1 server with quad core Intel Xeon CPU, 48GB RAM
HDFS secondary name node 1 server, with quad core Intel Xeon CPU, 48GB RAM
HDFS job tracker 1 server, with quad core Intel Xeon CPU, 48GB RAM
Network
10 GbE host connect One 10GbE CAN connection for all data nodes, name node, secondary name
10 GbE switched fabric node, job tracker
Cisco Nexus 5010, 10 GigE, Jumbo Frames (MTU=9000)
Storage
HDFS data node storage 6 NetApp E2660 6Gb SAS host connect, 6+1 RAID-5, 2TB near line SAS 7.2K
RPM drives, 360 drives total, version 47.77.19.99
HDFS name node storage 1 NetApp FAS2040, 1GbE NAS host connect, 6 disks, 1TB each, 7.2K RPM,
Data ONTAP 8.0.2 7 mode
Operating system boot drives Local 1TB 7.2K RPM SATA drive in each node

Software
Operating system Red Hat Enterprise Linux version 5, update 6 (RHEL5.6)
Analytics platform Cloudera Hadoop (CDH3u2)

HDFS Configuration Changes vs. Cloudera V3U2 Distribution
Local file system XFS
Map/reduce tasks per data node 7/5

Table 6 lists the differences between Hadoop core-site.xml defaults and the settings used during ESG Lab testing.
Table 6. Hadoop core-site Settings
Option Name Purpose Actual/Default
Name of the default file system specified as a URI (IP address hdfs:// 10.61.189.64:8020/
fs.default.name or hostname of the name node along with the port to be
used). [Default Value: file:///]]

Enables or disables certain management functions within
webinterface.private.actions the Hadoop Web user interface, including the ability to kill true / false
jobs and modify job priorities.
Memory in MB to be used for merging map outputs during
fs.inmemory.size.mb 200 / 100
the reduce phase.
io.file.buffer.size Size in bytes of the read/write buffer. 262144 / 4096
Script used to resolve the slave node’s name or IP address to
a rack ID. Used to invoke Hadoop rack awareness. The /etc/hadoop/conf/topology_script
topology.script.file.name
default value is null and results in all slaves being given a [Default value is null]
rack ID of “/default-rack.”
Sets the maximum acceptable number of arguments to be
topology.script.number.args 1 / 100
sent to the topology script at one time.
/home/hdfs/tmp
hadoop.tmp.dir Hadoop temporary directory storage. [Default value: /tmp/hadoop-
${user.name}]



Table 7 lists the differences between Linux sysctl.conf defaults and the settings used during ESG Lab testing.
Table 7. Linux sysctl.conf Settings
Parameter Description Actual /Default
net.ipv4.ip_forward Controls IP packet forwarding. 0/0
net.ipv4.conf.default.rp_filter Controls source route verification. 1/0
net.ipv4.conf.default.accept_
Do not accept source routing. 0/1
source_route
Controls the system request debugging functionality of the
kernel.sysrq 0/1
kernel.
Controls whether core dumps will append the PID to the
kernel.core_uses_pid core filename. Useful for debugging multithreaded 1/0
applications.
kernel.msgmnb Controls the maximum size of a message, in bytes. 65536 / 16384
kernel.msgmax Controls the default maximum size of a message queue. 65536 / 8192
kernel.shmmax Controls the maximum shared segment size, in bytes. 68719476736 / 33554432
Controls the maximum number of shared memory
kernel.shmall 4294967296 / 2097512
segments, in pages.
net.core.rmem_default Sets the default OS receive buffer size. 262144 / 129024
net.core.rmem_max Sets the max OS receive buffer size. 16777216 / 131071
net.core.wmem_default Sets the default OS send buffer size. 262144 / 129024
net.core.wmem_max Sets the max OS send buffer size. 16777216 / 131071
Maximum # of sockets the kernel will serve at one time. Set
net.core.somaxconn 1000 / 128
on name node, secondary name node and job tracker.
fs.file-max Sets the total number of file descriptors. 6815744 / 4847448
net.ipv4.tcp_timestamps Disables the TCP time stamps if set to “0” 0/1
net.ipv4.tcp_sack Enables select ACK for TCP. 1/1
net.ipv4.tcp_window_scaling Enables TCP window scaling. 1/1
kernel.shmmni Sets the maximum number of shared memory segments. 4096 / 4096
Sets the maximum number and size of semaphore sets that 250 32000 100 128 /
kernel.sem
can be allocated. 250 32000 32 128
fs.aio-max-nr Sets the maximum number of concurrent I/O requests. 1048576 / 65536
4096 262144 16777216 /
net.ipv4.tcp_rmem Sets min, default, and max receive window size.
4096 87380 4194304
4096 262144 16777216 /
net.ipv4.tcp_wmem Sets min, default, and max transmit window size.
4096 87380 4194304
net.ipv4.tcp_syncookies Disables TCP syncookies if set to “0”. 0/0
Sets the maximum number of in-flight rpc requests between
sunrpc.tcp_slot_table_entries a client and a server. This value is set on the name node and 128 / 16
secondary name node to improve NFS performance.
Maximum percentage of active system memory that can be
vm.dirty_background_ratio used for dirty pages before dirty pages are flushed to 1 / 10
storage.



Table 8 lists the differences between Hadoop hdfs-site.xml defaults and the settings used during ESG Lab testing.
Table 8. HDFS Site Settings
Path on the local file system where the name node stores
the namespace and transaction logs persistently.
If this is a comma-delimited list of directories (as used in this
configuration), then the name table is replicated in all of the
/local/hdfs/namedir/mnt/fsimage_bkp
directories for redundancy.
dfs.name.dir [Default value:
Note: Directory /mnt/fsimage_bkp is a location on NFS- ${hadoop.tmp.dir}/dfs/name]
mounted NetApp FAS storage where name node
metadata is mirrored and protected, a key feature
of NetApp’s Hadoop solution.

Specifies a list of machines authorized to join the Hadoop /etc/hadoop-0.20/conf/dfs_hosts
dfs.hosts
cluster as a data node. [Default value is null]
/disk1/data,/disk2/data
Directory paths on the data node local file systems where
dfs.data.dir [Default value:
HDFS data blocks are stored.
${hadoop.tmp.dir}/dfs/data]
/home/hdfs/namesecondary1
Directory path where checkpoint images are stored (used by
fs.checkpoint.dir [Default value:
secondary name node).
${hadoop.tmp.dir}/dfs/namesecondary]
HDFS block replication count. Hadoop default is 3. The
dfs.replication 2/3
NetApp Hadoop solution uses a replication setting of 2.

dfs.block.size HDFS data storage block size in bytes. 134217728 (128MB) / 67108864

dfs.namenode.handler.count Number of server threads for the name node. 128 / 10

dfs.datanode.handler.count Number of server threads for the data node 64 / 3

Maximum number of replications a data node is allowed to
dfs.max-repl-streams 8/2
handle at one time.

dfs.datanode.max.xcievers Maximum number of files a data node will serve at one time. 4096 / 256



Table 9 lists the differences between mapred-site.xml defaults and the settings used during ESG Lab testing.
Table 9. mapred-site Settings
Job tracker address as a URL (Job tracker IP address 10.61.189.66:9001
mapred.job.tracker
or hostname with port number). [Default value: local]
/disk1/mapred/local,/disk2/mapred/loc
al
Comma-separated list of the local file system where
mapred.local.dir [Default value:
temporary MapReduce data is written.
${hadoop.tmp.dir}/mapred/local

Specifies the file containing the list of nodes
mapred.hosts /etc/hadoop-0.20/conf/mapred.hosts
allowed to join the Hadoop cluster as task trackers.
[Default value is null]

Path in HDFS where the MapReduce framework /mapred/system
mapred.system.dir
stores control files. [Default value:
${hadoop.tmp.dir}/mapred/system]
Enables the job tracker to detect slow-running
mapred.reduce.
reduce tasks, assign them to run in parallel on other
tasks.speculative. false / true
nodes, use the first available results, and then kill
execution
the slower running reduce tasks.
Enables the job tracker to detect slow-running map
mapred.map.tasks. tasks, assign them to run in parallel on other nodes,
false / true
speculative.execution use the first available results and then kill the
slower running map tasks.

mapred.tasktracker. Maximum number of reduce tasks that can be run
5/2
reduce.tasks.maximum simultaneously on a single task tracker node.

mapred.tasktracker.map. Maximum number of map tasks that can be run
7/2
tasks.maximum simultaneously on a single task tracker node.
Java options passed to the task tracker child
mapred.child.java.opts processes. (In this case, 1 GB defined for heap -Xmx1024m / -Xmx200m
memory used by each individual JVM).
Total amount of buffer memory allocated to each
io.sort.mb merge stream while sorting files on the mapper, in 340 / 100
MB.
org.apache.hadoop.mapred.FairSchedul
er
mapred.jobtracker. Job tracker task scheduler to use (in this case use
[Default value:
taskScheduler the FairScheduler).
org.apache.hadoop.mapred.JobQueueT
askScheduler]
Number of streams to merge at once while sorting
io.sort.factor 100 / 10
files.
Enables/disables MapReduce output file
mapred.output.compress false / false
compression.
mapred.compress.map.
Enables/disables map output compression. false / false
output
mapred.output.compression.type Sets output compression type. block / record
Fraction of the number of map tasks that should be
mapred.reduce.slowstart.completed.m
complete before reducers are scheduled for the 0.05 / 0.05
aps
MapReduce job.



40 for 8 DataNodes

Total number of reduce tasks available for the 80 for 16 DataNodes
mapred.reduce.tasks
entire cluster. 120 for 24 DataNodes
[Default value: 1]

56 for 8 DataNodes

Total number of map tasks available for the entire 112 for 16 DataNodes
mapred.map.tasks
cluster. 168 for 24 DataNodes
[Default value: 2]

mapred.reduce.parallel. Number of parallel threads used by reduce tasks to
64 / 5
copies fetch outputs from map tasks.

mapred.compress.map.output Enable/disable map output compression. false / false

Number of map outputs in the reduce task tracker’s
mapred.inmem.merge.
memory at which map data is merged and spilled to 0 / 1000
threshold
disk.

mapred.job.reduce. Percent usage of the map outputs buffer at which
1/0
input.buffer.percent the map output data is merged and spilled to disk.

mapred.job.tracker. Number of job tracker server threads for handling
128 / 10
handler.count RPCs from the task trackers.
tasktracker.http. Number of task tracker worker threads for fetching
60 / 40
threads intermediate map outputs for reducers.
Maximum number of tasks that can be run in a
mapred.job.reuse.jvm.
single JVM for a job. A value of "-1" sets the number -1 / 1
num.tasks
to "unlimited."

mapred.jobtracker.restart.recover Enables job recovery after restart. true / false


20 Asylum Street | Milford, MA 01757 | Tel: 508.482.0188 Fax: 508.482.0218 | www.enterprisestrategygroup.com

ESG: NetApp Open Solution for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to ESG: NetApp Open Solution for Hadoop

Similar to ESG: NetApp Open Solution for Hadoop (20)

More from NetApp

More from NetApp (20)

Recently uploaded

Recently uploaded (20)

ESG: NetApp Open Solution for Hadoop