SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Lab Validation
Report
NetApp Open Solution for Hadoop

Open Source Data Analytics with Enterprise-class Storage Services


By Brian Garrett, VP, ESG Lab, & Julie Lockner, Sr Analyst & VP, Data Management




May 2012




© 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                                                            2

Contents
 Introduction .................................................................................................................................................. 3
   Background ............................................................................................................................................................... 3
   NetApp Open Solution for Hadoop .......................................................................................................................... 5
 ESG Lab Validation ........................................................................................................................................ 6
   Getting Started ......................................................................................................................................................... 6
   Performance and Scalability ..................................................................................................................................... 7
   Efficiency................................................................................................................................................................. 10
   Recoverability ......................................................................................................................................................... 12
 ESG Lab Validation Highlights ..................................................................................................................... 16
 Issues to Consider ....................................................................................................................................... 16
 The Bigger Truth ......................................................................................................................................... 17
 Appendix ..................................................................................................................................................... 18




   ESG Lab Reports
   The goal of ESG Lab reports is to educate IT professionals about data center technology products for
   companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation process that should
   be conducted before making purchasing decisions, but rather to provide insight into these emerging
   technologies. Our objective is to go over some of the more valuable feature/functions of products, show how
   they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's
   expert third-party perspective is based on our own hands-on testing as well as on interviews with customers
   who use these products in production environments. This ESG Lab report was sponsored by NetApp.




 All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise
 Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from
 time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in
 part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of the Enterprise
 Strategy Group, Inc., is in violation of U.S. Copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should
 you have any questions, please contact ESG Client Relations at 508.482.0188.




                                            © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                 3

Introduction
This ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highly
reliable, ready to deploy, scalable storage solution for Enterprise Hadoop.

Background
Driven by unrelenting data volume growth, the need for real-time data processing and data analytics, and the
increasing complexity and variety of data sources, ESG expects broad adoption of MapReduce data processing and
analytics frameworks over the next two to five years. These frameworks require new approaches for storing,
integrating and processing “big data.” ESG defines big data as any data set that exceeds the boundaries and sizes of
traditional IT processing; big data sets can range from ten to hundreds of terabytes in size.
Data analytics is a top IT priority for forward-looking IT organizations. In fact, a recent ESG survey indicates that
more than half (54%) of enterprise organizations (i.e., 1,000 or more employees) consider data analytics to be a
top-five IT priority and 38% plan on deploying a new data analytics solution in the next 12-18 months. A growing
number of IT organizations are using the open source Apache Hadoop MapReduce framework as a foundation for
their big data analytics initiatives. As shown in Figure 1, more than 50% of organizations polled by ESG are using
Hadoop, planning to deploy Hadoop in the next 12 months, or considering Hadoop.1
     Figure 1. Plans to Implement a MapReduce Framework such as Apache Hadoop

                      What are your organization’s plans to implement a MapReduce framework (e.g.,
                      Apache Hadoop) to address data analytics challenges? (Percent of respondents,
                                                          N=270)

                                                                               Already using, 8%
                                      Don’t know, 11%                                          Plan to implement
                                                                                               within 12 months,
                                                                                                      13%




                        No plans to
                      implement, 33%


                                                                                                  No plans to
                                                                                                implement at this
                                                                                              time but interested,
                                                                                                      35%
                                                                                             Source: Enterprise Strategy Group, 2011.
As with any exciting and emerging technology, big data analytics also has its challenges. Management is an issue
because the platforms are expensive and require new server and storage purchases, integration with existing data
sets and processes, training in new technologies, an analytics toolset, and people with expertise in dealing with it.
When IT managers were asked about their data analytics challenges, 47% named data integration complexity, 34%
cited a lack of skills necessary to properly manage large data sets and derive value from them, 29% said data set
sizes limiting their ability to perform analytics, and 28% said difficulty in completing analytics within a reasonable
period of time.

1
    Source: ESG Research Report, The Impact of Big Data on Data Analytics, September 2011.




                                      © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                  4

Looking beyond the high level organizational challenges associated with a big data analytics initiative, the Hadoop
framework adds technology and implementation issues that need to be considered. The common reference
architecture for a Hadoop cluster leverages commodity server nodes with internal hard drives; for conventional
data centers with mature ITIL processes, this introduces two challenges. First, data protection is, by default,
handled in the Hadoop software layer; every time a file is written to the Hadoop Distributed File System (HDFS),
two additional copies are written in case of a disk drive or a data node failure. This not only impacts data ingest and
throughput performance, but also reduces disk capacity utilization. Second, high availability is limited based on an
existing single point of failure in the Hadoop metadata repository. This single point of failure will eventually be
addressed by the Hadoop community, but, in the meantime, analytics downtime due to a name node failure is a
key concern. As shown in Figure 2, a majority of ESG survey respondents (55%) indicate that three hours or less of
data analytics platform downtime would result in a significant revenue loss or other adverse business impact.2
     Figure 2. Data Analytics Downtime Tolerance

                        Please indicate the amount of downtime your organization’s data analytics
                      platforms can tolerate before your organization experiences significant revenue
                          loss or other adverse business impact. (Percent of respondents, N=399)

                                               Don’t know, 6%                None, 6%
                        More than 3 days,
                               4%
                                                                                             Less than 1 hour,
                          1 day to 3 days, 10%                                                      21%



                       11 hours to 24
                        hours, 10%



                        4 hours to 10 hours,                                          1 hour to 3 hours,
                                18%                                                         26%



                                                                                              Source: Enterprise Strategy Group, 2011.
NetApp, in collaboration with leading Hadoop distribution vendors, is working to develop reference architectures,
best practices, and solutions that address these challenges while maximizing the speed, efficiency, and availability
of open source Hadoop deployments.




2
    Source: ESG Survey, The Convergence of Big Data Processing, Hadoop, and Integrated Infrastructure, December 2011.




                                      © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                      5


NetApp Open Solution for Hadoop
Hadoop is an open source and significant emerging technology for solving business problems around large volumes
of mostly unstructured data that cannot be analyzed with traditional database tools. The NetApp Open Solution for
Hadoop combines the power of the Hadoop framework with flexible storage, professional support, and services of
NetApp and its partners to deliver higher Hadoop cluster availability and efficiency. Based on a reference
architecture, it focuses on scaling Hadoop from its departmental origins to an enterprise infrastructure with
independent compute and storage scaling, faster cluster ingest, and faster job completion under failure conditions.
NetApp Open Solution for Hadoop extends the value of the open source Hadoop framework with enterprise-class
storage and services. As shown in Figure 3, NetApp FAS2040 and E2660 storage replace traditional DAS internal
hard drives within a Hadoop cluster. Compute and storage resources are decoupled with SAS attached NetApp
E2660 arrays and the recoverability of a failed Hadoop name node is improved with a NFS attached FAS2040. The
storage components are completely transparent to the Hadoop distribution and require no modification to the
native, underlying Hadoop platform. Note that while the FAS2040 is used for this testing configuration, any other
product in the FAS storage family can also be used.
  Figure 3. NetApp Open Solution for Hadoop




The NetApp Open Solution for Hadoop includes:
        NetApp E2660s with hardware RAID and hot-swappable disks increases efficiency, performance, scalability,
        availability, serviceability, and manageability compared to a traditional Hadoop deployment with internal
        hard drives and replication at the application layer. With data being protected by hardware RAID, higher
        storage utilization rates can be achieved by reducing the default Hadoop replication count.
        A NetApp FAS2040 with shared NFS attached capacity accelerates recoverability after a primary name node
        failure, compared to a traditional Hadoop deployment with internal hard drives.
        A high speed 10 Gbps Ethernet network and direct attached 6 Gbps SAS-attached E2600s with network free
        hardware RAID increases the performance, scalability and efficiency of the Hadoop infrastructure.
        High capacity E2660 disk arrays and a building block design that decouples the compute and storage layers
        provide near-linear scalability that’s ideally suited for big data analytics applications with extreme compute
        and storage capacity requirements.
        A field-tested solution comprised of open source Apache Hadoop distribution and enterprise-class NetApp
        storage, with professional design services and support reduces risk and accelerates deployment.


                               © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                             6

ESG Lab Validation
ESG Lab performed hands-on evaluation and testing of solution at a NetApp facility in Research Triangle Park, North
Carolina. Testing was designed to demonstrate that the NetApp Open Solution for Hadoop can perform and scale
linearly as data volumes and load increase, can recover from both a single and double node failure with no
disruption to a running Hadoop job, and can quickly recover from a name node failure. The performance and
scalability benefits of using network-free hardware RAID and a lower Hadoop replication count were evaluated as
well. Testing was performed using open source software, workload generators, and monitoring tools.

Getting Started
A Hadoop cluster with one name node, one secondary name node, one job tracker node, and up to 24 data nodes
was used during ESG Lab testing. Rack-mounted servers with quad core Intel Xeon processors and 48GB of RAM
were connected to six NetApp E2660s, with the name node and secondary name node connected to a single
NetApp FAS2040. Each NetApp E2660 was filled with 60 2TB 7200 RPM NL-SAS drives for a total raw capacity of
720TB. A building block approach was used, with groups of four data nodes sharing an E2660 through 6 Gbps SAS
connections. A 1 Gbps Ethernet network was used for the cluster interconnect and NFS connections to name and
job tracker nodes. Cloudera Distribution for Hadoop software was installed over the Red Hat Linux operating
system on each of the nodes in the cluster.3
     Figure 4. The ESG Lab Test Bed




3
    Configuration details are listed in the Appendix.




                                         © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                         7

Performance and Scalability
Hadoop uses a shared nothing programming paradigm and a massively parallel clustered architecture to meet the
extreme compute and capacity requirements of big data analytics applications. Aiming to augment the
performance and scalability capacity of traditional database architectures, Hadoop brings the compute power to
the data. The name node and job trackers handle distribution and orchestration as the data nodes do all of the
analytical processing work.
HDFS is a distributed network file system used by nodes in a Hadoop cluster. Software mirroring is the default data
protection scheme within the HDFS file system. For every block of data written into the HDFS file system, an
additional two copies are written to other nodes for a total of three copies. This is referred to as a replication count
of three, and is the default for most Hadoop implementations that rely on internal hard drive capacity. This
software data mirroring increases the processing load on data nodes and the utilization of the shared network
between nodes. To put this into perspective, consider what happens when a 2TB data set is loaded into a Hadoop
cluster with a default replication count of three: in this example, 2TB of application data results in 6TB of raw data
being processed and moved over the network.
A NetApp E2660 with hardware RAID reduces the processing and network overhead associated with software
mirroring, which increases the performance and scalability of a Hadoop cluster. With up to 15 high capacity, high
performance disk drives (2TB, 7.2K NL-SAS) available for each data node, the performance of a Hadoop cluster is
magnified compared to a traditional Hadoop cluster with internal SATA drives. A right-sized building block approach
provides near-linear scalability as compute and storage capacity are added to a cluster.

ESG Lab Testing
ESG Lab performed a series of tests to measure the performance and scalability of a 24-data-node NetApp Open
Solution for Hadoop. Note that there are actually 27 nodes, 24 data nodes, one name node, one secondary name
node and one job tracker node. The TeraGen utility, included in the Hadoop open source distribution, was used to
simulate the loading of a large analytic data set. Testing was performed with cluster sizes of 8, 16, and 24 data
nodes and a Hadoop replication count of two. Testing began with the creation of a 1TB data set on an 8-data-node
cluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-node
cluster. The results are presented in Figure 5 and Table 1.
  Figure 5. Data Loading Performance Analysis




                                © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                     8



  Table 1. Performance Scalability Test Results: Data Loading with the TeraGen Utility
 Data nodes                                       8                         16                       24
 NetApp E2660 arrays                              2                          4                        6
 NetApp E2660 drives                            120                        240                      360
 Usable capacity (TB)                           180                        360                      720
 Hadoop data set size (TB)                        1                          2                        3
 Job completion time (hh:mm:ss)               00:10:06                   00:09:52                 00:10:18
 Aggregate throughput (MB/sec)                  1,574                      3,222                    4,630

What the Numbers Mean
        The NetApp Solution for Hadoop was designed to scale performance in near-linear fashion as data nodes
        and E2660 disk arrays are added to the cluster. This modular building block approach can also be used to
        provide consistent levels of performance as a data set grows.
        The job completion time for each of the TeraGen runs was recorded as the amount of data generated, the
        number of data nodes, and the number of E2660 arrays added linearly.
        In this example, the solution scaled up to 24 data nodes and six E2660 arrays with a total of 360 drives and
        720TB of usable disk capacity.
        As the number of data nodes increased and the volume of data generated increased linearly, the
        completion time remained flat, at approximately 10 minutes (+/- 3%). This demonstrates the linear
        performance scalability of the NetApp Solution for Hadoop.
        A job completion time of ten minutes for the creation of a 3TB data set indicates that the 24-node NetApp
        solution sustained a high aggregate throughput rate of 4.630 GB/sec.
        An aggregate data creation rate of 4.630 GB/sec can be used to create 16.7TB of data per hour.
Performance testing continued with a similar series of tests designed to measure the scalability of the solution
when processing long running data analytics jobs. The open source TeraSort utility included in the Hadoop
distribution was used during this phase of testing. Using the data created with TeraGen, TeraSort was tested with
cluster sizes of 8, 16, and 24 data nodes, a map count of seven, and a reducer count of five per data node. Testing
began with a sort of the 1TB data set on an eight-data-node cluster. The test was repeated with a 2TB data set on a
16-data-node cluster and a 3TB data set on a 24-data-node cluster. The elapsed job run time was recorded after
each test. Each test began with a freshly created TeraGen data source. The results are presented in Table 2 and
Figure 6.
   Table 2. Performance Scalability Test Results: Data Analytics with the TeraSort Utility
 Data nodes                                       8                        16                        24
 Hadoop data set size (TB)                        1                         2                        3
 Job completion time (hh:mm:ss)               00:29:19                  00:30:19                  00:30:21
 Aggregate throughput (MB/sec)                   542                       1,049                    1,571




                               © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                        9



  Figure 6. Data Analytics Performance Analysis




What the Numbers Mean
        The job completion time for each of the TeraSort runs was recorded as the amount of data generated, the
        number of data nodes, and the number of E2660 arrays increased linearly.
        As the number of data nodes grew and the volume of data generated increased linearly, job completion
        time remained flat at approximately 30 minutes (+/- 2%).
        As shown in Figure 6, aggregate analytics throughput scaled linearly as data nodes and E2660 arrays were
        added to the cluster.



Why This Matters
A growing number of organizations are deploying big data analytics platforms to improve the efficiency and
profitability of their businesses. ESG research indicates that data analytics and managing data growth are among
the top five IT priorities in more than 50% of organizations. When asked about their data analytics challenges, 29%
said data set sizes are limiting their ability to perform analytics, and 28% reported difficulty in completing analytics
within a reasonable period of time.
The NetApp Open Solution for Hadoop combines the compute scalability of a shared Hadoop cluster with the
storage efficiency and scalability of network-free hardware RAID. Because the solution was designed to have the
Hadoop data replication setting lower than the default and because it standardizes on a 10GbE network, there is
less chance of having a network bottleneck compared to a traditional Hadoop deployment as data volumes grow.
ESG Lab confirmed that NetApp has created a big data analytics solution with near-linear performance scalability
that dwarfs the capabilities of traditional databases and disk arrays—testing with a 24-node cluster and a 3TB data
set scaled up to 4.63 GB/sec of aggregate load throughput and 1.57 GB/sec of aggregate analytics throughput.




                                © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                    10

Efficiency
The NetApp Open Solution for Hadoop improves capacity and performance efficiency compared to a traditional
Hadoop deployment. With protection from disk failures provided by NetApp E2660s with hardware RAID, the
Hadoop default replication setting of three can be reduced to two. NetApp E2660s with network-free hardware
RAID-5 (6+1) and a Hadoop replication count of two increase storage capacity utilization by 22%, compared to a
Hadoop cluster with internal drives and a default replication count of three. Network-free hardware RAID also
increases the performance and scalability of the cluster due to a reduction in the amount of mirrored data flowing
over the network.

ESG Lab Testing
The TeraGen tests were repeated with a replication count of two as the size of the cluster was increased from eight
to 24 data nodes. The elapsed job time was compared with those collected earlier with a default Hadoop
replication count of three. The results are summarized in Figure 7 and Table 3.
  Figure 7. Increasing Hadoop Cluster Efficiency with the “NetApp Effect”




  Table 3. Performance Efficiency Test Results: Data Loading with TeraGen
 Replication
 Count
                Data nodes                                       8                       16            24
                Hadoop data set size (TB)                        1                       2             3
       2        Job completion time (hh:mm:ss)               00:10:06                00:09:52       00:10:18
       2          Aggregate throughput (MB/sec)                   1,573                3,221          4,629
       3          Job completion time (hh:mm:ss)                00:15:32             00:16:11       00:16:44
       3          Aggregate throughput (MB/sec)                   1,023                1,964          2,849




                               © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                         11

What the Numbers Mean
        As the number of data nodes grew and the volume of data generated increased linearly, the job completion
        time remained flat at approximately ten minutes (+/- 2%) with a NetApp-enabled replication count of two.
        Job completion time increased by 50% or more with a Hadoop default replication count of three due to the
        extra processing and network overhead associated with triple mirroring.
        The increase in cluster efficiency (the NetApp effect) not only reduced job completion times, but also
        increased aggregate throughput.
        As shown in Figure 7, the NetApp effect was magnified as the size of the cluster and the amount of network
        traffic increased. Note how the “Replication 2 with NetApp” line (green, circles) increases linearly compared
        to the “Replication 3” line (red, triangles). Also note how the gap between the two increases as the cluster
        grows due to the increase in network traffic.
        The NetApp effect resulted in a peak aggregate throughput improvement of 62.5% during the 24-node test
        (4.629 vs. 2.849 GB/sec).




Why This Matters
Data growth shows no signs of abating. As data accumulates, there is a corresponding burden on IT to maintain
acceptable levels of performance, whether that is measured by the speed with which an application responds, the
ability to aggregate and deliver data, or the ultimate business value of information. Management teams are
recognizing that their growing data stores bring massive, and largely untapped, potential to improve business
intelligence. At the same time, they also recognize the challenges that big data poses to existing analytics tools and
processes, as well as the impact data growth is having on the bottom line in the form of increased requirements for
storage capacity and compute power. It is for these reasons that IT managers are struggling to meet the conflicting
goals of keeping up with explosive data growth and lowering the cost of delivering data analytics services.
The default replication count for Hadoop is three. This is strongly recommended for data protection with Hadoop
configurations with internal disk drives. Replication is also needed for cluster self-healing. “Self-healing” is used to
describe Hadoop’s ability to ensure job completion in the event of task failure. It does this by reassigning failed
tasks to other nodes in the cluster. This is made possible by the replication of blocks throughout the cluster.
With the NetApp Open Solution for Hadoop, replication is not required for data protection since data is protected
with hardware RAID. As a result, a replication count of two is sufficient for self-healing. Hadoop MapReduce jobs
that write data to the HDFS, such as data ingest, benefit from the lower replication count: they generally run faster
and require less storage space than a Hadoop cluster with internal disk storage and a replication count of three.
During ESG Lab testing with a 24-node cluster, the NetApp effect reduced disk capacity requirements by 22% as it
increased aggregate data load performance by 62%. In other words, organizations can manage more data at a
lower cost with NetApp.




                                © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                       12


Recoverability
When a name node fails, a Hadoop administrator needs to recover the metadata and restart the Hadoop cluster
using a standby, secondary name node.
In a Hadoop server cluster with internal storage, when a disk drive fails, the entire data node is “blacklisted” and no
longer available to execute tasks. This can result in degraded performance and the need for a Hadoop administrator
to take the data node offline, service and replace the failed component, and then redeploy. This process can take
several hours to complete. This single point of failure is being addressed by the open source Hadoop community,
but was not yet generally available when this report was published.
NetApp Open Solution for Hadoop increases the availability and recoverability of a Hadoop cluster in three
significant ways:
    1. Recovery from a name node failure is accelerated dramatically using an NFS attached FAS2040 instead of
       internal storage on the primary and secondary name nodes. If and when a name node failure occurs, a
       quick recovery from an NFS attached FAS2040 can restore analytics services in minutes instead of hours.
    2. NetApp E2600s with hardware RAID provide transparent recovery from hard drive failures. The data node is
       not blacklisted and any job tasks that were running continue uninterrupted.
    3. The NetApp E2660 management console (SANtricity) provides a centralized management GUI for
       monitoring and managing drive failures. This reduces the complexity associated with manually recovering
       from drive failures in a Hadoop cluster with internal drives.

ESG Lab Testing
A variety of errors were tested with a 24-data-node Hadoop cluster running a 3TB TeraSort job. As shown in Figure
8, errors were injected to validate that jobs continue to run after data node and E2660 hard drive failures, and that
the cluster can be quickly recovered after a name node failure. A dual drive failure was also tested to simulate and
measure job recovery time after an internal hard drive failure in a traditional Hadoop cluster.
  Figure 8. ESG Lab Error Injection Testing




                               © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                                 13

Disk Drive Failure
To simulate a disk drive failure, a drive was taken offline while a Hadoop TeraSort job was running.4 The Hadoop job
tracker web interface was used to confirm that the job completed successfully. The NetApp E2660 SANtricity
management console was used to identify which drive had failed and monitor automatic recovery from a hot spare.
A SANtricity management console screenshot taken shortly after the drive had failed is shown in Figure 9.
     Figure 9. Transparent Recovery from a Hard Drive Failure with E2660 Hardware RAID




Another TeraSort job was started. While it was running, a lab manager physically replaced the failed hard drive.
The TeraSort job completed without error, as expected. Another TeraSort job was started and a dual drive error
was introduced to simulate and measure the job completion time after a traditional Hadoop hard drive failure in a
data node.5 As shown in Table 4, the TeraSort job took slightly longer (5.7% longer) to complete during the single
drive failure with the hardware RAID recovery of the NetApp E2660. The simulated internal drive failure took more
than twice as long (236.2%) as the data node was blacklisted and job tasks were restarted on surviving nodes.
      Table 4. Drive Failure Recovery Results
                                        Job Completion Time                          Throughput                           Delta
    Test Scenario                             (hh:mm:ss)                              (MB/sec)                     (vs. Healthy Cluster)
    Healthy cluster                            00:30:21                                 1,821                              N/A
    NetApp E2660 drive failure                 00:32:06                                 1,486                             -5.7%
    Internal data node drive failure           01:12:13                                  660                             -237.9%

4
  Drive failures were introduced when the Hadoop job tracker indicated that the TeraSort job was 80% complete.
5
  In a Hadoop cluster using internal disk drives, a local file system is created on each disk. If a disk fails, that file system fails. A local disk
failure was simulated during ESG Lab testing by failing two disk drives in the same RAID 5 volume group. All data on that file system was lost
and all tasks running on that file system failed. The job tracker detected this and reassigned failed tasks to other nodes where copies of the
lost blocks exist. With the NetApp solution, a single disk drive has very little impact on running tasks, and all data in the local file system using
that LUN remains available as RAID reconstruct begins. With direct attached disks, if a single disk fails, a file system fails as described above.




                                       © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                      14

The screen shot shown in Figure 10 shows Hadoop job tracker status after the successful completion of the
TeraSort job following the simulated internal hard drive failure. Note how the non-zero failed/killed counts indicate
the number of map and reduce tasks that were restarted on surviving nodes (439 and 5, respectively).
  Figure 10. Jobs Completion after a Simulated Internal Hard Drive Failure




The screen shot shown in Figure 11 summarizes the status of the Hadoop Distributed File System (HDFS) after the
data node with a simulated internal hard drive failure was blacklisted. These errors didn’t occur with the E2660
drive failure, as the Hadoop job ran uninterrupted.
  Figure 11. Hadoop Self-healing in Action: Cluster Summary after a Simulated Internal Drive Failure




                               © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                     15

Name Node Failure
The Hadoop name node server was halted while a TeraSort job was running with a goal of demonstrating how an
NFS attached NetApp FAS2040 can be used to quickly recover from the single point of failure when a name node
goes offline in a Hadoop cluster. As shown in Figure 12, the job failed as expected after 13 minutes and 23 seconds.
After the job failed, name node metadata was copied to the secondary name node and the name node daemon was
started on the secondary name node server. The procedure outlined in the NetApp Open Solution for Hadoop
Solutions Guide6 was used to copy metadata to the secondary name node and start the name node daemon on the
secondary name node.
     Figure 12. Job Failure after a Name Node Failure: NetApp FAS2040 Recovery Begins




Five minutes after getting started with the recovery process, the Hadoop cluster was up and running. An fsck of the
HDFS file system indicated that the cluster was healthy and a restarted TeraSort job completed without error.


Why This Matters
A majority of respondents to a recent ESG survey indicated that three hours or less of data analytics platform
downtime would result in significant revenue loss or other adverse business impact. The single point of HDFS
failure in the open source Hadoop distribution that was generally available as of this writing can lead to three or
more hours of data analytics platform unavailability.
ESG Lab has confirmed NetApp Open Solution for Hadoop reduces name node recovery time from hours to minutes
(five minutes during ESG Lab testing). NetApp E2660s with hardware RAID dramatically improved recoverability
after simulated hard drive failures. The complexity and performance impact of a blacklisted name node was
avoided as 3TB TeraSort analytics job with NetApp completed more than twice as quickly as with a simulated
internal hard drive failure.




6
    http://media.netapp.com/documents/tr-3969.pdf




                                    © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                      16

ESG Lab Validation Highlights
   The capacity and performance of a NetApp solution scaled linearly when data nodes and NetApp E2660
    storage arrays were added to a Hadoop cluster.
   ESG Lab tested up to 24 data nodes and six NetApp E2660 arrays with 720TB of usable disk capacity.
   Load performance testing with the TeraGen utility delivered linear performance scalability.
   A 24-node cluster sustained a high aggregate load throughput rate of 4.630 GB/sec.
   Big data analytics performance testing with the TeraSort utility yielded linear performance scalability as
    data nodes and E2660 arrays were added.
   Network-free hardware RAID and a lower Hadoop replication count reduced network overhead, which
    increased the aggregate performance of the cluster. A peak aggregate throughput improvement of 62.5%
    was recorded during the 24-node test (4.629 vs. 2.849 GB/sec).
   A MapReduce job running during a simulated internal drive failure took more than twice as long (225%) to
    complete than during failure of a hardware RAID protected E2660 drive.
   An NFS attached NetApp FAS2040 for name node metadata storage was used to recover from a primary
    name node failure in five minutes, compared to multiple hours in a traditional configuration.

Issues to Consider
   While the results demonstrate how the NetApp Open Solution for Hadoop is ideally suited to meet the
    extreme compute and storage performance needs of big data analytic load and long running queries,
    applications with lots of small files, multiple writers, or many users with low response time requirements
    may be better suited for traditional relational databases and storage solutions.
   The single point of failure issue in the Hadoop distribution used during this ESG Lab Validation is being fixed
    in the open source community, but was not yet available and therefore not tested as part of ESG Lab’s
    assessment of the NetApp Open Solution for Hadoop. Even so, future releases of Hadoop that resolve the
    name node failure problem are still expected to rely on NFS shared storage as a functional requirement.
    NetApp, with its FAS family, is an industry leader in NFS shared storage.
   The test results presented in this report are based on a benchmarks deployed in a controlled environment.
    Due to the many variables in each production data center environment, capacity planning and testing in
    your own environment are recommended.
   A growing number of best practices, tuning guidelines, and proof points are available for reference when
    planning, deploying, and tuning a Hadoop Open Solution for NetApp. To learn more, visit:
    http://www.netapp.com/hadoop.




                             © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                       17

The Bigger Truth
Whether measured by increased revenues, market share gains, reduced costs, or scientific breakthroughs, data
analytics has always played a key role in the ability to harness value from electronically-stored information. What
has changed recently is that, as more business processes have become automated, information that was once
stored in separate online and offline repositories and formats is now readily available for amalgamation and
analysis to increase business insight and enhance decision support. Business executives are asking more of their
data and are expecting faster and more impactful answers. The result is an ever-increasing priority on data analytics
activities and, subsequently, more pressure on existing business analyst and IT teams to deliver.
Hadoop is a powerful open source framework for data analytics. It’s an emerging and fast growing solution that’s
considered one of the most impactful technology innovations since HTML. While ESG research indicates that a small
number of organizations are using Hadoop at this time, interest and plans for adoption over the next 12-18 months
is high (48%).
For those new to Hadoop, there is a steep learning curve. Very few enterprise applications are built to run on
massively parallel clusters, so there is much to learn. The NetApp Open Solution for Hadoop is a tested and proven
reference architecture storage appliance that reduces the risk and time associated with Hadoop adoption.
NetApp has embraced the open source Hadoop model and is working with major distributors to support open
source Hadoop software running on industry standard servers. Instead of promoting the use of a proprietary
clustered file system, NetApp has embraced the use of the open source Hadoop file system (HDFS). Instead of
promoting the use of SAN or NAS attached storage, NetApp has embraced the use of direct attached storage. Using
SAS direct connected NetApp E2660 arrays with hardware protected RAID, the NetApp solution improves
performance, scalability, and availability compared to typical internal hard drive Hadoop deployments. Thanks to an
NFS attached NetApp FAS2040 for shared access to metadata, recovery from a Hadoop name node failure is
reduced from hours to minutes.
With up to 5 GB/sec of aggregate TeraGen load performance on a 24-node cluster, ESG Lab has confirmed that the
NetApp Solution for Hadoop provides excellent near-linear performance scalability that dwarfs the capabilities of
traditional disk arrays and databases. NetApp E2660s with network-free hardware RAID improved the efficiency
and performance of the cluster by 66% compared to a traditional Hadoop deployment with triple mirroring. The
value of transparent RAID recovery was obvious after drive failures were simulated: the performance impact on a
long running sort job was less than 6% compared to more than 200% for a simulated internal drive failure that
blacklisted a Hadoop data node.
If you’re looking to accelerate the delivery of insight to your business with an enterprise-class big data analytics
infrastructure, ESG Lab recommends a close look at the NetApp Open Solution for Hadoop—it reduces risk with a
storage solution that delivers reliability, fast deployment, and scalability of open source Hadoop for the enterprise.




                               © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                        18

Appendix
The configuration of the test bed that was used during the ESG Lab Validation is summarized in Table 5.
  Table 5. Configuration Summary
                                                      Servers
 HDFS data nodes                                             24 servers, each with quad core Intel Xeon CPU, 48GB RAM
 HDFS name node                                              1 server with quad core Intel Xeon CPU, 48GB RAM
 HDFS secondary name node                                    1 server, with quad core Intel Xeon CPU, 48GB RAM
 HDFS job tracker                                            1 server, with quad core Intel Xeon CPU, 48GB RAM
                                                      Network
 10 GbE host connect                                         One 10GbE CAN connection for all data nodes, name node, secondary name
 10 GbE switched fabric                                      node, job tracker
                                                             Cisco Nexus 5010, 10 GigE, Jumbo Frames (MTU=9000)
                                                      Storage
 HDFS data node storage                                      6 NetApp E2660 6Gb SAS host connect, 6+1 RAID-5, 2TB near line SAS 7.2K
                                                             RPM drives, 360 drives total, version 47.77.19.99
 HDFS name node storage                                      1 NetApp FAS2040, 1GbE NAS host connect, 6 disks, 1TB each, 7.2K RPM,
                                                             Data ONTAP 8.0.2 7 mode
 Operating system boot drives                                Local 1TB 7.2K RPM SATA drive in each node

                                                      Software
 Operating system                                            Red Hat Enterprise Linux version 5, update 6 (RHEL5.6)
 Analytics platform                                          Cloudera Hadoop (CDH3u2)

                                  HDFS Configuration Changes vs. Cloudera V3U2 Distribution
 Local file system                                           XFS
 Map/reduce tasks per data node                              7/5

Table 6 lists the differences between Hadoop core-site.xml defaults and the settings used during ESG Lab testing.
  Table 6. Hadoop core-site Settings
           Option Name                                         Purpose                                        Actual/Default
                                   Name of the default file system specified as a URI (IP address       hdfs:// 10.61.189.64:8020/
          fs.default.name           or hostname of the name node along with the port to be
                                                               used).                                    [Default Value: file:///]]

                                     Enables or disables certain management functions within
   webinterface.private.actions     the Hadoop Web user interface, including the ability to kill                true / false
                                                   jobs and modify job priorities.
                                    Memory in MB to be used for merging map outputs during
       fs.inmemory.size.mb                                                                                       200 / 100
                                                      the reduce phase.
         io.file.buffer.size                   Size in bytes of the read/write buffer.                        262144 / 4096
                                   Script used to resolve the slave node’s name or IP address to
                                       a rack ID. Used to invoke Hadoop rack awareness. The          /etc/hadoop/conf/topology_script
     topology.script.file.name
                                     default value is null and results in all slaves being given a         [Default value is null]
                                                     rack ID of “/default-rack.”
                                    Sets the maximum acceptable number of arguments to be
   topology.script.number.args                                                                                    1 / 100
                                              sent to the topology script at one time.
                                                                                                             /home/hdfs/tmp
          hadoop.tmp.dir                       Hadoop temporary directory storage.                     [Default value: /tmp/hadoop-
                                                                                                               ${user.name}]




                                   © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                               19

Table 7 lists the differences between Linux sysctl.conf defaults and the settings used during ESG Lab testing.
  Table 7. Linux sysctl.conf Settings
             Parameter                                         Description                                Actual /Default
        net.ipv4.ip_forward                          Controls IP packet forwarding.                            0/0
   net.ipv4.conf.default.rp_filter                 Controls source route verification.                         1/0
 net.ipv4.conf.default.accept_
                                                     Do not accept source routing.                             0/1
 source_route
                                      Controls the system request debugging functionality of the
            kernel.sysrq                                                                                       0/1
                                                               kernel.
                                        Controls whether core dumps will append the PID to the
       kernel.core_uses_pid               core filename. Useful for debugging multithreaded                    1/0
                                                             applications.
          kernel.msgmnb                    Controls the maximum size of a message, in bytes.              65536 / 16384
          kernel.msgmax                 Controls the default maximum size of a message queue.              65536 / 8192
          kernel.shmmax                  Controls the maximum shared segment size, in bytes.          68719476736 / 33554432
                                           Controls the maximum number of shared memory
           kernel.shmall                                                                               4294967296 / 2097512
                                                         segments, in pages.
      net.core.rmem_default                      Sets the default OS receive buffer size.                262144 / 129024
       net.core.rmem_max                          Sets the max OS receive buffer size.                  16777216 / 131071
     net.core.wmem_default                        Sets the default OS send buffer size.                  262144 / 129024
       net.core.wmem_max                           Sets the max OS send buffer size.                    16777216 / 131071
                                      Maximum # of sockets the kernel will serve at one time. Set
       net.core.somaxconn                                                                                   1000 / 128
                                        on name node, secondary name node and job tracker.
             fs.file-max                        Sets the total number of file descriptors.              6815744 / 4847448
     net.ipv4.tcp_timestamps                   Disables the TCP time stamps if set to “0”                      0/1
         net.ipv4.tcp_sack                            Enables select ACK for TCP.                              1/1
   net.ipv4.tcp_window_scaling                        Enables TCP window scaling.                              1/1
          kernel.shmmni                Sets the maximum number of shared memory segments.                  4096 / 4096
                                      Sets the maximum number and size of semaphore sets that           250 32000 100 128 /
            kernel.sem
                                                         can be allocated.                               250 32000 32 128
           fs.aio-max-nr                Sets the maximum number of concurrent I/O requests.              1048576 / 65536
                                                                                                      4096 262144 16777216 /
        net.ipv4.tcp_rmem                   Sets min, default, and max receive window size.
                                                                                                        4096 87380 4194304
                                                                                                      4096 262144 16777216 /
        net.ipv4.tcp_wmem                  Sets min, default, and max transmit window size.
                                                                                                        4096 87380 4194304
     net.ipv4.tcp_syncookies                     Disables TCP syncookies if set to “0”.                        0/0
                                      Sets the maximum number of in-flight rpc requests between
   sunrpc.tcp_slot_table_entries      a client and a server. This value is set on the name node and          128 / 16
                                           secondary name node to improve NFS performance.
                                      Maximum percentage of active system memory that can be
    vm.dirty_background_ratio           used for dirty pages before dirty pages are flushed to                1 / 10
                                                               storage.




                                     © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                        20

Table 8 lists the differences between Hadoop hdfs-site.xml defaults and the settings used during ESG Lab testing.
  Table 8. HDFS Site Settings
             Option Name                                       Purpose                                           Actual/Default
                                  Path on the local file system where the name node stores
                                  the namespace and transaction logs persistently.
                                  If this is a comma-delimited list of directories (as used in this
                                  configuration), then the name table is replicated in all of the
                                                                                                      /local/hdfs/namedir/mnt/fsimage_bkp
                                  directories for redundancy.
 dfs.name.dir                                                                                                     [Default value:
                                  Note:     Directory /mnt/fsimage_bkp is a location on NFS-               ${hadoop.tmp.dir}/dfs/name]
                                            mounted NetApp FAS storage where name node
                                            metadata is mirrored and protected, a key feature
                                            of NetApp’s Hadoop solution.

                                    Specifies a list of machines authorized to join the Hadoop          /etc/hadoop-0.20/conf/dfs_hosts
 dfs.hosts
                                                        cluster as a data node.                               [Default value is null]
                                                                                                             /disk1/data,/disk2/data
                                    Directory paths on the data node local file systems where
 dfs.data.dir                                                                                                     [Default value:
                                                  HDFS data blocks are stored.
                                                                                                           ${hadoop.tmp.dir}/dfs/data]
                                                                                                          /home/hdfs/namesecondary1
                                   Directory path where checkpoint images are stored (used by
 fs.checkpoint.dir                                                                                               [Default value:
                                                    secondary name node).
                                                                                                      ${hadoop.tmp.dir}/dfs/namesecondary]
                                     HDFS block replication count. Hadoop default is 3. The
 dfs.replication                                                                                                      2/3
                                     NetApp Hadoop solution uses a replication setting of 2.

 dfs.block.size                               HDFS data storage block size in bytes.                     134217728 (128MB) / 67108864

 dfs.namenode.handler.count               Number of server threads for the name node.                               128 / 10

 dfs.datanode.handler.count                Number of server threads for the data node                                64 / 3

                                   Maximum number of replications a data node is allowed to
 dfs.max-repl-streams                                                                                                 8/2
                                                  handle at one time.

 dfs.datanode.max.xcievers        Maximum number of files a data node will serve at one time.                      4096 / 256




                                    © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                                      21

Table 9 lists the differences between mapred-site.xml defaults and the settings used during ESG Lab testing.
  Table 9. mapred-site Settings
            Option Name                                       Purpose                                      Actual/Default
                                       Job tracker address as a URL (Job tracker IP address              10.61.189.66:9001
         mapred.job.tracker
                                                or hostname with port number).                          [Default value: local]
                                                                                                /disk1/mapred/local,/disk2/mapred/loc
                                                                                                                  al
                                       Comma-separated list of the local file system where
           mapred.local.dir                                                                                 [Default value:
                                           temporary MapReduce data is written.
                                                                                                    ${hadoop.tmp.dir}/mapred/local


                                           Specifies the file containing the list of nodes
            mapred.hosts                                                                        /etc/hadoop-0.20/conf/mapred.hosts
                                       allowed to join the Hadoop cluster as task trackers.
                                                                                                        [Default value is null]

                                         Path in HDFS where the MapReduce framework                      /mapred/system
          mapred.system.dir
                                                       stores control files.                              [Default value:
                                                                                                 ${hadoop.tmp.dir}/mapred/system]
                                          Enables the job tracker to detect slow-running
           mapred.reduce.
                                       reduce tasks, assign them to run in parallel on other
          tasks.speculative.                                                                                 false / true
                                        nodes, use the first available results, and then kill
              execution
                                                the slower running reduce tasks.
                                       Enables the job tracker to detect slow-running map
         mapred.map.tasks.             tasks, assign them to run in parallel on other nodes,
                                                                                                             false / true
        speculative.execution             use the first available results and then kill the
                                                     slower running map tasks.

          mapred.tasktracker.           Maximum number of reduce tasks that can be run
                                                                                                                5/2
        reduce.tasks.maximum             simultaneously on a single task tracker node.

       mapred.tasktracker.map.           Maximum number of map tasks that can be run
                                                                                                                7/2
          tasks.maximum                   simultaneously on a single task tracker node.
                                           Java options passed to the task tracker child
        mapred.child.java.opts            processes. (In this case, 1 GB defined for heap             -Xmx1024m / -Xmx200m
                                              memory used by each individual JVM).
                                        Total amount of buffer memory allocated to each
              io.sort.mb                merge stream while sorting files on the mapper, in                    340 / 100
                                                              MB.
                                                                                                org.apache.hadoop.mapred.FairSchedul
                                                                                                                  er
         mapred.jobtracker.             Job tracker task scheduler to use (in this case use
                                                                                                            [Default value:
           taskScheduler                               the FairScheduler).
                                                                                                org.apache.hadoop.mapred.JobQueueT
                                                                                                            askScheduler]
                                        Number of streams to merge at once while sorting
            io.sort.factor                                                                                    100 / 10
                                                             files.
                                             Enables/disables MapReduce output file
       mapred.output.compress                                                                                false / false
                                                          compression.
       mapred.compress.map.
                                           Enables/disables map output compression.                          false / false
              output
   mapred.output.compression.type                 Sets output compression type.                             block / record
                                       Fraction of the number of map tasks that should be
 mapred.reduce.slowstart.completed.m
                                         complete before reducers are scheduled for the                      0.05 / 0.05
                 aps
                                                         MapReduce job.




                                 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
Lab Validation: NetApp Open Solution for Hadoop                                                           22

          Option Name                                        Purpose                             Actual/Default
                                                                                                40 for 8 DataNodes

                                         Total number of reduce tasks available for the        80 for 16 DataNodes
      mapred.reduce.tasks
                                                        entire cluster.                        120 for 24 DataNodes
                                                                                                [Default value: 1]

                                                                                                56 for 8 DataNodes

                                       Total number of map tasks available for the entire      112 for 16 DataNodes
       mapred.map.tasks
                                                          cluster.                             168 for 24 DataNodes
                                                                                                [Default value: 2]

    mapred.reduce.parallel.           Number of parallel threads used by reduce tasks to
                                                                                                      64 / 5
           copies                             fetch outputs from map tasks.

 mapred.compress.map.output                Enable/disable map output compression.                  false / false

                                      Number of map outputs in the reduce task tracker’s
     mapred.inmem.merge.
                                      memory at which map data is merged and spilled to             0 / 1000
          threshold
                                                          disk.

       mapred.job.reduce.              Percent usage of the map outputs buffer at which
                                                                                                      1/0
      input.buffer.percent             the map output data is merged and spilled to disk.

      mapred.job.tracker.              Number of job tracker server threads for handling
                                                                                                    128 / 10
        handler.count                           RPCs from the task trackers.
        tasktracker.http.             Number of task tracker worker threads for fetching
                                                                                                     60 / 40
             threads                      intermediate map outputs for reducers.
                                         Maximum number of tasks that can be run in a
     mapred.job.reuse.jvm.
                                      single JVM for a job. A value of "-1" sets the number           -1 / 1
          num.tasks
                                                         to "unlimited."

mapred.jobtracker.restart.recover              Enables job recovery after restart.                 true / false




                                © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
20 Asylum Street | Milford, MA 01757 | Tel: 508.482.0188 Fax: 508.482.0218 | www.enterprisestrategygroup.com

Weitere ähnliche Inhalte

Was ist angesagt?

The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...
Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...
Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...Raleigh ISSA
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Debraj GuhaThakurta
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Hortonworks
 
201505 Statistical Thinking course extract
201505 Statistical Thinking course extract201505 Statistical Thinking course extract
201505 Statistical Thinking course extractJefferson Lynch
 
Machine learning101 v1.2
Machine learning101 v1.2Machine learning101 v1.2
Machine learning101 v1.2CCG
 
Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017gapariciojr
 
Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)Prashanth Raj
 
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...IOSRjournaljce
 
Ml in a Day Workshop 5/1
Ml in a Day Workshop 5/1Ml in a Day Workshop 5/1
Ml in a Day Workshop 5/1CCG
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Developing a framework for
Developing a framework forDeveloping a framework for
Developing a framework forcsandit
 
The Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology WhitepaperThe Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology WhitepaperEdgar Alejandro Villegas
 
Doing qualitative data analysis
Doing qualitative data analysisDoing qualitative data analysis
Doing qualitative data analysisIrene Torres
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy Hussain Sultan
 
Big data for cybersecurity - skilledfield slides - 25032021
Big data for cybersecurity - skilledfield slides - 25032021Big data for cybersecurity - skilledfield slides - 25032021
Big data for cybersecurity - skilledfield slides - 25032021Mouaz Alnouri
 
Creating the Foundations for the Internet of Things
Creating the Foundations for the Internet of ThingsCreating the Foundations for the Internet of Things
Creating the Foundations for the Internet of ThingsCapgemini
 

Was ist angesagt? (20)

The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...
Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...
Raleigh ISSA: "Optimize Your Data Protection Investment for Bottom Line Resul...
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
 
201505 Statistical Thinking course extract
201505 Statistical Thinking course extract201505 Statistical Thinking course extract
201505 Statistical Thinking course extract
 
Machine learning101 v1.2
Machine learning101 v1.2Machine learning101 v1.2
Machine learning101 v1.2
 
Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017Data Insight-Driven Project Delivery ACADIA 2017
Data Insight-Driven Project Delivery ACADIA 2017
 
Big Data - Harnessing a game changing asset
Big Data - Harnessing a game changing assetBig Data - Harnessing a game changing asset
Big Data - Harnessing a game changing asset
 
BI_StrategyDM2
BI_StrategyDM2BI_StrategyDM2
BI_StrategyDM2
 
10 ways SPSS
10 ways SPSS10 ways SPSS
10 ways SPSS
 
Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)
 
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
Data Warehouse Development Standardization Framework (DWDSF): A Way to Handle...
 
Ml in a Day Workshop 5/1
Ml in a Day Workshop 5/1Ml in a Day Workshop 5/1
Ml in a Day Workshop 5/1
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Developing a framework for
Developing a framework forDeveloping a framework for
Developing a framework for
 
The Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology WhitepaperThe Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology Whitepaper
 
Doing qualitative data analysis
Doing qualitative data analysisDoing qualitative data analysis
Doing qualitative data analysis
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
 
Big data for cybersecurity - skilledfield slides - 25032021
Big data for cybersecurity - skilledfield slides - 25032021Big data for cybersecurity - skilledfield slides - 25032021
Big data for cybersecurity - skilledfield slides - 25032021
 
Creating the Foundations for the Internet of Things
Creating the Foundations for the Internet of ThingsCreating the Foundations for the Internet of Things
Creating the Foundations for the Internet of Things
 

Ähnlich wie NetApp Open Solution for Hadoop

Building Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich ApplicationBuilding Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich ApplicationRobert Bira
 
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...EMC
 
Webinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data IntegrationWebinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data IntegrationSnapLogic
 
ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012
ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012
ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012equinn1952
 
Accenture big-data
Accenture big-dataAccenture big-data
Accenture big-dataPlanimedia
 
Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013Intel IT Center
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014Roger Barga
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfvenkatakeerthi3
 
Big data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-reportBig data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-reportAravindharamanan S
 
Big data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-reportBig data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-reportAravindharamanan S
 
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKSDETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKSijseajournal
 
2016 Strata Conference New York - Vendor Briefings
2016 Strata Conference New York - Vendor Briefings2016 Strata Conference New York - Vendor Briefings
2016 Strata Conference New York - Vendor BriefingsDigital Enterprise Journal
 
Ventana Research Big Data Integration Benchmark Research Executive Report
Ventana Research Big Data Integration Benchmark Research Executive ReportVentana Research Big Data Integration Benchmark Research Executive Report
Ventana Research Big Data Integration Benchmark Research Executive ReportVentana Research
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries ciaKevin Pledge
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Kevin Pledge
 
Fine grained root cause and impact analysis with CDAP Lineage
Fine grained root cause and impact analysis with CDAP LineageFine grained root cause and impact analysis with CDAP Lineage
Fine grained root cause and impact analysis with CDAP LineageBig Data Aplications Meetup
 

Ähnlich wie NetApp Open Solution for Hadoop (20)

Building Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich ApplicationBuilding Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich Application
 
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirem...
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Webinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data IntegrationWebinar: Attaining Excellence in Big Data Integration
Webinar: Attaining Excellence in Big Data Integration
 
ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012
ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012
ESG Research Report Snapshot Big Data and Integrated Infrastructure Aug 2012
 
Hadoop Does Not Equal Big Data
Hadoop Does Not Equal Big Data Hadoop Does Not Equal Big Data
Hadoop Does Not Equal Big Data
 
Accenture big-data
Accenture big-dataAccenture big-data
Accenture big-data
 
Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
 
Big data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-reportBig data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-report
 
Big data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-reportBig data-analytics-2013-peer-research-report
Big data-analytics-2013-peer-research-report
 
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKSDETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
 
2016 Strata Conference New York - Vendor Briefings
2016 Strata Conference New York - Vendor Briefings2016 Strata Conference New York - Vendor Briefings
2016 Strata Conference New York - Vendor Briefings
 
Ventana Research Big Data Integration Benchmark Research Executive Report
Ventana Research Big Data Integration Benchmark Research Executive ReportVentana Research Big Data Integration Benchmark Research Executive Report
Ventana Research Big Data Integration Benchmark Research Executive Report
 
R180305120123
R180305120123R180305120123
R180305120123
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
Fine grained root cause and impact analysis with CDAP Lineage
Fine grained root cause and impact analysis with CDAP LineageFine grained root cause and impact analysis with CDAP Lineage
Fine grained root cause and impact analysis with CDAP Lineage
 

Mehr von NetApp

DevOps the NetApp Way: 10 Rules for Forming a DevOps Team
DevOps the NetApp Way: 10 Rules for Forming a DevOps TeamDevOps the NetApp Way: 10 Rules for Forming a DevOps Team
DevOps the NetApp Way: 10 Rules for Forming a DevOps TeamNetApp
 
10 Reasons to Choose NetApp for EUC/VDI
10 Reasons to Choose NetApp for EUC/VDI10 Reasons to Choose NetApp for EUC/VDI
10 Reasons to Choose NetApp for EUC/VDINetApp
 
Spot Lets NetApp Get the Most Out of the Cloud
Spot Lets NetApp Get the Most Out of the CloudSpot Lets NetApp Get the Most Out of the Cloud
Spot Lets NetApp Get the Most Out of the CloudNetApp
 
NetApp #WFH: COVID-19 Impact Report
NetApp #WFH: COVID-19 Impact ReportNetApp #WFH: COVID-19 Impact Report
NetApp #WFH: COVID-19 Impact ReportNetApp
 
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp SuccessNetApp
 
NetApp 2020 Predictions
NetApp 2020 Predictions NetApp 2020 Predictions
NetApp 2020 Predictions NetApp
 
NetApp 2020 Predictions
NetApp 2020 Predictions NetApp 2020 Predictions
NetApp 2020 Predictions NetApp
 
NetApp 2020 Predictions in Tech
NetApp 2020 Predictions in TechNetApp 2020 Predictions in Tech
NetApp 2020 Predictions in TechNetApp
 
Corporate IT at NetApp
Corporate IT at NetAppCorporate IT at NetApp
Corporate IT at NetAppNetApp
 
Modernize small and mid-sized enterprise data management with the AFF C190
Modernize small and mid-sized enterprise data management with the AFF C190Modernize small and mid-sized enterprise data management with the AFF C190
Modernize small and mid-sized enterprise data management with the AFF C190NetApp
 
Achieving Target State Architecture in NetApp IT
Achieving Target State Architecture in NetApp ITAchieving Target State Architecture in NetApp IT
Achieving Target State Architecture in NetApp ITNetApp
 
10 Reasons Why Your SAP Applications Belong on NetApp
10 Reasons Why Your SAP Applications Belong on NetApp10 Reasons Why Your SAP Applications Belong on NetApp
10 Reasons Why Your SAP Applications Belong on NetAppNetApp
 
Turbocharge Your Data with Intel Optane Technology and MAX Data
Turbocharge Your Data with Intel Optane Technology and MAX DataTurbocharge Your Data with Intel Optane Technology and MAX Data
Turbocharge Your Data with Intel Optane Technology and MAX DataNetApp
 
Redefining HCI: How to Go from Hyper Converged to Hybrid Cloud Infrastructure
Redefining HCI: How to Go from Hyper Converged to Hybrid Cloud InfrastructureRedefining HCI: How to Go from Hyper Converged to Hybrid Cloud Infrastructure
Redefining HCI: How to Go from Hyper Converged to Hybrid Cloud InfrastructureNetApp
 
Webinar: NetApp SaaS Backup
Webinar: NetApp SaaS BackupWebinar: NetApp SaaS Backup
Webinar: NetApp SaaS BackupNetApp
 
NetApp 2019 Perspectives
NetApp 2019 PerspectivesNetApp 2019 Perspectives
NetApp 2019 PerspectivesNetApp
 
Künstliche Intelligenz ist in deutschen Unter- nehmen Chefsache
Künstliche Intelligenz ist in deutschen Unter- nehmen ChefsacheKünstliche Intelligenz ist in deutschen Unter- nehmen Chefsache
Künstliche Intelligenz ist in deutschen Unter- nehmen ChefsacheNetApp
 
Iperconvergenza come migliora gli economics del tuo IT
Iperconvergenza come migliora gli economics del tuo ITIperconvergenza come migliora gli economics del tuo IT
Iperconvergenza come migliora gli economics del tuo ITNetApp
 
10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning
10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning
10 Good Reasons: NetApp for Artificial Intelligence / Deep LearningNetApp
 
NetApp IT’s Tiered Archive Approach for Active IQ
NetApp IT’s Tiered Archive Approach for Active IQNetApp IT’s Tiered Archive Approach for Active IQ
NetApp IT’s Tiered Archive Approach for Active IQNetApp
 

Mehr von NetApp (20)

DevOps the NetApp Way: 10 Rules for Forming a DevOps Team
DevOps the NetApp Way: 10 Rules for Forming a DevOps TeamDevOps the NetApp Way: 10 Rules for Forming a DevOps Team
DevOps the NetApp Way: 10 Rules for Forming a DevOps Team
 
10 Reasons to Choose NetApp for EUC/VDI
10 Reasons to Choose NetApp for EUC/VDI10 Reasons to Choose NetApp for EUC/VDI
10 Reasons to Choose NetApp for EUC/VDI
 
Spot Lets NetApp Get the Most Out of the Cloud
Spot Lets NetApp Get the Most Out of the CloudSpot Lets NetApp Get the Most Out of the Cloud
Spot Lets NetApp Get the Most Out of the Cloud
 
NetApp #WFH: COVID-19 Impact Report
NetApp #WFH: COVID-19 Impact ReportNetApp #WFH: COVID-19 Impact Report
NetApp #WFH: COVID-19 Impact Report
 
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
 
NetApp 2020 Predictions
NetApp 2020 Predictions NetApp 2020 Predictions
NetApp 2020 Predictions
 
NetApp 2020 Predictions
NetApp 2020 Predictions NetApp 2020 Predictions
NetApp 2020 Predictions
 
NetApp 2020 Predictions in Tech
NetApp 2020 Predictions in TechNetApp 2020 Predictions in Tech
NetApp 2020 Predictions in Tech
 
Corporate IT at NetApp
Corporate IT at NetAppCorporate IT at NetApp
Corporate IT at NetApp
 
Modernize small and mid-sized enterprise data management with the AFF C190
Modernize small and mid-sized enterprise data management with the AFF C190Modernize small and mid-sized enterprise data management with the AFF C190
Modernize small and mid-sized enterprise data management with the AFF C190
 
Achieving Target State Architecture in NetApp IT
Achieving Target State Architecture in NetApp ITAchieving Target State Architecture in NetApp IT
Achieving Target State Architecture in NetApp IT
 
10 Reasons Why Your SAP Applications Belong on NetApp
10 Reasons Why Your SAP Applications Belong on NetApp10 Reasons Why Your SAP Applications Belong on NetApp
10 Reasons Why Your SAP Applications Belong on NetApp
 
Turbocharge Your Data with Intel Optane Technology and MAX Data
Turbocharge Your Data with Intel Optane Technology and MAX DataTurbocharge Your Data with Intel Optane Technology and MAX Data
Turbocharge Your Data with Intel Optane Technology and MAX Data
 
Redefining HCI: How to Go from Hyper Converged to Hybrid Cloud Infrastructure
Redefining HCI: How to Go from Hyper Converged to Hybrid Cloud InfrastructureRedefining HCI: How to Go from Hyper Converged to Hybrid Cloud Infrastructure
Redefining HCI: How to Go from Hyper Converged to Hybrid Cloud Infrastructure
 
Webinar: NetApp SaaS Backup
Webinar: NetApp SaaS BackupWebinar: NetApp SaaS Backup
Webinar: NetApp SaaS Backup
 
NetApp 2019 Perspectives
NetApp 2019 PerspectivesNetApp 2019 Perspectives
NetApp 2019 Perspectives
 
Künstliche Intelligenz ist in deutschen Unter- nehmen Chefsache
Künstliche Intelligenz ist in deutschen Unter- nehmen ChefsacheKünstliche Intelligenz ist in deutschen Unter- nehmen Chefsache
Künstliche Intelligenz ist in deutschen Unter- nehmen Chefsache
 
Iperconvergenza come migliora gli economics del tuo IT
Iperconvergenza come migliora gli economics del tuo ITIperconvergenza come migliora gli economics del tuo IT
Iperconvergenza come migliora gli economics del tuo IT
 
10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning
10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning
10 Good Reasons: NetApp for Artificial Intelligence / Deep Learning
 
NetApp IT’s Tiered Archive Approach for Active IQ
NetApp IT’s Tiered Archive Approach for Active IQNetApp IT’s Tiered Archive Approach for Active IQ
NetApp IT’s Tiered Archive Approach for Active IQ
 

Kürzlich hochgeladen

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Kürzlich hochgeladen (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

NetApp Open Solution for Hadoop

  • 1. Lab Validation Report NetApp Open Solution for Hadoop Open Source Data Analytics with Enterprise-class Storage Services By Brian Garrett, VP, ESG Lab, & Julie Lockner, Sr Analyst & VP, Data Management May 2012 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 2. Lab Validation: NetApp Open Solution for Hadoop 2 Contents Introduction .................................................................................................................................................. 3 Background ............................................................................................................................................................... 3 NetApp Open Solution for Hadoop .......................................................................................................................... 5 ESG Lab Validation ........................................................................................................................................ 6 Getting Started ......................................................................................................................................................... 6 Performance and Scalability ..................................................................................................................................... 7 Efficiency................................................................................................................................................................. 10 Recoverability ......................................................................................................................................................... 12 ESG Lab Validation Highlights ..................................................................................................................... 16 Issues to Consider ....................................................................................................................................... 16 The Bigger Truth ......................................................................................................................................... 17 Appendix ..................................................................................................................................................... 18 ESG Lab Reports The goal of ESG Lab reports is to educate IT professionals about data center technology products for companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab's expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments. This ESG Lab report was sponsored by NetApp. All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of the Enterprise Strategy Group, Inc., is in violation of U.S. Copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 3. Lab Validation: NetApp Open Solution for Hadoop 3 Introduction This ESG Lab report presents the results of hands-on testing of the NetApp Open Solution for Hadoop, a highly reliable, ready to deploy, scalable storage solution for Enterprise Hadoop. Background Driven by unrelenting data volume growth, the need for real-time data processing and data analytics, and the increasing complexity and variety of data sources, ESG expects broad adoption of MapReduce data processing and analytics frameworks over the next two to five years. These frameworks require new approaches for storing, integrating and processing “big data.” ESG defines big data as any data set that exceeds the boundaries and sizes of traditional IT processing; big data sets can range from ten to hundreds of terabytes in size. Data analytics is a top IT priority for forward-looking IT organizations. In fact, a recent ESG survey indicates that more than half (54%) of enterprise organizations (i.e., 1,000 or more employees) consider data analytics to be a top-five IT priority and 38% plan on deploying a new data analytics solution in the next 12-18 months. A growing number of IT organizations are using the open source Apache Hadoop MapReduce framework as a foundation for their big data analytics initiatives. As shown in Figure 1, more than 50% of organizations polled by ESG are using Hadoop, planning to deploy Hadoop in the next 12 months, or considering Hadoop.1 Figure 1. Plans to Implement a MapReduce Framework such as Apache Hadoop What are your organization’s plans to implement a MapReduce framework (e.g., Apache Hadoop) to address data analytics challenges? (Percent of respondents, N=270) Already using, 8% Don’t know, 11% Plan to implement within 12 months, 13% No plans to implement, 33% No plans to implement at this time but interested, 35% Source: Enterprise Strategy Group, 2011. As with any exciting and emerging technology, big data analytics also has its challenges. Management is an issue because the platforms are expensive and require new server and storage purchases, integration with existing data sets and processes, training in new technologies, an analytics toolset, and people with expertise in dealing with it. When IT managers were asked about their data analytics challenges, 47% named data integration complexity, 34% cited a lack of skills necessary to properly manage large data sets and derive value from them, 29% said data set sizes limiting their ability to perform analytics, and 28% said difficulty in completing analytics within a reasonable period of time. 1 Source: ESG Research Report, The Impact of Big Data on Data Analytics, September 2011. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 4. Lab Validation: NetApp Open Solution for Hadoop 4 Looking beyond the high level organizational challenges associated with a big data analytics initiative, the Hadoop framework adds technology and implementation issues that need to be considered. The common reference architecture for a Hadoop cluster leverages commodity server nodes with internal hard drives; for conventional data centers with mature ITIL processes, this introduces two challenges. First, data protection is, by default, handled in the Hadoop software layer; every time a file is written to the Hadoop Distributed File System (HDFS), two additional copies are written in case of a disk drive or a data node failure. This not only impacts data ingest and throughput performance, but also reduces disk capacity utilization. Second, high availability is limited based on an existing single point of failure in the Hadoop metadata repository. This single point of failure will eventually be addressed by the Hadoop community, but, in the meantime, analytics downtime due to a name node failure is a key concern. As shown in Figure 2, a majority of ESG survey respondents (55%) indicate that three hours or less of data analytics platform downtime would result in a significant revenue loss or other adverse business impact.2 Figure 2. Data Analytics Downtime Tolerance Please indicate the amount of downtime your organization’s data analytics platforms can tolerate before your organization experiences significant revenue loss or other adverse business impact. (Percent of respondents, N=399) Don’t know, 6% None, 6% More than 3 days, 4% Less than 1 hour, 1 day to 3 days, 10% 21% 11 hours to 24 hours, 10% 4 hours to 10 hours, 1 hour to 3 hours, 18% 26% Source: Enterprise Strategy Group, 2011. NetApp, in collaboration with leading Hadoop distribution vendors, is working to develop reference architectures, best practices, and solutions that address these challenges while maximizing the speed, efficiency, and availability of open source Hadoop deployments. 2 Source: ESG Survey, The Convergence of Big Data Processing, Hadoop, and Integrated Infrastructure, December 2011. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 5. Lab Validation: NetApp Open Solution for Hadoop 5 NetApp Open Solution for Hadoop Hadoop is an open source and significant emerging technology for solving business problems around large volumes of mostly unstructured data that cannot be analyzed with traditional database tools. The NetApp Open Solution for Hadoop combines the power of the Hadoop framework with flexible storage, professional support, and services of NetApp and its partners to deliver higher Hadoop cluster availability and efficiency. Based on a reference architecture, it focuses on scaling Hadoop from its departmental origins to an enterprise infrastructure with independent compute and storage scaling, faster cluster ingest, and faster job completion under failure conditions. NetApp Open Solution for Hadoop extends the value of the open source Hadoop framework with enterprise-class storage and services. As shown in Figure 3, NetApp FAS2040 and E2660 storage replace traditional DAS internal hard drives within a Hadoop cluster. Compute and storage resources are decoupled with SAS attached NetApp E2660 arrays and the recoverability of a failed Hadoop name node is improved with a NFS attached FAS2040. The storage components are completely transparent to the Hadoop distribution and require no modification to the native, underlying Hadoop platform. Note that while the FAS2040 is used for this testing configuration, any other product in the FAS storage family can also be used. Figure 3. NetApp Open Solution for Hadoop The NetApp Open Solution for Hadoop includes: NetApp E2660s with hardware RAID and hot-swappable disks increases efficiency, performance, scalability, availability, serviceability, and manageability compared to a traditional Hadoop deployment with internal hard drives and replication at the application layer. With data being protected by hardware RAID, higher storage utilization rates can be achieved by reducing the default Hadoop replication count. A NetApp FAS2040 with shared NFS attached capacity accelerates recoverability after a primary name node failure, compared to a traditional Hadoop deployment with internal hard drives. A high speed 10 Gbps Ethernet network and direct attached 6 Gbps SAS-attached E2600s with network free hardware RAID increases the performance, scalability and efficiency of the Hadoop infrastructure. High capacity E2660 disk arrays and a building block design that decouples the compute and storage layers provide near-linear scalability that’s ideally suited for big data analytics applications with extreme compute and storage capacity requirements. A field-tested solution comprised of open source Apache Hadoop distribution and enterprise-class NetApp storage, with professional design services and support reduces risk and accelerates deployment. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 6. Lab Validation: NetApp Open Solution for Hadoop 6 ESG Lab Validation ESG Lab performed hands-on evaluation and testing of solution at a NetApp facility in Research Triangle Park, North Carolina. Testing was designed to demonstrate that the NetApp Open Solution for Hadoop can perform and scale linearly as data volumes and load increase, can recover from both a single and double node failure with no disruption to a running Hadoop job, and can quickly recover from a name node failure. The performance and scalability benefits of using network-free hardware RAID and a lower Hadoop replication count were evaluated as well. Testing was performed using open source software, workload generators, and monitoring tools. Getting Started A Hadoop cluster with one name node, one secondary name node, one job tracker node, and up to 24 data nodes was used during ESG Lab testing. Rack-mounted servers with quad core Intel Xeon processors and 48GB of RAM were connected to six NetApp E2660s, with the name node and secondary name node connected to a single NetApp FAS2040. Each NetApp E2660 was filled with 60 2TB 7200 RPM NL-SAS drives for a total raw capacity of 720TB. A building block approach was used, with groups of four data nodes sharing an E2660 through 6 Gbps SAS connections. A 1 Gbps Ethernet network was used for the cluster interconnect and NFS connections to name and job tracker nodes. Cloudera Distribution for Hadoop software was installed over the Red Hat Linux operating system on each of the nodes in the cluster.3 Figure 4. The ESG Lab Test Bed 3 Configuration details are listed in the Appendix. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 7. Lab Validation: NetApp Open Solution for Hadoop 7 Performance and Scalability Hadoop uses a shared nothing programming paradigm and a massively parallel clustered architecture to meet the extreme compute and capacity requirements of big data analytics applications. Aiming to augment the performance and scalability capacity of traditional database architectures, Hadoop brings the compute power to the data. The name node and job trackers handle distribution and orchestration as the data nodes do all of the analytical processing work. HDFS is a distributed network file system used by nodes in a Hadoop cluster. Software mirroring is the default data protection scheme within the HDFS file system. For every block of data written into the HDFS file system, an additional two copies are written to other nodes for a total of three copies. This is referred to as a replication count of three, and is the default for most Hadoop implementations that rely on internal hard drive capacity. This software data mirroring increases the processing load on data nodes and the utilization of the shared network between nodes. To put this into perspective, consider what happens when a 2TB data set is loaded into a Hadoop cluster with a default replication count of three: in this example, 2TB of application data results in 6TB of raw data being processed and moved over the network. A NetApp E2660 with hardware RAID reduces the processing and network overhead associated with software mirroring, which increases the performance and scalability of a Hadoop cluster. With up to 15 high capacity, high performance disk drives (2TB, 7.2K NL-SAS) available for each data node, the performance of a Hadoop cluster is magnified compared to a traditional Hadoop cluster with internal SATA drives. A right-sized building block approach provides near-linear scalability as compute and storage capacity are added to a cluster. ESG Lab Testing ESG Lab performed a series of tests to measure the performance and scalability of a 24-data-node NetApp Open Solution for Hadoop. Note that there are actually 27 nodes, 24 data nodes, one name node, one secondary name node and one job tracker node. The TeraGen utility, included in the Hadoop open source distribution, was used to simulate the loading of a large analytic data set. Testing was performed with cluster sizes of 8, 16, and 24 data nodes and a Hadoop replication count of two. Testing began with the creation of a 1TB data set on an 8-data-node cluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-node cluster. The results are presented in Figure 5 and Table 1. Figure 5. Data Loading Performance Analysis © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 8. Lab Validation: NetApp Open Solution for Hadoop 8 Table 1. Performance Scalability Test Results: Data Loading with the TeraGen Utility Data nodes 8 16 24 NetApp E2660 arrays 2 4 6 NetApp E2660 drives 120 240 360 Usable capacity (TB) 180 360 720 Hadoop data set size (TB) 1 2 3 Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18 Aggregate throughput (MB/sec) 1,574 3,222 4,630 What the Numbers Mean The NetApp Solution for Hadoop was designed to scale performance in near-linear fashion as data nodes and E2660 disk arrays are added to the cluster. This modular building block approach can also be used to provide consistent levels of performance as a data set grows. The job completion time for each of the TeraGen runs was recorded as the amount of data generated, the number of data nodes, and the number of E2660 arrays added linearly. In this example, the solution scaled up to 24 data nodes and six E2660 arrays with a total of 360 drives and 720TB of usable disk capacity. As the number of data nodes increased and the volume of data generated increased linearly, the completion time remained flat, at approximately 10 minutes (+/- 3%). This demonstrates the linear performance scalability of the NetApp Solution for Hadoop. A job completion time of ten minutes for the creation of a 3TB data set indicates that the 24-node NetApp solution sustained a high aggregate throughput rate of 4.630 GB/sec. An aggregate data creation rate of 4.630 GB/sec can be used to create 16.7TB of data per hour. Performance testing continued with a similar series of tests designed to measure the scalability of the solution when processing long running data analytics jobs. The open source TeraSort utility included in the Hadoop distribution was used during this phase of testing. Using the data created with TeraGen, TeraSort was tested with cluster sizes of 8, 16, and 24 data nodes, a map count of seven, and a reducer count of five per data node. Testing began with a sort of the 1TB data set on an eight-data-node cluster. The test was repeated with a 2TB data set on a 16-data-node cluster and a 3TB data set on a 24-data-node cluster. The elapsed job run time was recorded after each test. Each test began with a freshly created TeraGen data source. The results are presented in Table 2 and Figure 6. Table 2. Performance Scalability Test Results: Data Analytics with the TeraSort Utility Data nodes 8 16 24 Hadoop data set size (TB) 1 2 3 Job completion time (hh:mm:ss) 00:29:19 00:30:19 00:30:21 Aggregate throughput (MB/sec) 542 1,049 1,571 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 9. Lab Validation: NetApp Open Solution for Hadoop 9 Figure 6. Data Analytics Performance Analysis What the Numbers Mean The job completion time for each of the TeraSort runs was recorded as the amount of data generated, the number of data nodes, and the number of E2660 arrays increased linearly. As the number of data nodes grew and the volume of data generated increased linearly, job completion time remained flat at approximately 30 minutes (+/- 2%). As shown in Figure 6, aggregate analytics throughput scaled linearly as data nodes and E2660 arrays were added to the cluster. Why This Matters A growing number of organizations are deploying big data analytics platforms to improve the efficiency and profitability of their businesses. ESG research indicates that data analytics and managing data growth are among the top five IT priorities in more than 50% of organizations. When asked about their data analytics challenges, 29% said data set sizes are limiting their ability to perform analytics, and 28% reported difficulty in completing analytics within a reasonable period of time. The NetApp Open Solution for Hadoop combines the compute scalability of a shared Hadoop cluster with the storage efficiency and scalability of network-free hardware RAID. Because the solution was designed to have the Hadoop data replication setting lower than the default and because it standardizes on a 10GbE network, there is less chance of having a network bottleneck compared to a traditional Hadoop deployment as data volumes grow. ESG Lab confirmed that NetApp has created a big data analytics solution with near-linear performance scalability that dwarfs the capabilities of traditional databases and disk arrays—testing with a 24-node cluster and a 3TB data set scaled up to 4.63 GB/sec of aggregate load throughput and 1.57 GB/sec of aggregate analytics throughput. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 10. Lab Validation: NetApp Open Solution for Hadoop 10 Efficiency The NetApp Open Solution for Hadoop improves capacity and performance efficiency compared to a traditional Hadoop deployment. With protection from disk failures provided by NetApp E2660s with hardware RAID, the Hadoop default replication setting of three can be reduced to two. NetApp E2660s with network-free hardware RAID-5 (6+1) and a Hadoop replication count of two increase storage capacity utilization by 22%, compared to a Hadoop cluster with internal drives and a default replication count of three. Network-free hardware RAID also increases the performance and scalability of the cluster due to a reduction in the amount of mirrored data flowing over the network. ESG Lab Testing The TeraGen tests were repeated with a replication count of two as the size of the cluster was increased from eight to 24 data nodes. The elapsed job time was compared with those collected earlier with a default Hadoop replication count of three. The results are summarized in Figure 7 and Table 3. Figure 7. Increasing Hadoop Cluster Efficiency with the “NetApp Effect” Table 3. Performance Efficiency Test Results: Data Loading with TeraGen Replication Count Data nodes 8 16 24 Hadoop data set size (TB) 1 2 3 2 Job completion time (hh:mm:ss) 00:10:06 00:09:52 00:10:18 2 Aggregate throughput (MB/sec) 1,573 3,221 4,629 3 Job completion time (hh:mm:ss) 00:15:32 00:16:11 00:16:44 3 Aggregate throughput (MB/sec) 1,023 1,964 2,849 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 11. Lab Validation: NetApp Open Solution for Hadoop 11 What the Numbers Mean As the number of data nodes grew and the volume of data generated increased linearly, the job completion time remained flat at approximately ten minutes (+/- 2%) with a NetApp-enabled replication count of two. Job completion time increased by 50% or more with a Hadoop default replication count of three due to the extra processing and network overhead associated with triple mirroring. The increase in cluster efficiency (the NetApp effect) not only reduced job completion times, but also increased aggregate throughput. As shown in Figure 7, the NetApp effect was magnified as the size of the cluster and the amount of network traffic increased. Note how the “Replication 2 with NetApp” line (green, circles) increases linearly compared to the “Replication 3” line (red, triangles). Also note how the gap between the two increases as the cluster grows due to the increase in network traffic. The NetApp effect resulted in a peak aggregate throughput improvement of 62.5% during the 24-node test (4.629 vs. 2.849 GB/sec). Why This Matters Data growth shows no signs of abating. As data accumulates, there is a corresponding burden on IT to maintain acceptable levels of performance, whether that is measured by the speed with which an application responds, the ability to aggregate and deliver data, or the ultimate business value of information. Management teams are recognizing that their growing data stores bring massive, and largely untapped, potential to improve business intelligence. At the same time, they also recognize the challenges that big data poses to existing analytics tools and processes, as well as the impact data growth is having on the bottom line in the form of increased requirements for storage capacity and compute power. It is for these reasons that IT managers are struggling to meet the conflicting goals of keeping up with explosive data growth and lowering the cost of delivering data analytics services. The default replication count for Hadoop is three. This is strongly recommended for data protection with Hadoop configurations with internal disk drives. Replication is also needed for cluster self-healing. “Self-healing” is used to describe Hadoop’s ability to ensure job completion in the event of task failure. It does this by reassigning failed tasks to other nodes in the cluster. This is made possible by the replication of blocks throughout the cluster. With the NetApp Open Solution for Hadoop, replication is not required for data protection since data is protected with hardware RAID. As a result, a replication count of two is sufficient for self-healing. Hadoop MapReduce jobs that write data to the HDFS, such as data ingest, benefit from the lower replication count: they generally run faster and require less storage space than a Hadoop cluster with internal disk storage and a replication count of three. During ESG Lab testing with a 24-node cluster, the NetApp effect reduced disk capacity requirements by 22% as it increased aggregate data load performance by 62%. In other words, organizations can manage more data at a lower cost with NetApp. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 12. Lab Validation: NetApp Open Solution for Hadoop 12 Recoverability When a name node fails, a Hadoop administrator needs to recover the metadata and restart the Hadoop cluster using a standby, secondary name node. In a Hadoop server cluster with internal storage, when a disk drive fails, the entire data node is “blacklisted” and no longer available to execute tasks. This can result in degraded performance and the need for a Hadoop administrator to take the data node offline, service and replace the failed component, and then redeploy. This process can take several hours to complete. This single point of failure is being addressed by the open source Hadoop community, but was not yet generally available when this report was published. NetApp Open Solution for Hadoop increases the availability and recoverability of a Hadoop cluster in three significant ways: 1. Recovery from a name node failure is accelerated dramatically using an NFS attached FAS2040 instead of internal storage on the primary and secondary name nodes. If and when a name node failure occurs, a quick recovery from an NFS attached FAS2040 can restore analytics services in minutes instead of hours. 2. NetApp E2600s with hardware RAID provide transparent recovery from hard drive failures. The data node is not blacklisted and any job tasks that were running continue uninterrupted. 3. The NetApp E2660 management console (SANtricity) provides a centralized management GUI for monitoring and managing drive failures. This reduces the complexity associated with manually recovering from drive failures in a Hadoop cluster with internal drives. ESG Lab Testing A variety of errors were tested with a 24-data-node Hadoop cluster running a 3TB TeraSort job. As shown in Figure 8, errors were injected to validate that jobs continue to run after data node and E2660 hard drive failures, and that the cluster can be quickly recovered after a name node failure. A dual drive failure was also tested to simulate and measure job recovery time after an internal hard drive failure in a traditional Hadoop cluster. Figure 8. ESG Lab Error Injection Testing © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 13. Lab Validation: NetApp Open Solution for Hadoop 13 Disk Drive Failure To simulate a disk drive failure, a drive was taken offline while a Hadoop TeraSort job was running.4 The Hadoop job tracker web interface was used to confirm that the job completed successfully. The NetApp E2660 SANtricity management console was used to identify which drive had failed and monitor automatic recovery from a hot spare. A SANtricity management console screenshot taken shortly after the drive had failed is shown in Figure 9. Figure 9. Transparent Recovery from a Hard Drive Failure with E2660 Hardware RAID Another TeraSort job was started. While it was running, a lab manager physically replaced the failed hard drive. The TeraSort job completed without error, as expected. Another TeraSort job was started and a dual drive error was introduced to simulate and measure the job completion time after a traditional Hadoop hard drive failure in a data node.5 As shown in Table 4, the TeraSort job took slightly longer (5.7% longer) to complete during the single drive failure with the hardware RAID recovery of the NetApp E2660. The simulated internal drive failure took more than twice as long (236.2%) as the data node was blacklisted and job tasks were restarted on surviving nodes. Table 4. Drive Failure Recovery Results Job Completion Time Throughput Delta Test Scenario (hh:mm:ss) (MB/sec) (vs. Healthy Cluster) Healthy cluster 00:30:21 1,821 N/A NetApp E2660 drive failure 00:32:06 1,486 -5.7% Internal data node drive failure 01:12:13 660 -237.9% 4 Drive failures were introduced when the Hadoop job tracker indicated that the TeraSort job was 80% complete. 5 In a Hadoop cluster using internal disk drives, a local file system is created on each disk. If a disk fails, that file system fails. A local disk failure was simulated during ESG Lab testing by failing two disk drives in the same RAID 5 volume group. All data on that file system was lost and all tasks running on that file system failed. The job tracker detected this and reassigned failed tasks to other nodes where copies of the lost blocks exist. With the NetApp solution, a single disk drive has very little impact on running tasks, and all data in the local file system using that LUN remains available as RAID reconstruct begins. With direct attached disks, if a single disk fails, a file system fails as described above. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 14. Lab Validation: NetApp Open Solution for Hadoop 14 The screen shot shown in Figure 10 shows Hadoop job tracker status after the successful completion of the TeraSort job following the simulated internal hard drive failure. Note how the non-zero failed/killed counts indicate the number of map and reduce tasks that were restarted on surviving nodes (439 and 5, respectively). Figure 10. Jobs Completion after a Simulated Internal Hard Drive Failure The screen shot shown in Figure 11 summarizes the status of the Hadoop Distributed File System (HDFS) after the data node with a simulated internal hard drive failure was blacklisted. These errors didn’t occur with the E2660 drive failure, as the Hadoop job ran uninterrupted. Figure 11. Hadoop Self-healing in Action: Cluster Summary after a Simulated Internal Drive Failure © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 15. Lab Validation: NetApp Open Solution for Hadoop 15 Name Node Failure The Hadoop name node server was halted while a TeraSort job was running with a goal of demonstrating how an NFS attached NetApp FAS2040 can be used to quickly recover from the single point of failure when a name node goes offline in a Hadoop cluster. As shown in Figure 12, the job failed as expected after 13 minutes and 23 seconds. After the job failed, name node metadata was copied to the secondary name node and the name node daemon was started on the secondary name node server. The procedure outlined in the NetApp Open Solution for Hadoop Solutions Guide6 was used to copy metadata to the secondary name node and start the name node daemon on the secondary name node. Figure 12. Job Failure after a Name Node Failure: NetApp FAS2040 Recovery Begins Five minutes after getting started with the recovery process, the Hadoop cluster was up and running. An fsck of the HDFS file system indicated that the cluster was healthy and a restarted TeraSort job completed without error. Why This Matters A majority of respondents to a recent ESG survey indicated that three hours or less of data analytics platform downtime would result in significant revenue loss or other adverse business impact. The single point of HDFS failure in the open source Hadoop distribution that was generally available as of this writing can lead to three or more hours of data analytics platform unavailability. ESG Lab has confirmed NetApp Open Solution for Hadoop reduces name node recovery time from hours to minutes (five minutes during ESG Lab testing). NetApp E2660s with hardware RAID dramatically improved recoverability after simulated hard drive failures. The complexity and performance impact of a blacklisted name node was avoided as 3TB TeraSort analytics job with NetApp completed more than twice as quickly as with a simulated internal hard drive failure. 6 http://media.netapp.com/documents/tr-3969.pdf © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 16. Lab Validation: NetApp Open Solution for Hadoop 16 ESG Lab Validation Highlights  The capacity and performance of a NetApp solution scaled linearly when data nodes and NetApp E2660 storage arrays were added to a Hadoop cluster.  ESG Lab tested up to 24 data nodes and six NetApp E2660 arrays with 720TB of usable disk capacity.  Load performance testing with the TeraGen utility delivered linear performance scalability.  A 24-node cluster sustained a high aggregate load throughput rate of 4.630 GB/sec.  Big data analytics performance testing with the TeraSort utility yielded linear performance scalability as data nodes and E2660 arrays were added.  Network-free hardware RAID and a lower Hadoop replication count reduced network overhead, which increased the aggregate performance of the cluster. A peak aggregate throughput improvement of 62.5% was recorded during the 24-node test (4.629 vs. 2.849 GB/sec).  A MapReduce job running during a simulated internal drive failure took more than twice as long (225%) to complete than during failure of a hardware RAID protected E2660 drive.  An NFS attached NetApp FAS2040 for name node metadata storage was used to recover from a primary name node failure in five minutes, compared to multiple hours in a traditional configuration. Issues to Consider  While the results demonstrate how the NetApp Open Solution for Hadoop is ideally suited to meet the extreme compute and storage performance needs of big data analytic load and long running queries, applications with lots of small files, multiple writers, or many users with low response time requirements may be better suited for traditional relational databases and storage solutions.  The single point of failure issue in the Hadoop distribution used during this ESG Lab Validation is being fixed in the open source community, but was not yet available and therefore not tested as part of ESG Lab’s assessment of the NetApp Open Solution for Hadoop. Even so, future releases of Hadoop that resolve the name node failure problem are still expected to rely on NFS shared storage as a functional requirement. NetApp, with its FAS family, is an industry leader in NFS shared storage.  The test results presented in this report are based on a benchmarks deployed in a controlled environment. Due to the many variables in each production data center environment, capacity planning and testing in your own environment are recommended.  A growing number of best practices, tuning guidelines, and proof points are available for reference when planning, deploying, and tuning a Hadoop Open Solution for NetApp. To learn more, visit: http://www.netapp.com/hadoop. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 17. Lab Validation: NetApp Open Solution for Hadoop 17 The Bigger Truth Whether measured by increased revenues, market share gains, reduced costs, or scientific breakthroughs, data analytics has always played a key role in the ability to harness value from electronically-stored information. What has changed recently is that, as more business processes have become automated, information that was once stored in separate online and offline repositories and formats is now readily available for amalgamation and analysis to increase business insight and enhance decision support. Business executives are asking more of their data and are expecting faster and more impactful answers. The result is an ever-increasing priority on data analytics activities and, subsequently, more pressure on existing business analyst and IT teams to deliver. Hadoop is a powerful open source framework for data analytics. It’s an emerging and fast growing solution that’s considered one of the most impactful technology innovations since HTML. While ESG research indicates that a small number of organizations are using Hadoop at this time, interest and plans for adoption over the next 12-18 months is high (48%). For those new to Hadoop, there is a steep learning curve. Very few enterprise applications are built to run on massively parallel clusters, so there is much to learn. The NetApp Open Solution for Hadoop is a tested and proven reference architecture storage appliance that reduces the risk and time associated with Hadoop adoption. NetApp has embraced the open source Hadoop model and is working with major distributors to support open source Hadoop software running on industry standard servers. Instead of promoting the use of a proprietary clustered file system, NetApp has embraced the use of the open source Hadoop file system (HDFS). Instead of promoting the use of SAN or NAS attached storage, NetApp has embraced the use of direct attached storage. Using SAS direct connected NetApp E2660 arrays with hardware protected RAID, the NetApp solution improves performance, scalability, and availability compared to typical internal hard drive Hadoop deployments. Thanks to an NFS attached NetApp FAS2040 for shared access to metadata, recovery from a Hadoop name node failure is reduced from hours to minutes. With up to 5 GB/sec of aggregate TeraGen load performance on a 24-node cluster, ESG Lab has confirmed that the NetApp Solution for Hadoop provides excellent near-linear performance scalability that dwarfs the capabilities of traditional disk arrays and databases. NetApp E2660s with network-free hardware RAID improved the efficiency and performance of the cluster by 66% compared to a traditional Hadoop deployment with triple mirroring. The value of transparent RAID recovery was obvious after drive failures were simulated: the performance impact on a long running sort job was less than 6% compared to more than 200% for a simulated internal drive failure that blacklisted a Hadoop data node. If you’re looking to accelerate the delivery of insight to your business with an enterprise-class big data analytics infrastructure, ESG Lab recommends a close look at the NetApp Open Solution for Hadoop—it reduces risk with a storage solution that delivers reliability, fast deployment, and scalability of open source Hadoop for the enterprise. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 18. Lab Validation: NetApp Open Solution for Hadoop 18 Appendix The configuration of the test bed that was used during the ESG Lab Validation is summarized in Table 5. Table 5. Configuration Summary Servers HDFS data nodes 24 servers, each with quad core Intel Xeon CPU, 48GB RAM HDFS name node 1 server with quad core Intel Xeon CPU, 48GB RAM HDFS secondary name node 1 server, with quad core Intel Xeon CPU, 48GB RAM HDFS job tracker 1 server, with quad core Intel Xeon CPU, 48GB RAM Network 10 GbE host connect One 10GbE CAN connection for all data nodes, name node, secondary name 10 GbE switched fabric node, job tracker Cisco Nexus 5010, 10 GigE, Jumbo Frames (MTU=9000) Storage HDFS data node storage 6 NetApp E2660 6Gb SAS host connect, 6+1 RAID-5, 2TB near line SAS 7.2K RPM drives, 360 drives total, version 47.77.19.99 HDFS name node storage 1 NetApp FAS2040, 1GbE NAS host connect, 6 disks, 1TB each, 7.2K RPM, Data ONTAP 8.0.2 7 mode Operating system boot drives Local 1TB 7.2K RPM SATA drive in each node Software Operating system Red Hat Enterprise Linux version 5, update 6 (RHEL5.6) Analytics platform Cloudera Hadoop (CDH3u2) HDFS Configuration Changes vs. Cloudera V3U2 Distribution Local file system XFS Map/reduce tasks per data node 7/5 Table 6 lists the differences between Hadoop core-site.xml defaults and the settings used during ESG Lab testing. Table 6. Hadoop core-site Settings Option Name Purpose Actual/Default Name of the default file system specified as a URI (IP address hdfs:// 10.61.189.64:8020/ fs.default.name or hostname of the name node along with the port to be used). [Default Value: file:///]] Enables or disables certain management functions within webinterface.private.actions the Hadoop Web user interface, including the ability to kill true / false jobs and modify job priorities. Memory in MB to be used for merging map outputs during fs.inmemory.size.mb 200 / 100 the reduce phase. io.file.buffer.size Size in bytes of the read/write buffer. 262144 / 4096 Script used to resolve the slave node’s name or IP address to a rack ID. Used to invoke Hadoop rack awareness. The /etc/hadoop/conf/topology_script topology.script.file.name default value is null and results in all slaves being given a [Default value is null] rack ID of “/default-rack.” Sets the maximum acceptable number of arguments to be topology.script.number.args 1 / 100 sent to the topology script at one time. /home/hdfs/tmp hadoop.tmp.dir Hadoop temporary directory storage. [Default value: /tmp/hadoop- ${user.name}] © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 19. Lab Validation: NetApp Open Solution for Hadoop 19 Table 7 lists the differences between Linux sysctl.conf defaults and the settings used during ESG Lab testing. Table 7. Linux sysctl.conf Settings Parameter Description Actual /Default net.ipv4.ip_forward Controls IP packet forwarding. 0/0 net.ipv4.conf.default.rp_filter Controls source route verification. 1/0 net.ipv4.conf.default.accept_ Do not accept source routing. 0/1 source_route Controls the system request debugging functionality of the kernel.sysrq 0/1 kernel. Controls whether core dumps will append the PID to the kernel.core_uses_pid core filename. Useful for debugging multithreaded 1/0 applications. kernel.msgmnb Controls the maximum size of a message, in bytes. 65536 / 16384 kernel.msgmax Controls the default maximum size of a message queue. 65536 / 8192 kernel.shmmax Controls the maximum shared segment size, in bytes. 68719476736 / 33554432 Controls the maximum number of shared memory kernel.shmall 4294967296 / 2097512 segments, in pages. net.core.rmem_default Sets the default OS receive buffer size. 262144 / 129024 net.core.rmem_max Sets the max OS receive buffer size. 16777216 / 131071 net.core.wmem_default Sets the default OS send buffer size. 262144 / 129024 net.core.wmem_max Sets the max OS send buffer size. 16777216 / 131071 Maximum # of sockets the kernel will serve at one time. Set net.core.somaxconn 1000 / 128 on name node, secondary name node and job tracker. fs.file-max Sets the total number of file descriptors. 6815744 / 4847448 net.ipv4.tcp_timestamps Disables the TCP time stamps if set to “0” 0/1 net.ipv4.tcp_sack Enables select ACK for TCP. 1/1 net.ipv4.tcp_window_scaling Enables TCP window scaling. 1/1 kernel.shmmni Sets the maximum number of shared memory segments. 4096 / 4096 Sets the maximum number and size of semaphore sets that 250 32000 100 128 / kernel.sem can be allocated. 250 32000 32 128 fs.aio-max-nr Sets the maximum number of concurrent I/O requests. 1048576 / 65536 4096 262144 16777216 / net.ipv4.tcp_rmem Sets min, default, and max receive window size. 4096 87380 4194304 4096 262144 16777216 / net.ipv4.tcp_wmem Sets min, default, and max transmit window size. 4096 87380 4194304 net.ipv4.tcp_syncookies Disables TCP syncookies if set to “0”. 0/0 Sets the maximum number of in-flight rpc requests between sunrpc.tcp_slot_table_entries a client and a server. This value is set on the name node and 128 / 16 secondary name node to improve NFS performance. Maximum percentage of active system memory that can be vm.dirty_background_ratio used for dirty pages before dirty pages are flushed to 1 / 10 storage. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 20. Lab Validation: NetApp Open Solution for Hadoop 20 Table 8 lists the differences between Hadoop hdfs-site.xml defaults and the settings used during ESG Lab testing. Table 8. HDFS Site Settings Option Name Purpose Actual/Default Path on the local file system where the name node stores the namespace and transaction logs persistently. If this is a comma-delimited list of directories (as used in this configuration), then the name table is replicated in all of the /local/hdfs/namedir/mnt/fsimage_bkp directories for redundancy. dfs.name.dir [Default value: Note: Directory /mnt/fsimage_bkp is a location on NFS- ${hadoop.tmp.dir}/dfs/name] mounted NetApp FAS storage where name node metadata is mirrored and protected, a key feature of NetApp’s Hadoop solution. Specifies a list of machines authorized to join the Hadoop /etc/hadoop-0.20/conf/dfs_hosts dfs.hosts cluster as a data node. [Default value is null] /disk1/data,/disk2/data Directory paths on the data node local file systems where dfs.data.dir [Default value: HDFS data blocks are stored. ${hadoop.tmp.dir}/dfs/data] /home/hdfs/namesecondary1 Directory path where checkpoint images are stored (used by fs.checkpoint.dir [Default value: secondary name node). ${hadoop.tmp.dir}/dfs/namesecondary] HDFS block replication count. Hadoop default is 3. The dfs.replication 2/3 NetApp Hadoop solution uses a replication setting of 2. dfs.block.size HDFS data storage block size in bytes. 134217728 (128MB) / 67108864 dfs.namenode.handler.count Number of server threads for the name node. 128 / 10 dfs.datanode.handler.count Number of server threads for the data node 64 / 3 Maximum number of replications a data node is allowed to dfs.max-repl-streams 8/2 handle at one time. dfs.datanode.max.xcievers Maximum number of files a data node will serve at one time. 4096 / 256 © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 21. Lab Validation: NetApp Open Solution for Hadoop 21 Table 9 lists the differences between mapred-site.xml defaults and the settings used during ESG Lab testing. Table 9. mapred-site Settings Option Name Purpose Actual/Default Job tracker address as a URL (Job tracker IP address 10.61.189.66:9001 mapred.job.tracker or hostname with port number). [Default value: local] /disk1/mapred/local,/disk2/mapred/loc al Comma-separated list of the local file system where mapred.local.dir [Default value: temporary MapReduce data is written. ${hadoop.tmp.dir}/mapred/local Specifies the file containing the list of nodes mapred.hosts /etc/hadoop-0.20/conf/mapred.hosts allowed to join the Hadoop cluster as task trackers. [Default value is null] Path in HDFS where the MapReduce framework /mapred/system mapred.system.dir stores control files. [Default value: ${hadoop.tmp.dir}/mapred/system] Enables the job tracker to detect slow-running mapred.reduce. reduce tasks, assign them to run in parallel on other tasks.speculative. false / true nodes, use the first available results, and then kill execution the slower running reduce tasks. Enables the job tracker to detect slow-running map mapred.map.tasks. tasks, assign them to run in parallel on other nodes, false / true speculative.execution use the first available results and then kill the slower running map tasks. mapred.tasktracker. Maximum number of reduce tasks that can be run 5/2 reduce.tasks.maximum simultaneously on a single task tracker node. mapred.tasktracker.map. Maximum number of map tasks that can be run 7/2 tasks.maximum simultaneously on a single task tracker node. Java options passed to the task tracker child mapred.child.java.opts processes. (In this case, 1 GB defined for heap -Xmx1024m / -Xmx200m memory used by each individual JVM). Total amount of buffer memory allocated to each io.sort.mb merge stream while sorting files on the mapper, in 340 / 100 MB. org.apache.hadoop.mapred.FairSchedul er mapred.jobtracker. Job tracker task scheduler to use (in this case use [Default value: taskScheduler the FairScheduler). org.apache.hadoop.mapred.JobQueueT askScheduler] Number of streams to merge at once while sorting io.sort.factor 100 / 10 files. Enables/disables MapReduce output file mapred.output.compress false / false compression. mapred.compress.map. Enables/disables map output compression. false / false output mapred.output.compression.type Sets output compression type. block / record Fraction of the number of map tasks that should be mapred.reduce.slowstart.completed.m complete before reducers are scheduled for the 0.05 / 0.05 aps MapReduce job. © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 22. Lab Validation: NetApp Open Solution for Hadoop 22 Option Name Purpose Actual/Default 40 for 8 DataNodes Total number of reduce tasks available for the 80 for 16 DataNodes mapred.reduce.tasks entire cluster. 120 for 24 DataNodes [Default value: 1] 56 for 8 DataNodes Total number of map tasks available for the entire 112 for 16 DataNodes mapred.map.tasks cluster. 168 for 24 DataNodes [Default value: 2] mapred.reduce.parallel. Number of parallel threads used by reduce tasks to 64 / 5 copies fetch outputs from map tasks. mapred.compress.map.output Enable/disable map output compression. false / false Number of map outputs in the reduce task tracker’s mapred.inmem.merge. memory at which map data is merged and spilled to 0 / 1000 threshold disk. mapred.job.reduce. Percent usage of the map outputs buffer at which 1/0 input.buffer.percent the map output data is merged and spilled to disk. mapred.job.tracker. Number of job tracker server threads for handling 128 / 10 handler.count RPCs from the task trackers. tasktracker.http. Number of task tracker worker threads for fetching 60 / 40 threads intermediate map outputs for reducers. Maximum number of tasks that can be run in a mapred.job.reuse.jvm. single JVM for a job. A value of "-1" sets the number -1 / 1 num.tasks to "unlimited." mapred.jobtracker.restart.recover Enables job recovery after restart. true / false © 2012, Enterprise Strategy Group, Inc. All Rights Reserved.
  • 23. 20 Asylum Street | Milford, MA 01757 | Tel: 508.482.0188 Fax: 508.482.0218 | www.enterprisestrategygroup.com