For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper also focuses on the benefits of automated setup, centralized management of multiple Hadoop clusters, and quick provisioning of cloud-based Hadoop clusters.
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Effective Hadoop Cluster Management- Impetus White Paper
1. WHITE PAPER
Effective Hadoop Cluster Management
Abstract
In this white paper, Impetus Technologies talks about
Apache HadoopTM, an Open Source, Java-based free
software framework that enables the processing of
huge amounts of data through distributed data
processing. It talks about how correct and effective
provisioning and management plays a crucial role to
ensure is the key to ensure a successful HadoopTM
cluster environment and thus helps to make
individuals HadoopTM working experience a pleasant
one. Apart from this the white paper also and
discusses the challenges associated with cluster setup,
sharing, and management.
The paper also focuses on the benefits of automated
setup, centralized management of multiple HadoopTM
clusters, and quick provisioning of cloud-based
HadoopTM clusters.
Impetus Technologies, Inc.
www.impetus.com
April - 2012
2. Effective Hadoop Cluster Management
Table of Contents
Introduction .............................................................................................. 3
Understanding HadoopTM cluster related challenges ............................... 3
Manual operation ........................................................................ 4
Cluster set up ............................................................................... 4
Cluster management ................................................................... 4
Cluster sharing ............................................................................. 4
HadoopTM compatibility and others............................................. 4
What is missing? ....................................................................................... 5
Solutions space ......................................................................................... 5
Addressing operational challenges .............................................. 5
Addressing cluster set up challenges ........................................... 6
Addressing cluster management challenges ............................... 8
Addressing cluster sharing challenges ......................................... 8
Addressing HadoopTM compatibility related challenges .............. 9
Can HadoopTM Cluster Management tools help? ................................... 10
The Impetus solution .............................................................................. 11
Summary ................................................................................................. 13
2
3. Effective Hadoop Cluster Management
Introduction
The HadoopTM framework offers the required support for data-intensive
distributed applications. It manages and engages multiple nodes for distributed
processing of the large amount of data which is stored locally on individual
nodes. The results produced by the individual nodes are then consolidated
further to generate the final output.
HadoopTM provides Map/Reduce APIs and works on HadoopTM -compatible
distributed file systems. HadoopTM sub-components and related tools, such as
HBase, Hive, Pig, Zookeeper, Mahout etc. have specific uses and benefits
associated with them. Normally, these subcomponents are also used along with
HadoopTM and therefore, need set up and configuration.
Setting up a standalone or a pseudo-distributed cluster or even a relatively small
sized localized cluster is an easy task. On the other hand, manually setting up
and managing a production-level cluster in a truly distributed environment
requires significant effort, particularly in the area of cluster set up, configuration
and management. It is also tedious, time consuming and repetitive in nature.
Factors such as HadoopTM vendor, version, bundle type and the target
environment add to existing cluster set up and management related
complexities. Also, different cluster modes call for different kinds of
configurations. Commands and settings change due to alterations in cluster
modes, increasing the challenges related to HadoopTM set up and management.
Understanding HadoopTM cluster related challenges
The challenges associated with HadoopTM can be broadly classified into
following:
1. Operational
2. Cluster set up
3. Cluster management
4. Cluster sharing
5. HadoopTM compatibility and others
Let us take them up one by one to understand what they mean:
Operational challenges
Operational challenges mainly arise due to factors like manual operation,
console-based, non-friendly interface, as well as interactive and serial
execution.
3
4. Effective Hadoop Cluster Management
Manual operation
The manual mode of execution requires a full-time, totally interactive user
session and consumes a lot of time. At the same time, it is also error-prone due
to the mistakes and omissions that might have occurred. These in turn require
the entire activity to begin again from scratch.
Interface
Another factor is using the console-based interface, which is the primary and
the only available default interface, to interact with the HadoopTM cluster. It is
therefore, to some extent, also responsible for the serial execution of activities.
Cluster set up
In their simplest form, HadoopTM bundles are simple tar files. They need to be
extracted and require set up and initialization. Apache HadoopTM bundles,
(especially the tarball), do not have any set up support around them. The
default way to set up the cluster is totally manual. A sequence of activities has
to be followed depending on cluster mode and HadoopTM version/vendor. The
cluster set up activity involves a lot of complexities and variations due to
different factors like the cluster set up environment (on premise versus the
Cloud), cluster mode, component bundle type, vendor and version. On the top
of existing complexities, the manual, interactive and attended mode of
operation increases the challenges.
Cluster management
The current cluster management in HadoopTM offers limited functionalities and
at the same time the operation needs to be carried out manually from a
console-based interface. There is no feature that enables the management of
multiple clusters from one single location. One needs to change the interface or
log on to different machines in order to manage different clusters.
Cluster sharing
With the current way of operation, the task of sharing HadoopTM clusters across
various users and user groups with altogether different requirements is not just
challenging, tedious and time-consuming, but to some extent also insecure.
HadoopTM compatibility and others
The key factors that fall into this category are related to areas like HadoopTM API
compatibility, working with HadoopTM bundles and bundle formats (tar/RPM)
from multiple vendors, HadoopTM versions related operational and commands
differences, etc.
4
5. Effective Hadoop Cluster Management
What is missing?
After examining the challenges, it is important to understand what is missing.
Once we know the missing dimensions, it is possible to overcome or address
most of the challenges.
Missing Dimensions:
• Operational support
o Automation
o Alternate User friendly interface
o Monitoring and notifications support
• Setup support
• Cluster management support
• Cluster sharing mechanism
When we compare HadoopTM with other Big Data solutions (like Greenplum or
commercial solutions such as Aster Data), we find that the Big Data solutions
offer support around the above mentioned dimensions. This appears to be
missing in HadoopTM.
Today, there are tools in the market that address some or the majority of the
challenges mentioned above. The solutions primarily use these missing
dimensions around HadoopTM, and address the various pain points.
Let us now look at how these dimensions can help deal with the various
challenges.
Solutions space
Addressing operational challenges
The operational challenges can be addressed using a combination of methods
like applying automation, an alternate user interface with support for updates
and notifications.
Automation
Automation enables unattended, quick, and error-free execution of any activity.
Smart automation can take care of various associated factors and situations in a
context aware manner.
Automation ensures that the right commands are submitted, even though the
parameters may or may not be correct, as they are keyed in by users. With an
5
6. Effective Hadoop Cluster Management
input validating interface it is possible to validate user inputs so as to ensure
that only the right parameters are being used.
Using an alternate interface
As discussed earlier, with a default console based interface, several limitations
crop up, such as serial execution and interactive working. It is possible to
overcome this problem by adopting a user-friendly GUI as an alternate interface
that additionally supports configuration, input validation, and automation and
at the same time runs several activities in parallel. An alternate, friendly, user
interface helps in accessing HadoopTM functionality and operations in a
streamlined manner.
Impetus strongly believes that operational challenges associated with HadoopTM
clusters can be addressed to a great extent by using an alternate interface that
supports automation and provides parallel working support.
Thus automation and an alternate interface together offer an easy and better
HadoopTM environment for working.
Addressing cluster set up challenges
Cluster set up activity requires careful execution of pre-defined actions in a
situation aware manner where even a minor error or omission due to manual
intervention can result in a major setback.
While simple automation can handle this problem, some actions may still
require user intervention (e.g. accepting the license agreements). Bringing in
smart automation into the picture enables a quick set up of the cluster, in a
hassle-free and non-interactive manner. The entire cluster set up functionality
can be offered through a configurable alternate user interface that offers
simple, click-based cluster provisioning through a friendly and highly
configurable UI. This in turn utilizes context-aware automation, based on the
provided inputs and can perform multiple activities in parallel.
Understanding the difference between setting up a cluster on-premise and
over the Cloud
For Cloud-based clusters, organizations are required to launch and terminate
the cluster instances. However, the hardware, operating systems, and installed
Java versions are mostly uniform, which may differ in the case of on-premise
deployment. For on-premise deployment, it is important to set up password-less
SSH between the nodes, which are not required in the Cloud set up. The setup
of the HadoopTM ecosystem components remains the same, regardless of the
cluster set up environment.
6
7. Effective Hadoop Cluster Management
Provisioning the Cloud-based HadoopTM cluster
The complexities for provisioning the Cloud-based HadoopTM cluster appear
primarily due to the manual operation, which involves steps such as accessing
the Cloud-provider interface to launch the required number of nodes with
required different hardware configurations, providing inputs for key pairs,
security settings, machine images, etc. There is need to open or unblock the
required ports, manually collect individual node IPs and add all these IP
addresses/hostnames to the HadoopTM slave files. After using the Cloud cluster,
once again one need to manually terminate all the machines sequentially, by
individually selecting them on the Cloud interface.
If the cluster size is small then all these activities can be carried out easily.
However, performing these activities manually on a large-sized cluster is
cumbersome and may lead to errors. For performing the activities, one
continuously needs to switch between the interface of the Cloud providers and
the HadoopTM management interface.
Bringing in automation into the picture can ease all these activities and help to
save time and effort. One can incorporate automation just by adding simple
scripts or by using Cloud provider-specific exposed APIs or alternatively by using
generic Cloud APIs, such as JCloud, Simple Cloud, LibCloud, DeltaCloud etc.
Cloudera CDH-2 scripts can help launch instances on the Cloud and then enable
setting up HadoopTM over the launched nodes. This is somewhat similar in the
case of Whirr, which uses the JCloud API in the background.
7
8. Effective Hadoop Cluster Management
Addressing cluster management challenges
As we have discussed, the key challenge in this area is the lack of appropriate
functionalities to manage the cluster. It is possible to effectively managing
HadoopTM clusters by adopting tools with dedicated and sophisticated support
for various cluster management capabilities. This may include functionalities
ranging from node management to services, user, configuration, parameters
and job management. Additionally, this may also provide adequate support for
templates of various common required workflows and inputs for managing all
the mentioned entities.
The solutions may also support a friendly and configurable way for providing
updates on performance monitoring, progress or status update notifications and
alerts. This is a user-friendly approach which actively offers updates on the
progress, notifications for various events and change in status to users, instead
of them seeking or polling for it periodically in a passive manner. Furthermore,
the mode of receiving these updates can be customized and configured by the
user/administrator based on individual preferences and the exact requirements
in terms of criticalness. Thus users, depending on their needs, can configure the
communication channel/medium which can be any one of the online updates,
e-mails or SMS notifications.
All these functionalities are supported through an alternate, user-friendly
interface which also automates cluster management activities and offers a way
to work on multiple activities in parallel. If the user-interface is web-based, then
it additionally offers you the ability to access cluster-related functionality from
anywhere, and at any time.
Addressing cluster sharing challenges
It needs to be mentioned here that cluster sharing essentially means sharing
clusters among the different development and testing team members.
The main problem here is the manner in which these clusters are typically
shared. In the traditional approach, there are two possible ways to share the
cluster. The first is about sharing the credentials of a common user account with
the entire set of users. The second approach is about creating separate user
accounts or user groups for each user/user group with whom you are planning
to share the cluster.
If you share the cluster using the first approach, i.e. sharing common user
account credentials with all users, regardless of their actual usage or access
requirement, you are compromising the security of the system. The system (as
well as cluster) and other linked systems are now exposed because this user
8
9. Effective Hadoop Cluster Management
account may have some exclusive privileges that are now available to a broad
category of all cluster users, regardless of their actual requirement.
In the second approach, one needs to create OS level separate user accounts on
the system (in some cases, even on each node of the cluster) with restricted
privileges. Now this is a complex as well as time consuming task. You not only
have to create or set it up, but even need to maintain and update the various
requirements with time.
Impetus strongly suggests using role-based cluster sharing through the alternate
UI. This offers a cleaner way to share clusters without compromising on
security. Some of the solutions not only allow you to control role-based access
to various cluster management functionalities, they even offer a way to
authenticate users and their roles from a valid existing external user
authentication system. Some of the benefits of using this method include the
fact that now users may not be necessarily created at the per machine or OS
level. Rather they can be created at the solution level, or just reused even from
existing domain level users. Thus, it will be relatively easy to manage and
control users through the admin interface of the solution. Furthermore, based
on requirements, specific roles can be created on-the-fly and assigned to the
specific user accounts in order to restrict or provide access to specified
functionalities for individual user access.
Cluster sharing has definite associated benefits. Furthermore, if multiple shared
clusters can be managed from a single centralized location without switching
the interface or without being logged on to multiple machines, and then this
makes the entire task even easier. It becomes easy to manage and fine tune a
single shared cluster. Users and back-ups too are easy to manage. While
working with a shared cluster all cluster users gets performance benefits. When
compared with non-shared clusters running on individual machines, one saves a
lot of time required to set up, manage and troubleshoot local clusters on
individual machines. The performance figures received from local clusters
running on individual user machines are also not the true measure of cluster
performance, as the hardware of individual machines is not always up to the
mark in terms of being the best configuration.
Addressing HadoopTM compatibility related challenges
Let us look at the various challenges related to HadoopTM. The very first
challenge is HadoopTM compatibility. HadoopTM as a technology is still evolving
and has not reached complete maturity yet. This factor in turn, gives way to
numerous challenges, such as API differences across versions, their on wire
compatibility i.e. multiple HadoopTM versions on different nodes of same cluster,
and interoperability of multiple versions and their respective components.
9
10. Effective Hadoop Cluster Management
Sometimes, given configurations may not be supported in certain versions.
Problems may also arise due to multiple vendors and vendor-specific features.
There can also be complexities related to bundle formats (tar ball and RPM) as
their setup and folder locations (bin/conf) differ. Cluster modes are another
factor that demands suitable changes in configuration and command execution.
Security is available as a separate patch and needs customized configuration.
Issues can also crop up due to vendor-specific solutions such as SPOF/HA and
compatible file systems.
It is possible to find work-arounds to partially address the compatibility
challenges. A complete solution may not be possible, as these problems are the
result of the underlying technology. One can address the compatibility
problems by adopting HadoopTM Cluster Management Tools which can give you
the option of replacing these incompatible bundles with suitable ones. This will
primarily ensure that all the nodes within the cluster have the same version of
HadoopTM and the respective components installed.
Other factors such as version, vendor, bundle format, etc., can also be handled
to a great extent by using any HadoopTM cluster management tool that
facilitates context-aware component set up and management support. The tool
can take care of the differences in files and folder names/locations changes,
command changes and configuration differences in accordance with the bundle
format, cluster mode and vendor.
Can HadoopTM Cluster Management tools help?
Impetus strongly believes that by utilizing these missing dimensions, a tool can
offer a better HadoopTM environment for working. It can therefore improve your
productivity when working with HadoopTM clusters. Such tools can help
immensely, as they offer a quick turnaround and enable companies to create
new clusters quickly and in accordance with their specifications. The tools offer
automated set up, thereby leaving no room for error. They also help to minimize
the total cost of the operation by reducing the time and effort required for
cluster set up and management.
HadoopTM cluster management tools can provide integrated support for all
organizational requirements from one place and one interface. The tools can
also help to set up clusters for different needs for e.g. setting up a cluster for
testing the application across different vendors, distributions and versions and
then benchmarking them on different configurations, loads and environments.
10
11. Effective Hadoop Cluster Management
They help in analyzing the impact of cluster size against different load patterns
and then enabling the launch and resize of the cluster, on-the-fly.
Among the tools currently available for the effective set up, and management of
HadoopTM clusters are Amazon’s Elastic Map-Reduce, Whirr, Cloudera SCM, and
Impetus’ Ankush.
The Impetus solution
Impetus’ Ankush, which is a web-based solution, supports the intricacies
involved in cluster set up, management, cloud provisioning, and sharing.
Ankush offers customers the following benefits:
• Centralized management of multiple clusters. This is a very helpful
feature as it eliminates the need for a change in the interface or log in
to different machines, to manage different clusters.
• In the node management area for instance, the solution, through its
web interface, supports the listing of existing nodes as well as the
addition and removal of the nodes by IP or hostname.
• Support for cluster set up in all possible modes. It also performs
context-aware auto initialization of configuration parameters based on
the cluster mode. The initialization support includes services,
configuration initialization, initial node addition and initial job
submission. Additionally, it also supports multiple vendors, versions,
and bundles for HadoopTM ecosystem components.
• For the cloud, the solution supports the launch and termination of
entire HadoopTM clusters, for both heterogeneous and homogeneous
configurations. Ankush can support all the required Cloud operations
from its own UI, so organizations need not access the cloud-specific
interface.
• Supports centralized management and monitoring of multiple clusters.
Individual cluster-based operations can also be managed using the same
interface from the same location.
• Facilitates support for reconfiguration and upgrade of HadoopTM and
other components, the management of keys, users, configuration
parameters, services and monitoring. User management supports
multiple user roles. It allows role-based access to various cluster
functionalities. Only the admin-role based user can perform the
operations that affect the state of the cluster. The setup of the cluster,
11
12. Effective Hadoop Cluster Management
components and pre-dependencies like Java and password-less SSH set
up are undertaken in an automated fashion.
Figure: Ankush–Impetus’ HCM Tool
You can use Ankush to set up and manage local as well as Cloud-based clusters.
A web application that is bundled as a war file, the solution is deployable even
at the user level. Ankush furthermore, offers anytime-anywhere access to
cluster functionalities through its web-based interface.
Ankush is Cloud independent, thereby giving you an option to launch the cluster
on other compatible Clouds. Ankush offers a way to quickly apply the
configuration changes across all the clusters to leverage the performance
benefits. According to Impetus, it helped the company reduce cluster set up
time by 60 percent! Finally, the solution enables bundle set up optimization
across cluster nodes using parallelism and bundle re-use.
12
13. Effective Hadoop Cluster Management
Summary
In conclusion it can be said that, for effective HadoopTM cluster management,
automation facilitates quick and error-free execution of activities. It can be
applied to make the execution non-interactive and free from human
intervention. It can also be used to save extensive time, effort and costs
involved in the cluster set up. For quickly setting up the cluster on Cloud all that
you need to do is add automation, either simple scripts or by using Cloud APIs
(provider-specific exposed API or generic API).
Another important takeaway is adopting a user-friendly GUI as an alternate
interface that can help address your cluster-sharing problems. It will also
support automation and help execute activities in parallel.
It must be reiterated here that HadoopTM is still evolving and is yet to reach
maturity. It is only possible to address HadoopTM compatibility issues in a partial
way.
Lastly, using a suitable HadoopTM Cluster Management tool can enable
organizations to deal with the pain areas associated with cluster set up,
management, and sharing.
About Impetus
Impetus Technologies offers Product Engineering and Technology R&D services for software product development.
With ongoing investments in research and application of emerging technology areas, innovative business models, and
an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver
cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility
Solutions, Test Engineering, Performance Engineering, and Social Media among others.
Impetus Technologies, Inc.
5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA
Tel: 408.213.3310 | Email: inquiry@impetus.com
Regional Development Centers - INDIA: • New Delhi • Bangalore • Indore • Hyderabad
Visit: www.impetus.com
Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of
this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 13
Technologies Inc.