SlideShare ist ein Scribd-Unternehmen logo
1 von 13
WHITE PAPER




Effective Hadoop Cluster Management




 Abstract
    In this white paper, Impetus Technologies talks about
    Apache HadoopTM, an Open Source, Java-based free
    software framework that enables the processing of
    huge amounts of data through distributed data
    processing. It talks about how correct and effective
    provisioning and management plays a crucial role to
    ensure is the key to ensure a successful HadoopTM
    cluster environment and thus helps to make
    individuals HadoopTM working experience a pleasant
    one. Apart from this the white paper also and
    discusses the challenges associated with cluster setup,
    sharing, and management.

    The paper also focuses on the benefits of automated
    setup, centralized management of multiple HadoopTM
    clusters, and quick provisioning of cloud-based
    HadoopTM clusters.




  Impetus Technologies, Inc.
  www.impetus.com
  April - 2012
Effective Hadoop Cluster Management



                                                                Table of Contents
Introduction .............................................................................................. 3
Understanding HadoopTM cluster related challenges ............................... 3
            Manual operation ........................................................................ 4
            Cluster set up ............................................................................... 4
            Cluster management ................................................................... 4
            Cluster sharing ............................................................................. 4
            HadoopTM compatibility and others............................................. 4
What is missing? ....................................................................................... 5
Solutions space ......................................................................................... 5
            Addressing operational challenges .............................................. 5
            Addressing cluster set up challenges ........................................... 6
            Addressing cluster management challenges ............................... 8
            Addressing cluster sharing challenges ......................................... 8
            Addressing HadoopTM compatibility related challenges .............. 9
Can HadoopTM Cluster Management tools help? ................................... 10
The Impetus solution .............................................................................. 11
Summary ................................................................................................. 13




                                                                                                                2
Effective Hadoop Cluster Management



                                                                  Introduction
   The HadoopTM framework offers the required support for data-intensive
   distributed applications. It manages and engages multiple nodes for distributed
   processing of the large amount of data which is stored locally on individual
   nodes. The results produced by the individual nodes are then consolidated
   further to generate the final output.

   HadoopTM provides Map/Reduce APIs and works on HadoopTM -compatible
   distributed file systems. HadoopTM sub-components and related tools, such as
   HBase, Hive, Pig, Zookeeper, Mahout etc. have specific uses and benefits
   associated with them. Normally, these subcomponents are also used along with
   HadoopTM and therefore, need set up and configuration.

   Setting up a standalone or a pseudo-distributed cluster or even a relatively small
   sized localized cluster is an easy task. On the other hand, manually setting up
   and managing a production-level cluster in a truly distributed environment
   requires significant effort, particularly in the area of cluster set up, configuration
   and management. It is also tedious, time consuming and repetitive in nature.

   Factors such as HadoopTM vendor, version, bundle type and the target
   environment add to existing cluster set up and management related
   complexities. Also, different cluster modes call for different kinds of
   configurations. Commands and settings change due to alterations in cluster
   modes, increasing the challenges related to HadoopTM set up and management.



Understanding HadoopTM cluster related challenges
   The challenges associated with HadoopTM can be broadly classified into
   following:
   1.      Operational
   2.      Cluster set up
   3.      Cluster management
   4.      Cluster sharing
   5.      HadoopTM compatibility and others

   Let us take them up one by one to understand what they mean:
   Operational challenges
   Operational challenges mainly arise due to factors like manual operation,
   console-based, non-friendly interface, as well as interactive and serial
   execution.




                                                                                            3
Effective Hadoop Cluster Management


Manual operation
The manual mode of execution requires a full-time, totally interactive user
session and consumes a lot of time. At the same time, it is also error-prone due
to the mistakes and omissions that might have occurred. These in turn require
the entire activity to begin again from scratch.

Interface
Another factor is using the console-based interface, which is the primary and
the only available default interface, to interact with the HadoopTM cluster. It is
therefore, to some extent, also responsible for the serial execution of activities.

Cluster set up
In their simplest form, HadoopTM bundles are simple tar files. They need to be
extracted and require set up and initialization. Apache HadoopTM bundles,
(especially the tarball), do not have any set up support around them. The
default way to set up the cluster is totally manual. A sequence of activities has
to be followed depending on cluster mode and HadoopTM version/vendor. The
cluster set up activity involves a lot of complexities and variations due to
different factors like the cluster set up environment (on premise versus the
Cloud), cluster mode, component bundle type, vendor and version. On the top
of existing complexities, the manual, interactive and attended mode of
operation increases the challenges.

Cluster management
The current cluster management in HadoopTM offers limited functionalities and
at the same time the operation needs to be carried out manually from a
console-based interface. There is no feature that enables the management of
multiple clusters from one single location. One needs to change the interface or
log on to different machines in order to manage different clusters.

Cluster sharing
With the current way of operation, the task of sharing HadoopTM clusters across
various users and user groups with altogether different requirements is not just
challenging, tedious and time-consuming, but to some extent also insecure.

HadoopTM compatibility and others
The key factors that fall into this category are related to areas like HadoopTM API
compatibility, working with HadoopTM bundles and bundle formats (tar/RPM)
from multiple vendors, HadoopTM versions related operational and commands
differences, etc.




                                                                                      4
Effective Hadoop Cluster Management



                                                    What is missing?
After examining the challenges, it is important to understand what is missing.
Once we know the missing dimensions, it is possible to overcome or address
most of the challenges.

Missing Dimensions:
   • Operational support
           o Automation
           o Alternate User friendly interface
           o Monitoring and notifications support
   • Setup support
   • Cluster management support
   • Cluster sharing mechanism

When we compare HadoopTM with other Big Data solutions (like Greenplum or
commercial solutions such as Aster Data), we find that the Big Data solutions
offer support around the above mentioned dimensions. This appears to be
missing in HadoopTM.

Today, there are tools in the market that address some or the majority of the
challenges mentioned above. The solutions primarily use these missing
dimensions around HadoopTM, and address the various pain points.

Let us now look at how these dimensions can help deal with the various
challenges.



                                                       Solutions space
Addressing operational challenges
The operational challenges can be addressed using a combination of methods
like applying automation, an alternate user interface with support for updates
and notifications.

Automation
Automation enables unattended, quick, and error-free execution of any activity.
Smart automation can take care of various associated factors and situations in a
context aware manner.

Automation ensures that the right commands are submitted, even though the
parameters may or may not be correct, as they are keyed in by users. With an




                                                                                   5
Effective Hadoop Cluster Management


input validating interface it is possible to validate user inputs so as to ensure
that only the right parameters are being used.

Using an alternate interface
As discussed earlier, with a default console based interface, several limitations
crop up, such as serial execution and interactive working. It is possible to
overcome this problem by adopting a user-friendly GUI as an alternate interface
that additionally supports configuration, input validation, and automation and
at the same time runs several activities in parallel. An alternate, friendly, user
interface helps in accessing HadoopTM functionality and operations in a
streamlined manner.

Impetus strongly believes that operational challenges associated with HadoopTM
clusters can be addressed to a great extent by using an alternate interface that
supports automation and provides parallel working support.

Thus automation and an alternate interface together offer an easy and better
HadoopTM environment for working.


Addressing cluster set up challenges
Cluster set up activity requires careful execution of pre-defined actions in a
situation aware manner where even a minor error or omission due to manual
intervention can result in a major setback.

While simple automation can handle this problem, some actions may still
require user intervention (e.g. accepting the license agreements). Bringing in
smart automation into the picture enables a quick set up of the cluster, in a
hassle-free and non-interactive manner. The entire cluster set up functionality
can be offered through a configurable alternate user interface that offers
simple, click-based cluster provisioning through a friendly and highly
configurable UI. This in turn utilizes context-aware automation, based on the
provided inputs and can perform multiple activities in parallel.

Understanding the difference between setting up a cluster on-premise and
over the Cloud

For Cloud-based clusters, organizations are required to launch and terminate
the cluster instances. However, the hardware, operating systems, and installed
Java versions are mostly uniform, which may differ in the case of on-premise
deployment. For on-premise deployment, it is important to set up password-less
SSH between the nodes, which are not required in the Cloud set up. The setup
of the HadoopTM ecosystem components remains the same, regardless of the
cluster set up environment.



                                                                                     6
Effective Hadoop Cluster Management




Provisioning the Cloud-based HadoopTM cluster
The complexities for provisioning the Cloud-based HadoopTM cluster appear
primarily due to the manual operation, which involves steps such as accessing
the Cloud-provider interface to launch the required number of nodes with
required different hardware configurations, providing inputs for key pairs,
security settings, machine images, etc. There is need to open or unblock the
required ports, manually collect individual node IPs and add all these IP
addresses/hostnames to the HadoopTM slave files. After using the Cloud cluster,
once again one need to manually terminate all the machines sequentially, by
individually selecting them on the Cloud interface.

If the cluster size is small then all these activities can be carried out easily.
However, performing these activities manually on a large-sized cluster is
cumbersome and may lead to errors. For performing the activities, one
continuously needs to switch between the interface of the Cloud providers and
the HadoopTM management interface.

Bringing in automation into the picture can ease all these activities and help to
save time and effort. One can incorporate automation just by adding simple
scripts or by using Cloud provider-specific exposed APIs or alternatively by using
generic Cloud APIs, such as JCloud, Simple Cloud, LibCloud, DeltaCloud etc.

Cloudera CDH-2 scripts can help launch instances on the Cloud and then enable
setting up HadoopTM over the launched nodes. This is somewhat similar in the
case of Whirr, which uses the JCloud API in the background.




                                                                                     7
Effective Hadoop Cluster Management


Addressing cluster management challenges
As we have discussed, the key challenge in this area is the lack of appropriate
functionalities to manage the cluster. It is possible to effectively managing
HadoopTM clusters by adopting tools with dedicated and sophisticated support
for various cluster management capabilities. This may include functionalities
ranging from node management to services, user, configuration, parameters
and job management. Additionally, this may also provide adequate support for
templates of various common required workflows and inputs for managing all
the mentioned entities.

The solutions may also support a friendly and configurable way for providing
updates on performance monitoring, progress or status update notifications and
alerts. This is a user-friendly approach which actively offers updates on the
progress, notifications for various events and change in status to users, instead
of them seeking or polling for it periodically in a passive manner. Furthermore,
the mode of receiving these updates can be customized and configured by the
user/administrator based on individual preferences and the exact requirements
in terms of criticalness. Thus users, depending on their needs, can configure the
communication channel/medium which can be any one of the online updates,
e-mails or SMS notifications.

All these functionalities are supported through an alternate, user-friendly
interface which also automates cluster management activities and offers a way
to work on multiple activities in parallel. If the user-interface is web-based, then
it additionally offers you the ability to access cluster-related functionality from
anywhere, and at any time.


Addressing cluster sharing challenges
It needs to be mentioned here that cluster sharing essentially means sharing
clusters among the different development and testing team members.

The main problem here is the manner in which these clusters are typically
shared. In the traditional approach, there are two possible ways to share the
cluster. The first is about sharing the credentials of a common user account with
the entire set of users. The second approach is about creating separate user
accounts or user groups for each user/user group with whom you are planning
to share the cluster.

If you share the cluster using the first approach, i.e. sharing common user
account credentials with all users, regardless of their actual usage or access
requirement, you are compromising the security of the system. The system (as
well as cluster) and other linked systems are now exposed because this user




                                                                                       8
Effective Hadoop Cluster Management


account may have some exclusive privileges that are now available to a broad
category of all cluster users, regardless of their actual requirement.

In the second approach, one needs to create OS level separate user accounts on
the system (in some cases, even on each node of the cluster) with restricted
privileges. Now this is a complex as well as time consuming task. You not only
have to create or set it up, but even need to maintain and update the various
requirements with time.

Impetus strongly suggests using role-based cluster sharing through the alternate
UI. This offers a cleaner way to share clusters without compromising on
security. Some of the solutions not only allow you to control role-based access
to various cluster management functionalities, they even offer a way to
authenticate users and their roles from a valid existing external user
authentication system. Some of the benefits of using this method include the
fact that now users may not be necessarily created at the per machine or OS
level. Rather they can be created at the solution level, or just reused even from
existing domain level users. Thus, it will be relatively easy to manage and
control users through the admin interface of the solution. Furthermore, based
on requirements, specific roles can be created on-the-fly and assigned to the
specific user accounts in order to restrict or provide access to specified
functionalities for individual user access.

Cluster sharing has definite associated benefits. Furthermore, if multiple shared
clusters can be managed from a single centralized location without switching
the interface or without being logged on to multiple machines, and then this
makes the entire task even easier. It becomes easy to manage and fine tune a
single shared cluster. Users and back-ups too are easy to manage. While
working with a shared cluster all cluster users gets performance benefits. When
compared with non-shared clusters running on individual machines, one saves a
lot of time required to set up, manage and troubleshoot local clusters on
individual machines. The performance figures received from local clusters
running on individual user machines are also not the true measure of cluster
performance, as the hardware of individual machines is not always up to the
mark in terms of being the best configuration.


Addressing HadoopTM compatibility related challenges
Let us look at the various challenges related to HadoopTM. The very first
challenge is HadoopTM compatibility. HadoopTM as a technology is still evolving
and has not reached complete maturity yet. This factor in turn, gives way to
numerous challenges, such as API differences across versions, their on wire
compatibility i.e. multiple HadoopTM versions on different nodes of same cluster,
and interoperability of multiple versions and their respective components.



                                                                                    9
Effective Hadoop Cluster Management


Sometimes, given configurations may not be supported in certain versions.
Problems may also arise due to multiple vendors and vendor-specific features.

There can also be complexities related to bundle formats (tar ball and RPM) as
their setup and folder locations (bin/conf) differ. Cluster modes are another
factor that demands suitable changes in configuration and command execution.

Security is available as a separate patch and needs customized configuration.
Issues can also crop up due to vendor-specific solutions such as SPOF/HA and
compatible file systems.

It is possible to find work-arounds to partially address the compatibility
challenges. A complete solution may not be possible, as these problems are the
result of the underlying technology. One can address the compatibility
problems by adopting HadoopTM Cluster Management Tools which can give you
the option of replacing these incompatible bundles with suitable ones. This will
primarily ensure that all the nodes within the cluster have the same version of
HadoopTM and the respective components installed.

Other factors such as version, vendor, bundle format, etc., can also be handled
to a great extent by using any HadoopTM cluster management tool that
facilitates context-aware component set up and management support. The tool
can take care of the differences in files and folder names/locations changes,
command changes and configuration differences in accordance with the bundle
format, cluster mode and vendor.



 Can HadoopTM Cluster Management tools help?
Impetus strongly believes that by utilizing these missing dimensions, a tool can
offer a better HadoopTM environment for working. It can therefore improve your
productivity when working with HadoopTM clusters. Such tools can help
immensely, as they offer a quick turnaround and enable companies to create
new clusters quickly and in accordance with their specifications. The tools offer
automated set up, thereby leaving no room for error. They also help to minimize
the total cost of the operation by reducing the time and effort required for
cluster set up and management.

HadoopTM cluster management tools can provide integrated support for all
organizational requirements from one place and one interface. The tools can
also help to set up clusters for different needs for e.g. setting up a cluster for
testing the application across different vendors, distributions and versions and
then benchmarking them on different configurations, loads and environments.




                                                                                     10
Effective Hadoop Cluster Management


They help in analyzing the impact of cluster size against different load patterns
and then enabling the launch and resize of the cluster, on-the-fly.

Among the tools currently available for the effective set up, and management of
HadoopTM clusters are Amazon’s Elastic Map-Reduce, Whirr, Cloudera SCM, and
Impetus’ Ankush.



                                              The Impetus solution
Impetus’ Ankush, which is a web-based solution, supports the intricacies
involved in cluster set up, management, cloud provisioning, and sharing.

Ankush offers customers the following benefits:


    •   Centralized management of multiple clusters. This is a very helpful
        feature as it eliminates the need for a change in the interface or log in
        to different machines, to manage different clusters.
    •   In the node management area for instance, the solution, through its
        web interface, supports the listing of existing nodes as well as the
        addition and removal of the nodes by IP or hostname.
    •   Support for cluster set up in all possible modes. It also performs
        context-aware auto initialization of configuration parameters based on
        the cluster mode. The initialization support includes services,
        configuration initialization, initial node addition and initial job
        submission. Additionally, it also supports multiple vendors, versions,
        and bundles for HadoopTM ecosystem components.
    •   For the cloud, the solution supports the launch and termination of
        entire HadoopTM clusters, for both heterogeneous and homogeneous
        configurations. Ankush can support all the required Cloud operations
        from its own UI, so organizations need not access the cloud-specific
        interface.
    •   Supports centralized management and monitoring of multiple clusters.
        Individual cluster-based operations can also be managed using the same
        interface from the same location.
    •   Facilitates support for reconfiguration and upgrade of HadoopTM and
        other components, the management of keys, users, configuration
        parameters, services and monitoring. User management supports
        multiple user roles. It allows role-based access to various cluster
        functionalities. Only the admin-role based user can perform the
        operations that affect the state of the cluster. The setup of the cluster,



                                                                                     11
Effective Hadoop Cluster Management


        components and pre-dependencies like Java and password-less SSH set
        up are undertaken in an automated fashion.


          Figure: Ankush–Impetus’ HCM Tool




You can use Ankush to set up and manage local as well as Cloud-based clusters.
A web application that is bundled as a war file, the solution is deployable even
at the user level. Ankush furthermore, offers anytime-anywhere access to
cluster functionalities through its web-based interface.

Ankush is Cloud independent, thereby giving you an option to launch the cluster
on other compatible Clouds. Ankush offers a way to quickly apply the
configuration changes across all the clusters to leverage the performance
benefits. According to Impetus, it helped the company reduce cluster set up
time by 60 percent! Finally, the solution enables bundle set up optimization
across cluster nodes using parallelism and bundle re-use.




                                                                                   12
Effective Hadoop Cluster Management



                                                                                                                              Summary
                                       In conclusion it can be said that, for effective HadoopTM cluster management,
                                       automation facilitates quick and error-free execution of activities. It can be
                                       applied to make the execution non-interactive and free from human
                                       intervention. It can also be used to save extensive time, effort and costs
                                       involved in the cluster set up. For quickly setting up the cluster on Cloud all that
                                       you need to do is add automation, either simple scripts or by using Cloud APIs
                                       (provider-specific exposed API or generic API).

                                       Another important takeaway is adopting a user-friendly GUI as an alternate
                                       interface that can help address your cluster-sharing problems. It will also
                                       support automation and help execute activities in parallel.

                                       It must be reiterated here that HadoopTM is still evolving and is yet to reach
                                       maturity. It is only possible to address HadoopTM compatibility issues in a partial
                                       way.

                                       Lastly, using a suitable HadoopTM Cluster Management tool can enable
                                       organizations to deal with the pain areas associated with cluster set up,
                                       management, and sharing.




    About Impetus
    Impetus Technologies offers Product Engineering and Technology R&D services for software product development.
    With ongoing investments in research and application of emerging technology areas, innovative business models, and
    an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver
    cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility
    Solutions, Test Engineering, Performance Engineering, and Social Media among others.

    Impetus Technologies, Inc.
    5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA
    Tel: 408.213.3310 | Email: inquiry@impetus.com
    Regional Development Centers - INDIA: • New Delhi • Bangalore • Indore • Hyderabad
    Visit: www.impetus.com



Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of
this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus     13
Technologies Inc.

Weitere ähnliche Inhalte

Ähnlich wie Effective Hadoop Cluster Management- Impetus White Paper

The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...Impetus Technologies
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, TIB Academy
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
How to Migrate, Manage and Centralize your Web Infrastructure with Drupal
How to Migrate, Manage and Centralize your Web Infrastructure with DrupalHow to Migrate, Manage and Centralize your Web Infrastructure with Drupal
How to Migrate, Manage and Centralize your Web Infrastructure with DrupalAcquia
 
Effective Hadoop Cluster Management - Impetus Webinar
Effective Hadoop Cluster Management - Impetus WebinarEffective Hadoop Cluster Management - Impetus Webinar
Effective Hadoop Cluster Management - Impetus WebinarImpetus Technologies
 
Hadoop administarrtion
Hadoop administarrtionHadoop administarrtion
Hadoop administarrtionJanu Jahnavi
 
Cloud batch a batch job queuing system on clouds with hadoop and h-base
Cloud batch  a batch job queuing system on clouds with hadoop and h-baseCloud batch  a batch job queuing system on clouds with hadoop and h-base
Cloud batch a batch job queuing system on clouds with hadoop and h-baseJoão Gabriel Lima
 

Ähnlich wie Effective Hadoop Cluster Management- Impetus White Paper (20)

The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Virtualized Hadoop
Virtualized HadoopVirtualized Hadoop
Virtualized Hadoop
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
How to Migrate, Manage and Centralize your Web Infrastructure with Drupal
How to Migrate, Manage and Centralize your Web Infrastructure with DrupalHow to Migrate, Manage and Centralize your Web Infrastructure with Drupal
How to Migrate, Manage and Centralize your Web Infrastructure with Drupal
 
Effective Hadoop Cluster Management - Impetus Webinar
Effective Hadoop Cluster Management - Impetus WebinarEffective Hadoop Cluster Management - Impetus Webinar
Effective Hadoop Cluster Management - Impetus Webinar
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
MySQL Devops Webinar
MySQL Devops WebinarMySQL Devops Webinar
MySQL Devops Webinar
 
Hadoop administarrtion
Hadoop administarrtionHadoop administarrtion
Hadoop administarrtion
 
Cloud batch a batch job queuing system on clouds with hadoop and h-base
Cloud batch  a batch job queuing system on clouds with hadoop and h-baseCloud batch  a batch job queuing system on clouds with hadoop and h-base
Cloud batch a batch job queuing system on clouds with hadoop and h-base
 

Mehr von Impetus Technologies

Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Impetus Technologies
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarImpetus Technologies
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarImpetus Technologies
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Impetus Technologies
 
Impetus White Paper- Handling Data Corruption in Elasticsearch
Impetus White Paper- Handling  Data Corruption  in ElasticsearchImpetus White Paper- Handling  Data Corruption  in Elasticsearch
Impetus White Paper- Handling Data Corruption in ElasticsearchImpetus Technologies
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarImpetus Technologies
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarImpetus Technologies
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Impetus Technologies
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
 
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...Impetus Technologies
 
Enterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastEnterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastImpetus Technologies
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Impetus Technologies
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Impetus Technologies
 
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Impetus Technologies
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trendsImpetus Technologies
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
 
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarReal-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarImpetus Technologies
 

Mehr von Impetus Technologies (20)

Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus Webinar
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Impetus White Paper- Handling Data Corruption in Elasticsearch
Impetus White Paper- Handling  Data Corruption  in ElasticsearchImpetus White Paper- Handling  Data Corruption  in Elasticsearch
Impetus White Paper- Handling Data Corruption in Elasticsearch
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
 
Enterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastEnterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus Webcast
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
 
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trends
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarReal-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Effective Hadoop Cluster Management- Impetus White Paper

  • 1. WHITE PAPER Effective Hadoop Cluster Management Abstract In this white paper, Impetus Technologies talks about Apache HadoopTM, an Open Source, Java-based free software framework that enables the processing of huge amounts of data through distributed data processing. It talks about how correct and effective provisioning and management plays a crucial role to ensure is the key to ensure a successful HadoopTM cluster environment and thus helps to make individuals HadoopTM working experience a pleasant one. Apart from this the white paper also and discusses the challenges associated with cluster setup, sharing, and management. The paper also focuses on the benefits of automated setup, centralized management of multiple HadoopTM clusters, and quick provisioning of cloud-based HadoopTM clusters. Impetus Technologies, Inc. www.impetus.com April - 2012
  • 2. Effective Hadoop Cluster Management Table of Contents Introduction .............................................................................................. 3 Understanding HadoopTM cluster related challenges ............................... 3 Manual operation ........................................................................ 4 Cluster set up ............................................................................... 4 Cluster management ................................................................... 4 Cluster sharing ............................................................................. 4 HadoopTM compatibility and others............................................. 4 What is missing? ....................................................................................... 5 Solutions space ......................................................................................... 5 Addressing operational challenges .............................................. 5 Addressing cluster set up challenges ........................................... 6 Addressing cluster management challenges ............................... 8 Addressing cluster sharing challenges ......................................... 8 Addressing HadoopTM compatibility related challenges .............. 9 Can HadoopTM Cluster Management tools help? ................................... 10 The Impetus solution .............................................................................. 11 Summary ................................................................................................. 13 2
  • 3. Effective Hadoop Cluster Management Introduction The HadoopTM framework offers the required support for data-intensive distributed applications. It manages and engages multiple nodes for distributed processing of the large amount of data which is stored locally on individual nodes. The results produced by the individual nodes are then consolidated further to generate the final output. HadoopTM provides Map/Reduce APIs and works on HadoopTM -compatible distributed file systems. HadoopTM sub-components and related tools, such as HBase, Hive, Pig, Zookeeper, Mahout etc. have specific uses and benefits associated with them. Normally, these subcomponents are also used along with HadoopTM and therefore, need set up and configuration. Setting up a standalone or a pseudo-distributed cluster or even a relatively small sized localized cluster is an easy task. On the other hand, manually setting up and managing a production-level cluster in a truly distributed environment requires significant effort, particularly in the area of cluster set up, configuration and management. It is also tedious, time consuming and repetitive in nature. Factors such as HadoopTM vendor, version, bundle type and the target environment add to existing cluster set up and management related complexities. Also, different cluster modes call for different kinds of configurations. Commands and settings change due to alterations in cluster modes, increasing the challenges related to HadoopTM set up and management. Understanding HadoopTM cluster related challenges The challenges associated with HadoopTM can be broadly classified into following: 1. Operational 2. Cluster set up 3. Cluster management 4. Cluster sharing 5. HadoopTM compatibility and others Let us take them up one by one to understand what they mean: Operational challenges Operational challenges mainly arise due to factors like manual operation, console-based, non-friendly interface, as well as interactive and serial execution. 3
  • 4. Effective Hadoop Cluster Management Manual operation The manual mode of execution requires a full-time, totally interactive user session and consumes a lot of time. At the same time, it is also error-prone due to the mistakes and omissions that might have occurred. These in turn require the entire activity to begin again from scratch. Interface Another factor is using the console-based interface, which is the primary and the only available default interface, to interact with the HadoopTM cluster. It is therefore, to some extent, also responsible for the serial execution of activities. Cluster set up In their simplest form, HadoopTM bundles are simple tar files. They need to be extracted and require set up and initialization. Apache HadoopTM bundles, (especially the tarball), do not have any set up support around them. The default way to set up the cluster is totally manual. A sequence of activities has to be followed depending on cluster mode and HadoopTM version/vendor. The cluster set up activity involves a lot of complexities and variations due to different factors like the cluster set up environment (on premise versus the Cloud), cluster mode, component bundle type, vendor and version. On the top of existing complexities, the manual, interactive and attended mode of operation increases the challenges. Cluster management The current cluster management in HadoopTM offers limited functionalities and at the same time the operation needs to be carried out manually from a console-based interface. There is no feature that enables the management of multiple clusters from one single location. One needs to change the interface or log on to different machines in order to manage different clusters. Cluster sharing With the current way of operation, the task of sharing HadoopTM clusters across various users and user groups with altogether different requirements is not just challenging, tedious and time-consuming, but to some extent also insecure. HadoopTM compatibility and others The key factors that fall into this category are related to areas like HadoopTM API compatibility, working with HadoopTM bundles and bundle formats (tar/RPM) from multiple vendors, HadoopTM versions related operational and commands differences, etc. 4
  • 5. Effective Hadoop Cluster Management What is missing? After examining the challenges, it is important to understand what is missing. Once we know the missing dimensions, it is possible to overcome or address most of the challenges. Missing Dimensions: • Operational support o Automation o Alternate User friendly interface o Monitoring and notifications support • Setup support • Cluster management support • Cluster sharing mechanism When we compare HadoopTM with other Big Data solutions (like Greenplum or commercial solutions such as Aster Data), we find that the Big Data solutions offer support around the above mentioned dimensions. This appears to be missing in HadoopTM. Today, there are tools in the market that address some or the majority of the challenges mentioned above. The solutions primarily use these missing dimensions around HadoopTM, and address the various pain points. Let us now look at how these dimensions can help deal with the various challenges. Solutions space Addressing operational challenges The operational challenges can be addressed using a combination of methods like applying automation, an alternate user interface with support for updates and notifications. Automation Automation enables unattended, quick, and error-free execution of any activity. Smart automation can take care of various associated factors and situations in a context aware manner. Automation ensures that the right commands are submitted, even though the parameters may or may not be correct, as they are keyed in by users. With an 5
  • 6. Effective Hadoop Cluster Management input validating interface it is possible to validate user inputs so as to ensure that only the right parameters are being used. Using an alternate interface As discussed earlier, with a default console based interface, several limitations crop up, such as serial execution and interactive working. It is possible to overcome this problem by adopting a user-friendly GUI as an alternate interface that additionally supports configuration, input validation, and automation and at the same time runs several activities in parallel. An alternate, friendly, user interface helps in accessing HadoopTM functionality and operations in a streamlined manner. Impetus strongly believes that operational challenges associated with HadoopTM clusters can be addressed to a great extent by using an alternate interface that supports automation and provides parallel working support. Thus automation and an alternate interface together offer an easy and better HadoopTM environment for working. Addressing cluster set up challenges Cluster set up activity requires careful execution of pre-defined actions in a situation aware manner where even a minor error or omission due to manual intervention can result in a major setback. While simple automation can handle this problem, some actions may still require user intervention (e.g. accepting the license agreements). Bringing in smart automation into the picture enables a quick set up of the cluster, in a hassle-free and non-interactive manner. The entire cluster set up functionality can be offered through a configurable alternate user interface that offers simple, click-based cluster provisioning through a friendly and highly configurable UI. This in turn utilizes context-aware automation, based on the provided inputs and can perform multiple activities in parallel. Understanding the difference between setting up a cluster on-premise and over the Cloud For Cloud-based clusters, organizations are required to launch and terminate the cluster instances. However, the hardware, operating systems, and installed Java versions are mostly uniform, which may differ in the case of on-premise deployment. For on-premise deployment, it is important to set up password-less SSH between the nodes, which are not required in the Cloud set up. The setup of the HadoopTM ecosystem components remains the same, regardless of the cluster set up environment. 6
  • 7. Effective Hadoop Cluster Management Provisioning the Cloud-based HadoopTM cluster The complexities for provisioning the Cloud-based HadoopTM cluster appear primarily due to the manual operation, which involves steps such as accessing the Cloud-provider interface to launch the required number of nodes with required different hardware configurations, providing inputs for key pairs, security settings, machine images, etc. There is need to open or unblock the required ports, manually collect individual node IPs and add all these IP addresses/hostnames to the HadoopTM slave files. After using the Cloud cluster, once again one need to manually terminate all the machines sequentially, by individually selecting them on the Cloud interface. If the cluster size is small then all these activities can be carried out easily. However, performing these activities manually on a large-sized cluster is cumbersome and may lead to errors. For performing the activities, one continuously needs to switch between the interface of the Cloud providers and the HadoopTM management interface. Bringing in automation into the picture can ease all these activities and help to save time and effort. One can incorporate automation just by adding simple scripts or by using Cloud provider-specific exposed APIs or alternatively by using generic Cloud APIs, such as JCloud, Simple Cloud, LibCloud, DeltaCloud etc. Cloudera CDH-2 scripts can help launch instances on the Cloud and then enable setting up HadoopTM over the launched nodes. This is somewhat similar in the case of Whirr, which uses the JCloud API in the background. 7
  • 8. Effective Hadoop Cluster Management Addressing cluster management challenges As we have discussed, the key challenge in this area is the lack of appropriate functionalities to manage the cluster. It is possible to effectively managing HadoopTM clusters by adopting tools with dedicated and sophisticated support for various cluster management capabilities. This may include functionalities ranging from node management to services, user, configuration, parameters and job management. Additionally, this may also provide adequate support for templates of various common required workflows and inputs for managing all the mentioned entities. The solutions may also support a friendly and configurable way for providing updates on performance monitoring, progress or status update notifications and alerts. This is a user-friendly approach which actively offers updates on the progress, notifications for various events and change in status to users, instead of them seeking or polling for it periodically in a passive manner. Furthermore, the mode of receiving these updates can be customized and configured by the user/administrator based on individual preferences and the exact requirements in terms of criticalness. Thus users, depending on their needs, can configure the communication channel/medium which can be any one of the online updates, e-mails or SMS notifications. All these functionalities are supported through an alternate, user-friendly interface which also automates cluster management activities and offers a way to work on multiple activities in parallel. If the user-interface is web-based, then it additionally offers you the ability to access cluster-related functionality from anywhere, and at any time. Addressing cluster sharing challenges It needs to be mentioned here that cluster sharing essentially means sharing clusters among the different development and testing team members. The main problem here is the manner in which these clusters are typically shared. In the traditional approach, there are two possible ways to share the cluster. The first is about sharing the credentials of a common user account with the entire set of users. The second approach is about creating separate user accounts or user groups for each user/user group with whom you are planning to share the cluster. If you share the cluster using the first approach, i.e. sharing common user account credentials with all users, regardless of their actual usage or access requirement, you are compromising the security of the system. The system (as well as cluster) and other linked systems are now exposed because this user 8
  • 9. Effective Hadoop Cluster Management account may have some exclusive privileges that are now available to a broad category of all cluster users, regardless of their actual requirement. In the second approach, one needs to create OS level separate user accounts on the system (in some cases, even on each node of the cluster) with restricted privileges. Now this is a complex as well as time consuming task. You not only have to create or set it up, but even need to maintain and update the various requirements with time. Impetus strongly suggests using role-based cluster sharing through the alternate UI. This offers a cleaner way to share clusters without compromising on security. Some of the solutions not only allow you to control role-based access to various cluster management functionalities, they even offer a way to authenticate users and their roles from a valid existing external user authentication system. Some of the benefits of using this method include the fact that now users may not be necessarily created at the per machine or OS level. Rather they can be created at the solution level, or just reused even from existing domain level users. Thus, it will be relatively easy to manage and control users through the admin interface of the solution. Furthermore, based on requirements, specific roles can be created on-the-fly and assigned to the specific user accounts in order to restrict or provide access to specified functionalities for individual user access. Cluster sharing has definite associated benefits. Furthermore, if multiple shared clusters can be managed from a single centralized location without switching the interface or without being logged on to multiple machines, and then this makes the entire task even easier. It becomes easy to manage and fine tune a single shared cluster. Users and back-ups too are easy to manage. While working with a shared cluster all cluster users gets performance benefits. When compared with non-shared clusters running on individual machines, one saves a lot of time required to set up, manage and troubleshoot local clusters on individual machines. The performance figures received from local clusters running on individual user machines are also not the true measure of cluster performance, as the hardware of individual machines is not always up to the mark in terms of being the best configuration. Addressing HadoopTM compatibility related challenges Let us look at the various challenges related to HadoopTM. The very first challenge is HadoopTM compatibility. HadoopTM as a technology is still evolving and has not reached complete maturity yet. This factor in turn, gives way to numerous challenges, such as API differences across versions, their on wire compatibility i.e. multiple HadoopTM versions on different nodes of same cluster, and interoperability of multiple versions and their respective components. 9
  • 10. Effective Hadoop Cluster Management Sometimes, given configurations may not be supported in certain versions. Problems may also arise due to multiple vendors and vendor-specific features. There can also be complexities related to bundle formats (tar ball and RPM) as their setup and folder locations (bin/conf) differ. Cluster modes are another factor that demands suitable changes in configuration and command execution. Security is available as a separate patch and needs customized configuration. Issues can also crop up due to vendor-specific solutions such as SPOF/HA and compatible file systems. It is possible to find work-arounds to partially address the compatibility challenges. A complete solution may not be possible, as these problems are the result of the underlying technology. One can address the compatibility problems by adopting HadoopTM Cluster Management Tools which can give you the option of replacing these incompatible bundles with suitable ones. This will primarily ensure that all the nodes within the cluster have the same version of HadoopTM and the respective components installed. Other factors such as version, vendor, bundle format, etc., can also be handled to a great extent by using any HadoopTM cluster management tool that facilitates context-aware component set up and management support. The tool can take care of the differences in files and folder names/locations changes, command changes and configuration differences in accordance with the bundle format, cluster mode and vendor. Can HadoopTM Cluster Management tools help? Impetus strongly believes that by utilizing these missing dimensions, a tool can offer a better HadoopTM environment for working. It can therefore improve your productivity when working with HadoopTM clusters. Such tools can help immensely, as they offer a quick turnaround and enable companies to create new clusters quickly and in accordance with their specifications. The tools offer automated set up, thereby leaving no room for error. They also help to minimize the total cost of the operation by reducing the time and effort required for cluster set up and management. HadoopTM cluster management tools can provide integrated support for all organizational requirements from one place and one interface. The tools can also help to set up clusters for different needs for e.g. setting up a cluster for testing the application across different vendors, distributions and versions and then benchmarking them on different configurations, loads and environments. 10
  • 11. Effective Hadoop Cluster Management They help in analyzing the impact of cluster size against different load patterns and then enabling the launch and resize of the cluster, on-the-fly. Among the tools currently available for the effective set up, and management of HadoopTM clusters are Amazon’s Elastic Map-Reduce, Whirr, Cloudera SCM, and Impetus’ Ankush. The Impetus solution Impetus’ Ankush, which is a web-based solution, supports the intricacies involved in cluster set up, management, cloud provisioning, and sharing. Ankush offers customers the following benefits: • Centralized management of multiple clusters. This is a very helpful feature as it eliminates the need for a change in the interface or log in to different machines, to manage different clusters. • In the node management area for instance, the solution, through its web interface, supports the listing of existing nodes as well as the addition and removal of the nodes by IP or hostname. • Support for cluster set up in all possible modes. It also performs context-aware auto initialization of configuration parameters based on the cluster mode. The initialization support includes services, configuration initialization, initial node addition and initial job submission. Additionally, it also supports multiple vendors, versions, and bundles for HadoopTM ecosystem components. • For the cloud, the solution supports the launch and termination of entire HadoopTM clusters, for both heterogeneous and homogeneous configurations. Ankush can support all the required Cloud operations from its own UI, so organizations need not access the cloud-specific interface. • Supports centralized management and monitoring of multiple clusters. Individual cluster-based operations can also be managed using the same interface from the same location. • Facilitates support for reconfiguration and upgrade of HadoopTM and other components, the management of keys, users, configuration parameters, services and monitoring. User management supports multiple user roles. It allows role-based access to various cluster functionalities. Only the admin-role based user can perform the operations that affect the state of the cluster. The setup of the cluster, 11
  • 12. Effective Hadoop Cluster Management components and pre-dependencies like Java and password-less SSH set up are undertaken in an automated fashion. Figure: Ankush–Impetus’ HCM Tool You can use Ankush to set up and manage local as well as Cloud-based clusters. A web application that is bundled as a war file, the solution is deployable even at the user level. Ankush furthermore, offers anytime-anywhere access to cluster functionalities through its web-based interface. Ankush is Cloud independent, thereby giving you an option to launch the cluster on other compatible Clouds. Ankush offers a way to quickly apply the configuration changes across all the clusters to leverage the performance benefits. According to Impetus, it helped the company reduce cluster set up time by 60 percent! Finally, the solution enables bundle set up optimization across cluster nodes using parallelism and bundle re-use. 12
  • 13. Effective Hadoop Cluster Management Summary In conclusion it can be said that, for effective HadoopTM cluster management, automation facilitates quick and error-free execution of activities. It can be applied to make the execution non-interactive and free from human intervention. It can also be used to save extensive time, effort and costs involved in the cluster set up. For quickly setting up the cluster on Cloud all that you need to do is add automation, either simple scripts or by using Cloud APIs (provider-specific exposed API or generic API). Another important takeaway is adopting a user-friendly GUI as an alternate interface that can help address your cluster-sharing problems. It will also support automation and help execute activities in parallel. It must be reiterated here that HadoopTM is still evolving and is yet to reach maturity. It is only possible to address HadoopTM compatibility issues in a partial way. Lastly, using a suitable HadoopTM Cluster Management tool can enable organizations to deal with the pain areas associated with cluster set up, management, and sharing. About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.213.3310 | Email: inquiry@impetus.com Regional Development Centers - INDIA: • New Delhi • Bangalore • Indore • Hyderabad Visit: www.impetus.com Disclaimers The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 13 Technologies Inc.