Deploying and Managing Hadoop Clusters with AMBARI

Deploying and Managing
Hadoop Clusters with
AMBARI
Matt Foley and Hitesh Shah
Hortonworks, Inc.
mfoley@hortonworks.com
hitesh@hortonworks.com

© Hortonworks Inc. 2012 Page 1

Matt Foley - Background
•  MTS at Hortonworks Inc.
– Hadoop Core contributor, part of original ~25 in Yahoo! spin-out of
Hortonworks
– Currently managing engineering infrastructure for Hortonworks, including
build and deployment automation
– My team also volunteers Build Engineering infrastructure services to ASF,
for Hadoop core and several related projects within Apache
– Participated in the Hortonworks team working on Ambari implementation
during transitional phase
– Formerly, led software development for back end of Yahoo Mail for three
years – 20,000 servers in hundreds of clusters, with 30 PB of data under
management, 400M active users

•  Apache Hadoop, ASF
– Committer and PMC member, Hadoop core
– Release Manager – Hadoop-1.0

Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2012

Hitesh Shah - Background
• MTS at Hortonworks Inc.
• Committer for Apache MapReduce and Ambari
• Earlier, spent 8+ years at Yahoo! building various
frameworks all the way from data storage platforms to
high throughput online ad-serving systems.

Page 3

Overview
• Brief history – evolution of the Ambari project
• Installation
• Monitoring
• Management
• Invitation

Page 4

All features are available today
• Apologies that screen shots are from HMC
(Hortonworks Management Console) version of
Ambari
• Same code as current Ambari, but with Hortonworks
graphic elements
• You too can “skin” Ambari with your own logotype
and graphic elements!

Page 5

History
Of Ambari

Page 6

Brief History of the Ambari Project
• Deployment, Monitoring, and Management of Hadoop
and HBase clusters is:
– HARD, due to massive scale and distributed services; and
– DIFFERENT from other kinds of compute clusters,
due to Hadoop’s intrinsic fault-tolerance
• We needed an Apache opensource solution
• Started Ambari as an Apache incubator project
– Originally based in part on what was learned from “Hadoop
Management System” project out of Yahoo!

Page 7

History (continued)
• Early work specified a full architecture, including
many elements that remain today:
– State-based configuration management, rather than event-based
– Cluster configuration as a data object, able to be saved and manipulated
– Reliable deployment, parallelized for scalability
– Insightful monitoring and alerting, sharing our deep experience with the
community
– Take advantage of Puppet to achieve idempotence on installs, and
reliable start/stop of processes
– Go beyond Puppet to offer orchestrated start/stop of distributed services
• The team started with a “whole cloth” design and
build project
• 6 months into it, we figured out we had a 2-year
project on our hands!

Page 8

Evolution
•  How to get a useful tool out to the community sooner?
•  Make more use of existing tech
– Ganglia and Nagios for monitoring and alerting
– Puppet for reliable deployment and process control
•  Commit to incremental delivery
– First generation won’t have all the breadth and features desirable
– But will be useful and worth using

•  And the team has completed the first usable version of Ambari
over the last few weeks!
– Offers a good, GUI-driven Deploy experience, currently limited to RHEL5/
CentOS5 and non-secure mode (but just wait a few more weeks!)
– Quite nice Monitoring, based on our experience managing multi-
thousand-node Hadoop clusters at Yahoo!
– A beginning on Management, with several basic post-install operations

Page 9

Deployment
With Ambari

Page 10

Deployment and Installation Phases
• Preparation
• Cluster Pre-config
• Hadoop Stack Configuration
• Hadoop Stack Deploy / Install
• Service start-up and smoke test

Page 11

Deployment and Installation (Preparation)
•  Prepare Ambari and the Ambari Agent (includes Puppet agent)
–  Can follow instructions at
http://svn.apache.org/viewvc/incubator/ambari/trunk/README.txt
–  Or download the HMC from Hortonworks after Summit, and access its
documentation
•  Prepare access to ‘yum’ Repositories containing Hadoop Stack
and Ambari dependencies
–  If your nodes have direct internet access, can use provided RPMs to “install” the
repos on each node
–  Or, to avoid direct access from each node and minimize WAN traffic, can mirror the
yum repositories to an internal server accessible from the nodes
•  Prepare nodes for installation commands
–  Set up password-less ‘ssh’ for root user (secured via public keys and agent
forwarding) from Install Master node to all other cluster nodes, so can run ‘yum
install’ and ‘puppet’ commands
–  Take care of any other issues that may prevent root ssh during the Deployment
phase, such as iptables or SELinux.

Page 12

Deployment and Installation (Pre-config)

• Start running Ambari
• Provide list of hosts
– Works with Amazon EC2 IP addresses too
• Ambari does node Validation and Discovery
– Confirms availability and access capability
– Scans for node attributes and mount points
• Select desired services and data directory paths
• Automatic role assignments to nodes, with your
approval
– Based on node attributes and selected services
– Currently based primarily on memory size, to be refined in future

Page 13

.

Page 14

.

Page 15

.

Page 16

Deployment and Installation (Configuration)
•  Currently supported Hadoop Stack components for installation:
– Hadoop Core (required)
– HBase
– Pig
– Hive
– HCatalog
– Zookeeper (required for HBase, Hive, Hcat)
– Sqoop
– Oozie
– Ganglia
– Nagios

•  Modify a subset of about 50 key parameters that most commonly
need to be adjusted, depending on components selected

Page 17

.

Page 18

.

Page 19

.

Page 20

Deployment and Installation (Deploy)
•  Final review of Cluster and Stack parameters
•  Puppet agent on each node is invoked (in parallel) to reliably
deploy needed packages
•  Actual fetch and install is managed with ‘yum’
(for RHEL/CentOS) or comparable services
•  Success / failure is reported back to Install Master and the
Ambari application
•  Log messages for failures are provided to assist debugging

Page 21

.

Page 22

.

Page 23

.

Page 24

Deployment and Installation (Smoke Test)
After successful install:

•  Ambari provides “orchestration” to start-up distributed services
in dependency order

•  Puppet “kicks” are used to reliably (mostly) start and stop
service processes on individual nodes

•  After each distributed service is started, a smoketest is run and
results reported

•  Each component is smoketested before dependent components

After successful smoketest, you can be confident that your
selected components have been successfully installed and
started, and are running correctly.

Page 25

Going forward
•  Multiple OS support
– RHEL6/CentOS6
– Ubuntu and Debian
– SUSE/SLES
– Windows
•  Hadoop Security support, including secure install for all
components
•  HA support
•  Hadoop 2.0 support
•  Improved GUI user interface
•  Integration: Provide CLI commands for invoking Puppet scripts,
and Web APIs where appropriate
•  Etc.

Page 26

Monitoring
With Ambari

Page 27

Monitoring Dashboard

Page 28

Ambari Monitoring
•  Basic Monitoring capabilities for Hadoop Cluster Services
–  Up/Down status for installed Hadoop services
–  Key Alerts configured for health, performance and usage monitoring of
Hadoop services
–  Consolidated summary information for Hadoop Services (HDFS, M/R & HBase)
–  Key service metrics graphs for temporal analysis of service performance, utilization
and health (+System metrics - Cpu/Memory/Network etc.)

•  Efficient collection and visualization of monitoring metrics
–  Light weight alert condition checks (mostly over network) for better scalability

•  Leverage Open Source monitoring systems such as Nagios & Ganglia
–  Nagios - for Alert Monitoring
–  Ganglia/RRDTool for Hadoop metrics graphs

•  Simple and Intuitive UI to monitor the Hadoop cluster status

Page 29

HDFS Service

Page 30

Map/Reduce Service

Page 31

HBase Service

Page 32

Going forward
•  Rapid iterations with Ambari Open Source community to add more
monitoring capabilities e.g.
–  More services Alerts, Summary stats & Reporting for the Hadoop services
–  Queue/Job level monitoring & Diagnostic Reporting for M/R
–  Improved Visualization of service metrics graphs & reports
–  Ability to customize dashboard with relevant graphs, alerts and service information

•  RESTful APIs for Hadoop Monitoring
–  For integration with Enterprise and Cloud Management Systems, and
“powered by Ambari” products integration
–  CLIs

•  Ability to integrate with third party monitoring tools in place of Nagios &
Ganglia

•  Best practices, tips and guidelines for using Monitoring dashboard for
identifying and debugging common cluster problems

Page 33

Management
With Ambari

Page 34

Management
• “Management” can include many different
post-install activities with Hadoop clusters

• Ambari currently supports only a small set:
– Start / Stop individual services
– Dependent services will be automatically stopped also

– Change configuration parameters for a service
– Cannot currently change data directory paths

– Add nodes to the Cluster
– Decommissioning nodes is currently a manual process

– Uninstall the Cluster

Page 35

.

Page 36

.

Page 37

.

Page 38

Going forward
•  Lots more management actions supported
– Security and user management
– HA alerting and recovery
– Extensions of current functionalities
– Etc.

•  Integration: RESTful APIs / web services for integration with
established management tools in the data center

•  Improved GUI user interface

Page 39

Invitation
• Deployment, Monitoring, and Management – this is
just the first generation!
• If you are interested in these functionalities and want
to participate in an Apache opensource project,
please consider becoming a contributor to the
AMBARI (incubating) project!
• http://incubator.apache.org/ambari/mail-lists.html

Page 40

Thank you.

Page 41

Deploying and Managing Hadoop Clusters with AMBARI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deploying and Managing Hadoop Clusters with AMBARI

Similar to Deploying and Managing Hadoop Clusters with AMBARI (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Deploying and Managing Hadoop Clusters with AMBARI