1. Solution Guide
EMC HYBRID CLOUD SOLUTION
WITH VMWARE
Hadoop Applications Solution Guide 2.5
EMC Solutions
Abstract
This document serves as a reference for planning and designing a Pivotal Hadoop
solution that enables IT organizations to quickly deploy Hadoop as a service (HaaS)
on an existing cloud.
August 2014
3. Contents
3EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Contents
Chapter 1 Executive Summary 7
Document purpose.....................................................................................................8
Audience....................................................................................................................8
Solution purpose........................................................................................................8
Business challenge ....................................................................................................9
Technology solution ...................................................................................................9
Chapter 2 EMC Hybrid Cloud Solution Overview 11
Introduction .............................................................................................................12
EMC Hybrid Cloud features and functionality............................................................13
Automation and self-service provisioning ............................................................13
Multitenancy and secure separation....................................................................14
Workload-optimized storage................................................................................14
Elasticity and service assurance ..........................................................................14
Operational monitoring and management............................................................15
Metering and chargeback ....................................................................................15
Modular add-on components...............................................................................16
Chapter 3 EMC Hybrid Cloud Hadoop as a Service 19
Overview ..................................................................................................................20
EMC Hybrid Cloud HaaS and IaaS .............................................................................20
Pivotal Hadoop.........................................................................................................21
Serengeti..................................................................................................................22
VMware vSphere Big Data Extensions.......................................................................22
Chapter 4 HaaS Component Integration 25
Overview ..................................................................................................................26
Integrating Hadoop components with EMC Hybrid Cloud ..........................................26
BDE Topology.......................................................................................................26
Virtualized Hadoop..............................................................................................27
Configuring the platform...........................................................................................28
Installing and configuring BDE.............................................................................28
Installing and configuring PHD.............................................................................30
Installing and configuring EMC Hybrid Cloud IaaS................................................33
4. Contents
4 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 5 Creating vCO Workflows and vCAC Catalog Services for HaaS 35
Overview ..................................................................................................................36
Importing and modifying custom vCO workflows ......................................................36
Modifying custom workflows ...............................................................................36
Creating BDE Clusters...............................................................................................42
Creating new BDE clusters ...................................................................................42
Configuring a Hadoop cluster...............................................................................42
Creating vCAC Catalog Services ................................................................................45
Accessing vCAC ...................................................................................................45
Creating a new service blueprint..........................................................................45
Chapter 6 Use Cases: EMC Hybrid Cloud IaaS 49
Overview ..................................................................................................................50
IaaS – storage services.............................................................................................50
Overview..............................................................................................................50
Use case 1: Storage provisioning.........................................................................50
Use case 2: Select virtual machine storage ..........................................................54
Use case 3: Metering storage services .................................................................55
Summary .............................................................................................................56
Monitoring and capacity planning ............................................................................57
Monitoring...........................................................................................................57
Capacity planning................................................................................................57
Capacity planning example..................................................................................60
Metering and chargeback .........................................................................................61
Chapter 7 Conclusion 65
Summary..................................................................................................................66
Appendix A References 67
VMware references...................................................................................................68
5. Contents
5EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figures
Figure 1. EMC Hybrid Cloud key components .....................................................12
Figure 2. EMC Hybrid Cloud self-service portal ...................................................14
Figure 3. EMC ViPR Analytics with VMware vCenter Operations Manager............15
Figure 4. IT Business Management Suite overview dashboard for hybrid cloud ..16
Figure 5. EMC Hybrid Cloud HaaS component overview......................................21
Figure 6. Pivotal Hadoop (PHD) components......................................................22
Figure 7. BDE and Serengeti stack......................................................................23
Figure 8. BDE and vSphere deployment topology...............................................26
Figure 9. The evolution of virtual Hadoop...........................................................27
Figure 10. Configuring the SSO lookup service and management server IP
addresses ...........................................................................................29
Figure 11. Importing Hadoop binaries into BDE management server ....................31
Figure 12. Removing the default Apache template from BDE ................................32
Figure 13. Importing custom workflows into vCO..................................................36
Figure 14. Using the validate workflows action ....................................................37
Figure 15. How to edit the attributes....................................................................37
Figure 16. Editing and creating custom parameter passing ..................................38
Figure 17. Launching scripts from the VCO...........................................................39
Figure 18. Launching of Micro Hadoop Cluster workflow ......................................40
Figure 19. Status of creation of Micro Hadoop cluster from BDE (vSphere web
client)..................................................................................................41
Figure 20. Status of Micro Hadoop cluster creation from BDE vSphere Client .......41
Figure 21. Create and name a new Big Data Cluster .............................................42
Figure 22. Advance Service Designer ...................................................................46
Figure 23. Edit Entitlement window......................................................................46
Figure 24. vCAC Service Catalog showing Hadoop as a Service ............................47
Figure 25. Storage Services - Provision cloud storage ..........................................51
Figure 26. Provision Cloud Storage – select vCenter cluster .................................52
Figure 27. Storage Provisioning – Select datastore type.......................................52
Figure 28. Storage provisioning – Choose ViPR storage pool................................53
Figure 29. Storage provisioning – Enter storage size............................................53
Figure 30. Provision Storage – Storage Reservation for vCAC Business Group ......53
Figure 31. Set storage reservation policy for virtual machine disks ......................54
Figure 32. Create new virtual machine storage profile for Tier 2 storage ...............55
Figure 33. Automatic discovery of storage capabilities using EMC ViPR Storage
Provider...............................................................................................55
Figure 34. VMware ITBM chargeback based on storage profile of datastore .........56
Figure 35. Choosing virtual machine consumption models and profiles...............58
6. Contents
6 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 36. Specifying configuration and projected capacity usage of new virtual
machines ............................................................................................58
Figure 37. Capacity summary showing insufficient CPU and RAM resources.........59
Figure 38. Specifying number of hosts and amount of CPU and memory ..............59
Figure 39. Specifying datastore size.....................................................................60
Figure 40. Compared scenarios............................................................................60
Figure 41. Combined scenarios............................................................................61
Figure 42. Categorized hybrid cloud environment cost overview ..........................62
Figure 43. vSphere Cluster cost overview.............................................................63
Figure 44. Storage cost overview..........................................................................63
7. Chapter 1: Executive Summary
7EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 1 Executive Summary
This chapter presents the following topics:
Document purpose.....................................................................................................8
Audience....................................................................................................................8
Solution purpose........................................................................................................8
Business challenge....................................................................................................9
Technology solution...................................................................................................9
8. Chapter 1: Executive Summary
8 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Document purpose
This document serves as a reference for planning and designing a Pivotal Hadoop
solution that enables IT organizations to quickly deploy Hadoop as a service (HaaS)
on an existing cloud. The solution delivers infrastructure as-a-service (IaaS)
capabilities to support big data application development. This document introduces
the main features and functionality of the solution, the solution architecture and key
components, and the validated hardware and software environment. It demonstrates
the integration of Pivotal Hadoop Enterprise in the EMC®
Hybrid Cloud solution.
The Pivotal Hadoop solution is a modular add-on to the EMC Hybrid Cloud solution.
EMC Hybrid Cloud Solution with VMware: Foundation Infrastructure Reference
Architecture 2.5 and EMC Hybrid Cloud Solution with VMware: Foundation
Infrastructure Solution Guide 2.5 describe the reference architecture and the
foundation solution upon which all the EMC Hybrid Cloud add-on solutions build.
The following documents provide further information about how to implement
specific capabilities or enable specific use cases within the EMC Hybrid Cloud
solution with VMware:
EMC Hybrid Cloud Solution with VMware: Data Protection Continuous
Availability Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Data Protection Disaster Recovery
Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Data Protection Backup Solution
Guide 2.5
EMC Hybrid Cloud Solution with VMware: Security Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Pivotal CF Platform as a Service
Solution Guide 2.5
Audience
This document is intended for executives, managers, architects, cloud
administrators, and technical administrators of IT environments who want to build a
self-service Pivotal Hadoop-based Enterprise big data platform. Readers should be
familiar with VMware vCloud Suite, Pivotal Hadoop, VMware Big Data Extensions
(BDE), EMC ViPR®
, general IaaS defined datacenter concepts, and how a hybrid cloud
infrastructure accommodates these technologies and requirements.
Solution purpose
The EMC Hybrid Cloud solution enables EMC customers to build an enterprise-class,
scalable, multitenant infrastructure that enables:
Complete management of the infrastructure and application service lifecycle
On-demand access to and control of network bandwidth, servers, storage, and
security
9. Chapter 1: Executive Summary
9EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Quick deployment of IaaS components to support HaaS-based services without
IT administrator involvement
Scalable, elastic, flexible HaaS-based services for maximum asset utilization
Access to application services from a single platform for both business-critical
and next-generation cloud applications
This solution provides the reference architecture and the best practice guidance
necessary to integrate the key components and functionality of enterprise HaaS into
an underlying EMC Hybrid Cloud infrastructure.
Business challenge
Today’s enterprise demands an agile development platform that can enable the
continuous delivery, updating, and horizontal scalability of applications. The Pivotal
Hadoop (PHD) platform enables developers to easily deploy, bind, and scale
applications and data services. When integrated with VMware vCloud Automation
Center, it delivers a self-service Pivotal Hadoop platform that facilitates rapid
deployment and instant scaling or updating of Hadoop clusters.
HaaS interoperability with the underlying infrastructure needs to accommodate
consumable new generation applications while maintaining existing end-to-end
service delivery to provide:
Efficiency and flexibility
Fast, proactive responses for services requests
Easy as-a-service model of deployment
Adequate visibility into the cost of the infrastructure
Technology solution
This EMC Hybrid Cloud solution integrates the best of EMC, VMware, and Pivotal
products and services, and empowers IT organizations to adopt an as-a-service
implementation model of compute and storage infrastructure within the data center.
Agile, elastic, on-demand, end-to-end IaaS provisioning is crucial to support a
comprehensive, dynamic, and fast-growing big data environment.
The key solution components include:
EMC ViPR software-defined storage platform
VMware vCloud Suite cloud management and infrastructure
EMC and VMware integrated workflows
VMware NSX virtual networking technologies
VMware vSphere virtualization platform
VMware Big Data Extensions (BDE) with Project Serengeti
Pivotal Hadoop (PHD)
11. Chapter 2: EMC Hybrid Cloud Solution Overview
11EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 2 EMC Hybrid Cloud Solution Overview
This chapter presents the following topics:
Introduction .............................................................................................................12
EMC Hybrid Cloud features and functionality ...........................................................13
12. Chapter 2: EMC Hybrid Cloud Solution Overview
12 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Introduction
The EMC Hybrid Cloud solution enables a well-run hybrid cloud by bringing new
functionality not only to IT organizations, but also to developers, end users, and line-
of-business owners. Beyond delivering baseline infrastructure as a service (IaaS),
built on a software-defined data center (SDDC) architecture, the solution delivers
feature-rich capabilities to expand from IaaS to business-enabling IT as a service
(ITaaS). Backup as a service (BaaS) and disaster recovery as a service (DRaaS) are
now policies that users can enable with just a few mouse clicks. End users and
developers can quickly access a marketplace of resources for Microsoft, Oracle, SAP,
EMC Syncplicity®
, and Pivotal applications, and can add third-party packages as
required. All of these resources can be deployed on private cloud or public cloud
services, including VMware vCloud Air, from EMC-powered cloud service providers.
The EMC Hybrid Cloud solution uses the best of EMC and VMware products and
services, and takes advantage of the strong integration between EMC and VMware
technologies to provide the foundation for enabling IaaS on new and existing
infrastructure for the hybrid cloud.
Figure 1 shows the key components of the EMC Hybrid Cloud solution. For detailed
information, refer to EMC Hybrid Cloud Solution with VMware: Foundation
Infrastructure Solution Guide 2.5. For information on EMC Hybrid Cloud modular add-
on solutions, which provide functionality such as data protection, continuous
availability, and application services, refer to Modular add-on components and to the
individual Solution Guides for those add-ons.
Figure 1. EMC Hybrid Cloud key components
13. Chapter 2: EMC Hybrid Cloud Solution Overview
13EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
EMC Hybrid Cloud features and functionality
The EMC Hybrid Cloud solution incorporates the following features and functionality:
Automation and self-service provisioning
Multitenancy and secure separation
Workload-optimized storage
Elasticity and service assurance
Operational monitoring and management
Metering and chargeback
Modular add-on components
The solution provides self-service provisioning of automated cloud services to both
users and infrastructure administrators. It uses VMware vCloud Automation Center
(vCAC), integrated with EMC ViPR software-defined storage and VMware NSX, to
provide the compute, storage, network, and security virtualization platforms for the
SDDC.
Cloud users can request and manage their own applications and compute resources
within established operational policies. This can reduce IT service delivery times from
days or weeks to minutes. Automation and self-service provisioning features include:
Self-service portal—Provides a cross-cloud storefront that delivers a catalog of
custom-defined services for provisioning workloads based on business and IT
policies, as shown in Figure 2
Role-based entitlements—Ensure that the self-service portal presents only the
virtual machine, application, or service blueprints appropriate to a user’s role
within the business
Resource reservations—Allocate resources for use by a specific group and
ensure that those resources are inaccessible to other groups
Service levels—Define the amount and types of resources that a particular
service can receive during initial provisioning or as part of configuration
changes
Blueprints—Contain the build specifications and automation policies that
define the process for building or reconfiguring compute resources
Automation and
self-service
provisioning
14. Chapter 2: EMC Hybrid Cloud Solution Overview
14 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 2. EMC Hybrid Cloud self-service portal
The solution provides the ability to enforce physical and virtual separation for
multitenancy, as strongly as the administrator requires. This separation can
encompass network, compute, and storage resources to ensure appropriate security
and performance for each tenant.
The solution supports secure multitenancy through vCAC role-based access control
(RBAC), which enables vCAC roles to be mapped to Microsoft Active Directory groups.
The self-service portal shows only the appropriate views, functions, and operations to
cloud users, based on their role within the business.
The solution enables customers to take advantage of the proven benefits of EMC
storage in a hybrid cloud environment. Using ViPR storage services, which leverage
the capabilities of EMC VNX®
and EMC VMAX®
storage systems, the solution provides
software-defined, policy-based management of block- and file-based virtual storage.
ViPR abstracts the storage configuration and presents it as a single storage control
point, enabling cloud administrators to access all heterogeneous storage resources
within a data center as if the resources were a single large array.
The solution uses the capabilities of vCAC and various EMC tools to provide the
intelligence and visibility required to proactively ensure service levels in virtual and
cloud environments. Infrastructure administrators can add storage, compute, and
network resources to their resource pools as needed. Cloud users can select from a
range of service levels for compute, storage, and data protection for their applications
and can expand the resources of their virtual machines on demand to achieve the
service levels they expect for their application workloads.
Multitenancy and
secure separation
Workload-
optimized storage
Elasticity and
service assurance
15. Chapter 2: EMC Hybrid Cloud Solution Overview
15EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
The solution features automated monitoring and management capabilities that
provide IT administrators with a comprehensive view of the cloud environment to
enable smart decision-making for resource provisioning and allocation. These
automated capabilities are based on a combination of EMC ViPR Storage Resource
Management (SRM), VMware vCenter Log Insight, and VMware vCenter Operations
Manager (vC Ops), and use EMC plug-ins for ViPR, VNX, VMAX, and EMC Avamar®
systems to provide extensive additional storage detail.
Cloud administrators can use ViPR SRM to understand and manage the impact that
storage has on their applications and to view their storage topologies from
application to disk, as shown in Figure 3.
Figure 3. EMC ViPR Analytics with VMware vCenter Operations Manager
Capacity analytics and what-if scenarios in vC Ops identify over-provisioned
resources so they can be right-sized for the most efficient use of virtualized
resources. In addition, for centralized logging, infrastructure components can be
configured to forward their logs to vCenter Log Insight, which then aggregates the
logs from all the disparate sources for analytics and reporting.
The solution uses VMware IT Business Management Suite (ITBM) to provide cloud
administrators with comprehensive metering and cost information across all
business groups in the enterprise. ITBM is integrated into the cloud administrator’s
self-service portal and presents a dashboard overview of the hybrid cloud
infrastructure, as shown in Figure 4.
Operational
monitoring and
management
Metering and
chargeback
16. Chapter 2: EMC Hybrid Cloud Solution Overview
16 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 4. IT Business Management Suite overview dashboard for hybrid cloud
The EMC Hybrid Cloud solution provides modular add-on components for the
following services:
Application services
This add-on solution leverages VMware vCloud Application Director to optimize
application deployment and release management through logical application
blueprints in vCAC. Users can quickly and easily deploy blueprints for
applications and databases such as Microsoft Exchange, Microsoft SQL Server,
Microsoft SharePoint, Oracle, and SAP.
Data protection services
EMC Avamar and EMC Data Domain®
systems provide a backup infrastructure
that offers features such as deduplication, compression, and VMware
integration. By using VMware vCenter Orchestrator (vCO) workflows customized
by EMC, administrators can quickly and easily set up multitier data protection
policies and enable users to select an appropriate policy when they provision
their virtual machines.
Continuous availability
A combination of EMC VPLEX®
virtual storage and VMware vSphere High
Availability (HA) provides the ability to federate information across multiple
data centers over synchronous distances. With virtual storage and virtual
servers working together over distance, the infrastructure can transparently
provide load balancing, real time remote data access, and improved
application protection.
Modular add-on
components
17. Chapter 2: EMC Hybrid Cloud Solution Overview
17EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Disaster recovery
This add-on solution enables cloud administrators to select disaster recovery
(DR) protection for their applications and virtual machines when they provision
their hybrid cloud environment. ViPR automatically places these systems on
storage that is protected remotely by EMC RecoverPoint®
technology. VMware
vCenter Site Recovery Manager automates the recovery of all virtual storage and
virtual machines.
Platform as a service
The EMC Hybrid Cloud solution provides an elastic and scalable IaaS
foundation for platform-as-a-service (PaaS) and software-as-a-service (SaaS)
services. Pivotal CF provides a highly available platform that enables
application owners to easily deliver and manage applications over the
application lifecycle. The EMC Hybrid Cloud service offerings enable PaaS
administrators to easily provision compute and storage resources on demand
to support scalability and growth in their Pivotal CF enterprise PaaS
environments.
19. Chapter 3: EMC Hybrid Cloud Hadoop as a Service
19EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 3 EMC Hybrid Cloud Hadoop as a
Service
This chapter presents the following topics:
Overview..................................................................................................................20
EMC Hybrid Cloud HaaS and IaaS .............................................................................20
Pivotal Hadoop.........................................................................................................21
Serengeti .................................................................................................................22
VMware Big Data Extensions....................................................................................22
20. Chapter 3: EMC Hybrid Cloud Hadoop as a Service
20 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Overview
This chapter identifies and briefly describes the major features and functionality
required to support Pivotal Hadoop as a service and promote scalability in the EMC
Hybrid Cloud environment.
EMC Hybrid Cloud HaaS and IaaS
Project Serengeti
VMware Big Data Extensions (BDE)
Pivotal Hadoop (PHD)
HaaS Self-Service Portal
EMC Hybrid Cloud HaaS and IaaS
EMC Hybrid Cloud HaaS is a solution stack made up of EHC IaaS, integrated with BDE
and PHD. The self-service aspect of the portal is controlled by vCAC as shown in
Figure 5.
Hadoop is an open-source software program that supports the processing of large
data sets in a distributed computing environment. It is part of the Apache project
sponsored by the Apache Software Foundation. PHD is an Apache Hadoop
distribution.
Deploying a Hadoop cluster using traditional methods is complex and time-
consuming. It typically involves setting up the infrastructure, installing and
configuring the operating system, acquiring the respective Hadoop media, installing
Hadoop components, and finally creating the Hadoop cluster.
This process typically takes weeks and requires a significant skillset. The EMC HaaS
offering simplifies the process by using extensive workflow automation in the EHC
IaaS backend. Through self-service automation, it is now possible to deploy or
expand a Hadoop cluster in minutes using the vCloud Automation Center self-service
portal.
21. Chapter 3: EMC Hybrid Cloud Hadoop as a Service
21EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 5. EMC Hybrid Cloud HaaS component overview
Pivotal Hadoop
Pivotal Hadoop (PHD) is an open-source software program that supports the
processing of large data sets in a distributed computing environment. It is part of the
Apache project sponsored by the Apache Software Foundation. PHD is an Apache
Hadoop distribution. The complete PHD platform contains a number of components
that are not specifically used within this solution:
YARN (Yet Another Resource Negotiator)—a distributed processing framework
that can schedule and execute resource requests from multiple applications
HBASE—a column database that runs on top of the Hadoop Distributed Files
System (HDFS)
HAWQ—HAWQ is a parallel SQL query engine that combines the merits of the
Greenplum Database Massively Parallel Processing (MPP) relational database
engine and the Hadoop parallel processing framework
ZooKeeper—a centralized service for maintaining configuration information,
naming services, providing distributed synchronization, and providing group
services
Hive—a data warehouse infrastructure built on top of Hadoop infrastructure
Hadoop Map Reduce—Map Reduce is a programming model for processing and
generating large data sets with a parallel, distributed algorithm on a cluster
22. Chapter 3: EMC Hybrid Cloud Hadoop as a Service
22 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 6 shows the PHD components.
Figure 6. Pivotal Hadoop (PHD) components
Note: YARN, HBASE, HAWQ and HIVE are not referenced in this solution. HAWQ is not
installed by default and must be installed separately. This can be automated through the
use of vCO workflows if required.
Serengeti
Serengeti is an open source project initiated by VMware to enable the deployment
and management of Hadoop and big data clusters in a vCenter Server managed
environment. The key components are the Serengeti Management Server, which
provides a framework for running big data clusters on vSphere, and a command line
interface that provides tools and utilities that form an administrative interface for
managing and monitoring the cluster environments.
VMware vSphere Big Data Extensions
VMware vSphere Big Data Extensions, or BDE, is a feature within vSphere to support
big data and open source Hadoop distribution workloads. BDE provides an integrated
set of management tools to help enterprises deploy, run, and manage Hadoop on a
common virtual infrastructure. Figure 7 shows how BDE is an installable virtual
appliance plug-in that controls and monitors Hadoop Services. The BDE virtual
appliance runs on top of vSphere and uses the Serengeti Management Server to
control cluster creation by cloning templates through the template server.
BDE is a commercial version of Serengeti, which is an open source project from
VMware. BDE provides the features of Serengeti in an enterprise format, including:
An open source supported version of the Apache Hadoop Distribution
23. Chapter 3: EMC Hybrid Cloud Hadoop as a Service
23EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
The big data extensions GUI which is integrated into vSphere Web Client to
perform Hadoop infrastructure and cluster management tasks
Elastic-enabled clusters that optimize and provide scaling of physical compute
resources in a vSphere environment
Figure 7. BDE and Serengeti stack
24. Chapter 3: EMC Hybrid Cloud Hadoop as a Service
24 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
25. Chapter 4: HaaS Component Integration
25EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 4 HaaS Component Integration
This chapter presents the following topics:
Overview..................................................................................................................26
Integrating Hadoop components with EMC Hybrid Cloud .........................................26
Configuring the platform..........................................................................................28
26. Chapter 4: HaaS Component Integration
26 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Overview
This section provides guidance on configuring the services required for Hadoop as a
Service, specifically BDE and PHD, and integrating them with EMC Hybrid Cloud IaaS
services.
Integrating Hadoop components with EMC Hybrid Cloud
To install and configure Hadoop-as-a-Service components, refer to the appropriate
vendor documentation referenced in in the installing and configuring sections for the
component in this chapter.
The steps discussed assume that the EMC Hybrid Cloud has been installed and
configured as described in the EMC Hybrid Cloud Solution with VMware – Foundation
Intrastructure Solution Guide 2.5, and that the IaaS, portal, catalog services, and
tenant structure are all in place.
BDE runs on top of Serengeti. Figure 8 shows the virtual appliance that runs the
Serengeti Management Server and Template Server. BDE provides the GUI for
managing Hadoop clusters, communicating through the Serengeti Management
Server.
Figure 8. BDE and vSphere deployment topology
With VMware’s vSphere Big Data Extensions, you can enable deployment of Hadoop
inside your VMware vSphere environment. The Big Data Extensions are distributed as
a downloadable OVA-based virtual appliance that is imported into an existing
environment. The minimum requirements to support BDE are vSphere 5.0 or later and
Enterprise or Enterprise plus vSphere licenses. By default, the basic Apache
Foundation distribution of Hadoop is also included, but it is very easy to add in other
commercial Hadoop distributions such as Pivotal Hadoop, Cloudera Hadoop,
Hortonworks Hadoop, or MapR Hadoop. This solution uses the Pivotal Hadoop
distribution integrated with the EMC Hybrid Cloud IaaS stack to create Hadoop as a
Service.
BDE Topology
27. Chapter 4: HaaS Component Integration
27EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
After BDE is installed, you can begin creating a virtual Hadoop cluster. You can
specify a number of configuration options including distribution, topology (basic,
compute/storage separation, HBase-only, or custom), and the number and size of the
virtual machines for each of the Hadoop roles (for example, name node, client node,
and data nodes). Note the options presented in the web interface are only a fraction
of what can be invoked through the advanced command-line tools and API.
When you start to deploy a Hadoop cluster, BDE clones the appropriate virtual
machines and automatically builds out the cluster. When you are satisfied with the
cluster, you can scale up (increase the size of the virtual machine’s memory and CPU
resources) or scale out (increase the number of virtual machines). You can configure
the cluster to scale automatically as the load alters for additional flexibility and
efficiency.
Some of the benefits of virtualizing Hadoop—for example, elasticity and multi-
tenancy—arise from the increased number of deployment options that become
available when Hadoop is virtualized. Figure 9 shows the evolution of virtual Hadoop,
from self-contained to a tenant-based model.
Figure 9. The evolution of virtual Hadoop
The traditional Hadoop model combines compute and data. While this
implementation is straightforward, representing how the physical Hadoop model can
be directly translated into a virtual machine, the ability to scale up and down is
limited because the lifecycle of this type of virtual machine is tightly coupled to the
data it manages. Powering off a virtual machine with combined storage and
computing means access to its data is lost. Scaling out by adding more nodes would
necessitate rebalancing data across the expanded cluster, so this model is not
particularly elastic.
Separating computing from storage in a virtual Hadoop cluster can achieve compute
elasticity, enabling mixed workloads to run on the same virtualization platform and
improving resource utilization. It is simple to configure using a HDFS data layer that is
always available, along with a compute layer comprising a variable number of
TaskTracker nodes, which can be expanded and contracted on demand.
Extending the concept of data-compute separation, multiple tenants can be
accommodated on the virtualized Hadoop cluster by running multiple Hadoop
Virtualized Hadoop
28. Chapter 4: HaaS Component Integration
28 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
compute clusters against the same data service. Using this model, each virtual
compute cluster enjoys performance, security, and configuration isolation.
While Hadoop performance using the combined data-compute model on vSphere is
similar to its performance on physical hardware, providing virtualized Hadoop
increased topology awareness can enable the data locality needed to improve
performance when data and compute layers are separated. Topology awareness
allows Hadoop operators to realize elasticity and multi-tenancy benefits when data
storage and computing are separated. Furthermore, topology awareness can improve
reliability when multiple nodes of the same Hadoop cluster are colocated on the
same physical host.
To optimize the data locality and failure group characteristics of virtualized Hadoop:
Group virtual Hadoop nodes on the same physical host into the same failure
domain, and avoid multiple replicas.
Maximize usage of the virtual network between virtual nodes on the same
physical host. The virtual network has higher throughput and lower latency than
the physical network and does not consume any physical switch bandwidth
Configuring the platform
Refer to VMware vSphere Big Data Extensions Administrator's and User's Guide to
install and configure the BDE components required for Hadoop as a Service.
Configuration task order
The following steps outline the high-level tasks you need to perform to install and
configure BDE:
1. Ensure the environment meets the minimum vSphere requirements, correct
licensing is in place, and compute, storage and networking pre-requisites are
met.
2. Configure cluster settings, including vSphere HA, Distributed Resource
Scheduling, host monitoring, and admission control.
3. Configure network settings using either vSwitch, vSphere Distributed Switch
(vDS), or NSX. Ensure the required ports are configured as part of any firewall
policy.
4. Deploy the BDE OVF file and assign the management network. When you
deploy BDE the setup will ask for a destination port group; this is the network
that the management network uses to communicate with the server so the
port group should be the same as the VLAN ID. If vCenter or BDE are unable to
communicate with each other, then the integration will fail.
Configuring SSO service
As part of the configuration process an important step is to configure the SSO service
and management server IP addresses.
1. As shown in Figure 10, from the left pane in the Deploy OVF Template page
select Customize template.
Installing and
configuring BDE
29. Chapter 4: HaaS Component Integration
29EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
2. In the VC SSO Lookup Service URL box, type the vCenter Server Fully Qualified
Domain Name FQDN in the same format as shown (if the default server name
has not been changed). If you do not specify the FQDN here, then the
certificate will not be accepted and there will be a connection issue between
BDE and the Serengeti server later.
3. Under Management Server Network Settings, enter the appropriate IP address
settings.
Figure 10. Configuring the SSO lookup service and management server IP addresses
Starting BDE in vSphere
After successfully installing and configuring BDE within vSphere, power on the BDE
management server and then register BDE within vSphere as the final part of
configuration by performing the following steps:
1. Log in to the vSphere client with administrative privileges.
2. Within the vSphere client, locate the BDE management server. The
management server is located under the datacenter resource pool in which it
was deployed.
3. Select and record the management IP address.
4. Register the management server using the register plugin URL:
https://management-server-ip-address:8443/register-plugin where
management-server-ip-address is the IP address you recorded in step 3.
5. Complete the required registration information and then click Submit.
The BDE icon should now be available in the list of objects within the inventory.
30. Chapter 4: HaaS Component Integration
30 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Before installing and configuring PHD, download the following required components
and make them available for the installation:
Cent OS 6.2 64 bit ISO
Pivotal Hadoop Tar files
Oracle JDK 7, 64 bit rpm for Cent-OS
Big Data Extension OVF
VMware BDE comes supplied with a default Hadoop distribution from Apache. The
HaaS integration requires that Pivotal Hadoop be installed. Get the Pivotal Hadoop
media and documentation from http://www.gopivotal.com/big-data/pivotal-hd, and
register and obtain the necessary licenses. The following high level tasks outline the
process to load the media and create a PHD template within the BDE configuration.
Installing PHD
To create the required installation configuration for BDE, use Yum repositories (as
opposed to a TAR-ball). When you create a Hadoop cluster that is YUM-deployed, the
Hadoop nodes within the cluster then download the Red Hat Package Manager (RPM)
packages for the Pivotal Hadoop distribution from the official Yum repositories.
The Pivotal Hadoop distribution must be installed in a 64-bit version of the CentOS
6.x operating system. You must use either CentOS 6.2 or CentOS 6.4 to create the
Hadoop template virtual machine . The template is used in the cloning process for
creating a Hadoop cluster. After you have deployed the BDE OVF you must follow the
steps to integrate YUM into PHD by creating a YUM repository as outlined below, and
then create the template.
Creating a Yum repository for PHD
The steps for configuring PHD with BDE are described in the VMware vSphere Big Data
Extensions Administrator’s and User’s Administration Guide.
Creating a Hadoop template virtual machine
You must use either CentOS 6.2 or CentOS 6.4 to create the Hadoop template virtual
machine. To upgrade from a previous version, refer the chapter titled “Create a
Hadoop Template Virtual Machine using RHEL Server 6.x” in the VMware vSphere Big
Data Extensions Administrator’s and User’s Administration Guide.
The following steps outline the procedure for creating a Hadoop template virtual
machine:
1. Import the PHD binaries and create PHD media by logging into the BDE
management server and importing the PHD tar files into an appropriate
directory structure on the server. Figure 11shows the binary import process.
Installing and
configuring PHD
31. Chapter 4: HaaS Component Integration
31EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 11. Importing Hadoop binaries into BDE management server
2. Test that the import was successful by accessing the URL path from a browser
and ensuring that the expected folders are present.
3. After installing the media into the BDE management server, create a new
Pivotal Hadoop template.
4. Make the new Pivotal Hadoop template the default template by removing the
default Hadoop Apache template from the BDE management server, as shown
in Figure 12.
32. Chapter 4: HaaS Component Integration
32 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 12. Removing the default Apache template from BDE
Configuring custom resources for BDE
VMware BDE requires two resources types when automating Hadoop clusters:
networking resources and storage resources.
Networking resources
Networking is used to assign virtual machines IP addresses. BDE deploys all nodes of
a Hadoop cluster from a single common CentOS template that comes preconfigured
with the BDE vApp management server. As BDE deploys virtual machines into a
cluster, it uses either an existing DHCP server or a statically created IP address pool.
As part of the deployment process, hostnames are assigned by BDE. The hostnames
are the same as the IP addresses. For example, if DHCP assigns 10.10.10.10 then the
hostname of that virtual machine is 10.10.10.10. Hadoop then uses this hostname
for the clusters.
Storage resources
BDE defines two types of storage resources—local and shared. Shared storage is
useful for management or client servers deployed by BDE as shared storage can be
protected with technologies such as VMware HA.
Within Hadoop there are two types of nodes: master and worker nodes. Master
nodes provide tracking functions whereas worker nodes provide job processing
capabilities. Because worker nodes are disposable, they do not require top tier
storage since Hadoop is designed to deal with node failure. There is also no reason to
deploy worker nodes on shared storage. The choice of storage however must be
capable of dealing with the required level of performance for the nodes. Allowing BDE
33. Chapter 4: HaaS Component Integration
33EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
to use local VMFS storage for worker nodes is analogous to deploying physical worker
nodes on commodity storage using direct attached storage.
The final stage of configuration is to assign storage resources to BDE. This defines
how the Hadoop clusters are deployed, either using local or shared datastores. By
default BDE defines data stores as local. If you need shared datastores, you must
configure the datastores accordingly. Refer to Chapter 6 of the VMware vSphere Big
Data Extensions Administrator’s and User’s Guide for details on how to add
datastores and networks to a cluster from the vSphere client.
For details. refer to the EMC Hybrid Cloud Solution with VMware - Foundation
Infrastructure Reference Architecture 2.5. Detailed installation and configuration
information is available only to select EMC personnel and authorized partners.
Installing and
configuring EMC
Hybrid Cloud IaaS
35. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
35EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 5 Creating vCO Workflows and vCAC
Catalog Services for HaaS
This chapter presents the following topics:
Overview..................................................................................................................36
Importing and modifying custom vCO workflows .....................................................36
Creating vCAC Catalog Services ...............................................................................45
36. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
36 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Overview
The automation of Hadoop clusters is achieved by using custom workflows created
with VMware vCloud Orchestrator (vCO). This chapter describes how these workflows
are configured from within VMware Cloud Automation Center (vCAC) to present
enterprise organizations with a self-service portal that includes a catalog of pre-
configured Hadoop deployment scenarios.
Importing and modifying custom vCO workflows
To use HaaS within EMC Hybrid Cloud, the administrator must use custom vCO
workflows for deploying HaaS. These workflows offer a choice of cluster sizes that can
then be presented as catalog items from the vCloud Automation Center portal. The
workflows are imported into VMware vCO using the vCO import function to be edited,
tested, and packaged according the needs of the organization.
This section describes the process for importing the custom workflows into vCO, so
that the Hadoop Administrator can alter them and link them with the big data cluster
configurations created in the earlier stages of the process.
Importing custom workflows
From within the vCO client, as shown in Figure 13, select Run, click Workflows, and
select Import workflow. Browse to the location where you have placed the workflow
package and click Open. The imported workflow appears in the folder selected.
Figure 13. Importing custom workflows into vCO
Validating workflows
After importing the workflows into vCO, validate them by clicking the name of the
folder containing the workflows and then selecting the Validate option from the
context menu, as shown in Figure 14. The validation process ensures there are no
Modifying custom
workflows
37. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
37EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
open ends, unreachable workflow elements, or unused attributes in the workflows, so
that they will execute correctly.
Figure 14. Using the validate workflows action
Customizing HaaS workflows
The HaaS workflows provide a framework for deploying each Hadoop cluster
configuration of a given size through an automated workflow. The Hadoop
administrator should modify the attributes of these workflows to meet the specific
needs of the organization. Figure 15 shows how to use the vCO client to edit the
attributes within a workflow.
Figure 15. How to edit the attributes
38. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
38 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Configuring custom parameters
To make the workflows dynamic, vCO uses a combination of attributes and
parameters to transfer data when it is processing a workflow. Workflow parameters
must receive an input to generate an output or action. An example of configuring a
custom parameter is when an input is received from the user or system. The input can
then be passed to a command or script that would create a username or password,
This in turn can be passed to the Hadoop cluster for authentication.
Figure 16 shows how to create a custom username and password for the Hadoop
Client node.
Figure 16. Editing and creating custom parameter passing
Launching a custom script
Scripts help to edit the schema, which is the main component of a workflow.
Launching individual scripts lets you test the components of the workflow one
element at a time, or execute a script at runtime to prepare the data set, for example.
Figure 17 shows how to launch scripting from within the workflow by using the
Schema panel within the workflow itself.
39. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
39EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 17. Launching scripts from the VCO
Testing VCO HaaS custom workflows
The previous sections demonstrated how to import the HaaS sample workflows into
EMC Hybrid Cloud, specifically the vCenter Orchestrator which is the main
orchestration and automation engine for the solution. As shown, once imported, the
default workflows can be altered to meet any modifications made to the Hadoop
clusters. The workflows can also be modified to pass any additional parameters that
may be required, for example, passing the username and password or executing
parts of additional scripts components.
The final stage in importing and configuring the workflows is to test the workflows
that have been imported and modified for each of the HaaS cluster sizes (micro
cluster, small cluster, and large cluster). Figure 18 shows how to:
Select the specific workflow for a given cluster size
Execute the workflow from vCO
View the execution process
Verify the execution progress by checking the log files for any error messages
40. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
40 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 18. Launching of Micro Hadoop Cluster workflow
Viewing cluster creation
After the VCO workflow is launched, the cluster creation process starts within vSphere
and BDE. The management server uses the template server to clone the nodes
required to create the cluster in terms of the numbers and types of node that
comprise the cluster. To view and verify the cluster creation process, follow these
steps:
1. Login to the vSphere web client.
2. Go to the BDE and view the actual cluster being created.
Figure 19 shows the status of the creation of a micro Hadoop cluster in the BDE panel
of the vSphere web client.
41. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
41EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 19. Status of creation of Micro Hadoop cluster from BDE (vSphere web client)
You can also log in to the vSphere Client Application and view the Hadoop
cluster being created. Figure 20 shows the status of the creation of the Micro
Hadoop cluster in the vSphere Client Application.
Figure 20. Status of Micro Hadoop cluster creation from BDE vSphere Client
42. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
42 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Creating BDE Clusters
After the vCO workflows are imported they need to be customized for the different
sized clusters according to the requirements of the enterprise. The examples
provided describe micro, small, and large Hadoop clusters.
The custom workflows define the type of the cluster, including cluster configuration,
in terms of the number of master nodes, client nodes, and data nodes for each size.
Creating a Hadoop cluster
These steps document the procedure for creating a Hadoop cluster within BDE, which
can then be translated when building a VCO workflow:
1. In vCenter, under Objects > Data Extensions,click New Big Data Cluster.
2. Follow the steps in the wizard, specifying the appropriate parameters as
required. More detail can be found in the VMware vSphere Big Data
Extensions Administrator’s and User’s Guide.
The following sections outline the options and details required during the cluster
configuration process.
Naming a Hadoop cluster
When prompted by the wizard, type a name to identify the cluster. Valid characters for
cluster names are alphanumeric and underscores. When choosing a cluster name you
should also consider the associated vApp name. Together the vApp and cluster name
must be less than 80 characters.
Configuring the Hadoop distribution
When configuring a Hadoop cluster, you must select the correct Hadoop distribution
from the Hadoop distribution list box Change the default from Apache to Pivotal HD,
as shown in Figure 21. The distribution name matches the value of the name
parameter that was passed to the config-distro.rb script when the Hadoop
distribution was configured. For a Pivotal PHD 1.1 cluster, you must configure a valid
DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and
FQDN settings, the cluster creation process might fail or the cluster is created but
does not function.
Figure 21. Create and name a new Big Data Cluster
Specifying deployment type
When prompted by the wizard, select the deployment type for the cluster, either Basic
Hadoop Cluster or Data/Compute Separation Cluster. The type of cluster you create
determines the available node group selections.
Creating new BDE
clusters
Configuring a
Hadoop cluster
43. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
43EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Identifying the DataMaster node group
The DataMaster node is a virtual machine that runs the Hadoop NameNode service.
This node manages HDFS data and assigns tasks to Hadoop TaskTracker services
deployed in the worker node group. To identify the group:
1. Select a resource template from the list box or select Customize to create a
custom resource template.
2. For the master node, specify shared storage so that the virtual machine is
protected with vSphere HA.
Identifying the ComputeMaster node group
The ComputeMaster node is a virtual machine that runs the Hadoop JobTracker
service. This node assigns tasks to Hadoop TaskTracker services deployed in the
worker node group. To identify the group:
1. Select a resource template from the list box or select Customize to create a
custom resource template.
2. For the master node, specify shared storage so that the virtual machine is
protected with vSphere HA.
Identifying the HBaseMaster node group (HBase cluster only)
The HBaseMaster node is a virtual machine that runs the HBase master service. This
node orchestrates a cluster of one or more RegionServer slave nodes. To identify the
group:
1. Select a resource template from the list box or select Customize to create a
custom resource template.
2. For the master node, specify shared storage so that the virtual machine is
protected with vSphere HA.
Identifying the Worker node group
Worker nodes are virtual machines that run the Hadoop DataNode, TaskTracker, and
HBase HRegionServer services. These nodes store HDFS data and execute tasks. To
identify the group:
1. Select a resource template from the list box or select Customize to create a
custom resource template.
2. For the worker nodes, use local storage.
Note: You can add nodes to the worker node group by using Scale Out Cluster, but you
cannot reduce the number of nodes.
Identifying the Client node group
A client node is a virtual machine that contains Hadoop client components. From this
virtual machine you can access HDFS, submit MapReduce jobs, run Pig scripts, run
Hive queries, and run HBase commands. When configuring the cluster for use with
HaaS, you do not configure the Client node group unless any of these configuration
items are required outside of the HaaS solution.
44. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
44 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
To identify the group:
1. Select a resource template from the list box or select Customize to create a
custom resource template.
2. For the client nodes, use local storage.
Note: You can add nodes to the client node group by using Scale Out Cluster, but you
cannot reduce the number of nodes.
Selecting the Hadoop topology configuration
When you create a cluster with BDE, BDE disables automatic migration for the
cluster’s virtual machines. This prevents vSphere from migrating anything but does
not prevent the administrator from migrating nodes unintentionally to other vCenter
hosts. It is essential that migrating is not performed from within vCenter as this could
break the cluster placement policy.
As part of the final cluster configuration you should select the topology configuration
that you want the cluster to use: RACK_AS_RACK, HOST_AS_RACK , HVE, or NONE.
More information is available in the chapter “About Cluster Topology” in chapter 7 of
the VMware vSphere Big Data Extensions Administrator’s and User’s Guide.
45. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
45EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Creating vCAC Catalog Services
The focus of customization for this EMC Hybrid Cloud solution is the VMware vCAC
user self-service portal, where additional functionality is included to enable
additional services for cloud users. The final stage of integrating Hadoop as a Service
is to present to vCAC the HaaS workflows that have been imported and modifiedso
that they can be selected as catalog items.
VMware vCAC 6.0 provides the extensibility to enable IaaS functionality through
Advanced Service blueprints. The IaaS functionality is achieved by exposing custom
vCO workflows that the vCAC 6.0 portal can present as a catalog of services for cloud
users.
You can create custom workflow definitions using vCAC Designer. The vCAC Designer
console provides a visual workflow editor for customizing vCAC lifecycle workflows.
The extensibility toolkits include a library of activities that serve as building blocks for
custom workflows.
Using the Advanced Service Designer, you can define new service offerings and
publish them to the common catalog as catalog items.
To create the service blueprints you must access vCAC from a browser and log in to
vCAC.
Each tenant has a unique URL to the vCAC console:
The default tenant URL is in the following format:
https://hostname/shell-ui-app
where hostname is the Fully Qualified Domain Name (FQDN) of a vCAC host.
The URL for additional tenants is in the following format:
https://hostname/shell-ui-app/org/tenantURL
where tenantURL is the URL name specified when the tenant is being created.
This is the workspace in which the customer creates catalog services.
The following steps demonstrate, at a high level, how to integrate the HaaS workflows
into the vCAC self-service catalog by showing the creation of:
Catalog services
Blueprints
Custom resources and resource actions
For more information, refer to the vCloud Automation Center Extensibility Guide.
To integrate the HaaS workflows into the vCACA self-service catalog, follow these
steps:
1. From the main vCAC portal page, click Advanced Services to list all of the
current service blueprints defined.
2. Click the green “plus” symbol, shown in Figure 22, to create a new service
blueprint.
Accessing vCAC
Creating a new
service blueprint
46. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
46 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 22. Advance Service Designer
Follow these steps to create a new service blueprint:
1. Select one of the imported Hadoop Cluster Creation workflows from the list.
2. Name the new service and create a form to support user input for the required
parameters. If required, delete the default form and create a new form.
3. Drag and drop any appropriate input fields onto the form.
4. Publish the new service to create the appropriate service definition in the
catalog management.
5. Assign a catalog management service to the new advanced service, and
create the appropriate entitlement definition in the catalog management, as
shown in Figure 23.
Figure 23. Edit Entitlement window
47. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
47EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
When these tasks are completed, the new service is then available in the service
catalog for the cloud administrator. It is possible to replace the default VMware logo
icons in the service catalog with more suitable HaaS icons. The replacement of icons
is the final stage of customization and ensures that the service catalog items are
tailored to a specific function or application. This can be performed from the Catalog
Management menu by selecting the Catalog Items list box, selecting the configure an
icon option, and then browsing and selecting a new icon.
After the configuration stages have been performed within vCAC, the service catalog
is available to provision HaaS items, as shown in Figure 24.
Figure 24. vCAC Service Catalog showing Hadoop as a Service
48. Chapter 5: Creating vCO Workflows and vCAC Catalog Services for HaaS
48 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
49. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
49EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 6 Use Cases: EMC Hybrid Cloud IaaS
This chapter presents the following topics:
Overview..................................................................................................................50
IaaS – storage services ............................................................................................50
Monitoring and capacity planning............................................................................57
Metering and chargeback ........................................................................................61
50. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
50 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Overview
This chapter covers EMC Hybrid Cloud IaaS and other use cases that can be
incorporated to extend the functionality beyond virtual machine provisioning to
consume resources.
From time to time additional physical resources will be required to support the
extension of a Hadoop environment. The following sections show how EHC storage
provisioning workflows can be used to create additional resources on demand by
provisioning additional storage as required, and how the VMware vC Ops tool set can
be used to analyze consumed resources, provide capacity planning, increase
resources using scenarios that increase physical resources, and increase VM and
node capacity.
IaaS – storage services
Storage is provisioned, allocated, and consumed by different cloud users in this
solution.
For vCAC IaaS users, the storage services provided in the vCAC service catalog
provision storage resources that will be allocated to and consumed by other cloud
users.
Once the storage resources are available, fabric group administrators can assign the
resources to business groups. Creators of virtual machine blueprints (business group
managers) can then configure their blueprints to use those particular storage
resources for the list of virtual machine disks.
When they provision virtual machines, cloud users consume the storage and,
depending on their entitlements, may choose the storage service for their virtual
machines.
This use case demonstrates how ViPR software-defined storage is provisioned for the
hybrid cloud from the VMware vCAC self-service catalog.
1. To provision block or file storage from the vCAC self-service portal, select the
Provision Cloud Storage item from the vCAC service catalog, as shown in
Figure 25.
Overview
Use case 1:
Storage
provisioning
51. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
51EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 25. Storage Services - Provision cloud storage
The storage service blueprint can be created using vCAC anything-as-a-service
(XaaS) functionality in the vCAC Advanced Service Designer. EMC ViPR
provisioning workflows, which are presented by vCO to the vCAC service
catalog, support storage services.
The storage provisioned by the IaaS user enables the fabric group
administrator to make storage resources available to their business group.
The storage provisioning request requires very little input from the vCAC IaaS
user.
The main inputs required are:
Datastore Type: VMFS or NFS
Datastore Size
vCenter Cluster
Storage Tier
Most of these inputs, except LUN size, are selected from pre-populated list
boxes whose items are determined by the cluster resources available through
vCenter and the virtual pools available in ViPR.
After entering a description and reason for the storage-provisioning request,
enter your password. The vCenter Server will manage multiple ESXi clusters;
therefore, you must choose the relevant vCenter cluster to tell the
provisioning operation where to assign the storage device. Select a vCenter
cluster from the next screen, as shown in Figure 26.
52. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
52 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 26. Provision Cloud Storage – select vCenter cluster
2. Select the type of datastore you require from the list of available storage
types, as shown in Figure 27. A datastore type of VMFS requires block
storage, while NFS requires file storage. Other data services such as disaster
recovery and continuous availability are displayed as appropriate only if
detected in the underlying infrastructure.
Figure 27. Storage Provisioning – Select datastore type
3. Select from which storage offering the new storage device should be
provisioned. The list of available storage offerings is based on the datastore
type selected, such as VMFS or NFS, and what matching virtual pools are
available from the ViPR virtual array.
In this example, a single NFS-based ViPR virtual pool is available to provision
storage from, with the available capacity of the virtual pool also displayed to
the user, as shown in Figure 28.
The storage pools listed have been configured in the EMC ViPR virtual array
and their storage capabilities are associated with storage profiles created in
vCenter.
53. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
53EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 28. Storage provisioning – Choose ViPR storage pool
4. Enter the size required for the new storage, in GB, as shown in Figure 29.
Figure 29. Storage provisioning – Enter storage size
5. The fabric group administrator must reserve the new Storage Pool for use by
the business group, as shown in Figure 30.
Figure 30. Provision Storage – Storage Reservation for vCAC Business Group
When the automated process sends an email notification to the fabric group
administrator that the storage is ready and available in vCAC, the fabric group
administrator can then assign capacity reservations on the device for use by
the business group.
54. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
54 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
In this example, a number of required input values, such as LUN or datastore name,
have been masked from the user during the storage provisioning request process.
Some of these values are locked-in and managed by the orchestration process and
logic to ensure consistency.
In addition to the initial provisioning of storage to the ESXi cluster at the vSphere
layer, this solution provides further automation and integration of the new storage up
into the vCAC layer. The ViPR storage provider automatically tags the storage device
with the appropriate storage profile based on its storage capabilities.
The remaining automated steps in this solution are:
vCAC rediscovery of resources under vCenter endpoint
vCAC storage reservation policy assigned to new datastore
vCAC fabric group administrator notification of availability of new datastore
This use case demonstrates how cloud users can consume the available storage
service offerings. This use case is part of the broader virtual machine deployment use
case, but here it relates directly to how the business group manager and users can
manage the storage service offerings available to them.
VMware vCAC business group managers and users can select the appropriate storage
for their virtual machine through the VMware vCAC user portal.
For business group managers, the storage type for the virtual machine disks can be
set during the creation of a virtual machine blueprint. As shown in Figure 31, the
relevant storage reservation policy can be applied to each of the virtual disks.
Figure 31. Set storage reservation policy for virtual machine disks
After the storage reservation policy is set, the blueprint will always deploy this virtual
machine and its virtual disks to that storage type. If more user control is required at
deployment time, the business group manager can elect to allow business group
users to reconfigure the storage reservation policies at deployment time by selecting
the checkbox Allow user to see and change storage reservation policies.
Use case 2: Select
virtual machine
storage
55. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
55EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
This solution uses VMware IT Business Management Suite (ITBM) to provide
chargeback information on the storage service offerings for the hybrid cloud. Through
its integration with VMware vCenter and vCAC, ITBM enables the cloud administrator
to automatically track utilization of storage resources provided by EMC ViPR.
The EMC ViPR VASA provider in vCenter automatically captures the underlying storage
capabilities of LUNs provisioned from virtual pools on the EMC ViPR virtual array.
Storage profiles are created based on these storage capabilities, which are aligned
with the storage service offerings. This integration enables ITBM to automatically
discover and group datastores based on predefined service levels of storage.
In this solution we created a separate virtual machine storage profile for each of the
storage service offerings, as shown in Figure 32.
Figure 32. Create new virtual machine storage profile for Tier 2 storage
The storage capabilities are shown automatically in vSphere, as shown in Figure 33,
where Tier 2 EMC ViPR storage is supporting a datastore.
Figure 33. Automatic discovery of storage capabilities using EMC ViPR Storage Provider
Note: Storage capabilities are only visible in the traditional vSphere client and not in the
web client. Also, the web client uses virtual machine storage policies in place of virtual
machine storage profiles.
After the EMC ViPR Storage Provider has automatically configured the datastores with
the appropriate storage profiles, the data stores can be grouped and managed in
ITBM in line with their storage profile. Figure 34 shows that the cost profiles created
Use case 3:
Metering storage
services
56. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
56 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
in vCenter are discovered by ITBM. This allows the business management
administrator to group tiered datastores provisioned with ViPR and set the monthly
cost per GB as needed.
Figure 34. VMware ITBM chargeback based on storage profile of datastore
VMware vCAC can provide a storefront for storage services to be used by cloud users.
These service catalog items deploy EMC ViPR software-defined storage services
based on the usage of multiple service offerings of block and file storage across EMC
VNX and VMAX storage arrays. Each service offers varying levels of availability,
capacity, and performance to satisfy the operational requirements of different lines of
business.
This solution combines EMC ViPR with EMC array-based FAST-enabled storage service
offerings across the EMC storage arrays with VMware vSphere to simplify storage
operations for hybrid cloud consumers.
Summary
57. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
57EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Monitoring and capacity planning
The vCenter Operations Management Suite has functions that can help HaaS
administrators to achieve the following goals:
Eliminate or significantly reduce the manual problem-solving effort in the
environment.
Proactively manage core service and cloud infrastructure performance, and
utilize infrastructure resources optimally.
Provision proactive warnings regarding performance issues before problems
affect the end user. Real-time performance dashboards enable service
providers to meet their SLAs by highlighting potential performance issues
before end users notice these issues.
Infrastructure maintenance and operations teams need the end-to-end visibility and
intelligence to make fast, informed operational decisions to proactively ensure
service levels in cloud environments. They need to get promptly to the root cause of
performance problems, optimize capacity in real time, and maintain compliance in a
dynamic environment of constant change.
The vCenter Operations Management Suite offers many features and functions to
deliver quality of service, operational efficiency, and continuous compliance for your
dynamic cloud infrastructure and business critical applications.
This section describes in detail the capacity planning functions that can help you to
predict the impact on underlying infrastructure of new HaaS deployments or of
upgrading current HaaS instances with new services.
Forecasting capacity risks in vCenter Operations Manager involves creating what-if
scenarios to examine the demand and supply of resources in the cloud infrastructure.
A what-if scenario is a supposition about how capacity and load might change if
certain conditions, influenced by an increased or decreased number of ESX hosts,
storage resources, or virtual machines in environment, occur, without making actual
changes to your virtual infrastructure. If you implement the scenario, you know in
advance what your capacity requirements are.
To create a what-if scenario, you can use models and profiles based on current
resource consumption in the existing environment. Alternatively, you can manually
define amounts of virtual machine RAM, storage, CPU, and utilization in a new
consumption profile, as shown in Figure 35, to predict the potential impact of growth.
Monitoring
Capacity planning
58. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
58 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 35. Choosing virtual machine consumption models and profiles
To define a new virtual machine profile, you can make detailed specifications that
give you the option to include and predict specific resource utilizations, reservations,
and limits in order to get as accurate a projection as possible, as shown in Figure 36.
Figure 36. Specifying configuration and projected capacity usage of new virtual machines
Figure 37 shows that there are insufficient resources for a planned deployment
scenario consisting of either 50 or 85 new virtual machines. In this case, we can
easily provision new vSphere hosts using vCAC services as described in previous
sections.
59. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
59EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 37. Capacity summary showing insufficient CPU and RAM resources
Before you provision new hardware resources, you can create hardware change
scenarios to determine the effect of adding, removing, or updating the hardware
capacity in a vSphere cluster. You can create a scenario that models changes to hosts
and datastores, as shown in Figure 38 and Figure 39.
Figure 38. Specifying number of hosts and amount of CPU and memory
60. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
60 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 39. Specifying datastore size
The what-if scenario capacity planning function allows you compare how adding
different amounts of virtual machines and hardware will impact your actual
environment, as shown in Figure 40.
Figure 40. Compared scenarios
In a planning exercise, assume that you:
Have a request to deploy an additional 45 Hadoop node instances in the
existing HaaS.
Plan to purchase blade servers compliant with a certain specification.
Want to deploy an additional 25 Hadoop clusters.
Capacity planning
example
61. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
61EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
In Figure 41, each column shows how an individual change affects resources in your
environment. The Combined Scenarios column shows you the cumulative effect of
hardware purchasing and an overall expansion of 70 virtual machines.
Figure 41. Combined scenarios
Metering and chargeback
VMware ITBM provides cloud administrators with comprehensive metering and cost
information across physical and virtual resources in the EMC Hybrid Cloud
environment. Besides working out the cost of physical components such as storage,
compute, and networking resources, you can also include and configure other factors
that affect the overall cost of your cloud environment, such as operating system
licensing, maintenance, labor, and environmental facilities costs, as shown in Figure
42.
62. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
62 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Figure 42. Categorized hybrid cloud environment cost overview
ITBM is integrated into the vCAC portal for the Hadoop administrator and presents a
dashboard overview of the hybrid cloud infrastructure.
VMware ITBM Standard Edition uses its own reference database, which has been
preloaded with industry-standard data and vendor-specific data to generate the base
price for virtual CPU (vCPU), RAM, and storage values. These prices, which default to
the cost of CPU, RAM, and storage, are automatically consumed by vCAC, where they
can be changed as appropriate by the cloud administrator. This eliminates the need
to manually configure cost profiles in vCAC and assign them to compute resources.
ITBM is also integrated with vCenter and can import existing resource hierarchies,
folder structures, and vCenter tags to associate EMC Hybrid Cloud resource usage
with business units, departments, and projects.
Infrastructure resources consumed by HaaS instances and hosted applications are
provided by dedicated vSphere clusters with associated vSphere hosts and
datastores. ITBM provides you with detailed information about:
Number of vSphere hosts in the vSphere cluster and the number of virtual
machines on each host
CPU and RAM capacity and utilization of the vSphere cluster
Overall cost of the compute resources provided by the dedicated vSphere
cluster
Cluster cost by virtual machine
63. Chapter 6: Use Cases: EMC Hybrid Cloud IaaS
63EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
The Clusters tab provides you with insight into the cost of the vSphere cluster
resources consumed by Hadoop cluster instances. You can monitor costs while
provisioning new hosts, as shown in Figure 43.
Figure 43. vSphere Cluster cost overview
The Datastores tab provides insight into the cost of the storage resources consumed
by an HaaS instance. The name of a datastore provisioned by vCAC storage services
inherits a cluster name prefix as part of its published name. Performing a sort by
datastore name gives you a list of the names and costs of the datastores provisioned
and assigned to hosts in the vSphere cluster, as shown in Figure 44.
Figure 44. Storage cost overview
65. Chapter 7: Conclusion
65EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Chapter 7 Conclusion
This chapter presents the following topics:
Summary..................................................................................................................66
66. Chapter 7: Conclusion
66 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Summary
Pivotal Hadoop is designed to create an easy-to-scale big data framework. To achieve
this kind of flexibility, HaaS is designed around the modular system components of
Pivotal Hadoop. Using vCenter Orchestrator workflows, the administrator can provide
fixed cluster configuration catalog items or create dynamic workflows that can be
called from a catalog. The size of the nodes used is determined by the individual
making the request.
Elastic provisioning refers to the ability to provision flexible computing resources
when and where they are required and to easily scale resources up and down to
match demand. Resource elasticity can relate to processing power, memory, storage,
bandwidth, and so on. This document indicates the importance of having an elastic
and scalable IaaS platform on which to support the hosting of dynamically changing
and fast-growing big data platforms.
VMware vCenter Operations Manager enables you to deliver quality of service, attain
operational efficiency, and gather current capacity capabilities while forecasting the
effect of future HaaS deployments or upgrades in your cloud infrastructure.
HaaS clusters can grow to a large number of node instances. The limit can be
changed by changing the BDE configuration parameters. It is crucial therefore to have
proactive performance monitoring and capacity planning solutions in place.
To support comprehensive, dynamic, and fast-growing development environments
such as Hadoop as a service, you must ensure the stability of the underlying cloud
compute infrastructure, which must provide availability, scalability, flexibility, and
performance to the big data platform and its services. As a solution to these
challenges, this document has addressed simple provisioning from a self-service
catalog and considerations for building scalable Hadoop as-a -ervice environments,
with an elastic and easy-to-deploy underlying IaaS infrastructure provided by the EMC
Hybrid Cloud solution.
67. Appendix A: References
67EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
Appendix A References
This appendix presents the following topic:
References ...............................................................................................................68
68. Appendix A: References
68 EMC Hybrid Cloud Solution with VMware
Hadoop Applications Solution Guide 2.5
VMware references
The following VMware documents provide additional and relevant information:
Advanced Service Design vCloud Automation Center 6.0
Installing and Configuring VMware vCenter Orchestrator
VMware Compatibility Guide
VMware vSphere Big Data Extensions Administrator’s and User’s Guide:
vSphere Big Data Extensions 1.0
Installing and Configuring VMware vSphere Big Data Extensions (Video)