Cindy Gross aka @SQLCindy of @NealAnalytics presents a series of "Small Bites of Big Data" lessons on how to create an HDInsight cluster on Microsoft Azure. Recordings available via the YoutTube pllaylist Getting Started with HDInsight: https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4MkTGvSza7gv. Blog available at http://blogs.msdn.com/b/cindygross/archive/2015/02/26/create-hdinsight-cluster-in-azure-portal.aspx.
2. This presentation is available via recordings.
Blog: Create HDInsight Cluster in Azure Portal
http://blogs.msdn.com/b/cindygross/archive/2015/02/26/create-
hdinsight-cluster-in-azure-portal.aspx
YouTube Playlist SQLCindy - Getting Started with HDInsight
https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4
MkTGvSza7gv
4. Why Hadoop?
• Scale-out
• Load data now, add schema later (write once, read many)
• Fail fast – iterate through many questions to find the right question
• Faster time from question to insight
• Hadoop is “just another data source” for BI, Analytics, Machine
Learning
5. Why HDInsight?
• HDInsight is Hadoop on Azure as a service
• Easy, cost effective, changeable scale out data processing
• Lower TCO – easily add/remove/scale
• Separation of storage and compute allows data to exist across clusters
6. HDInsight Technology
• Hortonworks HDP is one of the 3 major Hadoop
distributors, the most purely open source
• HDInsight *IS* Hortonworks HDP as a service in Azure (cloud)
• Metastore (Hcatalog) exists independently across clusters via SQL DB
• #, size, type of clusters are flexible and can all access the same data
• Hive is a Hadoop component that makes data look like rows/columns
for data warehouse type activities
7. Why Big Data in the Azure Cloud?
• Instantly access data born in the cloud
• Easily, cheaply load, share, and merge public or private data
• Data exists independently across clusters (separation of storage and
compute) via WASB on Azure storage accounts
9. Get an Azure Subscription
Trial: http://azure.microsoft.com/en-us/pricing/free-trial/
MSDN Subscription: http://azure.microsoft.com/en-
us/pricing/member-offers/msdn-benefits/
Startup BizSpark: http://azure.microsoft.com/en-us/pricing/member-
offers/bizspark-startups/
Classroom: http://www.microsoftazurepass.com/azureu
Pay-As-You-Go or Enterprise Agreement:
http://azure.microsoft.com/en-us/pricing/
10. Login to Azure
Subscription
1. Login on Azure Portal
https://manage.windowsazure.
com
2. Use a Microsoft Account
http://www.microsoft.com/en-
us/account/default.aspx
Note: Some companies have
federated their accounts and
can use company accounts.
11. Choose
Subscription
Most accounts will only have one
Azure subscription associated with
them. But if you seem to have
unexpected resources, check to
make sure you are in the expected
subscription. The Subscriptions
button is on the upper right of the
Azure portal.
12. Add Accounts
Option: Add more
Microsoft Accounts as
admins of the Azure
Subscription.
1. Choose SETTINGS at the
very bottom on the left.
2. Then choose
ADMINISTRATORS at
the top. Click on the
ADD button at the very
bottom.
3. Enter a Microsoft
Account or federated
enterprise account that
will be an admin.
14. Create a
Storage
Account
1. Click on STORAGE in the
left menu then NEW.
2. URL: Choose a storage
account name that is
unique within
*.core.windows.net.
3. LOCATION: Choose the
same location for the
SQL Azure metastore
database, the storage
account(s), and
HDInsight.
4. REPLICATION: Locally
redundant stores fewer
copies and costs less.
Repeat if you need additional
storage.
15. Create a Container
1. Click on your storage account in the left
menu then CONTAINERS on the top.
2. Choose CREATE A CONTAINER or choose
the NEW button at the bottom.
3. Enter a lower-case NAME for the
container, unique within that storage
account.
4. Choose either Private or Public ACCESS.
If there is any chance of sensitive or PII
data being loaded to this container
choose Private. Private access requires a
key. HDInsight can be configured with
that key during creation or keys can be
passed in for individual jobs.
This will be the default container for the
cluster. If you want to manage your data
separately you may want to create additional
containers.
18. Create a Metastore
aka Azure SQL DB
Persist your Hive and Oozie metadata
across cluster instances, even if no
cluster exists, with an HCatalog
metastore in an Azure SQL Database.
This database should not be used for
anything else. While it works to share
a single metastore across multiple
instances it is not officially tested or
supported.
1. Click on SQL DATABASES then NEW
and choose CUSTOM CREATE.
2. Choose a NAME unique to your server.
3. Click on the “?” to help you decide
what TIER of database to create.
4. Use the default database COLLATION.
5. If you choose an existing SERVER you
will share sysadmin access with other
databases.
19. Firewall Rules
In order to refer to the
metastore from automated
cluster creation scripts such as
PowerShell your workstation
must be added to the firewall
rules.
1. Click on MANAGE then choose
YES.
2. You can also use the
MANAGE button to connect
to the SQL Azure database
and manage logins and
permissions.
21. How to Create an HDInsight Cluster
• Quick Create through the Azure portal is the fastest way to get started with
all the default settings.
• The Azure portal Custom Create allows you to customize size, storage, and
other configuration options.
• You can customize and automate through code including .NET and
PowerShell. This increases standardization and lets you automate the
creation and deletion of clusters over time.
• For all the examples here we will create a basic Hadoop cluster with Hive,
Pig, and MapReduce.
• A cluster will take several minutes to create, the type and size of the cluster
have little impact on the time for creation.
23. Option 1:
Quick Create
For your first cluster choose a
Quick Create.
1. Click on HDINSIGHT in the
left menu, then NEW.
2. Choose Hadoop. HBase
and Storm also include the
features of a basic Hadoop
cluster but are optimized
for in-memory key value
pairs (HBase) or alerting
(Storm).
3. Choose a NAME unique in
the azurehdinisght.net
domain.
4. Start with a small CLUSTER
SIZE, often 2 or 4 nodes.
5. Choose the admin
PASSWORD.
6. The location of the
STORAGE ACCOUNT
determines the location of
the cluster.
25. Option 2:
Custom Create
You can also customize your size, admin
account, storage, metastore, and more
through the portal. We’ll walk through a basic
Hadoop cluster.
1. Click on HDINSIGHT in the left menu, then
NEW in the lower left.
2. Choose CUSTOM CREATE.
<continued>
26. Custom Create
Basic Info
1. Choose a NAME unique in the
azurehdinisght.net domain.
2. Choose Hadoop. HBase and Storm
also include the features of a
basic Hadoop cluster but are
optimized for in-memory key-
value pairs (HBase) or alerting
(Storm).
3. Choose Windows or Linux as the
OPERATING SYSTEM. Linux is only
available if you have signed up for
the preview.
4. In most cases you will want the
default VERSION.
<continued>
27. Custom Create
Size and Location
1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all
use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.
2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription.
To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight
cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a
message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages
are in cores and your configuration is specified in nodes.
3. The storage account(s), metastore, and cluster will all be in the same REGION.
<continued>
28. Custom Create
Cluster Admin
1. Choose an administrator USER
NAME. It is more secure to
avoid “admin” and to choose a
relatively obscure name. This
account will be added to the
cluster and doesn’t have to
match any existing external
accounts.
2. Choose a strong PASSWORD of
at least 10 characters with
upper/lower case letters, a
number, and a special character.
Some special characters may
not be accepted.
<continued>
29. Custom Create
Metastore (Hcatalog)
On the same page as the Hadoop
cluster admin account you can
optionally choose to use a common
metastore (Hcatalog).
1. Click on the blue box to the right
of “Enter the Hive/Oozie
Metastore”. This makes more
fields available.
2. Choose the SQL Azure database
you created earlier as the
METASTORE.
3. Enter a login (DATABASE USER)
and PASSWORD that allow you to
access the METASTORE database.
If you encounter errors, try
logging in to the database
manually from the portal. You
may need to open firewall ports
or change permissions.
<continued>
30. Custom Create
Default Storage
Account
Every cluster has a default
storage account. You can
optionally specify additional
storage accounts at cluster
create time or at run time.
1. To access existing data on an
existing STORAGE ACCOUNT,
choose “Use Existing
Storage”.
2. Specify the NAME of the
existing storage account.
3. Choose a DEFAULT
CONTAINER on the default
storage account. Other
containers (units of data
management) can be used
as long as the storage
account is known to the
cluster.
4. To add ADDITIONAL
STORAGE ACCOUNTS that
will be accessible without
the user providing the
storage account key, specify
that here.
<continued>
31. Custom Create
Additional Storage
Accounts
If you specified there will be
additional accounts you will see
this screen.
1. If you choose “Use Existing
Storage” you simply enter
the NAME of the storage
account.
2. If you choose “Use Storage
From Another Subscription”
you specify the NAME and
the GUID KEY for that
storage account.
<continued>
32. Custom Create
Script Actions
You can add additional components
or configure existing components as
the cluster is deployed. This is
beyond the scope of this demo.
1. Click “add script action” to show
the remaining parameters.
2. Enter a unique NAME for your
action.
3. The SCRIPT URI points to code
for your custom action.
4. Choose the NODE TYPE for
deployment.
<continued>
33. Create is Done!
Once you click on the final
checkmark Azure goes to work and
creates the cluster. This takes
several minutes. When the cluster
is ready you can view it in the
portal.
35. Hive Console
The simplest, most relatable way for most people to use
Hadoop is via the SQL-like, Database-like Hive and HiveQL
(HQL).
1. Put focus on your HDInsight cluster and choose QUERY
CONSOLE to open a new tab in your browser. In my case it
opens: https://dragondemo1.azurehdinsight.net//
2. Click on Hive Editor.
36. Query Hive
The query console defaults to selecting the first
10 rows from the pre-loaded sample table. This
table is created when the cluster is created.
1. Optionally edit or replace the default query:
Select * from hivesampletable LIMIT 10;
2. Optionally name your query to make it
easier to find in the job history.
3. Click Submit.
Hive is a batch system optimized for processing
huge amounts of data. It spends several
seconds up front splitting the job across the
nodes and this overhead exists even for small
result sets. If you are doing the equivalent of a
table scan in SQL Server and have enough
nodes in Hadoop, Hadoop will probably be
faster than SQL Server. If your query uses
indexes in SQL Server, then SQL Server will likely
be faster than Hive.
37. View Hive Results
1. Click on the Query you just
submitted in the Job Session.
This opens a new tab.
2. You can see the text of the Job
Query that was submitted. You
can Download it.
3. The first few lines of the Job
Output (query result) are
available. To see the full output
choose Download File.
4. The Job Log has details
including errors if there are any.
5. Additional information about
the job is available in the upper
right.
38. View Hive Data in
Excel Workbook
At this point HDInsight is “just
another data source” for any
application that supports
ODBC.
1. Install the Microsoft Hive
ODBC driver.
2. Define an ODBC data
source pointing to your
HDInsight instance.
3. From DATA choose From
Other Sources and From
Data Connection Wizard.
39. View Hive Data in
PowerPivot
At this point HDInsight is “just
another data source” for any
application that supports ODBC.
1. Install the Microsoft Hive
ODBC driver.
2. Define an ODBC data source
pointing to your HDInsight
instance.
3. Click on POWERPIVOT then
choose Manage. This opens a
new PowerPivot for Excel
window.
4. Choose Get External Data
then Others (OLEDB/ODBC).
Now you can combine the Hive
data with other data inside the
tabular PowerPivot data model.
41. Load Data
In the cloud you don’t have to load
data to Hadoop, you can load data to
an Azure Storage Account. Then you
point your HDInsight or other WASB
compliant Hadoop cluster to the
existing data source. There many
ways to load data, for the demo we’ll
use CloudXplorer.
You use the Accounts button to add
Azure, S3, or other data/storage
accounts you want to manage.
In this example nealhadoop is the
Azure storage account, demo is the
container, and bacon is a “directory”.
The files are bacon1.txt and
bacon2.txt. Any Hive tables would
point to the bacon directory, not to
individual files. Drag and drop files
from Windows Explorer to
CloudXplorer.
Windows Azure Storage Explorers
(2014)
43. Pricing
You are charged for the time the
cluster exists, regardless of how
busy it is. Check the website for the
most recent information.
Due to the separation of storage
and compute you can drop your
cluster when it’s not in use and
easily add it back, pointing to
existing data stores that are still
there, when it’s needed again.
45. Automate with
PowerShell
With PowerShell, .NET,
or the Cross-Platform
cmd line tools you can
specify even more
configuration settings
that aren’t available in
the portal. This includes
node size, a library store,
and changing default
configuration settings
such as Tez and
compression.
Automation allows you
to standardize and with
version control lets you
track your configurations
over time.
47. HDInsight WrapUp
• HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either
Windows or Linux
• Easy, cost effective, changeable scale out data processing for a lower TCO – easily
add/remove/scale
• Separation of storage and compute allows data to exist across clusters via WASB
• Metastore (Hcatalog) exists independently across clusters via SQL DB
• #, size, type of clusters are flexible and can all access the same data
• Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or
private data
• Load data now, add schema later (write once, read many)
• Fail fast – iterate through many questions to find the right question
• Faster time from question to insight
• Hadoop is “just another data source” for BI, Analytics, Machine Learning
You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.
Use Additional Storage Accounts with HDInsight Hive
http://blogs.msdn.com/b/cindygross/archive/2014/05/05/use-additional-storage-accounts-with-hdinsight-hive.aspx
Using multiple storage accounts lets you manage billing, security, backups, and high availability separately for each account. It also enables cross-subscription access.
Generally you want to manage the storage accounts and load data outside of the cluster(s) existence, so choose “use existing storage”. If you let the cluster creation create the storage you lose control. This enables separation of storage and compute so that multiple clusters can access the same data.
Customize HDInsight clusters using Script Action http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
LIMIT is similar to TOP in TSQL.
HQL is most similar to MySQL’s implementation of the ANSI-SQL standard.
Technically Azure doesn’t have directories, but Hadoop interprets a file named with a / as being in a directory structure. CloudXplorer is the only free GUI storage explorer that makes that easy to visualize and configure.
.NET and the Azure Cross-platform (xplat) command line tools are also an option.
Sample PowerShell Script: HDInsight Custom Create
http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx
If your HDInsight and/or Azure cmdlets don’t match the current documention or return unexpected errors run Web Platform Installer and check for a new version of “Microsoft Azure PowerShell with Microsoft Azure SDK” or “Microsoft Azure PowerShell (standalone)”