SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Create HDInsight Cluster in
Azure Portal
Cindy Gross
@SQLCindy
http://smallbitesofbigdata.com
This presentation is available via recordings.
Blog: Create HDInsight Cluster in Azure Portal
http://blogs.msdn.com/b/cindygross/archive/2015/02/26/create-
hdinsight-cluster-in-azure-portal.aspx
YouTube Playlist SQLCindy - Getting Started with HDInsight
https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4
MkTGvSza7gv
HDInsight – Hadoop on Azure
Why Hadoop?
• Scale-out
• Load data now, add schema later (write once, read many)
• Fail fast – iterate through many questions to find the right question
• Faster time from question to insight
• Hadoop is “just another data source” for BI, Analytics, Machine
Learning
Why HDInsight?
• HDInsight is Hadoop on Azure as a service
• Easy, cost effective, changeable scale out data processing
• Lower TCO – easily add/remove/scale
• Separation of storage and compute allows data to exist across clusters
HDInsight Technology
• Hortonworks HDP is one of the 3 major Hadoop
distributors, the most purely open source
• HDInsight *IS* Hortonworks HDP as a service in Azure (cloud)
• Metastore (Hcatalog) exists independently across clusters via SQL DB
• #, size, type of clusters are flexible and can all access the same data
• Hive is a Hadoop component that makes data look like rows/columns
for data warehouse type activities
Why Big Data in the Azure Cloud?
• Instantly access data born in the cloud
• Easily, cheaply load, share, and merge public or private data
• Data exists independently across clusters (separation of storage and
compute) via WASB on Azure storage accounts
Azure Subscription
Get an Azure Subscription
Trial: http://azure.microsoft.com/en-us/pricing/free-trial/
MSDN Subscription: http://azure.microsoft.com/en-
us/pricing/member-offers/msdn-benefits/
Startup BizSpark: http://azure.microsoft.com/en-us/pricing/member-
offers/bizspark-startups/
Classroom: http://www.microsoftazurepass.com/azureu
Pay-As-You-Go or Enterprise Agreement:
http://azure.microsoft.com/en-us/pricing/
Login to Azure
Subscription
1. Login on Azure Portal
https://manage.windowsazure.
com
2. Use a Microsoft Account
http://www.microsoft.com/en-
us/account/default.aspx
Note: Some companies have
federated their accounts and
can use company accounts.
Choose
Subscription
Most accounts will only have one
Azure subscription associated with
them. But if you seem to have
unexpected resources, check to
make sure you are in the expected
subscription. The Subscriptions
button is on the upper right of the
Azure portal.
Add Accounts
Option: Add more
Microsoft Accounts as
admins of the Azure
Subscription.
1. Choose SETTINGS at the
very bottom on the left.
2. Then choose
ADMINISTRATORS at
the top. Click on the
ADD button at the very
bottom.
3. Enter a Microsoft
Account or federated
enterprise account that
will be an admin.
Azure Storage - WASB
Create a
Storage
Account
1. Click on STORAGE in the
left menu then NEW.
2. URL: Choose a storage
account name that is
unique within
*.core.windows.net.
3. LOCATION: Choose the
same location for the
SQL Azure metastore
database, the storage
account(s), and
HDInsight.
4. REPLICATION: Locally
redundant stores fewer
copies and costs less.
Repeat if you need additional
storage.
Create a Container
1. Click on your storage account in the left
menu then CONTAINERS on the top.
2. Choose CREATE A CONTAINER or choose
the NEW button at the bottom.
3. Enter a lower-case NAME for the
container, unique within that storage
account.
4. Choose either Private or Public ACCESS.
If there is any chance of sensitive or PII
data being loaded to this container
choose Private. Private access requires a
key. HDInsight can be configured with
that key during creation or keys can be
passed in for individual jobs.
This will be the default container for the
cluster. If you want to manage your data
separately you may want to create additional
containers.
http://SmallBitesOfBigData.com
Metastore
Create a Metastore
aka Azure SQL DB
Persist your Hive and Oozie metadata
across cluster instances, even if no
cluster exists, with an HCatalog
metastore in an Azure SQL Database.
This database should not be used for
anything else. While it works to share
a single metastore across multiple
instances it is not officially tested or
supported.
1. Click on SQL DATABASES then NEW
and choose CUSTOM CREATE.
2. Choose a NAME unique to your server.
3. Click on the “?” to help you decide
what TIER of database to create.
4. Use the default database COLLATION.
5. If you choose an existing SERVER you
will share sysadmin access with other
databases.
Firewall Rules
In order to refer to the
metastore from automated
cluster creation scripts such as
PowerShell your workstation
must be added to the firewall
rules.
1. Click on MANAGE then choose
YES.
2. You can also use the
MANAGE button to connect
to the SQL Azure database
and manage logins and
permissions.
Create HDInsight Cluster
How to Create an HDInsight Cluster
• Quick Create through the Azure portal is the fastest way to get started with
all the default settings.
• The Azure portal Custom Create allows you to customize size, storage, and
other configuration options.
• You can customize and automate through code including .NET and
PowerShell. This increases standardization and lets you automate the
creation and deletion of clusters over time.
• For all the examples here we will create a basic Hadoop cluster with Hive,
Pig, and MapReduce.
• A cluster will take several minutes to create, the type and size of the cluster
have little impact on the time for creation.
HDInsight Quick Create
Option 1:
Quick Create
For your first cluster choose a
Quick Create.
1. Click on HDINSIGHT in the
left menu, then NEW.
2. Choose Hadoop. HBase
and Storm also include the
features of a basic Hadoop
cluster but are optimized
for in-memory key value
pairs (HBase) or alerting
(Storm).
3. Choose a NAME unique in
the azurehdinisght.net
domain.
4. Start with a small CLUSTER
SIZE, often 2 or 4 nodes.
5. Choose the admin
PASSWORD.
6. The location of the
STORAGE ACCOUNT
determines the location of
the cluster.
HDInsight Custom Create
Option 2:
Custom Create
You can also customize your size, admin
account, storage, metastore, and more
through the portal. We’ll walk through a basic
Hadoop cluster.
1. Click on HDINSIGHT in the left menu, then
NEW in the lower left.
2. Choose CUSTOM CREATE.
<continued>
Custom Create
Basic Info
1. Choose a NAME unique in the
azurehdinisght.net domain.
2. Choose Hadoop. HBase and Storm
also include the features of a
basic Hadoop cluster but are
optimized for in-memory key-
value pairs (HBase) or alerting
(Storm).
3. Choose Windows or Linux as the
OPERATING SYSTEM. Linux is only
available if you have signed up for
the preview.
4. In most cases you will want the
default VERSION.
<continued>
Custom Create
Size and Location
1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all
use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.
2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription.
To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight
cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a
message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages
are in cores and your configuration is specified in nodes.
3. The storage account(s), metastore, and cluster will all be in the same REGION.
<continued>
Custom Create
Cluster Admin
1. Choose an administrator USER
NAME. It is more secure to
avoid “admin” and to choose a
relatively obscure name. This
account will be added to the
cluster and doesn’t have to
match any existing external
accounts.
2. Choose a strong PASSWORD of
at least 10 characters with
upper/lower case letters, a
number, and a special character.
Some special characters may
not be accepted.
<continued>
Custom Create
Metastore (Hcatalog)
On the same page as the Hadoop
cluster admin account you can
optionally choose to use a common
metastore (Hcatalog).
1. Click on the blue box to the right
of “Enter the Hive/Oozie
Metastore”. This makes more
fields available.
2. Choose the SQL Azure database
you created earlier as the
METASTORE.
3. Enter a login (DATABASE USER)
and PASSWORD that allow you to
access the METASTORE database.
If you encounter errors, try
logging in to the database
manually from the portal. You
may need to open firewall ports
or change permissions.
<continued>
Custom Create
Default Storage
Account
Every cluster has a default
storage account. You can
optionally specify additional
storage accounts at cluster
create time or at run time.
1. To access existing data on an
existing STORAGE ACCOUNT,
choose “Use Existing
Storage”.
2. Specify the NAME of the
existing storage account.
3. Choose a DEFAULT
CONTAINER on the default
storage account. Other
containers (units of data
management) can be used
as long as the storage
account is known to the
cluster.
4. To add ADDITIONAL
STORAGE ACCOUNTS that
will be accessible without
the user providing the
storage account key, specify
that here.
<continued>
Custom Create
Additional Storage
Accounts
If you specified there will be
additional accounts you will see
this screen.
1. If you choose “Use Existing
Storage” you simply enter
the NAME of the storage
account.
2. If you choose “Use Storage
From Another Subscription”
you specify the NAME and
the GUID KEY for that
storage account.
<continued>
Custom Create
Script Actions
You can add additional components
or configure existing components as
the cluster is deployed. This is
beyond the scope of this demo.
1. Click “add script action” to show
the remaining parameters.
2. Enter a unique NAME for your
action.
3. The SCRIPT URI points to code
for your custom action.
4. Choose the NODE TYPE for
deployment.
<continued>
Create is Done!
Once you click on the final
checkmark Azure goes to work and
creates the cluster. This takes
several minutes. When the cluster
is ready you can view it in the
portal.
Query with Hive
Hive Console
The simplest, most relatable way for most people to use
Hadoop is via the SQL-like, Database-like Hive and HiveQL
(HQL).
1. Put focus on your HDInsight cluster and choose QUERY
CONSOLE to open a new tab in your browser. In my case it
opens: https://dragondemo1.azurehdinsight.net//
2. Click on Hive Editor.
Query Hive
The query console defaults to selecting the first
10 rows from the pre-loaded sample table. This
table is created when the cluster is created.
1. Optionally edit or replace the default query:
Select * from hivesampletable LIMIT 10;
2. Optionally name your query to make it
easier to find in the job history.
3. Click Submit.
Hive is a batch system optimized for processing
huge amounts of data. It spends several
seconds up front splitting the job across the
nodes and this overhead exists even for small
result sets. If you are doing the equivalent of a
table scan in SQL Server and have enough
nodes in Hadoop, Hadoop will probably be
faster than SQL Server. If your query uses
indexes in SQL Server, then SQL Server will likely
be faster than Hive.
View Hive Results
1. Click on the Query you just
submitted in the Job Session.
This opens a new tab.
2. You can see the text of the Job
Query that was submitted. You
can Download it.
3. The first few lines of the Job
Output (query result) are
available. To see the full output
choose Download File.
4. The Job Log has details
including errors if there are any.
5. Additional information about
the job is available in the upper
right.
View Hive Data in
Excel Workbook
At this point HDInsight is “just
another data source” for any
application that supports
ODBC.
1. Install the Microsoft Hive
ODBC driver.
2. Define an ODBC data
source pointing to your
HDInsight instance.
3. From DATA choose From
Other Sources and From
Data Connection Wizard.
View Hive Data in
PowerPivot
At this point HDInsight is “just
another data source” for any
application that supports ODBC.
1. Install the Microsoft Hive
ODBC driver.
2. Define an ODBC data source
pointing to your HDInsight
instance.
3. Click on POWERPIVOT then
choose Manage. This opens a
new PowerPivot for Excel
window.
4. Choose Get External Data
then Others (OLEDB/ODBC).
Now you can combine the Hive
data with other data inside the
tabular PowerPivot data model.
Load Demo Data
Load Data
In the cloud you don’t have to load
data to Hadoop, you can load data to
an Azure Storage Account. Then you
point your HDInsight or other WASB
compliant Hadoop cluster to the
existing data source. There many
ways to load data, for the demo we’ll
use CloudXplorer.
You use the Accounts button to add
Azure, S3, or other data/storage
accounts you want to manage.
In this example nealhadoop is the
Azure storage account, demo is the
container, and bacon is a “directory”.
The files are bacon1.txt and
bacon2.txt. Any Hive tables would
point to the bacon directory, not to
individual files. Drag and drop files
from Windows Explorer to
CloudXplorer.
Windows Azure Storage Explorers
(2014)
HDInsight Pricing
Pricing
You are charged for the time the
cluster exists, regardless of how
busy it is. Check the website for the
most recent information.
Due to the separation of storage
and compute you can drop your
cluster when it’s not in use and
easily add it back, pointing to
existing data stores that are still
there, when it’s needed again.
HDInsight Automation
Automate with
PowerShell
With PowerShell, .NET,
or the Cross-Platform
cmd line tools you can
specify even more
configuration settings
that aren’t available in
the portal. This includes
node size, a library store,
and changing default
configuration settings
such as Tez and
compression.
Automation allows you
to standardize and with
version control lets you
track your configurations
over time.
HDInsight WrapUp
HDInsight WrapUp
• HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either
Windows or Linux
• Easy, cost effective, changeable scale out data processing for a lower TCO – easily
add/remove/scale
• Separation of storage and compute allows data to exist across clusters via WASB
• Metastore (Hcatalog) exists independently across clusters via SQL DB
• #, size, type of clusters are flexible and can all access the same data
• Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or
private data
• Load data now, add schema later (write once, read many)
• Fail fast – iterate through many questions to find the right question
• Faster time from question to insight
• Hadoop is “just another data source” for BI, Analytics, Machine Learning
Create HDInsight Cluster in
Azure Portal
Cindy Gross
@SQLCindy
http://smallbitesofbigdata.com

Weitere ähnliche Inhalte

Was ist angesagt?

Azure Virtual Machines Deployment Scenarios
Azure Virtual Machines Deployment ScenariosAzure Virtual Machines Deployment Scenarios
Azure Virtual Machines Deployment ScenariosBrian Benz
 
Windows Azure Virtual Machines
Windows Azure Virtual MachinesWindows Azure Virtual Machines
Windows Azure Virtual MachinesNeil Mackenzie
 
Windows Azure Blob Storage
Windows Azure Blob StorageWindows Azure Blob Storage
Windows Azure Blob Storageylew15
 
Hands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheHands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheAmazon Web Services
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best PracticesDoiT International
 
Zero to 60 with Azure Cosmos DB
Zero to 60 with Azure Cosmos DBZero to 60 with Azure Cosmos DB
Zero to 60 with Azure Cosmos DBAdnan Hashmi
 
AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and DrupalPromet Source
 
MS Cloud Day - Building web applications with Azure storage
MS Cloud Day - Building web applications with Azure storageMS Cloud Day - Building web applications with Azure storage
MS Cloud Day - Building web applications with Azure storageSpiffy
 
HDInsight Informative articles
HDInsight Informative articlesHDInsight Informative articles
HDInsight Informative articlesKaran Gulati
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...Amazon Web Services
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Windows Azure Virtual Machines And Virtual Networks
Windows Azure Virtual Machines And Virtual NetworksWindows Azure Virtual Machines And Virtual Networks
Windows Azure Virtual Machines And Virtual NetworksKristof Rennen
 
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar SeriesBest Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar SeriesAmazon Web Services
 
Azure Recovery Services
Azure Recovery ServicesAzure Recovery Services
Azure Recovery ServicesPavel Revenkov
 

Was ist angesagt? (20)

Azure Virtual Machines Deployment Scenarios
Azure Virtual Machines Deployment ScenariosAzure Virtual Machines Deployment Scenarios
Azure Virtual Machines Deployment Scenarios
 
Windows Azure Virtual Machines
Windows Azure Virtual MachinesWindows Azure Virtual Machines
Windows Azure Virtual Machines
 
Accelerating DynamoDB with DAX
Accelerating DynamoDB with DAXAccelerating DynamoDB with DAX
Accelerating DynamoDB with DAX
 
Windows Azure Blob Storage
Windows Azure Blob StorageWindows Azure Blob Storage
Windows Azure Blob Storage
 
Hands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheHands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCache
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
 
Zero to 60 with Azure Cosmos DB
Zero to 60 with Azure Cosmos DBZero to 60 with Azure Cosmos DB
Zero to 60 with Azure Cosmos DB
 
AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and Drupal
 
MS Cloud Day - Building web applications with Azure storage
MS Cloud Day - Building web applications with Azure storageMS Cloud Day - Building web applications with Azure storage
MS Cloud Day - Building web applications with Azure storage
 
HDInsight Informative articles
HDInsight Informative articlesHDInsight Informative articles
HDInsight Informative articles
 
CouchDB
CouchDBCouchDB
CouchDB
 
Windows Azure Caching
Windows Azure CachingWindows Azure Caching
Windows Azure Caching
 
Windows Azure Storage – Architecture View
Windows Azure Storage – Architecture ViewWindows Azure Storage – Architecture View
Windows Azure Storage – Architecture View
 
Cloud hosting survey
Cloud hosting surveyCloud hosting survey
Cloud hosting survey
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Windows Azure Virtual Machines And Virtual Networks
Windows Azure Virtual Machines And Virtual NetworksWindows Azure Virtual Machines And Virtual Networks
Windows Azure Virtual Machines And Virtual Networks
 
Azure CosmosDb
Azure CosmosDbAzure CosmosDb
Azure CosmosDb
 
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar SeriesBest Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
 
Azure Recovery Services
Azure Recovery ServicesAzure Recovery Services
Azure Recovery Services
 

Ähnlich wie Create HDInsight Cluster in Azure Portal (February 2015)

Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureAnita Luthra
 
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study labCynthia Saracco
 
Deploying DAOS and ID Vault
Deploying DAOS and ID VaultDeploying DAOS and ID Vault
Deploying DAOS and ID VaultLuis Guirigay
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
 
Elasticache Lab Report.docx
Elasticache Lab Report.docxElasticache Lab Report.docx
Elasticache Lab Report.docxPrinceMali5
 
MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...
MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...
MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...MIDAS
 
Yes, It's Number One it's TOTP!
Yes, It's Number One it's TOTP!Yes, It's Number One it's TOTP!
Yes, It's Number One it's TOTP!Keith Brooks
 
CloudToolGuidance03May2023
CloudToolGuidance03May2023CloudToolGuidance03May2023
CloudToolGuidance03May2023Timothy Spann
 
Drupal 7x Installation - Introduction to Drupal Concepts
Drupal 7x Installation - Introduction to Drupal ConceptsDrupal 7x Installation - Introduction to Drupal Concepts
Drupal 7x Installation - Introduction to Drupal ConceptsMicky Metts
 
Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-Stuart Ainsworth
 
Spend Less on Azure
Spend Less on AzureSpend Less on Azure
Spend Less on AzureFrans Lytzen
 
Strategies to automate deployment and provisioning of Microsoft Azure.
Strategies to automate deployment and provisioning of Microsoft Azure.Strategies to automate deployment and provisioning of Microsoft Azure.
Strategies to automate deployment and provisioning of Microsoft Azure.HARMAN Services
 
Connect Azure Data Factory (ADF) With Azure DevOps
Connect Azure Data Factory (ADF) With Azure DevOpsConnect Azure Data Factory (ADF) With Azure DevOps
Connect Azure Data Factory (ADF) With Azure DevOpskomal chauhan
 
Improving Drupal Performances
Improving Drupal PerformancesImproving Drupal Performances
Improving Drupal PerformancesVladimir Ilic
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
Visualizing drupalcode06 25-2015
Visualizing drupalcode06 25-2015Visualizing drupalcode06 25-2015
Visualizing drupalcode06 25-2015Joe Tippetts
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudAlluxio, Inc.
 

Ähnlich wie Create HDInsight Cluster in Azure Portal (February 2015) (20)

Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on Azure
 
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
 
Deploying DAOS and ID Vault
Deploying DAOS and ID VaultDeploying DAOS and ID Vault
Deploying DAOS and ID Vault
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Elasticache Lab Report.docx
Elasticache Lab Report.docxElasticache Lab Report.docx
Elasticache Lab Report.docx
 
Big datademo
Big datademoBig datademo
Big datademo
 
hbase lab
hbase labhbase lab
hbase lab
 
MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...
MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...
MIDAS - Web Based Room & Resource Scheduling Software - LDAP (Active Director...
 
AZURE BASICS EMERSON EDUARDO RODRIGUES
AZURE BASICS EMERSON EDUARDO RODRIGUESAZURE BASICS EMERSON EDUARDO RODRIGUES
AZURE BASICS EMERSON EDUARDO RODRIGUES
 
Yes, It's Number One it's TOTP!
Yes, It's Number One it's TOTP!Yes, It's Number One it's TOTP!
Yes, It's Number One it's TOTP!
 
CloudToolGuidance03May2023
CloudToolGuidance03May2023CloudToolGuidance03May2023
CloudToolGuidance03May2023
 
Drupal 7x Installation - Introduction to Drupal Concepts
Drupal 7x Installation - Introduction to Drupal ConceptsDrupal 7x Installation - Introduction to Drupal Concepts
Drupal 7x Installation - Introduction to Drupal Concepts
 
Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-
 
Spend Less on Azure
Spend Less on AzureSpend Less on Azure
Spend Less on Azure
 
Strategies to automate deployment and provisioning of Microsoft Azure.
Strategies to automate deployment and provisioning of Microsoft Azure.Strategies to automate deployment and provisioning of Microsoft Azure.
Strategies to automate deployment and provisioning of Microsoft Azure.
 
Connect Azure Data Factory (ADF) With Azure DevOps
Connect Azure Data Factory (ADF) With Azure DevOpsConnect Azure Data Factory (ADF) With Azure DevOps
Connect Azure Data Factory (ADF) With Azure DevOps
 
Improving Drupal Performances
Improving Drupal PerformancesImproving Drupal Performances
Improving Drupal Performances
 
R server and spark
R server and sparkR server and spark
R server and spark
 
Visualizing drupalcode06 25-2015
Visualizing drupalcode06 25-2015Visualizing drupalcode06 25-2015
Visualizing drupalcode06 25-2015
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Create HDInsight Cluster in Azure Portal (February 2015)

  • 1. Create HDInsight Cluster in Azure Portal Cindy Gross @SQLCindy http://smallbitesofbigdata.com
  • 2. This presentation is available via recordings. Blog: Create HDInsight Cluster in Azure Portal http://blogs.msdn.com/b/cindygross/archive/2015/02/26/create- hdinsight-cluster-in-azure-portal.aspx YouTube Playlist SQLCindy - Getting Started with HDInsight https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4 MkTGvSza7gv
  • 4. Why Hadoop? • Scale-out • Load data now, add schema later (write once, read many) • Fail fast – iterate through many questions to find the right question • Faster time from question to insight • Hadoop is “just another data source” for BI, Analytics, Machine Learning
  • 5. Why HDInsight? • HDInsight is Hadoop on Azure as a service • Easy, cost effective, changeable scale out data processing • Lower TCO – easily add/remove/scale • Separation of storage and compute allows data to exist across clusters
  • 6. HDInsight Technology • Hortonworks HDP is one of the 3 major Hadoop distributors, the most purely open source • HDInsight *IS* Hortonworks HDP as a service in Azure (cloud) • Metastore (Hcatalog) exists independently across clusters via SQL DB • #, size, type of clusters are flexible and can all access the same data • Hive is a Hadoop component that makes data look like rows/columns for data warehouse type activities
  • 7. Why Big Data in the Azure Cloud? • Instantly access data born in the cloud • Easily, cheaply load, share, and merge public or private data • Data exists independently across clusters (separation of storage and compute) via WASB on Azure storage accounts
  • 9. Get an Azure Subscription Trial: http://azure.microsoft.com/en-us/pricing/free-trial/ MSDN Subscription: http://azure.microsoft.com/en- us/pricing/member-offers/msdn-benefits/ Startup BizSpark: http://azure.microsoft.com/en-us/pricing/member- offers/bizspark-startups/ Classroom: http://www.microsoftazurepass.com/azureu Pay-As-You-Go or Enterprise Agreement: http://azure.microsoft.com/en-us/pricing/
  • 10. Login to Azure Subscription 1. Login on Azure Portal https://manage.windowsazure. com 2. Use a Microsoft Account http://www.microsoft.com/en- us/account/default.aspx Note: Some companies have federated their accounts and can use company accounts.
  • 11. Choose Subscription Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.
  • 12. Add Accounts Option: Add more Microsoft Accounts as admins of the Azure Subscription. 1. Choose SETTINGS at the very bottom on the left. 2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom. 3. Enter a Microsoft Account or federated enterprise account that will be an admin.
  • 14. Create a Storage Account 1. Click on STORAGE in the left menu then NEW. 2. URL: Choose a storage account name that is unique within *.core.windows.net. 3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight. 4. REPLICATION: Locally redundant stores fewer copies and costs less. Repeat if you need additional storage.
  • 15. Create a Container 1. Click on your storage account in the left menu then CONTAINERS on the top. 2. Choose CREATE A CONTAINER or choose the NEW button at the bottom. 3. Enter a lower-case NAME for the container, unique within that storage account. 4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs. This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.
  • 18. Create a Metastore aka Azure SQL DB Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported. 1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE. 2. Choose a NAME unique to your server. 3. Click on the “?” to help you decide what TIER of database to create. 4. Use the default database COLLATION. 5. If you choose an existing SERVER you will share sysadmin access with other databases.
  • 19. Firewall Rules In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules. 1. Click on MANAGE then choose YES. 2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.
  • 21. How to Create an HDInsight Cluster • Quick Create through the Azure portal is the fastest way to get started with all the default settings. • The Azure portal Custom Create allows you to customize size, storage, and other configuration options. • You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time. • For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce. • A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.
  • 23. Option 1: Quick Create For your first cluster choose a Quick Create. 1. Click on HDINSIGHT in the left menu, then NEW. 2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm). 3. Choose a NAME unique in the azurehdinisght.net domain. 4. Start with a small CLUSTER SIZE, often 2 or 4 nodes. 5. Choose the admin PASSWORD. 6. The location of the STORAGE ACCOUNT determines the location of the cluster.
  • 25. Option 2: Custom Create You can also customize your size, admin account, storage, metastore, and more through the portal. We’ll walk through a basic Hadoop cluster. 1. Click on HDINSIGHT in the left menu, then NEW in the lower left. 2. Choose CUSTOM CREATE. <continued>
  • 26. Custom Create Basic Info 1. Choose a NAME unique in the azurehdinisght.net domain. 2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key- value pairs (HBase) or alerting (Storm). 3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview. 4. In most cases you will want the default VERSION. <continued>
  • 27. Custom Create Size and Location 1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link. 2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages are in cores and your configuration is specified in nodes. 3. The storage account(s), metastore, and cluster will all be in the same REGION. <continued>
  • 28. Custom Create Cluster Admin 1. Choose an administrator USER NAME. It is more secure to avoid “admin” and to choose a relatively obscure name. This account will be added to the cluster and doesn’t have to match any existing external accounts. 2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted. <continued>
  • 29. Custom Create Metastore (Hcatalog) On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog). 1. Click on the blue box to the right of “Enter the Hive/Oozie Metastore”. This makes more fields available. 2. Choose the SQL Azure database you created earlier as the METASTORE. 3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions. <continued>
  • 30. Custom Create Default Storage Account Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time. 1. To access existing data on an existing STORAGE ACCOUNT, choose “Use Existing Storage”. 2. Specify the NAME of the existing storage account. 3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster. 4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here. <continued>
  • 31. Custom Create Additional Storage Accounts If you specified there will be additional accounts you will see this screen. 1. If you choose “Use Existing Storage” you simply enter the NAME of the storage account. 2. If you choose “Use Storage From Another Subscription” you specify the NAME and the GUID KEY for that storage account. <continued>
  • 32. Custom Create Script Actions You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo. 1. Click “add script action” to show the remaining parameters. 2. Enter a unique NAME for your action. 3. The SCRIPT URI points to code for your custom action. 4. Choose the NODE TYPE for deployment. <continued>
  • 33. Create is Done! Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.
  • 35. Hive Console The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL). 1. Put focus on your HDInsight cluster and choose QUERY CONSOLE to open a new tab in your browser. In my case it opens: https://dragondemo1.azurehdinsight.net// 2. Click on Hive Editor.
  • 36. Query Hive The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created. 1. Optionally edit or replace the default query: Select * from hivesampletable LIMIT 10; 2. Optionally name your query to make it easier to find in the job history. 3. Click Submit. Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.
  • 37. View Hive Results 1. Click on the Query you just submitted in the Job Session. This opens a new tab. 2. You can see the text of the Job Query that was submitted. You can Download it. 3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File. 4. The Job Log has details including errors if there are any. 5. Additional information about the job is available in the upper right.
  • 38. View Hive Data in Excel Workbook At this point HDInsight is “just another data source” for any application that supports ODBC. 1. Install the Microsoft Hive ODBC driver. 2. Define an ODBC data source pointing to your HDInsight instance. 3. From DATA choose From Other Sources and From Data Connection Wizard.
  • 39. View Hive Data in PowerPivot At this point HDInsight is “just another data source” for any application that supports ODBC. 1. Install the Microsoft Hive ODBC driver. 2. Define an ODBC data source pointing to your HDInsight instance. 3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window. 4. Choose Get External Data then Others (OLEDB/ODBC). Now you can combine the Hive data with other data inside the tabular PowerPivot data model.
  • 41. Load Data In the cloud you don’t have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo we’ll use CloudXplorer. You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage. In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a “directory”. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer. Windows Azure Storage Explorers (2014)
  • 43. Pricing You are charged for the time the cluster exists, regardless of how busy it is. Check the website for the most recent information. Due to the separation of storage and compute you can drop your cluster when it’s not in use and easily add it back, pointing to existing data stores that are still there, when it’s needed again.
  • 45. Automate with PowerShell With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that aren’t available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression. Automation allows you to standardize and with version control lets you track your configurations over time.
  • 47. HDInsight WrapUp • HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either Windows or Linux • Easy, cost effective, changeable scale out data processing for a lower TCO – easily add/remove/scale • Separation of storage and compute allows data to exist across clusters via WASB • Metastore (Hcatalog) exists independently across clusters via SQL DB • #, size, type of clusters are flexible and can all access the same data • Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or private data • Load data now, add schema later (write once, read many) • Fail fast – iterate through many questions to find the right question • Faster time from question to insight • Hadoop is “just another data source” for BI, Analytics, Machine Learning
  • 48. Create HDInsight Cluster in Azure Portal Cindy Gross @SQLCindy http://smallbitesofbigdata.com

Hinweis der Redaktion

  1. You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.
  2. Use Additional Storage Accounts with HDInsight Hive http://blogs.msdn.com/b/cindygross/archive/2014/05/05/use-additional-storage-accounts-with-hdinsight-hive.aspx Using multiple storage accounts lets you manage billing, security, backups, and high availability separately for each account. It also enables cross-subscription access. Generally you want to manage the storage accounts and load data outside of the cluster(s) existence, so choose “use existing storage”. If you let the cluster creation create the storage you lose control. This enables separation of storage and compute so that multiple clusters can access the same data.
  3. Customize HDInsight clusters using Script Action http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
  4. LIMIT is similar to TOP in TSQL. HQL is most similar to MySQL’s implementation of the ANSI-SQL standard.
  5. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/
  6. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/
  7. Technically Azure doesn’t have directories, but Hadoop interprets a file named with a / as being in a directory structure. CloudXplorer is the only free GUI storage explorer that makes that easy to visualize and configure.
  8. http://azure.microsoft.com/en-us/pricing/details/hdinsight/
  9. .NET and the Azure Cross-platform (xplat) command line tools are also an option. Sample PowerShell Script: HDInsight Custom Create http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx If your HDInsight and/or Azure cmdlets don’t match the current documention or return unexpected errors run Web Platform Installer and check for a new version of “Microsoft Azure PowerShell with Microsoft Azure SDK” or “Microsoft Azure PowerShell (standalone)”