SlideShare ist ein Scribd-Unternehmen logo
1 von 53
The Fundamentals
Guide to HDP and
HDInsight
Gert Drapers (#DataDude)
Principle Software Design Engineer
http://www.economist.com/node/15579717?Story_ID=15579717
Copyright Š The Economist Newspaper Limited 2012. All rights reserved
The 4Vs of Big Data:
Volume, Velocity, Variability, & Variety
Source: http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
New In Hadoop 2
•YARN
• ResourceManager
• NodeManager
• ApplicationMaster
•HDFS 2
• NameNode HA
• Snapshots
• Federation
Source: http://hortonworks.com/hadoop/yarn/
Hortonworks Data Platform For Windows
• Leverages work from Hortonworks and Microsoft
• 100% open source Apache Hadoop
• Built on the latest releases across Hadoop (2.2)
• YARN
• Stinger Phase 2 (Faster queries)
• Only distribution available on Windows Server
• Harness existing .NET and Java skills to write
MapReduce
• Utilize familiar BI tools for analysis including
Microsoft Excel
On-Premise Self-Deploy (Hadoop)
See: http://hortonworks.com/products/releases/hdp-2-windows/
Microsoft Azure HDInsight 3.0
• Microsoft’s cloud Hadoop offer
• 100% open source Apache Hadoop
• Built on the latest releases across Hadoop (2.2)
• YARN
• Stinger Phase 2 (Faster queries)
• Up and running in minutes with no hardware to
deploy
• Harness existing .NET and Java skills to write
MapReduce
• Utilize familiar BI tools for analysis including
Microsoft Excel
Cloud, Hadoop
Microsoft Azure
See: http://www.windowsazure.com/en-us/solutions/big-data/
Stinger Phase 2 in Hive 0.12
•QO improvements
•Predicate pushdown
•ORC file improvements
http://hortonworks.com/labs/stinger/
Demo: Getting Started with Hadoop
2 in Azure with HDInsight
HDFS
HDFS Architecture
• Block based
(64MB default)
• Hierarchical file
organization of
directories and files
• Write once,
read many
• Highly portable
• Optimized for small
numbers of very large files
Distributed Fault Tolerant File System
Source: http://hortonworks.com/hadoop/hdfs/
YARN
A long time ago, in a data center far,
far away…
Episode IV
There was Map Reduce
Introduction to Map/Reduce
Map f(k1,v1)  list(k2,v2)
Reduce f(k2, list(v2))  (k2, v3)
Functionally
In Practice, WordCount
The quick brown fox jumps over the lazy dog
Map
(the,1) (quick,1), (brown,1), (fox,1), (over,1), (the,1),(lazy,1),(dog,1)
Shuffle
(the,(1,1)) (quick,1), (brown,1), (fox,1), (over,1),(lazy,1),(dog,1)
Reduce
(the,2) (quick,1), (brown,1), (fox,1), (over,1), (lazy,1),(dog,1)
In Code
Then, scale to TB/PB of data over 10’s, 100’s or 1000’s of nodes
And Map Reduce was… good?
Episode V
Then came the abstractions
A pig who eats everything
logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs'
USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray,
cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray,
cs_User_Agent:chararray, cs_Cookie chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray,
sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int );
SET default_parallel 5;
-- remove header rows
filtered_logs = FILTER logs BY datereq != '#';
referrer_logs = GROUP filtered_logs BY cs_Referer;
summary_referrer = FOREACH referrer_logs GENERATE $0, COUNT($1) AS COUNT, SUM(filtered_logs.sc_bytes) AS
TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_referrer BY COUNT DESC;
limit_summary = LIMIT sorted_summary 25;
grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;
summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS
TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;
limited_summary = LIMIT sorted_summary 25;
STORE filtered_logs INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/forhive' USING
PigStorage('t');
STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/stemstats'
USING PigStorage('t');
STORE limit_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/referer_logs'
Hive for those who know SQL
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING,
timereq STRING,
s_sitename STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_User_Agent STRING,
cs_Cookie STRING,
cs_Referer STRING,
cs_host STRING,
sc_status INT,
sc_substatus STRING,
sc_win32_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2'
tblproperties ("skip.header.line.count"="1");
set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
select count(*) from websites_logs_raw
Cascading/Scalding to bring a
modern JVM API for analytics
WordCount in Scalding
See: https://github.com/twitter/scalding
But the abstractions all shared one
thing… Map Reduce
WordCount in Scalding…
See: https://github.com/twitter/scalding
Map Phase
Reduce Phase
Map/Reduce v1 Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/MRArch.png
Episode VI
One YARN to rule them all
Compute Model != Resource Model
YARN Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/YARNArch.png
• Thus, removing contention on Job
Tracker to do everything
• Become more resilient to RM
failures
• Number of active jobs more
scalable
Other Interesting YARN projects
Some Existing YARN apps
• Storm on YARN
• Hbase on YARN
• Spark
• Giraph
• Hamster (MPI on Yarn)
• MemcacheD
• Dryad
Source: http://hortonworks.com/
Writing your own YARN app for fun
and profit…
Start by clicking here
Yikes…
See Slide 20 – Enter Abstractions
Tez
http://tez.incubator.apache.org/
Source: http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/
REEF
http://www.reef-project.org/
Kitten
https://github.com/cloudera/kitten
http://www.lua.org/manual/5.1
What about .NET?
Dryad on YARN
sources
background

The Microsoft Data Platform
Resources
• All about HDInsight
• Getting Started with HDInsight
• Windows HDP 2.0
• Hadoop project
• HadoopSDK Codeplex project
• Getting Started with YARN blog series
• YARN book
Laat ons weten wat u vindt van deze sessie! Vul de evaluatie
in via www.techdaysapp.nl en maak kans op een van de 20
prijzen*. Prijswinnaars worden bekend gemaakt via Twitter
(#TechDaysNL). Gebruik hiervoor de code op uw badge.
Let us know how you feel about this session! Give your
feedback via www.techdaysapp.nl and possibly win one of
the 20 prices*. Winners will be announced via Twitter
(#TechDaysNL). Use your personal code on your badge.
* Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are
examples
Backup Slides
Moving Data Between Stores
•Sqoop
• Data in or out of relational store
•Pig
• Set of Storage & Loaders (JDBC, Mongo, etc)
•Hive
• Table formats (Mongo, Azure Tables)
Website log processing, Pig, Hive
logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ')
AS
(datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray,
cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray,
c_ip:chararray, cs_User_Agent:chararray, cs_Cookie :chararray, cs_Referer:chararray, cs_host
:chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int,
cs_bytes:int, time_taken:int );
SET default_parallel 100;
-- remove header rows
filtered_logs = FILTER logs BY datereq != '#';
grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;
summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests,
SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;
limited_summary = LIMIT sorted_summary 1000;
--STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/build2014/stats'
USING PigStorage('t');
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING,
timereq STRING,
s_sitename STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_User_Agent STRING,
cs_Cookie STRING,
cs_Referer STRING,
cs_host STRING,
sc_status INT,
sc_substatus STRING,
sc_win32_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2'
tblproperties ("skip.header.line.count"="1");
set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
select count(*) from websites_logs_raw
Interacting with SQL DB
binsqoop import --connect
"jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW
orks2012;user=[username];password=[password]" --table SalesOrderDetail --
hive-import -m 10 -- --schema Sales
New-AzureHDInsightSqoopJobDefinition –Command ‘import --connect
"jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW
orks2012;user=[username];password=[password]" --table SalesOrderDetail --
hive-import -m 10 -- --schema Sales’
REGISTER lib/piggybank.jar;
REGISTER c:appsdistsqljdbc_3.0enusqljdbc4.jar;
STORE limited_summary INTO '/doesnotmatter'
USING org.apache.pig.piggybank.storage.DBStorage('com.microsoft.sqlserver.jdbc.SQLServerDriver',
'jdbc:sqlserver://[yourserver].database.windows.net;database=AdventureWorks2012;user=[username];
password=[password]',
'INSERT INTO OutputFromPig(cs_uri_stem, NumberOfRequests, TotalEgress, AverageTimeTaken) VALUES
(?,?,?,?)');

Weitere ähnliche Inhalte

Was ist angesagt?

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaDatabricks
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth newsnnakasone
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS CloudIdan Tohami
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloudDmitry Tolpeko
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 

Was ist angesagt? (20)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth news
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 

Ähnlich wie The Fundamentals Guide to HDP and HDInsight

Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw sparkWisely chen
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystemRan Silberman
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 

Ähnlich wie The Fundamentals Guide to HDP and HDInsight (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Scala+data
Scala+dataScala+data
Scala+data
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 

KĂźrzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

KĂźrzlich hochgeladen (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

The Fundamentals Guide to HDP and HDInsight

  • 1. The Fundamentals Guide to HDP and HDInsight Gert Drapers (#DataDude) Principle Software Design Engineer
  • 2. http://www.economist.com/node/15579717?Story_ID=15579717 Copyright Š The Economist Newspaper Limited 2012. All rights reserved
  • 3. The 4Vs of Big Data: Volume, Velocity, Variability, & Variety Source: http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
  • 4.
  • 5. New In Hadoop 2 •YARN • ResourceManager • NodeManager • ApplicationMaster •HDFS 2 • NameNode HA • Snapshots • Federation Source: http://hortonworks.com/hadoop/yarn/
  • 6. Hortonworks Data Platform For Windows • Leverages work from Hortonworks and Microsoft • 100% open source Apache Hadoop • Built on the latest releases across Hadoop (2.2) • YARN • Stinger Phase 2 (Faster queries) • Only distribution available on Windows Server • Harness existing .NET and Java skills to write MapReduce • Utilize familiar BI tools for analysis including Microsoft Excel On-Premise Self-Deploy (Hadoop) See: http://hortonworks.com/products/releases/hdp-2-windows/
  • 7. Microsoft Azure HDInsight 3.0 • Microsoft’s cloud Hadoop offer • 100% open source Apache Hadoop • Built on the latest releases across Hadoop (2.2) • YARN • Stinger Phase 2 (Faster queries) • Up and running in minutes with no hardware to deploy • Harness existing .NET and Java skills to write MapReduce • Utilize familiar BI tools for analysis including Microsoft Excel Cloud, Hadoop Microsoft Azure See: http://www.windowsazure.com/en-us/solutions/big-data/
  • 8. Stinger Phase 2 in Hive 0.12 •QO improvements •Predicate pushdown •ORC file improvements http://hortonworks.com/labs/stinger/
  • 9. Demo: Getting Started with Hadoop 2 in Azure with HDInsight
  • 10. HDFS
  • 11. HDFS Architecture • Block based (64MB default) • Hierarchical file organization of directories and files • Write once, read many • Highly portable • Optimized for small numbers of very large files Distributed Fault Tolerant File System Source: http://hortonworks.com/hadoop/hdfs/
  • 12. YARN
  • 13. A long time ago, in a data center far, far away…
  • 14. Episode IV There was Map Reduce
  • 15. Introduction to Map/Reduce Map f(k1,v1)  list(k2,v2) Reduce f(k2, list(v2))  (k2, v3) Functionally In Practice, WordCount The quick brown fox jumps over the lazy dog Map (the,1) (quick,1), (brown,1), (fox,1), (over,1), (the,1),(lazy,1),(dog,1) Shuffle (the,(1,1)) (quick,1), (brown,1), (fox,1), (over,1),(lazy,1),(dog,1) Reduce (the,2) (quick,1), (brown,1), (fox,1), (over,1), (lazy,1),(dog,1) In Code Then, scale to TB/PB of data over 10’s, 100’s or 1000’s of nodes
  • 16.
  • 17. And Map Reduce was… good?
  • 18. Episode V Then came the abstractions
  • 19. A pig who eats everything
  • 20. logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray, cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray, cs_User_Agent:chararray, cs_Cookie chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int ); SET default_parallel 5; -- remove header rows filtered_logs = FILTER logs BY datereq != '#'; referrer_logs = GROUP filtered_logs BY cs_Referer; summary_referrer = FOREACH referrer_logs GENERATE $0, COUNT($1) AS COUNT, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken; sorted_summary = ORDER summary_referrer BY COUNT DESC; limit_summary = LIMIT sorted_summary 25; grouped_by_stem = GROUP filtered_logs BY cs_uri_stem; summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken; sorted_summary = ORDER summary_ip BY NumberOfRequests DESC; limited_summary = LIMIT sorted_summary 25; STORE filtered_logs INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/forhive' USING PigStorage('t'); STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/stemstats' USING PigStorage('t'); STORE limit_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/referer_logs'
  • 21. Hive for those who know SQL
  • 22. CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING, timereq STRING, s_sitename STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, s_port STRING, cs_username STRING, c_ip STRING, cs_User_Agent STRING, cs_Cookie STRING, cs_Referer STRING, cs_host STRING, sc_status INT, sc_substatus STRING, sc_win32_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2' tblproperties ("skip.header.line.count"="1"); set mapred.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true; select count(*) from websites_logs_raw
  • 23. Cascading/Scalding to bring a modern JVM API for analytics
  • 24. WordCount in Scalding See: https://github.com/twitter/scalding
  • 25. But the abstractions all shared one thing… Map Reduce
  • 26. WordCount in Scalding… See: https://github.com/twitter/scalding Map Phase Reduce Phase
  • 27. Map/Reduce v1 Architecture Source: http://hortonworks.com/wp-content/uploads/2012/08/MRArch.png
  • 28. Episode VI One YARN to rule them all
  • 29. Compute Model != Resource Model
  • 30. YARN Architecture Source: http://hortonworks.com/wp-content/uploads/2012/08/YARNArch.png • Thus, removing contention on Job Tracker to do everything • Become more resilient to RM failures • Number of active jobs more scalable
  • 31.
  • 33. Some Existing YARN apps • Storm on YARN • Hbase on YARN • Spark • Giraph • Hamster (MPI on Yarn) • MemcacheD • Dryad Source: http://hortonworks.com/
  • 34. Writing your own YARN app for fun and profit…
  • 37. See Slide 20 – Enter Abstractions
  • 44. Resources • All about HDInsight • Getting Started with HDInsight • Windows HDP 2.0 • Hadoop project • HadoopSDK Codeplex project • Getting Started with YARN blog series • YARN book
  • 45. Laat ons weten wat u vindt van deze sessie! Vul de evaluatie in via www.techdaysapp.nl en maak kans op een van de 20 prijzen*. Prijswinnaars worden bekend gemaakt via Twitter (#TechDaysNL). Gebruik hiervoor de code op uw badge. Let us know how you feel about this session! Give your feedback via www.techdaysapp.nl and possibly win one of the 20 prices*. Winners will be announced via Twitter (#TechDaysNL). Use your personal code on your badge. * Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are examples
  • 46.
  • 48. Moving Data Between Stores •Sqoop • Data in or out of relational store •Pig • Set of Storage & Loaders (JDBC, Mongo, etc) •Hive • Table formats (Mongo, Azure Tables)
  • 50. logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray, cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray, cs_User_Agent:chararray, cs_Cookie :chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int ); SET default_parallel 100; -- remove header rows filtered_logs = FILTER logs BY datereq != '#'; grouped_by_stem = GROUP filtered_logs BY cs_uri_stem; summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken; sorted_summary = ORDER summary_ip BY NumberOfRequests DESC; limited_summary = LIMIT sorted_summary 1000; --STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/build2014/stats' USING PigStorage('t');
  • 51. CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING, timereq STRING, s_sitename STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, s_port STRING, cs_username STRING, c_ip STRING, cs_User_Agent STRING, cs_Cookie STRING, cs_Referer STRING, cs_host STRING, sc_status INT, sc_substatus STRING, sc_win32_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2' tblproperties ("skip.header.line.count"="1"); set mapred.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true; select count(*) from websites_logs_raw
  • 53. binsqoop import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW orks2012;user=[username];password=[password]" --table SalesOrderDetail -- hive-import -m 10 -- --schema Sales New-AzureHDInsightSqoopJobDefinition –Command ‘import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW orks2012;user=[username];password=[password]" --table SalesOrderDetail -- hive-import -m 10 -- --schema Sales’ REGISTER lib/piggybank.jar; REGISTER c:appsdistsqljdbc_3.0enusqljdbc4.jar; STORE limited_summary INTO '/doesnotmatter' USING org.apache.pig.piggybank.storage.DBStorage('com.microsoft.sqlserver.jdbc.SQLServerDriver', 'jdbc:sqlserver://[yourserver].database.windows.net;database=AdventureWorks2012;user=[username]; password=[password]', 'INSERT INTO OutputFromPig(cs_uri_stem, NumberOfRequests, TotalEgress, AverageTimeTaken) VALUES (?,?,?,?)');