The Fundamentals Guide to HDP and HDInsight

The Fundamentals
Guide to HDP and
HDInsight
Gert Drapers (#DataDude)
Principle Software Design Engineer

http://www.economist.com/node/15579717?Story_ID=15579717
Copyright © The Economist Newspaper Limited 2012. All rights reserved

The 4Vs of Big Data:
Volume, Velocity, Variability, & Variety
Source: http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data

New In Hadoop 2
•YARN
• ResourceManager
• NodeManager
• ApplicationMaster
•HDFS 2
• NameNode HA
• Snapshots
• Federation
Source: http://hortonworks.com/hadoop/yarn/

Hortonworks Data Platform For Windows
• Leverages work from Hortonworks and Microsoft
• 100% open source Apache Hadoop
• Built on the latest releases across Hadoop (2.2)
• YARN
• Stinger Phase 2 (Faster queries)
• Only distribution available on Windows Server
• Harness existing .NET and Java skills to write
MapReduce
• Utilize familiar BI tools for analysis including
Microsoft Excel
On-Premise Self-Deploy (Hadoop)
See: http://hortonworks.com/products/releases/hdp-2-windows/

Microsoft Azure HDInsight 3.0
• Microsoft’s cloud Hadoop offer
• 100% open source Apache Hadoop
• Built on the latest releases across Hadoop (2.2)
• YARN
• Stinger Phase 2 (Faster queries)
• Up and running in minutes with no hardware to
deploy
• Harness existing .NET and Java skills to write
MapReduce
• Utilize familiar BI tools for analysis including
Microsoft Excel
Cloud, Hadoop
Microsoft Azure
See: http://www.windowsazure.com/en-us/solutions/big-data/

Stinger Phase 2 in Hive 0.12
•QO improvements
•Predicate pushdown
•ORC file improvements
http://hortonworks.com/labs/stinger/

Demo: Getting Started with Hadoop
2 in Azure with HDInsight

HDFS Architecture
• Block based
(64MB default)
• Hierarchical file
organization of
directories and files
• Write once,
read many
• Highly portable
• Optimized for small
numbers of very large files
Distributed Fault Tolerant File System
Source: http://hortonworks.com/hadoop/hdfs/

A long time ago, in a data center far,
far away…

Episode IV
There was Map Reduce

Introduction to Map/Reduce
Map f(k1,v1)  list(k2,v2)
Reduce f(k2, list(v2))  (k2, v3)
Functionally
In Practice, WordCount
The quick brown fox jumps over the lazy dog
Map
(the,1) (quick,1), (brown,1), (fox,1), (over,1), (the,1),(lazy,1),(dog,1)
Shuffle
(the,(1,1)) (quick,1), (brown,1), (fox,1), (over,1),(lazy,1),(dog,1)
Reduce
(the,2) (quick,1), (brown,1), (fox,1), (over,1), (lazy,1),(dog,1)
In Code
Then, scale to TB/PB of data over 10’s, 100’s or 1000’s of nodes

Episode V
Then came the abstractions

logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs'
USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray,
cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray,
cs_User_Agent:chararray, cs_Cookie chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray,
sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int );
SET default_parallel 5;
-- remove header rows
filtered_logs = FILTER logs BY datereq != '#';
referrer_logs = GROUP filtered_logs BY cs_Referer;
summary_referrer = FOREACH referrer_logs GENERATE $0, COUNT($1) AS COUNT, SUM(filtered_logs.sc_bytes) AS
TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_referrer BY COUNT DESC;
limit_summary = LIMIT sorted_summary 25;
grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;
summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS
TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;
limited_summary = LIMIT sorted_summary 25;
STORE filtered_logs INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/forhive' USING
PigStorage('t');
STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/stemstats'
USING PigStorage('t');
STORE limit_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/referer_logs'

CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING,
timereq STRING,
s_sitename STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_User_Agent STRING,
cs_Cookie STRING,
cs_Referer STRING,
cs_host STRING,
sc_status INT,
sc_substatus STRING,
sc_win32_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs2'
tblproperties ("skip.header.line.count"="1");
set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
select count(*) from websites_logs_raw

Cascading/Scalding to bring a
modern JVM API for analytics

WordCount in Scalding
See: https://github.com/twitter/scalding

But the abstractions all shared one
thing… Map Reduce

WordCount in Scalding…
See: https://github.com/twitter/scalding
Map Phase
Reduce Phase

Map/Reduce v1 Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/MRArch.png

Episode VI
One YARN to rule them all

Compute Model != Resource Model

YARN Architecture
Source: http://hortonworks.com/wp-content/uploads/2012/08/YARNArch.png
• Thus, removing contention on Job
Tracker to do everything
• Become more resilient to RM
failures
• Number of active jobs more
scalable

Other Interesting YARN projects

Some Existing YARN apps
• Storm on YARN
• Hbase on YARN
• Spark
• Giraph
• Hamster (MPI on Yarn)
• MemcacheD
• Dryad
Source: http://hortonworks.com/

Writing your own YARN app for fun
and profit…

See Slide 20 – Enter Abstractions

Tez
http://tez.incubator.apache.org/
Source: http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/

REEF
http://www.reef-project.org/

Kitten
https://github.com/cloudera/kitten
http://www.lua.org/manual/5.1

Dryad on YARN
sources
background


The Microsoft Data Platform

Resources
• All about HDInsight
• Getting Started with HDInsight
• Windows HDP 2.0
• Hadoop project
• HadoopSDK Codeplex project
• Getting Started with YARN blog series
• YARN book

Laat ons weten wat u vindt van deze sessie! Vul de evaluatie
in via www.techdaysapp.nl en maak kans op een van de 20
prijzen*. Prijswinnaars worden bekend gemaakt via Twitter
(#TechDaysNL). Gebruik hiervoor de code op uw badge.
Let us know how you feel about this session! Give your
feedback via www.techdaysapp.nl and possibly win one of
the 20 prices*. Winners will be announced via Twitter
(#TechDaysNL). Use your personal code on your badge.
* Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden – All results are final, prices are
examples

Moving Data Between Stores
•Sqoop
• Data in or out of relational store
•Pig
• Set of Storage & Loaders (JDBC, Mongo, etc)
•Hive
• Table formats (Mongo, Azure Tables)

Website log processing, Pig, Hive

logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs' USING PigStorage(' ')
AS
(datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray,
cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray,
c_ip:chararray, cs_User_Agent:chararray, cs_Cookie :chararray, cs_Referer:chararray, cs_host
:chararray, sc_status:chararray, sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int,
cs_bytes:int, time_taken:int );
SET default_parallel 100;
-- remove header rows
filtered_logs = FILTER logs BY datereq != '#';
grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;
summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests,
SUM(filtered_logs.sc_bytes) AS TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;
limited_summary = LIMIT sorted_summary 1000;
--STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/build2014/stats'
USING PigStorage('t');

binsqoop import --connect
"jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW
orks2012;user=[username];password=[password]" --table SalesOrderDetail --
hive-import -m 10 -- --schema Sales
New-AzureHDInsightSqoopJobDefinition –Command ‘import --connect
"jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureW
orks2012;user=[username];password=[password]" --table SalesOrderDetail --
hive-import -m 10 -- --schema Sales’
REGISTER lib/piggybank.jar;
REGISTER c:appsdistsqljdbc_3.0enusqljdbc4.jar;
STORE limited_summary INTO '/doesnotmatter'
USING org.apache.pig.piggybank.storage.DBStorage('com.microsoft.sqlserver.jdbc.SQLServerDriver',
'jdbc:sqlserver://[yourserver].database.windows.net;database=AdventureWorks2012;user=[username];
password=[password]',
'INSERT INTO OutputFromPig(cs_uri_stem, NumberOfRequests, TotalEgress, AverageTimeTaken) VALUES
(?,?,?,?)');

The Fundamentals Guide to HDP and HDInsight

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Fundamentals Guide to HDP and HDInsight

Ähnlich wie The Fundamentals Guide to HDP and HDInsight (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Fundamentals Guide to HDP and HDInsight