This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
3. The 4Vs of Big Data:
Volume, Velocity, Variability, & Variety
Source: http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data
4.
5. New In Hadoop 2
â˘YARN
⢠ResourceManager
⢠NodeManager
⢠ApplicationMaster
â˘HDFS 2
⢠NameNode HA
⢠Snapshots
⢠Federation
Source: http://hortonworks.com/hadoop/yarn/
6. Hortonworks Data Platform For Windows
⢠Leverages work from Hortonworks and Microsoft
⢠100% open source Apache Hadoop
⢠Built on the latest releases across Hadoop (2.2)
⢠YARN
⢠Stinger Phase 2 (Faster queries)
⢠Only distribution available on Windows Server
⢠Harness existing .NET and Java skills to write
MapReduce
⢠Utilize familiar BI tools for analysis including
Microsoft Excel
On-Premise Self-Deploy (Hadoop)
See: http://hortonworks.com/products/releases/hdp-2-windows/
7. Microsoft Azure HDInsight 3.0
⢠Microsoftâs cloud Hadoop offer
⢠100% open source Apache Hadoop
⢠Built on the latest releases across Hadoop (2.2)
⢠YARN
⢠Stinger Phase 2 (Faster queries)
⢠Up and running in minutes with no hardware to
deploy
⢠Harness existing .NET and Java skills to write
MapReduce
⢠Utilize familiar BI tools for analysis including
Microsoft Excel
Cloud, Hadoop
Microsoft Azure
See: http://www.windowsazure.com/en-us/solutions/big-data/
11. HDFS Architecture
⢠Block based
(64MB default)
⢠Hierarchical file
organization of
directories and files
⢠Write once,
read many
⢠Highly portable
⢠Optimized for small
numbers of very large files
Distributed Fault Tolerant File System
Source: http://hortonworks.com/hadoop/hdfs/
15. Introduction to Map/Reduce
Map f(k1,v1) ď¨ list(k2,v2)
Reduce f(k2, list(v2)) ď¨ (k2, v3)
Functionally
In Practice, WordCount
The quick brown fox jumps over the lazy dog
Map
(the,1) (quick,1), (brown,1), (fox,1), (over,1), (the,1),(lazy,1),(dog,1)
Shuffle
(the,(1,1)) (quick,1), (brown,1), (fox,1), (over,1),(lazy,1),(dog,1)
Reduce
(the,2) (quick,1), (brown,1), (fox,1), (over,1), (lazy,1),(dog,1)
In Code
Then, scale to TB/PB of data over 10âs, 100âs or 1000âs of nodes
20. logs = LOAD 'wasb://sampledata@mwinklenortheurope.blob.core.windows.net/weblogs'
USING PigStorage(' ') AS (datereq:chararray, timereq:chararray, s_sitename:chararray, cs_method:chararray,
cs_uri_stem:chararray, cs_uri_query:chararray, s_port:chararray, cs_username:chararray, c_ip:chararray,
cs_User_Agent:chararray, cs_Cookie chararray, cs_Referer:chararray, cs_host :chararray, sc_status:chararray,
sc_substatus:chararray, sc_win32_status:chararray, sc_bytes:int, cs_bytes:int, time_taken:int );
SET default_parallel 5;
-- remove header rows
filtered_logs = FILTER logs BY datereq != '#';
referrer_logs = GROUP filtered_logs BY cs_Referer;
summary_referrer = FOREACH referrer_logs GENERATE $0, COUNT($1) AS COUNT, SUM(filtered_logs.sc_bytes) AS
TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_referrer BY COUNT DESC;
limit_summary = LIMIT sorted_summary 25;
grouped_by_stem = GROUP filtered_logs BY cs_uri_stem;
summary_ip = FOREACH grouped_by_stem GENERATE $0, COUNT($1) AS NumberOfRequests, SUM(filtered_logs.sc_bytes) AS
TotalEgress, AVG(filtered_logs.time_taken) AS AverageTimeTaken;
sorted_summary = ORDER summary_ip BY NumberOfRequests DESC;
limited_summary = LIMIT sorted_summary 25;
STORE filtered_logs INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/forhive' USING
PigStorage('t');
STORE limited_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/stemstats'
USING PigStorage('t');
STORE limit_summary INTO 'wasb://output@mwinklenortheurope.blob.core.windows.net/tmp/results5/referer_logs'
44. Resources
⢠All about HDInsight
⢠Getting Started with HDInsight
⢠Windows HDP 2.0
⢠Hadoop project
⢠HadoopSDK Codeplex project
⢠Getting Started with YARN blog series
⢠YARN book
45. Laat ons weten wat u vindt van deze sessie! Vul de evaluatie
in via www.techdaysapp.nl en maak kans op een van de 20
prijzen*. Prijswinnaars worden bekend gemaakt via Twitter
(#TechDaysNL). Gebruik hiervoor de code op uw badge.
Let us know how you feel about this session! Give your
feedback via www.techdaysapp.nl and possibly win one of
the 20 prices*. Winners will be announced via Twitter
(#TechDaysNL). Use your personal code on your badge.
* Over de uitslag kan niet worden gecorrespondeerd, prijzen zijn voorbeelden â All results are final, prices are
examples
48. Moving Data Between Stores
â˘Sqoop
⢠Data in or out of relational store
â˘Pig
⢠Set of Storage & Loaders (JDBC, Mongo, etc)
â˘Hive
⢠Table formats (Mongo, Azure Tables)