3. What is Hadoop?
ī§ Open-source data storage and processing API
ī§ Massively scalable, automatically parallelizable
ī¯
Based on work from Google
ī¯
GFS + MapReduce + BigTable
ī¯
Current Distributions based on Open Source and Vendor Work
ī¯
Apache Hadoop
ī¯
Cloudera â CH4 w/ Impala
ī¯
Hortonworks
ī¯
MapR
ī¯
AWS
ī¯
Windows Azure HDInsight
4. Why Use Hadoop?
ī§ Cheaper
ī¯
Scales to Petabytes or
more
ī§ Faster
ī¯
Parallel data processing
ī§ Better
ī¯
Suited for particular types
of BigData problems
5. What types of business problems for Hadoop?
Source: Cloudera âTen Common Hadoopable Problemsâ
19. So, whatâs the problem?
ī§ âI can just use some âSQL-likeâ language to query Hadoop, right?
ī§ âYeah, SQL-on-HadoopâĻthatâs what I want
ī§ âI donât want learn a new query language andâĻ.
ī§ âI want massive scale for my shiny, new BigData
22. What is Hive?
ī§ a data warehouse system for Hadoop that
ī¯
facilitates easy data summarization
ī¯
supports ad-hoc queries (still batch thoughâĻ)
ī¯
created by Facebook
ī§ a mechanism to project structure onto this data and query the data using a
SQL-like language â HiveQL
ī¯
Interactive-console âor-
ī¯
Execute scripts
ī¯
Kicks off one or more MapReduce jobs in the background
ī§ an ability to use indexes, built-in user-defined functions
23. Is HQL == ANSI SQL? â NO!
--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming upâĻ
25. Common Hadoop Shell Commands
hadoop fs âcat file:///file2
hadoop fs âmkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs âcopyFromLocal <fromDir> <toDir>
hadoop fs âput <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs âls /user/hadoop/dir1
hadoop fs âcat hdfs://nn1.example.com/file1
hadoop fs âget /user/hadoop/file <localfile>
Tips
-- âsudoâ means ârun as administratorâ (super user)
--some hadoop configurations use âhadoop dfsâ rather than âhadoop fsâ â file paths to hadoop differ for the former, see the link
included for more detail
28. Understanding MapReduce â P1/3
ī§ Map>>
ī¯
(K1, V1) ī
ī¯
Info in
ī¯
Input Split
ī¯
list (K2, V2)
ī¯
Key / Value out
(intermediate values)
ī¯
One list per local
node
ī¯
Can implement local
Reducer (or
Combiner)
29. Understanding MapReduce â P2/3
ī§ Map>>
ī¯
(K1, V1) ī
ī¯
Info in
ī¯
Input Split
ī¯
list (K2, V2)
ī¯
Key / Value out
(intermediate values)
ī¯
One list per local
node
ī¯
Can implement local
Reducer (or
Combiner)
ī§ Shuffle/Sort>>
30. Understanding MapReduce â P3/3
ī§ Map>>
ī¯
(K1, V1) ī
ī¯
Info in
ī¯
Input Split
ī¯
list (K2, V2)
ī¯
Key / Value out
(intermediate values)
ī¯
One list per local
node
ī¯
Can implement local
Reducer (or
Combiner)
ī§ Reduce
ī¯
(K2, list(V2) ī
ī¯
Shuffle / Sort phase
precedes Reduce phase
ī¯
Combines Map output
into a list
ī¯
list (K3, V3)
ī¯
Usually aggregates
intermediate values
(input) <k1, v1> ī map ī <k2, v2> ī combine ī <k2, v2> ī reduce ī <k3, v3> (output)
ī§ Shuffle/Sort>>
36. Ways to run MapReduce Jobs
ī§ Configure JobConf options
ī§ From Development Environment (IDE)
ī§ From a GUI utility
ī¯
Cloudera â Hue
ī¯
Microsoft Azure â HDInsight console
ī§ From the command line
ī¯
hadoop jar <filename.jar> input output
42. Where is your Data coming from?
ī§ On premises
ī§
Local file system
ī§
Local HDFS instance
ī§ Private Cloud
ī§
Cloud storage
ī§ Public Cloud
ī§
Input Storage buckets
ī§
Script / Code buckets
ī§
Output buckets
44. Demo â Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.
45. Methods to write MapReduce Jobs
ī§ Typical â usually written in Java
ī¯
MapReduce 2.0 API
ī¯
MapReduce 1.0 API
ī§ Streaming
ī¯
Uses stdin and stdout
ī¯
Can use any language to write Map and Reduce Functions
ī¯
C#, Python, JavaScript, etcâĻ
ī§ Pipes
ī¯
Often used with C++
ī§ Abstraction libraries
ī¯
Hive, Pig, etcâĻ write in a higher level language, generate one or more
MapReduce jobs
49. Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
50. What is Pig?
ī§ ETL Library for HDFS developed at Yahoo
ī¯
Pig Runtime
ī¯
Pig Language
ī¯
Generates MapReduce Jobs
ī§ ETL steps
ī¯
LOAD <file>
ī¯
FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNTâĻ
ī¯
DUMP {to screen for testing} ī STORE <newFile>
64. Demo â Unit Testing MapReduce
ī§ Using MRUnit + Asserts
ī§ Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
65. A note about MapReduce 2.0
ī§ Splits the existing JobTrackerâs roles
ī¯
resource management
ī¯
job lifecycle management
ī§ MapReduce 2.0 provides many benefits over the existing MapReduce
framework, such as better scalability
ī¯
through distributed job lifecycle management
ī¯
support for multiple Hadoop MapReduce API versions in a single cluster
66. What is Mahout?
ī§ Library with common machine learning algorithms
ī§ Over 20 algorithms
ī¯
Recommendation (likelihood â Pandora)
ī¯
Classification (known data and new data â spam id)
ī¯
Clustering (new groups of similar data â Google news)
ī§ Can non-statisticians find value using this library?
68. Setting up Hadoop on Windows
ī§ For local development
ī§ Install from binaries from Web Platform Installer
ī§ Install .NET Azure SDK (for Azure BLOB storage)
ī§ Install other tools
ī¯
Neudesic Azure Storage Viewer
71. Clients (Visualizations) for HDFS
ī§ Many clients use Hive
ī¯
Often included in GUI console tools for Hadoop distributions as well
ī§ Microsoft includes clients in Office (Excel 2013)
ī¯
Direct Hive client
ī¯
Connect using ODBC
ī¯
PowerPivot â data mashups and presentation
ī¯
Data Explorer â connect, transform, mashup and filter
ī¯
Hadoop SDK on Codeplex
ī§ Other popular clients
ī¯
Qlikview
ī¯
Tableau
ī¯
Karmasphere
78. Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch â NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
79. Microsoft alternatives to MapReduce
ī§ Use existing relational system
ī¯
Scale via cloud or edition (i.e. Enterprise or PDW)
ī§ Use in memory OLAP
ī¯
SQL Server Analysis Services Tabular Models
ī§ Use âproductizedâ Dremel
ī¯
Microsoft Polybase â status = beta?
80. Looking Forward - Dremel or Apache Drill
ī§ Based on original research from Google
http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac âclasspath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar âd classes <nameOfJavaFile>.java && jar âcvf <nameOfJarFile>.jar âC classes/
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ & http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&referringTitle=Home
Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png