3. What is Hadoop?
Open-source data storage and processing API
Massively scalable, automatically parallelizable
Based on work from Google
GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
4. Why Use Hadoop?
Cheaper
Scales to Petabytes or
more
Faster
Parallel data processing
Better
Suited for particular types
of BigData problems
5. What types of business problems for Hadoop?
Source: Cloudera “Ten Common Hadoopable Problems”
19. So, what’s the problem?
“I can just use some ‘SQL-like’ language to query Hadoop, right?
“Yeah, SQL-on-Hadoop…that’s what I want
“I don’t want learn a new query language and….
“I want massive scale for my shiny, new BigData
22. What is Hive?
a data warehouse system for Hadoop that
facilitates easy data summarization
supports ad-hoc queries (still batch though…)
created by Facebook
a mechanism to project structure onto this data and query the data using a
SQL-like language – HiveQL
Interactive-console –or-
Execute scripts
Kicks off one or more MapReduce jobs in the background
an ability to use indexes, built-in user-defined functions
23. Is HQL == ANSI SQL? – NO!
--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
25. Common Hadoop Shell Commands
hadoop fs –cat file:///file2
hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1
hadoop fs –get /user/hadoop/file <localfile>
Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link
included for more detail
28. Understanding MapReduce – P1/3
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out
(intermediate values)
One list per local
node
Can implement local
Reducer (or
Combiner)
29. Understanding MapReduce – P2/3
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out
(intermediate values)
One list per local
node
Can implement local
Reducer (or
Combiner)
Shuffle/Sort>>
30. Understanding MapReduce – P3/3
Map>>
(K1, V1)
Info in
Input Split
list (K2, V2)
Key / Value out
(intermediate values)
One list per local
node
Can implement local
Reducer (or
Combiner)
Reduce
(K2, list(V2)
Shuffle / Sort phase
precedes Reduce phase
Combines Map output
into a list
list (K3, V3)
Usually aggregates
intermediate values
(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)
Shuffle/Sort>>
36. Ways to run MapReduce Jobs
Configure JobConf options
From Development Environment (IDE)
From a GUI utility
Cloudera – Hue
Microsoft Azure – HDInsight console
From the command line
hadoop jar <filename.jar> input output
42. Where is your Data coming from?
On premises
Local file system
Local HDFS instance
Private Cloud
Cloud storage
Public Cloud
Input Storage buckets
Script / Code buckets
Output buckets
44. Demo – Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.
45. Methods to write MapReduce Jobs
Typical – usually written in Java
MapReduce 2.0 API
MapReduce 1.0 API
Streaming
Uses stdin and stdout
Can use any language to write Map and Reduce Functions
C#, Python, JavaScript, etc…
Pipes
Often used with C++
Abstraction libraries
Hive, Pig, etc… write in a higher level language, generate one or more
MapReduce jobs
49. Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
50. What is Pig?
ETL Library for HDFS developed at Yahoo
Pig Runtime
Pig Language
Generates MapReduce Jobs
ETL steps
LOAD <file>
FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…
DUMP {to screen for testing} STORE <newFile>
64. Demo – Unit Testing MapReduce
Using MRUnit + Asserts
Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
65. A note about MapReduce 2.0
Splits the existing JobTracker’s roles
resource management
job lifecycle management
MapReduce 2.0 provides many benefits over the existing MapReduce
framework, such as better scalability
through distributed job lifecycle management
support for multiple Hadoop MapReduce API versions in a single cluster
66. What is Mahout?
Library with common machine learning algorithms
Over 20 algorithms
Recommendation (likelihood – Pandora)
Classification (known data and new data – spam id)
Clustering (new groups of similar data – Google news)
Can non-statisticians find value using this library?
68. Setting up Hadoop on Windows
For local development
Install from binaries from Web Platform Installer
Install .NET Azure SDK (for Azure BLOB storage)
Install other tools
Neudesic Azure Storage Viewer
71. Clients (Visualizations) for HDFS
Many clients use Hive
Often included in GUI console tools for Hadoop distributions as well
Microsoft includes clients in Office (Excel 2013)
Direct Hive client
Connect using ODBC
PowerPivot – data mashups and presentation
Data Explorer – connect, transform, mashup and filter
Hadoop SDK on Codeplex
Other popular clients
Qlikview
Tableau
Karmasphere
78. Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
79. Microsoft alternatives to MapReduce
Use existing relational system
Scale via cloud or edition (i.e. Enterprise or PDW)
Use in memory OLAP
SQL Server Analysis Services Tabular Models
Use “productized” Dremel
Microsoft Polybase – status = beta?
80. Looking Forward - Dremel or Apache Drill
Based on original research from Google
http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes <nameOfJavaFile>.java && jar –cvf <nameOfJarFile>.jar –C classes/
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ & http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&referringTitle=Home
Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png