2. What is Hadoop?
Hadoop was started out as a subproject of Nutch by
Doug Cutting
Hadoop boosted Nutch’s scalability
Enhanced by Yahoo! and became Apache top level
project
System for distributed big data processing
Big data is Terabytes and
Petabytes and
more…
Exabytes, Zettabytes datasets?
7. Hadoop basics
Implements Google’s whitepaper:
http://research.google.com/archive/mapreduce.html
Hadoop is a combination of:
HDFS Storage
MapReduce Computation
8. HDFS
Hadoop Distributed File System
It’s a file system
bin/hadoop dfs <command> <options>
<command>
cat expunge put
chgrp get rm
chmod getmerge rmr
chown ls setrep
copyFromLocal lsr stat
copyToLocal mkdir tail
cp moveFromLocal test
du moveToLocal text
dus mv touchz
11. Hadoop Distributed File System
Name Node:
Stores file system metadata
Secondary Name Node(s):
Periodically merges file system image
Data Node(s):
Stores actual data (blocks)
Allows data to be replicated
12. MapReduce
A programming model for distributed data
processing
A data processing primitives are functions:
Mappers and Reducers
13. MapReduce
! To decompose MapReduce think of data in
terms of keys and values:
<key, value>
<user id, user profile>
<timestamp, apache log entry>
<tag, list of tagged images>
14. MapReduce
Mapper
Function that takes key and value and emits
zero or more keys and values
Reducer
Function that takes key and all “mapped”
values and emits zero or more new keys and
value
15. MapReduce example
“Hello World” for Hadoop:
http://wiki.apache.org/hadoop/WordCount
“Tag Cloud” example for Hadoop:
tag1 tag2 tag3
tag1 tag3 weight(tagi)
tag3
tag4 tag5 tag6
16. Tag Cloud example
Input is taggable content (images, posts,
videos) with space separated tags:
<posti, “tag1 tag2 … tagn”>
Output is tagi with it’s count and total tags:
<tagi, tag count>
<total tags, total tags count>
Results:
weight(tagi)=tagi count/total tags
font(tagi)=fn(weight(tagi))
17. Tag Cloud Mapper
Mapper implements interface:
org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
Mapper input:
<post1, “tag1 tag3”>
<post2, “tag3”>
<post3, “tag2 tag3 tag4”>
<post4, “tag1 tag2 tag3”>
simplify model & make line number a key
<line1, “tag1 tag3”>
<line2, “tag3”>
<line3, “tag2 tag3 tag4”>
<line4, “tag1 tag2 tag3”>
write raw tags to input file
18. Tag Cloud Mapper
Mapper input: Mapper output:
<0, “tag1 tag3”> <“total tags”, 2>
<1, “tag3”> <“tag1”, 1>
<2, “tag2 tag3 tag4”> <“tag3”, 1>
<3, “tag1 tag2 tag3”>
<“total tags”, 1>
read values - tags from file (line number is a key) <“tag3”, 1>
“tag1 tag3” // space separated tags <“total tags”, 3>
<“tag2”, 1>
String line = value.toString(); <“tag3”, 1>
StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1>
context.write(TOTAL_TAGS_KEY, context.write()
new IntWritable(tokenizer.countTokens())); <“total tags”, 3>
while (tokenizer.hasMoreTokens()) { <“tag1”, 1>
Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1>
context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1>
}
19. Reducer phases
1. Shuffle or Copy phase:
Copies output from Mapper to Reducer local file system
2. Sort phase:
Sort Mapper output by keys. This becomes Reducer input
Mapper output: Reducer input:
<“total tags”, 2> <“tag1”, 1>
<“tag1”, 1> <“tag1”, 1>
<“tag3”, 1>
<“tag2”, 1>
<“total tags”, 1> <“tag2”, 1>
<“tag3”, 1>
shuffle & sort by
<“total tags”, 3> key <“tag3”, 1>
<“tag2”, 1> <“tag3”, 1>
<“tag3”, 1> <“tag3”, 1>
<“tag4”, 1> <“tag3”, 1>
<“total tags”, 3> <“tag4”, 1>
<“tag1”, 1>
<“tag2”, 1> <“total tags”, 2>
<“tag3”, 1> <“total tags”, 1>
<“total tags”, 3>
<“total tags”, 3>
3. Reduce or Emit phase:
Performs reduce() for each sorted <key, value> input groups
24. Apache Pig
Higher-level data processing layer on top
of Hadoop
Data-flow oriented language (pig scripts)
Data types include sets, associative
arrays, tuples
Developed at Yahoo!
25. Apache Hive
Feature set is similar to Pig
SQL-like data warehouse infrastructure
Language is more strictly SQL
Supports SELECT, JOIN, GROUP BY, etc
Developed at Facebook
26. Apache HBase
Column-store database (after Google
BigTable model)
HDFS is an underlying file system
Holds extremely large datasets (multi Tb)
Constrained access model
27. Apache Mahout
Scalable machine learning algorithms on
top of Hadoop:
– filtering,
– recommendations,
– classifiers,
– clustering
28. Apache ZooKeeper
Common services for distributed
applications:
- group services,
- configuration management,
- naming services,
- synchronization
29. Oozie
Workflow engine for Hadoop
Orchestrates dependencies between
jobs running on Hadoop (including HDFS,
Pig and MapReduce)
Another query processing API
Developed at Yahoo!
30. Apache Chukwa
System for reliable large-scale log
collection
Displaying, monitoring and analyzing results
Built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce
Incubated at apache.org
DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/apache-hadoop-0.23.03.cd $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml<configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>2. Formatfilesystembin/hadoopnamenode –format3. Start daemons./sbin/hadoop-daemon.sh start namenode./sbin/hadoop-daemon.sh start datanode./sbin/hadoop-daemon.sh start secondarynamenode4. Checkhdfs statushttp://localhost:50070/