2. What is Hadoop?
ï Hadoop was started out as a subproject of Nutch by
Doug Cutting
ï Hadoop boosted Nutchâs scalability
ï Enhanced by Yahoo! and became Apache top level
project
ï System for distributed big data processing
ï” Big data is Terabytes and
Petabytes and
moreâŠ
ï” Exabytes, Zettabytes datasets?
7. Hadoop basics
ï Implements Googleâs whitepaper:
http://research.google.com/archive/mapreduce.html
ï Hadoop is a combination of:
HDFS Storage
MapReduce Computation
8. HDFS
Hadoop Distributed File System
ï Itâs a file system
bin/hadoop dfs <command> <options>
<command>
cat expunge put
chgrp get rm
chmod getmerge rmr
chown ls setrep
copyFromLocal lsr stat
copyToLocal mkdir tail
cp moveFromLocal test
du moveToLocal text
dus mv touchz
10. Hadoop Distributed File System
ï Itâs distributed
ï It employs masterslave architecture
11. Hadoop Distributed File System
ï Name Node:
Stores file system metadata
ï Secondary Name Node(s):
Periodically merges file system image
ï Data Node(s):
Stores actual data (blocks)
Allows data to be replicated
12. MapReduce
ï A programming model for distributed data
processing
ï A data processing primitives are functions:
Mappers and Reducers
13. MapReduce
! To decompose MapReduce think of data in
terms of keys and values:
<key, value>
<user id, user profile>
<timestamp, apache log entry>
<tag, list of tagged images>
14. MapReduce
ï Mapper
Function that takes key and value and emits
zero or more keys and values
ï Reducer
Function that takes key and all âmappedâ
values and emits zero or more new keys and
value
15. MapReduce example
ï âHello Worldâ for Hadoop:
http://wiki.apache.org/hadoop/WordCount
ï âTag Cloudâ example for Hadoop:
tag1 tag2 tag3
tag1 tag3 weight(tagi)
tag3
tag4 tag5 tag6
16. Tag Cloud example
ï Input is taggable content (images, posts,
videos) with space separated tags:
<posti, âtag1 tag2 ⊠tagnâ>
ï Output is tagi with itâs count and total tags:
<tagi, tag count>
<total tags, total tags count>
ï Results:
weight(tagi)=tagi count/total tags
font(tagi)=fn(weight(tagi))
17. Tag Cloud Mapper
ï Mapper implements interface:
org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
ï Mapper input:
<post1, âtag1 tag3â>
<post2, âtag3â>
<post3, âtag2 tag3 tag4â>
<post4, âtag1 tag2 tag3â>
simplify model & make line number a key
<line1, âtag1 tag3â>
<line2, âtag3â>
<line3, âtag2 tag3 tag4â>
<line4, âtag1 tag2 tag3â>
write raw tags to input file
18. Tag Cloud Mapper
ï Mapper input: ï Mapper output:
<0, âtag1 tag3â> <âtotal tagsâ, 2>
<1, âtag3â> <âtag1â, 1>
<2, âtag2 tag3 tag4â> <âtag3â, 1>
<3, âtag1 tag2 tag3â>
<âtotal tagsâ, 1>
read values - tags from file (line number is a key) <âtag3â, 1>
âtag1 tag3â // space separated tags <âtotal tagsâ, 3>
<âtag2â, 1>
String line = value.toString(); <âtag3â, 1>
StringTokenizer tokenizer = new StringTokenizer(line, â "); <âtag4â, 1>
context.write(TOTAL_TAGS_KEY, context.write()
new IntWritable(tokenizer.countTokens())); <âtotal tagsâ, 3>
while (tokenizer.hasMoreTokens()) { <âtag1â, 1>
Text tag = new Text(tokenizer.nextToken()); <âtag2â, 1>
context.write(tag, new IntWritable(1)); // write to HDFS <âtag3â, 1>
}
19. Reducer phases
ï 1. Shuffle or Copy phase:
Copies output from Mapper to Reducer local file system
ï 2. Sort phase:
Sort Mapper output by keys. This becomes Reducer input
Mapper output: Reducer input:
<âtotal tagsâ, 2> <âtag1â, 1>
<âtag1â, 1> <âtag1â, 1>
<âtag3â, 1>
<âtag2â, 1>
<âtotal tagsâ, 1> <âtag2â, 1>
<âtag3â, 1>
shuffle & sort by
<âtotal tagsâ, 3> key <âtag3â, 1>
<âtag2â, 1> <âtag3â, 1>
<âtag3â, 1> <âtag3â, 1>
<âtag4â, 1> <âtag3â, 1>
<âtotal tagsâ, 3> <âtag4â, 1>
<âtag1â, 1>
<âtag2â, 1> <âtotal tagsâ, 2>
<âtag3â, 1> <âtotal tagsâ, 1>
<âtotal tagsâ, 3>
<âtotal tagsâ, 3>
ï 3. Reduce or Emit phase:
Performs reduce() for each sorted <key, value> input groups
24. Apache Pig
ï Higher-level data processing layer on top
of Hadoop
ï Data-flow oriented language (pig scripts)
ï Data types include sets, associative
arrays, tuples
ï Developed at Yahoo!
25. Apache Hive
ï Feature set is similar to Pig
ï SQL-like data warehouse infrastructure
ï Language is more strictly SQL
ï Supports SELECT, JOIN, GROUP BY, etc
ï Developed at Facebook
26. Apache HBase
ï Column-store database (after Google
BigTable model)
ï HDFS is an underlying file system
ï Holds extremely large datasets (multi Tb)
ï Constrained access model
27. Apache Mahout
ï Scalable machine learning algorithms on
top of Hadoop:
â filtering,
â recommendations,
â classifiers,
â clustering
28. Apache ZooKeeper
ï Common services for distributed
applications:
- group services,
- configuration management,
- naming services,
- synchronization
29. Oozie
ï Workflow engine for Hadoop
ï Orchestrates dependencies between
jobs running on Hadoop (including HDFS,
Pig and MapReduce)
ï Another query processing API
ï Developed at Yahoo!
30. Apache Chukwa
ï System for reliable large-scale log
collection
ï Displaying, monitoring and analyzing results
ï Built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce
ï Incubated at apache.org
DataNodes are constantly reporting to the NameNode. Blocks are stored on the Data Nodes.
Standalone operation mode:1. export TAG_CLOUD_HOME=/Users/tazija/Projects/hadoop/tagcloud2. export HADOOP_HOME=/Users/tazija/Programs/apache-hadoop-0.23.03.cd $TAG_CLOUD_HOME4.mvn clean install5. $HADOOP_HOME/bin/hadoopfs -ls $TAG_CLOUD_HOME/inputInput directory is $TAG_CLOUD_HOME/input6. $HADOOP_HOME/bin/hadoopfs -cat $TAG_CLOUD_HOME/input/tags01We use InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.5. $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/outputDistributed mode:1. /etc/hadoop/hdfs-site.xml<configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/Users/tazija/Programs/apache-hadoop-0.23.0/data/hdfs/namenode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>2. Formatfilesystembin/hadoopnamenode âformat3. Start daemons./sbin/hadoop-daemon.sh start namenode./sbin/hadoop-daemon.sh start datanode./sbin/hadoop-daemon.sh start secondarynamenode4. Checkhdfs statushttp://localhost:50070/