LOAD 'tweets.txt' USING PigStorage() AS (id, text, iso_language);
FILTER tweets BY iso_language == 'en';
GROUP filtered_tweets BY iso_language;
DUMP grouped_tweets;
This Pig Latin program loads tweets data from a text file, filters the data to only include tweets with an iso_language of 'en', groups the filtered tweets by iso_language, and dumps the results.
2. What is Hadoop?
• A framework for storing and processing big data on
lots of commodity machines.
o Up to 4,000 machines in a cluster
o Up to 20 PB in a cluster
• Open Source Apache project
• High reliability done in software
o Automated fail-over for data and computation
• Implemented in Java
28-10-2012 2
3. Hadoop development
• Hadoop was created by Doug Cutting
• This is named as Hadoop from his son‟s toy
elephant.
• It is originally developed to support Nutch search
engine project.
• After that, So many companies adopted it and
contributed in this project.
28-10-2012 3
4. Hadoop Echo system
• Apache Hadoop is a collection of open-source software
for reliable, scalable, distributed computing.
• Hadoop Common: The common utilities that support the
other Hadoop subprojects.
• HDFS: A distributed file system that provides high
throughput access to application data.
• MapReduce: A software framework for distributed
processing of large data sets on compute clusters.
• Pig: A high-level data-flow language and execution
framework for parallel computation.
• HBase: A scalable, distributed database that supports
structured data storage for large tables.
28-10-2012 4
6. Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
–Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
o Workloads are IO bound and not CPU bound
28-10-2012 6
7. Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch(Search engine) uses MapReduce
• Feb 2006 – Starts as a Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• May 2009 – Hadoop sorts Petabyte in 17 hours
• Aug 2010 – World‟s Largest Hadoop cluster at
o Facebook
o 2900 nodes, 30+ PetaByte
28-10-2012 7
8. Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
28-10-2012 8
9. Applications of Hadoop
• Search
o Yahoo, Amazon, Zvents
• Log processing
o Facebook, Yahoo, ContextWeb. Joost, Last.fm
• Recommendation Systems
o Facebook
• Data Warehouse
o Facebook, AOL
• Video and Image Analysis
o New York Times, Eyealike
28-10-2012 9
10. Who generates the data?
• Lots of data is generated on Facebook
o 500+ million active users
o 30 billion pieces of content shared every month (news stories, photos,
blogs, etc)
• Lots of data is generated for Yahoo search engine.
• Lots of data is generated at Amazon S3 cloud
service.
28-10-2012 10
11. Data usage
• Data Usage
o Statistics per day:
o 20 TB of compressed new data added per day
o 3 PB of compressed data scanned per day
o 20K jobs on production cluster per day
o 480K compute hours per day
• Barrier to entry is significantly reduced:
o New engineers go though a Hadoop/Hive training session
o 300+ people run jobs on Hadoop
o Analysts (non-engineers) use Hadoop through Hive
28-10-2012 11
15. Commodity Hardware
• Typically in 2 level architecture
o Nodes are commodity PCs
o 20-40 nodes/rack
o The default size of Apache Hadoop block is 64 MB.
o Relational databases typically store data blocks in sizes ranging from 4KB
to 32KB.
28-10-2012 15
16. How does HDFS maintain everything?
• Two types of nodes
o Single NameNode and a number of DataNodes
• Namenode
o File names, permissions, modified flags, etc.
o Data locations exposed so that computations can
• Datanode
o Store and retrieve blocks when they are told to .
o HDFS is built using the Java language; any machine that supports Java
can run the NameNode or the DataNode software
28-10-2012 16
18. • The NameNode executes file system namespace
operations like opening, closing, and renaming files
and directories. It also determines the mapping of
blocks to DataNodes.
• The DataNodes are responsible for serving read and
write requests from the file system‟s clients.
28-10-2012 18
20. MapReduce Overview
• Provides a clean abstraction for programmers to
write distributed application.
• Factors out many reliability concerns from
application logic
• A batch data processing system
• Automatic parallelization & distribution
• Fault-tolerance
• Status and monitoring tools
28-10-2012 20
21. Programming Model
• Programmer has to implement interface of two
functions:
– map (in_key, in_value) ->
(out_key, intermediate_value) list
– reduce (out_key, intermediate_value list) ->
out_value list
28-10-2012 21
23. Mapper(indexing
example)
• Input is the line no and the actual line.
• Input 1 : (“100”,“I Love India ”)
• Output 1 : (“I”,“100”), (“Love”,“100”),
(“India”,“100”)
• Input 2 : (“101”,“I Love eBay”)
• Output 2 : (“I”,“101”), (“Love”,“101”),
(“eBay”,“101”)
28-10-2012 23
24. Reducer (indexing
example)
• Input is word and the line nos.
• Input 1 : (“I”,“100”,”101”)
• Input 2 : (“Love”,“100”,”101”)
• Input 3 : (“India”, “100”)
• Input 4 : (“eBay”, “101”)
• Output, the words are stored along with the line
nos.
28-10-2012 24
25. Google Page Rank
example
• Mapper
o Input is a link and the html content
o Output is a list of outgoing link and pagerank of this page
• Reducer
o Input is a link and a list of pagranks of pages linking to this
page
o Output is the pagerank of this page, which is the weighted
average of all input pageranks
28-10-2012 25
26. Conti.
• Limited atomicity and transaction support.
o HBase supports multiple batched mutations of
single rows only.
o Data is unstructured and untyped.
• No accessed or manipulated via SQL.
o Programmatic access via Java, REST, or Thrift APIs.
o Scripting via JRuby.
28-10-2012 26
28. OVERVIEW
• HBase is an Apache open source project
whose goal is to provide storage for the
Hadoop Distributed Computing
Environment.
• Data is logically organized into tables, rows
and columns.
28-10-2012 28
30. Conceptual <family>:<label>
View Row key
Time Column
Stamp “contents:”
Column “anchor:”
• A data row has t12 “<html>…”
a sortable row “com.apach
key and an e.www”
t11 “<html>…”
arbitrary number t10
“anchor:apache.
com”
“APACHE”
of columns.
t15 “anchor:cnnsi.com” “CNN”
• A Time Stamp is
“anchor:my.look.c
designated t13
a”
“CNN.com”
automatically if “com.cnn.w
t6 “<html>…”
not artificially.
ww”
• <family>:<label>
t5 “<html>…”
t3 “<html>…”
31. HStore
Physical Storage View Column
Row key TS
“contents:”
• Physically, tables are
t12 “<html>…”
stored on a per-column “com.apache.w
ww”
family basis. t11 “<html>…”
HStore
t6 “<html>…”
• Empty cells are not
stored in a column- “com.cn.www” t5 “<html>…”
oriented storage t3 “<html>…”
format.
Row key TS Column “anchor:”
• Each column family is
managed by an HStore. “com.apache.
www” t10
“anchor:
apache.com” “APACHE”
Data MapFile Key/Value
t9
“anchor:
“CNN”
cnnsi.com”
Index MapFile Index key
com.cn.www”
“anchor: “CNN.co
t8
my.look.ca” m”
Memcache
32. Time Column
Row Ranges: RegionsRow key
Stamp “contents:”
Column “anchor:”
t15 anchor:cc value
• Row key/ Column
t13 ba
ascending, Timestamp
descending aaaa t12 bb
• Physically, tables are broken
t11 anchor:cd value
into row ranges contain rowsbc
t10
from start-key to end-key
aaab t14
aaac anchor:be value
aaad anchor:ad value
t5 ae
aaae
t3 af
35. Master
HBaseMaster
2 META Region 2 META Region
2 META Region
2 META Region
1 ROOT Region
• Assign regions to
HRegionServers.
1. ROOT region locates all the Server Server Server Server Server
META regions.
2. META region maps a number
of user regions. USER Region
3. Assign user regions to the
HRegionServers. META Region
• Enable/Disable table and
change table schema ROOT Region USER Region
• Monitor the health of each META Region
Server
USER Region
46. Introduction
• Pig was initially developed at Yahoo!
• Pig programming language is designed
to handle any kind of data-hence the
name!
• Pig is made of two components:
Language itself, which is called PigLatin .
Runtime Environment where PigLatin programs
are executed.
28-10-2012 46
47. Why PigLatin?
• Map Reduce is very powerful, but:
o It requires a Java programmer.
o User has to re-invent common functionality (join, filter, etc.).
• For non-java programmers Pig Latin is introduced.
• Pig Latin is a data flow language rather than
procedural or declarative.
• User code and existing binaries can be included
almost anywhere.
• Metadata not required, but used when available.
• Support for nested types.
• Operates on files in HDFS.
28-10-2012 47
48. Pig Latin Overview
• Pig provides a higher level language,
Pig Latin, that:
o Increases productivity.
o In one test 10 lines of Pig Latin ≈ 200 lines of Java.
• What took 4 hours to write in Java took
15 minutes in Pig Latin.
o Opens the system to non-Java programmers.
o Provides common operations like join, group,
filter, sort.
28-10-2012 48
49. Load Data
• The objects that are being worked on by Hadoop
are stored in HDFS.
• To access this data, the program must first tell Pig
what file (or files) it will use.
• That‟s done through the LOAD ‘data_file’
command .
• If the data is stored in a file format that is not
natively accessible to Pig,
• Add the “USING” function to the LOAD statement to
specify a user-defined function that can read in
and interpret the data.
28-10-2012 49
50. Transform Data
• The transform logic is where all the
data manipulation happens.
• For example :
FILTER out rows that are not of interest.
JOIN two sets of data files .
GROUP data to build aggregations .
ORDER results .
28-10-2012 50
51. Example of Pig Program
• file composed of Twitter feeds, selects only those
tweets that are using en(English) iso_language
code, then groups them by the user who is
tweeting, and displays the sum of the number of the
re tweets of that user‟s tweets.
L = LOAD „hdfs//node/tweet_data‟;
FL = FILTER L BY iso_language_code EQ „en‟;
G = GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(retweets);
28-10-2012 51
52. DUMP and STORE
• DUMP or STORE command generates the results of a
Pig program.
• DUMP command sends the output to the screen,
while debugging Pig programs.
• DUMP command can be used anywhere in
program to dump intermediate result sets to the
screen.
• STORE command will store results from running
programs in a file for further processing and analysis.
28-10-2012 52
53. Pig Runtime Environment
• Pig runtime is used when Pig program need to run in
the Hadoop environment .
• There are three ways to run a Pig program:
Embedded in a Script.
Embedded in Java Program.
From the Pig Command line, called Grunt.
• The Pig runtime environment translates the program
into a set of map and reduce tasks and runs.
• This greatly simplifies the work associated with the
analysis of large amounts of data.
28-10-2012 53
54. PIG is used for?
• Web log processing.
• Data processing for web search platforms.
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for processing large
data sets
28-10-2012 54
56. Hadoop@Facebook
• Production cluster
o 4800 cores, 600 machines, 16GB per machine – April 2009
o 8000 cores, 1000 machines, 32 GB per machine – July 2009
o 4 SATA disks of 1 TB each per machine
o 2 level network hierarchy, 40 machines per rack
o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009
• Test cluster
• 800 cores, 16GB each
28-10-2012 56
57. Hadoop@Yahoo
• World's largest Hadoop production application.
• The Yahoo! Search Webmap is a Hadoop
application that runs on a more than 10,000 core
Linux cluster
• Biggest contributor to Hadoop.
• Converting All its batches to Hadoop.
28-10-2012 57
58. Hadoop@Amazon
• Hadoop can be run on Amazon Elastic Compute
Cloud (EC2) and Amazon Simple Storage Service
(S3)
• The New York Times used 100 Amazon EC2 instances
and a Hadoop application to process 4TB of raw
image TIFF data (stored in S3) into 11 million finished
PDFs in the space of 24 hours at a computation cost
of about $240
• Amazon Elastic MapReduce is a new web service
that enables businesses, researchers, data analysts,
and developers to easily and cost-effectively
process vast amounts of data. It utilizes a hosted
Hadoop framework.
28-10-2012 58