Apache hadoop

Apache Hadoop
Presented By,

Darpan Dekivadiya(09BCE008)

What is Hadoop?
• A framework for storing and processing big data on
lots of commodity machines.
o Up to 4,000 machines in a cluster
o Up to 20 PB in a cluster

• Open Source Apache project
• High reliability done in software
o Automated fail-over for data and computation

• Implemented in Java

28-10-2012 2

Hadoop development
• Hadoop was created by Doug Cutting
• This is named as Hadoop from his son‟s toy
elephant.
• It is originally developed to support Nutch search
engine project.
• After that, So many companies adopted it and
contributed in this project.

28-10-2012 3

Hadoop Echo system
• Apache Hadoop is a collection of open-source software
for reliable, scalable, distributed computing.
• Hadoop Common: The common utilities that support the
other Hadoop subprojects.
• HDFS: A distributed file system that provides high
throughput access to application data.
• MapReduce: A software framework for distributed
processing of large data sets on compute clusters.
• Pig: A high-level data-flow language and execution
framework for parallel computation.
• HBase: A scalable, distributed database that supports
structured data storage for large tables.

28-10-2012 4

Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
–Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
o Workloads are IO bound and not CPU bound

28-10-2012 6

Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch(Search engine) uses MapReduce
• Feb 2006 – Starts as a Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• May 2009 – Hadoop sorts Petabyte in 17 hours
• Aug 2010 – World‟s Largest Hadoop cluster at
o Facebook
o 2900 nodes, 30+ PetaByte

28-10-2012 7

Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!

28-10-2012 8

Applications of Hadoop
• Search
o Yahoo, Amazon, Zvents

• Log processing
o Facebook, Yahoo, ContextWeb. Joost, Last.fm

• Recommendation Systems
o Facebook

• Data Warehouse
o Facebook, AOL

• Video and Image Analysis
o New York Times, Eyealike

28-10-2012 9

Who generates the data?
• Lots of data is generated on Facebook
o 500+ million active users
o 30 billion pieces of content shared every month (news stories, photos,
blogs, etc)

• Lots of data is generated for Yahoo search engine.
• Lots of data is generated at Amazon S3 cloud
service.

28-10-2012 10

Data usage
• Data Usage
o Statistics per day:
o 20 TB of compressed new data added per day
o 3 PB of compressed data scanned per day
o 20K jobs on production cluster per day
o 480K compute hours per day

• Barrier to entry is significantly reduced:
o New engineers go though a Hadoop/Hive training session
o 300+ people run jobs on Hadoop
o Analysts (non-engineers) use Hadoop through Hive

28-10-2012 11

HDFS
Hadoop Distributed File System

28-10-2012 12

Based on Google File System

28-10-2012 13

Redundant storage

28-10-2012 14

Commodity Hardware

• Typically in 2 level architecture
o Nodes are commodity PCs
o 20-40 nodes/rack
o The default size of Apache Hadoop block is 64 MB.
o Relational databases typically store data blocks in sizes ranging from 4KB
to 32KB.

28-10-2012 15

How does HDFS maintain everything?
• Two types of nodes
o Single NameNode and a number of DataNodes

• Namenode
o File names, permissions, modified flags, etc.
o Data locations exposed so that computations can

• Datanode
o Store and retrieve blocks when they are told to .
o HDFS is built using the Java language; any machine that supports Java
can run the NameNode or the DataNode software

28-10-2012 16

How HDFS works?

28-10-2012 17

• The NameNode executes file system namespace
operations like opening, closing, and renaming files
and directories. It also determines the mapping of
blocks to DataNodes.
• The DataNodes are responsible for serving read and
write requests from the file system‟s clients.

28-10-2012 18

MapReduce
Google‟s MapReduce Technique

28-10-2012 19

MapReduce Overview
• Provides a clean abstraction for programmers to
write distributed application.
• Factors out many reliability concerns from
application logic
• A batch data processing system
• Automatic parallelization & distribution
• Fault-tolerance
• Status and monitoring tools

28-10-2012 20

Programming Model
• Programmer has to implement interface of two
functions:

– map (in_key, in_value) ->
(out_key, intermediate_value) list

– reduce (out_key, intermediate_value list) ->
out_value list

28-10-2012 21

MapReduce Flow

28-10-2012 22

Mapper(indexing
example)
• Input is the line no and the actual line.

• Input 1 : (“100”,“I Love India ”)
• Output 1 : (“I”,“100”), (“Love”,“100”),
(“India”,“100”)

• Input 2 : (“101”,“I Love eBay”)
• Output 2 : (“I”,“101”), (“Love”,“101”),
(“eBay”,“101”)

28-10-2012 23

Reducer (indexing
example)
• Input is word and the line nos.

• Input 1 : (“I”,“100”,”101”)
• Input 2 : (“Love”,“100”,”101”)
• Input 3 : (“India”, “100”)
• Input 4 : (“eBay”, “101”)

• Output, the words are stored along with the line
nos.

28-10-2012 24

Google Page Rank
example
• Mapper
o Input is a link and the html content
o Output is a list of outgoing link and pagerank of this page

• Reducer
o Input is a link and a list of pagranks of pages linking to this
page
o Output is the pagerank of this page, which is the weighted
average of all input pageranks

28-10-2012 25

Conti.
• Limited atomicity and transaction support.
o HBase supports multiple batched mutations of
single rows only.
o Data is unstructured and untyped.
• No accessed or manipulated via SQL.
o Programmatic access via Java, REST, or Thrift APIs.
o Scripting via JRuby.

28-10-2012 26

OVERVIEW
• HBase is an Apache open source project
whose goal is to provide storage for the
Hadoop Distributed Computing
Environment.
• Data is logically organized into tables, rows
and columns.

28-10-2012 28

Outline
• Data Model

• Architecture and Implementation

• Examples & Tests

28-10-2012 29

Conceptual <family>:<label>

View Row key
Time Column
Stamp “contents:”
Column “anchor:”

• A data row has t12 “<html>…”
a sortable row “com.apach
key and an e.www”
t11 “<html>…”

arbitrary number t10
“anchor:apache.
com”
“APACHE”
of columns.
t15 “anchor:cnnsi.com” “CNN”
• A Time Stamp is
“anchor:my.look.c
designated t13
a”
“CNN.com”

automatically if “com.cnn.w
t6 “<html>…”
not artificially.
ww”

• <family>:<label>
t5 “<html>…”

t3 “<html>…”

HStore
Physical Storage View Column
Row key TS
“contents:”
• Physically, tables are
t12 “<html>…”
stored on a per-column “com.apache.w
ww”
family basis. t11 “<html>…”
HStore
t6 “<html>…”
• Empty cells are not
stored in a column- “com.cn.www” t5 “<html>…”

oriented storage t3 “<html>…”
format.
Row key TS Column “anchor:”
• Each column family is
managed by an HStore. “com.apache.
www” t10
“anchor:
apache.com” “APACHE”

Data MapFile Key/Value
t9
“anchor:
“CNN”
cnnsi.com”
Index MapFile Index key
com.cn.www”
“anchor: “CNN.co
t8
my.look.ca” m”
Memcache

Time Column

Row Ranges: RegionsRow key
Stamp “contents:”
Column “anchor:”

t15 anchor:cc value
• Row key/ Column
t13 ba
ascending, Timestamp
descending aaaa t12 bb

• Physically, tables are broken
t11 anchor:cd value

into row ranges contain rowsbc
t10

from start-key to end-key
aaab t14

aaac anchor:be value

aaad anchor:ad value

t5 ae
aaae
t3 af

Outline
• Data Model
• Architecture and Implementation
• Examples & Tests

Three major components
• The HBaseMaster

• The HRegionServer

• The HBase client

Master

HBaseMaster
2 META Region 2 META Region
2 META Region
2 META Region

1 ROOT Region

• Assign regions to
HRegionServers.
1. ROOT region locates all the Server Server Server Server Server
META regions.
2. META region maps a number
of user regions. USER Region

3. Assign user regions to the
HRegionServers. META Region

• Enable/Disable table and
change table schema ROOT Region USER Region

• Monitor the health of each META Region
Server
USER Region

User Region
HBase Client

Information cached

Row Key Create columnFamily1: columnFamily2:
Timestamp MyTable
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

Insert Values
BatchUpdate batchUpdate = new
BatchUpdate("myRow",timestamp);
batchUpdate.put("columnFamily1:labela",Bytes.toByt
es("labela value"));
batchUpdate.put("columnFamily1:labelb",Bytes.toByt
es(“labelb value"));
table.commit(batchUpdate);
Row Key Timestamp columnFamily1:

ts1 labela labela value

myRow
ts2 labelb labelb value

Select value from table where
key=‘com.apache.www’ AND
Search
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”


t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3

Select value from table
Scanner
Search where anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”


t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3

PIG
Programming Language for Hadoop Framework

28-10-2012 45

Introduction
• Pig was initially developed at Yahoo!
• Pig programming language is designed
to handle any kind of data-hence the
name!
• Pig is made of two components:
 Language itself, which is called PigLatin .
 Runtime Environment where PigLatin programs
are executed.

28-10-2012 46

Why PigLatin?
• Map Reduce is very powerful, but:
o It requires a Java programmer.
o User has to re-invent common functionality (join, filter, etc.).
• For non-java programmers Pig Latin is introduced.
• Pig Latin is a data flow language rather than
procedural or declarative.
• User code and existing binaries can be included
almost anywhere.
• Metadata not required, but used when available.
• Support for nested types.
• Operates on files in HDFS.

28-10-2012 47

Pig Latin Overview
• Pig provides a higher level language,
Pig Latin, that:
o Increases productivity.
o In one test 10 lines of Pig Latin ≈ 200 lines of Java.
• What took 4 hours to write in Java took
15 minutes in Pig Latin.
o Opens the system to non-Java programmers.
o Provides common operations like join, group,
filter, sort.

28-10-2012 48

Load Data
• The objects that are being worked on by Hadoop
are stored in HDFS.
• To access this data, the program must first tell Pig
what file (or files) it will use.
• That‟s done through the LOAD ‘data_file’
command .
• If the data is stored in a file format that is not
natively accessible to Pig,
• Add the “USING” function to the LOAD statement to
specify a user-defined function that can read in
and interpret the data.

28-10-2012 49

Transform Data
• The transform logic is where all the
data manipulation happens.
• For example :
 FILTER out rows that are not of interest.
 JOIN two sets of data files .
 GROUP data to build aggregations .
 ORDER results .

28-10-2012 50

Example of Pig Program
• file composed of Twitter feeds, selects only those
tweets that are using en(English) iso_language
code, then groups them by the user who is
tweeting, and displays the sum of the number of the
re tweets of that user‟s tweets.

L = LOAD „hdfs//node/tweet_data‟;
FL = FILTER L BY iso_language_code EQ „en‟;
G = GROUP FL BY from_user;
RT = FOREACH G GENERATE group, SUM(retweets);

28-10-2012 51

DUMP and STORE
• DUMP or STORE command generates the results of a
Pig program.
• DUMP command sends the output to the screen,
while debugging Pig programs.
• DUMP command can be used anywhere in
program to dump intermediate result sets to the
screen.
• STORE command will store results from running
programs in a file for further processing and analysis.

28-10-2012 52

Pig Runtime Environment
• Pig runtime is used when Pig program need to run in
the Hadoop environment .
• There are three ways to run a Pig program:
 Embedded in a Script.
 Embedded in Java Program.
 From the Pig Command line, called Grunt.
• The Pig runtime environment translates the program
into a set of map and reduce tasks and runs.
• This greatly simplifies the work associated with the
analysis of large amounts of data.

28-10-2012 53

PIG is used for?
• Web log processing.
• Data processing for web search platforms.
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for processing large
data sets

28-10-2012 54

Hadoop@BIG
Statistics of Hadoop used at giant structure

28-10-2012 55

Hadoop@Facebook
• Production cluster
o 4800 cores, 600 machines, 16GB per machine – April 2009
o 8000 cores, 1000 machines, 32 GB per machine – July 2009
o 4 SATA disks of 1 TB each per machine
o 2 level network hierarchy, 40 machines per rack
o Total cluster size is 2 PB, projected to be 12 PB in Q3 2009

• Test cluster
• 800 cores, 16GB each

28-10-2012 56

Hadoop@Yahoo
• World's largest Hadoop production application.
• The Yahoo! Search Webmap is a Hadoop
application that runs on a more than 10,000 core
Linux cluster
• Biggest contributor to Hadoop.
• Converting All its batches to Hadoop.

28-10-2012 57

Hadoop@Amazon
• Hadoop can be run on Amazon Elastic Compute
Cloud (EC2) and Amazon Simple Storage Service
(S3)
• The New York Times used 100 Amazon EC2 instances
and a Hadoop application to process 4TB of raw
image TIFF data (stored in S3) into 11 million finished
PDFs in the space of 24 hours at a computation cost
of about $240
• Amazon Elastic MapReduce is a new web service
that enables businesses, researchers, data analysts,
and developers to easily and cost-effectively
process vast amounts of data. It utilizes a hosted
Hadoop framework.
28-10-2012 58

Thank You

28-10-2012 59

Apache hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Apache hadoop

Ähnlich wie Apache hadoop (20)

Mehr von Darpan Dekivadiya

Mehr von Darpan Dekivadiya (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache hadoop