2. Who am I ?
Director of Platform Engineering at Bronto
Former Googler/FeedBurner(er)
Web Analytics background
Still working this out in therapy
3. What is a Hadoop?
Open source distributed computing framework built on Java
Named by Doug Cutting (Apache Lucene) after son’s toy elephant
Main components: HDFS and MapReduce
Heavily used and sponsored by Yahoo
Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others
Tremendous community and growing popularity
4. What does Hadoop do?
Networks nodes together to combine storage and computing power
Scales to petabytes of storage
Manages fault tolerance and data replication automagically
Excels at processing semi-structured and unstructured data
Provides framework for analyzing data in parallel (MapReduce)
5. What does Hadoop not do?
No random access (it’s not a database)
Not real-time (it’s batch oriented)
Make things obvious (there’s a learning curve)
6. Where do we start?
1. HDFS & MapReduce
2. ???
3. Profit
7. Hadoop’s Filesystem (HDFS)
Hadoop Distributed File System, based on Google’s GFS whitepaper
Data stored in blocks across cluster
Hadoop manages replication, node failure, rebalancing
Namenode is the master; Datanodes are slaves
Data stored on disk, but not accessible via local file system; use Hadoop
API/tools
8. How HDFS stores data
Hadoop Client/API talks local filesystem
to Namenode file001 (1,2,3) file002 (2)
Namenode
file003 (1,3) file004 (3)
Namenode looks up file005 (2) file006 (4)
block locations, returns
which Datanodes have
data Datanode 1 file001, file003
HDFS
Hadoop Client/API talks Datanode 2 file001, file002, file005
to Datanodes to read file
data
Datanode 3 file001, file003, file004
Datanode 4 file006
9. How HDFS stores data
Hadoop Client/API talks local filesystem
to Namenode file001 (1,2,3) file002 (2)
Namenode
file003 (1,3) file004 (3)
Namenode looks up file005 (2) file006 (4)
block locations, returns
which Datanodes have
data Datanode 1 file001, file003
HDFS
Hadoop Client/API talks Datanode 2 file001, file002, file005
to Datanodes to read file
data
Datanode 3 file001, file003, file004
This is the only way to
access HDFS data Datanode 4 file006
10. How HDFS stores data
Hadoop Client/API talks local filesystem
to Namenode file001 (1,2,3) file002 (2)
Namenode
file003 (1,3) file004 (3)
Namenode looks up file005 (2) file006 (4)
block locations, returns
which Datanodes have
data Datanode 1 file001, file003
HDFS
Hadoop Client/API talks Datanode 2 HDFS data on
file001, file002, file005
to Datanodes to read file local file system
data is stored in
Datanode 3 file001, file003, file004 blocks all over
This is the only way to the cluster
access HDFS data Datanode 4 file006
11. About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
Datanode Datanode
Namenode keeps track of available
Datanodes and file locations across the
cluster
Namenode
Namenode is a SPOF
Datanode Datanode
12. About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
Datanode Datanode
Namenode keeps track of available
Datanodes and file locations across the
cluster
Namenode is a SPOF
If you lose Namenode metadata, Hadoop Datanode Datanode
has no idea which files are in which blocks
13. About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
Namenode keeps track of available
Datanodes and file locations across the
cluster
Namenode is a SPOF
If you lose Namenode metadata, Hadoop
has no idea which files are in which blocks
14. HDFS Tips & Tricks
Write Namenode data to multiple local & a remote device (NFS mount)
No RAID, use JBOD. More disks == more disk I/O
Mount disks with noatime (skip writing last accessed time on file reads)
LZO compression; saves space, speeds network transfer
Tweak and test settings with included JARs: TestDFSIO, sort example
16. Hadoop’s MapReduce
Framework for running tasks in parallel, based on Google’s whitepaper
JobTracker is the master; schedules tasks on nodes, monitors tasks and re-
tries failures
TaskTrackers are the slaves; runs specified task against specified bits of data
on HDFS
Map/Reduce functions operate on smaller parts of problem, distributed
across multiple nodes
18. Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
1. Each line of log file is input into the map function. The mapper (filename, file-contents):
map parses the line, emits a key/value pair representing for each line in file-contents:
page = parsePage(line)
the page, and that it was viewed once. emit(page, 1)
19. Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
1. Each line of log file is input into the map function. The mapper (filename, file-contents):
map parses the line, emits a key/value pair representing for each line in file-contents:
page = parsePage(line)
the page, and that it was viewed once. emit(page, 1)
2. Reducer is given a key and all occurrences of values reduce (key, values):
for that key. The reducer sums the values and outputs a int views = 0
key/value pair that represents the page and a total # of for each value in values:
views++
views.
emit(key, views)
20. Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
1. Each line of log file is input into the map function. The mapper (filename, file-contents):
map parses the line, emits a key/value pair representing for each line in file-contents:
page = parsePage(line)
the page, and that it was viewed once. emit(page, 1)
2. Reducer is given a key and all occurrences of values reduce (key, values):
for that key. The reducer sums the values and outputs a int views = 0
key/value pair that represents the page and a total # of for each value in values:
views++
views.
emit(key, views)
3. The result is a count of how many times a webpage (index1, 3)
(index2, 1)
has appeared in this log file.
21. Hadoop MapReduce data flow
InputFormat controls where data comes from,
breaks into InputSplits
RecordReader knows how to read InputSplit, passes
data to map function
Mappers do their thing, output intermediate data to
local disk
Hadoop shuffles, sorts keys in map output so all
occurrences of same key are passed to reducer
together
Reducers do their thing, send output to
OutputFormat
chart from Yahoo! Hadoop Tutorial
OutputFormat controls where data goes http://developer.yahoo.com/hadoop/tutorial/index.html
22. Input/Output Formats
TextInputFormat - Reads text files, each line is an input
TextOutputFormat - Writes output from Hadoop to plain text
DBInputFormat - Reads JDBC sources, rows map to custom DBWritable
DBOutputFormat - Writes to JDBC sources, again using DBWritable
ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
23. MapReduce Tips & Tricks
You don’t have to do it in Java; current MapReduce abstractions are
awesome
Pig, Hive - performance is close enough to native MR, with big productivity
boost
Hadoop Streaming - passes data through stdin/stdout so you can use any
language. Ruby, Python popular choices
Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
24. Hadoop at Bronto
5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8
cores
Mostly Pig scripts, some Java utility MR jobs
Jobs process raw data/mail logs; store aggregate stats in Cassandra
Ad-hoc scripts analyze internal logs for app monitoring/debugging
Using Cassandra with Hadoop (we’re rolling our own InputFormat)
25. Summary
Hadoop excels at big data, analytics, batch processing
Not real-time, no random access; not a database
HDFS makes it all possible: massively scalable, fault tolerant file system
MapReduce provides framework for processing data on HDFS
Pig, Hive easy to use, big productivity gain, close enough performance in
most cases