Hadoop World Oct 2009 Production Deep Dive With High Availability
1. Hadoop at ContextWeb
Alex Dorman, VP Engineering
Paul George, Sr. Systems Architect
October 2009
2. ContextWeb: Traffic
ADSDAQ – Online Advertisement Exchange
Traffic –
up to 6,000 Ad requests per second.
7bln Ad requests per month
5,000+ Active Publisher and Advertiser accounts
Account reports are updated every 15 minutes
About 50 internal reports for business users updated nightly
2
3. ContextWeb Architecture highlights
Pre – Hadoop aggregation framework
Logs are generated on each server and aggregated in memory to 15
minute chunks
Aggregation of logs from different servers into one log
Load to DB
Multi-stage aggregation in DB
About 20 different jobs end-to-end
Could take 2hr to process through all stages
200mln records was the limit
3
4. Hadoop Data Set
Up to 120GB of raw log files per day. 60GB compressed
60 different aggregated data sets 25TB total to cover 1 year
(compressed)
50 different reports for Business and End Users
Major data sets are updated every 15 minutes
5. Hadoop Cluster
40 nodes/320 Cores (DELL 2950)
100TB total raw capacity
CentOS 5.3 x86_64
Hadoop 0.18.3-CH-0.2.1 (Cloudera), migrating to 0.20.x
NameNode high availability using DRBD Replication.
Log collection using custom scripts and Scribe
6. Hadoop Cluster
In-house developed Java framework on top of
hadoop.mapred.*
PIG and Perl Streaming for ad-hoc reports
OpsWise scheduler
~2000 MapReduce job executions per day
Exposing data to Windows:
WebDav Server with WebDrive clients
Reporting Application: Qlikview
Cloudera support for Hadoop
7. Architectural Challenges
How to organize data set to keep aggregated data sets
fresh.
Logs constantly appended to the main Data Set. Reports and
aggregated datasets should be refreshed every 15 minutes
Mix of .NET and Java applications. (70%+ .Net, 30%- Java)
How to make .Net application write logs to Hadoop?
Some 3rd party applications to consume results of
MapReduce Jobs (e.g. reporting application)
How make 3rd party or internal Legacy applications to read data from
Hadoop ?
Backward and forward compatibility of our data sets
every month we are adding 3-5 new data points to our logs
7
9. Partitioned Data Set: Date/Time
Date/Time as main dimension for Partitioning
Segregate results of MapReduce jobs into Monthly, Daily or
Hourly Directories
Use MultipleOutputFormat to segregate output to
different files
Reprocess only what has changed – check DateTime in
filename to determine what is affected. Data Set is
regenerated if input into MR job contains data for this Month/
Day/Hour.
Use PathFilter to specify what files to process
9
10. Partitioned Data Set: Revisions
Need overlapping jobs:
12:00 -12:10 Job 1.1 A->B
12:10-12:20 Job 1.2 B->C 12:15-12:25 Job 2.1 A->B !!! Job 1.2 is still reading set B !!!
12:20-12:30 Job 1.3C->D 12:25-12:35 Job 2.2 B->C
12:35-12:45 Job 2.3 C->D
Use revisions:
12:00 -12:10 Job 1.1 A.r1->B.r1
12:10-12:20 Job 1.2 B.r1->C.r1 12:15-12:25 Job 2.1 A.r2->B.r2
12:20-12:30 Job 1.3 C.r1->D.r1 12:25-12:35 Job 2.2 B.r2->C.r2
12:35-12:45 Job 2.3 C.r2->D.r2
Assign revision (timestamp) when generate output
Use MultipleOutputFormat to segregate output to different files
Use highest available revision number when selecting input
Use PathFilter to specify revisions to process
Clean up “old” revisions after some grace period
10
11. Partitioned Data Set: processing flow
From Ad Serving Platform
HADOOP
15 minute log HDFS
LogRpt15
yyyy0215_
hhmm
Historic Data (By Day)
RawLogD RawLogD RawLogD
0214_r4 0215_r4 0216_r4
RawLogD
0214_r5
Map Reduce
IncomingMR
Aggregated data for Advertisers (By Day)
AdvD AdvD AdvD
0214_r3 0215_r4 0216_r4
Map Reduce AdvD
0214_r4
AdvMR
To Reporting and Predictions
11
13. Logical Schemas and Headers
Meta data repository to define list of columns in all data sets
Each file has headers as the first line
Job configuration files that define source and target
Columns are mapped dynamically based on the schema file
and header information
Each data set can have individual files of different format
No need to modify source code if a new column is added or
if order of columns has changed
Support for default values if a column is missing in older file
Easy to export to external applications (DB, reporting apps)
13
14. Getting Data in and out
Mix of .NET and Java applications. (70%+ .Net, 30%- Java)
How to make .Net application write logs to Hadoop?
Some 3rd party applications to consume results of
MapReduce Jobs (e.g. reporting application)
How make 3rd party or internal Legacy applications to read data from
Hadoop ?
14
15. Getting Data in and out: WebDAV driver
WebDAV server is part of Hadoop source code tree
Needed some minor clean up. Was co-developed with IponWeb.
Available http://www.hadoop.iponweb.net/Home/hdfs-over-webdav
There are multiple commercial Windows WebDav clients you
can use (we use WebDrive) http://www.webdrive.com/
Linux
Mount Modules available from http://dav.sourceforge.net/
15
16. Getting Data in and out: WebDav
HADOOP/HDFS
Data Node
Master
Client (Windows/Linux)
Webdav client
Data List
Data Node
consu getProperties
mers (Windows/Linux)
HDFS api
Data
WebDav
Data Node
Client (Windows/Linux) Server
Webdav client
Data Data
consu
mers Data Node
Data
Data Node
16
17. QlikView Reporting Application
In-memory DB
AJAX support for integration into WEB portals
TXT files are supported
Understands headers
WebDav allows to load data directly from Hadoop
Coming soon: generation of Qlikview files as output of
Hadoop MR jobs
18. High Availability for NameNode/JobTracker
Goals
Availability! (But not stateful)
Failed jobs resubmitted by workflow scheduler
Target < 5 minutes of downtime per incident
Automatic fail over with no human action required.
No phone calls, no experts required
Alert that it happened, not that it needs to be fixed
Allow for maintenance windows
Avoid at all cost
Whenever possible, use redundancy inside of the box
Disks (RAID 1), network bonding, dual power supplies
18
19. Redundant Network Architecture
• Use Linux bonding
• See bonding.txt from Linux kernel docs.
• Throughput advantage
– Observed at 1.76Gb/s
• We use LACP, aka 802.3ad, aka mode=4.
– http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol
– Must be supported by your switches.
• On the data nodes, too. Great for rebalancing.
• Keep nodes on different switches
• Use a dedicated cross over connection, too
19
20. Software Packages We Use for HA
inux-HA Project’s Heartbeat
(http://www.linux-ha.org)
Default resource manager, haresources
Manages multiple resources:
Virtual IP address
DRBD Disk and file system
Hadoop init scripts (from Cloudera’s distribution)
RBD by LINBIT
(http://www.drbd.org)
“DRBD can be understood as network based raid-1.”
20
21. Replication of NameNode Metadata
DRBD Replication.
Block level replication, file system agnostic
File system is active on only one node at a time
We use synchronous replication
Move only the data that you need! (metadata, not the whole system)
2.6mm Files, 33k dirs, 60TB = 1.3GB meta data (not a lot to move)
Still consider running your secondary namenode on another machine
and/or NFS dir!
LVM snapshots
/getimage?getimage=1
/getimage?getedit=1
21
22. In the Unlikely Event of a Water Landing
Order of Events, the magic of Heartbeat
• Detect the failure (“deadtime” from ha.cf)
• Virtual IP fails over.
• DRBD system switches primary node. (/proc/drbd status)
• File system fsck and mount at /hadoop.
• Hadoop processes started via Cloudera init scripts.
• Optionally, original master is rebooted (if it’s still alive)
• End to end fail over time approximately 15 seconds.
23. In the Unlikely Event of a Water Landing
Order of Events, the magic of Heartbeat
• Detect the failure (“deadtime” from ha.cf)
• Virtual IP fails over.
• DRBD system switches primary node. (/proc/drbd status)
• File system fsck and mount at /hadoop.
• Hadoop processes started via Cloudera init scripts.
• Optionally, original master is rebooted (if it’s still alive)
• End to end fail over time approximately 15 seconds.
Does it work?
24. In the Unlikely Event of a Water Landing
Order of Events, the magic of Heartbeat
• Detect the failure (“deadtime” from ha.cf)
• Virtual IP fails over.
• DRBD system switches primary node. (/proc/drbd status)
• File system fsck and mount at /hadoop.
• Hadoop processes started via Cloudera init scripts.
• Optionally, original master is rebooted (if it’s still alive)
• End to end fail over time approximately 15 seconds.
Does it work?
• Yes!! 6 failovers in the past 18 months
25. In the Unlikely Event of a Water Landing
Order of Events, the magic of Heartbeat
• Detect the failure (“deadtime” from ha.cf)
• Virtual IP fails over.
• DRBD system switches primary node. (/proc/drbd status)
• File system fsck and mount at /hadoop.
• Hadoop processes started via Cloudera init scripts.
• Optionally, original master is rebooted (if it’s still alive)
• End to end fail over time approximately 15 seconds.
Does it work?
• Yes!! 6 failovers in the past 18 months
• (only 3 were planned)
26. Other Options to Consider
(or: How I Learned to Stop Worrying and Start Over From the
Beginning)
Explore additional resource management systems
• ie., OpenAIS + Pacemaker: N+1, N-to-N
• Be resource aware, not just machine aware
Consider additional file system replication methods
• ie., GlusterFS, Red Hat GFS
• SAN/iSCSI backed
Virtualized solutions?
Other things I don’t know about yet
• Solutions to the problem exist
• Work with something you’re comfortable with
http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
26