Hadoop and friends

Chandan Rajah
Hadoop and Friends
1

• What is Hadoop?
- Reliable, Scalable, Distributed, Parallel, Fault Tolerant
- Built for non-reliable heterogeneous commodity H/W
• Where did it come from?
- Community + Yahoo!
• Why Hadoop? – Business Benefits
• Scales linearly >500k nodes, Ultra Flexible & Versatile
• Carrier Grade 99.999, High Performance, Runs on any H/W
• Low TCO per Node (Node assumes > 8 CPU with 8 cores)
• Open Source no S/W License Fee
• ~ $4k/Node (= DWH $1m – $100m /node, = Oracle > $20k/Core)1
• ~ < $250/TB (= RDBMS >$10,000/TB)1
• ~ 200 GFLOPS/Node at 65% (= £20/GFLOP) in processing
• ~ > 20 GFLOPS/Watt in power
• Running cost ~$32/hour (= Oracle ~$100/hour)2
•Is it Mature?
• Yahoo manages >200 petabytes across >50,000 nodes1
Overview
2
1 Source: Forbes - http://www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-big-data/
2 Source: Computerworld - http://news.idg.no/cw/art.cfm?id=AEF8309A-FDB9-3AFA-6F188DFCA9B24083

To name just a few…
Who is using it?
3

• HDFS - Distributed Resilient Filesystem
• Infinitely Scalable (> 50PB in production)
• Replicated data storage, self healing, fault tolerant
• Master/Slave (Name Node/Data Nodes)
• MapReduce - Distributed Parallel Processing
• Schema-less data stream processing
• Paradigm shift: processing to where data exists
• Master/Slave (Job Tracker/Task Trackers)
• Hive - SQL Database with ODBC/JDBC Access
• SQL database for ETL and Data Warehousing
• Oracle Business Objects connectors
• HBase - NoSQL Database with REST/Thrift Access
• Sub-second reads/writes
Few important things...
5

• Very Large Distributed File System
• >10K nodes, >100 million files, >50 PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
• 128 MB blocks replicated at least thrice
• CRC32 error check and correct on each node
• Optimized for Batch Processing
• Computations moves to where data resides
• Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
• Transaction Log stored on Name Node
• Replicated local and on NFS/CIFS
• Zookeeper Quorum of Name Nodes
• No SPOF
HDFS - Overview
6

HDFS - Write
Name Node
1 32
Client 1. Create Metadata
2. Put Blocks
Data Nodes
Control / Monitoring
1 1
2 2
3 3
8

HDFS - Read
Name Node
1 1 1 2
2
2
3 3 34
4 4
Client 1. Get Metadata
2. Fetch Blocks
Data Nodes
9

• Pioneered by Google
• Moves processing to data
• Works like Unix pipeline (called Jobs)
• cat input | grep | sort | uniq -c | cat > output
• input | Map | Shuffle/Sort | Reduce | Output
• Job Pipeline made of Mappers and reducers
• Mappers: input mapped to key-value pair
• Reducer: receive all values for key and output aggregation
• Jobs submitted to JobTracker for execution
• Mappers/Reducers sent to the DataNode that has data block
• Mapper/Reducer state managed and maintained by JT
• Innate fault tolerance, JT restarts jobs that fail
MapReduce - Overview
10

MapReduce – Job Submit
Job TrackerClient 1. Setup Job
Task Trackers
M M
M M
R R
M M
M M
R R
M M
M M
R R
M M
M M
R R
M M
M M
R R
11

MapReduce – Pipeline Schematic
12

• Data warehouse infrastructure built on Hadoop
• Providing data summarization, query, analysis, etc.
• ETL. Structure. Access to different storage types HDFS, HBase
• Query execution via MapReduce.
• Key Building Principles:
• SQL queries
• Web based UI for ad-hoc queries (Hue)
• Extensibility – Types, Functions, Formats, Scripts
• Performance, Scalability, Reliability
• Data Units are Databases, Tables, Partitions &
Clusters
• ODBC and JDBC connectors readily available
• Used extensively by Facebook, Reuters, UBS
Hive - Overview
14

Hive - Components
15
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.WebUI
Hue Web UI
ODBC
Data
Warehouse

Hive – Data Warehousing at Facebook
16
> 800M users, > 600TB of data, > 2TB a day, > 4000 queries a day

HBase - Overview
17
• Very large scale analytic processing
• Big queries – typically range or table scans.
• Big databases (100s of TB)
• Sub second query response time
• Performance of RDBMS system good for transaction processing
but very inefficient for very large scale analytic processing
• Column oriented database
• No SQL queries
• Tables have one primary key and index
• No join operations
• Data is unstructured and not typed
• Data is versioned with timestamp
• Ultra fast lookup with row key and optional timestamp
• Full table scans & range scans have no performance impact
• Built on HDFS
• Horizontal scalability, highly available, high performance

HBase – Data Model
18
Row key
Time
Stamp
Column
“contents:”
Column “anchor:”
“com.apache.www”
t12 “<html>…”
t11 “<html>…”
t10 “anchor:apache.com” “APACHE”
“com.cnn.www”
t15 “anchor:cnnsi.com” “CNN”
t13 “anchor:my.look.ca” “CNN.com”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”

• Pig – English like “Pig Latin” to build queries
• Rapid prototyping
• Sqoop – Structured data ingest
• Highly available structured data ingest
• Supports ODBC connection to Oracle, MySQL, etc
• Flume – Real time event ingest
• Ingest real time events scaling to >100k tps
• Scalding – Pipeline construction in Scala
• Created by Twitter built on Cascading
• Production ready code with TDD and CI
• Oozie – Job scheduler and coordinator
• Ideal to build job dependencies
• Schedule job to run automatically
• Mahout – distributed machine learning algorithms
• R & RMR for statistical analysis and insight
Big Data – More tools
20

Hadoop and friends

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop and friends

Similar to Hadoop and friends (20)

More from Chandan Rajah

More from Chandan Rajah (19)

Recently uploaded

Recently uploaded (20)

Hadoop and friends

Editor's Notes