5. • HDFS - Distributed Resilient Filesystem
• Infinitely Scalable (> 50PB in production)
• Replicated data storage, self healing, fault tolerant
• Master/Slave (Name Node/Data Nodes)
• MapReduce - Distributed Parallel Processing
• Schema-less data stream processing
• Paradigm shift: processing to where data exists
• Master/Slave (Job Tracker/Task Trackers)
• Hive - SQL Database with ODBC/JDBC Access
• SQL database for ETL and Data Warehousing
• Oracle Business Objects connectors
• HBase - NoSQL Database with REST/Thrift Access
• Sub-second reads/writes
Few important things...
5
6. • Very Large Distributed File System
• >10K nodes, >100 million files, >50 PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
• 128 MB blocks replicated at least thrice
• CRC32 error check and correct on each node
• Optimized for Batch Processing
• Computations moves to where data resides
• Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
• Transaction Log stored on Name Node
• Replicated local and on NFS/CIFS
• Zookeeper Quorum of Name Nodes
• No SPOF
HDFS - Overview
6
8. HDFS - Write
Name Node
1 32
Client 1. Create Metadata
2. Put Blocks
Data Nodes
Control / Monitoring
1 1
2 2
3 3
8
9. HDFS - Read
Name Node
1 1 1 2
2
2
3 3 34
4 4
Client 1. Get Metadata
2. Fetch Blocks
Data Nodes
Control / Monitoring
9
10. • Pioneered by Google
• Moves processing to data
• Works like Unix pipeline (called Jobs)
• cat input | grep | sort | uniq -c | cat > output
• input | Map | Shuffle/Sort | Reduce | Output
• Job Pipeline made of Mappers and reducers
• Mappers: input mapped to key-value pair
• Reducer: receive all values for key and output aggregation
• Jobs submitted to JobTracker for execution
• Mappers/Reducers sent to the DataNode that has data block
• Mapper/Reducer state managed and maintained by JT
• Innate fault tolerance, JT restarts jobs that fail
MapReduce - Overview
10
11. MapReduce – Job Submit
Job TrackerClient 1. Setup Job
Task Trackers
Control / Monitoring
M M
M M
R R
M M
M M
R R
M M
M M
R R
M M
M M
R R
M M
M M
R R
11
14. • Data warehouse infrastructure built on Hadoop
• Providing data summarization, query, analysis, etc.
• ETL. Structure. Access to different storage types HDFS, HBase
• Query execution via MapReduce.
• Key Building Principles:
• SQL queries
• Web based UI for ad-hoc queries (Hue)
• Extensibility – Types, Functions, Formats, Scripts
• Performance, Scalability, Reliability
• Data Units are Databases, Tables, Partitions &
Clusters
• ODBC and JDBC connectors readily available
• Used extensively by Facebook, Reuters, UBS
Hive - Overview
14
15. Hive - Components
15
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.WebUI
Hue Web UI
ODBC
Data
Warehouse
16. Hive – Data Warehousing at Facebook
16
> 800M users, > 600TB of data, > 2TB a day, > 4000 queries a day
17. HBase - Overview
17
• Very large scale analytic processing
• Big queries – typically range or table scans.
• Big databases (100s of TB)
• Sub second query response time
• Performance of RDBMS system good for transaction processing
but very inefficient for very large scale analytic processing
• Column oriented database
• No SQL queries
• Tables have one primary key and index
• No join operations
• Data is unstructured and not typed
• Data is versioned with timestamp
• Ultra fast lookup with row key and optional timestamp
• Full table scans & range scans have no performance impact
• Built on HDFS
• Horizontal scalability, highly available, high performance
18. HBase – Data Model
18
Row key
Time
Stamp
Column
“contents:”
Column “anchor:”
“com.apache.www”
t12 “<html>…”
t11 “<html>…”
t10 “anchor:apache.com” “APACHE”
“com.cnn.www”
t15 “anchor:cnnsi.com” “CNN”
t13 “anchor:my.look.ca” “CNN.com”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
20. • Pig – English like “Pig Latin” to build queries
• Rapid prototyping
• Sqoop – Structured data ingest
• Highly available structured data ingest
• Supports ODBC connection to Oracle, MySQL, etc
• Flume – Real time event ingest
• Ingest real time events scaling to >100k tps
• Scalding – Pipeline construction in Scala
• Created by Twitter built on Cascading
• Production ready code with TDD and CI
• Oozie – Job scheduler and coordinator
• Ideal to build job dependencies
• Schedule job to run automatically
• Mahout – distributed machine learning algorithms
• R & RMR for statistical analysis and insight
Big Data – More tools
20