Introducing Apache Hadoop: The Modern Data Operating System - Stanford EE3801. 11/16/2011, Stanford EE380 Computer Systems Colloquium
Introducing Apache Hadoop:
The Modern Data Operating System
Dr. Amr Awadallah | Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
2. Limitations of Existing Data Analytics Architecture
BI Reports + Interactive Apps Can’t Explore Original
High Fidelity Raw Data
RDBMS (aggregated data)
ETL Compute Grid
Moving Data To
Compute Doesn’t Scale
Storage Only Grid (original raw data)
Archiving =
Mostly Append
Premature
Collection Data Death
Instrumentation
2
©2011 Cloudera, Inc. All Rights Reserved.
3. So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for
data storage and processing (open source
under the Apache license).
• Core Hadoop has two main systems:
– Hadoop Distributed File System: self-healing
high-bandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
3
©2011 Cloudera, Inc. All Rights Reserved.
4. The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
any data can be loaded. store, no transformation is needed.
• An explicit load operation has to • A SerDe (Serializer/Deserlizer) is
take place which transforms applied during read time to extract
data to DB internal structure. the required columns (late binding)
• New columns must be added • New data can start flowing anytime
explicitly before new data for and will appear retroactively once
such columns can be loaded the SerDe is updated to parse it.
into the database.
• Read is Fast • Load is Fast
Pros
• Standards/Governance • Flexibility/Agility
4
©2011 Cloudera, Inc. All Rights Reserved.
6. Flexibility: Complex Data Processing
1. Java MapReduce: Most flexibility and performance, but tedious
development cycle (the assembly language of Hadoop).
2. Streaming MapReduce (aka Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java
(modeled After Google’s FlumeJava)
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
workflow of jobs composed of any of the above.
6
©2011 Cloudera, Inc. All Rights Reserved.
7. Scalability: Scalable Software Development
Grows without requiring developers to
re-architect their algorithms/application.
AUTO SCALE
7
©2011 Cloudera, Inc. All Rights Reserved.
8. Scalability: Data Beats Algorithm
Smarter Algos More Data
A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009
8
©2011 Cloudera, Inc. All Rights Reserved.
9. Scalability: Keep All Data Alive Forever
Archive to Tape and Extract Value From
Never See It Again All Your Data
9
©2011 Cloudera, Inc. All Rights Reserved.
10. Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when: Use when:
• Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility)
• Multistep ACID Transactions • Scalability of Storage/Compute
• 100% SQL Compliance • Complex Data Processing
10
©2011 Cloudera, Inc. All Rights Reserved.
11. HDFS: Hadoop Distributed File System
A given file is broken down into blocks
(default=64MB), then blocks are
replicated across cluster (default=3).
Optimized for:
• Throughput
• Put/Get/Delete
• Appends
Block Replication for:
• Durability
• Availability
• Throughput
Block Replicas are distributed
across servers and racks.
11
©2011 Cloudera, Inc. All Rights Reserved.
12. MapReduce: Computational Framework
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)
Output
Be, 5 Reduce 1
(sorted words,
sum of counts)
File 1
“To Be
Or Not Be, 30
To Be?”
Be, 12
Output
(sorted words,
Reduce i File i
Split i (docid, text) Map i sum of counts)
Be, 7
Be, 6
Shuffle Output
(sorted words,
Reduce R File R
sum of counts)
Split N (docid, text) Map M (words, counts) (sorted words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)
12
©2011 Cloudera, Inc. All Rights Reserved.
13. MapReduce: Resource Manager / Scheduler
A given job is broken down into tasks,
then tasks are scheduled to be as
close to data as possible.
Three levels of data locality:
• Same server as data (local disk)
• Same rack as data (rack/leaf switch)
• Wherever there is a free slot (cross rack)
Optimized for:
• Batch Processing
• Failure Recovery
System detects laggard tasks and
speculatively executes parallel tasks
on the same slice of data.
13
©2011 Cloudera, Inc. All Rights Reserved.
14. But Networks Are Faster Than Disks!
Yes, however, core and disk density per server
are going up very quickly:
• 1 Hard Disk = 100MB/sec (~1Gbps)
• Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
• Rack = 20 Servers = 24GB/sec (~240Gbps)
• Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
• Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)
• Scanning 4.8TB at 100MB/sec takes 13 hours.
14
©2011 Cloudera, Inc. All Rights Reserved.
15. Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
Name Node Job Tracker
Maintains mapping of file names Tracks resources and schedules
to blocks to data node slaves. jobs across task tracker slaves.
Data Node Task Tracker
Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node
15
©2011 Cloudera, Inc. All Rights Reserved.
16. Changes for Better Availability/Scalability
Hadoop Client
Federation partitions Contacts Name Node for data Each job has its own
or Job Tracker to submit jobs
out the name space, Application Manager,
High Availability via Resource Manager is
an Active Standby. decoupled from MR.
Name Node Job Tracker
Data Node Task Tracker
Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node
16
©2011 Cloudera, Inc. All Rights Reserved.
17. CDH: Cloudera’s Distribution Including Apache Hadoop
File System Mount UI Framework/SDK Data Mining
Build/Test: APACHE BIGTOP
FUSE-DFS HUE APACHE MAHOUT
Workflow Scheduling Metadata
APACHE OOZIE APACHE OOZIE APACHE HIVE
Languages / Compilers Fast
Data APACHE PIG, APACHE HIVE
Read/Write
Integration
Access
APACHE FLUME,
APACHE SQOOP APACHE HBASE
Coordination APACHE ZOOKEEPER
SCM Express (Installation Wizard for CDH)
17
©2011 Cloudera, Inc. All Rights Reserved.
18. Books
18
©2011 Cloudera, Inc. All Rights Reserved.
19. Conclusion
• The Key Benefits of Apache Hadoop:
– Agility/Flexibility (Quickest Time to Insight).
– Complex Data Processing (Any Language, Any Problem).
– Scalability of Storage/Compute (Freedom to Grow).
– Economical Storage (Keep All Your Data Alive Forever).
• The Key Systems for Apache Hadoop are:
– Hadoop Distributed File System: self-healing high-
bandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource
management coupled with scalable data processing.
19
©2011 Cloudera, Inc. All Rights Reserved.
21. Unstructured Data is Exploding
Complex, Unstructured
Relational
• 2,500 exabytes of new information in 2012 with Internet as primary driver
• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
“zettabytes” this year
Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
21
©2011 Cloudera, Inc. All Rights Reserved.
22. Hadoop Creation History
• Fastest sort of a TB,
62secs over 1,460 nodes
• Sorted a PB in 16.25hours
over 3,658 nodes
22
©2011 Cloudera, Inc. All Rights Reserved.
23. Hadoop in the Enterprise Data Stack
Data Scientists Analysts Business Users
Enterprise
IDEs BI, Analytics
Reporting
Development Tools Business Intelligence Tools
System
Operators
ODBC, JDBC,
Cloudera
NFS, Native
Mgmt Suite Enterprise
ETL Tools
Data
Warehouse
Sqoop
Data
Architects Customers
Low-Latency Web
Flume Flume Flume Sqoop
Serving Application
Relational Systems
Logs Files Web Data
Databases
23
©2011 Cloudera, Inc. All Rights Reserved.
24. MapReduce Next Gen
Main idea is to split up the JobTracker functions:
• Cluster resource management (for tracking and
allocating nodes)
• Application life-cycle management (for MapReduce
scheduling and execution)
Enables:
• High Availability
• Better Scalability
• Efficient Slot Allocation
• Rolling Upgrades
• Non-MapReduce Apps
24
©2011 Cloudera, Inc. All Rights Reserved.
25. Two Core Use Cases Common Across Many Industries
Use Case Application Industry Application Use Case
Web
ADVANCED ANALYTICS
Social Network Analysis Clickstream Sessionization
DATA PROCESSING
Content Optimization Media Clickstream Sessionization
Network Analytics Telco Mediation
Loyalty & Promotions Retail Data Factory
Fraud Analysis Financial Trade Reconciliation
Entity Analysis Federal SIGINT
Sequencing Analysis Bioinformatics Genome Mapping
Product Quality Manufacturing Mfg Process Tracking
25
©2011 Cloudera, Inc. All Rights Reserved.
26. What is Cloudera Enterprise?
Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy
Simplify and Accelerate Hadoop Deployment Cloudera Production-
Management Level Support
Reduce Adoption Costs and Risks
Suite
Lower the Cost of Administration
Comprehensive Our Team of Experts
Increase the Transparency & Control of Hadoop On-Call to Help You
Toolset for Hadoop
Leverage the Experience of Our Experts Administration Meet Your SLAs
3 of the top 5 telecommunications, mobile services, defense &
intelligence, banking, media and retail organizations depend on Cloudera
EFFECTIVENESS EFFICIENCY
Ensuring Repeatable Value from Enabling Apache Hadoop to be
Apache Hadoop Deployments Affordably Run in Production
26
©2011 Cloudera, Inc. All Rights Reserved.
27. Hive vs Pig Latin (count distinct values > 0)
• Hive Syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig Latin Syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
27
©2011 Cloudera, Inc. All Rights Reserved.
28. Apache Hive Key Features
• A subset of SQL covering the most common statements
• JDBC/ODBC support
• Agile data types: Array, Map, Struct, and JSON objects
• Pluggable SerDe system to work on unstructured files directly
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• Partitions and Buckets (for performance optimization)
• Microstrategy/Tableau Compatibility (through ODBC)
• In The Works: Indices, Columnar Storage, Views, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
28
©2011 Cloudera, Inc. All Rights Reserved.
29. Hive Agile Data Types
• STRUCTS:
– SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
– SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
– SELECT mytable.mycolumn[5] FROM …
• JSON:
– SELECT get_json_object(mycolumn, objpath) FROM …
29
©2011 Cloudera, Inc. All Rights Reserved.