Weitere ähnliche Inhalte Ähnlich wie Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera (20) Mehr von Cloudera, Inc. (20) Kürzlich hochgeladen (20) Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanced Data Analytics at Yahoo - Amr Awadallah, Cloudera1. November 2011
How Apache Hadoop is Revolutionizing
Business Intelligence and Data Analytics
Dr. Amr Awadallah | Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
2. Business Intelligence Before Adopting Apache Hadoop
BI Reports + Interactive Apps Can’t Explore Original
High Fidelity Raw Data
RDBMS (processed data)
ETL Compute Grid
Moving Data To
Compute Doesn’t Scale
Storage Only Grid (original raw data)
Archiving =
Mostly Append
Premature
Collection Data Death
Instrumentation
2
©2011 Cloudera, Inc. All Rights Reserved.
3. Business Intelligence After Adopting Apache Hadoop
Data Exploration &
BI Reports + Interactive Apps Advanced Analytics
RDBMS
ETL and Aggregations Complex Data Processing
Hadoop: Storage + Compute Grid
Keep Data Alive For Ever
Collection
Instrumentation
3
©2011 Cloudera, Inc. All Rights Reserved.
4. So What is Apache Hadoop ?
• A scalable fault-tolerant distributed system for data storage
and processing (open source under the Apache license).
• Core Hadoop has two main components:
– Hadoop Distributed File System: self-healing high-bandwidth
clustered storage.
– MapReduce: fault-tolerant distributed processing.
• Key business values:
– Flexible – Store any data, Run any analysis (Mine First, Govern Later).
– Scalable – Start at 1TB/3-nodes then grow to petabytes/1000s of nodes.
– Affordable – Cost per TB at a fraction of traditional options.
– Open Source – No Lock-In, Rich Ecosystem, Large developer community.
– Broadly adopted – A large and active ecosystem, Proven to run at scale.
4
©2011 Cloudera, Inc. All Rights Reserved.
5. The Main Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
data is loaded store, no transformation is needed
• Explicit load operation has to • A SerDe (Serializer/Deserlizer) is
take place which transforms applied during read time to extract
data to DB internal structure the required columns
• New columns must be added • New data can start flowing anytime
explicitly before data for such and will appear retroactively once
columns can be loaded into the the SerDe is updated to parse it
database
• Read is Fast • Load is Fast
Benefits
• Standards/Governance • Flexibility/Agility
5
©2011 Cloudera, Inc. All Rights Reserved.
6. What is Complex Data Processing?
1. Java MapReduce: Most flexibility and performance, but tedious
development cycle (the “assembly language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java.
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
workflow of jobs composed of any of the above.
6
©2011 Cloudera, Inc. All Rights Reserved.
7. What This Means For You: Agility
Up Front Design Just in Time
7
©2011 Cloudera, Inc. All Rights Reserved.
8. What This Means For You: Innovation
Data Committee Data Scientist
8
©2011 Cloudera, Inc. All Rights Reserved.
9. What This Means For You: Consolidation
Silos Sharing
9
©2011 Cloudera, Inc. All Rights Reserved.
10. What This Means For You: Extract Value from Latent Data
Archive to Tape Keep Data Alive
10
©2011 Cloudera, Inc. All Rights Reserved.
11. What This Means For You: Ability to Grow Fluidly
11
©2011 Cloudera, Inc. All Rights Reserved.
12. What This Means For You: Data Beats Algorithm
Smarter Algos More Data
12
©2011 Cloudera, Inc. All Rights Reserved.
13. Where Does Hadoop Fit in the Enterprise Data Stack?
Data Scientists Analysts Business Users
Enterprise
IDEs BI, Analytics
Reporting
Development Tools Business Intelligence Tools
System
Operators
Cloudera
Mgmt Suite Enterprise
ETL Tools
Data
Warehouse
Data
Architects Customers
Low-Latency Web
Serving Application
Relational Systems
Logs Files Web Data
Databases
13
©2011 Cloudera, Inc. All Rights Reserved.
14. Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when: Use when:
• Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility)
• Multistep ACID Transactions • Scalability of Storage/Compute
• 100% SQL Compliance • Complex Data Processing
14
©2011 Cloudera, Inc. All Rights Reserved.
15. Two Core Use Cases Common Across Many Industries
Use Case Application Industry Application Use Case
Web
ADVANCED ANALYTICS
Social Network Analysis Clickstream Sessionization
DATA PROCESSING
Content Optimization Media Clickstream Sessionization
Network Analytics Telco Mediation
Loyalty & Promotions Retail Data Factory
Fraud Analysis Financial Trade Reconciliation
Entity Analysis Federal SIGINT
Sequencing Analysis Bioinformatics Genome Mapping
Product Quality Manufacturing Mfg Process Tracking
15
©2011 Cloudera, Inc. All Rights Reserved.
16. CDH: Cloudera’s Distribution Including Apache Hadoop
The #1 commercial and non-commercial Apache Hadoop distribution.
File System Mount UI Framework/SDK Data Mining
FUSE-DFS HUE APACHE MAHOUT
Workflow Scheduling Metadata
APACHE OOZIE APACHE OOZIE APACHE HIVE
Languages / Compilers
APACHE PIG, APACHE HIVE Fast Read/Write
Data Integration
Access
APACHE FLUME,
APACHE HBASE
APACHE SQOOP
Coordination APACHE ZOOKEEPER
• Open Source – 100% Apache licensed, 100% Open Source, 100% Free, No Forks.
• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA.
• Proven at Scale – Deployed at hundreds of enterprises across many industries.
• Integrated – All required component versions & dependencies are managed for you.
• Industry Standard – Existing RDBMS, ETL and BI systems work best with it.
• Many Form Factors – Public Cloud, Private Cloud, RHEL, Ubuntu, 32/64bit, etc.
16
©2011 Cloudera, Inc. All Rights Reserved.
17. CDH Integrates with Existing IT Infrastructure
BI/Analytics ETL Databases Cloud/OS Hardware
Cloudera’s Distribution including Apache Hadoop
17
©2011 Cloudera, Inc. All Rights Reserved.
18. What is Cloudera Enterprise?
Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS
source Apache Hadoop enterprise-easy
Simplify and Accelerate Hadoop Deployment Cloudera Production-
Management Level Support
Reduce Adoption Costs and Risks
Suite
Lower the Cost of Administration
Comprehensive Our Team of Experts
Increase the Transparency & Control of Hadoop On-Call to Help You
Toolset for Hadoop
Leverage the Experience of Our Experts Administration Meet Your SLAs
3 of the top 5 telecommunications, mobile services, defense &
intelligence, banking, media and retail organizations depend on Cloudera
EFFECTIVENESS EFFICIENCY
Ensuring Repeatable Value from Enabling Apache Hadoop to be
Apache Hadoop Deployments Affordably Run in Production
18
©2011 Cloudera, Inc. All Rights Reserved.
19. SCM Express: Simplifies Installation and Configuration
Service & Configuration Manager
(SCM) Express takes the complexity out
of deploying and configuring CDH.
Provision a complete Hadoop stack in minutes
Centrally manage system services through a user-
friendly interface
Manages services for up to 50 nodes
FREE to download
KEY FEATURES
Automated, wizard- Central, real-time Ability to configure the Incorporates Automates the
based installation of dashboard for cluster while it’s comprehensive expansion of services
the complete Hadoop configuration running validation and error to new nodes when they
stack management checking come online
1 2 3 4 5
19
©2011 Cloudera, Inc. All Rights Reserved.
20. What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:
– Agility/Flexibility (Enables Exploration/Innovation).
– Complex Data Processing (Any Language, Any Problem).
– Scalability of Storage/Compute (Freedom to Grow).
– Economical Active Archive (Keep All Your Data Alive).
• Cloudera Enterprise enables:
– Lower the Cost of Management and Administration.
– Simplify and Accelerate Hadoop Deployment.
– Increase the Transparency & Control of Hadoop.
– Firm SLAs on Issue Resolution.
20
©2011 Cloudera, Inc. All Rights Reserved.