Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Apache Hadoop in the Enterprise

Cloudera, Inc.
Amr Awadallah, Founder, CTO, VP of Engineering.
aaa@cloudera.com, twitter: @awadallah

Microstrategy World – January 2011 – Las Vegas

Unstructured Data Explosion

Complex, Unstructured

Relational

• 2,500 exabytes of new information in 2012 with Internet as primary driver
• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
“zettabytes” this year Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
Copyright © 2011, Cloudera, Inc. All Rights Reserved. . 2

Dramatic Changes in Enterprise Data Needs

Data Explosion
• Any Type of Data
• From Many Sources
• Instrument Everything

Hard Problems
• Complex Analysis
• At Lowest Granularity
• Data Beats Algorithm

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3

What is Hadoop?
• A scalable fault-tolerant distributed system for data storage and
processing (open source under the Apache license)

• Core Hadoop has two main components
• Hadoop Distributed File System (HDFS): self-healing high-bandwidth
clustered storage
• MapReduce: fault-tolerant distributed processing

• Key business values
• Flexible -> Store any data, run any analysis (Mine First, Govern Later)
• Affordable -> Cost per TB at a fraction of traditional options
• Broadly adopted -> A large and active ecosystem
• Proven at scale -> Several petabyte deployments in production today
• Open Source -> No Lock-In, low cost, large developer community.

Cloudera’s Data Operating System (CDH)

Hue Hue SDK
Avro,
Oozie Oozie Hive
Pig.
Hive

Avro, Flume, Sqoop HBase

Zookeeper

• Open Source – 100% Apache licensed
• Simplified – Component versions & dependencies managed for you
• Reliable – Predictable release schedules, Patched with fixes to improve stability
• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.
• Integrated – All components & functions interoperate through standard API’s
• Supported – Founders, committers, contributors across all projects

Benefit #1: Agility

Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
data is loaded store, no special transformation is
needed
• Explicit load operation has to
take place which transforms data • A SerDe (Serializer/Deserlizer) is
to database internal structure applied during read time to extract
the required columns
• New columns must be added
explicitly before data for such • New data can start flowing
columns can be loaded into the anytime and will appear
database retroactively once the SerDe is
updated to parse them
• Read is Fast • Load is Fast
Benefits
• Standards/Governance • Evolving Schemas/Agility


Benefit #2: Data Consolidation

Complex Data
Documents SharePoint
Web feeds Sensor data
System logs EMB archives
Online forums Images/Video

Structured Data (“relational”)
CRM Inventory
Financials Sales records
Logistics HR records
Data Marts Web Profiles

A single data system to enable processing across
the universe of data types.

Benefit #3: Any Programing Language (Not Only SQL)
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.

Benefit #4: Balancing Return on Investment (or Byte!)
• Return on Byte = value to be extracted from that byte
divided by the cost of storing that byte

• If ROB is < 1 then it will be buried into tape wasteland,
thus we need more economical active storage.

High ROB

Low ROB


Use The Right Tool For The Right Job

Relational Databases: Hadoop:

Use when: Use when:
• Interactive OLAP Analytics (<1sec) • Structured or Not (Agility)
• Multistep ACID Transactions • Scalable Storage/Compute
• SQL Compliance • Complex Data Processing

Where Does Hadoop Fit in the Enterprise Data Stack?

Data Scientists Analysts Business Users
Enterprise
IDEs BI, Analytics
Reporting

System
Administrators
Cloudera
Mgmt Apps Enterprise
Data
Warehouse
Data
Users
Architects
Low-Latency Web
Serving Application

Relational Systems
Logs Files Web Data
Databases

Apache Hive Features

• A subset of SQL covering the most common statements
• JDBC/ODBC support
• Agile data types: Array, Map, Struct, and JSON objects
• Pluggable SerDe system to work on unstructured files directly.
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive

Broad Adoption in Key Verticals
Financial Services Telecom Retail Government

Example Risk management: BSS: Brand Equity: Traffic Analysis:
Applications “Examine purchase “Analyze calling “Monitor customer “Use multimedia
behavior across patterns among and product data data from various
debit and credit users and current recorded across sources to build an
properties to better capacity to forecast internal & external actionable graph of
identify high-risk traffic growth and sources to trend relationships among
customers.” locate new towers.” brand valuation.” targets.”
IT: Operations
Stakeholders
IT: Data Engineering
Risk Analysts Research Insight Team Intelligence


Customers


How are Customers Using Cloudera?
Answering Questions that Were Impossible to Ask Before
Analyze search terms and subsequent user purchase decisions
to tune search results, increase conversion rates
Digest long-term historical trade data to identify fraudulent
activity and build real-time fraud prevention
Model site visitor behavior with analytics that deliver better
recommendations for new purchases
Continually refine predictive models for advertising response
rates to deliver more precisely targeted advertisements
Replace expensive legacy ETL system with more flexible,
cheaper infrastructure that is 20 times faster
Correlate educational outcomes with programs and student
histories to improve results
Big Bank Examine customer behavior to improve loan risk scoring
More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Cloudera Offerings
Facilitating enterprise adoption of Hadoop

Software Services Training


Cloudera Enterprise
Enterprise Support and Management Applications

• Improves conformance to important IT SLAs, policies and procedures
• Lowers the cost of management and administration
• Increases reliability and consistency of the platform
• Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems
Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Integrating with Existing IT Infrastructure

BI/Analytics ETL RDBMS Cloud/OS Hardware


MicroStrategy (for interactive Dashboards)

Copyright © 2011 Couldera, Inc. All Rights Reserved. 19

Informatica (for Extract-Transform-Load, aka ETL)


Summary

• Cloudera’s Data OS (CDH) enables:
• Data Agility (Evolving Schemas)
• Consolidation (Structured or Not)
• Complex Data Processing (Any Language)
• Economical Storage (Enable Return-on-Byte > 1)
• Cloudera Enterprise enables:
• Conformance to important IT SLAs, policies and procedures
• Lower cost of management and administration
• Increased reliability and consistency
• Certified integration with existing IT infrastructure


Contact Information and Free Hadoop Book

Amr Awadallah
CTO, Cloudera, Inc.
aaa@cloudera.com
650-644-3921
twitter.com/awadallah
twitter.com/cloudera


Appendix


Cloudera Overview
Jeff Hammerbacher, Chief Scientist
Hadoop… Amr Awadallah, CTO, VP Engineering
Doug Cutting, Chief Architect
Mike Olson - CEO
Omer Trajman – VP, Customer Solutions
… meets enterprise John Kreisa –VP, Marketing
Charles Zedlewski – VP, Product Management
Ed Albanese – Head of Business Development

Investors Accel Partners, Greylock Partners, Meritech Capital Partners

Product category Data Management

Business model Cloudera offers Software, Support, Training, and Professional Services

Employees 70+

Customers 75+

Headquarters Palo Alto, California

Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise
Vision We enable organizations to profit from all of their data


Why CDH (Cloudera Distribution for Hadoop)?
Features Benefits
It’s packaged Much easier for users to install CDH than any other form
of Hadoop.
It’s patched This makes CDH more stable and secure than just
downloading an Apache branch
It’s proven Thousands of organizations already use CDH today so risk
is lower
It’s highly functional CDH will cover more use cases and users will be more
productive than if they were just using core Hadoop.
It’s integrated Save time (of piecing a system together yourself) and
lower risk (of choosing the wrong combination of
versions or patches)
It’s the accepted standard More of your preexisting investments in RDBMS, ETL and
BI work best with CDH
It’s supported CDH is one of only two distributions that has a
commercial entity standing behind it
It’s 100% Apache licensed Investment in this technology is insured.

Hadoop Timeline

Fastest sort of a TB, 3.5mins
over 910 nodes
Cutting adds DFS &
MapReduce support to Nutch • Fastest sort of a TB, 62secs
over 1,460 nodes
NY Times converts 4TB of • Sorted a PB in 16.25hours
Doug Cutting & Mike Cafarella over 3,658 nodes
image archives over 100 EC2s
started working on Nutch

2002 2003 2004 2005 2006 2007 2008 2009

Google publishes GFS & Cloudera
Yahoo! hires Cutting, Cloudera
MapReduce papers Founded
Hadoop spins out of Nutch hires Cutting

Web-scale deployments at
Y!, Facebook, Last.fm
Hadoop Summit 2009,
750 attendees


10 Common Hadoop-able Problems

1. Modeling true risk 6. Analyzing network data
to predict failure
2. Customer churn
analysis 7. Threat analysis
3. Recommendation 8. Trade surveillance
engine
9. Search quality
4. Ad targeting
10. Data “sandbox”
5. PoS transaction analysis


Case Studies: Hadoop World 2009

•VISA: Large Scale Transaction Analysis
•JP Morgan Chase: Data Processing for Financial Services
•China Mobile: Data Mining Platform for Telecom Industry
•Rackspace: Cross Data Center Log Processing
•Booz Allen Hamilton: Protein Alignment using Hadoop
•eHarmony: Matchmaking in the Hadoop Cloud
•General Sentiment: Understanding Natural Language
•Yahoo!: Social Graph Analysis
•Visible Technologies: Real-Time Business Intelligence

Slides and Videos: http://www.cloudera.com/hadoop-world-nyc


Case Studies: Hadoop World 2010

•eBay: Hadoop at eBay
•Twitter: The Hadoop Ecosystem at Twitter
•Yale University: MapReduce and Parallel Database Systems
•General Electric: Sentiment Analysis powered by Hadoop
•Facebook: HBase in Production
•AOL: AOL’s Data Layer
•Raytheon: SHARD: Storing and Querying Large-Scale Data
•StumbleUpon: Mixing Real-Time and Batch Processing

More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/


Hadoop Design Axioms

1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Should Move to Data

4. Simple Core, Modular and Extensible


HDFS: Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3

Cost/GB is a few
¢/month vs $/month

MapReduce: Distributed Processing


MapReduce Example for Word Count
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)
Output
Be, 5 Reduce 1
(sorted words,
sum of counts)
File 1

“To Be
Or Not Be, 30
To Be?”
Be, 12
Output
(sorted words,
Reduce i File i
Split i (docid, text) Map i sum of counts)

Be, 7
Be, 6
Shuffle Output
(sorted words,
Reduce R File R
sum of counts)
Split N (docid, text) Map M (words, counts) (sorted words, counts)

Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)


Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs

Name Node Job Tracker
Maintains mapping of file blocks Schedules jobs across
to data node slaves task tracker slaves

Data Node Task Tracker
Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node


Hive vs Pig Example (count distinct values > 0)

• Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;

• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;


Hive Agile Data Types

• STRUCTS:
• SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
• SELECT mytable.mycolumn[mykey+ FROM …
• ARRAYS:
• SELECT mytable.mycolumn*5+ FROM …
• JSON:
• SELECT get_json_object(mycolumn, objpath


Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Ähnlich wie Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011 (20)

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011