What Are The Drone Anti-jamming Systems Technology?
Â
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
1. Apache Hadoop in the Enterprise
Cloudera, Inc.
Amr Awadallah, Founder, CTO, VP of Engineering.
aaa@cloudera.com, twitter: @awadallah
Microstrategy World â January 2011 â Las Vegas
2. Unstructured Data Explosion
Complex, Unstructured
Relational
⢠2,500 exabytes of new information in 2012 with Internet as primary driver
⢠Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
âzettabytesâ this year Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. . 2
3. Dramatic Changes in Enterprise Data Needs
Data Explosion
⢠Any Type of Data
⢠From Many Sources
⢠Instrument Everything
Hard Problems
⢠Complex Analysis
⢠At Lowest Granularity
⢠Data Beats Algorithm
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 3
4. What is Hadoop?
⢠A scalable fault-tolerant distributed system for data storage and
processing (open source under the Apache license)
⢠Core Hadoop has two main components
⢠Hadoop Distributed File System (HDFS): self-healing high-bandwidth
clustered storage
⢠MapReduce: fault-tolerant distributed processing
⢠Key business values
⢠Flexible -> Store any data, run any analysis (Mine First, Govern Later)
⢠Affordable -> Cost per TB at a fraction of traditional options
⢠Broadly adopted -> A large and active ecosystem
⢠Proven at scale -> Several petabyte deployments in production today
⢠Open Source -> No Lock-In, low cost, large developer community.
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 4
5. Clouderaâs Data Operating System (CDH)
Hue Hue SDK
Avro,
Oozie Oozie Hive
Pig.
Hive
Avro, Flume, Sqoop HBase
Zookeeper
⢠Open Source â 100% Apache licensed
⢠Simplified â Component versions & dependencies managed for you
⢠Reliable â Predictable release schedules, Patched with fixes to improve stability
⢠Many Form Factors â Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.
⢠Integrated â All components & functions interoperate through standard APIâs
⢠Supported â Founders, committers, contributors across all projects
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 5
6. Benefit #1: Agility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
⢠Schema must be created before ⢠Data is simply copied to the file
data is loaded store, no special transformation is
needed
⢠Explicit load operation has to
take place which transforms data ⢠A SerDe (Serializer/Deserlizer) is
to database internal structure applied during read time to extract
the required columns
⢠New columns must be added
explicitly before data for such ⢠New data can start flowing
columns can be loaded into the anytime and will appear
database retroactively once the SerDe is
updated to parse them
⢠Read is Fast ⢠Load is Fast
Benefits
⢠Standards/Governance ⢠Evolving Schemas/Agility
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 6
7. Benefit #2: Data Consolidation
Complex Data
Documents SharePoint
Web feeds Sensor data
System logs EMB archives
Online forums Images/Video
Structured Data (ârelationalâ)
CRM Inventory
Financials Sales records
Logistics HR records
Data Marts Web Profiles
A single data system to enable processing across
the universe of data types.
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 7
8. Benefit #3: Any Programing Language (Not Only SQL)
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the âassembly
languageâ of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 8
9. Benefit #4: Balancing Return on Investment (or Byte!)
⢠Return on Byte = value to be extracted from that byte
divided by the cost of storing that byte
⢠If ROB is < 1 then it will be buried into tape wasteland,
thus we need more economical active storage.
High ROB
Low ROB
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 9
10. Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when: Use when:
⢠Interactive OLAP Analytics (<1sec) ⢠Structured or Not (Agility)
⢠Multistep ACID Transactions ⢠Scalable Storage/Compute
⢠SQL Compliance ⢠Complex Data Processing
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 10
11. Where Does Hadoop Fit in the Enterprise Data Stack?
Data Scientists Analysts Business Users
Enterprise
IDEs BI, Analytics
Reporting
System
Administrators
Cloudera
Mgmt Apps Enterprise
Data
Warehouse
Data
Users
Architects
Low-Latency Web
Serving Application
Relational Systems
Logs Files Web Data
Databases
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 11
12. Apache Hive Features
⢠A subset of SQL covering the most common statements
⢠JDBC/ODBC support
⢠Agile data types: Array, Map, Struct, and JSON objects
⢠Pluggable SerDe system to work on unstructured files directly.
⢠User Defined Functions and Aggregates
⢠Regular Expression support
⢠MapReduce support
⢠Partitions and Buckets (for performance optimization)
⢠In The Works: Indices, Columnar Storage, Views, Microstrategy
compatibility, Explode/Collect
⢠More details: http://wiki.apache.org/hadoop/Hive
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 12
13. Broad Adoption in Key Verticals
Financial Services Telecom Retail Government
Example Risk management: BSS: Brand Equity: Traffic Analysis:
Applications âExamine purchase âAnalyze calling âMonitor customer âUse multimedia
behavior across patterns among and product data data from various
debit and credit users and current recorded across sources to build an
properties to better capacity to forecast internal & external actionable graph of
identify high-risk traffic growth and sources to trend relationships among
customers.â locate new towers.â brand valuation.â targets.â
IT: Operations
Stakeholders
IT: Data Engineering
Risk Analysts Research Insight Team Intelligence
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 13
14. Customers
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 14
15. How are Customers Using Cloudera?
Answering Questions that Were Impossible to Ask Before
Analyze search terms and subsequent user purchase decisions
to tune search results, increase conversion rates
Digest long-term historical trade data to identify fraudulent
activity and build real-time fraud prevention
Model site visitor behavior with analytics that deliver better
recommendations for new purchases
Continually refine predictive models for advertising response
rates to deliver more precisely targeted advertisements
Replace expensive legacy ETL system with more flexible,
cheaper infrastructure that is 20 times faster
Correlate educational outcomes with programs and student
histories to improve results
Big Bank Examine customer behavior to improve loan risk scoring
More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 15
17. Cloudera Enterprise
Enterprise Support and Management Applications
⢠Improves conformance to important IT SLAs, policies and procedures
⢠Lowers the cost of management and administration
⢠Increases reliability and consistency of the platform
⢠Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems
Copyright Š 2011, Cloudera, Inc. All Rights Reserved.
18. Integrating with Existing IT Infrastructure
BI/Analytics ETL RDBMS Cloud/OS Hardware
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 18
21. Summary
⢠Clouderaâs Data OS (CDH) enables:
⢠Data Agility (Evolving Schemas)
⢠Consolidation (Structured or Not)
⢠Complex Data Processing (Any Language)
⢠Economical Storage (Enable Return-on-Byte > 1)
⢠Cloudera Enterprise enables:
⢠Conformance to important IT SLAs, policies and procedures
⢠Lower cost of management and administration
⢠Increased reliability and consistency
⢠Certified integration with existing IT infrastructure
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 21
22. Contact Information and Free Hadoop Book
Amr Awadallah
CTO, Cloudera, Inc.
aaa@cloudera.com
650-644-3921
twitter.com/awadallah
twitter.com/cloudera
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 22
24. Appendix
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 24
25. Cloudera Overview
Jeff Hammerbacher, Chief Scientist
Hadoop⌠Amr Awadallah, CTO, VP Engineering
Doug Cutting, Chief Architect
Mike Olson - CEO
Omer Trajman â VP, Customer Solutions
⌠meets enterprise John Kreisa âVP, Marketing
Charles Zedlewski â VP, Product Management
Ed Albanese â Head of Business Development
Investors Accel Partners, Greylock Partners, Meritech Capital Partners
Product category Data Management
Business model Cloudera offers Software, Support, Training, and Professional Services
Employees 70+
Customers 75+
Headquarters Palo Alto, California
Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise
Vision We enable organizations to profit from all of their data
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 25
26. Why CDH (Cloudera Distribution for Hadoop)?
Features Benefits
Itâs packaged Much easier for users to install CDH than any other form
of Hadoop.
Itâs patched This makes CDH more stable and secure than just
downloading an Apache branch
Itâs proven Thousands of organizations already use CDH today so risk
is lower
Itâs highly functional CDH will cover more use cases and users will be more
productive than if they were just using core Hadoop.
Itâs integrated Save time (of piecing a system together yourself) and
lower risk (of choosing the wrong combination of
versions or patches)
Itâs the accepted standard More of your preexisting investments in RDBMS, ETL and
BI work best with CDH
Itâs supported CDH is one of only two distributions that has a
commercial entity standing behind it
Itâs 100% Apache licensed Investment in this technology is insured.
Copyright Š 2011, Cloudera, Inc. All Rights Reserved.
27. Hadoop Timeline
Fastest sort of a TB, 3.5mins
over 910 nodes
Cutting adds DFS &
MapReduce support to Nutch ⢠Fastest sort of a TB, 62secs
over 1,460 nodes
NY Times converts 4TB of ⢠Sorted a PB in 16.25hours
Doug Cutting & Mike Cafarella over 3,658 nodes
image archives over 100 EC2s
started working on Nutch
2002 2003 2004 2005 2006 2007 2008 2009
Google publishes GFS & Cloudera
Yahoo! hires Cutting, Cloudera
MapReduce papers Founded
Hadoop spins out of Nutch hires Cutting
Web-scale deployments at
Y!, Facebook, Last.fm
Hadoop Summit 2009,
750 attendees
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 27
28. 10 Common Hadoop-able Problems
1. Modeling true risk 6. Analyzing network data
to predict failure
2. Customer churn
analysis 7. Threat analysis
3. Recommendation 8. Trade surveillance
engine
9. Search quality
4. Ad targeting
10. Data âsandboxâ
5. PoS transaction analysis
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 28
29. Case Studies: Hadoop World 2009
â˘VISA: Large Scale Transaction Analysis
â˘JP Morgan Chase: Data Processing for Financial Services
â˘China Mobile: Data Mining Platform for Telecom Industry
â˘Rackspace: Cross Data Center Log Processing
â˘Booz Allen Hamilton: Protein Alignment using Hadoop
â˘eHarmony: Matchmaking in the Hadoop Cloud
â˘General Sentiment: Understanding Natural Language
â˘Yahoo!: Social Graph Analysis
â˘Visible Technologies: Real-Time Business Intelligence
Slides and Videos: http://www.cloudera.com/hadoop-world-nyc
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 29
30. Case Studies: Hadoop World 2010
â˘eBay: Hadoop at eBay
â˘Twitter: The Hadoop Ecosystem at Twitter
â˘Yale University: MapReduce and Parallel Database Systems
â˘General Electric: Sentiment Analysis powered by Hadoop
â˘Facebook: HBase in Production
â˘AOL: AOLâs Data Layer
â˘Raytheon: SHARD: Storing and Querying Large-Scale Data
â˘StumbleUpon: Mixing Real-Time and Batch Processing
More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 30
31. Hadoop Design Axioms
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and Extensible
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 31
32. HDFS: Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3
Cost/GB is a few
¢/month vs $/month
Copyright Š 2011, Cloudera, Inc. All Rights Reserved.
34. MapReduce Example for Word Count
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)
Output
Be, 5 Reduce 1
(sorted words,
sum of counts)
File 1
âTo Be
Or Not Be, 30
To Be?â
Be, 12
Output
(sorted words,
Reduce i File i
Split i (docid, text) Map i sum of counts)
Be, 7
Be, 6
Shuffle Output
(sorted words,
Reduce R File R
sum of counts)
Split N (docid, text) Map M (words, counts) (sorted words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)
Copyright Š 2011, Cloudera, Inc. All Rights Reserved.
35. Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
Name Node Job Tracker
Maintains mapping of file blocks Schedules jobs across
to data node slaves task tracker slaves
Data Node Task Tracker
Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node
Copyright Š 2011, Cloudera, Inc. All Rights Reserved.
36. Hive vs Pig Example (count distinct values > 0)
⢠Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
⢠Pig syntax:
mytable = LOAD âmyfileâ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 36
37. Hive Agile Data Types
⢠STRUCTS:
⢠SELECT mytable.mycolumn.myfield FROM âŚ
⢠MAPS (Hashes):
⢠SELECT mytable.mycolumn[mykey+ FROM âŚ
⢠ARRAYS:
⢠SELECT mytable.mycolumn*5+ FROM âŚ
⢠JSON:
⢠SELECT get_json_object(mycolumn, objpath
Copyright Š 2011, Cloudera, Inc. All Rights Reserved. 37