More Related Content
Similar to Yahoo & Hadoop
Similar to Yahoo & Hadoop (20)
More from Mauricio Godoy (20)
Yahoo & Hadoop
- 1. YAHOO &
HADOOP
USING AND IMPROVING
APACHE HADOOP AT YAHOO!
Eric Baldeschwieler
VP, Hadoop Software
1 © 2011 IBM Corporation
- 2. AGENDA
Brief Overview
Hadoop @ Yahoo!
Hadoop Momentum
The Future of Hadoop
2 © 2011 IBM Corporation
2
- 3. what’s
happening
- Big Data is here!
- unstructured data
- petabyte scale
- operationally critical
3 Flickr : sub_lime79 © 2011 IBM Corporation
- 4. turning data
into insights
machine learning
logic regression time series
content clustering
algorithms ad inventory modeling
user interest prediction
factorization models
4 Flickr : NASA Goddard Photo and Video © 2011 IBM Corporation
- 5. making YAHOO
relevant
5 Flickr : ogimogi © 2011 IBM Corporation
- 6. hadoop:
Powering
Yahoo!
science + big data + insight =
personal relevance = VALUE
6 Flickr : DDFic © 2011 IBM Corporation
- 7. WHAT IS HADOOP?
Commodity
Pig Hive •Computers
•Network
MapReduce
Focus on
•Simplicity
•Redundancy
HDFS
•Scale
•Availability
Transforms commodity equipment into a service that:
•HDFS – Stores peta bytes of data reliably
•Map-Reduce – Allows huge distributed computations
Key Attributes
•Redundant and reliable – Doesn’t stop or loose data even as hardware fails
•Easy to program – Our rocket scientists use it directly!
•Very powerful – Allows the development of big data algorithms & tools 7
7
•Batch processing centric © 2011 IBM Corporation
- 8. WHAT HADOOP ISN’T
A replacement for relational and data
warehouse systems
A transactional / online / serving system
A low latency or streaming solution
8
8 © 2011 IBM Corporation
- 9. HADOOP IN THE ENTERPRISE
Business Intelligence Applications
HADOOP
CLUSTER(S) RDMS EDW Data
Marts
Interactions Transactions, Structured Data
Semi-Structured or Un-Structured Data
Web Logs, Server Logs, Business
Social Media, etc… Applications
9 © 2011 IBM Corporation 9
- 11. HADOOP @
YAHOO!
“Where Science meets Data”
PRODUCTS
Data Analytics
DIM
E NS Content Optimization
ION
AL Content Enrichment
D ATA
Yahoo! Mail Anti-Spam
CO Advertising Products
NT
EN
T HADOOP CLUSTERS Ad Optimization
Tens of thousands of servers Ad Selection
Big Data Processing & ETL
DA
TA
PIP
ELI
NE
S
APPLIED SCIENCE
Ter User Interest Prediction
ab
(com ytes / Ad inventory prediction
pre Day Machine learning -
sse
d) search ranking
Machine learning - ad
targeting
Machine learning - spam
10s of Petabytes filtering
11 © 2011 IBM Corporation
11
- 12. FROM PROJECT TO
CORE PLATFORM
90 250
80 40K+ Servers
170 PB Storage 200
70
5M+ Monthly Jobs
60
Thousands of Servers
150
50
Petabytes
40
100
30
20
50
10
0 0
2006 2007 2008 2009 2010
12 © 2011 IBM Corporation
12
- 13. HADOOP POWERS THE
YAHOO! NETWORK
advertising optimization data analytics
machine learning search ranking
advertising data systems Yahoo! Mail anti-spam
audience, ad and search pipelines ad selection
Yahoo! Homepage Content Optimization
ad inventory prediction
user interest prediction
13 © 2011 IBM Corporation
13
- 14. CASE STUDY
YAHOO! HOMEPAGE
Personalized
for each visitor
twice the engagement
Result:
twice the engagement
Recommended links News Interests Top Searches
+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected
14 © 2011 IBM Corporation
14
- 15. CASE STUDY
YAHOO! HOMEPAGE
• Serving Maps SCIENCE » Machine learning to build ever
• Users - Interests HADOOP better categorization models
CLUSTER
• Five Minute USER CATEGORIZATION
Production BEHAVIOR MODELS (weekly)
• Weekly PRODUCTION
Categorization HADOOP » Identify user interests using
CLUSTER
models SERVING Categorization models
MAPS
(every 5 minutes)
USER
BEHAVIOR
SERVING SYSTEMS ENGAGED USERS
Build customized home pages with latest data (thousands / second)
15 © 2011 IBM Corporation
15
- 16. CASE STUDY
YAHOO! MAIL
Enabling quick response in the spam arms race
• 450M mail boxes
• 5B+ deliveries/day
SCIENCE
• Antispam models retrained
every few hours on Hadoop
“ 40% less spam than
PRODUCTION
Hotmail and 55% less
“
spam than Gmail
16 © 2011 IBM Corporation
16
- 17. YAHOO! & APACHE HADOOP
Yahoo! has contributed 70+% of
Apache Hadoop code to date
Hadoop is not our business, but Hadoop is key to our business
• Yahoo! benefits from open source eco-system around Hadoop
• Hadoop drives revenue at Yahoo! by making our core products better
We need Hadoop to be rock solid
• We invest heavily in core Hadoop development
• We focus on scalability, reliability, availability
We fix bugs before you see them
• We run very large clusters
• We have a large QA effort
• We run a huge variety of workloads
We are good Apache Hadoop citizens
• We contribute our work to Apache
17 • We share the exact code we run © 2011 IBM Corporation
17
- 18. HADOOP
MOMENTUM
18 © 2011 IBM Corporation
18
- 19. HADOOP IS GOING
MAINSTREAM
2007 2008 2009 2010
The Datagraph Blog
19 © 2011 IBM Corporation
19
- 20. THE PLATFORM EFFECT
BIRTH OF AN ECOSYSTEM
and other Early Adopters
Scale and productize Hadoop
Apache Hadoop
Enhance Orgs with Internet Scale Problems
Hadoop Add tools / frameworks, enhance Hadoop
Ecosystem
Service Providers
Grow ecosystem - Training, support, enhancements
Virtuous Circle!
• Investment -> Adoption
• Adoption -> Investment
Mainstream / Enterprise adoption
20
Drive further development, enhancements 20
© 2011 IBM Corporation
- 22. MAKING HADOOP ENTERPRISE-READY
WHAT’S NEXT
Hadoop is far from “done”
• Current implementation is showing its age
• Need to address several deficiencies in scalability,
flexibility, ease of use & performance
Yahoo! is working on Next Generation of Hadoop
• MapReduce: Rewrite to improve performance;
pluggable support for new programming models
• HDFS: Adding volumes to improve scalability;
Flush & sync support for applications that log to HDFS
Apache should remain the hub of Hadoop ecosystem
• Yahoo! contributes all Hadoop changes back to Apache
Hadoop
• Everyone benefits from shared neutral foundation
22 © 2011 IBM Corporation
22