Architecting the Future of Big Data & Search - Eric Baldeschwieler
1. Architecting the Future of Big Data
and Search
Eric Baldeschwieler, Hortonworks
e14@hortonworks.com, 19 October 2011
2. What I Will Cover
§ď§âŻ Architecting the Future of Big Data and
Search
â˘âŻ Lucene, a technology for managing big data
â˘âŻ Hadoop, a technology built for search
â˘âŻ Could they work together?
§ď§âŻ Topics:
â˘âŻ What is Apache Hadoop?
â˘âŻ History and use Cases
â˘âŻ Current State
â˘âŻ Where Hadoop is Going
â˘âŻ Investigating Apache Hadoop and Lucene
3
4. Apache Hadoop isâŚ
A set of open source projects owned
by the Apache Foundation that
transforms commodity computers
and network into a distributed service
â˘âŻ HDFS â Stores petabytes of data
reliably
â˘âŻ MapReduce â Allows huge
distributed computations
Key Attributes
â˘âŻ Reliable and redundant â Doesnât slow down or lose data even as
hardware fails
â˘âŻ Simple and flexible APIs â Our rocket scientists use it directly!
â˘âŻ Very powerful â Harnesses huge clusters, supports best of breed analytics
â˘âŻ Batch processing-centric â Hence its great simplicity and speed, not a fit
for all use cases
5
7. MapReduce
§ď§âŻ MapReduce is a distributed computing programming model
§ď§âŻ It works like a Unix pipeline:
â˘âŻ cat input | grep | sort | uniq -c
> output
â˘âŻ Input | Map | Shuffle & Sort | Reduce | Output
§ď§âŻ Strengths:
â˘âŻ Easy to use! Developer just writes a couple of functions
â˘âŻ Moves compute to data
§ď§âŻ Schedules work on HDFS node with data if possible
â˘âŻ Scans through data, reducing seeks
â˘âŻ Automatic reliability and re-execution on failure
8
8
8. HDFS: Scalable, Reliable, Managable
Scale IO, Storage, CPU rď˛âŻ Fault Tolerant & Easy management
â˘âŻ Add commodity servers & JBODs rď˛âŻ Built in redundancy
â˘âŻ 4K nodes in cluster, 80 rď˛âŻ Tolerate disk and node failures
rď˛âŻ Automatically manage addition/
removal of nodes
Core Core rď˛âŻ One operator per 8K nodes!!
Switch Switch
rď˛âŻ Storage server used for computation
Switch Switch Switch
rď˛âŻ Move computation to data
rď˛âŻ Not a SAN
⌠rď˛âŻ But high-bandwidth network access
to data via Ethernet
âŚ
âŚ
âŚ
rď˛âŻ Immutable file system
rď˛âŻ Read, Write, sync/flush
rď˛âŻ No random writes
9
9. HBase
§ď§âŻ Hadoop ecosystem âNoSQL storeâ
â˘âŻ Very large tables interoperable with Hadoop
â˘âŻ Inspired by Googleâs BigTable
§ď§âŻ Features
â˘âŻ Multidimensional sorted Map
§ď§âŻ Table => Row => Column => Version => Value
â˘âŻ Distributed column-oriented store
â˘âŻ Scale â Sharding etc. done automatically
§ď§âŻ No SQL, CRUD etc.
§ď§âŻ billions of rows X millions of columns
â˘âŻ Uses HDFS for its storage layer
10
11. A Brief History
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
Â
 , early adopters 2006 â present
Scale and productize Hadoop
Apache
 Hadoop
Â
Other Internet Companies 2008 â present
Add tools / frameworks, enhance
Hadoop
âŚ
Service Providers 2010 â present
Provide training, support, hosting
Cloudera, MapR
Microsoft
IBM, EMC, Oracle
âŚ
Wide Enterprise Adoption Nascent / 2011
Funds further development, enhancements
12
12. Early Adopters & Uses
data
analyzing web logs analytics
advertising optimization machine learning
text mining web search mail anti-spam
content optimization
customer trend analysis
ad selection
video & audio processing
data mining
user interest prediction
social media
13
13. CASE STUDY
YAHOO! WEBMAP
 §ď§âŻ What is a WebMap?
 â˘âŻ Gigantic table of information about every web site,
page and link Yahoo! knows about
 â˘âŻ Directed graph of the web
 â˘âŻ Various aggregated views (sites, domains, etc.)
â˘âŻ Various algorithms for ranking, duplicate detection,
 twice
 tregion classification, spam detection, etc.
he
 engagement
Â
§ď§âŻ Why was it ported to Hadoop?
â˘âŻ Custom C++ solution was not scaling
â˘âŻ Leverage scalability, load balancing and resilience of
Hadoop infrastructure
â˘âŻ Focus on application vs. infrastructure
Š Yahoo 2011 14
14
Â
14. CASE STUDY
WEBMAP PROJECT RESULTS
 §ď§âŻ 33% time savings over previous system on the
 same cluster (and Hadoop keeps getting better)
 §ď§âŻ Was largest Hadoop application, drove scale
â˘âŻ Over 10,000 cores in system
 â˘âŻ 100,000+ maps, ~10,000 reduces
â˘âŻ ~70 hours runtime
 twice
 t~300 engagement
Â
â˘âŻ he
 TB shuffling
â˘âŻ ~200 TB compressed output
§ď§âŻ Moving data to Hadoop increased number of
groups using the data
Š Yahoo 2011 15
15
Â
15. CASE STUDY
YAHOO SEARCH ASSISTâ˘
Â
Â
Â
â˘âŻ Database
 for
 Search
 Assistâ˘
 is
 built
 using
 Apache
 Hadoop
Â
 â˘âŻ Several
 years
 of
 log-Ââdata
Â
â˘âŻ 20-Ââsteps
 of
 MapReduce
Â
Â
Â
 twice
 the
 engagement
Â
" Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
Š Yahoo 2011 16
16
Â
16. HADOOP @ YAHOO!
TODAY
40K+ Servers
170 PB Storage
5M+ Monthly Jobs
1000+ Active users
Š Yahoo 2011 17
17
Â
17. CASE STUDY
YAHOO! HOMEPAGE
Â
Â
 Personalized
Â
Â
 for
 each
 visitor
Â
Â
 twice
 the
 engagement
Â
Result:
Â
Â
twice
 the
 engagement
Â
Â
Recommended
 links
 News
 Interests
 Top
 Searches
Â
+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected
Š Yahoo 2011 18
18
Â
18. CASE STUDY
YAHOO! HOMEPAGE
â˘âŻ Serving Maps
 SCIENCE 
 Machine learning to build
â˘âŻ Users
 -Ââ
 Interests
 HADOOP ever better categorization
 CLUSTER
models
â˘âŻ Five
 Minute
 USER
 CATEGORIZATION
Â
ProducDon
 BEHAVIOR
 MODELS
 (weekly)
Â
Â
â˘âŻ Weekly
 PRODUCTION
CategorizaDon
 HADOOP
Âť
 Identify user interests
models
 SERVING
CLUSTER
using Categorization
MAPS models
(every 5 minutes)
USER
BEHAVIOR
SERVING
 SYSTEMS ENGAGED
 USERS
Â
Build customized home pages with latest data (thousands / second)
Š Yahoo 2011 19
19
Â
19. CASE STUDY
YAHOO! MAIL
Enabling
 quick
 response
 in
 the
 spam
 arms
 race
Â
â˘âŻ 450M
 mail
 boxes
Â
Â
â˘âŻ 5B+
 deliveries/day
Â
SCIENCE
Â
â˘âŻ AnDspam
 models
 retrained
Â
 every
 few
 hours
 on
 Hadoop
Â
Â
â 40% less spam than
PRODUCTION
Hotmail and 55%
â
less spam than
Gmail
Š Yahoo 2011 20
20
Â
21. Adoption Drivers
§ď§âŻ Business drivers
â˘âŻ ROI and business advantage from mastering big data
â˘âŻ High-value projects that require use of more data Gartner predicts
800% data growth
â˘âŻ Opportunity to interact with customers at point of over next 5 years
procurement
§ď§âŻ Financial drivers
â˘âŻ Growing cost of data systems as percentage of IT
spend
â˘âŻ Cost advantage of commodity hardware + open source
§ď§âŻ Technical drivers
80-90% of data
â˘âŻ Existing solutions not well suited for volume, variety produced today
and velocity of big data is unstructured
â˘âŻ Proliferation of unstructured data
22
22. Key Success Factors
§ď§âŻ Opportunity
â˘âŻ Apache Hadoop has the potential to become a center of the
next generation enterprise data platform
â˘âŻ My prediction is that 50% of the worldâs data will be stored in
Hadoop within 5 years
§ď§âŻ In order to achieve this opportunity, there is work to do:
â˘âŻ Make Hadoop easier to install, use and manage
â˘âŻ Make Hadoop more robust (performance, reliability,
availability, etc.)
â˘âŻ Make Hadoop easier to integrate and extend to enable a
vibrant ecosystem
â˘âŻ Overcome current knowledge gaps
§ď§âŻ Hortonworks mission is to enable Apache Hadoop to
become de facto platform and unified distribution for big data
23
23. Our Roadmap
Phase 1 â Making Apache Hadoop Accessible 2011
â˘âŻ Release the most stable version of Hadoop ever
â˘âŻ Hadoop 0.20.205
â˘âŻ Release directly usable code from Apache
â˘âŻ RPMs & .debsâŚ
â˘âŻ Improve project integration
â˘âŻ HBase support
Phase 2 â Next-Generation Apache Hadoop 2012
â˘âŻ Address key product gaps (HA, ManagementâŚ) (Alphas in Q4
2011)
â˘âŻ Ambari
â˘âŻ Enable ecosystem innovation via open APIs
â˘âŻ HCatalog, WebHDFS, HBase
â˘âŻ Enable community innovation via modular architecture
â˘âŻ Next Generation MapReduce, HDFS Federation
24
25. Developer Questions
§ď§âŻ We know we want to integrate Lucene into Hadoop
â˘âŻ How is this best done?
§ď§âŻ Log & merge problems (search indexes & HBase)
â˘âŻ Are there opportunities for Solr and HBase to share?
â˘âŻ Knowledge? Lessons learned? Code?
§ď§âŻ Hadoop is moving closer to online
â˘âŻ Lower latency and fast batch
§ď§âŻ Outsource more indexing work to Hadoop?
â˘âŻ HBase maturing
§ď§âŻ Better crawlers, document processing and serving?
26
26. Business Questions
§ď§âŻ Users of Hadoop are natural users of Lucene
â˘âŻ How can we help them search all that data?
§ď§âŻ Are users of Solr natural users of Hadoop?
â˘âŻ How can we improve search with Hadoop?
â˘âŻ How many of you use both?
§ď§âŻ What are the opportunities?
â˘âŻ Integration points? New projects? Training?
â˘âŻ Win-Win if communities help each other
27