This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.
4. Social Media and Scaling
â˘Scalability Matters Now.
â˘SM produces large, complex data
5. Social Media and Scaling
â˘Scalability Matters Now.
â˘SM produces large, complex data
â˘Anyone can collect the web
6. Social Media and Scaling
â˘Scalability Matters Now.
â˘SM produces large, complex data
â˘Anyone can collect the web
â˘Make a Twitter in a few days
7. Social Media and Scaling
â˘Scalability Matters Now.
â˘SM produces large, complex data
â˘Anyone can collect the web
â˘Make a Twitter in a few days
â˘Easy to get TBs of data
8. Social Media and Scaling
â˘Scalability Matters Now.
â˘SM produces large, complex data
â˘Anyone can collect the web
â˘Make a Twitter in a few days
â˘Easy to get TBs of data
â˘Big Data enabling new ďŹelds for
companies
34. Goals for New Platform
â˘âGolden Timelineâ
â˘Search/Analyze *any* data
35. Goals for New Platform
â˘âGolden Timelineâ
â˘Search/Analyze *any* data
â˘Linear Cost
36. Goals for New Platform
â˘âGolden Timelineâ
â˘Search/Analyze *any* data
â˘Linear Cost
â˘Not Hacked Together
37. Goals for New Platform
â˘âGolden Timelineâ
â˘Search/Analyze *any* data
â˘Linear Cost
â˘Not Hacked Together
â˘âCollect the Social Internetâ
44. Avoiding Impedance Mismatch
â˘Most problems can be divided into
High or Low latency
â˘Get a lot of data eventually, or a little
now
45. Avoiding Impedance Mismatch
â˘Most problems can be divided into
High or Low latency
â˘Get a lot of data eventually, or a little
now
â˘MapReduce vs. Sharding/Indexing
51. Hadoop + MR
â˘Special: Crunch web-scale data fast
â˘SacriďŹce: Low-Latency, Transactions,
Random Access, Updates
52. Hadoop + MR
â˘Special: Crunch web-scale data fast
â˘SacriďŹce: Low-Latency, Transactions,
Random Access, Updates
â˘Structure: Chunked ďŹat ďŹles
53. Structured Processing Cluster
Enriched Data
Structured
Analysis
Unstructured Store in
Cluster HBase
Store in Search
Indexing
Hadoop Cluster
HBase
Records
Sharded
Lucene Index
Lucene Index
54. Document Structure
ContentID: 00BAC189
Title: Iron Maiden Rules
Body: I think Janick Gers is an amazing guitarist blah blah
PostDT: 20090718
ParentID: 0FDEADBEEF
Permalink: www.roadtofailure.com/post?=20
80. Recap: Rules for Scaling
â˘RDBMS is not a Swiss-Army Knife
â˘Know your sacriďŹces
81. Recap: Rules for Scaling
â˘RDBMS is not a Swiss-Army Knife
â˘Know your sacriďŹces
â˘Know your specialness
82. Recap: Rules for Scaling
â˘RDBMS is not a Swiss-Army Knife
â˘Know your sacriďŹces
â˘Know your specialness
â˘Know your data structure
83. Recap: Rules for Scaling
â˘RDBMS is not a Swiss-Army Knife
â˘Know your sacriďŹces
â˘Know your specialness
â˘Know your data structure
â˘Ponder Latency