3. Initial Hadoop Deployment
▪ Tested in mid-2006: not great performance, small community
▪ Already had Cheetah and another Hadoop-like project underway
▪ Strong resistance to Java
▪ Early adopters: Yahoo!, Powerset, Quantcast, Last.fm
▪ First serious cluster: spring 2007
▪ Pulled sixty web server boxes and put 3 x 500 GB SATA disks in the back
▪ Loaded two separate log files: clickstream and activity logs
▪ Clickstream was nearly 600 GB per day, activity logs around 200 GB
▪ Lots of difficulties just getting data into the system
▪ All sorts of fun learning to operate the file system
4. Initial Hadoop Applications
Hadoop Streaming
▪ Almost all applications at Facebook use Hadoop Streaming
▪ Mapper and Reducer take inputs from a pipe and write outputs to a pipe
▪ Facebook users write in Python, PHP, C++ (though Pipes would be better)
▪ Allows for library reuse, faster development
▪ Eats way too much CPU
▪ More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
5. Initial Hadoop Applications
Unstructured text analysis
▪ Intern asked to understand brand sentiment and influence
▪ First began by building an online language classifier for wall posts
▪ Ported application to Hadoop for offline processing
▪ Many tools for supporting his project had to be built
▪ Understanding serialization format of wall post logs
▪ Common data operations: project, filter, join, group by
▪ Developed using Hadoop streaming for rapid prototyping in Python
▪ Scheduling regular processing and recovering from failures
▪ Making it easy to regularly load new data
7. Initial Hadoop Applications
Lexicon: Future Directions
▪ Further segmentation and visualization of term intensities
▪ Age
▪ Gender
▪ Geography
▪ TF-IDF
▪ Topic modeling
▪ Sentiment analysis
▪ Augment with data sources from around the internet
8. Initial Hadoop Applications
Ensemble Learning
▪ Build a lot of Decision Trees and average them
▪ Random Forests are a combination of tree predictors such that each
tree depends on the values of a random vector sampled independently
and with the same distribution for all trees in the forest
▪ Can be used for regression or classification
▪ See “Random Forests” by Leo Breiman
9. More Hadoop Applications
Insights
▪ Monitor performance of your Facebook Ad, Page, Application
▪ Regular aggregation of high volumes of log file data
▪ First hourly pipelines
▪ Publish data back to a MySQL tier
▪ System currently only running partially on Hadoop
11. More Hadoop Applications
Platform Application Reputation Scoring
▪ Users complaining about being spammed by Platform applications
▪ Now, every Platform Application has a set of quotas
▪ Notifications
▪ News Feed story insertion
▪ Invitations
▪ Emails
▪ Quotas determined by calculating a “reputation score” for the
application
13. More Hadoop Applications
Recommendation Engines and Affinity Scores
▪ People You May Know (PYMK)
▪ Other application areas
▪ Pages
▪ Applications
▪ News Feed
▪ Search
▪ Ads
▪ Chat
14. More Hadoop Applications
Miscellaneous
▪ Experimentation Platform back end
▪ A/B Testing
▪ Champion/Challenger Testing
▪ Lots of internal analyses
▪ Export smaller data sets to R
▪ Ad targeting optimization
▪ Search index building
▪ Load testing for new storage systems
▪ Language prediction for translation targeting
15. (c) 2008 Facebook, Inc. or its licensors. quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0