20080528dublinpt2

Hadoop Applications at Facebook

Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008

Initial Hadoop Deployment
▪ Tested in mid-2006: not great performance, small community
▪ Already had Cheetah and another Hadoop-like project underway
▪ Strong resistance to Java
▪ Early adopters: Yahoo!, Powerset, Quantcast, Last.fm
▪ First serious cluster: spring 2007
▪ Pulled sixty web server boxes and put 3 x 500 GB SATA disks in the back
▪ Loaded two separate log files: clickstream and activity logs
▪ Clickstream was nearly 600 GB per day, activity logs around 200 GB
▪ Lots of difficulties just getting data into the system
▪ All sorts of fun learning to operate the file system

Initial Hadoop Applications
Hadoop Streaming
▪ Almost all applications at Facebook use Hadoop Streaming
▪ Mapper and Reducer take inputs from a pipe and write outputs to a pipe
▪ Facebook users write in Python, PHP, C++ (though Pipes would be better)
▪ Allows for library reuse, faster development
▪ Eats way too much CPU
▪ More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html

Unstructured text analysis
▪ Intern asked to understand brand sentiment and influence
▪ First began by building an online language classifier for wall posts
▪ Ported application to Hadoop for offline processing
▪ Many tools for supporting his project had to be built
▪ Understanding serialization format of wall post logs
▪ Common data operations: project, filter, join, group by
▪ Developed using Hadoop streaming for rapid prototyping in Python
▪ Scheduling regular processing and recovering from failures
▪ Making it easy to regularly load new data

Lexicon: Future Directions
▪ Further segmentation and visualization of term intensities
▪ Age
▪ Gender
▪ Geography
▪ TF-IDF
▪ Topic modeling
▪ Sentiment analysis
▪ Augment with data sources from around the internet

Ensemble Learning
▪ Build a lot of Decision Trees and average them
▪ Random Forests are a combination of tree predictors such that each
tree depends on the values of a random vector sampled independently
and with the same distribution for all trees in the forest
▪ Can be used for regression or classiﬁcation
▪ See “Random Forests” by Leo Breiman

More Hadoop Applications
Insights
▪ Monitor performance of your Facebook Ad, Page, Application
▪ Regular aggregation of high volumes of log ﬁle data
▪ First hourly pipelines
▪ Publish data back to a MySQL tier
▪ System currently only running partially on Hadoop

Platform Application Reputation Scoring
▪ Users complaining about being spammed by Platform applications
▪ Now, every Platform Application has a set of quotas
▪ Notiﬁcations
▪ News Feed story insertion
▪ Invitations
▪ Emails
▪ Quotas determined by calculating a “reputation score” for the
application

Platform Application Reputation Scoring

Recommendation Engines and Afﬁnity Scores
▪ People You May Know (PYMK)
▪ Other application areas
▪ Pages
▪ Applications
▪ News Feed
▪ Search
▪ Ads
▪ Chat

Miscellaneous
▪ Experimentation Platform back end
▪ A/B Testing
▪ Champion/Challenger Testing
▪ Lots of internal analyses
▪ Export smaller data sets to R
▪ Ad targeting optimization
▪ Search index building
▪ Load testing for new storage systems
▪ Language prediction for translation targeting

20080528dublinpt2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie 20080528dublinpt2

Ähnlich wie 20080528dublinpt2 (20)

Mehr von Jeff Hammerbacher

Mehr von Jeff Hammerbacher (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

20080528dublinpt2