2024: Domino Containers - The Next Step. News from the Domino Container commu...
Facebook Retrospective - Big data-world-europe-2012
1. Data Infrastructure at Facebook
A retrospective
Joydeep Sen Sarma
Ex-Facebook DI Lead, Founder Qubole
2. Intro
• File/Database Systems developer (ex- Netapp/Oracle)
• Yahoo (2005-07), Facebook (2007-11)
• @Facebook:
– SysAdmin: operated massive Hadoop/Hive installs
– Architect: conceived/wrote Apache Hive. made Hbase@FB
happen
– Herded cats: first manager of Data Infra team
– IT engineer/DBA: built ETL tools, warehouse/reporting for
FB Virtual Currency
– Vested my stock options!
• Founder Qubole Inc. (2011-)
3. What not to do: Yahoo
• Want to add ‘feed’ in warehouse?
Fill form, schmooze PM, wait 2 months.
• Want to justify project?
Take $100M, double count 5 times.
• Hard to find out what data exists in company., silos
• Lots of grand architecture, but no progress
4. Goals going in
• Universal ability to log data and compute against it
• Build infrastructure for data processing
– Help people help themselves
– Get out of the way
• Done is better than perfect, Move Fast.
– Iterate, Fix Failures Fast, Do everything twice
5.
6. State of the Union
• Sep, 2007:
– Use Case: compute relationship strength between friends
– Data Sets: user graph, interaction and page-view logs
– ~10TB cluster
…
• July, 2011:
– Ads reporting/data-mining, News Feed ranking, Spam
classification, PYMK, Search Indexing, Entitization,
Sentiment Analysis, Fraud Analysis ..
– ~10k queries a day, hundreds of users, scores concurrent
– 50PB cluster, 15 engineers/ops in total manning.
7. User Feedback
• Ex-Yahoo Senior-Directory Ads Product Mgmt.:
"I haven't done SQL for ages - but I can use this stuff easily“
• Ex-Yahoo Data Scientist:
"This is so amazing. That all data is stored in one place and I can
get access instantly without having to wait months and contact
multiple groups/silos“
• Ex-Paypal Fraud Analyst:
"So much better data and infrastructure than I have ever had in
the past"
8. Key Highways
• Hive
– Centrally managed Hadoop service, no setup
– SQL is easy, add scripts for map-reduce
– Browser based query wizards for SQL dummies
• Download results to Excel
• Schedule queries periodically with a few clicks
• Scribe
– Just log data using Scribe from any application
– Dead simple to add attributes to user page views
– Easy to pull data from RDBMS
9. Key Highways
• Simple Workflow authoring system (Databee)
• Reporting is easy
– Provision MySQL Data-marts in hours
– Easy self-service charting/dashboarding software
• Data Explorer
– Wiki like system for documenting tables, columns, types
– Keyword Search, find table authors, users
– Help people help people
10. Democracies – Ugh!
“Democracy may not be the perfect … but it is better
than the alternatives.”
“The family that poops together stays together”
11. Maintaining Order
• Hadoop Fair Scheduler
– Guarantee resources to projects/users. Share excess capacity
• Multiple Compute tiers
– Production, Large Ad-hoc, Small Ad-hoc, Local-mode queries
• Kill the bad guys
– Code to hunt down bad queries/apps
– Track cpu/disk usage – go after biggies
• Ban assault rifles
– Basic ACLs – can’t delete important tables, directories
12. Why did we succeed?
All Heil
Data Consolidation
(9pm, FB Hack Night)
Ads Engineering Director:
“Hey Joy, I want to join user fb-
DATA currency purchases with friend
request data to test a thesis –
pointers?”
DATA
13. Hadoop
• Cheap
– Can consolidate everything.
– We made it cheaper (RCFile, HDFS-RAID)
• Reduces governance cost
– Only worry about really really large stuff.
– Less data replication processes to manage
• Separates compute from storage
– Most legacy vendors don’t get this
• Disk Based analytic systems degrade gracefully
– No tipping point (vs. in-memory only)
– Ability to catchup, go back in past (vs. real-time stream processing
only)
15. Things we missed
• SLOOOOOOW
– Extensive work on FB Hadoop repo for faster scheduling
– Make testing faster (approx. queries)
– Watch @Qubole
• SQL as rope
– Need higher level templates. Don’t need 10 versions of a 30-day
moving average calculator
• Duplication of queries/jobs
– How to discover if there’s existing summaries?
– People help people, but still ..
• Didn’t build enough APIs
16. Final Words
• It’s not the software stupid
– Software is easy to write and fix
– Can be slow
• It’s the service that matters
– Making everything work seamlessly
– Ability to fix/improve things FAST