LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio.
To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way.
This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.
2. Outline
• Overview: LinkedIn Biz, Tech, & Analytics
• Rapid Data Exploration 101
- Spatial Analytics Pig Code
- Trend detection with Pig & Python
- R Streaming Example
• Deep Dive: Our Data Analysis Approach
• Building Data Products
• LinkedIn Data Insights
3. Connect the world’s professionals to make
them more productive and successful
5. LinkedIn at a glance
• Founded in 2003
• #17 site in the US (Alexa)
• 60+ million members
• First million members = 477 days
• Latest million = 9 days
• 500K+ company profiles
• 12+ million small business professionals
• In 2009 - 1billion people searches
• Average age: 41
• Household income $107,000
• 42% are “decision makers”
6. How International?
• More than 50% international
(members in over 200 countries & territories)
• 13+ million in Europe
• 4+ million in India
• 3+ million in UK
• #13 site in UK (Alexa)
7. How do we keep the lights on?
• Profitable since 2007
• Valued at over $1B at the last funding round
• Subscriptions
• Ads
• Job Postings
• Enterprise Client
8. Hadoop on LinkedIn
1,400+ members list “Hadoop” on their profile
What other skills do they have?
•HBase, Lucene, Solr, MapReduce, Nutch...
Where are they? Who do they work for?
• 36% in Bay Area • 11% Yahoo!
• 8% in India • 2% Apache Software Foundation
• 6% in NYC • 1% LinkedIn
• 4% in Seattle • 1% Google
• 4% in Los Angeles • 1% Facebook
10. Voldemort Data Storage
Compact, compressed, binary data (something like Avro)
Type can be any combination of int, double, float, String,
Map, List, etc. => Sequence Files
Example member definition:
{
‘member_id’: ‘int32’,
‘first_name': 'string',
’last_name': ’string’,
‘age’ : ‘int32’
…
}
11. Getting Data In
•From Databases (user data, news, jobs etc.)
• Need a way to get data reliably periodically
• Need tests to verify data
• Support for incremental replication
• Solution: Transmogrify Driver Program
• InputReader: JDBCReader, CSV Reader
• Output Writer: JDBCWriter, HDFS writers
• From web logs (page views, search, clicks etc)
• Weblogs files are rsynced and loaded up in HDFS
• Hadoop jobs for date cleaning and transformation.
27. We can also leverage...
• Connection Graph • Company Pages
• Recommendations • Talent Match
• Address Book Uploads • Web Referrals
• Search Logs • 1M+ Twitter Accounts
• Profile Views & Activity • Wikipedia Data
• Job Postings • Mechanical Turk
• LinkedIn Groups • Census, BLS, & Data.gov
• LinkedIn Questions • Much more...
31. Data Scientist Lessons
• Follow the data, avoid assumptions
• Sanity check the extremes (0, infinity)
• Don’t get mired in rare edge cases
• Data Jujitsu: solve easier auxiliary problems
• Build smaller consistent samples to test code
• Establish a baseline model quickly, iterate often
• Use the right tool for the job at hand
• Iterate quickly with high level languages