Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Data infrastructure and Hadoop at LinkedIn
1. Big data and Hadoop
September 2012
Hari Shankar Menon
Software engineer
LinkedIn
1
2. About me
LinkedIn Engineering
Data warehouse team
Previously, Software engineer @Clickable
– Worked on building the reporting and analytics platform on
Hadoop and HBase.
Hadoop and Open-source enthusiast
2
5. LinkedIn by numbers
175M+
90
~2/sec
New Members joining
>2M
55 Company Pages
32
85%
Fortune 100 Companies
use LinkedIn to**
hire
17
2
4
8
~4.2B
Professional
2004 2005 2006 2007 2008 2009 2010 searches in 2011
LinkedIn Members (Millions) *as of Nov 4, 2011
**as of June 30, 2011
6. About LinkedIn
Data Infrastructure overview
Hadoop@LinkedIn
Challenges
6
7. What is big data?
* Chart from Philip Russom- Research Director: TDWI
8. Infrastructure technologies
Search technologies
Primary data store (Front-end) Document-oriented store
Distributed key-value store
Distributed PubSub messaging
Database change replication SenseiDB
Zoie Bobo
8
10. About LinkedIn
Data Infrastructure overview
Hadoop@LinkedIn
Challenges
10
11. What is Hadoop
Evolution of Hadoop
Impact
11
12. @
Recommendation systems
– Generating recommendations
– Modeling
– A/B Testing
– Grandfathering
Data warehouse/ETL
– Raw data storage
– Aggregations
– Heavy lifting
Data sciences
– Strategic analyses
– Experimentation sandbox
12
13. The Recommendations opportunity
• Relevance/Late Pandora Search for People
ncy
• Offline
computation Events You
Groups browse maps
May Be
Interested In
• Caching
13
23. Real-time processing
• Challenges
• Random reads/writes
• Warm-up time
• Solutions
• Parts of the problem that can be moved offline?
• HBase, Voldemort
23
Being part of LinkedIn, being a social media company, we deal with a lot of data. We face with a lot of the challenges – sell LIHadoop user group
For us, fundamentally changing the way the world works begins with our mission statement: To connect the world’s professionals and entrepreneurs to make them more productive and successful. This means not only helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in. Platform that lets us become more productiveTalent is THE driving force for success and economic opportunity; that holds true for both individual professionals and the companies they work for. At our core, LinkedIn is in the business of connecting talent with opportunity at massive scale. We are able to do this in an unprecedented way due to the convergence of two unique trends:Scalable infrastructure that connects hundreds of millions of people in milliseconds, andExtraordinary shifts in online behavior related to the way people represent their identities, build their networks and share information and knowledge. This is fundamentally changing the world in the way we live, play, and, of course, work. And that’s where LinkedIn is focused: on fundamentally transforming the way the world works. These factors enable LI to connect talent+opportunity.
With north of 175 million members, we’re making great strides toward our mission of connecting the world’s professionals to make them more productive and successful. For us this not only means helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.-With terabytes of data flowing through our systems, generated from member’s profile, their connections and their activity on LinkedIn, we have amassed rich and structured data of one of the most influential, affluent and highly-educated audience on the web. This huge semi-structured data is getting updated in real-time and growing at a tremendous pace, we are all very excited about the data opportunity at LinkedIn
The power of LinkedIn’s platform grows exponentially as we continue toAdd more membersGet them to come back more often, and Give them more reasons to engage on the siteThese three actions drive network effects that form a virtuous cycle on LinkedIn. As membership grows, and activity on the platform increases, it improves the quantity and quality of data propagated throughout the network, which we then use to create better and more relevant products and services for our members and customers. Virtuous cycle. We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site. Enables professionals to be more productive.Volume – Generally large – in several TB’s – sometimes in PBVariety – 80% of the data is unstructured, Growing at 15 time the rate of growth of structured data,,Velocity – High velocityUser data (More structured)Traffic data (Real-time)3rd party data (Batch data, but unstructured)Example
Need for various technologiesOne size doesn’t fit all
History: Google paper, Doug cutting, Yahoo, Storage and computation- Synonymous with big dataEmpowering.Made a lot of new ideas feasible, spurned a new bunch of startupsAbility to store and process => More data to storeMay be 2 slidesNAS systems, OLAP. But not feasible. Hadoop democratized scalable data processing.
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.Very visible value addition – Right information to the right user at the right timeIntegral to virality of the networkProblems:Computation intensive algorithmsVariety of recommendationsLots of A/B testing required
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.50% of job views/applications by members are a direct result of recommendations.Similar results across all recommendations
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.AggregationsComplex transformationsLong-term data storageLoad sharing (?)
The Hadoop impactETL jobs transfer to hadoop has helped make data available to adhoc queries by data scientsts.
We have a unique perspective into data Before the collapse, we saw substantial spikes in user activity for the following 5 companies during major financial events:One hypothesis is that many of the employees left the financial industry. According to the LinkedIn data set, that just isn’t true. Bank of America acquired Merrill Lynch and Nomura acquired Lehman Brothers’ franchise in the Asia Pacific region),Barclays was by far the biggest beneficiary, scooping up 10% of the laid off talent, followed by Credit Suisse at 1.5% and Citigroup at 1.1 %.
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.
ENG SLIDE What is a data scientists? What are the different technologies, big data, challenges and opportunities? Open Source – IN Maps (hackday projects), full fledged products.
Add images for SQL/Mapreduce
Hadoop is, and will always be optimized for sequential reads and throughput rather than speed of completion