Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Data Applications and Infrastructure at LinkedIn <ul><li>Jay Kreps </li></ul>LinkedIn
Plan <ul><ul><ul><li>`whoami` </li></ul></ul></ul><ul><ul><ul><li>Data products </li></ul></ul></ul><ul><ul><ul><li>Data i...
Data-centric engineering at LinkedIn <ul><ul><ul><li>LinkedIn’s Search Network & Analytics team </li></ul></ul></ul><ul><u...
People You May Know
Other products
People You May Know <ul><li>120 billion relationships scored...every day </li></ul><ul><li>82 hadoop jobs (not counting ET...
Relevance Products <ul><li>You must fly entirely by the instruments </li></ul><ul><li>Scale and relevance very closely lin...
Infrastructure as an Ecosystem <ul><li>Isolated infrastructure team is usually a bad solution </li></ul><ul><ul><li>Too is...
Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (...
Azkaban workflow = cron + make
Azkaban workflow:hadoop :: web framework:webapp
Azkaban
Azkaban Examples <ul><li>Example job source: </li></ul>Example workflow UI
Workflow
Azkaban  <ul><li>82 jobs running every day just for PYMK </li></ul><ul><ul><li>...need to run in the right order </li></ul...
Data Deployment How do you get your  multi-billion edge probabilistic  relationship graph to the live website to serve que...
Voldemort <ul><li>LinkedIn had many prior passes at this problem, all bad </li></ul><ul><ul><li>MySQL </li></ul></ul><ul><...
Voldemort Data Deployment
Voldemort Data Deployment <ul><li>Building a multi TB lookup structure is really, really hard work...it is a batch operati...
Voldemort Data Deployment <ul><li>If data takes 24 hours to generate, it may take 24 hours to fix </li></ul><ul><ul><li>Ne...
Questions?
Nächste SlideShare
Wird geladen in …5
×

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

21.234 Aufrufe

Veröffentlicht am

Hadoop Summit 2010 - application track
Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn

Veröffentlicht in: Technologie

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

  1. 1. Data Applications and Infrastructure at LinkedIn <ul><li>Jay Kreps </li></ul>LinkedIn
  2. 2. Plan <ul><ul><ul><li>`whoami` </li></ul></ul></ul><ul><ul><ul><li>Data products </li></ul></ul></ul><ul><ul><ul><li>Data infrastructure </li></ul></ul></ul>
  3. 3. Data-centric engineering at LinkedIn <ul><ul><ul><li>LinkedIn’s Search Network & Analytics team </li></ul></ul></ul><ul><ul><ul><li>Domain: Derived data </li></ul></ul></ul><ul><ul><ul><li>Products </li></ul></ul></ul><ul><ul><ul><ul><li>Search </li></ul></ul></ul></ul><ul><ul><ul><ul><li>People you may know </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Social graph services </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Job matching </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Collaborative filtering </li></ul></ul></ul></ul><ul><ul><ul><li>Infrastructure </li></ul></ul></ul>
  4. 4. People You May Know
  5. 5. Other products
  6. 6. People You May Know <ul><li>120 billion relationships scored...every day </li></ul><ul><li>82 hadoop jobs (not counting ETL) </li></ul><ul><li>Around 16TB of intermediate data </li></ul><ul><li>Machine learning model to predict probability of connection </li></ul><ul><li>Bloom filter's for approximate filtering joins (10x perf improvement) </li></ul><ul><li>About ~5 test algorithms per week </li></ul><ul><li>2 engineers </li></ul>
  7. 7. Relevance Products <ul><li>You must fly entirely by the instruments </li></ul><ul><li>Scale and relevance very closely linked </li></ul><ul><ul><li>More is often better </li></ul></ul><ul><ul><li>Iteration time is essential </li></ul></ul><ul><li>UI matters, really </li></ul><ul><li>We threw out custom non-hadoop code that was faster </li></ul><ul><li>Opportunity to work directly on the business </li></ul>
  8. 8. Infrastructure as an Ecosystem <ul><li>Isolated infrastructure team is usually a bad solution </li></ul><ul><ul><li>Too isolated from the problems </li></ul></ul><ul><li>Data product team has crushing problems </li></ul><ul><ul><li>This area is extremely immature </li></ul></ul><ul><li>People should want to use it </li></ul><ul><li>Treat it like a product </li></ul><ul><ul><li>Either make money off it or give it away </li></ul></ul><ul><ul><li>Open source is a great solution </li></ul></ul><ul><ul><li>Custom software should be the best </li></ul></ul>
  9. 9. Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management & RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
  10. 10. Azkaban workflow = cron + make
  11. 11. Azkaban workflow:hadoop :: web framework:webapp
  12. 12. Azkaban
  13. 13. Azkaban Examples <ul><li>Example job source: </li></ul>Example workflow UI
  14. 14. Workflow
  15. 15. Azkaban <ul><li>82 jobs running every day just for PYMK </li></ul><ul><ul><li>...need to run in the right order </li></ul></ul><ul><ul><li>… need to restart from failure </li></ul></ul><ul><ul><li>… need to enforce dependencies </li></ul></ul><ul><li>GUI is important for operations </li></ul><ul><li>alerting, resource locking, config management, etc </li></ul><ul><li>deployable zip files of code represent a job flow </li></ul><ul><li>everyone works independently, releases/deploys independently </li></ul><ul><li>simple text files for config (but can use GUI in a pinch) </li></ul><ul><li>aggregate logs, run times </li></ul><ul><li>restart from point of failure </li></ul>
  16. 16. Data Deployment How do you get your multi-billion edge probabilistic relationship graph to the live website to serve queries?
  17. 17. Voldemort <ul><li>LinkedIn had many prior passes at this problem, all bad </li></ul><ul><ul><li>MySQL </li></ul></ul><ul><ul><li>Oracle </li></ul></ul><ul><ul><li>Etc. </li></ul></ul><ul><li>Fully distributed, partitioned, decentralized key-value storage </li></ul><ul><li>Supports pluggable storage engines </li></ul><ul><li>Online/offline cycle </li></ul><ul><li>Is this a good fit? </li></ul>
  18. 18. Voldemort Data Deployment
  19. 19. Voldemort Data Deployment <ul><li>Building a multi TB lookup structure is really, really hard work...it is a batch operation </li></ul><ul><li>Solution: build this structure in hadoop </li></ul><ul><li>Tradeoff: build time vs lookup time </li></ul><ul><ul><li>Minimal perfect hashing requires only 2.5 bits per key, but is slow to build </li></ul></ul><ul><ul><li>Sorted indexes are a fast, simple alternative </li></ul></ul><ul><li>Build is a no-op map/reduce (just sorting) </li></ul><ul><li>Data load will saturate the network even for small cluster </li></ul><ul><li>Voldemort gives </li></ul><ul><ul><li>failover </li></ul></ul><ul><ul><li>load balancing </li></ul></ul><ul><ul><li>monitoring </li></ul></ul><ul><ul><li>remote access </li></ul></ul><ul><ul><li>partitioning </li></ul></ul>
  20. 20. Voldemort Data Deployment <ul><li>If data takes 24 hours to generate, it may take 24 hours to fix </li></ul><ul><ul><li>Need a faster rollback strategy </li></ul></ul><ul><li>Cold disk space is cheap </li></ul><ul><ul><li>Store the live copy </li></ul></ul><ul><ul><li>Store the copy currently being updated </li></ul></ul><ul><ul><li>Store N backup copies </li></ul></ul><ul><ul><li>“ Atomic” swap </li></ul></ul><ul><li>Cache needs to start warm </li></ul><ul><li>I/O network throttling to limit impact of deployment </li></ul><ul><li>Our prod latency is < 3 ms from the client side </li></ul><ul><li>900GB store takes ~1:30 to build on 45 node dev cluster </li></ul>
  21. 21. Questions?

×