SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Proprietary & Confidential 6/7/2020 1
Large Scale Computing
@ Linkedin
Bhupesh Bansal
Software Engineer
Linkedin
The Plan
• What Is Linkedin
• Large Scale problems @ Linkedin
• Scalable Architectures
• Hadoop : The mighty elephant
• Questions
What is Linkedin?
 The largest Professional social network:
 Some high level statistics:
– 50M active users, in 147 Industries and 244 Countries (580+ Regions)
– ~10M unique daily visitors
– ~25M weekly searches (150 QPS at peak)
– ~50M weekly profile views
– ~2M connections per day
Why should you care?
 Control your brand
 Market yourself
 Find inside connections
 Learn from Wisdom of the crowd
Build Your Brand
What is your Identity?
How do you want the world to see and know about you?
My Network My Strength
1. Explore opportunities before its too late
2. Make your network go far. (well really really far !!)
3. Seek expert advice
4. Stand on the shoulder of giants.
Wisdom of your network
1. Slideshare : See slides uploaded by your network
2. Answers : See expert answers & who they are
3. Groups
4. …
Large Scale Problems @ Linkedin
9
Linkedin Network Cache
• Third Degree network
• Friends of Friends of Friends
• Scales exponentially
• first degree: 100
• second-degree : first-degree 2
• third-degree : second-degree 2
• Ranges from 10,000  20 M.
• Is Unique for every user
• How do we solve it.
• Custom graph engine app.
• Cache entire third degree for logged
in members
• Graph Algorithms
• Minimize memory footprint of cache
• Bit vector optimizations
• compressions : p4Delta ..
Linkedin Search
• Linkedin Search
• 50 M documents
• Business need to filter/rank
with third degree cache
• Response time : 30-50 ms.
• How do we solve it.
• Distributed search engine.
• Documents partitioned
• 0 -5M, 5-10M , ..
• Each partition fetch network in
range
• Filter/Rank and return results.
Search & Filter on Third Degree
Data Analytics
• Data is Linkedin primary asset
• Linkedin Data has immense value
• Career trends
• Bubble bursts
• Hot keywords
• find experts/ Influencers
• Big Data VS Better Algorithm
• Big fight in machine learning
community
• personal opinion : big data rocks
if you know your data
Image1 source : http://otec.uoregon.edu/images/data-to-wisdom.gif
Image2 source : http://www.nature.com/ki/journal/v62/n5/images/4493262f1b.gif
Data Analytics Examples
People You May Know
• Shows People you should be
connected to ?
• For 50 M member
• Potential 50 2 = 2500,000 Billion
• Can be Narrowed down
• Triangle closing
• School/Company overlap
• .. Other factors
• secret sauce 
Scalable Architectures
• What is a scalable architecture ?
• Ability to do more if desired
• With minimum effort
• w/o software rearchitecture
• How to scale ?
• Scale Vertically (easy but
costly)
• Scale Horizontally (Harder
but very scalable if done right)
• Divide & Conquer is the only
way to scale horizontally
• Sharding/Partitioning
Scalable Architectures
Respect the elephant
Hadoop Origin?
 “For the last several years, every company involved in building large
web-scale systems has faced some of the same fundamental
challenges. While nearly everyone agrees that the "divide-and-
conquer using lots of cheap hardware" approach to breaking down
large problems is the only way to scale, doing so is not easy “
 Hadoop was born to fulfill this very need
Proprietary & Confidential 6/7/2020 18
What is Hadoop?
• An Open-source Java based implementation of distributed map-
reduce paradigm
 Apache project with very active community
– heavy support from Yahoo
 Distributed Storage + Distributed Processing + Job Management
 Runs on Commodity Hardware (Heterogeneous cluster)
 Scalable, Reliable , Fault tolerant and very easy to learn.
 Yahoo currently running clusters with 10,000 nodes that processes
10TB of compressed data for their production search relevance
processing
 Written by Lucene author Doug Cutting (and others)
 Plethora of valuable sub-projects
• Pig, Hive, HBase, Mahout, Katta
What is Hadoop used for?
• Search Relevance: Yahoo, Amazon, Powerset,
Zevents
• Log Processing: Facebook, Yahoo, Joost, Last.fm
• Recommendation System: Facebook
• Data Warehouse: Facebook, AOL
• Video and Image Analysis: New york times, eyealike
• Prediction models: Quantcast, Goldman Sachs
• Genome Sequencing: University of Maryland.
• Academia : University of Washington, Cornell ,
Stanford, Carnegie Mellon,
Proprietary & Confidential 6/7/2020 20
Why Use Hadoop
• Distributed/parallel programs are very very hard to
write
• Tip of iceberg is the actual program/business logic
• Map/Reduce code
• The hidden iceberg is
• Parallelization (Divide & Conquer)
• Data Management (TB to PB)
• Fault Tolerance
• Scalability
• Data local optimizations
• Monitoring
• Job isolation
Hadoop core components
• HDFS distributed file system
• User space distributed data management
• Not a real file system (no POSIX, not in the
kernel)
• Replication
• Rebalancing
• Very high aggregate throughput
• Files are immutable
• MapReduce layer
• Very simple but powerful programming model
• Move computation to the data
• Cope with hardware failures
• Non-goal: low latency
Map Reduce
• Google paper started it all:
•“MapReduce: Simplified Data Processing on Large Clusters”, OSDI
2004
• Map = per “record” computation (extract/transform phase)
• Each map task gets a block or piece of input data
• Shuffle = copy/collect data together (hidden from users)
• Collect all values for same key together.
• Can provide custom sort implementations.
• Reduce = final computation (aggregation/summarize phase)
• All key, value pairs for one key goes to one reducer.
• Reduce is optional
Map Reduce data flow
Proprietary & Confidential 6/7/2020 24
Map Reduce Example : WordCount
Proprietary & Confidential 6/7/2020 25
Image source : http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Map Reduce Implementation
• Jobs are divided into small pieces (Tasks)
• Improves load balancing
• Faster recovery from failed nodes
• Master
• What to run?
• Where to run?
• Slaves
• One tasktracker per machine
• Can configure how many mappers / reducers per tasktracker
• Manage task failures by re-starting on different node.
• Speculative execution if a task is slow or have failed earlier
• Maps and reduces are single threaded Individual JVMs
• Kill -9
• Resource Isolation
• Complete cleanup
Map Reduce
• Job tracker
• splits input and assigns to various map tasks
• Schedules and monitors map tasks (heartbeat)
• On completion, schedules reduce tasks
• Task tracker
• Execute map tasks – call mapper for every input record
• Execute reduce tasks – call reducer for every intermediate key, list of
values pair
• Handle partitioning of map outputs
• Handle sorting and grouping of reducer input
Jobtracker
tasktrackertasktrackertasktracker
Input Job (mapper, reducer, input)
Data
transfer
Assign tasks
HDFS II
 Namenode (master)
– Maps file Name to set of blocks
– Maps blocks to list of datanodes where it resides.
– Replication engine for blocks
– Checksum based data coherency checker.
– Issues
 Metadata in memory (single point of failure)
 Secondary name node checkpoints metadata.
 Datanodes (slaves)
– Stores blocks on local file system.
– Stores checksum of blocks
– Read/write data to clients directly.
– Periodically send a report of all existing blocks to Namenode.
Proprietary & Confidential 6/7/2020 28
HDFS I
Hadoop @ Linkedin
• People You May Know
• 40 individual jobs:
• Graph analysis
• School & Company overlap
• Collaborative filtering
• Misc. other factors
• Regression model combines all values
• Content scoring
• Collaborative Filtering
• Sessionization and cause analysis
• Phrase analysis
• Other derived data: company data, matching fields, seniority, etc.
• Search relevance
• Peoplerank
• Link keywords to results based on user behavior
• User bucketing
Data Visualization
• Can be very very useful.
• Saves ton of time
• Very insightful
• Pictures = 1000+ words
• Challenges
• Amount of data is huge
• Need to write visualization for each problem.
• Time spent per visualization is huge
• Hard to make it distributed/scalable
Data Visualization
Questions
33
References
1. Hadoop. http://hadoop.apache.org/
2. Jeffrey Dean and Sanjay Ghemawat. MapReduce:
Simplified Data Processing on Large Clusters.
http://labs.google.com/papers/mapreduce.html
3. http://code.google.com/edu/parallel/index.html
4. http://www.youtube.com/watch?v=yjPBkvYh-ss
5. http://www.youtube.com/watch?v=-vD6PUdf3Js
6. S. Ghemawat, H. Gobioff, and S. Leung. The Google
File System. http://labs.google.com/papers/gfs.html
7. Linkedin.com
8. DJ Patil Talk at Scale Unlimited 2009

Weitere ähnliche Inhalte

Was ist angesagt?

Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Andrew Brust
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 

Was ist angesagt? (20)

NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-less
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT project
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT project
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
Considerations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT projectConsiderations for using NoSQL technology on your next IT project
Considerations for using NoSQL technology on your next IT project
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 

Ähnlich wie Large scale computing

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
DataWorks Summit
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
SEAD
 

Ähnlich wie Large scale computing (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 

KĂźrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

KĂźrzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Large scale computing

  • 1. Proprietary & Confidential 6/7/2020 1 Large Scale Computing @ Linkedin Bhupesh Bansal Software Engineer Linkedin
  • 2. The Plan • What Is Linkedin • Large Scale problems @ Linkedin • Scalable Architectures • Hadoop : The mighty elephant • Questions
  • 3. What is Linkedin?  The largest Professional social network:  Some high level statistics: – 50M active users, in 147 Industries and 244 Countries (580+ Regions) – ~10M unique daily visitors – ~25M weekly searches (150 QPS at peak) – ~50M weekly profile views – ~2M connections per day
  • 4. Why should you care?  Control your brand  Market yourself  Find inside connections  Learn from Wisdom of the crowd
  • 6. What is your Identity? How do you want the world to see and know about you?
  • 7. My Network My Strength 1. Explore opportunities before its too late 2. Make your network go far. (well really really far !!) 3. Seek expert advice 4. Stand on the shoulder of giants.
  • 8. Wisdom of your network 1. Slideshare : See slides uploaded by your network 2. Answers : See expert answers & who they are 3. Groups 4. …
  • 9. Large Scale Problems @ Linkedin 9
  • 10. Linkedin Network Cache • Third Degree network • Friends of Friends of Friends • Scales exponentially • first degree: 100 • second-degree : first-degree 2 • third-degree : second-degree 2 • Ranges from 10,000  20 M. • Is Unique for every user • How do we solve it. • Custom graph engine app. • Cache entire third degree for logged in members • Graph Algorithms • Minimize memory footprint of cache • Bit vector optimizations • compressions : p4Delta ..
  • 11. Linkedin Search • Linkedin Search • 50 M documents • Business need to filter/rank with third degree cache • Response time : 30-50 ms. • How do we solve it. • Distributed search engine. • Documents partitioned • 0 -5M, 5-10M , .. • Each partition fetch network in range • Filter/Rank and return results. Search & Filter on Third Degree
  • 12. Data Analytics • Data is Linkedin primary asset • Linkedin Data has immense value • Career trends • Bubble bursts • Hot keywords • find experts/ Influencers • Big Data VS Better Algorithm • Big fight in machine learning community • personal opinion : big data rocks if you know your data Image1 source : http://otec.uoregon.edu/images/data-to-wisdom.gif Image2 source : http://www.nature.com/ki/journal/v62/n5/images/4493262f1b.gif
  • 14. People You May Know • Shows People you should be connected to ? • For 50 M member • Potential 50 2 = 2500,000 Billion • Can be Narrowed down • Triangle closing • School/Company overlap • .. Other factors • secret sauce 
  • 15. Scalable Architectures • What is a scalable architecture ? • Ability to do more if desired • With minimum effort • w/o software rearchitecture • How to scale ? • Scale Vertically (easy but costly) • Scale Horizontally (Harder but very scalable if done right) • Divide & Conquer is the only way to scale horizontally • Sharding/Partitioning
  • 18. Hadoop Origin?  “For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and- conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy “  Hadoop was born to fulfill this very need Proprietary & Confidential 6/7/2020 18
  • 19. What is Hadoop? • An Open-source Java based implementation of distributed map- reduce paradigm  Apache project with very active community – heavy support from Yahoo  Distributed Storage + Distributed Processing + Job Management  Runs on Commodity Hardware (Heterogeneous cluster)  Scalable, Reliable , Fault tolerant and very easy to learn.  Yahoo currently running clusters with 10,000 nodes that processes 10TB of compressed data for their production search relevance processing  Written by Lucene author Doug Cutting (and others)  Plethora of valuable sub-projects • Pig, Hive, HBase, Mahout, Katta
  • 20. What is Hadoop used for? • Search Relevance: Yahoo, Amazon, Powerset, Zevents • Log Processing: Facebook, Yahoo, Joost, Last.fm • Recommendation System: Facebook • Data Warehouse: Facebook, AOL • Video and Image Analysis: New york times, eyealike • Prediction models: Quantcast, Goldman Sachs • Genome Sequencing: University of Maryland. • Academia : University of Washington, Cornell , Stanford, Carnegie Mellon, Proprietary & Confidential 6/7/2020 20
  • 21. Why Use Hadoop • Distributed/parallel programs are very very hard to write • Tip of iceberg is the actual program/business logic • Map/Reduce code • The hidden iceberg is • Parallelization (Divide & Conquer) • Data Management (TB to PB) • Fault Tolerance • Scalability • Data local optimizations • Monitoring • Job isolation
  • 22. Hadoop core components • HDFS distributed file system • User space distributed data management • Not a real file system (no POSIX, not in the kernel) • Replication • Rebalancing • Very high aggregate throughput • Files are immutable • MapReduce layer • Very simple but powerful programming model • Move computation to the data • Cope with hardware failures • Non-goal: low latency
  • 23. Map Reduce • Google paper started it all: •“MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004 • Map = per “record” computation (extract/transform phase) • Each map task gets a block or piece of input data • Shuffle = copy/collect data together (hidden from users) • Collect all values for same key together. • Can provide custom sort implementations. • Reduce = final computation (aggregation/summarize phase) • All key, value pairs for one key goes to one reducer. • Reduce is optional
  • 24. Map Reduce data flow Proprietary & Confidential 6/7/2020 24
  • 25. Map Reduce Example : WordCount Proprietary & Confidential 6/7/2020 25 Image source : http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
  • 26. Map Reduce Implementation • Jobs are divided into small pieces (Tasks) • Improves load balancing • Faster recovery from failed nodes • Master • What to run? • Where to run? • Slaves • One tasktracker per machine • Can configure how many mappers / reducers per tasktracker • Manage task failures by re-starting on different node. • Speculative execution if a task is slow or have failed earlier • Maps and reduces are single threaded Individual JVMs • Kill -9 • Resource Isolation • Complete cleanup
  • 27. Map Reduce • Job tracker • splits input and assigns to various map tasks • Schedules and monitors map tasks (heartbeat) • On completion, schedules reduce tasks • Task tracker • Execute map tasks – call mapper for every input record • Execute reduce tasks – call reducer for every intermediate key, list of values pair • Handle partitioning of map outputs • Handle sorting and grouping of reducer input Jobtracker tasktrackertasktrackertasktracker Input Job (mapper, reducer, input) Data transfer Assign tasks
  • 28. HDFS II  Namenode (master) – Maps file Name to set of blocks – Maps blocks to list of datanodes where it resides. – Replication engine for blocks – Checksum based data coherency checker. – Issues  Metadata in memory (single point of failure)  Secondary name node checkpoints metadata.  Datanodes (slaves) – Stores blocks on local file system. – Stores checksum of blocks – Read/write data to clients directly. – Periodically send a report of all existing blocks to Namenode. Proprietary & Confidential 6/7/2020 28
  • 30. Hadoop @ Linkedin • People You May Know • 40 individual jobs: • Graph analysis • School & Company overlap • Collaborative filtering • Misc. other factors • Regression model combines all values • Content scoring • Collaborative Filtering • Sessionization and cause analysis • Phrase analysis • Other derived data: company data, matching fields, seniority, etc. • Search relevance • Peoplerank • Link keywords to results based on user behavior • User bucketing
  • 31. Data Visualization • Can be very very useful. • Saves ton of time • Very insightful • Pictures = 1000+ words • Challenges • Amount of data is huge • Need to write visualization for each problem. • Time spent per visualization is huge • Hard to make it distributed/scalable
  • 34. References 1. Hadoop. http://hadoop.apache.org/ 2. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/mapreduce.html 3. http://code.google.com/edu/parallel/index.html 4. http://www.youtube.com/watch?v=yjPBkvYh-ss 5. http://www.youtube.com/watch?v=-vD6PUdf3Js 6. S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. http://labs.google.com/papers/gfs.html 7. Linkedin.com 8. DJ Patil Talk at Scale Unlimited 2009

Hinweis der Redaktion

  1. What is batch computing? Cloud example
  2. Data driven features - Examples from google, facebook, etc
  3. Data driven features - Examples from google, facebook, etc
  4. Data driven features - Examples from google, facebook, etc
  5. Data driven features - Examples from google, facebook, etc
  6. Data driven features - Examples from google, facebook, etc
  7. Data driven features - Examples from google, facebook, etc
  8. Data driven features - Examples from google, facebook, etc
  9. Data driven features - Examples from google, facebook, etc
  10. Transition:
  11. http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
  12. Show blocks
  13. Talk through other people’s example
  14. Talk through other people’s example
  15. Talk through other people’s example
  16. Talk through other people’s example