SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM
Hien Luu & Raj Rangaswamy
About Us

Hien	
  Luu	
  

Rajasekaran	
  
Rangaswamy	
  
Agenda
• 
• 
• 
• 

Little bit about LinkedIn
Segmentation & Targeting Platform Overview
How Lucene powers Segmentation & Targeting Platform
Q&A
Our Mission
Connect the world’s professionals to make them
more productive and successful.

Our Vision
Create economic opportunity for every
professional in the world.

Members First!
The world’s largest professional network
Over 65% of members are now international

>30M	
  
>90%	
  
Fortune	
  100	
  Companies	
  	
  
use	
  LinkedIn	
  Talent	
  Soln	
  to	
  hire	
  

>3M	
  
Company	
  Pages	
  

	
  

19	
   	
  

Languages	
  

>5.7B	
  
Professional	
  searches	
  in	
  2012	
  
Other Company Facts
• 
• 
	
  

Headquartered	
  in	
  Mountain	
  View,	
  Calif.,	
  with	
  offices	
  around	
  the	
  world!
LinkedIn	
  has	
  ~4200	
  full-­‐3me	
  employees	
  located	
  around	
  the	
  world	
  
Segmenta3on	
  &	
  Targe3ng	
  PlaRorm	
  Overview	
  
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview
2. Attributes Added to Table

1. Create attributes
§ 
§ 
§ 
§ 
§ 

Name
Email
State
Occupation
Etc.

Name	
  

Email	
  

State	
  

OccupaEon	
  

John	
  Smith	
  

jsmith@blah.com	
  

California	
  

Engineer	
  

Jane	
  Smith	
  

smithj@mail.com	
  

Nevada	
  

HR	
  Manager	
  

Jane	
  Doe	
  

jdoe@email.com	
  

California	
  

Engineer	
  

3. Create Target Segment:
California, Engineer
Name	
  

Email	
  

State	
  

OccupaEon	
  

John	
  Smith	
  

jsmith@blah.com	
  

California	
  

Engineer	
  

Jane	
  Doe	
  

jdoe@email.com	
  

California	
  

Engineer	
  

4. Export List & Send Vendor

…	
  
Segmentation & Targeting Platform Overview

•  Business definition
–  Business would like to launch new campaigns often
–  Business would like to specify targeting criteria using
arbitrary set of attributes
–  Attributes need to be computed to fulfill the targeting
criteria
–  The attribute data resides on Hadoop or TD
–  Business is most comfortable with SQL-like language
Segmentation & Targeting Platform Overview

A[ribute	
  
Computa3on	
  	
  
Engine	
  

A[ribute	
  	
  
Serving	
  	
  
Engine	
  
Segmentation & Targeting Platform Overview
Attribute consolidation

Self-service

A[ribute	
  
Computa3on	
  	
  
Engine	
  

Support	
  various	
  data	
  sources	
  

Attribute availability
Segmentation & Targeting Platform Overview

PB

Attribute computation
~238M

TB

TB
~440
Segmentation & Targeting Platform Overview
Build segments

Self-service

A[ribute	
  	
  
Serving	
  	
  
Engine	
  

A[ribute	
  predicate	
  expression	
  

Build lists
Segmentation & Targeting Platform Overview
count

$	
  

1234

filter

	
  
Σ	
  

sum

complex
expressions

Attribute Serving Engine
~238M

~440
Segmentation & Targeting Platform Overview

Who are the job seekers?
Who are the LinkedIn Talent Solution prospects
in Europe?
Who are north American recruiters that
don’t work for a competitor?
Segmentation & Targeting Platform Overview
How	
  Lucene	
  powers	
  Segmenta3on	
  &	
  Targe3ng	
  PlaRorm	
  
How Lucene powers Segmentation & Targeting Platform

•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
Architecture

Attribute
Serving
Engine

Attribute
Computation
Engine

Data
Storage
Layer

Attribute
Indexing

Attribute
Creation
Engine

Attribute
Serving
Engine

Attribute
Materialization
Engine

Attribute
Metastore
Mapper	
  

Architecture
mysql
attribute
store

K=>	
  AvroKey<GenericRecord>	
  	
  
V=>	
  AvroValue<NullWritable>	
  

Attribute
Definitions

HDFS

shard 1

Avro data in
HDFS

Hadoop
Indexer MR
shard 2

Index Merger

shard n

Web Servers

Reducer	
  

K=>	
  NullWritable	
  	
  
V=>	
  LuceneDocumentWrapper	
  

LuceneOutputFormat	
  
RecordWriter	
  

	
  	
  	
  LuceneDocumentWrapper	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  Document	
  	
  	
  	
  
	
  
	
  	
  	
  	
  	
  
	
  
	
  	
  	
  Index	
  
Architecture
JSON	
  Predicate	
  
Expression	
  

JSON	
  Lucene	
  	
  
Query	
  Parser	
  

Inverted	
  	
  
Index	
  

Inverted	
  	
  
Index	
  

Segment	
  &	
  
List	
  

Inverted	
  	
  
Index	
  
How Lucene powers Segmentation & Targeting Platform

•  Architecture
–  Indexer Architecture
–  Serving Architecture

•  Load Balanced Model
• 
• 
• 
• 

Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
Serving – Load Balanced Model
HTTP Request

Load Balancer

Web Server 1

Shard 1

Web Server 2

Shard 2

Shared Drive

Web Server n

Shard n
Serving – Load Balanced Model

But	
  Wait…..	
  
•  Is	
  load	
  balancing	
  alone	
  good	
  enough?	
  
•  What	
  about	
  distribu3on	
  and	
  failover?	
  
How Lucene powers Segmentation & Targeting Platform

•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
Next Steps – Distributed Model

•  A	
  generic	
  cluster	
  management	
  framework	
  
•  Manage	
  par33oned	
  and	
  replicated	
  resources	
  in	
  distributed	
  systems	
  
•  Built	
  on	
  top	
  of	
  Zookeeper	
  that	
  hides	
  the	
  complexity	
  of	
  ZK	
  primi3ves	
  
•  Provides	
  distributed	
  features	
  such	
  as	
  leader	
  elec3on,	
  two-­‐phase	
  
commit	
  etc.	
  via	
  a	
  model	
  of	
  state	
  machine	
  
	
  hLp://helix.incubator.apache.org/	
  
Next Steps – Distributed Model

HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shard
1

active

Shard
2

active

Shard
3

active

Shard
2

standby

Shard
3

standby

Shard
1

standby
Next Steps – Distributed Model

HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shard
1

active

Shard
2

active

Shard
3

failure

Shard
2

standby

Shard
3

active

Shard
1

failure
•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
DocValues – Use Case

•  Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they
want to run.
•  Campaigns can be run on various Revenue Models
•  This involves adding per member Propensity Scores and
Dollar Amounts
DocValues – Why not Stored Fields?

Why	
  not	
  use	
  Stored	
  Fields?	
  

Document ID

•  Stored	
  fields	
  have	
  one	
  indirec3on	
  per	
  
document	
  resul3ng	
  in	
  two	
  disk	
  seeks	
  

.fdx

fetch filepointer to field data

.fdt

scan by id until field is found

per	
  document	
  
•  Performance	
  cost	
  quickly	
  adds	
  up	
  when	
  
fetching	
  millions	
  of	
  documents	
  
DocValues – Why not Stored Fields?

•  Why not use Field Cache?
–  Is memory resident
–  Works fine when there is enough memory
–  But keeping millions of un-inverted values in memory is
impossible
–  Additional cost to parse values (from String and to String)
DocValues
•  Dense column based storage
–  (1 Value per Document and 1 Column per field and segment)
•  Accepts primitives
•  No conversion from/to String needed
•  Loads 80x-100x faster than building a FieldCache
•  All the work is done during Indexing
•  DocValue fields can be indexed and stored too
•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
Lessons Learnt

Indexing
•  Reuse index writers, field and document instances
•  Create many partitions and merge them in a different
process
•  Rebuild (bootstrap) entire index if possible
•  Use partial updates with caution
•  Analyze the index
Lessons Learnt

Serving
•  Reuse a single instance of IndexSearcher
•  Limit usage of stored fields and term vectors
•  Plan for load balancing and failover
•  Cache term frequencies
•  Use different machines for serving and indexing
•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
Why not use existing solutions?
•  Doesn’t	
  allow	
  dynamic	
  schema	
  
•  Difficult	
  to	
  bootstrap	
  indexes	
  built	
  in	
  Hadoop	
  
•  Indexing	
  elevates	
  query	
  latency	
  
	
  

• 
• 
• 
• 

Doesn’t	
  allow	
  dynamic	
  schema	
  
Difficult	
  to	
  bootstrap	
  indexes	
  built	
  in	
  Hadoop	
  
Larger	
  memory	
  overhead	
  
Compara3vely	
  slow	
  
Ques3ons?	
  
	
  
More	
  info:	
  data.linkedin.com	
  

Weitere ähnliche Inhalte

Ähnlich wie How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Marina Peregud
 
Implementation of Oracle ExaData and OFM 11g with Banner in HCT
Implementation of Oracle ExaData and OFM 11g with Banner in HCTImplementation of Oracle ExaData and OFM 11g with Banner in HCT
Implementation of Oracle ExaData and OFM 11g with Banner in HCT
Khalid Tariq
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Lucidworks
 

Ähnlich wie How Lucene Powers the LinkedIn Segmentation and Targeting Platform (20)

How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
UNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data baseUNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data base
 
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Practical soa for business and researchers
Practical soa for business and researchersPractical soa for business and researchers
Practical soa for business and researchers
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Implementation of Oracle ExaData and OFM 11g with Banner in HCT
Implementation of Oracle ExaData and OFM 11g with Banner in HCTImplementation of Oracle ExaData and OFM 11g with Banner in HCT
Implementation of Oracle ExaData and OFM 11g with Banner in HCT
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Apache CloudStack Examination - CloudStack Collaboration Conference in Europe...
Apache CloudStack Examination - CloudStack Collaboration Conference in Europe...Apache CloudStack Examination - CloudStack Collaboration Conference in Europe...
Apache CloudStack Examination - CloudStack Collaboration Conference in Europe...
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
 
Azure basics
Azure basicsAzure basics
Azure basics
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

  • 1.
  • 2. HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM Hien Luu & Raj Rangaswamy
  • 3. About Us Hien  Luu   Rajasekaran   Rangaswamy  
  • 4. Agenda •  •  •  •  Little bit about LinkedIn Segmentation & Targeting Platform Overview How Lucene powers Segmentation & Targeting Platform Q&A
  • 5. Our Mission Connect the world’s professionals to make them more productive and successful. Our Vision Create economic opportunity for every professional in the world. Members First!
  • 6. The world’s largest professional network Over 65% of members are now international >30M   >90%   Fortune  100  Companies     use  LinkedIn  Talent  Soln  to  hire   >3M   Company  Pages     19     Languages   >5.7B   Professional  searches  in  2012  
  • 7. Other Company Facts •  •    Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! LinkedIn  has  ~4200  full-­‐3me  employees  located  around  the  world  
  • 8. Segmenta3on  &  Targe3ng  PlaRorm  Overview  
  • 9. Segmentation & Targeting Platform Overview
  • 10. Segmentation & Targeting Platform Overview
  • 11. Segmentation & Targeting Platform Overview 2. Attributes Added to Table 1. Create attributes §  §  §  §  §  Name Email State Occupation Etc. Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Smith   smithj@mail.com   Nevada   HR  Manager   Jane  Doe   jdoe@email.com   California   Engineer   3. Create Target Segment: California, Engineer Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Doe   jdoe@email.com   California   Engineer   4. Export List & Send Vendor …  
  • 12. Segmentation & Targeting Platform Overview •  Business definition –  Business would like to launch new campaigns often –  Business would like to specify targeting criteria using arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language
  • 13. Segmentation & Targeting Platform Overview A[ribute   Computa3on     Engine   A[ribute     Serving     Engine  
  • 14. Segmentation & Targeting Platform Overview Attribute consolidation Self-service A[ribute   Computa3on     Engine   Support  various  data  sources   Attribute availability
  • 15. Segmentation & Targeting Platform Overview PB Attribute computation ~238M TB TB ~440
  • 16. Segmentation & Targeting Platform Overview Build segments Self-service A[ribute     Serving     Engine   A[ribute  predicate  expression   Build lists
  • 17. Segmentation & Targeting Platform Overview count $   1234 filter   Σ   sum complex expressions Attribute Serving Engine ~238M ~440
  • 18. Segmentation & Targeting Platform Overview Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don’t work for a competitor?
  • 19. Segmentation & Targeting Platform Overview
  • 20. How  Lucene  powers  Segmenta3on  &  Targe3ng  PlaRorm  
  • 21. How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  • 23. Mapper   Architecture mysql attribute store K=>  AvroKey<GenericRecord>     V=>  AvroValue<NullWritable>   Attribute Definitions HDFS shard 1 Avro data in HDFS Hadoop Indexer MR shard 2 Index Merger shard n Web Servers Reducer   K=>  NullWritable     V=>  LuceneDocumentWrapper   LuceneOutputFormat   RecordWriter        LuceneDocumentWrapper                                    Document                            Index  
  • 24. Architecture JSON  Predicate   Expression   JSON  Lucene     Query  Parser   Inverted     Index   Inverted     Index   Segment  &   List   Inverted     Index  
  • 25. How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  Load Balanced Model •  •  •  •  Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  • 26. Serving – Load Balanced Model HTTP Request Load Balancer Web Server 1 Shard 1 Web Server 2 Shard 2 Shared Drive Web Server n Shard n
  • 27. Serving – Load Balanced Model But  Wait…..   •  Is  load  balancing  alone  good  enough?   •  What  about  distribu3on  and  failover?  
  • 28. How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  • 29. Next Steps – Distributed Model •  A  generic  cluster  management  framework   •  Manage  par33oned  and  replicated  resources  in  distributed  systems   •  Built  on  top  of  Zookeeper  that  hides  the  complexity  of  ZK  primi3ves   •  Provides  distributed  features  such  as  leader  elec3on,  two-­‐phase   commit  etc.  via  a  model  of  state  machine    hLp://helix.incubator.apache.org/  
  • 30. Next Steps – Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby
  • 31. Next Steps – Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure
  • 32. •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  • 33. DocValues – Use Case •  Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. •  Campaigns can be run on various Revenue Models •  This involves adding per member Propensity Scores and Dollar Amounts
  • 34. DocValues – Why not Stored Fields? Why  not  use  Stored  Fields?   Document ID •  Stored  fields  have  one  indirec3on  per   document  resul3ng  in  two  disk  seeks   .fdx fetch filepointer to field data .fdt scan by id until field is found per  document   •  Performance  cost  quickly  adds  up  when   fetching  millions  of  documents  
  • 35. DocValues – Why not Stored Fields? •  Why not use Field Cache? –  Is memory resident –  Works fine when there is enough memory –  But keeping millions of un-inverted values in memory is impossible –  Additional cost to parse values (from String and to String)
  • 36. DocValues •  Dense column based storage –  (1 Value per Document and 1 Column per field and segment) •  Accepts primitives •  No conversion from/to String needed •  Loads 80x-100x faster than building a FieldCache •  All the work is done during Indexing •  DocValue fields can be indexed and stored too
  • 37. •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  • 38. Lessons Learnt Indexing •  Reuse index writers, field and document instances •  Create many partitions and merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index
  • 39. Lessons Learnt Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for serving and indexing
  • 40. •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  • 41. Why not use existing solutions? •  Doesn’t  allow  dynamic  schema   •  Difficult  to  bootstrap  indexes  built  in  Hadoop   •  Indexing  elevates  query  latency     •  •  •  •  Doesn’t  allow  dynamic  schema   Difficult  to  bootstrap  indexes  built  in  Hadoop   Larger  memory  overhead   Compara3vely  slow  
  • 42. Ques3ons?     More  info:  data.linkedin.com