4. Agenda
•
•
•
•
Little bit about LinkedIn
Segmentation & Targeting Platform Overview
How Lucene powers Segmentation & Targeting Platform
Q&A
5. Our Mission
Connect the world’s professionals to make them
more productive and successful.
Our Vision
Create economic opportunity for every
professional in the world.
Members First!
6. The world’s largest professional network
Over 65% of members are now international
>30M
>90%
Fortune
100
Companies
use
LinkedIn
Talent
Soln
to
hire
>3M
Company
Pages
19
Languages
>5.7B
Professional
searches
in
2012
7. Other Company Facts
•
•
Headquartered
in
Mountain
View,
Calif.,
with
offices
around
the
world!
LinkedIn
has
~4200
full-‐3me
employees
located
around
the
world
11. Segmentation & Targeting Platform Overview
2. Attributes Added to Table
1. Create attributes
§
§
§
§
§
Name
Email
State
Occupation
Etc.
Name
Email
State
OccupaEon
John
Smith
jsmith@blah.com
California
Engineer
Jane
Smith
smithj@mail.com
Nevada
HR
Manager
Jane
Doe
jdoe@email.com
California
Engineer
3. Create Target Segment:
California, Engineer
Name
Email
State
OccupaEon
John
Smith
jsmith@blah.com
California
Engineer
Jane
Doe
jdoe@email.com
California
Engineer
4. Export List & Send Vendor
…
12. Segmentation & Targeting Platform Overview
• Business definition
– Business would like to launch new campaigns often
– Business would like to specify targeting criteria using
arbitrary set of attributes
– Attributes need to be computed to fulfill the targeting
criteria
– The attribute data resides on Hadoop or TD
– Business is most comfortable with SQL-like language
18. Segmentation & Targeting Platform Overview
Who are the job seekers?
Who are the LinkedIn Talent Solution prospects
in Europe?
Who are north American recruiters that
don’t work for a competitor?
23. Mapper
Architecture
mysql
attribute
store
K=>
AvroKey<GenericRecord>
V=>
AvroValue<NullWritable>
Attribute
Definitions
HDFS
shard 1
Avro data in
HDFS
Hadoop
Indexer MR
shard 2
Index Merger
shard n
Web Servers
Reducer
K=>
NullWritable
V=>
LuceneDocumentWrapper
LuceneOutputFormat
RecordWriter
LuceneDocumentWrapper
Document
Index
25. How Lucene powers Segmentation & Targeting Platform
• Architecture
– Indexer Architecture
– Serving Architecture
• Load Balanced Model
•
•
•
•
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
26. Serving – Load Balanced Model
HTTP Request
Load Balancer
Web Server 1
Shard 1
Web Server 2
Shard 2
Shared Drive
Web Server n
Shard n
27. Serving – Load Balanced Model
But
Wait…..
• Is
load
balancing
alone
good
enough?
• What
about
distribu3on
and
failover?
28. How Lucene powers Segmentation & Targeting Platform
• Architecture
– Indexer Architecture
– Serving Architecture
•
•
•
•
•
Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
29. Next Steps – Distributed Model
• A
generic
cluster
management
framework
• Manage
par33oned
and
replicated
resources
in
distributed
systems
• Built
on
top
of
Zookeeper
that
hides
the
complexity
of
ZK
primi3ves
• Provides
distributed
features
such
as
leader
elec3on,
two-‐phase
commit
etc.
via
a
model
of
state
machine
hLp://helix.incubator.apache.org/
30. Next Steps – Distributed Model
HTTP Request
Load Balancer
Scatter Gather
Web Server 1
Web Server 2
Web Server 3
Shard
1
active
Shard
2
active
Shard
3
active
Shard
2
standby
Shard
3
standby
Shard
1
standby
31. Next Steps – Distributed Model
HTTP Request
Load Balancer
Scatter Gather
Web Server 1
Web Server 2
Web Server 3
Shard
1
active
Shard
2
active
Shard
3
failure
Shard
2
standby
Shard
3
active
Shard
1
failure
32. • Architecture
– Indexer Architecture
– Serving Architecture
•
•
•
•
•
Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
33. DocValues – Use Case
• Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they
want to run.
• Campaigns can be run on various Revenue Models
• This involves adding per member Propensity Scores and
Dollar Amounts
34. DocValues – Why not Stored Fields?
Why
not
use
Stored
Fields?
Document ID
• Stored
fields
have
one
indirec3on
per
document
resul3ng
in
two
disk
seeks
.fdx
fetch filepointer to field data
.fdt
scan by id until field is found
per
document
• Performance
cost
quickly
adds
up
when
fetching
millions
of
documents
35. DocValues – Why not Stored Fields?
• Why not use Field Cache?
– Is memory resident
– Works fine when there is enough memory
– But keeping millions of un-inverted values in memory is
impossible
– Additional cost to parse values (from String and to String)
36. DocValues
• Dense column based storage
– (1 Value per Document and 1 Column per field and segment)
• Accepts primitives
• No conversion from/to String needed
• Loads 80x-100x faster than building a FieldCache
• All the work is done during Indexing
• DocValue fields can be indexed and stored too
37. • Architecture
– Indexer Architecture
– Serving Architecture
•
•
•
•
•
Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
38. Lessons Learnt
Indexing
• Reuse index writers, field and document instances
• Create many partitions and merge them in a different
process
• Rebuild (bootstrap) entire index if possible
• Use partial updates with caution
• Analyze the index
39. Lessons Learnt
Serving
• Reuse a single instance of IndexSearcher
• Limit usage of stored fields and term vectors
• Plan for load balancing and failover
• Cache term frequencies
• Use different machines for serving and indexing
40. • Architecture
– Indexer Architecture
– Serving Architecture
•
•
•
•
•
Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?
41. Why not use existing solutions?
• Doesn’t
allow
dynamic
schema
• Difficult
to
bootstrap
indexes
built
in
Hadoop
• Indexing
elevates
query
latency
•
•
•
•
Doesn’t
allow
dynamic
schema
Difficult
to
bootstrap
indexes
built
in
Hadoop
Larger
memory
overhead
Compara3vely
slow