4. Background
• Bloomberg Vault – hosted communication archive
• Explosive
growth
of
enterprise
data
communica7ons
• Compliance
for
Regulated
Industries
(e.g.
e-‐mail,
chat,
mobile,
voice,
social
media,
files)
• Private
Cloud
• E-Discovery - large historical data sets, but small
query volume
• Search
to
accurately
and
7mely
respond
to
li7ga7on
requests
• Reconstruct
communica7ons
across
all
channels
and
types
• Extrac7on
of
large
data
sets
from
special
storage
(WORM)
Query
User
Index
Results
Extrac7on
5. Sizing
• 80 billion documents
• And
growing
• Average document size is 50KB
• Large
variance
-‐
1KB
to
hundreds
of
MB
• Hundreds of indexed fields
• There
is
a
lot
of
metadata
that
goes
along
with
communica7on
• <10 searches/second
7. Architecture
• Massive scale - shards have to be left
offline until needed
• Load only the shards needed to serve
a search request
• Searches
normally
require
~30
shards,
but
can
range
from
1
to
several
hundred
depending
on
applica7on
• Open shards cached in case they are
needed again
• Indexing is an external batch process
Solr
Solr
Solr
Shards
Solr
Search
Manager
Shard
Mapping
9. Incremental Search
• Calculating the full result set is time
consuming
• Query
cache
usually
cold
due
to
unload
• Shards
load
takes
7me
• Users want to review a subset before
exporting
• Shards and results are date sorted
• Search shards sequentially, and
return partial results as available
• Creates a streaming interface
Applica7ons
Solr
Solr
Solr
Shards
Solr
Search
Manager
10. Pinned Shards
• Incremental search starts with the most recent data
• `Pin` shards for most recent data
• Subset
of
shards
to
be
kept
loaded
at
all
7mes
• Shards already loaded for the beginning of the stream
• User doesn’t see the load times for the rest since it happens while they review
initial results
• Allows query caches to be more effective
• User sees results in seconds rather than minutes
12. Security
• What if each user has a different view
of a document?
• User
1
has
permission
to
view
the
red
• User
2
has
permission
to
view
green
• User
3
has
permission
to
view
everything
Lorem
ipsum
dolor
sit
amet,
consectetur
adipiscing
elit,
sed
do
eiusmod
tempor
incididunt
ut
labore
et
dolore
magna
aliqua.
Ut
enim
ad
minim
veniam,
quis
nostrud
exercita7on
ullamco
laboris
nisi
ut
aliquip
ex
ea
commodo
consequat.
Duis
aute
irure
dolor
in
reprehenderit
in
voluptate
velit
esse
cillum
dolore
eu
fugiat
nulla
pariatur.
Excepteur
sint
occaecat
cupidatat
non
proident,
sunt
in
culpa
qui
officia
deserunt
mollit
anim
id
est
laborum
13. Security
• Post process each document
• Ends
up
being
horribly
slow
• Ties
applica7on
logic
to
backend
• Generate a unique document for each view
• 1000s
of
unique
views
makes
for
an
unmanageable
index
• Trillions
of
documents
is
a
whole
different
problem!
• Dynamic fields
• text_view1:value1,
text_view2:value2,
text_view3:”value1
value2”
• Solr
doesn’t
have
a
max
number
of
fields,
but
string
interning
becomes
an
issue
• Mangle field values
• text:”view1_value1
view2_value2
view3_value1
view3_value2”
• Works
pre^y
well