DevEX - reference for building teams, processes, and platforms
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
1. Search
Discover
Analyze
Large
Scale
Search,
Discovery
and
Analy5cs
with
Solr,
Mahout
and
Hadoop
Grant
Ingersoll
Chief
Scien:st
Lucid
Imagina:on
|
1
2. Search
is
Dead,
Long
Live
Search
l Good
keyword
search
is
a
Documents
commodity
and
easy
to
get
up
and
running
l The
Bar
is
Raised
Content User
Relationships Interaction
– Relevance
is
(always
will
be?)
hard
l Holis:c
view
of
the
data
AND
the
users
is
cri:cal
Access
|
2
3. Topics
l Quick
Background
and
needs
l Architecture
– Abstract
– Prac:cal
l SDA
In
Prac:ce
– Components
– Challenges
and
Lessons
Learned
l Wrap
Up
|
3
4. Why
Search,
Discovery
and
Analy8cs
(SDA)?
l User
Needs
– Real-‐:me,
ad
hoc
access
to
content
– Aggressive
Priori:za:on
based
on
Importance
Search
– Serendipity
– Feedback/Learning
from
past
l Business
Needs
Analytics Discovery – Deeper
insight
into
users
– Leverage
exis:ng
internal
knowledge
– Cost
effec:ve
|
4
5. What
Do
Developers
Need
for
SDA?
l Fast, efficient, scalable search
– Bulk and Near Real Time Indexing
– Handle billions of records w/ sub-second search and faceting
l Large scale, cost effective storage and processing capabilities
– Need whole data consumption and analysis
– Experimentation/Sampling tools
– Distributed In Memory where appropriate
l NLP and machine learning tools that scale to enhance discovery and
analysis
|
5
6. Abstract
-‐>
Prac8cal
SDA
Architecture
Access (API, UI,Visualization)
Search, Discovery and Analytics Glue
Stats Mahout, R, GATE, Others
Pig, Machine Docs User Admin
Package Learning Access Modeling
Experiment Mgmt Service
Mgmt
Content Computation and Storage
Acquisition
DB
Dist. Data
Search NoSQL
Process Mgmt
KV
Shards Shards Shards
Shards Shards Shards
Shards Logs DFS
Provisioning, Monitoring, Infrastructure
|
6
7. Computa8on
and
Storage
Solr Hadoop HBase
• Document Index • Stores Logs, • Metric Storage
• Document Raw files, • User Histories
Storage? intermediate • Document
files, etc. Storage?
• SolrCloud makes • WebHDFS
sharding easy
• Small file are an
unnatural act
Challenges
• Who
is
the
authorita:ve
store?
Solr
or
HBase?
• Real
:me
vs.
Batch
• Where
should
analysis
be
done?
|
7
8. Search
In
Prac8ce
l Three
primary
concerns
– Performance/Scaling
– Relevance
– Opera:ons:
monitoring,
failover,
etc.
l Business
typically
cares
more
about
relevance
l Devs
more
about
performance
(and
then
ops)
|
8
9. Search
with
Solr:
Scaling
and
NRT
l SolrCloud
takes
care
of
distributed
indexing
and
search
needs
– Transac:on
logs
for
recovery
– Automa:c
leader
elec:on,
so
no
more
master/worker
– Have
to
declare
number
of
shards
now,
but
spliang
coming
soon
– Use
CloudSolrServer
in
SolrJ
l NRT
Config
:ps:
– 1
second
sod
commits
for
NRT
updates
– 1
minute
hard
commits
(no
searcher
reopen)
|
9
10. Search:
Relevance
l ABT
–
Always
Be
Tes:ng
– Experiment
management
is
cri:cal
– Top
X
+
Random
Sampling
of
Long
Tail
– Click
logs
l Track
Everything!
– Queries
– Clicks
– Displayed
Documents
– Mouse/Scroll
tracking???
l Phrases
are
your
friend
|
10
11. Discovery
Components
Serendipity Organization Data Quality
• Trends • Importance • Document factor
• Topics • Clustering Distributions
• Recommendations • Classification • Length
• Related Items • Named Entities • Boosts
• More Like This • Time Factors • Duplicates
• Did you mean? • Faceting
• Stat. Interesting
Phrases
Challenges
• Many
of
these
are
intense
calcula:ons
or
itera:ve
• Many
are
subjec:ve
and
require
a
lot
of
experimenta:on
|
11
12. Discovery
with
Mahout
l Mahout’s
3
“C”s
provide
tools
for
helping
across
many
aspects
of
discovery
– Collabora:ve
Filtering
– Classifica:on
– Clustering
l Also:
– Colloca:ons
(Sta:s:cally
Interes:ng
Phrases)
– SVD
– Others
l Challenges:
– High
cost
to
itera:ve
machine
learning
algorithms
– Mahout
is
very
command
line
oriented
– Some
areas
less
mature
|
12
13. Aside:
Experiment
Management
l Plan
for
running
experiments
from
the
beginning
across
Search
and
Discovery
components
– Your
analy:cs
engine
should
help!
l Types
of
Experiments
to
consider
– Indexing/Analysis
– Query
parsing
– Scoring
formulas
– Machine
Learning
Models
– Recommenda:ons,
many
more
l Make
it
easy
to
do
A/B
tes:ng
across
all
experiments
and
compare
and
contrast
the
results
|
13
14. Analy8cs
Components
l Commonly
used
components
– Solr
– R
Stats
– Hive
– Pig
– Commercial
l Star:ng
with
Search
and
Discovery
metrics
and
analysis
gives
context
into
where
to
make
investments
for
broader
analy:cs
|
14
15. Analy8cs
in
Prac8ce
l Simple
Counts:
– Facets
– Term
and
Document
frequencies
– Clicks
l Search
and
Discovery
example
metrics
– Relevance
measures
like
Mean
Reciprocal
Rank
– Histograms/Drilldowns
around
Number
of
Results
– Log
and
naviga:on
analysis
l Data
cleanliness
analysis
is
helpful
for
finding
poten:al
issues
in
content
|
15
16. Wrap
l Search,
Discovery
and
Analy:cs,
when
combined
into
a
single,
coherent
system
provides
powerful
insight
into
both
your
content
and
your
users
l Lucid
has
combined
many
of
these
things
into
LucidWorks
Big
Data
– hrp://www.lucidimagina:on.com/products/lucidworks-‐search-‐plasorm/
lucidworks-‐big-‐data
l Design
for
the
big
picture
when
building
search-‐based
applica:ons
|
16