2. Agenda
• Presenters
• Query time Nonparametric Regression
• Demo – Suggesting tagged images
• Time Routed Aliases
• Demo – Creating a Time Routed Alias
3. Presenters
Patrick (Gus) Heck
• Solr Contributor since 2013
• Consulting since 2012
• Enterprise Search since
2010
• Apache Ant contributor
2003-2004
• Web Applications since 2003
David Smiley
• Lucene/Solr Committer
(PMC)
• Consulting
• Author of first book on Solr
• Presentations & Training
4. In the beginning..
Dave Mackey - Vision for an online coaching
platform leveraging machine learning
technology to help companies help their
middle managers.
Engaged me first as Chief Architect and later
as CTO to bring this vision to reality.
It worked, but was not funded...
5. Brief Overview of Simply Coached
Simply Coached provided online career coaching
system
Key Goal was connecting users with relevant
curated content
Relevance was determined by
• Identifying topics the user is interested in by
observing and learning from click-throughs
• Several other customized metrics
6. Learning What Interests the User
Goals:
• Suggest things of interest, things the user is willing
to learn
• Make predictions based on the user’s behavior
• Unique prediction per user
• Predictions based on entire user base
• Avoid calcification of the predictive model, old data
needs to be sunset regularly.
7. Modeling Candidate - Neural Nets
Simply Coached started right at the dawn of
the “3rd wave” of neural networks
Limitations of Neural Networks
• They don’t adapt well to a changing
problem
• We would need to retrain predictive
network regularly
• Require massive data for training
8. Modeling Candidate - Regression
Not as “cool” as deep learning, but more suitable
• Can converge faster with less data
• But, most techniques are assuming that the
residuals are normally distributed
• Many of our features will be binary
• Normality is right out the window
I decided to look for a Non-Parametric Regression
technique
9. Non-Parametric Multiplicative Regression
• Can predict continuous variable
• Good for sorting/ranking
• Kernel smoother, presence/absence
kernels already well known
• Non-parametric
Nice discrete sub-calculations...
Used in Habitat Ecology for predicting habitat suitability
https://ir.library.oregonstate.edu/concern/defaults/z029p910z
10. NPMR Equations
Performance across multiple factors (features)
Local Mean estimator
Example kernels continuous or
presence/absence:
Key realization: the products and sums for i,j can
pre precalculated:
I called these pre-calculated portions “Partials” and
added temporal metadata
11. But wait... What are we predicting?
Needed a continuous statistic to predict
Thesis: Given a good teaser etc., the user
will click more rapidly on more interesting
articles.
Suggest documents in order of predicted
reaction time
12. Demo
Three users:
• Sporty Black likes black sports cars
• Cooper Blue likes blue coupes
• Racy Yeller likes yellow formula cars
System was trained by presenting ~20 sets of 9
randomly selected cars and clicking with
varying speed
Images copyright Andrew Mauldin, used with permission
16. Continuously running, recalculating
sums/products once a minute
Included traditional document indexing
(article scanner) and also streaming
expression update()
(send_per_user_sum_update)
By end, running continuously for months
at a time unattended.
http://www.jesterj.org for more info about
JesterJ
Indexing With JesterJ
17. Adapting and Scaling
• Adapt at query time with filters, could also
filter in other ways with additional metadata
• Four Dimensions:
1. Content to recommend - constant set
2. Users - per user pricing to the rescue
3. Activity - price for average activity level
4. Time - data accumulates over time
Time Routed Aliases are a solution to #4
19. Have a Lot of Timestamped Data?
And you need keyword search and/or analytics
(faceting/aggregations) capabilities.
Examples:
• Logs
• Sensor data (IoT)
• Social media posts
Characteristics: tons of docs, continuously flowing in, limited
retention
20. Strategies
Hash Partitioned:
• One collection, hash routed shards (built-in)
Time Partitioned:
• One collection, time routed shards (DIY)
• Time partitioned collections (DIY)
• Time partitioned collections via TRAs (built-in)
“Partitioned” = “Routed” = “Organized”
21. One Collection, Hash Routed Shards
Hash on ID, even distribution
router.name=compositeId
+ Easy (default)
+ High write throughput
- Deleted dead-weight
- Poor realtime search
- Queries execute everywhere
- Inflexible sizing
- … thus Expensive (uniform
hardware requirements)
Deleted docs
Live docs
See DocExpirationUpdateProcessorFactory
shard1 shard4 shard7
2 5 8
9
22. Time Partitioned Data
Generally better...
• New data always goes to most recent partition
• Variable or equal sized (depends on approach)
• Addresses all negatives of hash routing
• Write throughput can be addressed by sharding within partition
• Opportunities to “optimize” aged indexes
• Flexibility in assigning partitions to better or cheaper hardware
• … all leads to cost savings
How?...
2017-01 2017-02 2017-03 2017-04
2017-05
2017-06 2017-07 2017-08
2017-09
23. Time Partitioned Data
Implementation Strategies:
• One Collection, router.name=implicit
• See this sample code: (DIY) https://github.com/cga-harvard/hhypermap-
bop/tree/master/bop-core/solr-
plugins/src/main/java/edu/harvard/gis/hhypermap/bop/solrplugins by me
• Multiple Collections
• See this blog: (DIY)
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-
really-big-data/ by Mark Miller
• TRAs… (built-in)
2017-01 2017-02 2017-03 2017-04
2017-05
2017-06 2017-07 2017-08
2017-09
24. About Collection Aliases
SolrCloud supports collection aliases
Aliases point to one or more collections
Ex: Alias “tra-demo” → 2017-09, 2017-08, 2017-07, ...
2017-01 2017-02 2017-03 2017-04
2017-05
2017-06 2017-07 2017-08
2017-09
25. Time Routed Aliases
Aliases have new tricks up their sleeves…
• Aliases now have metadata (API to read & edit)
• Can create a “time routed alias” w/ first collection
• Collections in a TRA can
• Route update requests to the correct collections
• Adds/deletes collections automatically
• data driven
• just-in-time or preemptively
Mutable configuration, mostly
New
in Solr 7.3
27. TRAs, the fine print
TODOs
• Size capped (e.g. 5M docs / collection)
• Query routing to subset of collections
• Better “auto-scaling” tie-ins
• “optimize” of older collections
• TRA deletion, ease of use
TRAs may not be for everyone
• Partitioning is strict and must be adjacent (no gaps)
• Doesn’t work with CDCR