Breaking the Kubernetes Kill Chain: Host Path Mount
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks
1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
2. The Evolution of Lucene & Solr Numerics
from Strings to Points
Steve Rowe
Senior Software Engineer, Lucidworks
@steven_a_rowe
3. 3
01
Agenda
1. {Long time ago, yesterday}: History
2. Today: Benchmarks
3. Tomorrow: Future developments
Not on the agenda: geospatial; stats/analytics; streaming expressions
10. 10
01
Trie numerics
From http://www.thetaphi.de/share/Schindler-TrieRange.ppt:
421
52
4
44 6442
644642641634633632522521448446445423
63
5 6
Range
1. Fast range queries
2. Fewer terms required than term range queries
3. 7-bit encoded to minimize disk footprint
4. Adjustable “precisionStep”: number of bits to
shift when generating synthetic terms
5. Synthetic prefix terms created by stripping low
bits and prepending the shift amount in the first byte
1. E.g.: For 423, synthetic terms 42 and 4 are also
indexed
2. When searching range [423, 642]: the lowest-
precision terms covering the range are used:
423, 44, 5, 63, 641, 642 (6 terms), versus
11 terms required by a term range query.
15. }
}15
01
Dimensional Points
1. All point values in a field have
the same fixed width (max 128bit)
2. 1D - 8D
3. Block k-d tree
4. Points are sorted;
recursively
partitioned along
the longest
dimension; then
at a target
cardinality, the
“leaf block” is
written out.
1-8 dimensions
1-16 bytes per dimension
4. An in-memory binary
tree index points to
the leaf blocks.
5. Adaptive optimal
partitioning (versus
trie numerics, which
generates terms
irrespective of local
density.)
16. 16
01
Dimensional Points
1. Lucene-only - no Solr support yet
2. Optimized for query types:
range, distance, nearest-neighbor, and point-in-polygon
3. Multi-valued support
4. Not supported: value retrieval (store if you need this)
5. Not supported: sorting or faceting (use DocValues for these)
17. 17
01
Dimensional Points
1D Native 1D 128-bit 1D-4D Range 2D Geospatial 3D Geospatial
Implementations
LongPoint
IntPoint
DoublePoint
FloatPoint
BinaryPoint
BigIntegerPoint
InetAddressPoint
LongRangeField
IntRangeField
DoubleRangeField
FloatRangeField
LatLonPoint Geo3DPoint
Supported
queries
1. any in set
2. exact
3. range
1. any in set
2. exact
3. range
1.intersects
2.contains
3.within
(given a range)
1. within box
2. within distance
3. within polygon
4. nearest neighbor
1.within shape
18. 18
01
Today
Mike McCandless benchmarked pre-6.0 1D points and found*:
1. Points were substantially faster at both index- and query- time than the equivalent
Trie numeric type.
2. Index size was smaller with points.
3. Query-time heap usage with points was much lower.
Adrien Grand re-ran Mike’s benchmark against a Lucene 6.2 snapshot**, and drew similar
conclusions: “36% faster at query time, 71% faster at index time and used 66% less disk and
85% less memory"
* https://www.elastic.co/blog/lucene-points-6.0
** https://www.elastic.co/blog/searching-numb3rs-in-5.0
19. 19
01
Today
I benchmarked fixed range queries against trie and point long, int and double fields in
25 million NYC taxi trips using modified tools from luceneutil.
I create an index with three versions of each long, int and double field:
1. Trie numerics with the default precision step
2. Point fields
3. Trie numerics with a precision step the same width as the numbers - this should provide
a maximum performance threshold for String ranges.
20. 20
01
Today
Indexing
time
Index size
Points 31s 1.2GiB
Trie 53s 1.6GiB
Single-precision trie 19s 0.7GiB
The index has 24 fields defined: 6 string fields, 1 text field, 2 long fields,
1 int field, and 14 double fields.
22. 22
01
Tomorrow
1. Add support for PointFields in Solr: SOLR-8396
2. David Smiley will be working on adding a Solr adaptor for LatLonPoint in the near future.
3. Trie numerics will be removed from Lucene in 7.0, but Solr may take ownership to provide
a longer backcompat timeframe.
4. FieldCache may be removed from Lucene / moved to Solr: LUCENE-7283
23. 23
01
References
1. Numeric Range Queries with Lucene TrieRange:
http://www.thetaphi.de/share/Schindler-TrieRange.ppt
2. Generic XML-based Framework for Metadata Portals:
http://epic.awi.de/17813/1/Sch2007br.pdf
3. Fun with flexible indexing:
http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html
4. Searching numb3rs in 5.0: https://www.elastic.co/blog/searching-numb3rs-in-5.0
5. Multi-dimensional points, coming in Apache Lucene 6.0:
https://www.elastic.co/blog/lucene-points-6.0
6. Bkd-tree: A Dynamic Scalable kd-tree:
http://www.madalgo.au.dk/~large/Papers/bkdsstd03.ps
7. Luceneutil: https://github.com/mikemccand/luceneutil/