Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch
1. Native Code & Off-Heap Data
Structures for Solr
Yonik Seeley
Lucene/Solr Revolution 2014
Washington, D.C.
2. My Background
• Creator of Solr
• Heliosearch Founder
• LucidWorks Co-Founder
• Lucene/Solr committer, PMC member
• Apache Software Foundation member
• M.S. in Computer Science, Stanford
3. Heliosearch Project
• The Next Evolution of Solr
• Forked from Solr, Developing at github
– Started Jan 2014
– Well aligned community
– Open Source, Apache licensed
• Bring back to Apache in the future?
• Currently drop-in replacement for Solr at the HTTP-API level
– A super-set… we continually merge in upstream changes
– Latest version of Heliosearch includes latest Solr
• Current Features: Off-heap filters, Off-heap fieldcache, facet-by-
function, sub-facets, native code performance
enhancements
5. Garbage Collection Basics
Eden
Space
Survivor
Space
1
Survivor
Space
2
Tenured
Space
Permanent
Space
q New objects allocated in Eden
q Find live objects by tracing from GC
“roots” (threads, stack locals, etc)
q Make a copy of live objects, leaving
“garbage” behind
q Eden + Survivor Space copied
together to other Survivor space
q Tenured from Survivor when old
enough
q “stop-the-world” needed when GC
can’t keep up
q Out of memory when too much time
spent in GC
Thread
6. Java Memory Waste
- Need to size for worst case scenario
- OS needs free memory to cache index files
- JVMs aren’t good at “sharing” with rest of the system
- mmap allocations managed by OS, can be immediately reused on free
OS
Real
Memory
max
heap
Unused
Heap
Heap
in
use
JVM
max
heap
Unused
Heap
Heap
in
use
JVM
Unused
Heap
C
Heap
in
use
C
Process
Unused
Heap
C
Heap
in
use
C
Process
mmap
alloced
mmap
alloced
“Free”
Memory
includes
buffer
cache,
important
to
cache
index
files
7. GC Impact
q GC Reduces Throughput
q Time to copy all that memory around could be spent
better!
q Stop-the-world pauses
q Seconds to Minutes long
q Pause time proportional to heap size
q Still exists in all Hotspot GCs… CMS, G1GC, etc
q Breaks Application SLAs (request timeouts, etc)
q Can cause SolrCloud Zookeeper session timeouts
q Reducing max pause size normally means reduced
throughput
q Non-graceful degradation
q if you don't size your heap big enough… BOOM!
9. GC Reduction
q Reuse objects – cause less garbage
q Move certain things off-heap (invisible to GC)
q Option1: Direct ByteBuffers
q Limited to “int” (2GB)
q No way to directly “free” – still relies on GC
q Option2: sun.misc.Unsafe
q malloc() + free() + direct memory access
q Supported on all major JVMs
q Widely used: Java (nio, concurrent),JSR166, Google
Guava, objenesis (which is used in Kyro, which is used
in Twitter Storm), Apache DirectMemory,Lightning,
Hazelcast, snappy, gson, …
q Being considered for Java 9
11. Off-Heap title
Filters Test
Observed
max
process
sizes
Solr
:
3.8GB
–
4.3GB
Heliosearch:
3.6GB
–
3.7GB
12. Off-Heap FieldCache
Normal (on-heap) FieldCache
q Typically the largest data structures kept on the heap
q Used for sorting, function query values, single-valued faceting,
grouping
q Uses weak references
Heliosearch nCache (n is for “native”)
q Allocated off-heap
q First-class managed Solr cache
q Configure size, warming policies
q View statistics
q Per-segment (NRT friendly)
q No weak references
13.
14. nCache admin stats
item_id:{
"field":"id",
"uses":8,
"class":"StrTopValues",
"refcount":2,
"numSegments":7,
"carriedOver":6,
"size":612}
item_popularity:{
"field":"popularity",
"uses":5,
"class":"IntTopValues",
"refcount":2,
"numSegments":7,
"carriedOver":6,
"size":106}
item_price:{
"field":"price”,
"uses":0,
-- the number of top-level uses for searcher
"class":"FloatTopValues",
"refcount":2,
"numSegments":5,
-- number of segments populated
"carriedOver":5,
-- number of segments carried over from last searcher
"size":272
-- size in bytes for all populated segments
}
15. Off-Heap Integer Field
q 50M document index
q Sorting on 6 different integer fields (10,100,1000,10000,1M unique values)
q 4 request threads
Results
q 42% faster sorting
q 73% faster functions
16. String Field Sorting
q 10M document index
q 10 different string fields, each field 80% populated
q Median latency
17. String Field Sorting Throughput
q Concurrent throughput sorting on random fields in random order (asc/desc)
q ~50% performance gain
19. Native Code
q The Idea: create native accelerators for CPU hotspots
q Faceting anyone?
q But…. JNI Sucks! (and it’s GC’s fault again)
jint
*buf=
(*env)-‐>GetIntArrayElements(env,
arr,
0);
for
(i=0;
i<len;
i++)
{
sum
+=
buf[i];
q GetArrayElements() – makes a *copy* of the array!
q GetPrimitiveArrayCritical() – blocks garbage collection!
q Tons of other restrictions… it’s a “critical section”
q Defeats the purpose of going to native code in the first place
q But… our data is already off-heap, we’re good!
}
20. Native Single Valued String Faceting
q Top-Level off-heap String cache
q Improves Sorting and Faceting speed
q Eliminates FieldCache “insanity”
q Native Code
q Written in C++, compiled with GCC 4.7, 4.8
q Currently supports 64 bit Windows, OS-X, Linux (x86)
q static compilation avoids JVM hotspot warmup period,
mis-compilation bugs, and variations between runs
25. Facet Module Goals
q Replace the aging “SimpleFacets”
q First class JSON support
q Easier programmatic construction of complex nested facet
commands
q Canonical response format that is easier for clients to
parse
q First class analytics support
q Cleaner distributed search support
q Fully pluggable
q Better base for integration of other search features
Heliosearch is a Solr super-set, so you can still chose to
use the old faceting or mix-n-match.
26. API Comparison
Old Style New JSON API
&facet=true
&facet.range={!key=age_ranges}age
&f.age_ranges.facet.range.start=0
&f.age_ranges.facet.range.end=100
&f.age_ranges.facet.range.gap=10
&facet.range={!key=price_ranges}price
&f.price_ranges.facet.range.start=0
&f.price_ranges.facet.range.end=1000
&f.price_ranges.facet.range.gap=50
{
age_ranges:
{
//
facet
name
range:
{
//
facet
type
field
:
age,
//
facet
params
start
:
0,
end
:
100,
gap
:
10
}
},
price_ranges:
{
range:
{
field
:
price,
start
:
0,
end
:
1000,
gap
:
50
}
}
}
27. Facet Functions
q Sort/Report by things other than “count”
Aggregation Functions / Stats:
count
sum(function)
avg(function)
sumsq(function)
min(function)
max(function)
unique(string_field)
any
“funcKon
query”
that
yields
a
numeric
value!
Example:
sum(mul(num_units,
unit_price))
q Stats are calculated “per bucket”
q Buckets created by Query, Range, or Terms (field) facets
28. Simple Request + Response
$
curl
http://localhost:8983/solr/query
-‐d
'q=widgets&
json.facet=
{
//
Comments
can
help
with
clarity
/*
traditional
C-‐style
comments
are
also
supported
*/
x
:
"avg(price)"
,
//
Simple
strings
can
occur
unquoted
y
:
'unique(brand)'
//
Strings
can
also
use
single
quotes
}
'
[…]
"facets"
:
{
"count"
:
314,
"x"
:
102.5,
"y"
:
28
}
Number
of
documents
in
the
facet
bucket
30. Sub-Facets
q Any facet that produces buckets can have sub-facets
(terms/field, range, query)
q Sub-facets can have facet functions (stats) or their
own sub-facets (no limit to nesting).
q A subfacet can be any type (field, range, query)
q Multiple subfacets can be added to any given facet
q Subfacets are first-class facets - can be configured
independently like any other facet.
q Different offsets, limits, stats, sorts, etc
31. Sub-Facet Example
json.facet={
shoes:{
terms:{
field:
shoe_style,
sort:
{x
:
desc},
facet:{
x
:
"avg(price)",
y
:
"unique(brand)",
colors
:{terms:color}
}
}
}
}
"facets":
{
"count"
:
472,
"shoes":
{
"buckets"
:
[
{
"val"
:
"Hiking",
"count"
:
34,
"x"
:
135.25,
"y"
:
17,
"colors"
:
{
"buckets"
:
[
{
"val"
:
"brown",
"count"
:
12
},
{
"val"
:
"black",
"count"
:
10
},
[…]
]
}
//
end
of
colors
sub-‐facet
},
//
end
of
Hiking
bucket
{
"val"
:
"Running",
"count"
:
45,
"x"
:
110.75,
"y"
:
24,
"colors"
:
{
"buckets"
:
[…]
Short-‐form
for
terms
facet
simply
specifies
the
field.
Sorts
buckets
by
count
descending.
32. Terms Facet
Terms facet creates buckets of docs with the same value in a field
- field – The field name to facet over.
- offset – Used for paging, this skips the first N buckets. Defaults to 0.
- limit – Limits the number of buckets returned. Defaults to 10.
- mincount – Only return buckets with a count of at least this number. Defaults to 1.
- sort – Specifies how to sort the buckets produced. “count” specifies document count,
“index” sorts by the index (natural) order of the bucket value. One can also sort by any
facet function / statistic that occurs in the bucket. The default is “count desc”. This
parameter may also be specified in JSON like sort:{count:desc}. The sort order may
either be “asc” or “desc”
- missing – A boolean that specifies if a special “missing” bucket should be returned that is
defined by documents without a value in the field. Defaults to false.
- numBuckets – A boolean. If true, adds “numBuckets” to the response, an integer
representing the number of buckets for the facet (as opposed to the number of buckets
returned). Defaults to false.
- allBuckets – A boolean. If true, adds an “allBuckets” bucket to the response, representing
the union of all of the buckets. For multi-valued fields, this is different than a bucket for all
of the documents in the domain since a single document can belong to multiple buckets.
Defaults to false.
- prefix – Only produce buckets for terms starting with the specified prefix.
33. Query Facet
Query facet creates a single bucket of documents matching the
query.
{
//
simple
example
highpop:{
query:{
q:"inStock:true
AND
popularity[8
TO
10]"
}
}
}
{
//
example
with
multiple
sub-‐facets
highpop:{
query:{
q
:
"inStock:true
AND
popularity[8
TO
10]",
facet
:
{
average_price
:
"agv(price)",
available_colors
:
{
terms
:
color
},
price_ranges
:
{
range
:
{
field:price,
start:0,
end:200,
gap:10
}}
}}
}
34. Range Facet
Creates buckets over ranges on a numeric or date field
Parameter names/values "in sync" with Solr range parameters:
field – The numeric field or date field to produce range buckets from
start – Lower bound of the ranges
end – Upper bound of the ranges
gap – Size of each range bucket produced
hardend – A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false,
the last bucket will be “gap” wide, which may extend past “end”.
other – This param indicates that in addition to the counts for each range constraint between facet.range.start and
facet.range.end, counts should also be computed for…
– "before" all records with field values lower then lower bound of the first range
– "after" all records with field values greater then the upper bound of the last range
– "between" all records with field values between the start and end bounds of all ranges
– "none" compute none of this information
– "all" shortcut for before, between, and after
include – By default, the ranges used to compute range faceting between facet.range.start and facet.range.end are
inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is
inclusive. This default, equivalent to lower below, will not result in double counting at the boundaries. This behavior can be
modified by the facet.range.include param, which can be any combination of the following options…
– "lower" all gap based ranges include their lower bound
– "upper" all gap based ranges include their upper bound
– "edge" the first and last gap ranges include their edge bounds (ie: lower for the first one, upper for the last one)
even if the corresponding upper/lower option is not specified
– "outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already
include those boundaries.
– "all" shorthand for lower, upper, edge, outer
36. Fantasy
($1045)
Top
Authors
$423
George
R.R.
MarKn
$347
Brandon
Sanderson
$155
JK
Rowling
Top
Books
$252
A
Game
of
Thrones
$113
Emperor
of
Thorns
$101
Nine
Princes
in
Amber
$82
Steel
Heart
Sci-‐Fi
($898)
Top
Authors
$321
Iain
M
Banks
$218
Neal
Asher
$155
Neal
Stephenson
Top
Books
$113
Gridlinked
$101
Use
of
Weapons
$93
Snow
Crash
$82
The
Skinner
Mystery
($645)
Top
Authors
$191
James
Panerson
$145
Patricia
Cornwell
$126
John
Grisham
Top
Books
$85
One
for
the
Money
$77
Angels
&
Daemons
$64
Shuner
Island
$35
The
Firm
Filter
By
State
$852
NJ
(14
stores)
$658
NY
(11
stores)
$421
CT
(8
stores)
Chain
$984
Amazoon
(14
stores)
$734
Houses&Royalty
(9
stores)
$387
Books-‐r-‐us
(7
stores)
Store
$108
Amazoon
Branchburg
$93
Books-‐r-‐us
Bridgewater
$87
H&R
NYC
Number
of
Books
Chain
201K
Houses&Royalty
183K
Amazoon
98K
Books-‐r-‐us
Store
193K
H&R
NYC
77K
Books-‐r-‐us
Bridgewater
68K
Amazoon
Branchburg
37. date_breakout
:
{
range:
{
field:
sale_date,
start
:
...,
end
:
...,
gap
:
"+1MONTH”,
facet
:
{
top_genre
:
{
terms
:
{
field
:
genre,
sort
:
"revenue
desc",
limit
:
4,
facet
:
{
revenue
:
"sum(sales)"
}
}},
by_chain:
{
terms
:
{
field
:
chain,
facet
:
{
revenue
:
"sum(sales)"
}
}}
[…]
Implementation
Creates
series
of
facet
buckets
based
on
date
For
each
date
bucket,
facet
by
genre,
taking
the
top
4
by
revenue
For
each
genre
bucket,
report
revenue
38. Fantasy
($1045)
Top
Authors
$423
George
R.R.
MarKn
$347
Brandon
Sanderson
$155
JK
Rowling
Top
Books
$252
A
Game
of
Thrones
$113
Emperor
of
Thorns
$101
Nine
Princes
in
Amber
$82
Steel
Heart
Sci-‐Fi
($898)
Top
Authors
$321
Iain
M
Banks
$218
Neal
Asher
$155
Neal
Stephenson
Top
Books
$113
Gridlinked
$101
Use
of
Weapons
$93
Snow
Crash
$82
The
Skinner
Mystery
($645)
Top
Authors
$191
James
Panerson
$145
Patricia
Cornwell
$126
John
Grisham
Top
Books
$85
One
for
the
Money
$77
Angels
&
Daemons
$64
Shuner
Island
$35
The
Firm
top_genres:{
terms:{
field:
genre,
facet
:
{
rev
:
"sum(sales)",
top_authors:{
terms:{
field
:
author,
sort
:"rev
desc",
limit
:
3,
facet
:
{
rev
:
"sum(sales)"
}
}},
top_books:{
terms:{
field
:
Ktle,
sort
:
"rev
desc",
limit
:
4,
facet
:
{
rev
:
"sum(sales)"
}
}}
[…]
41. Parameter Substitution
q Parameters / macros substituted across whole request
q Happens before any parsing, so usable in any context
q=price:[ ${low} TO ${high} ]
&low=100
&high=200
q Default values
q=price:[ ${low:0} TO ${high:100} ]
q Nested
q=${price_query}
&price_query=${price_field}:[ ${low} TO ${high} ] AND inStock:true
&price_field=specialPrice
&low=50
&high=100
42. New Query Parser Features
q Filters in queries - just like “fq” parameters, but may appear
anywhere in a query
q=(text:elephant –(filter(*:* -price:[ 0 TO 100 ]) OR
filter(date[0 TO 2013]) )
q Constant Score Queries
q=color:(blue OR green)^=1 text:shoes
q Comments in Queries (can nest)
q=+text:elephant /* the main query */ /* boosting part – WIP
{!func}mul(pop,rank)^10 */
43. Thank You
Help Develop the Next Generation of Solr!
Resources:
q http://heliosearch.org
q https://github.com/Heliosearch/heliosearch
q https://groups.google.com/forum/#!forum/heliosearch
q https://groups.google.com/forum/#!forum/heliosearch-dev