1. Document relations
Martijn van Groningen
@mvgroningen
Tuesday, November 6, 12
2. Overview
• Background
• Document relations with joining.
• Various solutions in Lucene and Elasticsearch
Tuesday, November 6, 12
3. Background - Lucene model
• Lucene is document based.
• Lucene doesn’t store information about relations
between documents.
• Data often holds relations.
• Good free text search over relational data.
Tuesday, November 6, 12
4. Background - Common solutions
• Compound documents.
• May result in documents with many fields.
• Subsequent searches.
• May cause a lot of network overhead.
• Non Lucene based approach:
• Use Lucene in combination with a relational database.
Tuesday, November 6, 12
5. Background - Example
• Product
• Name
• Description
• Product-item
• Color
• Size
• Price
Tuesday, November 6, 12
6. Background - Example
• Compound Product & Product-items document.
• Each product-item has its own field prefix.
Tuesday, November 6, 12
7. Background - Other solutions
• Lucene offers solutions to have a 'relational' like
search.
• Joining
• Result grouping
• Elasticsearch builds on top of the joining
capabilities.
• These solutions aren't naturally supported.
Tuesday, November 6, 12
9. Joining
• Join support available since Lucene 3.4
• Not a SQL join!
• Two distinct joining types:
• Index time join
• Query time join
• Joining provides a solution to handle document
relations.
Tuesday, November 6, 12
10. Joining - What is out there?
• Index time:
• Lucene’s block join implementation.
• Elasticsearch’s nested filter, query and facets.
• Built on top of the Lucene’s block join support.
• Query time:
• Lucene’s query time join utility.
• Solr’s join query parser.
• Elasticsearch’s various parent-child queries and filters.
Tuesday, November 6, 12
11. Index time join
And nested documents.
Tuesday, November 6, 12
13. Joining - Block indexing
• Atomically adding documents.
• A block of documents.
• Each document gets sequentially assigned Lucene
document id.
• IndexWriter#addDocuments(docs);
Tuesday, November 6, 12
14. Joining - Block indexing
• Index doesn't record blocks.
• App is responsible for identifying block documents.
• Segment merging doesn’t re-order documents in a
segment.
• Adding athe whole block.block requires you to
reindex
document to a
Tuesday, November 6, 12
15. Joining - block join query
• Parent is the last document in a block.
Tuesday, November 6, 12
18. Block join - ToChildBlockJoinQuery
• Parent filter marks the parent documents.
• Child query is executed in the parent space.
Tuesday, November 6, 12
19. Block join & Elasticsearch
• In Elasticsearch exposed as nested objects.
• Documents are constructed as JSON.
• JSON’s nested structure works nicely with block
indexing.
• Elasticsearchoftakes nested documents. and also
keeps track the
care of block indexing
Tuesday, November 6, 12
20. Elasticsearch’s nested support
• Support for a nested type in mappings.
• Nested query.
• Nested filter.
• Nested facets.
Tuesday, November 6, 12
21. Nested type
• The nested types enables Lucene’s block indexing.
index
curl -XPUT 'localhost:9200/products' -d '{
type "mappings" : {
"product" : {
"properties" : {
Nested offers "offers" : { "type" : "nested" }
}
}
}
}'
Tuesday, November 6, 12
25. Query time join
and parent & child relations.
Tuesday, November 6, 12
26. Query time joining
• Documents are joined during query time.
• More expensive, but more flexible.
• Two types of query time joins:
• Parent child joining.
• Field based joining.
Tuesday, November 6, 12
27. Lucene’s query time join
• Query time joining is executed in two phases.
• Field based joining:
• ‘from’ field
• ‘to’ field
• Doesn’t require block indexing.
Tuesday, November 6, 12
28. Query time join - JoinUtil
• Firstthe documents thatthe terms in the fromField
for
phase collects all
match with the original
query.
• The second the collected termsdocumentsprevious
match with
phase returns the
from the
that
phase in the toField.
• One public method:
• JoinUtil#createJoinQuery(...)
Tuesday, November 6, 12
31. Joining - JoinUtil
Join utility
• Result will contain one products.
• Possible to do ‘join’ across indices.
Tuesday, November 6, 12
32. Elasticsearch’s query time join
• A parent child solution.
• Not related to Lucene’s query time join.
• Support consists out of:
• The _parent field.
• The top_children query.
• The has_parent & has_child filter & query.
• Scoped facets.
Tuesday, November 6, 12
33. The _parent field
• Points to the parent type.
• Mapping attribute to be define on the child type.
curl -XPUT 'localhost:9200/products' -d '{
"mappings" : {
"offer" : {
"_parent" : {
"type" : "product"
}
}
}
}'
• Elasticsearch uses the _parent field to build an id
cache.
• Makes parent/child queries & filters fast.
Tuesday, November 6, 12
34. Indexing parent & child documents
• Parent document:
curl -XPOST 'localhost:9200/products/product/1' -d '{
"name" : "Polo shirt",
"description" : "Made of 100% cotton"
}' The id of the parent
document. Also used for
• Child documents: routing.
curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{
"color" : "red",
"size" : "s",
"price" : 999
}'
curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{
"color" : "blue",
"red",
"size" : "s",
"m",
"price" : 999
1099
}'
Tuesday, November 6, 12
35. The ‘top_children’ query
curl -XPOST 'localhost:9200/products/_search' -d '{
"query" : {
Child type
"top_children" : {
"type" : "offer",
"query" : {
"term" : { Child query
"size" : "m"
}
},
"score" : "sum" Score mode
}
}
}'
• Internally the in order to getpotentially executed
several times
child query is
enough parent hits.
Tuesday, November 6, 12
36. The ‘has_child’ query
curl -XPOST 'localhost:9200/products/_search' -d '{ Child type
"query" : {
"has_child" : {
"type" : "offer",
"query" : {
"term" : {
"size" : "m"
Child query
}
}
}
}
}'
• Doesn’tdoc. Works as a filter. into the matching
parent
map the child scores
• The has_parent query matches child document
instead.
Tuesday, November 6, 12
38. Conclusion
• Block join & nested object are fast and efficient,
but lack flexibility.
• Query time and parent child join are flexible at the
cost of performance and memory.
• Field based query time joining is the most flexible.
• Parent child based joining is the fastest.
• Faceting in combination with document relations
gives a nice analytical view.
Tuesday, November 6, 12