7. What is ElasticSearch?
“flexible and powerful open-source, distributed (NoSQL), RESTful search engine build on
top of Lucene”
(http://www/elastic.co)
Features: real-time data, real-time analytics, distributed, high availability, multi-tenancy,
full text search, document oriented, conflict management, schema free, restful API, per-
operation persistence, apache 2 open source license, build on top of apache lucene.
8. Installation
Procedure
Java based, requires v7+
Same JVM version on all nodes is required
Set a bunch of environment variables
Fill in the ElasticSearch config files
Streamlined Installation available for Windows (local service)
https://github.com/rgl/elasticsearch-setup/releases
10. Scalability
NoSQL databases are more scalable and provide
superior performance, and their data model addresses
several issues that the relational model is not designed
to address
- Structured & fixed data model vs. dynamic model
- Efficient, scale-out architecture instead of expensive,
monolithic architecture (scale-up)
- Object-oriented programming that is easy to use and
flexible
Data representation in JSON
11. Scalability - Architecture
Cluster
logical grouping of multiple nodes
Node
an elasticsearch server instance
Master – in charge of managing cluster-wide operations
Only one, responsible for cluster-wide operations
No bottleneck for queries
Shard
low-level worker instance that holds a slice of all data
Each document belongs to a single primary shard
Created during index creation
Determines the number of data stored in each shard
Replica
A copy of a master shard on a different node
Can be created any time
Spreading over nodes => done automatically
12. POST /<index name>
{
"settings" :
{
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
Create an index
1 node
2 nodes
3 nodes
3 nodes
2 replica’s
Having more replica’s shards on the same
number of nodes doesn’t increase our
performance at all because each shard has
access to a smaller fraction of its node’s
resources but it adds redundancy.
13. Default Routing
Hashes the ID of a document and uses that to find a shard (retrieve document).
Gives an even distribution of documents across the entire set of shards
But what about search?
Incomming request
Broadcast & query all shards
Aggregate all results & send back
14. Custom Routing
Configure routing for a certain type:
XPUT /<index name>/<type>/_mapping -d
{
"order":
{
"_routing":
{
"required":true,
"path":"customerID"
}
}
}
Search for a specific document of user user123:
XGET /<index name>/<type>/_search?routing=user123 -d
{
"query":
{
"match_all":{}
}
}
Tell ElasticSearch which property
to use to determine routing
E.g. zipcode, age,
Default routing ensures that distribution is fairly
uniform across all shards.
Once you start implementing your own custom
schemes, it is entirely possible that this uniformity is
lost.
16. Dealing with human language
Indexation
Example : <div>Here is some example text including an extract of 9 poems</div>
Analyzers
Character filters
convert 9 to nine
strip HTML and extract the actual text
lower-case all words
Tokenizer
create individual terms or tokens from text, minding comma’s, whitespaces, periods, hyphens, …
Token filter:
remove stopwords like ‘an’, ‘the’, …
stemming: reduce verbes and words to their stem
{Here} {is} {some} {example} {text} {including} {extract} {nine} {poems}
17. Text Analysis - Experiments
Whitespace
Whitespace tokenizer - A tokenizer of type whitespace that divides text at whitespace.
Sentence: Convert the title-case text using the ToLower(string) command.
Result: {Convert} {the} {title-case} {text} {using} {the} {ToLower(string)} {command.}
18. Text Analysis - Experiments
Simple
Standard tokenizer - A tokenizer of type standard providing grammar based tokenizer that is a good
tokenizer for most European language documents.
Lower-case token filter
Sentence: Convert the title-case text using the ToLower(string) command.
Result: {convert} {the} {title} {case} {text} {using} {the} {tolower} {string} {command}
19. Text Analysis - Experiments
Stop analyzer:
Standard tokenizer
Lower-case token filter
Stop token filter
A token filter of type stop that removes stop words (meaningless words for search) from token streams.
Support for multiple languages
Sentence: Convert the title-case text using the ToLower(string) command.
Result: {convert} {the} {title} {case} {text} {using} {the} {tolower} {string} {command}
20. Text Analysis - Experiments
Snowball
Standard tokenizer
Lower-case token filter
Stop token filter
Stemming (snowball generated stemmer)
A filter that stems (reduce a word to the core) words using a Snowball-generated stemmer
Support for multiple languages
Sentence: Convert the title-case text using the ToLower(string) command.
Result: {convert} {title} {case} {text} {usinge} {tolower} {string} {command}
21. Text Analysis- Adding Custom Analyzers
PUT /my-index/_settings
{
"index":
{
"analysis":
{
"analyzer":
{
“YourCustomAnalyzer":
{
"type": "custom",
"char_filter": [ "html_strip" ],
"tokenizer": "standard",
“filter": [ "lowercase", "stop", "snowball" ]
}
}
}
}
}
A list of available analysis tools:
CharacterFilters: http://bit.ly/1H3hgJF
Tokenizers: http://bit.ly/1zIU2IO
Token filters: http://bit.ly/1AJXCO2
Possible to create your own combination!
22. Text Analysis – Define analyzer
Create a Mapping Type (cfr. Table)
Assign fields
Define field types (string, int, date, …)
Define the analyzer to be used
Define the boost value on a field
Define the routing
…
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"english_title": {
"type": "string",
"analyzer": "english"
}
}
}
}
24. What is NEST?
NEST
• All request & response objects represented
• Strongly typed Query DSL implementation
• Supports fluent syntax
• Uses ElasticSearch.net
ElasticSearch.NET
• Low-level, dependency-free client
• All ES endpoints are available as methods
ElasticSearch RESTFul API
http://nest.azurewebsites.net/
25. NEST – Connection Initialization
Initialize an ElasticClient:
All actions on the ElasticSearch cluster are performed using the ElasticClient
For example:
Search
Index
DeleteIndex/CreateIndex
…
Uri node = new Uri("http://192.168.137.73:9200");
ConnectionSettings settings = new ConnectionSettings(node, defaultIndex: "products");
ElasticClient client = new ElasticClient(settings);
26. Index your content
JSON .NET
PUT /products/product/1 Index the RAW JSON string
Index a Type
Automatically infers
Index
Type
ID
Use ElasticType to define type behavior
Use ElasticProperty to define field behavior
Define explicit values for inferred ones
More information:
http://nest.azurewebsites.net/nest/index-type-
inference.html
http://localhost:9200/products/product/1
{
"id":"1",
"name" : "MacBook Air",
"price" : 1099,
"descr" : "Some lengthy never-read description",
"attributes" :
{
"color" : "silver",
"display" : 13.3,
"ram" : 4
}
}
27. Index your Content - .NET
Raw JSON string
Type based indexation
Modify out-of-the-box behavior using decorators
client.Raw.Index("products", "product", new JavaScriptSerializer().Serialize(prod));
client.Index(product);
[ElasticType(Name = "Product", IdProperty="id")]
public class Product
{
public int id { get; set; }
[ElasticProperty(Name = "name", Index = FieldIndexOption.Analyzed, Type = FieldType.String, Analyzer =
"standard")]
public string name { get; set; }
30. Query your content – Query DSL .NET
Retrieve all products from an index using a MatchAll search
Retrieve all products by using a term query
Search on all fields using the _all built-in property
Search on a combination of fields using boolean operators (see fiddler result)
result = client.Search<Product>(s => s.MatchAll());
result = client.Search<Product>(s => s.Query(q => q.Term(t => t.name, "macbook")));
result = client.Search<Product>(s => s.Query(q => q.Term("name", "macbook")));
result = client.Search<Product>(s => s.Query(q => q.Term("_all", "macbook")));
result = client.Search<Product>(s => s.Query(q => q.Term("name", "macbook") ||
q.Term("descr","macbook")));
31. Query your content – Query DSL
Search on a combination of fields using boolean operators and a date range filter
Some more advanced query examples:
Wildcard Query - use wildcards to search for relevant documents
Span Near - search for word combinations within a certain span in the document
More like this query - finds documents which are ‘like’ a given set of documents using representative
terms
More information: http://bit.ly/1A6wpKs
result = client.Search<Product>(s => s
.Query(q => (q.Term("name", "macbook") || q.Term("descr", "macbook"))
&& q.Range(r => r
.OnField("price")
.Greater(1000)
.LowerOrEquals(2000)
)));
32. Query your content – Fuzzy searches
Perform a fuzzy search to overcome query string errors
result = client.Search<Product>(s => s
.Query(q => q
.Match(m => m
.Query("makboek")
.OnField("name")
.Fuzziness(10)
.PrefixLength(1)
)));
33. Query your content - Paging
Select pages from the full result set using the From & Size filters
result = client.Search<Product>(s => s
.Query(q => q.Term("name", "macbook") || q.Term("descr", "macbook"))
.From(0)
.Size(1));
34. Query your content – Hit Highlighting
.NET Code JSON Result
Hit Highlighting
Possible to add other Pre- and Post-
tags on specific fields
result = client.Search<Product>(s => s
.Query(q => q.Term("name", "macbook"))
.Highlight(h => h
.PreTags("<b>")
.PostTags("</b>")
.OnFields(f => f
.OnField(e => e.name))));
35. Query your content – Aggregations
.NET Code JSON Result
Aggregations group documents
based on term values
Useful to create a facetted search
interface
result = client.Search<Product>(s => s
.Aggregations(a => a
.Terms("color", st => st
.Field(o => o.attributes.color))));
36. Query your content – Suggesters
Did you mean
Term suggester
Suggests terms based on edit distance (=number of operations needed to switch term)
More info: http://bit.ly/1FDFPwr
Phrase suggester
adds additional logic on top of the term suggester to select entire corrected phrases instead
of individual tokens weighted based on ngram-language models.
Provides better suggestions because of co-occurrence & frequency
More info: http://bit.ly/1FbfAKg
37. Query your content – Suggesters
Search as you type
Completion suggester
a so-called prefix suggester
does not do spell correction like the term or phrase suggesters but allows basic auto-complete functionality
Uses FST models and makes them part of the index for faster querying
More info: http://bit.ly/1HwFKbO
hotel, marriot, mercure, munchen and munich