How to use Elasticsearch Analyzers by EmergiNet

Analyzers
Pablo Musa
EmergiNet
05 de Maio de 2014

Outline
1 Motiva¸c˜ao
2 Elasticsearch e EmergiNet
3 Conceitos B´asicos
4 Criando um Analisador
5 Problemas Comuns
6 Outros Trabalhos
Pablo Musa (EmergiNet) Analyzers 05 de Maio de 2014 2 / 26

Motiva¸cão
Caso de Uso
Site de compras
“Full text search” em SQL é complexo e lento
Necessidade de um sistema de busca:
mais rápido
mais preciso
mais simples de desenvolver

Elasticsearch
Rápido (em média 100x)
Resultados excelentes
Fácil de consumir
Instala¸cão muito simples e escalável
API RESTful simples utilizando JSON
“Schema é automático”

Elasticsearch e EmergiNet
Nem sempre o padrão é o melhor
Ninguém conhece melhor seus dados do que você
Mapping personalizado
EmergiNet solu¸cão de consultoria ou execu¸cão de projetos
Otimizar a aplica¸cão e incluir funcionalidades
1 Ordena¸cão
2 Aggregations
3 Auto-Complete, Suggester
4 Auxiliar no SEO (Search Engine Optimization)

Elasticsearch
Empty Index
{
"settings": {
"analysis": {
"filter": {
},
"analyzer": {
"my_analyzer": {
"type": "",
"char_filter": [],
"tokenizer": "",
"filter": []
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "",
"index": "",
"analyzer": ""
}
}
}
}
}
“Empty” analysis and mappings. Example of the structure to be fulﬁlled.

Etapas de um analisador
1 Arrumar
2 Quebrar
3 Normalizar
Elasticsearch oferece analisadores pr´e-deﬁnidos
Por exemplo: standard, simple, whitespace, language

Arrumar
Character Filters
“Pr´e-processamento”
Limpeza da string
Opcional
Atualmente existem 3 tipos:
mapping (ex: "ph" => "f")
html strip (removes tags and maps entities, "á" => "´a")
pattern replace (regular expression)

Arrumar
Analysis with Character Filters
"analysis": {
"filter": {
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"tokenizer": "",
"filter": []
}
}
}
Analysis with character ﬁlter function only.

Quebrar
Tokenizers
“Processamento”
Quebra da string em termos individuais
Obrigat´orio
standard
keyword
whitespace
ngram, edge ngram
letter, lowercase (opt), pattern, uax email url, path hierarchy

Quebrar
Analysis with Character Filters and Tokenizers
"analysis": {
"filter": {
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"filter": []
}
}
}
Analysis with character ﬁlter and tokenizer function.

Normalizar
Token Filters
“P´os-processamento”
Normalizar os tokens (alterar ou remover)
Opcional
ascii folding
lowercase, uppercase
stop
stemmer
ngram, edge ngram, length, snowball, synonym, ...

Normalizar
Analysis Complete
"analysis": {
"filter": {
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
Analysis using all functions.

Normalizar
stop token filter
Stop Words
Remove palavras indesejadas
É baseado em uma lista de palavras e deve ser criado manualmente
"stop_noise": {
"type": "stop",
"stopwords_path": "sw.txt"
}
"stop_noise": {
"type": "stop",
"stopwords": ["o", "a",
"no", "na","de","da",
"as","os"]
}
Stop word token filter definition. ignore case and remove trailing are boolean settings.

Normalizar
Analysis Complete with stop words
"analysis": {
"filter": {
"stop_noise": {
"type": "stop",
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"stop_noise",
"asciifolding"
]
}
}
}
Analysis using all functions and my own stop words ﬁlter.

Normalizar
stemmer token filter
Stemmer (deriva¸cões)
“Trava” as palavras ("jogar"=>"joga" ou "jogar" =>"jog")
É baseado em um conjunto já existente, mas deve ser criado
manualmente
"my_stemmer": {
"type": "stemmer",
"name": "light_portuguese"
}
Stemmer token filter definition. minimal portuguese and portuguese are other portuguese
options.

Normalizar
Analysis Complete with stop words and stemmer
"analysis": {
"filter": {
"stop_noise": {
"type": "stop",
},
"light_pt": {
"type": "stemmer",
},
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"stop_noise",
"asciifolding",
"light_pt"
]
}
}
}
Analysis using all functions, with my own stop words and light portuguese stemmer ﬁlters.

One Field Mapping
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string",
"index": "analyzed",
"analyzer": "my_analyzer",
}
}
}
}
Simple mapping with one string ﬁeld using my analyzer.

Problemas
Ordenar
Aggregation
SEO (Search Engine Optimization)

Problemas
Ordena¸cão
Ordena¸cão em campos indexados gera resultados aleatórios
"Telha" < "casa"
Novo analisador
"sort": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
Sort analyzer. Makes use of lowercase and asciifolding filters and the keyword tokenizer.

Problemas
Aggregation
Como funciona: ”sao”, ”paulo”, ”rio”
O que queremos: ”São Paulo”
Ou seja, não queremos análise

Problemas
Search Engine Optimization
Stemmer ´e ruim
Novo analisador
"url_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"stop_noise",
"asciifolding"
]
}
URL analyzer for SEO. It will not be used in mappings.

Problemas
Search Engine Optimization
Não precisamos mapeá-lo para um field
analyze API
curl -XPOST "http://localhost:9200/my_index/_analyze?analyzer=my_analyzer" -d ’{
"O Meetup Elasticsearch RJ será no dia 05 de maio as 18h."
}’
> meetup elasticsearch rj sera dia 05 maio 18h
analyze API Example.

Resultado
{
"settings": {
"analysis": {
"filter": {
"stop_noise": {
"type": "stop",
},
"light_pt": {
"type": "stemmer",
} },
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"stop_noise",
"asciifolding",
"light_pt"
]
},
"sort": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"url_analyzer": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"stop_noise",
"asciifolding"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"sort": {
"type": "string",
"analyzer": "sort"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Complete mapping for one ﬁeld using sub-ﬁelds to text search, sort, and aggregation.

Outros Trabalhos
Boost
Parent/Child
Armazenamento de Logs (Logstash + Kibana)
Consultoria de infra estrutura para ELK

Obrigado
www.emergi.net - pmusa@emergi.net
“Keep it simple, but not simpler.”

How to use Elasticsearch Analyzers by EmergiNet

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How to use Elasticsearch Analyzers by EmergiNet

Ähnlich wie How to use Elasticsearch Analyzers by EmergiNet (20)

How to use Elasticsearch Analyzers by EmergiNet