ESKibana
- 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
Elasticsearch & Kibana
- 2. © 2014 MapR Technologies 2
Agenda
• Brief overview of search
• Brief overview of Elasticsearch
• Brief overview of Kibana 4
• MapR-DB integration with Elasticsearch
• Demo
- 4. © 2014 MapR Technologies 4
Information Retrieval (IR)
“Information retrieval is the activity of obtaining information
resources (in the form of documents) relevant to an information
need from a collection of information resources. Searches can be
based on metadata or on full-text (or other content-based)
indexing”
~ Wikipedia
- 5. © 2014 MapR Technologies 5
Basics
• Term t : a noun or compound word used in a specific context
• tf (t in d) : term frequency in a document
• The number of times term t appears in the currently scored document d
• idf (t) : inverse document frequency
• measure of whether the term is common or rare across a corpus of documents,
i.e. how often the term appears across the index
• boost (index) : boost of the field at index-time
• boost (query) : boost of the field at query-time
- 6. © 2014 MapR Technologies 6
What is TFIDF?
TF - IDF = Term Frequency X Inverse Document Frequency
- 7. © 2014 MapR Technologies 7
Lucene
• Fast, high performance, scalable search/IR library
• Open source
• Initially developed by Doug Cutting (Also author of Hadoop)
• Indexing and Searching
• Inverted Index of documents
• Provides advanced Search options like synonyms, stopwords,
based on similarity, proximity.
• http://lucene.apache.org/
- 8. © 2014 MapR Technologies 8
What is an inverted index?
- 9. © 2014 MapR Technologies 10
Indexing pipeline
Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters)
Each field can define an Analyzer at index time/query time or both at same time.
- 10. © 2014 MapR Technologies 11
What is a search engine?
- 12. © 2014 MapR Technologies 13
Elasticsearch
• Enterprise Search platform built on top of Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing, Replication and load balanced querying
• Distributed RESTful search server
• Document oriented
• Schema less
• Easy to scale horizontally
• http://www.elasticsearch.org/
- 13. © 2014 MapR Technologies 14
Features
• Highlighting
• Spelling Suggestions
• Aggregations – Bucketing and Metric
• Query DSL
– based on JSON to define queries
• Automatic shard replication, routing, splitting and rebalancing
• Zen discovery
– Unicast
– Multicast
• Master Election
- Re-election if Master Node fails
• Schemaless or schema on the fly
• Percolation queries
- 18. © 2014 MapR Technologies 19
High-level Client Architecture
- 19. © 2014 MapR Technologies 20
Language clients
• Java
• Ruby
• Python
• Scala
• Go
• PHP
• Pearl
• Groovy
• Javascript
• C#
• .Net
• Haskell
• Clojure
• Erlang
• Ocaml
• Smalltalk
• Cold Fusion
• NodeJS
• R
• Eventmachine
- 20. © 2014 MapR Technologies 21
ES and Friends
• Hadoop and Yarn
• Spark
• Storm
• Samza
• Kafka
• Hive
• Cascading
• Pig
Reference: https://speakerdeck.com/elastic/elasticsearch-hadoop-and-friends-spark-storm-and-more
- 21. © 2014 MapR Technologies 22
Elasticsearch VS Solr
Download
Expanded
First run
0
50
100
150
200
250
300
Ref: http://www.slideshare.net/arafalov/solr-vs-elasticsearch-case-by-case
- 23. © 2014 MapR Technologies 24
Features
• Seamless Integration with Elasticsearch
• Give shape to data easily
• Sophisticated Analysis
• Flexible interface
• Empower more team members – easy to share
• Easy setup
• Visualize data from many sources – logstash, hadoop, Flume,
Fluentd etc
• Simple data import
• Support for aggregations and sub-aggregations
- 28. © 2014 MapR Technologies 29© 2014 MapR Technologies
MapRDB - ES Integration
Mansi Shah
- 29. © 2014 MapR Technologies 30
Agenda
• Architecture
• Setup and Monitoring
• Default Conversion
• Custom Conversion
• Performance Considerations
• Gateway Configuration
• Q & A
- 30. © 2014 MapR Technologies 31
MapR-DB Replication
• 4.1
• DR
• Async and Sync Replication on Geographically distributed
cluster
• Connector to ES 4.2/5.0
• Connector to Spark
• …
- 31. © 2014 MapR Technologies 32
Replication
Gateway
MapR
Server
Volume 1
Volume 1
Table Replication Architecture
Client Operations Client operations
SRC Cluster DST Cluster
Volume 1
Volume 1
Volume 1
Table
1
Table
2
Table
n
Volume a
Table
2
Table
n
MapR
Server
Gateway
Nodes
Replication
Gateway MapR
Server
Table
1
Replication Stream Write
- 32. © 2014 MapR Technologies 33
Replication
Gateway
MapR
Server
Volume 1
Volume 1
Table Replication Architecture
Client operations Client operations
Cluster1 Cluster2
Volume 1
Volume 1
Volume 1
Table
1
Table
2
Table
n
Volume a
Table
2
Table
n
MapR
Server
Gateway
Nodes
Replication
Gateway MapR
Server
Table
1
Replication
GatewayGateway
Nodes
Replication
Gateway
Replication Stream Write
- 33. © 2014 MapR Technologies 34
MapR
Server
Volume 1
Volume 1
Client operations Client operations
MapR-DB Cluster
Volume 1
Table 1
Table
2
Table n
MapR
Server
Gateway
Nodes
Replication
Gateway &
ES Client
Repl Stream Write ES
Cluster
Elasticsearch
Cluster
ES Replication Architecture
- 34. © 2014 MapR Technologies 35© 2014 MapR Technologies
Setup and Monitoring
- 35. © 2014 MapR Technologies 36
Register Elasticsearch Cluster
MapR Cluster
MFS Nodes
+
Gateway
Nodes
Elasticsearch
Cluster
Register the
elasticsearch
cluster with
mapr
- 36. © 2014 MapR Technologies 37
Create a replication to ES
MapR-DB
Table
ES Index
+ Type
● Register the ES cluster with MapR
● Create a source table
● Start replication on the source table.
maprcli table replica elasticsearch autosetup -path /test1 -target elasticsearch -index demoidx -type demotype
* Will work with mcs in later versions – 5.0 (possibly)
- 37. © 2014 MapR Technologies 38© 2014 MapR Technologies
DEMO
- 38. © 2014 MapR Technologies 39© 2014 MapR Technologies
Data Conversion
- 39. © 2014 MapR Technologies 40
Default Data Conversion
● Converts byte-stream stored in MapR-DB to basic ES data types using mappings stored in ES.
● Data Types supported - String, Int, Float, Double, Long, Short. Date (as epoch), Geo-point / Geo-
Hash, Boolean, Binary, IP, etc.
● Gateway reads data type mapping from ES and then converts data based on this mapping
● Example mapping added to Elasticsearch during index creation
PUT /costarica/_mapping/activities
{
"activities" : {
"properties" : {
“CF1” : {“type” : “nested”}
“properties”: {
“name” : {“type” : “string”},
"price" : {"type" : "integer"},
"rating": {"type" : "float"},
"location":{"type" : "geo_point"}
} } } } }
- 40. © 2014 MapR Technologies 41
Mapping Continued ...
PUT /costarica/_mapping/activities
{
"activities" : {
"properties" : {
“CF1” : {“type” : “nested”}
“properties”: {
“name” : {“type” : “string”},
"price" : {"type" : "integer"},
"rating": {"type" : "float"},
"location":{"type" : "geo_point"}
} } } } }
GET /costarica/activities/row1
{
“CF1” : {
“name” : “kayaking”,
"price" : 50,
"rating": 8.6,
"location": “78.45, 14.33”
}
“CF2” : {
description: “river safari, animals, bird-watching”
}
}
- 41. © 2014 MapR Technologies 42
Custom Conversion
GET /costarica/activities/row1
{
“CF1” : {
“name” : “kayaking”,
"price" : 50,
"rating": 8.6,
"location": “78.45, 14.33”
}
“CF2” : {
description: “river safari, animals”
}
}
GET /costarica/activities/row1
{
“name” : “kayaking”,
"price" : 90,
"rating": 8,
“location": “78.45, 14.33”,
“tags”: [“river safari”, “animals”, “birds”
}
● Custom mapping / conversion / data manipulation.
● Non-supported data types - arrays, nested etc
● Special settings like - replication, routing, scripts, scripting language.
● Multiple JSON documents per source table update
● Delete something on a row update.
- 42. © 2014 MapR Technologies 45© 2014 MapR Technologies
Gateway configuration
- 43. © 2014 MapR Technologies 46
Gateways overview
• mapr-gateway is a new package.
• Has no dependency on mapr-fileserver or mapr-cldb, can run
independently.
• Warden monitors the status, restarts if the service terminates.
• MCS shows the health of gateway service.
• Is not counted as a licensed node (only nodes having mfs are
counted).
- 44. © 2014 MapR Technologies 47
Gateways discovery
• MFS nodes on the source cluster need to discover the gateways
to a destination cluster.
• Gateways can be specified via DNS
– Add an entry in the dns with key gateway.dstClusterName, and value
being a list of hostnames or IPs
• Gateways can be specified via maprcli
maprcli cluster gateway set -dstcluster hyd -gateway
“gw1.hyd.maprtech.com gw2.hyd.maprtech.com”
- 45. © 2014 MapR Technologies 48© 2014 MapR Technologies
Q & A
- 46. © 2014 MapR Technologies 49
Q&A
@mapr maprtech
mgunturu@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies