ElasticSearch - Suche im Zeitalter der Clouds

ElasticSearch –
Suche im Zeitalter der Clouds
Christian Meder
Bernhard Pflugfelder
inovex Gmbh

Background
‣  open source (free software)
‣  Linux
‣  Web
‣  Java
‣  Android
‣  CTO@inovex
‣  Christian Meder
Christian MederSpeaker
2

Background
‣  Lucene
‣  Solr
‣  Text Mining Technologies,
Information Retrieval
‣  Hadoop
‣  Java
‣  Big Data Engineer@inovex
‣  bpflugfelder@inovex.de
Bernhard PflugfelderSpeaker
3

‣  Search is everywhere
‣  Elasticsearch
‣  Examples
‣  Overview
‣  Features
Agenda
4

Enterprise SearchSearch applications
6

Online shopsSearch applications
7

Semantic searchSearch applications
8

Navigation &
Information access
Search applications
9

Data analysisSearch applications
10
http://datarpm.com/product

Log-file AnalysisSearch applications
11
http://kibana.org/

Document storeSearch applications
12

‣  Can you think of other scenarios where search applications
will also do a good job?
‣  Remind the key capabilities of search technologies:
‣  Persistency
‣  Flexible data model
‣  Unstructured data, but not only
‣  Extremely quick access to data
‣  Horizontal scalability
There are plenty of applications scenarios out there where
search technologies shall be considered!
Document storeSearch applications
13

Open sourceSearch technologies
14
http://lucene.apache.org
http://lucene.apache.org/solr/
http://www.elasticsearch.org

Lucene is an open source, pure Java API
for enabling information retrieval
‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001
‣  Licensed by Apache License 2.0
‣  Pure Java Library with implementations for :
‣  Lucene.NET (http://lucenenet.apache.org)
‣  PyLucene (http://lucene.apache.org/pylucene/)
‣  and more:
http://wiki.apache.org/lucene-java/LuceneImplementations
‣  Large and very active developer community, well documented and supported (38
active committer!)
‣  Current stable release: 4.2.1
‣  Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/lucene-java/PoweredBy
Overview
15
http://lucene.apache.org/

Solr is a standalone enterprise search server & document
store with based on Lucene
‣  Created by Yonik Seeley at CNET Networks in 2004
‣  Introduced as Apache Incubator in 2006, became TLP in 2007
‣  Seeley and others founded Lucid Imagination -> LucidWorks
‣  Large and very active developer community, well documented and supported
(strong relationship to Lucene community also)
‣  Current stable release: 4.2.1
‣  Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/solr/PublicServers
Overview
16
http://lucene.apache.org/solr/

“You know, for search” (Shay Banon)
Search technologies
17

Elasticsearch is a “distributed-from-scratch” search server
based on Lucene
Created by Shay Banon with a first version made public in 02/2010:
Elasticsearch itself was born out of my frustration with the fact that there isn’t really a
good, open source, solution for distributed search engine out there, which also
combines what I expect of search engines after building Compass (and on that, I will
blog later…).
I have been working on this for the past several months, pouring my search and
distributed knowledge into this (and portions of my heart and time ;) )
[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]
Motivation
18
http://www.elasticsearch.org/

‣  Current stable version 0.20.6 working with Lucene 3.6
‣  Available version 0.90 RC2 includes Lucene 4.2.1 integration
‣  Small, but growing group of core developer
‣  Strong support of valuable Lucene committer
‣  Company elasticsearch.com founded in 2012
‣  By the people behind elasticsearch.org
‣  www.elasticsearch.com
Overview
19

Customers
20

‣  Code search is organized on a cluster
‣  26 storage nodes holding the searchable data
‣  8 client nodes coordinating query requests
‣  Storage cluster has 2TB of SSD based storage
‣  17 TB of indexed data is stored in cluster
‣  shared in the cluster with replication factor of 1
‣  makes overall 34 TB of indexed data
Github
21

‣  Question-and-answer website
‣  aggregates questions and answer in terms of topics
‣  Sources are the web in general, social media
‣  Goals for search:
‣  low latency for queries
‣  increased relevancy of results.
‣  evaluates elasticsearch against Solr and Sphinx
‣  “After much benchmarking with our data set, we discovered that ElasticSearch
was clearly the fastest of the possible search platforms we were considering.”
Quora
22

Quora
23
http://www.quora.com/Full-Text-Search-on-Quora/What-technology-does-Quora-use-for-its-full-text-search-infrastructure/answer/Adrien-Lucas-Ecoffet?
srid=pilt&share=1

Soundcloud
24
http://bed-con.org/2013/wp-content/uploads/2013/04/Wie_SoundCloud_skaliert.pdf

Moloch
25
https://github.com/aol/moloch

Huffington Post
26
http://blogs.vmware.com/vfabric/2013/03/scaling-real-time-comments-huffpost-live-with-rabbitmq.html

‣  Scalable, High-Performance Indexing
‣  over 95GB/hour on modern hardware
‣  small RAM requirements
‣  incremental indexing as fast as batch indexing
‣  index size roughly 20-30% the size of text indexed
‣  Powerful, Accurate and Efficient Search Algorithms
‣  ranked searching -- best results returned first
‣  many powerful query types
‣  fielded searching (e.g., title, author, contents)
‣  date-range searching
‣  sorting by any field
‣  multiple-index searching with merged results
‣  allows simultaneous update and searching
[From http://lucene.apache.org/core/features.html]
Highlights
28
http://lucene.apache.org/

‣  Pure Java application
‣  Powered by Lucene
‣  Document-oriented
‣  Schema-less
‣  HTTP API with JSON In & Out
‣  Indexing / Updating
‣  Searching
‣  Administration / Monitoring
‣  Extendable by plugins
‣  Distribution is a fundamental paradigm of Elasticsearch
Overview
29

Architecture
30
21 12
321
3 3
Primary Shard Replica Shard
Master node
Node
Node

‣  Index distribution by auto sharding
‣  Automatic replication and balancing
‣  Fault tolerant + high availability
‣  Cluster building & managment
‣  node detection through zen discovery
‣  nodes communicate via unicast / multicast
‣  automatic master election
‣  influence into master / data node assignment possible
‣  Master responsible to
‣  route the search request
‣  include new nodes into cluster
‣  Index / query routing (automatic / individual)
Architecture
31

Elasticsearch-head
32
https://github.com/mobz/elasticsearch-head

Elasticsearch-head
33
https://github.com/mobz/elasticsearch-head

Schema-less, but
34

‣  Define a mapping for type book
‣  Retrieve the current mapping for type book
Schema-less, but
35
# echo " {
"mappings" : {
"books" : {
"properties" : {
”id" : { "type" : "string" },
"title" : { "type" : "string" },
"author" : { "type" : "string" },
”subject" : { "type" : ”string" },
”view_count" : { "type" : ”integer" },
"created" : { "type" : "date",
"format" : “dateOptionalTime" }
}}}} " > book.json
curl –XPUT 'localhost:9200/gutenberg/books/_mapping’ –d @book.json
# curl 'localhost:9200/gutenberg/books/_mapping?pretty=1

‣  Search on terms, numeric values, dates, numeric ranges, date/time ranges
‣  Lots of query types
‣  terms, phrases, fuzzy, wildcard, ranges
‣  faceting, filtering
‣  Geospatial search called GeoShape Query
‣  Configurable caching for
‣  Filter queries
‣  Field values
‣  NRT search with separate API
‣  Sorting, Highlighting
‣  MoreLikeThis
‣  Multi Tenancy
Search highlights
36

Faceted search
37

Suggestion
38

Highlighting
39

Local search
40

Multi Tenancy
41

‣  Gateway module stores cluster metadata to:
‣  Local FS, Shared FS, Hadoop, Amazon S3
‣  River:
‣  Pluggable service to constantly pull data
‣  Manage over specific REST endpoint
‣  Implementations for CouchDB, MongoDB, JDBC, Solr, …
‣  Bulk indexing
‣  Default: single document indexing
‣  Bulk indexing over specific REST endpoints
‣  Lucene Analyzer specification over elasticsearch.yml or API
Some more features
42

‣  Query types such as term, terms, match, wildcard, fuzzy, range, …
‣  Multi Search
‣  Get
‣  Multi Get
‣  Filter
‣  Facets
‣  Highlighting
‣  Suggest
‣  MoreLikeThis
‣  Index boosting
‣  Explain
‣  Percolate
Search API
43

‣  Create, Delete, Exists, Open, Close, Optimize, Refresh, Flush, Settings
‣  Index templates (mappings + settings)
‣  Get, Put, Delete Mapping
‣  Get, update settings
‣  Snapshot
‣  Aliases
‣  Warmers
‣  Statistics, Status
Indices API
44

‣  Live configuration of cluster settings
‣  minimum master nodes
‣  cache sizes
‣  routing
‣  allocation
‣  moving shards
‣  Moving replicas
‣  Cluster health & status
‣  Nodes info & stats, Shutdown all / specific nodes
Cluster API
45

+  Elasticssearch feels light-weighted
+  Simple but effective architecture
+  Easiness of use, even when using distributed search
+  High matureness, even though ES is young
+  High-performance search (at least based on current benchmarks seen)
+  Modern technologies used (HTTP, JSON, NoXML, Java, Guava)
-  Still small community and small group of core developer
-  Missing data connectors (e.g. dataimporthandler),
-  Missing search features grouping & search result clustering
-  Less number of query types
-  Less possibilities for boosting (e.g function queries)
-  Less number of analyzers
Pros & Cons
46

‣  The world becomes data-driven and user-driven
‣  large data volumes
‣  multiple sources
‣  many users shall be able to access
‣  Therefore search technologies Elasticsearch becomes important:
‣  Easy aggregation of data from multiple sources
‣  Provide unified access layer through search
‣  Scalable regarding data volume and users
‣  Highly configurable
‣  ElasticSearch is easy to use, distributed, scalable and search is fast
Wrap up
47

ElasticSearch - Suche im Zeitalter der Clouds

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ElasticSearch - Suche im Zeitalter der Clouds

Ähnlich wie ElasticSearch - Suche im Zeitalter der Clouds (20)

Mehr von inovex GmbH

Mehr von inovex GmbH (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ElasticSearch - Suche im Zeitalter der Clouds