Eine performante Suche mit relevanten Ergebnissen in großen Datenbeständen ist inzwischen für uns alle immer und überall selbstverständlich. Suche wird nicht mehr nur in klassischen Szenarien wie Enterprise Search und Web Search eingesetzt, sondern organisiert den Zugriff auf Daten und Informationen in verschiedensten Anwendungen (Stichwort: Search-based Applications). Ein Großteil der gebräuchlichen Suchtechnologien basiert hierbei auf dem Apache-Lucene-Projekt. Im Bereich der Suchserver auf Lucene-Basis gibt es nun neben Apache Solr einen neuen Star in der Open-Soruce-Szene: ElasticSearch. Dieser Vortrag stellt ElasticSearch und die Einsatzszenarien eingehend vor und grenzt die Möglichkeiten gegenüber Lucene und Solr insbesondere im Bereich großer Datenmengen ab.
2. Background
‣ open source (free software)
‣ Linux
‣ Web
‣ Java
‣ Android
‣ CTO@inovex
‣ Christian Meder
Christian MederSpeaker
2
3. Background
‣ Lucene
‣ Solr
‣ Text Mining Technologies,
Information Retrieval
‣ Hadoop
‣ Java
‣ Big Data Engineer@inovex
‣ bpflugfelder@inovex.de
Bernhard PflugfelderSpeaker
3
4. ‣ Search is everywhere
‣ Elasticsearch
‣ Examples
‣ Overview
‣ Features
Agenda
4
13. ‣ Can you think of other scenarios where search applications
will also do a good job?
‣ Remind the key capabilities of search technologies:
‣ Persistency
‣ Flexible data model
‣ Unstructured data, but not only
‣ Extremely quick access to data
‣ Horizontal scalability
There are plenty of applications scenarios out there where
search technologies shall be considered!
Document storeSearch applications
13
15. Lucene is an open source, pure Java API
for enabling information retrieval
‣ Originally developed by Doug Cutting 1999 and became Apache TLP in 2001
‣ Licensed by Apache License 2.0
‣ Pure Java Library with implementations for :
‣ Lucene.NET (http://lucenenet.apache.org)
‣ PyLucene (http://lucene.apache.org/pylucene/)
‣ and more:
http://wiki.apache.org/lucene-java/LuceneImplementations
‣ Large and very active developer community, well documented and supported (38
active committer!)
‣ Current stable release: 4.2.1
‣ Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/lucene-java/PoweredBy
Overview
15
http://lucene.apache.org/
16. Solr is a standalone enterprise search server & document
store with based on Lucene
‣ Created by Yonik Seeley at CNET Networks in 2004
‣ Introduced as Apache Incubator in 2006, became TLP in 2007
‣ Licensed by Apache License 2.0
‣ Seeley and others founded Lucid Imagination -> LucidWorks
‣ Large and very active developer community, well documented and supported
(strong relationship to Lucene community also)
‣ Current stable release: 4.2.1
‣ Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/solr/PublicServers
Overview
16
http://lucene.apache.org/solr/
17. “You know, for search” (Shay Banon)
Search technologies
17
18. Elasticsearch is a “distributed-from-scratch” search server
based on Lucene
Created by Shay Banon with a first version made public in 02/2010:
Elasticsearch itself was born out of my frustration with the fact that there isn’t really a
good, open source, solution for distributed search engine out there, which also
combines what I expect of search engines after building Compass (and on that, I will
blog later…).
I have been working on this for the past several months, pouring my search and
distributed knowledge into this (and portions of my heart and time ;) )
[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]
Motivation
18
http://www.elasticsearch.org/
19. ‣ Current stable version 0.20.6 working with Lucene 3.6
‣ Available version 0.90 RC2 includes Lucene 4.2.1 integration
‣ Licensed by Apache License 2.0
‣ Small, but growing group of core developer
‣ Strong support of valuable Lucene committer
‣ Company elasticsearch.com founded in 2012
‣ By the people behind elasticsearch.org
‣ www.elasticsearch.com
Overview
19
http://www.elasticsearch.org/
21. ‣ Code search is organized on a cluster
‣ 26 storage nodes holding the searchable data
‣ 8 client nodes coordinating query requests
‣ Storage cluster has 2TB of SSD based storage
‣ 17 TB of indexed data is stored in cluster
‣ shared in the cluster with replication factor of 1
‣ makes overall 34 TB of indexed data
Github
21
http://www.elasticsearch.org/
22. ‣ Question-and-answer website
‣ aggregates questions and answer in terms of topics
‣ Sources are the web in general, social media
‣ Goals for search:
‣ low latency for queries
‣ increased relevancy of results.
‣ evaluates elasticsearch against Solr and Sphinx
‣ “After much benchmarking with our data set, we discovered that ElasticSearch
was clearly the fastest of the possible search platforms we were considering.”
Quora
22
http://www.elasticsearch.org/
28. ‣ Scalable, High-Performance Indexing
‣ over 95GB/hour on modern hardware
‣ small RAM requirements
‣ incremental indexing as fast as batch indexing
‣ index size roughly 20-30% the size of text indexed
‣ Powerful, Accurate and Efficient Search Algorithms
‣ ranked searching -- best results returned first
‣ many powerful query types
‣ fielded searching (e.g., title, author, contents)
‣ date-range searching
‣ sorting by any field
‣ multiple-index searching with merged results
‣ allows simultaneous update and searching
[From http://lucene.apache.org/core/features.html]
Highlights
28
http://lucene.apache.org/
29. ‣ Pure Java application
‣ Powered by Lucene
‣ Document-oriented
‣ Schema-less
‣ HTTP API with JSON In & Out
‣ Indexing / Updating
‣ Searching
‣ Administration / Monitoring
‣ Extendable by plugins
‣ Distribution is a fundamental paradigm of Elasticsearch
Overview
29
http://www.elasticsearch.org/
31. ‣ Index distribution by auto sharding
‣ Automatic replication and balancing
‣ Fault tolerant + high availability
‣ Cluster building & managment
‣ node detection through zen discovery
‣ nodes communicate via unicast / multicast
‣ automatic master election
‣ influence into master / data node assignment possible
‣ Master responsible to
‣ route the search request
‣ include new nodes into cluster
‣ Index / query routing (automatic / individual)
Architecture
31
http://www.elasticsearch.org/
42. ‣ Gateway module stores cluster metadata to:
‣ Local FS, Shared FS, Hadoop, Amazon S3
‣ River:
‣ Pluggable service to constantly pull data
‣ Manage over specific REST endpoint
‣ Implementations for CouchDB, MongoDB, JDBC, Solr, …
‣ Bulk indexing
‣ Default: single document indexing
‣ Bulk indexing over specific REST endpoints
‣ Lucene Analyzer specification over elasticsearch.yml or API
Some more features
42
http://www.elasticsearch.org/
43. ‣ Query types such as term, terms, match, wildcard, fuzzy, range, …
‣ Multi Search
‣ Get
‣ Multi Get
‣ Filter
‣ Facets
‣ Highlighting
‣ Suggest
‣ MoreLikeThis
‣ Index boosting
‣ Explain
‣ Percolate
Search API
43
http://www.elasticsearch.org/
44. ‣ Create, Delete, Exists, Open, Close, Optimize, Refresh, Flush, Settings
‣ Index templates (mappings + settings)
‣ Get, Put, Delete Mapping
‣ Get, update settings
‣ Snapshot
‣ Aliases
‣ Warmers
‣ Statistics, Status
Indices API
44
http://www.elasticsearch.org/
45. ‣ Live configuration of cluster settings
‣ minimum master nodes
‣ cache sizes
‣ routing
‣ allocation
‣ moving shards
‣ Moving replicas
‣ Cluster health & status
‣ Nodes info & stats, Shutdown all / specific nodes
Cluster API
45
http://www.elasticsearch.org/
46. + Elasticssearch feels light-weighted
+ Simple but effective architecture
+ Easiness of use, even when using distributed search
+ High matureness, even though ES is young
+ High-performance search (at least based on current benchmarks seen)
+ Modern technologies used (HTTP, JSON, NoXML, Java, Guava)
- Still small community and small group of core developer
- Missing data connectors (e.g. dataimporthandler),
- Missing search features grouping & search result clustering
- Less number of query types
- Less possibilities for boosting (e.g function queries)
- Less number of analyzers
Pros & Cons
46
http://www.elasticsearch.org/
47. ‣ The world becomes data-driven and user-driven
‣ large data volumes
‣ multiple sources
‣ many users shall be able to access
‣ Therefore search technologies Elasticsearch becomes important:
‣ Easy aggregation of data from multiple sources
‣ Provide unified access layer through search
‣ Scalable regarding data volume and users
‣ Highly configurable
‣ ElasticSearch is easy to use, distributed, scalable and search is fast
Wrap up
47
http://www.elasticsearch.org/