E L A S T I C S E A R C H ,
L O G S TA S H , K I B A N A
C O O L S E A R C H ,
A N A LY T I C S ,
D ATA M I N I N G
A N D ...
MY NAME IS…
Oleksiy Panchenko
Software engineer, Lohika
E-mail: oleksij@gmail.com
Twitter: oleskiyp
LinkedIn:
https://ua.l...
AGENDA
• Introduction. What is it all about?
• Jump start Elastic. Demo time
• Architecture and deployment. Why is
Elastic...
INTRODUCTION
W H A T I S I T A L L A B O U T ?
HOW TO MAKE YOUR SITE
SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
• Google search
• Why not to use plain vanilla SQL? RDBMS rocks!
select *
from books
join authors
on …
where …
• Sphinx (h...
WHO HAS EVER USED
ELASTICSEARCH?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
LUCENE AS A CORE
• Lucene = Low-level Java library (JAR) which
implements search functionality
• Can be used in both web a...
LUCENE AS A CORE
• Lucene was originally written in
1999 by Doug Cutting (creator
(creator of Hadoop and Nutch,
http://www...
MORE ABOUT SEARCH ENGINES
Riak Search
TIME TO TALK ABOUT
ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text Search
...
https://www.elastic.co/products/elasticsearch
High Availability
Multitenancy
Distributed, Horizontally Scalable
https://www.elastic.co/products/elasticsearch
Document-Oriented
Schema-Free
Conflict Management
Optimistic Concurrency Con...
https://www.elastic.co/products/elasticsearch
Apache 2 Open Source License
Awesome documentation
Large community
Developer...
ELASTICSEARCH USERS
https://www.elastic.co/use-cases
https://en.wikipedia.org/wiki/Elasticsearch#Users
ELASTICSEARCH – PAST &
PRESENT
• 2004. Shay Banon (aka
Kimchy) started working on
Compass – Java Search
Engine on top of L...
ELASTICSEARCH
AS A COMPANY
• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees
• https://www.elastic.co/
...
JUMP START
ELASTIC
D E M O T I M E
INSTALLATION &
CONFIGURATION
• Prerequisites:
– JDK 6 or above (recommended: JDK 8)
– RAM: min. 2Gb (recommended: 16–64 Gb...
LET’S TALK ABOUT
TERMINOLOGY
Index ~ DB Schema
Type ~ DB Table
Documen
t
Record, JSON object
Mapping ~ Schema definition i...
DEMO #1
http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ARCHITECTURE
AND
DEPLOYMENT
W H Y I S E L A S T I C S E A R C H E L A S T I C ?
Cluster One or more nodes which
share the same cluster name
Node Running instance of
Elasticsearch which belongs
to a clus...
SINGLE-NODE CLUSTER
0 1 2 3 4
Hash
Function*
{ "id": "123", "name": "john", … }
{ "id": "124", "name": "patricia", … }
{ "...
TWO-NODE CLUSTER
0 1 R2 3 R4Node
1
R0 R1 2 R3 4Node
2
* Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘...
BENEFITS OF SHARDING
• Take advantage of multi-core CPUs (one shard is
a single Lucene instance = single JVM process)
• Ho...
CUSTOM ROUTING
• Social network. Users, events
• event_id: 17567654, 17567655, 17567656, …
user_id: 10300, 10301, …
• No E...
ELASTICSEARCH NODE TYPES
• Data node node.data = true
• Master node node.master = true
• Communication client http.enabled...
DEPLOYMENT DIAGRAM
INDEXING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
RETRIEVING A DOCUMENT
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html
• In terms of retrievi...
DISTRIBUTED SEARCH
• Given search query, retrieve 10 most relevant
results
https://www.elastic.co/guide/en/elasticsearch/g...
CASE STUDIES
4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revi...
GENERAL INFO
• 4 projects, ~2 years
• RDBMS (MySQL, PostgreSQL) as a primary data
storage
• Both on-premise Elasticsearch ...
#1. SOCIAL INFLUENCER
MARKETING PLATFORM
http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
• Document types: Blog Posts, Bloggers
(Influencers)
• Elasticsearch usage:
– search and rank Influencers by category,
key...
#2. JOB SITE
http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
• Document types: Job Postings, Jobseekers
• Find relevant jobs
– Simple one-click search
– Advanced search (title, keywor...
• No fixed document structure (jobs from
different providers)
• Full-text search
• Fuzzy search
• Geolocation (distance)
•...
SOME MORE FACTS
• Amount of data:
–Job postings: ~1M
–Applicants: ~20K
• Cluster size: 2 ‘medium’ EC2 instances
• Technolo...
IMPLEMENTATION (RUBY)
• A Model is ActiveRecord (Ruby on Rails ORM)
• ActiveRecord can persist itself to the database
• Ac...
LESSONS LEARNED
• On-premise deployment (EC2) vs. SaaS
(Bonsai @ Heroku)
• Dynamic scripting
• PostgreSQL as a backup sear...
#3. CAR TRADING
http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
PARSING ADS
Price
$3900
1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG
WAT???
• Fuzzy Search (Levenstein Distance Algorithm) used to
parse ads an...
#4. [NDA]
http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
SOME UNCOVERED INFO
• Check documents against duplicate content
• Shingle analysis (commonly used by copywriters and SEO
e...
QUERY API IN
DEPTH
+ D E M O
FILTERS VS. QUERIES
As a general rule, filters should be used:
• for binary yes/no searches
• for queries on exact values
...
QUERIES VS. FILTERS
As a general rule, queries should be used instead
of filters:
• for full text search
• where the resul...
DEMO #2
http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
SOME THEORY BEHIND
RELEVANCE SCORING
full AND text AND search AND (elasticsearch OR
lucene)
• Term Frequency: How often do...
MORE COOL FEATURES
• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)
• Autocomplete suggestion:
• Did-you-mean su...
SEARCH IMAGES
https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/
https://github.com/kzwang/ela...
http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
ELASTICSEARCH
ECOSYSTEM.
ELK STACK
+ D E M O
CLIENTS
http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png
• Java: 1 native client + 1 community
supported
• Python: 1 official + 7 community supported
• Ruby: 1 official + 7 commun...
INTEGRATIONS
• Django
• Ruby on Rails
• Spring, Spring Data
• Node.js
• Symfony, Drupal, Wordpress
• Grails
• Play! Framew...
FRONT ENDS
http://php.archive.razorflow.com/assets/img/header_v1.png
ELASTICSEARCH-HEAD
http://mobz.github.io/elasticsearch-head/
ESCLIENT
https://github.com/rdpatil4/ESClient
AVAILABLE FRONT ENDS
https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html
• elasticsearc...
HEALTH AND PERFORMANCE
http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png
ELASTICSEARCH-HEAD
https://github.com/mobz/elasticsearch-head
BIGDESK
https://github.com/lukas-vlcek/bigdesk
WHATSON
https://github.com/xyu/elasticsearch-whatson
ELASTICOCEAN
https://itunes.apple.com/us/app/elasticocean/id955278030
HEALTH AND PERFORMANCE
https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html
• bigdesk: Live ...
10 ES METRICS TO WATCH
http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html
1. Cluster health — nodes a...
RIVERS (DEPRECATED IN 1.5.0)
http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi
• JDBC River Plugin, CSV River Plugin
• MongoDB, CouchDB, Solr, Redis, Neo4j,
DynamoDB, RethinkDB, Hazelcast, …
• JMS, Rab...
OTHER PLUGINS
https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg
• Internalization, normalization, analysis,
languages support (Chinese, Japanese, Khmer,
Thai etc.), transliteration etc.
...
ELASTICSEARCH
PRODUCT PORTFOLIO
http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg
FOUND ($)
• Elasticsearch as a service
• Starts from $45/mo (1GB RAM, 8GB SSD, 1 data
center)
• No deployment and maintena...
SHIELD ($)
• Authentication
• Authorization: RBAC
• Encrypted communication, IP filtering
• Audit logging
• Other approach...
MARVEL ($)
• Elasticsearch cluster health check, monitoring,
performance
• Real-time and historical analysis
• Customizabl...
WATCHER
• Alerts about anomalies in data
• Proactive monitoring of ES cluster (in
conjunction with Marvel)
• A lot of ways...
ELK
https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png
LOGSTASH + ELASTIC + KIBANA
LOGSTASH ADVANCED
LOGSTASH
• Variety of inputs and outputs (165 plugins)
• 120 predefined patterns + custom log formats
• Flexible DSL to pa...
SOME LOGSTASH INPUTS
https://www.elastic.co/guide/en/logstash/current/input-plugins.html
• file
• stdin
• syslog
• eventlo...
SOME LOGSTASH OUTPUTS
https://www.elastic.co/guide/en/logstash/current/output-plugins.html
• file
• stdout
• csv
• exec
• ...
KIBANA
• Variety of charts: bar charts, line and scatter
plots, histograms, pie charts, maps
• Flexible and customizable U...
DEMO #3
http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg
ELASTICSEARCH DRAWBACKS
• No transaction support. Elasticsearch is not a
database.
• No joins, constraints and other RDBMS...
PERFORMANCE?
http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/
http://solr-vs-elasticsearch.com/
• Apache ...
SUMMARY
• ES is not a silver bullet but really really powerful
tool
• Elasticsearch is not a RDBMS and is not supposed
to ...
QUESTIONS?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
THANK YOU!
http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
USEFUL LINKS
• Elasticsearch:
https://www.elastic.co/products/elasticsearch
• Logstash: https://www.elastic.co/products/lo...
Nächste SlideShare
Wird geladen in …5
×

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

3.714 Aufrufe

Veröffentlicht am

In the age of information and big data, ability to quickly and easily find a needle in a haystack is extremely important. Elasticsearch is a distributed and scalable search engine which provides rich and flexible search capabilities. Social networks (Facebook, LinkedIn), media services (Netflix, SoundCloud), Q&A sites (StackOverflow, Quora, StackExchange) and even GitHub - they all find data for you using Elasticsearch. In conjunction with Logstash and Kibana, Elasticsearch becomes a powerful log engine which allows to process, store, analyze, search through and visualize your logs.
Video: https://www.youtube.com/watch?v=GL7xC5kpb-c
Scripts for the Demo: https://github.com/opanchenko/morning-at-lohika-ELK

Veröffentlicht in: Daten & Analysen

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

  1. 1. E L A S T I C S E A R C H , L O G S TA S H , K I B A N A C O O L S E A R C H , A N A LY T I C S , D ATA M I N I N G A N D M O R E … O L E K S I Y P A N C H E N K O / L O H I K A / 2 0 1 5
  2. 2. MY NAME IS… Oleksiy Panchenko Software engineer, Lohika E-mail: oleksij@gmail.com Twitter: oleskiyp LinkedIn: https://ua.linkedin.com/in/opanchenko
  3. 3. AGENDA • Introduction. What is it all about? • Jump start Elastic. Demo time • Architecture and deployment. Why is Elasticsearch elastic? • Case studies. 4 real-life projects • Query API in depth + Demo • Elasticsearch ecosystem. ELK Stack + Demo • Q & A
  4. 4. INTRODUCTION W H A T I S I T A L L A B O U T ?
  5. 5. HOW TO MAKE YOUR SITE SEARCHABLE? http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
  6. 6. • Google search • Why not to use plain vanilla SQL? RDBMS rocks! select * from books join authors on … where … • Sphinx (hello Craigslist, Habrahabr, The Pirate Bay, 1C); Xapian • Lucene Family: Apache Lucene, Elasticsearch, Apache Apache Solr, Amazon Cloudsearch, …
  7. 7. WHO HAS EVER USED ELASTICSEARCH? http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
  8. 8. LUCENE AS A CORE • Lucene = Low-level Java library (JAR) which implements search functionality • Can be used in both web and standalone applications (desktop, mobile) • Lucene stores its index as a local binary file • Implemented in Java, ports to other languages available • Initial version: 1999 • Apache project since 2001 • Latest stable release: 5.2.1 (15 June 2015)
  9. 9. LUCENE AS A CORE • Lucene was originally written in 1999 by Doug Cutting (creator (creator of Hadoop and Nutch, http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
  10. 10. MORE ABOUT SEARCH ENGINES Riak Search
  11. 11. TIME TO TALK ABOUT ELASTICSEARCH https://www.elastic.co/products/elasticsearch Near Real-Time Data (NRT) Full-Text Search Multilingual search, geolocation, fuzzy search, did-you-mean suggestions, autocomplete
  12. 12. https://www.elastic.co/products/elasticsearch High Availability Multitenancy Distributed, Horizontally Scalable
  13. 13. https://www.elastic.co/products/elasticsearch Document-Oriented Schema-Free Conflict Management Optimistic Concurrency Control
  14. 14. https://www.elastic.co/products/elasticsearch Apache 2 Open Source License Awesome documentation Large community Developer-Friendly, RESTful API Client libraries available for many programming languages and frameworks.
  15. 15. ELASTICSEARCH USERS https://www.elastic.co/use-cases https://en.wikipedia.org/wiki/Elasticsearch#Users
  16. 16. ELASTICSEARCH – PAST & PRESENT • 2004. Shay Banon (aka Kimchy) started working on Compass – Java Search Engine on top of Lucene • 2010. Initial release of Elasticsearch • Latest stable release: 1.7.1 (July 29, 2015) • 500K downloads per month• https://github.com/elastic/elasticsearch http://opensource.hk/sites/default/files/u1/shay-banon.jpg
  17. 17. ELASTICSEARCH AS A COMPANY • 2012. Elasticsearch BV; Funding: $104M in 3 rounds, 100+ employees • https://www.elastic.co/ • Product portfolio: – Elasticsearch, Logstash, Kibana (ELK stack) – Watcher – Shield – Marvel – es-hadoop – found
  18. 18. JUMP START ELASTIC D E M O T I M E
  19. 19. INSTALLATION & CONFIGURATION • Prerequisites: – JDK 6 or above (recommended: JDK 8) – RAM: min. 2Gb (recommended: 16–64 Gb for production) – CPU: number of cores over clock rate – Disks: recommended SSD • Homebrew, apt, yum: apt-get install elasticsearch • Download (ZIP, TAR, DEB, RPM): https://www.elastic.co/downloads/elasticsearch • Installation is absolutely straightforward and easy:
  20. 20. LET’S TALK ABOUT TERMINOLOGY Index ~ DB Schema Type ~ DB Table Documen t Record, JSON object Mapping ~ Schema definition in RDBMS
  21. 21. DEMO #1 http://www.telikin.com/cms/images/shocked_senior_computer_user.jpg
  22. 22. http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
  23. 23. ARCHITECTURE AND DEPLOYMENT W H Y I S E L A S T I C S E A R C H E L A S T I C ?
  24. 24. Cluster One or more nodes which share the same cluster name Node Running instance of Elasticsearch which belongs to a cluster Shard A portion of data – single Lucene instance. Default: 5 shards in an index Primary Shard Master copy of data Replica Shard Exact copy of a primary shard. Default: 1 replica
  25. 25. SINGLE-NODE CLUSTER 0 1 2 3 4 Hash Function* { "id": "123", "name": "john", … } { "id": "124", "name": "patricia", … } { "id": "125", "name": "scott", … } * Also consider custom routing
  26. 26. TWO-NODE CLUSTER 0 1 R2 3 R4Node 1 R0 R1 2 R3 4Node 2 * Ability to ‘route’ indexes to particular nodes (tag-based, e.g.: ‘strong’, ‘medium’, ‘weak’)
  27. 27. BENEFITS OF SHARDING • Take advantage of multi-core CPUs (one shard is a single Lucene instance = single JVM process) • Horizontal scalability. Dynamic rebalancing • Fault tolerance and cluster resilience • NB! The number of shards can not be changed dynamically on the fly – need to perform full reindexing • Max number of documents per shard: 2,147,483,519 – imposed by Lucene
  28. 28. CUSTOM ROUTING • Social network. Users, events • event_id: 17567654, 17567655, 17567656, … user_id: 10300, 10301, … • No Elasticsearch ID provided: ID will be auto- generated  Events will be equally distributed across the shards • Obvious approach: Elasticsearch ID = event_id  Events will be equally distributed across the shards • Elasticsearch ID = user_id  Events which belong to the same user will be
  29. 29. ELASTICSEARCH NODE TYPES • Data node node.data = true • Master node node.master = true • Communication client http.enabled = true • TCP ports 9200 (ext), 9300 (int) • A node can play 2 or 3 roles at the same time • Multicast discovery (true by default): discovery.zen.ping.multicast.enabled
  30. 30. DEPLOYMENT DIAGRAM
  31. 31. INDEXING A DOCUMENT https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
  32. 32. RETRIEVING A DOCUMENT https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-read.html • In terms of retrieving documents, primary and replica shards are equivalent: data can be read from either primary or replica shard
  33. 33. DISTRIBUTED SEARCH • Given search query, retrieve 10 most relevant results https://www.elastic.co/guide/en/elasticsearch/guide/current/_query_phase.html
  34. 34. CASE STUDIES 4 R E A L - L I F E P R O J E C T S http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&pat h-prefix=ru
  35. 35. GENERAL INFO • 4 projects, ~2 years • RDBMS (MySQL, PostgreSQL) as a primary data storage • Both on-premise Elasticsearch installation (AWS, MS Azure) and SaaS (Bonsai @ Heroku) • 1 or 2 instances in a cluster • Data volume: Gigabytes; millions of documents • Back-end: Java, Ruby
  36. 36. #1. SOCIAL INFLUENCER MARKETING PLATFORM http://www.nclurbandesign.org/wp-content/uploads/2015/05/blog-pic-b2c.jpg
  37. 37. • Document types: Blog Posts, Bloggers (Influencers) • Elasticsearch usage: – search and rank Influencers by category, keywords, tags, location, audience, influence – search blog posts by keywords etc. • Amount of data: – Influencers: hundreds of thousands – Blog Posts: millions • ES cluster size: 2 instances • Technology stack: Java, MySQL, Dynamo
  38. 38. #2. JOB SITE http://www.roberthalf.com/sites/default/files/Media_Root/Images/RH-Images/Using-a-job-search-site.jpg
  39. 39. • Document types: Job Postings, Jobseekers • Find relevant jobs – Simple one-click search – Advanced search (title, keywords, industry, location/distance, salary, requirements) • Elasticsearch as a Recommendation Engine Recommend jobs based on: previously applied/viewed jobs, location, distance, schedule etc. • 2 types of recommendations: – Side banner (You also might be interested in…) – E-mail subscriptions every 2 weeks
  40. 40. • No fixed document structure (jobs from different providers) • Full-text search • Fuzzy search • Geolocation (distance) • Weighted search: Boosted search clauses • Dynamic scripting (Mvel until v1.4.0, then Groovy) SEARCH QUERIES
  41. 41. SOME MORE FACTS • Amount of data: –Job postings: ~1M –Applicants: ~20K • Cluster size: 2 ‘medium’ EC2 instances • Technology stack: –Ruby on Rails –Elasticsearch, PostgreSQL, Redis –Heroku + add-ons, AWS (S3, EC2) –Lots of 3rd party APIs and integrations
  42. 42. IMPLEMENTATION (RUBY) • A Model is ActiveRecord (Ruby on Rails ORM) • ActiveRecord can persist itself to the database • ActiveRecord::Callbacks: – after_commit on [:create, :update] { index_document } – after_commit on [:destroy] { delete_document } – after_create… – after_save … – after_destroy… • Rake tasks to drop/recreate index, reindex documents • Zero-downtime reindexing using aliases • Ruby/Rails client: https://github.com/elastic/elasticsearch-rails
  43. 43. LESSONS LEARNED • On-premise deployment (EC2) vs. SaaS (Bonsai @ Heroku) • Dynamic scripting • PostgreSQL as a backup search engine sucks
  44. 44. #3. CAR TRADING http://bigskybeetles.com/wp-content/uploads/2014/12/restored-beetle-car.png
  45. 45. PARSING ADS Price $3900
  46. 46. 1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG WAT??? • Fuzzy Search (Levenstein Distance Algorithm) used to parse ads and classify cars • Elasticsearch index contains dictionary (Year, Make, Model, Trim) • Used in conjunction with other approaches: regular expressions, dictionaries of synonyms (VW  Volkswagen, Chevy  Chevrolet), normalization (e.g. LX-370  LX370) • Algorithm approach: – Parse Year (1996) – Search most relevant Make (VW, volkswagon  Volkswagen) – Search most relevant Model (Passat) for Make = Volkswagen, Year = 1996 – Search most relevant Trim (TDi 4dr Sedan) • Parsing quality: 90% https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
  47. 47. #4. [NDA] http://cdn.4glaza.ru/images/products/large/0/bresser-junior-loupe-2x-4x-dop6.jpg
  48. 48. SOME UNCOVERED INFO • Check documents against duplicate content • Shingle analysis (commonly used by copywriters and SEO experts) – I have a dream that one day this nation will rise up and live… – Normalization I have a dream that one day this nation will rise up and live… – Splitting a text into shingles (n-grams), n = 3..10 have dream that dream that this that this nation this nation will … – Replacement: latin ‘c’  cyrillic ‘c’ https://en.wikipedia.org/wiki/W-shingling
  49. 49. QUERY API IN DEPTH + D E M O
  50. 50. FILTERS VS. QUERIES As a general rule, filters should be used: • for binary yes/no searches • for queries on exact values Filters are much faster than queries Filters are usually great candidates for caching 27 Filters available (Elasticsearch 1.7.1) https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
  51. 51. QUERIES VS. FILTERS As a general rule, queries should be used instead of filters: • for full text search • where the result depends on a relevance score Common approach: Filter as many records as possible, then query them. 38 Queries available (Elasticsearch v 1.7.1) https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
  52. 52. DEMO #2 http://www.socialtalent.co/wp-content/uploads/blog-content/computer-user-confused.jpg
  53. 53. SOME THEORY BEHIND RELEVANCE SCORING full AND text AND search AND (elasticsearch OR lucene) • Term Frequency: How often does the term appear in the document? • Inverse Document Frequency: How often does the term appear in all documents in the collection? • Field-length norm: How long is the field? https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
  54. 54. MORE COOL FEATURES • Indexing attachments: MS Office, ePub, PDF (Apache Tika) • Autocomplete suggestion: • Did-you-mean suggestion: • Highlight results:
  55. 55. SEARCH IMAGES https://www.theloopyewe.com/shop/search/cd/0-100~75-90-50~18-12-12/g/59A9BAC5/ https://github.com/kzwang/elasticsearch-image
  56. 56. http://orig06.deviantart.net/a893/f/2008/017/1/f/coffee_break____by_dragonshy.jpg
  57. 57. ELASTICSEARCH ECOSYSTEM. ELK STACK + D E M O
  58. 58. CLIENTS http://blog.euranova.eu/wp-content/uploads/2014/04/programming-languages.png
  59. 59. • Java: 1 native client + 1 community supported • Python: 1 official + 7 community supported • Ruby: 1 official + 7 community supported • JavaScript: 1 official + 4 • PHP: 1 official + 4 • C#. NET: 1 official + 2 • Scala: 4 • Groovy (1), Haskell (1), Perl (1), Clojure (1), Go (3), R (2), Erlang (3), OCaml (2), Smalltalk (1), ColdFusion (1), C++ (1) • Command Line (2)https://www.elastic.co/guide/en/elasticsearch/client/community/current/clients.html
  60. 60. INTEGRATIONS • Django • Ruby on Rails • Spring, Spring Data • Node.js • Symfony, Drupal, Wordpress • Grails • Play! Framework https://www.elastic.co/guide/en/elasticsearch/client/community/current/integrations.html
  61. 61. FRONT ENDS http://php.archive.razorflow.com/assets/img/header_v1.png
  62. 62. ELASTICSEARCH-HEAD http://mobz.github.io/elasticsearch-head/
  63. 63. ESCLIENT https://github.com/rdpatil4/ESClient
  64. 64. AVAILABLE FRONT ENDS https://www.elastic.co/guide/en/elasticsearch/client/community/current/front-ends.html • elasticsearch-head: A web front end for an Elasticsearch cluster. • browser: Web front-end over elasticsearch data. • Inquisitor: Front-end to help debug/diagnose queries and analyzers • Hammer: Web front-end for elasticsearch • Calaca: Simple search client for Elasticsearch • ESClient: Simple search, update, delete client for Elasticsearch
  65. 65. HEALTH AND PERFORMANCE http://www.transcend-marketing.co.uk/wp-content/uploads/2014/09/health-check2.png
  66. 66. ELASTICSEARCH-HEAD https://github.com/mobz/elasticsearch-head
  67. 67. BIGDESK https://github.com/lukas-vlcek/bigdesk
  68. 68. WHATSON https://github.com/xyu/elasticsearch-whatson
  69. 69. ELASTICOCEAN https://itunes.apple.com/us/app/elasticocean/id955278030
  70. 70. HEALTH AND PERFORMANCE https://www.elastic.co/guide/en/elasticsearch/client/community/current/health.html • bigdesk: Live charts and statistics for elasticsearch cluster. • Kopf: Live cluster health and shard allocation monitoring with administration toolset. • paramedic: Live charts with cluster stats and indices/shards information. • ElasticsearchHQ: Free cluster health monitoring tool • SPM for Elasticsearch: Performance monitoring with live charts showing cluster and node stats, integrated alerts, email reports, etc. • check-es: Nagios/Shinken plugins for checking on elasticsearch • check_elasticsearch: An Elasticsearch availability and performance monitoring plugin for Nagios. • opsview-elasticsearch: Opsview plugin written in Perl for monitoring Elasticsearch • SegmentSpy: Plugin to watch Lucene segment merges across your cluster • es2graphite: Send cluster and indices stats and status to Graphite for monitoring and graphing. • Scout: Provides plugins for monitoring Elasticsearch nodes, clusters, and indices. • ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring
  71. 71. 10 ES METRICS TO WATCH http://radar.oreilly.com/2015/04/10-elasticsearch-metrics-to-watch.html 1. Cluster health — nodes and shards 2. Node performance — CPU 3. Node performance — memory usage 4. Node performance — disk I/O 5. Java — heap usage and garbage collection 6. Java — JVM pool size 7. Search performance — request latency and request rate 8. Search performance — filter cache 9. Search performance — field data cache 10.Indexing performance — refresh times and merge times
  72. 72. RIVERS (DEPRECATED IN 1.5.0) http://acuate.typepad.com/.a/6a0120a5e84a91970c01539381efff970b-pi
  73. 73. • JDBC River Plugin, CSV River Plugin • MongoDB, CouchDB, Solr, Redis, Neo4j, DynamoDB, RethinkDB, Hazelcast, … • JMS, RabbitMQ, ActiveMQ, Amazon SQS, Kafka, … • Twitter, Wikipedia, Git, GitHub, Subversion, RSS, … • FileSystem, Dropbox, Google Drive, Amazon S3, … • IMAP/POP3, Web, LDAP https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#river
  74. 74. OTHER PLUGINS https://d2wucpkmh57zie.cloudfront.net/wp-content/uploads/2015/04/plugins-together.jpg
  75. 75. • Internalization, normalization, analysis, languages support (Chinese, Japanese, Khmer, Thai etc.), transliteration etc. • Discovery plugins: Amazon AWS, MS Azure, Google GCE, ZooKeeper • Transport plugins: allow to use Elasticsearch REST API over Servlet, ZeroMQ, Jetty, Redis, Memecached • Scripting in Elasticsearch queries: Groovy, JavaScript, Python, Clojure, SQL (!) • Front-ends (CRUD operations) & data visualization • Snapshot/Restore Repository: HDFS, AWS S3, GridFS • Misc: Attachments handling (uses Apache Tika), https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html
  76. 76. ELASTICSEARCH PRODUCT PORTFOLIO http://blog.archisnapper.com/wp-content/uploads/architecture-portfolio.jpg
  77. 77. FOUND ($) • Elasticsearch as a service • Starts from $45/mo (1GB RAM, 8GB SSD, 1 data center) • No deployment and maintenance overhead https://www.elastic.co/products/found
  78. 78. SHIELD ($) • Authentication • Authorization: RBAC • Encrypted communication, IP filtering • Audit logging • Other approaches: • Jetty instead of embedded server • Nginx as a front-end https://www.elastic.co/products/shield
  79. 79. MARVEL ($) • Elasticsearch cluster health check, monitoring, performance • Real-time and historical analysis • Customizable dashboards https://www.elastic.co/products/marvel
  80. 80. WATCHER • Alerts about anomalies in data • Proactive monitoring of ES cluster (in conjunction with Marvel) • A lot of ways of notifications: e-mails, SMS, webhooks • Retrospective analysis • High availability https://www.elastic.co/products/watcher
  81. 81. ELK https://pbs.twimg.com/media/CCAkRqVXIAA9cDE.png
  82. 82. LOGSTASH + ELASTIC + KIBANA
  83. 83. LOGSTASH ADVANCED
  84. 84. LOGSTASH • Variety of inputs and outputs (165 plugins) • 120 predefined patterns + custom log formats • Flexible DSL to parse/normalize/enrich logs • Implemented in Ruby, running on JRuby https://www.elastic.co/products/logstash
  85. 85. SOME LOGSTASH INPUTS https://www.elastic.co/guide/en/logstash/current/input-plugins.html • file • stdin • syslog • eventlog • jdbc • varnishlog • websocket • log4j • jmx • s3 • sqs • rss • redis • rabbitmq • zeromq • kafka • twitter • elasticsearch • github • lumberjack
  86. 86. SOME LOGSTASH OUTPUTS https://www.elastic.co/guide/en/logstash/current/output-plugins.html • file • stdout • csv • exec • elasticsearch • email • nagios • syslog • redis • loggly • jira • hipchat • irc • graphite • http • s3 • sqs • sns • rabbitmq • zeromq
  87. 87. KIBANA • Variety of charts: bar charts, line and scatter plots, histograms, pie charts, maps • Flexible and customizable UI, responsive design • Slice and dice data to get necessary details • Seamless integration with Elasticsearch • Simple data export https://www.elastic.co/products/kibana
  88. 88. DEMO #3 http://25.media.tumblr.com/tumblr_mbduvkuspZ1qe6vsbo1_400.jpg
  89. 89. ELASTICSEARCH DRAWBACKS • No transaction support. Elasticsearch is not a database. • No joins, constraints and other RDBMS features • Durability and consistency issues, data loss: – https://aphyr.com/posts/323-call-me-maybe- elasticsearch-1-5-0 – https://www.elastic.co/guide/en/elasticsearch/resili ency/current/index.html
  90. 90. PERFORMANCE? http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/ http://solr-vs-elasticsearch.com/ • Apache Solr can be faster than ES in search-only scenarios while Elasticsearch usually outperforms Solr when doing writes and reads concurrently • Sphinx is faster at indexing (up to 15MB/s per core) • Performance issues can be usually fixed by horizontal scaling
  91. 91. SUMMARY • ES is not a silver bullet but really really powerful tool • Elasticsearch is not a RDBMS and is not supposed to act as a database. Choose your tools properly. Leverage the synergy of DB + ES • Elasticsearch is dead simple at the start but might be sophisticated later as you go • Kick off easily, then hire a good DevOps engineer for best results • Ecosystem around Elasticsearch is just amazing • Give it a try – it can bring a lot of value to your product and your CV ;) http://www.aperfectworld.org/clipart/gestures/rockhard11.png
  92. 92. QUESTIONS? http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
  93. 93. THANK YOU! http://conveyancingderby.co/wp-content/uploads/2011/07/cat-card.jpg
  94. 94. USEFUL LINKS • Elasticsearch: https://www.elastic.co/products/elasticsearch • Logstash: https://www.elastic.co/products/logstash • Kibana: https://www.elastic.co/products/kibana • Scripts for the demos: https://github.com/opanchenko/morning-at-lohika-ELK

×