SlideShare a Scribd company logo
1 of 24
ElasticSearch on AWS
                      Real Estate portal Case Study (Spitogatos.gr)


                                                    AWSUG GR meetup #7
                                                      27 September 2012




                                                    Andreas Chatzakis
                                     co-founder / IT Director – Spitogatos.gr

Event sponsored by:                                       @achatzakis on twitter
http://geekandpoke.typepad.com/geekandpoke/2010/09/instant-search.html
#about_us
Helping you find a property

Finding a property in Greece is complex, lacks transparency.
We make life easier for househunters via:
     Powerful search functionality
          Web & Mobile
          Location & Criteria
     Quality content
          Listings (we love photos)
          Articles
     mySpitogatos
          Email alerts
          Save your search
          Favorite listings & notes
          Contact the realtors


                                                                          4
Realtors love us too!

Professionals need help in those turbulent times.
We add value in multiple ways:
     Cost effective promotion & high quality leads
          Targeted channel (very)
          Leads already filtered (we ve seen the fotos!)
     Technology services for realtors
          Turnkey web site solution
          Listing synchronization web service
     B2B via Spitogatos Network (SpiN) business
      network / collaboration tool for realtors
     Channel for foreign buyers via the English version




                                                                                    5
#background
To Search is to Find

Search is central to what we do
   Users searching for property come with structured criteria of huge variety
        Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter,
         with a garage
        Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter,
         2+ bedrooms, only show listings with photos
        Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per
         month, only listings with recently reduced price
        Monetize: # of Listings grouped by paying member + above criteria
        IPhone app → Listings within geo-rectangle + above criteria
        As a result, caching is rarely our friend!
   We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful
    for text search, not adding value for structured search




                                                                                           G
   Have been insisting on trying to optimize MySQL (multi column indices etc)




                                                                                      N
    while throwing replicas to the problem.




                                                                                     O
                                                                                   R
                                                                                               7
Why ElasticSearch

Selected elasticSearch after a (very) brief research* on alternatives:
   AWS's own Cloudsearch:
        Zero management service: nice!
        Not available on eu-west-1
        Currently lacks ES functionality (e.g. geospatial, non english analyzers)
   Sphinx
        Easy MySQL integration
        How do you scale it?*
   Solr
        Industry standard
        Seems like it is conceived as somehow harder to scale/operate*?
   ElasticSearch:
        Piece of cake to setup on AWS (stay tuned!)
        Super distributed, scales & is easy on IT ops (more on that later!)
                                                      * Disclaimer: We did not go through a
                                                                                              8
                                                       detailed product selection process!
#elasticsearch
ElasticSearch basics

A distributed, RESTful Search engine built on top of Lucene
   Free Schema
        JSON documents
        Analyzers
        Boost levels
   Easy & flexible Search
        Lucene query string or JSON based search query DSL
        Facets & Highlighting
        Spatial search
        Custom scripts
   Multi Tenancy
        Store & search across multiple indices
        Each with its own settings
        Use-case: Logs – recent in memory, old on disk

                                                                                 10
Scaling ElasticSearch

Designed from the ground up to be Scalable & Highly Available
   Distributed
        Indices automatically broken into shards
        Replicas for read performance & availability
        Multiple cluster nodes, each hosting 1+ shards/replicas
        peer2peer, each node can delegate operations to other nodes
        Add,remove nodes at will
              Rebalancing & routing automagically behind the scenes
   Discovery
        Multicast or unicast (declarative)
   Gateway
        Allows recovery in case all nodes go down
        Local or shared storage
        Async replication in case of shared storage

                                                                                       11
A scale-up example

Assume a cluster with 4 shards and 1 replica configuration
   1 node example – Status Yellow



   2 nodes example – Status Green



   3 nodes example




     : Primary shard              : Replica shard              : Master node               : Regular node

Master node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards.         12
ElasticSearch on AWS

2 modules make deployment on AWS a breeze
   EC2 discovery
        Filter by security group, AZ, tags
              Requires IAM user with certain EC2 privileges:
               DescribeAvailabilityZones, DescribeInstances, DescribeRegions,
               DescribeSecurityGroups, DescribeTags
       Very useful in autoscaling setups with ephemeral servers
   S3 gateway
        Long term reliable async persistency of cluster state and indices
        Allows deployment without EBS volumes
        Still, local gateway with EBS volumes performs better (less network used,
         faster recovery)
        Won't protect from accidental deletion of index (deletion will propagate to
         shared storage)


                                                                                       13
#implementation
Indexation

Indexation of Spitogatos.gr ads
   DB is still the “source of truth”
        We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously
              KISS: Cron job (re) indexes never or least-recently indexed listings
              ORM marks new/modified listings as never-indexed (so they go first)
   Location: Multivalue field instead of nested set model in the DB
        e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus
        Property will be included in results when I search for any of the above.
   Flat schema
        Searchable listing owner fields are included in the document (vs a JOIN in our DB)
        Changes to other tables might lead to large # of listings requiring reindexation
         (e.g. real estate agent becomes a paying member)




                                                                                               15
Index Integrity

Making sure our index is consistent with the DB
   Scrutineer ( https://github.com/Aconex/scrutineer )
        Compares DB and ElasticSearch index for mismatches
             exists in ES but not on DB (or vice versa)
             ES version not up to date
        Relies on “_version” field - is incremented via our ORM onChange
        When indexing we explicitly set versioning to “external”
        Had to “hack” it as it doesn't work with EC2 discovery module
           http://labs.spitogatos.gr/?p=45




                                                                                  16
Search – Shards & Routing

How does ElasticSearch decide in which shard to store a doc?
   By default this is done based on hash of document id
   Can be ovverriden while indexing and while searching (routing parameter)
   We shard based on hash of the id of area id
       - Most users search for listings within a specific area
       - We hit only a single shard for a large percentage of the searches.




           No routing                                                Routing by
           specificed                                                specific areaId

                                                                                         17
Search – Flat Schema, Facets & Scoring

We rely a lot on ElasticSearch's Flat Schema, Facets & Scoring
   No joins due to flat schema => fast!
   Multivalue fields => fast filtering for listings in areas of various hierarchy levels
   Facets functionality returns list of paying agents with # listings matching criteria
   Old slow ranking algorithm replaced by elasticSearch scoring functionality
        used to go through our DB and refresh score
             ad age is part of the equation
        Now ES computes this dynamically on every search
        We use custom scoring
        We can modify scoring algorithm and see changes instantly
             no need to recalculate scores for all listings




                                                                                            18
Monitoring

Sematext SPM offers a (currently free) ES monitoring solution
   Cluster Health       Search rate & latency      Disk
   Index Stats          Cache                      Network
   Shard Stats          CPU & RAM                  JVM & GC




                                                                          19
Tooling

ElasticSearch-Head is a GUI for browsing /interacting with a cluster




                                                                       20
Backups

 We take periodic copies from the Gateway
    Cause the Gateway is no cure for accidental deletions or bugs
    S3cmd syncs S3 gateway contents to local folder
         Expect some errors here as files get deleted/modified
    Disables snapshots to gateway
    Syncs again (no errors this time and much faster)
    Reenables snapshots to gateway
    Zips local folder contents, splits into smaller files & uploads to secondary S3 bucket




Get the script here: http://labs.spitogatos.gr/?p=17


                                                                                              21
Learnings

Issues & leasons learned:
   Faceted search can return wrong (smaller) results (on multiple shards)
        Due to the way sorting/merging is done
        Increase facet size field depending on cardinallity of faceted field
   We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica
        Lacking Document Routing and Version Type support
        Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType
   Filters vs queries (Query DSL)
        Filters perform an order of magnitude better than plain queries since no scoring is
         performed and they are automatically cached.
   Do it! Your DB will thank you




CPU Utilization                                  Response time pattern

                                                                                               22
Read more
    Useful resources:

   https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch
   http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/
   http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010
   http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext


    Need help integrating ElasticSearch to your app?




    http://bacterials.net/


                                                     Follow us on twitter: @spitogatosLabs
                                                 Check out our blog: http://labs.spitogatos.gr

                                                                                             23
#questions

More Related Content

What's hot

What's hot (20)

Consolidate MySQL Shards Into Amazon Aurora Using AWS Database Migration Serv...
Consolidate MySQL Shards Into Amazon Aurora Using AWS Database Migration Serv...Consolidate MySQL Shards Into Amazon Aurora Using AWS Database Migration Serv...
Consolidate MySQL Shards Into Amazon Aurora Using AWS Database Migration Serv...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique Visitors
 
Aws Kinesis
Aws KinesisAws Kinesis
Aws Kinesis
 
Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
 
Compare DynamoDB vs. MongoDB
Compare DynamoDB vs. MongoDBCompare DynamoDB vs. MongoDB
Compare DynamoDB vs. MongoDB
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
AWS re:Invent 2016 Recap: What Happened, What It Means
AWS re:Invent 2016 Recap: What Happened, What It MeansAWS re:Invent 2016 Recap: What Happened, What It Means
AWS re:Invent 2016 Recap: What Happened, What It Means
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
 
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
Cloud Storage in Azure, AWS and Google Cloud
Cloud  Storage in Azure, AWS and Google CloudCloud  Storage in Azure, AWS and Google Cloud
Cloud Storage in Azure, AWS and Google Cloud
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 

Viewers also liked

Sharding with MongoDB (Eliot Horowitz)
Sharding with MongoDB (Eliot Horowitz)Sharding with MongoDB (Eliot Horowitz)
Sharding with MongoDB (Eliot Horowitz)
MongoSF
 
AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...
AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...
AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...
Amazon Web Services
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
Amazee Labs
 

Viewers also liked (20)

DynamoDB for PHP sessions
DynamoDB for PHP sessionsDynamoDB for PHP sessions
DynamoDB for PHP sessions
 
Scalr Demo
Scalr DemoScalr Demo
Scalr Demo
 
Ansible pill09wp
Ansible pill09wpAnsible pill09wp
Ansible pill09wp
 
Key considerations when adopting cloud: expectations vs hurdles
Key considerations when adopting cloud: expectations vs hurdlesKey considerations when adopting cloud: expectations vs hurdles
Key considerations when adopting cloud: expectations vs hurdles
 
Scalr cost analytics talk
Scalr cost analytics talkScalr cost analytics talk
Scalr cost analytics talk
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and Elasticsearch
 
CCCEU14 - A Real World Outlook on Hybrid Cloud: Why and How
CCCEU14 - A Real World Outlook on Hybrid Cloud: Why and HowCCCEU14 - A Real World Outlook on Hybrid Cloud: Why and How
CCCEU14 - A Real World Outlook on Hybrid Cloud: Why and How
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out
 
Selling Umbraco - CodeGarden 2015
Selling Umbraco - CodeGarden 2015Selling Umbraco - CodeGarden 2015
Selling Umbraco - CodeGarden 2015
 
Personalize Expedia Hotel Searches
Personalize Expedia Hotel SearchesPersonalize Expedia Hotel Searches
Personalize Expedia Hotel Searches
 
Scalr - Open Source Cloud Management
Scalr - Open Source Cloud Management Scalr - Open Source Cloud Management
Scalr - Open Source Cloud Management
 
Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuning
 
Sharding with MongoDB (Eliot Horowitz)
Sharding with MongoDB (Eliot Horowitz)Sharding with MongoDB (Eliot Horowitz)
Sharding with MongoDB (Eliot Horowitz)
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Machine Learning Travel Industry
Machine Learning   Travel IndustryMachine Learning   Travel Industry
Machine Learning Travel Industry
 
Selling umbraco
Selling umbracoSelling umbraco
Selling umbraco
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
 
AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...
AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...
AWS Partner Presentation - PetaByte Scale Computing on Amazon EC2 with BigDat...
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 

Similar to ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)

Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
macrochen
 

Similar to ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr) (20)

ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018
 
The Power of Elasticsearch
The Power of ElasticsearchThe Power of Elasticsearch
The Power of Elasticsearch
 
AWS case study: real estate portal
AWS case study: real estate portalAWS case study: real estate portal
AWS case study: real estate portal
 
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels - AWS Summit 2...
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels  - AWS Summit 2...Keynote: Your Future With Cloud Computing - Dr. Werner Vogels  - AWS Summit 2...
Keynote: Your Future With Cloud Computing - Dr. Werner Vogels - AWS Summit 2...
 
Kubernetes in 15 minutes
Kubernetes in 15 minutesKubernetes in 15 minutes
Kubernetes in 15 minutes
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Introduction to AWS tools
Introduction to AWS toolsIntroduction to AWS tools
Introduction to AWS tools
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Scaling the Content Repository with Elasticsearch
Scaling the Content Repository with ElasticsearchScaling the Content Repository with Elasticsearch
Scaling the Content Repository with Elasticsearch
 
Clouds in Your Coffee Session with Cleversafe & Avere
Clouds in Your Coffee Session with Cleversafe & AvereClouds in Your Coffee Session with Cleversafe & Avere
Clouds in Your Coffee Session with Cleversafe & Avere
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Zenko: Enabling Data Control in a Multi-cloud World
Zenko: Enabling Data Control in a Multi-cloud WorldZenko: Enabling Data Control in a Multi-cloud World
Zenko: Enabling Data Control in a Multi-cloud World
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Building a Resilient, Scalable, Storage System with OpenStack
Building a Resilient, Scalable, Storage System with OpenStackBuilding a Resilient, Scalable, Storage System with OpenStack
Building a Resilient, Scalable, Storage System with OpenStack
 
OpenStack Architected Like AWS (and GCP)
OpenStack Architected Like AWS (and GCP)OpenStack Architected Like AWS (and GCP)
OpenStack Architected Like AWS (and GCP)
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)

  • 1. ElasticSearch on AWS Real Estate portal Case Study (Spitogatos.gr) AWSUG GR meetup #7 27 September 2012 Andreas Chatzakis co-founder / IT Director – Spitogatos.gr Event sponsored by: @achatzakis on twitter
  • 4. Helping you find a property Finding a property in Greece is complex, lacks transparency. We make life easier for househunters via:  Powerful search functionality  Web & Mobile  Location & Criteria  Quality content  Listings (we love photos)  Articles  mySpitogatos  Email alerts  Save your search  Favorite listings & notes  Contact the realtors 4
  • 5. Realtors love us too! Professionals need help in those turbulent times. We add value in multiple ways:  Cost effective promotion & high quality leads  Targeted channel (very)  Leads already filtered (we ve seen the fotos!)  Technology services for realtors  Turnkey web site solution  Listing synchronization web service  B2B via Spitogatos Network (SpiN) business network / collaboration tool for realtors  Channel for foreign buyers via the English version 5
  • 7. To Search is to Find Search is central to what we do  Users searching for property come with structured criteria of huge variety  Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter, with a garage  Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter, 2+ bedrooms, only show listings with photos  Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per month, only listings with recently reduced price  Monetize: # of Listings grouped by paying member + above criteria  IPhone app → Listings within geo-rectangle + above criteria  As a result, caching is rarely our friend!  We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful for text search, not adding value for structured search G  Have been insisting on trying to optimize MySQL (multi column indices etc) N while throwing replicas to the problem. O R 7
  • 8. Why ElasticSearch Selected elasticSearch after a (very) brief research* on alternatives:  AWS's own Cloudsearch:  Zero management service: nice!  Not available on eu-west-1  Currently lacks ES functionality (e.g. geospatial, non english analyzers)  Sphinx  Easy MySQL integration  How do you scale it?*  Solr  Industry standard  Seems like it is conceived as somehow harder to scale/operate*?  ElasticSearch:  Piece of cake to setup on AWS (stay tuned!)  Super distributed, scales & is easy on IT ops (more on that later!) * Disclaimer: We did not go through a 8 detailed product selection process!
  • 10. ElasticSearch basics A distributed, RESTful Search engine built on top of Lucene  Free Schema  JSON documents  Analyzers  Boost levels  Easy & flexible Search  Lucene query string or JSON based search query DSL  Facets & Highlighting  Spatial search  Custom scripts  Multi Tenancy  Store & search across multiple indices  Each with its own settings  Use-case: Logs – recent in memory, old on disk 10
  • 11. Scaling ElasticSearch Designed from the ground up to be Scalable & Highly Available  Distributed  Indices automatically broken into shards  Replicas for read performance & availability  Multiple cluster nodes, each hosting 1+ shards/replicas  peer2peer, each node can delegate operations to other nodes  Add,remove nodes at will  Rebalancing & routing automagically behind the scenes  Discovery  Multicast or unicast (declarative)  Gateway  Allows recovery in case all nodes go down  Local or shared storage  Async replication in case of shared storage 11
  • 12. A scale-up example Assume a cluster with 4 shards and 1 replica configuration  1 node example – Status Yellow  2 nodes example – Status Green  3 nodes example : Primary shard : Replica shard : Master node : Regular node Master node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards. 12
  • 13. ElasticSearch on AWS 2 modules make deployment on AWS a breeze  EC2 discovery  Filter by security group, AZ, tags  Requires IAM user with certain EC2 privileges: DescribeAvailabilityZones, DescribeInstances, DescribeRegions, DescribeSecurityGroups, DescribeTags  Very useful in autoscaling setups with ephemeral servers  S3 gateway  Long term reliable async persistency of cluster state and indices  Allows deployment without EBS volumes  Still, local gateway with EBS volumes performs better (less network used, faster recovery)  Won't protect from accidental deletion of index (deletion will propagate to shared storage) 13
  • 15. Indexation Indexation of Spitogatos.gr ads  DB is still the “source of truth”  We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously  KISS: Cron job (re) indexes never or least-recently indexed listings  ORM marks new/modified listings as never-indexed (so they go first)  Location: Multivalue field instead of nested set model in the DB  e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus  Property will be included in results when I search for any of the above.  Flat schema  Searchable listing owner fields are included in the document (vs a JOIN in our DB)  Changes to other tables might lead to large # of listings requiring reindexation (e.g. real estate agent becomes a paying member) 15
  • 16. Index Integrity Making sure our index is consistent with the DB  Scrutineer ( https://github.com/Aconex/scrutineer )  Compares DB and ElasticSearch index for mismatches  exists in ES but not on DB (or vice versa)  ES version not up to date  Relies on “_version” field - is incremented via our ORM onChange  When indexing we explicitly set versioning to “external”  Had to “hack” it as it doesn't work with EC2 discovery module  http://labs.spitogatos.gr/?p=45 16
  • 17. Search – Shards & Routing How does ElasticSearch decide in which shard to store a doc?  By default this is done based on hash of document id  Can be ovverriden while indexing and while searching (routing parameter)  We shard based on hash of the id of area id - Most users search for listings within a specific area - We hit only a single shard for a large percentage of the searches. No routing Routing by specificed specific areaId 17
  • 18. Search – Flat Schema, Facets & Scoring We rely a lot on ElasticSearch's Flat Schema, Facets & Scoring  No joins due to flat schema => fast!  Multivalue fields => fast filtering for listings in areas of various hierarchy levels  Facets functionality returns list of paying agents with # listings matching criteria  Old slow ranking algorithm replaced by elasticSearch scoring functionality  used to go through our DB and refresh score  ad age is part of the equation  Now ES computes this dynamically on every search  We use custom scoring  We can modify scoring algorithm and see changes instantly  no need to recalculate scores for all listings 18
  • 19. Monitoring Sematext SPM offers a (currently free) ES monitoring solution  Cluster Health  Search rate & latency  Disk  Index Stats  Cache  Network  Shard Stats  CPU & RAM  JVM & GC 19
  • 20. Tooling ElasticSearch-Head is a GUI for browsing /interacting with a cluster 20
  • 21. Backups We take periodic copies from the Gateway  Cause the Gateway is no cure for accidental deletions or bugs  S3cmd syncs S3 gateway contents to local folder  Expect some errors here as files get deleted/modified  Disables snapshots to gateway  Syncs again (no errors this time and much faster)  Reenables snapshots to gateway  Zips local folder contents, splits into smaller files & uploads to secondary S3 bucket Get the script here: http://labs.spitogatos.gr/?p=17 21
  • 22. Learnings Issues & leasons learned:  Faceted search can return wrong (smaller) results (on multiple shards)  Due to the way sorting/merging is done  Increase facet size field depending on cardinallity of faceted field  We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica  Lacking Document Routing and Version Type support  Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType  Filters vs queries (Query DSL)  Filters perform an order of magnitude better than plain queries since no scoring is performed and they are automatically cached.  Do it! Your DB will thank you CPU Utilization Response time pattern 22
  • 23. Read more Useful resources:  https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch  http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/  http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010  http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext Need help integrating ElasticSearch to your app? http://bacterials.net/ Follow us on twitter: @spitogatosLabs Check out our blog: http://labs.spitogatos.gr 23