SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
SCALINGSCALING
THE DOCUMENT REPOSITORYTHE DOCUMENT REPOSITORY
WITH ELASTICSEARCHWITH ELASTICSEARCH
SOME CONTEXTSOME CONTEXT
What we Do and What Problems We Try to Solve
NUXEONUXEO
Nuxeo
​we provide a Platform that developers can use to build highly
customized Content Applications
we provide components, and the tools to assemble them
everything we do is open source (for real)
various customers - various use cases
me: developer & CTO - joined the Nuxeo project 10+ years ago
Track game builds Electronic Flight Bags Central repository for Models Food industry PLM
https://github.com/nuxeo
DOCUMENT REPOSITORYDOCUMENT REPOSITORY
Store Documents / Assets / Objects
Blob objects
Complex data Structures
Hierarchy, references and links
Audit trail & Versioning
Data level security & encryption
Lifecycle, workflows ...
API (REST, CMIS, Java, JS...)
CRUD
Search
Service API
Heavily configurable : all data structures are flexible / customizable
Used by developers to build
Content Applications on top of
the Nuxeo Repository
OUR CHALLENGESOUR CHALLENGES
CRUD on large repository works
inject at 6,000 docs/s up to 1 Billion
not so many companies have that many documents anyway
Queries are the main scalability issue
impact of c_ud vs search
​multi-criteria queries + full-text
security filtering
configurable data structures
user defined queries
UI heavily depends on search
Search API is the most used:
search is the main scalability challenge
HISTORY : NUXEO & LUCENEHISTORY : NUXEO & LUCENE
2006: Nuxeo CPS 3.6
(Python / Zope based)
Replace built-in index with
lucene + XML-RPC server
pyLucene
(GCJ build+ python bindings!)
Complex setup
2007: Nuxeo Platform 5.1
JCR : queries (and backup) issues
Integrate Compass Core
transactionnal & storage abstraction
Missing sync & concurrency issues
2009: Nuxeo 5.2
VCS : Homebrew SQL based repository
Search in database but some real limitations
2013 / 2014: Nuxeo 5.9.3
Reintroduce Lucene in the stack via elasticsearch
Learn from our past mistakes
Leverage elasticsearch architecture
​easy deployment
safe indexing
powerful search
... we are now happy with Elasticsearch
Lucene and Nuxeo have a long story ...
REPOSITORY & SEARCHREPOSITORY & SEARCH
Understanding the Issue
REPOSITORY & SEARCHREPOSITORY & SEARCH
Search API is the most used :
search is the main scalability challenge
COMPLEX SQL QUERIESCOMPLEX SQL QUERIES
Configurable Data Structure
+ User defined multi-criteria searches
=> multiple & complex SQL queries
Search API is the most used:
search is the main scalability challenge
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
WHERE
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio'))
AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
ABOUT SQL LIMITATIONSABOUT SQL LIMITATIONS
Scaling queries is complex
depend on indexes, I/O speed and available memory
can not satisfy all types of queries
poor performances on unselective multi-criteria queries
some types of queries can simply not be fast in SQL
Scalability
Scale up is expensive
Scale out is complex at best (XA & MVCC)
Sharding requires a global index
Fulltext support is usually poor
limitations on features & impact on performances
SQL technology is not the solution
IS NOSQL THE SOLUTION!?IS NOSQL THE SOLUTION!?
USING NOSQL FOR THE REPOSITORYUSING NOSQL FOR THE REPOSITORY
ABOUT THE NOSQL OPTIONABOUT THE NOSQL OPTION
(sadly) NoSQL is no magic
it does work very well for CRUD and it scales easily, but
​query options are limited and performance is not that good
multi-document transactions is usually not safe
more adapted for DBs with billions of entries and simple queries
SQL has some real advantages
ACID (and MVCC) is good
Workflows and bulk updates are a typical use case
​(even transient) lack of consistency is complex to explain to users
​lot of existing tools (BI & reporting), lot of existing skills(DBA)
PGSQL (or AWS RDS) can be very cost effective
SQL or NoSQL repository are not the solution
KEEP THE REPOSITORYKEEP THE REPOSITORY
SQL OR NOSQLSQL OR NOSQL
BUTBUT
FIND A SUPER FAST INDEX ENGINEFIND A SUPER FAST INDEX ENGINE
REPOSITORY & ELASTICSEARCHREPOSITORY & ELASTICSEARCH
Toward an Hybrid Storage
HYBRID STORAGEHYBRID STORAGE
Use each storage solution for what it does the best
SQL DB
store content in an ACID way
store & retrieve
queries needed ACID and MVCC
elasticsearch
provide powerful and scalable queries
do the heavy lifting that the RDBMS can not do
scoring, native full-text, aggregates
distributed search
Route the query to the correct index depending
on requirements
ELASTICSEARCH & REPOSITORYELASTICSEARCH & REPOSITORY
One query
Several possible backends
PERFORMANCE RESULTSPERFORMANCE RESULTS
Fast indexing
No ACID constraints / No impedance issue
3,500 documents/s when using SQL backend
10,000 documents/s when using MongoDB
Super query performance
query on term using inverted index
very efficient caching
native full text support & distributed architecture
3,000 queries/s with 1 elasticsearch node
6,000 queries/s with 2 elasticsearch nodes
SOME REAL LIFE FEEDBACKSOME REAL LIFE FEEDBACK
“ We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.
“ Please activate nuxeo-elasticsearch !
“ We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?
“ It looks like you have some network
congestion between your client and the servers.
“ ...right... we have pushed past 1900 tps ... I think we are close to
declaring success for this configuration ...
Customer
Customer
Customer
Nuxeo support
Nuxeo support
SQL VS ELASTICSEARCHSQL VS ELASTICSEARCH
Scalability is simply from
another order of magnitude
SCALE OUTSCALE OUT
UNIFIED INDEX ON SHARDED REPOSITORYUNIFIED INDEX ON SHARDED REPOSITORY
Tested with 10 PgSQL databases
10 x 100 Million documents => 1 Billion documents
1 elasticsearch cluster
IS THIS MAGIC?IS THIS MAGIC?
For users
it really looks like magic
For sales guys & solution architects
​it is magic: it unleashes a lot of possibilities
performance is just one aspect
For Nuxeo Core Dev team
it was almost magic: some integration work was needed
INTEGRATING ELASTICSEARCHINTEGRATING ELASTICSEARCH
Inside nuxeo-elasticsearch Plugin
CHALLENGES TO ADDRESSCHALLENGES TO ADDRESS
Keep index in sync with the repository
No transaction management
Do not lose anything
Without support for update
Mitigate eventually consistent effect
​Avoid displaying transient inconsistent state
Handle security filtering
Without join
Without post-filtering
SECURITY FILTERINGSECURITY FILTERING
Constraints
​Filtering must be done at index level : no post filtering
Join is not an option
can not join with DB or withing lucene (previously tested without success)
​Solution
​index the ReadACL as part of the JSON Document
​list of groups / users who can read the resource
​​automatically add a filter clause on ACL
Consequences
​Recursive indexing is needed
More pressure to maintain re-indexing procesing
​in last resort: the Document security is checked by the repository anyway
SAFE INDEXING FLOWSAFE INDEXING FLOW
Do not try to make it Transactionnal
Collect and de-duplicate Repository Events during Transaction
Wait for commit to be done at the repository level
then call elasticsearch
Do not lose any update​
run Indexing Tasks in a distributed Job infrastructure
​Jobs should be persisted
Jobs should be retried
Jobs should be monitored​
ASYNC INDEXING FLOWASYNC INDEXING FLOW
MITIGATE EVENTUALLY CONSISTENTMITIGATE EVENTUALLY CONSISTENT
In the code :
use case : need to see results from within the transaction
query directly on the repository
​leverage ACID and MVCC of SQL repository
full-text search and facets are usually not needed by the code
For the users :
use case : see changes in listings in "real time"
use pseudo-real time indexing
​​indexing actions triggered by UI threads are flagged
​run as afterCompletion listener
refresh elasticsearch index
PSEUDO-SYNC INDEXING FLOWPSEUDO-SYNC INDEXING FLOW
DOES THIS WORK ?DOES THIS WORK ?
Live for about 18 months now
No missing sync issue
some customers asked for verification tools
but no problem was found
re-index in bulk mode is very fast anyway
No consistency issues
good usage of hybrid query engines
​elasticsearch helped address several scaling challenges
but elasticsearch brings us much more than just scalability
BONUS FROM ELASTICSEARCHBONUS FROM ELASTICSEARCH
More than Raw Speed
LEVERAGE AGGREGATESLEVERAGE AGGREGATES
Leverage elasticsearch aggregates
​integrate with the Query system (PageProvider)
integrate with the Listing / UI model (ContentView)
Allow to easily build and configure faceted search
ADVANCED INDEXINGADVANCED INDEXING
Fine tuning of elasticsearch indexing
multi language support using multiple analyzers and copy_to
compound fields created using groovy scripts
Introduce elasticsearch hints into NXQL
select a specific elasticsearch index / analyzer
leverage elasticseach operators
do geolocation search
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash;
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')
leverage what comes for free with elasticsearch
INDEX AUDIT TRAIL WITH ELASTICSEARCHINDEX AUDIT TRAIL WITH ELASTICSEARCH
Use elasticsearch to store & index Audit trail
all events are serialized in JSON and stored inside elasticsearch
​Unleash Audit system power
​can store a lot of events
can store and query arbitrary JSON structure
ELASTICSEARCH PASS-THROUGHELASTICSEARCH PASS-THROUGH
Expose an HTTP pass-through API on top of Nuxeo integration
Integrate Authentication & Authorization
not all users can access workflow index
Integrate Security Filtering
activate data level security filtering​
Expose "virtual index" via http
index + filter
​​Use elasticsearch API related components on Nuxeo data
​Documents + Audit log
With embedded security
Easy real time data analytics on business data
DATA ANALYTICS WITH ELASTICSEARCHDATA ANALYTICS WITH ELASTICSEARCH
Queries on Documents + Audit: flexible reporting on workflows
READ DOCUMENTS FROM ELASTICSEARCHREAD DOCUMENTS FROM ELASTICSEARCH
Full JSONDocument is stored in elasticsearch
​required to be able to do fast re-indexing
​We can retrieve Documents from elasticsearch
execute full search & retrieve without touching the DB
​By controling indexing we can use the elasticsearch index
​as a persistent cache on top of the repository
as a staging area for queries
_source
NEXT STEPSNEXT STEPS
Leveraging Even More elasticsearch
NEXT STEPSNEXT STEPS
Leverage elasticsearch percolator
push update on the nuxeo-drive clients
notify users about saved search
automatic categorization
Search result highlighting
​not sure why it is still not there ...
Plug automatic denormalization
ANY QUESTIONS ?ANY QUESTIONS ?
Thank You !
https://github.com/nuxeo
http://www.nuxeo.com/careers/

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

"Using Indexes in SQL Server 2008" by Alexander Korotkiy, part 1
"Using Indexes in SQL Server 2008" by Alexander Korotkiy, part 1 "Using Indexes in SQL Server 2008" by Alexander Korotkiy, part 1
"Using Indexes in SQL Server 2008" by Alexander Korotkiy, part 1
 
Elasticsearch and Symfony Integration - Debarko De
Elasticsearch and Symfony Integration - Debarko DeElasticsearch and Symfony Integration - Debarko De
Elasticsearch and Symfony Integration - Debarko De
 
Discovering ElasticSearch
Discovering ElasticSearchDiscovering ElasticSearch
Discovering ElasticSearch
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 
reveal.js 3.0.0
reveal.js 3.0.0reveal.js 3.0.0
reveal.js 3.0.0
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
10 Revealing Statistics About Compensation & Benefits You should Know
10 Revealing Statistics About Compensation & Benefits You should Know10 Revealing Statistics About Compensation & Benefits You should Know
10 Revealing Statistics About Compensation & Benefits You should Know
 

Ähnlich wie Scaling the Content Repository with Elasticsearch

1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
sqlserver.co.il
 
NoSQL - No Security?
NoSQL - No Security?NoSQL - No Security?
NoSQL - No Security?
Gavin Holt
 

Ähnlich wie Scaling the Content Repository with Elasticsearch (20)

Nosql why and how on Microsoft Azure
Nosql why and how on Microsoft AzureNosql why and how on Microsoft Azure
Nosql why and how on Microsoft Azure
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
 
Polyglot Database - Linuxcon North America 2016
Polyglot Database - Linuxcon North America 2016Polyglot Database - Linuxcon North America 2016
Polyglot Database - Linuxcon North America 2016
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
 
Azure Key Vault, Azure Dev Ops and Azure Synapse - how these services work pe...
Azure Key Vault, Azure Dev Ops and Azure Synapse - how these services work pe...Azure Key Vault, Azure Dev Ops and Azure Synapse - how these services work pe...
Azure Key Vault, Azure Dev Ops and Azure Synapse - how these services work pe...
 
MySQL Document Store for Modern Applications
MySQL Document Store for Modern ApplicationsMySQL Document Store for Modern Applications
MySQL Document Store for Modern Applications
 
Introduction to NoSQL Database
Introduction to NoSQL DatabaseIntroduction to NoSQL Database
Introduction to NoSQL Database
 
Qui Quaerit, Reperit. AWS Elasticsearch in Action
Qui Quaerit, Reperit. AWS Elasticsearch in ActionQui Quaerit, Reperit. AWS Elasticsearch in Action
Qui Quaerit, Reperit. AWS Elasticsearch in Action
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
SQL or NoSQL, is this the question? - George Grammatikos
SQL or NoSQL, is this the question? - George GrammatikosSQL or NoSQL, is this the question? - George Grammatikos
SQL or NoSQL, is this the question? - George Grammatikos
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
NoSQL - No Security?
NoSQL - No Security?NoSQL - No Security?
NoSQL - No Security?
 
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Experiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure teamExperiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure team
 
Effective Persistence Using ORM With Hibernate
Effective Persistence Using ORM With HibernateEffective Persistence Using ORM With Hibernate
Effective Persistence Using ORM With Hibernate
 
Day2
Day2Day2
Day2
 

Mehr von Nuxeo

Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...
Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...
Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...
Nuxeo
 

Mehr von Nuxeo (20)

Own the Digital Shelf Strategies Food and Beverage Companies
Own the Digital Shelf Strategies Food and Beverage CompaniesOwn the Digital Shelf Strategies Food and Beverage Companies
Own the Digital Shelf Strategies Food and Beverage Companies
 
How DAM Librarians Can Get Ready for the Uncertain Future
How DAM Librarians Can Get Ready for the Uncertain FutureHow DAM Librarians Can Get Ready for the Uncertain Future
How DAM Librarians Can Get Ready for the Uncertain Future
 
How Insurers Fueled Transformation During a Pandemic
How Insurers Fueled Transformation During a PandemicHow Insurers Fueled Transformation During a Pandemic
How Insurers Fueled Transformation During a Pandemic
 
Manage your Content at Scale with MongoDB and Nuxeo
Manage your Content at Scale with MongoDB and NuxeoManage your Content at Scale with MongoDB and Nuxeo
Manage your Content at Scale with MongoDB and Nuxeo
 
Accelerate the Digital Supply Chain From Idea to Support
Accelerate the Digital Supply Chain From Idea to SupportAccelerate the Digital Supply Chain From Idea to Support
Accelerate the Digital Supply Chain From Idea to Support
 
Where are you in the DAM Continuum
Where are you in the DAM ContinuumWhere are you in the DAM Continuum
Where are you in the DAM Continuum
 
Customer Experience in 2021
Customer Experience in 2021Customer Experience in 2021
Customer Experience in 2021
 
L’IA personnalisée, clé d’une gestion de l’information innovante
L’IA personnalisée, clé d’une gestion de l’information innovanteL’IA personnalisée, clé d’une gestion de l’information innovante
L’IA personnalisée, clé d’une gestion de l’information innovante
 
Gérer ses contenus avec MongoDB et Nuxeo
Gérer ses contenus avec MongoDB et NuxeoGérer ses contenus avec MongoDB et Nuxeo
Gérer ses contenus avec MongoDB et Nuxeo
 
Le DAM en 2021 : Tendances, points clés et critères d'évaluation
Le DAM en 2021 : Tendances, points clés et critères d'évaluationLe DAM en 2021 : Tendances, points clés et critères d'évaluation
Le DAM en 2021 : Tendances, points clés et critères d'évaluation
 
Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...
Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...
Enabling Digital Transformation Amidst a Global Pandemic | Low-Code, Cloud, A...
 
Elevate your Customer's Experience and Stay Ahead of the Competition
Elevate your Customer's Experience and Stay Ahead of the CompetitionElevate your Customer's Experience and Stay Ahead of the Competition
Elevate your Customer's Experience and Stay Ahead of the Competition
 
Driving Brand Loyalty Through Superior Customer Experience
Driving Brand Loyalty Through Superior Customer Experience Driving Brand Loyalty Through Superior Customer Experience
Driving Brand Loyalty Through Superior Customer Experience
 
Drive Enterprise Speed and Scale with A Cloud-Native DAM
Drive Enterprise Speed and Scale with A Cloud-Native DAMDrive Enterprise Speed and Scale with A Cloud-Native DAM
Drive Enterprise Speed and Scale with A Cloud-Native DAM
 
The Big Picture: the Role of Video, Photography, and Content in Enhancing the...
The Big Picture: the Role of Video, Photography, and Content in Enhancing the...The Big Picture: the Role of Video, Photography, and Content in Enhancing the...
The Big Picture: the Role of Video, Photography, and Content in Enhancing the...
 
How Creatives Are Getting Creative in 2020 and Beyond
How Creatives Are Getting Creative in 2020 and BeyondHow Creatives Are Getting Creative in 2020 and Beyond
How Creatives Are Getting Creative in 2020 and Beyond
 
Digitalisation : Améliorez la collaboration et l’expérience client grâce au DAM
Digitalisation : Améliorez la collaboration et l’expérience client grâce au DAMDigitalisation : Améliorez la collaboration et l’expérience client grâce au DAM
Digitalisation : Améliorez la collaboration et l’expérience client grâce au DAM
 
Reimagine Your Claims Process with Future-Proof Technologies
Reimagine Your Claims Process with Future-Proof TechnologiesReimagine Your Claims Process with Future-Proof Technologies
Reimagine Your Claims Process with Future-Proof Technologies
 
Comment le Centre Hospitalier Laborit dématérialise ses processus administratifs
Comment le Centre Hospitalier Laborit dématérialise ses processus administratifsComment le Centre Hospitalier Laborit dématérialise ses processus administratifs
Comment le Centre Hospitalier Laborit dématérialise ses processus administratifs
 
Accelerating the Packaging Design Process with Artificial Intelligence
Accelerating the Packaging Design Process with Artificial IntelligenceAccelerating the Packaging Design Process with Artificial Intelligence
Accelerating the Packaging Design Process with Artificial Intelligence
 

Kürzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Scaling the Content Repository with Elasticsearch

  • 1. SCALINGSCALING THE DOCUMENT REPOSITORYTHE DOCUMENT REPOSITORY WITH ELASTICSEARCHWITH ELASTICSEARCH
  • 2. SOME CONTEXTSOME CONTEXT What we Do and What Problems We Try to Solve
  • 3. NUXEONUXEO Nuxeo ​we provide a Platform that developers can use to build highly customized Content Applications we provide components, and the tools to assemble them everything we do is open source (for real) various customers - various use cases me: developer & CTO - joined the Nuxeo project 10+ years ago Track game builds Electronic Flight Bags Central repository for Models Food industry PLM https://github.com/nuxeo
  • 4. DOCUMENT REPOSITORYDOCUMENT REPOSITORY Store Documents / Assets / Objects Blob objects Complex data Structures Hierarchy, references and links Audit trail & Versioning Data level security & encryption Lifecycle, workflows ... API (REST, CMIS, Java, JS...) CRUD Search Service API Heavily configurable : all data structures are flexible / customizable Used by developers to build Content Applications on top of the Nuxeo Repository
  • 5. OUR CHALLENGESOUR CHALLENGES CRUD on large repository works inject at 6,000 docs/s up to 1 Billion not so many companies have that many documents anyway Queries are the main scalability issue impact of c_ud vs search ​multi-criteria queries + full-text security filtering configurable data structures user defined queries UI heavily depends on search Search API is the most used: search is the main scalability challenge
  • 6. HISTORY : NUXEO & LUCENEHISTORY : NUXEO & LUCENE 2006: Nuxeo CPS 3.6 (Python / Zope based) Replace built-in index with lucene + XML-RPC server pyLucene (GCJ build+ python bindings!) Complex setup 2007: Nuxeo Platform 5.1 JCR : queries (and backup) issues Integrate Compass Core transactionnal & storage abstraction Missing sync & concurrency issues 2009: Nuxeo 5.2 VCS : Homebrew SQL based repository Search in database but some real limitations 2013 / 2014: Nuxeo 5.9.3 Reintroduce Lucene in the stack via elasticsearch Learn from our past mistakes Leverage elasticsearch architecture ​easy deployment safe indexing powerful search ... we are now happy with Elasticsearch Lucene and Nuxeo have a long story ...
  • 7. REPOSITORY & SEARCHREPOSITORY & SEARCH Understanding the Issue
  • 8. REPOSITORY & SEARCHREPOSITORY & SEARCH Search API is the most used : search is the main scalability challenge
  • 9. COMPLEX SQL QUERIESCOMPLEX SQL QUERIES Configurable Data Structure + User defined multi-criteria searches => multiple & complex SQL queries Search API is the most used: search is the main scalability challenge SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" WHERE ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) AND ("hierarchy"."isversion" IS NULL) AND ("_F1"."lifecyclestate" <> 'deleted') AND ("_F2"."created" IS NOT NULL ) ORDER BY "_F2"."created" DESC LIMIT 201 OFFSET 0;
  • 10. ABOUT SQL LIMITATIONSABOUT SQL LIMITATIONS Scaling queries is complex depend on indexes, I/O speed and available memory can not satisfy all types of queries poor performances on unselective multi-criteria queries some types of queries can simply not be fast in SQL Scalability Scale up is expensive Scale out is complex at best (XA & MVCC) Sharding requires a global index Fulltext support is usually poor limitations on features & impact on performances SQL technology is not the solution
  • 11. IS NOSQL THE SOLUTION!?IS NOSQL THE SOLUTION!?
  • 12. USING NOSQL FOR THE REPOSITORYUSING NOSQL FOR THE REPOSITORY
  • 13. ABOUT THE NOSQL OPTIONABOUT THE NOSQL OPTION (sadly) NoSQL is no magic it does work very well for CRUD and it scales easily, but ​query options are limited and performance is not that good multi-document transactions is usually not safe more adapted for DBs with billions of entries and simple queries SQL has some real advantages ACID (and MVCC) is good Workflows and bulk updates are a typical use case ​(even transient) lack of consistency is complex to explain to users ​lot of existing tools (BI & reporting), lot of existing skills(DBA) PGSQL (or AWS RDS) can be very cost effective SQL or NoSQL repository are not the solution
  • 14. KEEP THE REPOSITORYKEEP THE REPOSITORY SQL OR NOSQLSQL OR NOSQL BUTBUT FIND A SUPER FAST INDEX ENGINEFIND A SUPER FAST INDEX ENGINE
  • 15. REPOSITORY & ELASTICSEARCHREPOSITORY & ELASTICSEARCH Toward an Hybrid Storage
  • 16. HYBRID STORAGEHYBRID STORAGE Use each storage solution for what it does the best SQL DB store content in an ACID way store & retrieve queries needed ACID and MVCC elasticsearch provide powerful and scalable queries do the heavy lifting that the RDBMS can not do scoring, native full-text, aggregates distributed search Route the query to the correct index depending on requirements
  • 17. ELASTICSEARCH & REPOSITORYELASTICSEARCH & REPOSITORY One query Several possible backends
  • 18. PERFORMANCE RESULTSPERFORMANCE RESULTS Fast indexing No ACID constraints / No impedance issue 3,500 documents/s when using SQL backend 10,000 documents/s when using MongoDB Super query performance query on term using inverted index very efficient caching native full text support & distributed architecture 3,000 queries/s with 1 elasticsearch node 6,000 queries/s with 2 elasticsearch nodes
  • 19. SOME REAL LIFE FEEDBACKSOME REAL LIFE FEEDBACK “ We are now testing the Nuxeo 6 stack in AWS. DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus Between 350 and 400 tps the DB cpu is maxed out. “ Please activate nuxeo-elasticsearch ! “ We are now able to do about 1200 tps with almost 0 DB activity. Question though, Nuxeo and ES do not seem to be maxed out ? “ It looks like you have some network congestion between your client and the servers. “ ...right... we have pushed past 1900 tps ... I think we are close to declaring success for this configuration ... Customer Customer Customer Nuxeo support Nuxeo support
  • 20. SQL VS ELASTICSEARCHSQL VS ELASTICSEARCH Scalability is simply from another order of magnitude
  • 22. UNIFIED INDEX ON SHARDED REPOSITORYUNIFIED INDEX ON SHARDED REPOSITORY Tested with 10 PgSQL databases 10 x 100 Million documents => 1 Billion documents 1 elasticsearch cluster
  • 23. IS THIS MAGIC?IS THIS MAGIC? For users it really looks like magic For sales guys & solution architects ​it is magic: it unleashes a lot of possibilities performance is just one aspect For Nuxeo Core Dev team it was almost magic: some integration work was needed
  • 25. CHALLENGES TO ADDRESSCHALLENGES TO ADDRESS Keep index in sync with the repository No transaction management Do not lose anything Without support for update Mitigate eventually consistent effect ​Avoid displaying transient inconsistent state Handle security filtering Without join Without post-filtering
  • 26. SECURITY FILTERINGSECURITY FILTERING Constraints ​Filtering must be done at index level : no post filtering Join is not an option can not join with DB or withing lucene (previously tested without success) ​Solution ​index the ReadACL as part of the JSON Document ​list of groups / users who can read the resource ​​automatically add a filter clause on ACL Consequences ​Recursive indexing is needed More pressure to maintain re-indexing procesing ​in last resort: the Document security is checked by the repository anyway
  • 27. SAFE INDEXING FLOWSAFE INDEXING FLOW Do not try to make it Transactionnal Collect and de-duplicate Repository Events during Transaction Wait for commit to be done at the repository level then call elasticsearch Do not lose any update​ run Indexing Tasks in a distributed Job infrastructure ​Jobs should be persisted Jobs should be retried Jobs should be monitored​
  • 28. ASYNC INDEXING FLOWASYNC INDEXING FLOW
  • 29. MITIGATE EVENTUALLY CONSISTENTMITIGATE EVENTUALLY CONSISTENT In the code : use case : need to see results from within the transaction query directly on the repository ​leverage ACID and MVCC of SQL repository full-text search and facets are usually not needed by the code For the users : use case : see changes in listings in "real time" use pseudo-real time indexing ​​indexing actions triggered by UI threads are flagged ​run as afterCompletion listener refresh elasticsearch index
  • 31. DOES THIS WORK ?DOES THIS WORK ? Live for about 18 months now No missing sync issue some customers asked for verification tools but no problem was found re-index in bulk mode is very fast anyway No consistency issues good usage of hybrid query engines ​elasticsearch helped address several scaling challenges but elasticsearch brings us much more than just scalability
  • 32. BONUS FROM ELASTICSEARCHBONUS FROM ELASTICSEARCH More than Raw Speed
  • 33. LEVERAGE AGGREGATESLEVERAGE AGGREGATES Leverage elasticsearch aggregates ​integrate with the Query system (PageProvider) integrate with the Listing / UI model (ContentView) Allow to easily build and configure faceted search
  • 34. ADVANCED INDEXINGADVANCED INDEXING Fine tuning of elasticsearch indexing multi language support using multiple analyzers and copy_to compound fields created using groovy scripts Introduce elasticsearch hints into NXQL select a specific elasticsearch index / analyzer leverage elasticseach operators do geolocation search -- Use an explicit Elasticsearch field SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo' -- Use ES operators not present in NXQL SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y' SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces' -- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5') leverage what comes for free with elasticsearch
  • 35. INDEX AUDIT TRAIL WITH ELASTICSEARCHINDEX AUDIT TRAIL WITH ELASTICSEARCH Use elasticsearch to store & index Audit trail all events are serialized in JSON and stored inside elasticsearch ​Unleash Audit system power ​can store a lot of events can store and query arbitrary JSON structure
  • 36. ELASTICSEARCH PASS-THROUGHELASTICSEARCH PASS-THROUGH Expose an HTTP pass-through API on top of Nuxeo integration Integrate Authentication & Authorization not all users can access workflow index Integrate Security Filtering activate data level security filtering​ Expose "virtual index" via http index + filter ​​Use elasticsearch API related components on Nuxeo data ​Documents + Audit log With embedded security Easy real time data analytics on business data
  • 37. DATA ANALYTICS WITH ELASTICSEARCHDATA ANALYTICS WITH ELASTICSEARCH Queries on Documents + Audit: flexible reporting on workflows
  • 38. READ DOCUMENTS FROM ELASTICSEARCHREAD DOCUMENTS FROM ELASTICSEARCH Full JSONDocument is stored in elasticsearch ​required to be able to do fast re-indexing ​We can retrieve Documents from elasticsearch execute full search & retrieve without touching the DB ​By controling indexing we can use the elasticsearch index ​as a persistent cache on top of the repository as a staging area for queries _source
  • 39. NEXT STEPSNEXT STEPS Leveraging Even More elasticsearch
  • 40. NEXT STEPSNEXT STEPS Leverage elasticsearch percolator push update on the nuxeo-drive clients notify users about saved search automatic categorization Search result highlighting ​not sure why it is still not there ... Plug automatic denormalization
  • 41. ANY QUESTIONS ?ANY QUESTIONS ? Thank You ! https://github.com/nuxeo http://www.nuxeo.com/careers/