This talk will explain how to leverage Elasticsearch capabilities to make your content repository scale to the sky while still relying on standard SQL based technologies and ensuring data security and integrity. The design choices behind this hybrid Elasticsearch / PgSQL architecture will be discussed and the technical integration with Elasticsearch will be demonstrated.
Watch the recorded webinar: http://www.nuxeo.com/resources/scaling-the-document-repository-with-elasticsearch/
3. NUXEONUXEO
Nuxeo
we provide a Platform that developers can use to build highly
customized Content Applications
we provide components, and the tools to assemble them
everything we do is open source (for real)
various customers - various use cases
me: developer & CTO - joined the Nuxeo project 10+ years ago
Track game builds Electronic Flight Bags Central repository for Models Food industry PLM
https://github.com/nuxeo
4. DOCUMENT REPOSITORYDOCUMENT REPOSITORY
Store Documents / Assets / Objects
Blob objects
Complex data Structures
Hierarchy, references and links
Audit trail & Versioning
Data level security & encryption
Lifecycle, workflows ...
API (REST, CMIS, Java, JS...)
CRUD
Search
Service API
Heavily configurable : all data structures are flexible / customizable
Used by developers to build
Content Applications on top of
the Nuxeo Repository
5. OUR CHALLENGESOUR CHALLENGES
CRUD on large repository works
inject at 6,000 docs/s up to 1 Billion
not so many companies have that many documents anyway
Queries are the main scalability issue
impact of c_ud vs search
multi-criteria queries + full-text
security filtering
configurable data structures
user defined queries
UI heavily depends on search
Search API is the most used:
search is the main scalability challenge
6. HISTORY : NUXEO & LUCENEHISTORY : NUXEO & LUCENE
2006: Nuxeo CPS 3.6
(Python / Zope based)
Replace built-in index with
lucene + XML-RPC server
pyLucene
(GCJ build+ python bindings!)
Complex setup
2007: Nuxeo Platform 5.1
JCR : queries (and backup) issues
Integrate Compass Core
transactionnal & storage abstraction
Missing sync & concurrency issues
2009: Nuxeo 5.2
VCS : Homebrew SQL based repository
Search in database but some real limitations
2013 / 2014: Nuxeo 5.9.3
Reintroduce Lucene in the stack via elasticsearch
Learn from our past mistakes
Leverage elasticsearch architecture
easy deployment
safe indexing
powerful search
... we are now happy with Elasticsearch
Lucene and Nuxeo have a long story ...
9. COMPLEX SQL QUERIESCOMPLEX SQL QUERIES
Configurable Data Structure
+ User defined multi-criteria searches
=> multiple & complex SQL queries
Search API is the most used:
search is the main scalability challenge
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
WHERE
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio'))
AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
10. ABOUT SQL LIMITATIONSABOUT SQL LIMITATIONS
Scaling queries is complex
depend on indexes, I/O speed and available memory
can not satisfy all types of queries
poor performances on unselective multi-criteria queries
some types of queries can simply not be fast in SQL
Scalability
Scale up is expensive
Scale out is complex at best (XA & MVCC)
Sharding requires a global index
Fulltext support is usually poor
limitations on features & impact on performances
SQL technology is not the solution
12. USING NOSQL FOR THE REPOSITORYUSING NOSQL FOR THE REPOSITORY
13. ABOUT THE NOSQL OPTIONABOUT THE NOSQL OPTION
(sadly) NoSQL is no magic
it does work very well for CRUD and it scales easily, but
query options are limited and performance is not that good
multi-document transactions is usually not safe
more adapted for DBs with billions of entries and simple queries
SQL has some real advantages
ACID (and MVCC) is good
Workflows and bulk updates are a typical use case
(even transient) lack of consistency is complex to explain to users
lot of existing tools (BI & reporting), lot of existing skills(DBA)
PGSQL (or AWS RDS) can be very cost effective
SQL or NoSQL repository are not the solution
14. KEEP THE REPOSITORYKEEP THE REPOSITORY
SQL OR NOSQLSQL OR NOSQL
BUTBUT
FIND A SUPER FAST INDEX ENGINEFIND A SUPER FAST INDEX ENGINE
16. HYBRID STORAGEHYBRID STORAGE
Use each storage solution for what it does the best
SQL DB
store content in an ACID way
store & retrieve
queries needed ACID and MVCC
elasticsearch
provide powerful and scalable queries
do the heavy lifting that the RDBMS can not do
scoring, native full-text, aggregates
distributed search
Route the query to the correct index depending
on requirements
18. PERFORMANCE RESULTSPERFORMANCE RESULTS
Fast indexing
No ACID constraints / No impedance issue
3,500 documents/s when using SQL backend
10,000 documents/s when using MongoDB
Super query performance
query on term using inverted index
very efficient caching
native full text support & distributed architecture
3,000 queries/s with 1 elasticsearch node
6,000 queries/s with 2 elasticsearch nodes
19. SOME REAL LIFE FEEDBACKSOME REAL LIFE FEEDBACK
“ We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.
“ Please activate nuxeo-elasticsearch !
“ We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?
“ It looks like you have some network
congestion between your client and the servers.
“ ...right... we have pushed past 1900 tps ... I think we are close to
declaring success for this configuration ...
Customer
Customer
Customer
Nuxeo support
Nuxeo support
20. SQL VS ELASTICSEARCHSQL VS ELASTICSEARCH
Scalability is simply from
another order of magnitude
22. UNIFIED INDEX ON SHARDED REPOSITORYUNIFIED INDEX ON SHARDED REPOSITORY
Tested with 10 PgSQL databases
10 x 100 Million documents => 1 Billion documents
1 elasticsearch cluster
23. IS THIS MAGIC?IS THIS MAGIC?
For users
it really looks like magic
For sales guys & solution architects
it is magic: it unleashes a lot of possibilities
performance is just one aspect
For Nuxeo Core Dev team
it was almost magic: some integration work was needed
25. CHALLENGES TO ADDRESSCHALLENGES TO ADDRESS
Keep index in sync with the repository
No transaction management
Do not lose anything
Without support for update
Mitigate eventually consistent effect
Avoid displaying transient inconsistent state
Handle security filtering
Without join
Without post-filtering
26. SECURITY FILTERINGSECURITY FILTERING
Constraints
Filtering must be done at index level : no post filtering
Join is not an option
can not join with DB or withing lucene (previously tested without success)
Solution
index the ReadACL as part of the JSON Document
list of groups / users who can read the resource
automatically add a filter clause on ACL
Consequences
Recursive indexing is needed
More pressure to maintain re-indexing procesing
in last resort: the Document security is checked by the repository anyway
27. SAFE INDEXING FLOWSAFE INDEXING FLOW
Do not try to make it Transactionnal
Collect and de-duplicate Repository Events during Transaction
Wait for commit to be done at the repository level
then call elasticsearch
Do not lose any update
run Indexing Tasks in a distributed Job infrastructure
Jobs should be persisted
Jobs should be retried
Jobs should be monitored
29. MITIGATE EVENTUALLY CONSISTENTMITIGATE EVENTUALLY CONSISTENT
In the code :
use case : need to see results from within the transaction
query directly on the repository
leverage ACID and MVCC of SQL repository
full-text search and facets are usually not needed by the code
For the users :
use case : see changes in listings in "real time"
use pseudo-real time indexing
indexing actions triggered by UI threads are flagged
run as afterCompletion listener
refresh elasticsearch index
31. DOES THIS WORK ?DOES THIS WORK ?
Live for about 18 months now
No missing sync issue
some customers asked for verification tools
but no problem was found
re-index in bulk mode is very fast anyway
No consistency issues
good usage of hybrid query engines
elasticsearch helped address several scaling challenges
but elasticsearch brings us much more than just scalability
33. LEVERAGE AGGREGATESLEVERAGE AGGREGATES
Leverage elasticsearch aggregates
integrate with the Query system (PageProvider)
integrate with the Listing / UI model (ContentView)
Allow to easily build and configure faceted search
34. ADVANCED INDEXINGADVANCED INDEXING
Fine tuning of elasticsearch indexing
multi language support using multiple analyzers and copy_to
compound fields created using groovy scripts
Introduce elasticsearch hints into NXQL
select a specific elasticsearch index / analyzer
leverage elasticseach operators
do geolocation search
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash;
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')
leverage what comes for free with elasticsearch
35. INDEX AUDIT TRAIL WITH ELASTICSEARCHINDEX AUDIT TRAIL WITH ELASTICSEARCH
Use elasticsearch to store & index Audit trail
all events are serialized in JSON and stored inside elasticsearch
Unleash Audit system power
can store a lot of events
can store and query arbitrary JSON structure
36. ELASTICSEARCH PASS-THROUGHELASTICSEARCH PASS-THROUGH
Expose an HTTP pass-through API on top of Nuxeo integration
Integrate Authentication & Authorization
not all users can access workflow index
Integrate Security Filtering
activate data level security filtering
Expose "virtual index" via http
index + filter
Use elasticsearch API related components on Nuxeo data
Documents + Audit log
With embedded security
Easy real time data analytics on business data
37. DATA ANALYTICS WITH ELASTICSEARCHDATA ANALYTICS WITH ELASTICSEARCH
Queries on Documents + Audit: flexible reporting on workflows
38. READ DOCUMENTS FROM ELASTICSEARCHREAD DOCUMENTS FROM ELASTICSEARCH
Full JSONDocument is stored in elasticsearch
required to be able to do fast re-indexing
We can retrieve Documents from elasticsearch
execute full search & retrieve without touching the DB
By controling indexing we can use the elasticsearch index
as a persistent cache on top of the repository
as a staging area for queries
_source
40. NEXT STEPSNEXT STEPS
Leverage elasticsearch percolator
push update on the nuxeo-drive clients
notify users about saved search
automatic categorization
Search result highlighting
not sure why it is still not there ...
Plug automatic denormalization
41. ANY QUESTIONS ?ANY QUESTIONS ?
Thank You !
https://github.com/nuxeo
http://www.nuxeo.com/careers/