Learning Lessons: Building a CMS on top of NoSQL technologies
1. Learning
Lessons
Building a content repository
on top of NoSQL Technologies
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
6. We Prefer Sophistication
» the challenge for us was to scale ...
without dropping features
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
10. The typical CMS ‘architecture’
even more cache
more cache
application cache
database (+opt. filesystem) (+ opt. full-text indexes)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
11. The typical CMS ‘architecture’
client
even more cache
more cache
application cache
database (+opt. filesystem) (+ opt. full-text indexes)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
12. The typical CMS ‘architecture’
client (+cache)
even more cache
more cache
application cache
database (+opt. filesystem) (+ opt. full-text indexes)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
13. What we found hard to scale
» access control
» facet browsing
» all the nifty stuff people were using our
software for
» ... anything that required random access
to in-memory-cache data for computations
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
14. Beyond the ‘scaling’ problem
» three-prong data layer
fs
» result set merging (between MySQL & Lucene)
» happened in appcode/memory
» ‘transactions’, set operations = hard
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
16. If we would be able to add more nodes ...
scalability
» True Distribution availability
performance
... in the line of fire
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
17. Solution 1
» do MORE inside the database
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
27. Enter The Cambrian Explosion
Cassandra
NoSQL
neo4j
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
28. Requirements, phase I
» automatic scaling to large data sets
» fault-tolerance: replication, automatic handling of failing nodes
» a flexible data model supporting sparse data
» runs on commodity hardware
» efficient random access to data
» open source, ability to participate in the development thus
drive the direction of the project
» some preference for a Java-based solution
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
29. Requirements, phase II
» After careful consideration, we realized the
important choices were also:
» consistency: no chance of having two conflicting
versions of a row
» atomic updates of a single row, single-row
transactions
» bonus points for MapReduce integration
» e.g. full-text index rebuilding
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
30. That brought us to HBase, which bought us:
» a datamodel where you can have column
families which keep all versions and others
which do not, which fits very well on our
CMS document model
» ordered tables with the ability to do range
scans on them, which allows to build
scalable indexes on top of it
» HDFS, a convenient place to store large blobs
» Apache license and community, a familiar
environment for us
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
31. » OK, so now we had a data store !
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
32. » However, content repository =
store + search !
u ch
o
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
33. a s
w
at !
h sy
T a .. .)
e ver
we
h o
(
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
34. Search ponderings
» CMS = two types of search
» structured search
» numbers, strings
» based on logic (SQL, anyone?)
» information retrieval (or: full-text search)
» text
» based on statistics
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
35. Search ponderings
» All of that, at scale
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
36. Structured Search
» HBase Indexing Library
» idea from Google App Engine datastore indexes
» http://code.google.com/appengine/articles/
index_building.html
rowkey col col rowkey col
order
A val3 foo6 val2-B
B val2 foo7 val3-A
content table index table A
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
37. Full-text / IR search
» Lucene?
» no sharding (for scale)
» no replication (for availability)
» batched index updates (not real-time)
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
38. Beyond Lucene
» Katta
» scalable architecture, however only search, no indexing
» Elastic Search
» very young (sorry)
» hbasene et al.
» stores inverted index in HBase, might not scale all features
» SOLR
» widely used, schema, facets, query syntax, cloud branch
More info: http://lilycms.org/lily/prerelease/technology.html
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
39. ?
+
=
r ?
! O
a sy
E
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
40. Remember distribution ?
Remember secondary indexes ?
➙ Need for reliable queuing
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
41. Connecting things
» we needed a reliable bridge between our
main storage (HBase) and our index/search
server(s) (SOLR)
» indexing, reindexing, mass reindexing (M/R)
» we need a reliable method of updating
HBase secondary indexes
» all of that eventually to run distributed
» distribution means coping with failure
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
42. Solution
» ACMEMessageQueue ? Bzzzzzt.
We wanted fault-safe HBase persistence for
the queues.
Also for ease of administration.
» ➙ WAL & Queue implemented on top of
HBase tables
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
43. WAL / Queue
» WAL » Queue
» guaranteed execution » triggering of async
of synchronous actions actions
» call doesn’t return before » e.g. (re)index (updated)
secondary action finishes record with SOLR back-end
» e.g. update secondary actions » size depends on speed of
» if all goes well, back-end process
size = #concurrent ops
» will be useful/made available
outside of Lily context as
well!
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
44. The Sum
» Lily model (records & fields)
» mapped onto HBase (=storage)
» indexed and searchable through
SOLR
» using a WAL/Queue mechanism
implemented in HBase
» runtime based on Kauri
» with client/server comms via
Avro
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
47. Roadmap
» Today = release of learning material
(architecture, model, API, Javadoc)
➥ www.lilycms.org
➥ bit.ly/lilyprerelease
» Mid July = ‘proof of architecture’ release e re!
th
early
N
» from there on, ca. 3-monthly releases
leading up to Lily 1.0
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47