A walkthrough of Traackr's experience in choosing a NoSQL solution and how we ended up up switching from HBase to MongoDB. This deck goes through some in depth technical aspects, like schema design and our use of secondary indexes.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Hinweis der Redaktion
While there were definitely people that mistook the rise of NoSQL as a complete replacement of RDBMS, there were equal misunderstandings on the RDBMS camp:Eventual consistency is not the only way to operate MongoDB: write-ahead journaling, commit acknowledgement with fsync and j options is available just as in RDBMS systemsOne does not need to be a distributed search engine managing petabytes of data to use these types of tools (which is our point)
Taking a look at the amount of storage we are using as of a month ago in Mongo; this includes indexes
The point is that we don’t need to track the entire web: just the subset belonging to influencers!
There is a different perspective on “Web Scale” that has to do with the nature of the data on the web
Take the approach of using a simplifiedentity model
…withsemi-structured data storage formats like JSON:Facilitate capturing related attribute structures Enablethe flexibility of definingnew attributes as they are discovered
Pre-web: we knew exactly the questions we wanted to ask and how to model the data for themPost-web: questions and data are hard to predict; we need storage tools that are built to support this
CLOB pre-allocated space
Sparse maps
- This is something we thought we needed back in early 2010- Traack needs to score its’ entire DB of influencers on a weekly basis to adjust the weighted averages and stats that drive the scores. This means processing north of 750K of sites, over 650K influencers and soon, millions of posts (post-level attributes)
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
Memcache: memory-based,we need true persistence
Amazon SimpleDB: not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters, better used as queues or distributed caches
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try early prototypes
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options (came later on).
MongoDB: in early 2010, maturity questions, adoption questions and no batch processing options
Riak: very close but in early 2010, we had adoption questions
HBase: came across as the most mature at the time, with several deployments, a healthy community, "out-of-the box" secondary indexes through a contrib and support for batch processing using Hadoop/MR Hadoop and its’ maturity was a big reason we picked HBase
Had to deal with a complex setup right from the start:- minimum number of data nodes to support replication- odd number of zookeper nodes to avoid voting deadlocks- co-locating region servers = paying close attention to JVM resources- Master = SPOF- co-locating job trackers = paying close attention to JVM resources
- Quick overview of how we modeled a list in hbase- This is what our customers see- Let's consider the name, the ranks of the influencers and the influencer references
Each row has a unique key: the alist idWe would group general attributes under one family of columns appropriately named “attributes”. Benefit: can get Alist information without loading all the influencersWe would group the influencer references under another family of columns named “influencerIds”
Column prefixes = family namesColumn suffixes = attribute names
Now we can see where the attributes we see on the screen are stored
- We coded the pagination and indexing features ourselves and contributed them back- Felt really good about it!
It wasn’t bad enough we had to write our own code to support our indexing needs, we now had to maintain a third-party code base that was quickly becoming outdated!
Simplified example for posts
Denormalized/duplicated for fast runtime access and storage of influencer-to-site relationship properties
Content attribution logic could sometimes mis-attribute posts because of the duplicated data.
Exacerbated when we started tracking people’s content on a daily basis in mid-2011
Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before
CouchDB: more mature but still no ad-hoc queries
Cassandra: matured quite a bit, added secondary indexes and batch processing options but more restrictive in its’ use than other solutions. After the Hbase lesson, simplicity of use was now more important
Riak: strong contender still but adoption questions
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing options, breeze to use, well documented and fit into our existing code base very nicely.
Embedded list of references to sites augmented with influencer-specific site attributes (e.g. percent contribution to content)
siteId indexed for “find influencers connected to site X”
Embedded list of influencer references augmented with “usernames” (useful for content attribution)
Indexed for “find sites associated to influencer X”
- This is an example of a simple report written in JavaScript meant to count the number of twitter profiles we have counted total retweets forEasy to write and test if you know JavaScript (no complicated Java MR jobs)Easy to execute as a cron job and pipe the results to an emailMR slightly more involved by still much more approachable than Java MR (or Pig)