Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

•Als PPTX, PDF herunterladen•

5 gefällt mir•681 views

A walkthrough of Traackr's experience in choosing a NoSQL solution and how we ended up up switching from HBase to MongoDB. This deck goes through some in depth technical aspects, like schema design and our use of secondary indexes.

Finding the right NoSQL DB for the job - The path to a non-RDBMS solution at Traackr

Weitere ähnliche Inhalte

Empfohlen

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Empfohlen (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Hinweis der Redaktion

While there were definitely people that mistook the rise of NoSQL as a complete replacement of RDBMS, there were equal misunderstandings on the RDBMS camp:Eventual consistency is not the only way to operate MongoDB: write-ahead journaling, commit acknowledgement with fsync and j options is available just as in RDBMS systemsOne does not need to be a distributed search engine managing petabytes of data to use these types of tools (which is our point)
Taking a look at the amount of storage we are using as of a month ago in Mongo; this includes indexes
The point is that we don’t need to track the entire web: just the subset belonging to influencers!
There is a different perspective on “Web Scale” that has to do with the nature of the data on the web
Take the approach of using a simplifiedentity model
…withsemi-structured data storage formats like JSON:Facilitate capturing related attribute structures Enablethe flexibility of definingnew attributes as they are discovered
Pre-web: we knew exactly the questions we wanted to ask and how to model the data for themPost-web: questions and data are hard to predict; we need storage tools that are built to support this
CLOB pre-allocated space
Sparse maps
- This is something we thought we needed back in early 2010- Traack needs to score its’ entire DB of influencers on a weekly basis to adjust the weighted averages and stats that drive the scores. This means processing north of 750K of sites, over 650K influencers and soon, millions of posts (post-level attributes)
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
Memcache: memory-based,we need true persistence
Amazon SimpleDB: not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters, better used as queues or distributed caches
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try early prototypes
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options (came later on).
MongoDB: in early 2010, maturity questions, adoption questions and no batch processing options
Riak: very close but in early 2010, we had adoption questions
HBase: came across as the most mature at the time, with several deployments, a healthy community, "out-of-the box" secondary indexes through a contrib and support for batch processing using Hadoop/MR Hadoop and its’ maturity was a big reason we picked HBase
Had to deal with a complex setup right from the start:- minimum number of data nodes to support replication- odd number of zookeper nodes to avoid voting deadlocks- co-locating region servers = paying close attention to JVM resources- Master = SPOF- co-locating job trackers = paying close attention to JVM resources
- Quick overview of how we modeled a list in hbase- This is what our customers see- Let's consider the name, the ranks of the influencers and the influencer references
Each row has a unique key: the alist idWe would group general attributes under one family of columns appropriately named “attributes”. Benefit: can get Alist information without loading all the influencersWe would group the influencer references under another family of columns named “influencerIds”
Column prefixes = family namesColumn suffixes = attribute names
Now we can see where the attributes we see on the screen are stored
- We coded the pagination and indexing features ourselves and contributed them back- Felt really good about it!
It wasn’t bad enough we had to write our own code to support our indexing needs, we now had to maintain a third-party code base that was quickly becoming outdated!
Simplified example for posts
Denormalized/duplicated for fast runtime access and storage of influencer-to-site relationship properties
Content attribution logic could sometimes mis-attribute posts because of the duplicated data.
Exacerbated when we started tracking people’s content on a daily basis in mid-2011
Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before
CouchDB: more mature but still no ad-hoc queries
Cassandra: matured quite a bit, added secondary indexes and batch processing options but more restrictive in its’ use than other solutions. After the Hbase lesson, simplicity of use was now more important
Riak: strong contender still but adoption questions
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing options, breeze to use, well documented and fit into our existing code base very nicely.
Embedded list of references to sites augmented with influencer-specific site attributes (e.g. percent contribution to content)
siteId indexed for “find influencers connected to site X”
Embedded list of influencer references augmented with “usernames” (useful for content attribution)
Indexed for “find sites associated to influencer X”
- This is an example of a simple report written in JavaScript meant to count the number of twitter profiles we have counted total retweets forEasy to write and test if you know JavaScript (no complicated Java MR jobs)Easy to execute as a cron job and pipe the results to an emailMR slightly more involved by still much more approachable than Java MR (or Pig)
Easily configurable replica sets