SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
C* @Disqus · July 31, 2013
Cassandra SF Meetup
1Thursday, August 1, 13
INTRO
Software Engineer at Disqus
Built the current Data Pipeline
Enjoy working on large ecosystems
Who am I?
2Thursday, August 1, 13
SO YOU MADE SOME ANALYTICS
200,000 unique users creating
1,000,000 unique comments on
1,000,000 unique articles on
20,000 unique websites
Needed to build a system to track events from across the
Disqus network. On a given day we have
4*10^21
4,000,000,000,000,000,000,000
4 sextillion (zetta)
potential combinations PER DAY
3Thursday, August 1, 13
INTROTHE BIG ONE
4Thursday, August 1, 13
DESIGNING THE SYSTEM
5Thursday, August 1, 13
3. ABILITY TO ACCESS A SUBSET IN REAL TIME
2. ABILITY TO QUERY AND JOIN LARGE DATA SETS
1. SCALABLE AND AVAILABLE DATA PIPELINE
GOALS
6Thursday, August 1, 13
3. ABILITY TO ACCESS A SUBSET IN REAL TIME
2. ABILITY TO QUERY AND JOIN LARGE DATA SETS
1. SCALABLE AND AVAILABLE DATA PIPELINE
GOALS
This is where Cassandra comes in
7Thursday, August 1, 13
DATA FORMAT
You need a format for your data
8Thursday, August 1, 13
You need a format for your data
Avro
Thrift
Protobuf
JSON
DATA FORMAT
9Thursday, August 1, 13
We chose JSON
Avro
Thrift
Protobuf
JSON
DATA FORMAT
10Thursday, August 1, 13
At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
11Thursday, August 1, 13
At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
12Thursday, August 1, 13
At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
13Thursday, August 1, 13
At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
14Thursday, August 1, 13
At Disqus we do comments
{
! "category": "comment",
! "data": {
! ! "text": "What's going on",
! ! "author": "gjcourt"
! },
! "meta": {
! ! "endpoint": "/event.js",
! ! "useragent": {
! ! ! "flavor": { "version": "X" },
! ! ! "browser": { "version": "6.0", "name": "Safari" }
! ! }
! },
! "timestamp": 1375228800
}
DATA FORMAT
15Thursday, August 1, 13
Random Aside
Handling time in python is a pain in the ass
RANDOM ASIDE
time.time()
Return the time in seconds since the epoch as a floating point number. Note that even
though the time is always returned as a floating point number, not all systems provide time
with a better precision than 1 second. While this function normally returns non-decreasing
values, it can return a lower value than a previous call if the system clock has been set back
between the two calls.
16Thursday, August 1, 13
Random Aside
Handling time in python is a pain in the ass
RANDOM ASIDE
time.time()
Return the time in seconds since the epoch as a floating point number. Note that even
though the time is always returned as a floating point number, not all systems provide time
with a better precision than 1 second. While this function normally returns non-decreasing
values, it can return a lower value than a previous call if the system clock has been set back
between the two calls.
>>> print time.time(); print time.mktime(time.gmtime())
1375244678.64
1375273478.0
17Thursday, August 1, 13
PICKING A DATABASE IS HARD
18Thursday, August 1, 13
Mainly because there are so many choices
PICKING A DATABASE
19Thursday, August 1, 13
PICKING A DATABASE
In an early startup, opportunity cost is king
While the choice of a system is important there are a
range of possible choices.
A system that provides value is more important than
choosing a local maximum.
20Thursday, August 1, 13
PICKING A DATABASE
We need a large sparse matrix
Requires horizontal scalability
Fast reads and inserts
High cardinality
21Thursday, August 1, 13
PICKING A DATABASE
We need a large sparse matrix
Requires horizontal scalability
Fast reads and inserts
High cardinality
Almost rules out most RDBMS
22Thursday, August 1, 13
PICKING A DATABASE
We chose Cassandra
23Thursday, August 1, 13
PICKING A DATABASE
We chose Cassandra
24Thursday, August 1, 13
PICKING A DATABASE
What made the difference
We wanted counters and 0.8.0 has this capability
Fast inserts and reads
Tunable consistency guarantees
Simple data model
25Thursday, August 1, 13
DESIGNING A DATA MODEL
26Thursday, August 1, 13
3. SCALABLE AND AVAILABLE
2. FAST AND ACCURATE COUNTERS
1. HIGH VOLUME SPARSE MATRIX (billions of dimensions)
DATA THAT SCALES
27Thursday, August 1, 13
DATA MODEL
How do you store arbitrary dimensionality over time?
Cassandra is a 2D sorted array
28Thursday, August 1, 13
DATA MODEL
A simple way to build a counter
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
29Thursday, August 1, 13
DATA MODEL
A simple way to build a counter
+--------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment |-----------------+------------------------------------------------------
|! ! ! | 1000 | 100 | 10 | 1 |
+--------------+-----------------+-----------------+-----------------+-----------------+
30Thursday, August 1, 13
DATA MODEL
A simple way to build a counter
+--------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment |-----------------+------------------------------------------------------
|! ! ! | 1000 | 100 | 10 | 1 |
+--------------+-----------------+-----------------+-----------------+-----------------+
----------------------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment.author.gjcourt |-----------------+------------------------------------------------------
|! ! ! | 23 | 17 | 7 | 1 |
----------------------------+-----------------+-----------------+-----------------+-----------------+
Dimensions are easy
31Thursday, August 1, 13
DATA MODEL
And if you increment the time bucket 2013-07-31
+--------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment |-----------------+------------------------------------------------------
|! ! ! | 1001 | 101 | 10 | 1 |
+--------------+-----------------+-----------------+-----------------+-----------------+
----------------------------+-----------------+-----------------+-----------------+-----------------+
|! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 |
| comment.author.gjcourt |-----------------+------------------------------------------------------
|! ! ! | 24 | 18 | 7 | 1 |
----------------------------+-----------------+-----------------+-----------------+-----------------+
Dimensions are easy
32Thursday, August 1, 13
DATA MODEL
Some major disadvantages
All time intervals are in the same row
Queries are non linear
Time buckets in lexical order
Dimensions can not be indexed
Rows can grow unbounded
33Thursday, August 1, 13
DATA MODEL
A better version of counters
--------------------+-----------------+
|! ! ! | 2013 |
| comment.year |-----------------+
|! ! ! | 1000 |
--------------------+-----------------+
---------------------+-----------------+-----------------+-----------------+
|! ! ! | 2013.5 | 2013.6 | 2013.7 |
| comment.month |-----------------+-----------------+-----------------+
|! ! ! | 96 | 78 | 100 |
---------------------+-----------------+-----------------+-----------------+
---------------------+-----------------+-----------------+-----------------+
|! ! ! | 2013.7.28 | 2013.7.29 | 2013.7.30 |
| comment.day |-----------------+-----------------+-----------------+
|! ! ! | 8 | 6 | 13 |
---------------------+-----------------+-----------------+-----------------+
34Thursday, August 1, 13
DATA MODEL
This is a large improvement
Efficient range queries
Rollups are possible
35Thursday, August 1, 13
DATA MODEL
However still has some problems
Dimensions are not indexed
Rows can grow unbounded
36Thursday, August 1, 13
DATA MODEL
Remember the schema
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
37Thursday, August 1, 13
DATA MODEL
Remember the schema
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
38Thursday, August 1, 13
DATA MODEL
Remember the schema
CREATE TABLE counts (
key text,
time_dimension text,
value counter,
PRIMARY KEY (key, time_dimension)
);
Should this be a <timestamp>?
39Thursday, August 1, 13
DATA MODEL
A better version of counters
CREATE TABLE better_counts (
key text,
time_dimension 'org.apache.cassandra.db.marshal.ReversedType' <timestamp>,
value counter,
PRIMARY KEY (key, time_dimension)
);
40Thursday, August 1, 13
DATA MODEL
The problem with counters
Operations are NOT Idempotent
Limited protection for overcounting
https://issues.apache.org/jira/browse/CASSANDRA-4775
41Thursday, August 1, 13
DATA MODEL
And you end up having to write code like this
def swallow_cassandra_timeouts(func):
@wraps(func)
def inner(*args, **kwargs):
try:
return func(*args, **kwargs)
except TimedOutException, e:
logger.warning("processor.pycassa.exception.timeout")
except UnavailableException, e:
# raise so that we retry this batch
logger.error("processor.pycassa.exception.unavailable")
raise CassandraError(e)
except MaximumRetryException, e:
logger.warning("processor.pycassa.exception.max_retry")
except Exception, e:
logger.error("processor.pycassa.exception.unknown")
raise
return inner
42Thursday, August 1, 13
DATA MODEL
And this
if LOCAL:
CASSANDRA_TIMEOUT = 60
CASSANDRA_RETRIES = 0
elif "prod" in hostname:
CASSANDRA_TIMEOUT = 2 # Seconds
CASSANDRA_RETRIES = 0 # None
elif "storm" in hostname:
CASSANDRA_TIMEOUT = 0.2
CASSANDRA_RETRIES = 0
else: # proxy (read only)
CASSANDRA_TIMEOUT = 60
CASSANDRA_RETRIES = 3
43Thursday, August 1, 13
DATA MODEL
And this too
CASSANDRA_CONFIG = {
'stats': {
'pool': PoolConfig(CASSANDRA_TIMEOUT, CASSANDRA_RETRIES, CASSANDRA_POOL_SIZE),
'cf': {
'counts': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.ONE),
'durable_counts': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.LOCAL_QUORUM),
'sets': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.LOCAL_QUORUM),
}
}
}
44Thursday, August 1, 13
DATA MODEL
And operations to Cassandra look like this
@swallow_cassandra_timeouts
def side_effecting_function():
# insert/update into cassandra
pass
45Thursday, August 1, 13
DATA MODEL
Durable counts
CREATE TABLE durable_counts (
key text,
time_dimension 'org.apache.cassandra.db.marshal.ReversedType'<timestamp>,
random uuid,
value int,
PRIMARY KEY (key, time_dimension, random)
);
46Thursday, August 1, 13
DATA MODEL
Durable counts
---------------------+----------------------------------------+----------------------------------------+
|! ! ! | 2013-07-30 05:21:38+0000 | 2013-07-30 05:23:44+0000 |
|! ! ! | eb401386-f420-11e2-a26b-002590024b08 | b320a95c-f240-11e2-a26b-002590024b08 |
| comment.year |----------------------------------------+----------------------------------------+
|! ! ! | 20 | 50 |
---------------------+----------------------------------------+----------------------------------------+
---------------------+----------------------------------------+----------------------------------------+
|! ! ! | 2013-07-30 05:21:38+0000 | 2013-07-30 05:23:44+0000 |
|! ! ! | eb401386-f420-11e2-a26b-002590024b08 | b320a95c-f240-11e2-a26b-002590024b08 |
| comment.month |----------------------------------------+----------------------------------------+
|! ! ! | 20 | 50 |
---------------------+----------------------------------------+----------------------------------------+
---------------------+----------------------------------------+----------------------------------------+
|! ! ! | 2013-07-30 05:21:38+0000 | 2013-07-30 05:23:44+0000 |
|! ! ! | eb401386-f420-11e2-a26b-002590024b08 | b320a95c-f240-11e2-a26b-002590024b08 |
| comment.day |----------------------------------------+----------------------------------------+
|! ! ! | 20 | 50 |
---------------------+----------------------------------------+----------------------------------------+
47Thursday, August 1, 13
DATA MODEL
And even doing all that hackery
Hive count C* counter % Similar C* durable counts % Similar
8101 8179 99.046338 8179 99.046338
7328 7390 99.161028 7390 99.161028
6255 6304 99.222715 6304 99.222715
6604 6665 99.150141 6665 99.150141
7700 7766 99.150141 7766 99.150141
5 week days of countable data
48Thursday, August 1, 13
DATA MODEL
Over 99% accuracy
100% (allegedly) counter parity
49Thursday, August 1, 13
DATA MODEL
Since our data is time series what if you could view it that way
50Thursday, August 1, 13
DATA MODEL
With arbitrary dimensionality
51Thursday, August 1, 13
DATA MODEL
With arbitrary multi dimensionality
52Thursday, August 1, 13
DATA MODEL
Sets (our first iteration)
CREATE TABLE sets (
key text,
time_dimension timestamp,
element blob,
value double,
PRIMARY KEY (key, time_dimension)
);
Insert only workload. Items are deleted by TTL
53Thursday, August 1, 13
DATA MODEL
Better Sets
CREATE TABLE sets (
key text,
time_dimension timestamp,
element blob,
deleted boolean,
value double,
PRIMARY KEY (key, time_dimension)
);
Insert only workload. When you want to delete, you insert with deleted set to true.
Read require you to iterate over all columns in chronological order. You sum values to calculate a score.
54Thursday, August 1, 13
DATA MODEL
Counters with indexable dimensions
CREATE TABLE catalog (
key text,
time_dimension 'org.apache.cassandra.db.marshal.ReversedType' <timestamp>,
dimension_1 text,
dimension_1_val text,
dimension_2 text,
dimension_2_val text,
...
value counter,
PRIMARY KEY (key, time_dimension)
);
55Thursday, August 1, 13
DATA MODEL
Dimension Catalog
CREATE TABLE catalog (
key text,
dimension text,
value text,
PRIMARY KEY (key, dimension)
);
56Thursday, August 1, 13
DATA MODEL
Dimension Catalog
CREATE TABLE catalog (
key text,
dimension text,
value text,
PRIMARY KEY (key, dimension)
);
cqlsh:> insert into catalog (key, dimension, value) values ('comment', 'author', 'gjcourt');
cqlsh:> insert into catalog (key, dimension, value) values ('comment', 'forum', 'disqus');
cqlsh:> select dimension from catalog where key='comment';
dimension
-----------
author
forum
57Thursday, August 1, 13
WHERE ARE WE GOING
58Thursday, August 1, 13
3. EXPLORE NEW AND INTERESTING DATA PRODUCTS
2. PRODUCTIZE OUR DATA PIPELINE
1. EVOLVE CONTENT RECOMMENDATION AND ADVERTISING
OUR 2013 MISSIONS
59Thursday, August 1, 13
THE FUTURE
casscached
Comparable performance
2GB max “key” (instead of 1mb)
Tunable consistency levels
Useful for SSI, mat-views
60Thursday, August 1, 13
THE FUTURE
Postgres Foreign Data Wrapper
Could use a cass_fdw
61Thursday, August 1, 13
THE FUTURE
Graph of users and views
g.V('username','gjcourt').out('thread_views').in('thread_views').except('username', 'gjcourt')
The Netflix algorithm:
All articles that people that have viewed the thread I’m currently viewing
have also viewed.
62Thursday, August 1, 13
C* @Disqus · July 31, 2013
Cassandra SF Meetup
Thanks for listening
We’re hiring http://disqus.com/jobs/
63Thursday, August 1, 13

Weitere ähnliche Inhalte

Ähnlich wie Cassandra at Disqus — SF Cassandra Users Group July 31st

Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...xu liwei
 
My SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please helpMy SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please helpMarkus Flechtner
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Jeremy Schneider
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLJim Mlodgenski
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 
MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015Dave Stokes
 
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015Dave Stokes
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...Citus Data
 
Butter Web Browsing with Margarine
Butter Web Browsing with MargarineButter Web Browsing with Margarine
Butter Web Browsing with MargarineWayne Walls
 
Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQLRunning a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQLKazuho Oku
 
Cassandra Community Webinar | The World's Next Top Data Model
Cassandra Community Webinar | The World's Next Top Data ModelCassandra Community Webinar | The World's Next Top Data Model
Cassandra Community Webinar | The World's Next Top Data ModelDataStax
 
Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Jonathan Katz
 
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0Tugdual Grall
 
MySQL Built-In Functions
MySQL Built-In FunctionsMySQL Built-In Functions
MySQL Built-In FunctionsSHC
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBig Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBigDataExpo
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationJonathan Katz
 

Ähnlich wie Cassandra at Disqus — SF Cassandra Users Group July 31st (20)

Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
 
My SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please helpMy SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please help
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015
 
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
 
Butter Web Browsing with Margarine
Butter Web Browsing with MargarineButter Web Browsing with Margarine
Butter Web Browsing with Margarine
 
Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQLRunning a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQL
 
Cassandra Community Webinar | The World's Next Top Data Model
Cassandra Community Webinar | The World's Next Top Data ModelCassandra Community Webinar | The World's Next Top Data Model
Cassandra Community Webinar | The World's Next Top Data Model
 
Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!
 
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
 
MySQL Built-In Functions
MySQL Built-In FunctionsMySQL Built-In Functions
MySQL Built-In Functions
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it allBig Data Expo 2015 - Gigaspaces Making Sense of it all
Big Data Expo 2015 - Gigaspaces Making Sense of it all
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management Application
 

Mehr von DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Kürzlich hochgeladen

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Cassandra at Disqus — SF Cassandra Users Group July 31st

  • 1. C* @Disqus · July 31, 2013 Cassandra SF Meetup 1Thursday, August 1, 13
  • 2. INTRO Software Engineer at Disqus Built the current Data Pipeline Enjoy working on large ecosystems Who am I? 2Thursday, August 1, 13
  • 3. SO YOU MADE SOME ANALYTICS 200,000 unique users creating 1,000,000 unique comments on 1,000,000 unique articles on 20,000 unique websites Needed to build a system to track events from across the Disqus network. On a given day we have 4*10^21 4,000,000,000,000,000,000,000 4 sextillion (zetta) potential combinations PER DAY 3Thursday, August 1, 13
  • 6. 3. ABILITY TO ACCESS A SUBSET IN REAL TIME 2. ABILITY TO QUERY AND JOIN LARGE DATA SETS 1. SCALABLE AND AVAILABLE DATA PIPELINE GOALS 6Thursday, August 1, 13
  • 7. 3. ABILITY TO ACCESS A SUBSET IN REAL TIME 2. ABILITY TO QUERY AND JOIN LARGE DATA SETS 1. SCALABLE AND AVAILABLE DATA PIPELINE GOALS This is where Cassandra comes in 7Thursday, August 1, 13
  • 8. DATA FORMAT You need a format for your data 8Thursday, August 1, 13
  • 9. You need a format for your data Avro Thrift Protobuf JSON DATA FORMAT 9Thursday, August 1, 13
  • 10. We chose JSON Avro Thrift Protobuf JSON DATA FORMAT 10Thursday, August 1, 13
  • 11. At Disqus we do comments { ! "category": "comment", ! "data": { ! ! "text": "What's going on", ! ! "author": "gjcourt" ! }, ! "meta": { ! ! "endpoint": "/event.js", ! ! "useragent": { ! ! ! "flavor": { "version": "X" }, ! ! ! "browser": { "version": "6.0", "name": "Safari" } ! ! } ! }, ! "timestamp": 1375228800 } DATA FORMAT 11Thursday, August 1, 13
  • 12. At Disqus we do comments { ! "category": "comment", ! "data": { ! ! "text": "What's going on", ! ! "author": "gjcourt" ! }, ! "meta": { ! ! "endpoint": "/event.js", ! ! "useragent": { ! ! ! "flavor": { "version": "X" }, ! ! ! "browser": { "version": "6.0", "name": "Safari" } ! ! } ! }, ! "timestamp": 1375228800 } DATA FORMAT 12Thursday, August 1, 13
  • 13. At Disqus we do comments { ! "category": "comment", ! "data": { ! ! "text": "What's going on", ! ! "author": "gjcourt" ! }, ! "meta": { ! ! "endpoint": "/event.js", ! ! "useragent": { ! ! ! "flavor": { "version": "X" }, ! ! ! "browser": { "version": "6.0", "name": "Safari" } ! ! } ! }, ! "timestamp": 1375228800 } DATA FORMAT 13Thursday, August 1, 13
  • 14. At Disqus we do comments { ! "category": "comment", ! "data": { ! ! "text": "What's going on", ! ! "author": "gjcourt" ! }, ! "meta": { ! ! "endpoint": "/event.js", ! ! "useragent": { ! ! ! "flavor": { "version": "X" }, ! ! ! "browser": { "version": "6.0", "name": "Safari" } ! ! } ! }, ! "timestamp": 1375228800 } DATA FORMAT 14Thursday, August 1, 13
  • 15. At Disqus we do comments { ! "category": "comment", ! "data": { ! ! "text": "What's going on", ! ! "author": "gjcourt" ! }, ! "meta": { ! ! "endpoint": "/event.js", ! ! "useragent": { ! ! ! "flavor": { "version": "X" }, ! ! ! "browser": { "version": "6.0", "name": "Safari" } ! ! } ! }, ! "timestamp": 1375228800 } DATA FORMAT 15Thursday, August 1, 13
  • 16. Random Aside Handling time in python is a pain in the ass RANDOM ASIDE time.time() Return the time in seconds since the epoch as a floating point number. Note that even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls. 16Thursday, August 1, 13
  • 17. Random Aside Handling time in python is a pain in the ass RANDOM ASIDE time.time() Return the time in seconds since the epoch as a floating point number. Note that even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls. >>> print time.time(); print time.mktime(time.gmtime()) 1375244678.64 1375273478.0 17Thursday, August 1, 13
  • 18. PICKING A DATABASE IS HARD 18Thursday, August 1, 13
  • 19. Mainly because there are so many choices PICKING A DATABASE 19Thursday, August 1, 13
  • 20. PICKING A DATABASE In an early startup, opportunity cost is king While the choice of a system is important there are a range of possible choices. A system that provides value is more important than choosing a local maximum. 20Thursday, August 1, 13
  • 21. PICKING A DATABASE We need a large sparse matrix Requires horizontal scalability Fast reads and inserts High cardinality 21Thursday, August 1, 13
  • 22. PICKING A DATABASE We need a large sparse matrix Requires horizontal scalability Fast reads and inserts High cardinality Almost rules out most RDBMS 22Thursday, August 1, 13
  • 23. PICKING A DATABASE We chose Cassandra 23Thursday, August 1, 13
  • 24. PICKING A DATABASE We chose Cassandra 24Thursday, August 1, 13
  • 25. PICKING A DATABASE What made the difference We wanted counters and 0.8.0 has this capability Fast inserts and reads Tunable consistency guarantees Simple data model 25Thursday, August 1, 13
  • 26. DESIGNING A DATA MODEL 26Thursday, August 1, 13
  • 27. 3. SCALABLE AND AVAILABLE 2. FAST AND ACCURATE COUNTERS 1. HIGH VOLUME SPARSE MATRIX (billions of dimensions) DATA THAT SCALES 27Thursday, August 1, 13
  • 28. DATA MODEL How do you store arbitrary dimensionality over time? Cassandra is a 2D sorted array 28Thursday, August 1, 13
  • 29. DATA MODEL A simple way to build a counter CREATE TABLE counts ( key text, time_dimension text, value counter, PRIMARY KEY (key, time_dimension) ); 29Thursday, August 1, 13
  • 30. DATA MODEL A simple way to build a counter +--------------+-----------------+-----------------+-----------------+-----------------+ |! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 | | comment |-----------------+------------------------------------------------------ |! ! ! | 1000 | 100 | 10 | 1 | +--------------+-----------------+-----------------+-----------------+-----------------+ 30Thursday, August 1, 13
  • 31. DATA MODEL A simple way to build a counter +--------------+-----------------+-----------------+-----------------+-----------------+ |! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 | | comment |-----------------+------------------------------------------------------ |! ! ! | 1000 | 100 | 10 | 1 | +--------------+-----------------+-----------------+-----------------+-----------------+ ----------------------------+-----------------+-----------------+-----------------+-----------------+ |! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 | | comment.author.gjcourt |-----------------+------------------------------------------------------ |! ! ! | 23 | 17 | 7 | 1 | ----------------------------+-----------------+-----------------+-----------------+-----------------+ Dimensions are easy 31Thursday, August 1, 13
  • 32. DATA MODEL And if you increment the time bucket 2013-07-31 +--------------+-----------------+-----------------+-----------------+-----------------+ |! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 | | comment |-----------------+------------------------------------------------------ |! ! ! | 1001 | 101 | 10 | 1 | +--------------+-----------------+-----------------+-----------------+-----------------+ ----------------------------+-----------------+-----------------+-----------------+-----------------+ |! ! ! | 2013 | 2013.7 | 2013.7.30 | 2013.7.30.0 | | comment.author.gjcourt |-----------------+------------------------------------------------------ |! ! ! | 24 | 18 | 7 | 1 | ----------------------------+-----------------+-----------------+-----------------+-----------------+ Dimensions are easy 32Thursday, August 1, 13
  • 33. DATA MODEL Some major disadvantages All time intervals are in the same row Queries are non linear Time buckets in lexical order Dimensions can not be indexed Rows can grow unbounded 33Thursday, August 1, 13
  • 34. DATA MODEL A better version of counters --------------------+-----------------+ |! ! ! | 2013 | | comment.year |-----------------+ |! ! ! | 1000 | --------------------+-----------------+ ---------------------+-----------------+-----------------+-----------------+ |! ! ! | 2013.5 | 2013.6 | 2013.7 | | comment.month |-----------------+-----------------+-----------------+ |! ! ! | 96 | 78 | 100 | ---------------------+-----------------+-----------------+-----------------+ ---------------------+-----------------+-----------------+-----------------+ |! ! ! | 2013.7.28 | 2013.7.29 | 2013.7.30 | | comment.day |-----------------+-----------------+-----------------+ |! ! ! | 8 | 6 | 13 | ---------------------+-----------------+-----------------+-----------------+ 34Thursday, August 1, 13
  • 35. DATA MODEL This is a large improvement Efficient range queries Rollups are possible 35Thursday, August 1, 13
  • 36. DATA MODEL However still has some problems Dimensions are not indexed Rows can grow unbounded 36Thursday, August 1, 13
  • 37. DATA MODEL Remember the schema CREATE TABLE counts ( key text, time_dimension text, value counter, PRIMARY KEY (key, time_dimension) ); 37Thursday, August 1, 13
  • 38. DATA MODEL Remember the schema CREATE TABLE counts ( key text, time_dimension text, value counter, PRIMARY KEY (key, time_dimension) ); 38Thursday, August 1, 13
  • 39. DATA MODEL Remember the schema CREATE TABLE counts ( key text, time_dimension text, value counter, PRIMARY KEY (key, time_dimension) ); Should this be a <timestamp>? 39Thursday, August 1, 13
  • 40. DATA MODEL A better version of counters CREATE TABLE better_counts ( key text, time_dimension 'org.apache.cassandra.db.marshal.ReversedType' <timestamp>, value counter, PRIMARY KEY (key, time_dimension) ); 40Thursday, August 1, 13
  • 41. DATA MODEL The problem with counters Operations are NOT Idempotent Limited protection for overcounting https://issues.apache.org/jira/browse/CASSANDRA-4775 41Thursday, August 1, 13
  • 42. DATA MODEL And you end up having to write code like this def swallow_cassandra_timeouts(func): @wraps(func) def inner(*args, **kwargs): try: return func(*args, **kwargs) except TimedOutException, e: logger.warning("processor.pycassa.exception.timeout") except UnavailableException, e: # raise so that we retry this batch logger.error("processor.pycassa.exception.unavailable") raise CassandraError(e) except MaximumRetryException, e: logger.warning("processor.pycassa.exception.max_retry") except Exception, e: logger.error("processor.pycassa.exception.unknown") raise return inner 42Thursday, August 1, 13
  • 43. DATA MODEL And this if LOCAL: CASSANDRA_TIMEOUT = 60 CASSANDRA_RETRIES = 0 elif "prod" in hostname: CASSANDRA_TIMEOUT = 2 # Seconds CASSANDRA_RETRIES = 0 # None elif "storm" in hostname: CASSANDRA_TIMEOUT = 0.2 CASSANDRA_RETRIES = 0 else: # proxy (read only) CASSANDRA_TIMEOUT = 60 CASSANDRA_RETRIES = 3 43Thursday, August 1, 13
  • 44. DATA MODEL And this too CASSANDRA_CONFIG = { 'stats': { 'pool': PoolConfig(CASSANDRA_TIMEOUT, CASSANDRA_RETRIES, CASSANDRA_POOL_SIZE), 'cf': { 'counts': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.ONE), 'durable_counts': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.LOCAL_QUORUM), 'sets': ColumnFamilyConfig(ConsistencyLevel.LOCAL_QUORUM, ConsistencyLevel.LOCAL_QUORUM), } } } 44Thursday, August 1, 13
  • 45. DATA MODEL And operations to Cassandra look like this @swallow_cassandra_timeouts def side_effecting_function(): # insert/update into cassandra pass 45Thursday, August 1, 13
  • 46. DATA MODEL Durable counts CREATE TABLE durable_counts ( key text, time_dimension 'org.apache.cassandra.db.marshal.ReversedType'<timestamp>, random uuid, value int, PRIMARY KEY (key, time_dimension, random) ); 46Thursday, August 1, 13
  • 47. DATA MODEL Durable counts ---------------------+----------------------------------------+----------------------------------------+ |! ! ! | 2013-07-30 05:21:38+0000 | 2013-07-30 05:23:44+0000 | |! ! ! | eb401386-f420-11e2-a26b-002590024b08 | b320a95c-f240-11e2-a26b-002590024b08 | | comment.year |----------------------------------------+----------------------------------------+ |! ! ! | 20 | 50 | ---------------------+----------------------------------------+----------------------------------------+ ---------------------+----------------------------------------+----------------------------------------+ |! ! ! | 2013-07-30 05:21:38+0000 | 2013-07-30 05:23:44+0000 | |! ! ! | eb401386-f420-11e2-a26b-002590024b08 | b320a95c-f240-11e2-a26b-002590024b08 | | comment.month |----------------------------------------+----------------------------------------+ |! ! ! | 20 | 50 | ---------------------+----------------------------------------+----------------------------------------+ ---------------------+----------------------------------------+----------------------------------------+ |! ! ! | 2013-07-30 05:21:38+0000 | 2013-07-30 05:23:44+0000 | |! ! ! | eb401386-f420-11e2-a26b-002590024b08 | b320a95c-f240-11e2-a26b-002590024b08 | | comment.day |----------------------------------------+----------------------------------------+ |! ! ! | 20 | 50 | ---------------------+----------------------------------------+----------------------------------------+ 47Thursday, August 1, 13
  • 48. DATA MODEL And even doing all that hackery Hive count C* counter % Similar C* durable counts % Similar 8101 8179 99.046338 8179 99.046338 7328 7390 99.161028 7390 99.161028 6255 6304 99.222715 6304 99.222715 6604 6665 99.150141 6665 99.150141 7700 7766 99.150141 7766 99.150141 5 week days of countable data 48Thursday, August 1, 13
  • 49. DATA MODEL Over 99% accuracy 100% (allegedly) counter parity 49Thursday, August 1, 13
  • 50. DATA MODEL Since our data is time series what if you could view it that way 50Thursday, August 1, 13
  • 51. DATA MODEL With arbitrary dimensionality 51Thursday, August 1, 13
  • 52. DATA MODEL With arbitrary multi dimensionality 52Thursday, August 1, 13
  • 53. DATA MODEL Sets (our first iteration) CREATE TABLE sets ( key text, time_dimension timestamp, element blob, value double, PRIMARY KEY (key, time_dimension) ); Insert only workload. Items are deleted by TTL 53Thursday, August 1, 13
  • 54. DATA MODEL Better Sets CREATE TABLE sets ( key text, time_dimension timestamp, element blob, deleted boolean, value double, PRIMARY KEY (key, time_dimension) ); Insert only workload. When you want to delete, you insert with deleted set to true. Read require you to iterate over all columns in chronological order. You sum values to calculate a score. 54Thursday, August 1, 13
  • 55. DATA MODEL Counters with indexable dimensions CREATE TABLE catalog ( key text, time_dimension 'org.apache.cassandra.db.marshal.ReversedType' <timestamp>, dimension_1 text, dimension_1_val text, dimension_2 text, dimension_2_val text, ... value counter, PRIMARY KEY (key, time_dimension) ); 55Thursday, August 1, 13
  • 56. DATA MODEL Dimension Catalog CREATE TABLE catalog ( key text, dimension text, value text, PRIMARY KEY (key, dimension) ); 56Thursday, August 1, 13
  • 57. DATA MODEL Dimension Catalog CREATE TABLE catalog ( key text, dimension text, value text, PRIMARY KEY (key, dimension) ); cqlsh:> insert into catalog (key, dimension, value) values ('comment', 'author', 'gjcourt'); cqlsh:> insert into catalog (key, dimension, value) values ('comment', 'forum', 'disqus'); cqlsh:> select dimension from catalog where key='comment'; dimension ----------- author forum 57Thursday, August 1, 13
  • 58. WHERE ARE WE GOING 58Thursday, August 1, 13
  • 59. 3. EXPLORE NEW AND INTERESTING DATA PRODUCTS 2. PRODUCTIZE OUR DATA PIPELINE 1. EVOLVE CONTENT RECOMMENDATION AND ADVERTISING OUR 2013 MISSIONS 59Thursday, August 1, 13
  • 60. THE FUTURE casscached Comparable performance 2GB max “key” (instead of 1mb) Tunable consistency levels Useful for SSI, mat-views 60Thursday, August 1, 13
  • 61. THE FUTURE Postgres Foreign Data Wrapper Could use a cass_fdw 61Thursday, August 1, 13
  • 62. THE FUTURE Graph of users and views g.V('username','gjcourt').out('thread_views').in('thread_views').except('username', 'gjcourt') The Netflix algorithm: All articles that people that have viewed the thread I’m currently viewing have also viewed. 62Thursday, August 1, 13
  • 63. C* @Disqus · July 31, 2013 Cassandra SF Meetup Thanks for listening We’re hiring http://disqus.com/jobs/ 63Thursday, August 1, 13