TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
1. Cassandra 1.0
and the future of big data
Jonathan Ellis
Tuesday, October 4, 2011
2. About me
✤ Project chair, Apache Cassandra
✤ Active since Dec 2008
✤ First non-Facebook committer
✤ wrote ~30% of committed patches, reviewed ~40% of the rest
✤ Distributed systems background
✤ At Mozy, built a multi-petabyte, scalable storage system based on
Reed-Solomon encoding
✤ Founder and CTO, DataStax
Tuesday, October 4, 2011
3. About DataStax
✤ Founded in April 2010
✤ Commercial leader in Apache Cassandra
✤ 100+ customers
✤ 30+ employees
✤ Home to Apache Cassandra Chair & most committers
✤ Headquartered in San Francisco Bay area, California
✤ Secured $11M in Series B funding in Sep 2011
Tuesday, October 4, 2011
6. Big data
Analytics Realtime
?
(Hadoop) (“NoSQL”)
Tuesday, October 4, 2011
7. Some Cassandra users
✤ Financial
✤ Social Media
✤ Advertising
✤ Entertainment
✤ Energy
✤ E-tail
✤ Health care
✤ Government
Tuesday, October 4, 2011
8. Common use cases
✤ Time series data
✤ Messaging
✤ Ad tracking
✤ Data mining
✤ User activity streams
✤ User sessions
✤ Anything requiring: Scalable + performant + highly
available
Tuesday, October 4, 2011
15. Level-based Compaction
✤ SSTables are non-overlapping within a level
✤ Bounds the number that can contain a given row
L0: newly flushed
L1: 100 MB
L2: 1000 MB
Tuesday, October 4, 2011
16. Read performance: maxtimestamp
✤ Sort sstables by maximum (client-provided) timestamp
✤ Only merge sstables until we have the columns requested
✤ Allows pre-merging highly fragmented rows without
waiting for compaction
Tuesday, October 4, 2011
18. CQL
cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;
KEY | birth_date | full_name | state |
bsanderson | 1975 | Brandon Sanderson | UT |
Tuesday, October 4, 2011
19. CQL 2.0
✤ ALTER
✤ Counter support
✤ TTL support
✤ SELECT count(*)
Tuesday, October 4, 2011
20. Post-1.0 features
✤ Ease Of Use
✤ CQL
✤ “Native” transport
✤ Composite columns
✤ Prepared statements
✤ Triggers
✤ Entity groups
✤ Smarter range queries
✤ Enables more-efficient analytics
Tuesday, October 4, 2011
21. The evolution of Analytics
Analytics + Realtime
Tuesday, October 4, 2011
22. The evolution of Analytics
replication
Analytics Realtime
Tuesday, October 4, 2011
27. Data model: Realtime
LiveStocks
last
GOOG $95.52
AAPL $186.10
AMZN $112.98
Portfolios
GOOG LNKD P AMZN AAPLE
Portfolio1
80 20 40 100 20
StockHist
2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11
Tuesday, October 4, 2011
28. Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
Tuesday, October 4, 2011
29. Data model: Analytics
10dayreturns
ticker rdate return
GOOG 2011-07-25 $8.23
GOOG 2011-07-24 $6.14
GOOG 2011-07-23 $7.78
AAPL 2011-07-25 $15.32
AAPL 2011-07-24 $12.68
INSERT OVERWRITE TABLE 10dayreturns
SELECT a.row_key ticker,
b.column_name rdate,
b.value - a.value
FROM StockHist a
JOIN StockHist b
ON (a.row_key = b.row_key
AND date_add(a.column_name,10) = b.column_name);
Tuesday, October 4, 2011
30. Data model: Analytics
2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11
row_key column_name value
GOOG 2011-01-01 $8.23
GOOG 2011-01-02 $6.14
GOOG 2011-001-03 $7.78
Tuesday, October 4, 2011
31. Data model: Analytics
portfolio_returns
portfolio rdate preturn
Portfolio1 2011-07-25 $118.21
Portfolio1 2011-07-24 $60.78
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-07-25 $2143.92
Portfolio3 2011-07-24 -$10.19
INSERT OVERWRITE TABLE portfolio_returns
SELECT row_key portfolio,
rdate,
SUM(b.return)
FROM portfolios a JOIN 10dayreturns b
ON (a.column_name = b.ticker)
GROUP BY row_key, rdate;
Tuesday, October 4, 2011
32. Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
INSERT OVERWRITE TABLE HistLoss
SELECT a.portfolio, rdate, minp
FROM (
SELECT portfolio, min(preturn) as minp
FROM portfolio_returns
GROUP BY portfolio
) a
JOIN portfolio_returns b
ON (a.portfolio = b.portfolio and a.minp = b.preturn);
Tuesday, October 4, 2011
33. Portfolio Demo dataflow
Portfolios Portfolios
Historical Prices Live Prices for today
Intermediate Results
Largest loss Largest loss
Tuesday, October 4, 2011
34. Operations
✤ “Vanilla” Hadoop
✤ 8+ services to setup, monitor, backup, and recover
(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper,
Region Server,...)
✤ Single points of failure
✤ Can't separate online and offline processing
✤ DataStax Enterprise
✤ Single, simplified component
✤ Self-organizes based on workload
✤ Peer to peer
✤ JobTracker failover
✤ No additional cassandra config
Tuesday, October 4, 2011