6. The Traditional Hadoop Stack
Slave Nodes
Master Nodes
Data Node
Name Node
Task Tracker
Secondary Name Node
Region Server
Job Tracker
Hbase Master Client Nodes
Pig
ZooKeeper
Hive
MetaStore
Region Server
Monday, July 25, 2011
9. Brisk Highlights
✤ Easy to deploy and operate
✤ No single points of failure
✤ Scale and change nodes with no downtime
✤ Cross-DC, multi-master clusters
✤ Allocate resources for OLAP vs OLTP
✤ With no ETL
Monday, July 25, 2011
10. Cassandra data model
✤ ColumnFamilies contain rows + columns
✤ (Not really schemaless for a while now)
password name site
zznate * Nate McCall
driftx * Brandon Williams
jbellis * Jonathan Ellis datastax.com
Monday, July 25, 2011
11. Sparse
password name
zznate
* Nate McCall
password name
driftx
* Brandon Williams
password name site
jbellis
* Jonathan Ellis datastax.com
Monday, July 25, 2011
14. CassandraFS
✤ data stored as ByteBuffer internally -- excellent fit for blocks
✤ local reads mmap data directly (no rpc)
✤ blocks are compressed with google snappy
✤ hadoop distcp hdfs:///mydata cfs:///mydata
Monday, July 25, 2011
15. Hive support
✤ Hive MetaStore in Cassandra
✤ Unified schema view from any node, with no external systems
and no SPOF
✤ Automatically maps Cassandra column families to Hive tables
✤ Supports static and dynamic column families (and supercolumns)
Monday, July 25, 2011
16. Hive: CFS and ColumnFamilies
CREATE TABLE users (name STRING, zip INT);
LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users;
CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';
CREATE EXTERNAL TABLE Keyspace1.Users
(row_key STRING, column_name STRING, value string)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';
Monday, July 25, 2011
17. Pig Support
✤ With standard Cassandra:
$ export PIG_HOME=/path/to/pig
$ export PIG_INITIAL_ADDRESS=localhost
$ export PIG_RPC_PORT=9160
$ export
PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
$ contrib/pig/bin/pig_cassandra
grunt>
✤ With Brisk:
$ bin/brisk pig
grunt>
Monday, July 25, 2011
18. Pig: CFS and ColumnFamilies
grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as
(name:chararray, value:long);
data = LOAD 'cassandra://Demo1/Scores' using CassandraStorage()
AS (key, columns: {T: tuple(name, value)});
data = LOAD 'cassandra://Demo1/Scores&slice_start=M&slice_end=S'
using CassandraStorage() AS (key, columns: {T: tuple(name,
value)});
Monday, July 25, 2011
24. Data model: Analytics
portfolio_returns
portfolio rdate preturn
Portfolio1 2011-07-25 $118.21
Portfolio1 2011-07-24 $60.78
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-07-25 $2143.92
Portfolio3 2011-07-24 -$10.19
INSERT OVERWRITE TABLE portfolio_returns
SELECT row_key portfolio,
rdate,
SUM(b.return)
FROM portfolios a JOIN 10dayreturns b
ON (a.column_name = b.ticker)
GROUP BY row_key, rdate;
Monday, July 25, 2011
25. Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
INSERT OVERWRITE TABLE HistLoss
SELECT a.portfolio, rdate, minp
FROM (
SELECT portfolio, min(preturn) as minp
FROM portfolio_returns
GROUP BY portfolio
) a
JOIN portfolio_returns b
ON (a.portfolio = b.portfolio and a.minp = b.preturn);
Monday, July 25, 2011
26. Portfolio Demo dataflow
Portfolios Web-based Portfolios
Historical Prices Live Prices for today
Intermediate Results
Largest loss Largest loss
Monday, July 25, 2011