Real time analytics with cassandra hive and solr

Real Time Analytics
LA Cassandra User’s Group

with Cassandra, Hive,
and Solr

By Aaron Stannard
and Caglar Oner March 2013

What is MarkedUp?
2

MarkedUp: powerful analytics
tools for native apps.

Understand your Monitor your Drive
audience. app’s health. more sales.
Gain valuable data on Log errors and crashes Better data = more
your users. remotely. revenue.

What Our Customers See
3

The Analytics CAP Theorem - SCV
4

The Truth about Analytics
6

Real time analytics isn’t
inherently superior or
necessary.

Analytic Speed is a Business Decision
7

MarkedUp is a Blend of Both
8

Why Cassandra for Real Time?
9

Our Technology Stack
10

Cassandra Setup on EC2
11

Time Series Schema 0: All Knowns
15

Time Series Schema 1: Bounded Number of Unknowns
16

Time Series Schema 2: Unbounded Number of Unknowns
17

Schema Design Considerations
18

Data, data and more data
19

More Data = More Problems
• Every day is our peak day.
• Every day we sign up more apps and our apps sign up more users.

How Do We Manage this Data?
20

TM

TM

TM

TM

TM

Managing a Cluster
21

Hadoop Overview
22

TM

• Two components:
• HDFS
• Map / Reduce
• Mimics Google File System and Map / Reduce
TM

• Started by Yahoo!
TM

• Part of Apache Foundation
• Mature and has rich ecosystem
• Scalable
• Highly available
• SLOOOOOWW

Hive Overview
23

TM

• A SQL Map / Reduce abstraction
• Originally developed as a BI tool for data warehousing @ Facebook
TM

• Rich data types (structs, lists and maps)
• Efficient implementation of joins, group-by's
• Scalable – easy to shard across nodes
• Tunable performance
• Extensible

Hive in Production
24

• Simple Aggregations
SELECT applicationId, COUNT(1) AS dataPoints FROM
ApplicationTable
GROUP BY applicationId WHERE date = “2013-03-12”

• Data Mining
User Retention
User Engagement
• Ad hoc Analysis
How many sales by X?

Hive Syntax
25

Query: count the number items where “key” is greater than 100

RDBMS> select key, count(1) from kv1
where key > 100 group by key;

Hive> select key, count(1) from kv1
where key > 100 group by key;

Hive Tables vs. Cassandra Column Families
26

• When working with Cassandra, Hive still requires its own tables
• These are stored in HDFS typically
• In DataStax Enterprise Edition it’s implemented into a dedicated
TM

Cassandra key space
• Hive tables are mapped directly onto Cassandra column families

Mapping Hive to Cassandra
27

CREATE EXTERNAL TABLE AverageTime (rowkey binary, dates bigint, total bigint)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ("cassandra.ks.name" = “myKeySpace","cassandra.cf.name" =
“myColumnFamily", "cassandra.columns.mapping" = ":key, :column, :value" )
TBLPROPERTIES ( "cassandra.range.size" = "100", "cassandra.slice.predicate.size" = "100" );

Writing from Hive to Cassandra
28

INSERT OVERWRITE TABLE AverageTime

SELECT secondQuery.Project, secondQuery.SessionDay, AVG(secondQuery.SessionSum)
FROM
(
SELECT firstQuery.Project, firstQuery.SessionDay, SUM(firstQuery.SessionLength) as
SessionSum FROM
(
SELECT HSessions.Project as Project, HSessions.Day as SessionDay,
HSessions.Length as SessionLength, HSessions.User as UserId
FROM HSessions
) firstQuery
GROUP BY firstQuery.Project, firstQuery.SessionDay, firstQuery.UserId
) secondQuery
GROUP BY secondQuery.Project, secondQuery.SessionDay;

Hive User-Defined Functions
29

# export
HIVE_AUX_JARS_PATH=/home/user/scripts
# custom function
hive> create temporary function myFunc as
'com.markedup.hive.udf.Compositor';

SELECT myFunc(Apps.AppId, “DailySessions”) FROM
Apps WHERE ….

Hive Tips and Tricks
30

• Don’t write data from Hive back to a hot Cassandra column family
• Created dedicated Cassandra column families for Hive
• You can write to multiple places on a single Hive read
• Use sampling to test Hive queries on scaled-down data sets

Solr
31

TM

• Solr: Lucene-based indexing engine
• Part of Apache Foundation
• Full-text search
• Faceted search
• Distributed
• Integrates well with Cassandra

Real World Example
32

How do you count a billion
distinct items in real-time?

We Are Hiring!
35

We’re hiring.
markedup.com/jobs

THANK YOU
36

Questions? markedup.com
team@markedup.com

Real time analytics with cassandra hive and solr

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Real time analytics with cassandra hive and solr