Aaron Stannard and Caglar Oner, founding engineers at MarkedUp Analytics (www.markedup.com), on how they use Cassandra, Hive, and Solr to create real-time reports for hundreds of customers under burst loads.
Learn how to take advantage of Cassandra schema design, counter columns, Hive analysis, and Solr indexing to create robust, scalable analytics solutions.
This talk was presented at the Los Angeles Cassandra Users meetup (http://www.meetup.com/Los-Angeles-Cassandra-Users/events/104649892/) on March 12, 2013
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Real time analytics with cassandra hive and solr
1. Real Time Analytics
LA Cassandra User’s Group
with Cassandra, Hive,
and Solr
By Aaron Stannard
and Caglar Oner March 2013
2. What is MarkedUp?
2
MarkedUp: powerful analytics
tools for native apps.
Understand your Monitor your Drive
audience. app’s health. more sales.
Gain valuable data on Log errors and crashes Better data = more
your users. remotely. revenue.
19. Data, data and more data
19
More Data = More Problems
• Every day is our peak day.
• Every day we sign up more apps and our apps sign up more users.
22. Hadoop Overview
22
TM
• Two components:
• HDFS
• Map / Reduce
• Mimics Google File System and Map / Reduce
TM
• Started by Yahoo!
TM
• Part of Apache Foundation
• Mature and has rich ecosystem
• Scalable
• Highly available
• SLOOOOOWW
23. Hive Overview
23
TM
• A SQL Map / Reduce abstraction
• Originally developed as a BI tool for data warehousing @ Facebook
TM
• Rich data types (structs, lists and maps)
• Efficient implementation of joins, group-by's
• Scalable – easy to shard across nodes
• Tunable performance
• Extensible
24. Hive in Production
24
• Simple Aggregations
SELECT applicationId, COUNT(1) AS dataPoints FROM
ApplicationTable
GROUP BY applicationId WHERE date = “2013-03-12”
• Data Mining
User Retention
User Engagement
• Ad hoc Analysis
How many sales by X?
25. Hive Syntax
25
Query: count the number items where “key” is greater than 100
RDBMS> select key, count(1) from kv1
where key > 100 group by key;
Hive> select key, count(1) from kv1
where key > 100 group by key;
26. Hive Tables vs. Cassandra Column Families
26
• When working with Cassandra, Hive still requires its own tables
• These are stored in HDFS typically
• In DataStax Enterprise Edition it’s implemented into a dedicated
TM
Cassandra key space
• Hive tables are mapped directly onto Cassandra column families
28. Writing from Hive to Cassandra
28
INSERT OVERWRITE TABLE AverageTime
SELECT secondQuery.Project, secondQuery.SessionDay, AVG(secondQuery.SessionSum)
FROM
(
SELECT firstQuery.Project, firstQuery.SessionDay, SUM(firstQuery.SessionLength) as
SessionSum FROM
(
SELECT HSessions.Project as Project, HSessions.Day as SessionDay,
HSessions.Length as SessionLength, HSessions.User as UserId
FROM HSessions
) firstQuery
GROUP BY firstQuery.Project, firstQuery.SessionDay, firstQuery.UserId
) secondQuery
GROUP BY secondQuery.Project, secondQuery.SessionDay;
29. Hive User-Defined Functions
29
# export
HIVE_AUX_JARS_PATH=/home/user/scripts
# custom function
hive> create temporary function myFunc as
'com.markedup.hive.udf.Compositor';
SELECT myFunc(Apps.AppId, “DailySessions”) FROM
Apps WHERE ….
30. Hive Tips and Tricks
30
• Don’t write data from Hive back to a hot Cassandra column family
• Created dedicated Cassandra column families for Hive
• You can write to multiple places on a single Hive read
• Use sampling to test Hive queries on scaled-down data sets
31. Solr
31
TM
• Solr: Lucene-based indexing engine
• Part of Apache Foundation
• Full-text search
• Faceted search
• Distributed
• Integrates well with Cassandra
32. Real World Example
32
How do you count a billion
distinct items in real-time?