2. Agenda
Ooyala – quick company overview
What do we mean by “video analytics”?
What are the challenges?
Cassandra at Ooyala - technical details
Lessons learned
Q&A
2
18. State of Analytics Today
Collect vast amounts of data
Aggregate, slice in various dimensions
Report and visualize
Personalize and recommend
Scalable, fault tolerant, near real-time
using Hadoop + Cassandra
18
20. Challenge: Scale
150M+ unique monthly users
15M+ monthly video hours
Daily inflow: billions of log pings, TBs of uncompressed logs
10TB+ of historical analytics data in C* covering a period of
about 4 years
Exponential data growth in C*: currently 1TB+ per month
20
21. Challenge: Processing Speed
Large “fan-out” to multiple dimensions + per-video-asset
analytics = lots of data being written. Parallelizable!
“Analytics delay” metric = time from log ping hitting a server to
being visible to a publisher in the analytics UI
Current avg. delay: 10-25 minutes depending on time of day
Target max analytics delay: <30 minutes (Hadoop system)
Would like <1 minute (future real-time processing system)
21
22. Challenge: Depth
Per-video-asset analytics means millions of new rows added
and/or updated in each CF every day
10+ dimensions (CFs) for slicing data in different ways
Queries range from “everything in my account for all time” to “video
X in city Y on date Z”
We’d like 1-hour granularity, but that’s up to 24x more rows
Or even 1-minute granularity in real-time, but that could be >1000x
more rows …
22
23. Challenge: Accuracy
Publishers make business decisions based on analytics data
Ooyala makes business decisions based on analytics data
Ooyala bills publishers based on analytics data
Analytics need to be accurate and verifiable
23
24. Challenge: Developer
Speed
We’re still a small company with limited developer resources
Like to iterate fast and release often, but …
… we use Hadoop MR for large-scale data processing
Hadoop is a Java framework
So, MapReduce jobs have to be written in Java … right?
24
28. Challenge: Developer
Speed
Word Count MR – Language Comparison
Development Runtime Hadoop
Lines Characters
Speed Speed API
Java 69 2395 Low High Native
Ruby 30 738 High Low Streaming
Scala 35 1284 Medium High Native
28
30. A bit of history
2008 – 2009: Single MySQL DB
Early 2010:
Too much data
Want higher granularity and more ways to slice data
Need a scalable data store!
30
31. Why Cassandra?
Linear scaling (space, load) – handles Scale & Depth challenges
Tunable consistency – QUORUM/QUORUM R/W allows accuracy
Very fast writes, reasonably fast reads
Great community support, rapidly evolving and improving
codebase – 0.6.13 => 0.8.7 increased our performance by >4x
Simpler and fewer dependencies than Hbase, richer data model
than a simple K/V store, more scalable than an RDBMS, …
31
32. Data Model - Overview
Row keys specify the entity and time (and some other stuff …)
Column families specify the dimension
Column names specify a data point within that dimension
Column values are maps of key/value pairs that represent a
collection of related metrics
Different groups of related metrics are stored under different row
keys
32
34. Data Model - Timestamps
Row keys have a timestamp component
Row keys have a time granularity component
Allows for efficient queries over large time ranges (few row keys
with big numbers)
Preserves granularity at smaller time ranges
Currently Month/Week/Day. Maybe Hour/Minute in the future?
34
36. Data Model – Metrics
Performance – plays, displays, unique users, time watched, bytes
downloaded, etc
Sharing – tweets, facebook shares, diggs, etc
Engagement – how many users watched through certain time
buckets of a video
QoS – bitrates, buffering events
Ad – ad requests, impressions, clicks, mouse-overs, failures, etc
36
38. Data Model - Dimensions
Analytics data is sliced in different dimensions == CFs
Example: country. Column names are “US”, “CA”, “JP”, etc
Column values are aggregates of the metric for the row key in that
country
For example: the video performance metrics for month of 2011-10-
01 in the US for video asset 123
Example: platform. Column names: “desktop:windows:chrome”,
“tablet:ipad”, “mobile:android”, “settop:ps3”.
38
40. Data Model – Indices
Need to efficiently answer “Top N” queries over an aggregate of
multiple rows, sorted by some field in the metrics object
But, column sort order is “CA” < “JP” < “US” regardless of field
values
Would like to support multiple fields to sort on, anyway
Naïve implementation – read entire rows, aggregate, sort in RAM –
pretty slow
Solution: write additional index rows to C*
40
41. Data Model – Indices
Every data row may have 0 or more index rows, depending on the
metrics type
Index rows – empty column values, column names are prepended
with the value of the indexed field, encoded as a fixed-width byte
array
Rely on C* to order the columns according to the indexed field
Index rows are stored in separate CFs which have “i_” prepended
to the dimension name.
41
43. Data Model – Indices
Trivial to answer a “Top N” query for a single row if the field we sort
on has an index: just read the last N columns of the index row
What if the query spans multiple rows?
Use 3-pass uniform threshold algorithm. Guaranteed to get the top-
N columns in any multi-row aggregate in 3 RPC calls. See:
[http://www.cs.ucsb.edu/research/tech_reports/reports/2005-
14.pdf]
Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is
impossible, have to do top-2N and drop half.
43
44. Data Model – Drilldowns
All cities in the world stored in one row, allowing us to do a global
sort. What if we need cities within some region only?
Solution: use “drilldown” indices.
Just a special kind of index that includes only a subset of all data in
the parent row.
Example: all cities in the country “US”
Works like regular index otherwise
Not free – more than 1/3rd of all our C* disk usage
44
45. The Bad Stuff
Read-modify-write is slow, because in C* read latency >> write
latency
Having a write-only pipeline would greatly speed up processing,
but makes reading data more expensive (aggregate-on-read)
And/or requires more complicated asynchronous aggregation
Minimum granularity of 1 day is not that good, would like to do 1-
hour or 1-minute
But, storage requirements go up very fast
45
46. The Bad Stuff
Synchronous updates of time rollups and index rows make
processing slower and increase delays
But, asynchronous is harder to get right
Reprocessing of data is currently difficult because of lack of locking
– have to pause regular pipeline
Also have to reprocess log files in batches of full days
46