SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Scaling Video Analytics
With Apache Cassandra
             ILYA MAYKOV | Dec 6th, 2011
Agenda
Ooyala – quick company overview
What do we mean by “video analytics”?
What are the challenges?
Cassandra at Ooyala - technical details
Lessons learned
Q&A


                                          2
3
4
5
6
7
8
9
10
Analytics Overview




                     11
1   Aggregate and Visualize Data


2   Give Insights


3   Enable experimentation


4   Optimize automagically



                                   12
Analytics Overview




Go from this …
                              13
Analytics Overview




   … to this …
                     14
Analytics Overview




           … and this!
                         15
System Architecture




                      16
17
State of Analytics Today

Collect vast amounts of data
Aggregate, slice in various dimensions
Report and visualize
Personalize and recommend
Scalable, fault tolerant, near real-time
using Hadoop + Cassandra

                                           18
Analytics Challenges

Scale
Processing Speed
Depth
Accuracy
Developer speed


                               19
Challenge: Scale

150M+ unique monthly users

15M+ monthly video hours

Daily inflow: billions of log pings, TBs of uncompressed logs

10TB+ of historical analytics data in C* covering a period of
about 4 years

Exponential data growth in C*: currently 1TB+ per month



                                                                20
Challenge: Processing Speed

Large “fan-out” to multiple dimensions + per-video-asset
analytics = lots of data being written. Parallelizable!

“Analytics delay” metric = time from log ping hitting a server to
being visible to a publisher in the analytics UI

Current avg. delay: 10-25 minutes depending on time of day

Target max analytics delay: <30 minutes (Hadoop system)

Would like <1 minute (future real-time processing system)


                                                                    21
Challenge: Depth

Per-video-asset analytics means millions of new rows added
and/or updated in each CF every day

10+ dimensions (CFs) for slicing data in different ways

Queries range from “everything in my account for all time” to “video
X in city Y on date Z”

We’d like 1-hour granularity, but that’s up to 24x more rows

Or even 1-minute granularity in real-time, but that could be >1000x
more rows …


                                                                       22
Challenge: Accuracy

Publishers make business decisions based on analytics data

Ooyala makes business decisions based on analytics data

Ooyala bills publishers based on analytics data

Analytics need to be accurate and verifiable




                                                             23
Challenge: Developer
                             Speed
We’re still a small company with limited developer resources

Like to iterate fast and release often, but …

… we use Hadoop MR for large-scale data processing

Hadoop is a Java framework

So, MapReduce jobs have to be written in Java … right?




                                                               24
Word Count Example: Java




                           25
Word Count Example: Ruby




                           26
Word Count Example: Scala




                            27
Challenge: Developer
                            Speed
         Word Count MR – Language Comparison

                         Development Runtime    Hadoop
        Lines Characters
                           Speed      Speed      API


Java     69     2395        Low       High      Native


Ruby     30     738         High      Low      Streaming


Scala    35     1284      Medium      High      Native


                                                           28
Why Cassandra?




                 29
A bit of history

2008 – 2009: Single MySQL DB

Early 2010:

  Too much data

  Want higher granularity and more ways to slice data

  Need a scalable data store!




                                                        30
Why Cassandra?

Linear scaling (space, load) – handles Scale & Depth challenges

Tunable consistency – QUORUM/QUORUM R/W allows accuracy

Very fast writes, reasonably fast reads

Great community support, rapidly evolving and improving
codebase – 0.6.13 => 0.8.7 increased our performance by >4x

Simpler and fewer dependencies than Hbase, richer data model
than a simple K/V store, more scalable than an RDBMS, …



                                                                  31
Data Model - Overview

Row keys specify the entity and time (and some other stuff …)

Column families specify the dimension

Column names specify a data point within that dimension

Column values are maps of key/value pairs that represent a
collection of related metrics

Different groups of related metrics are stored under different row
keys



                                                                     32
Data Model – Example

           CF =>                            Country
          Column =>                “CA”                “US”           …


                               { displays: 50,    { displays: 100,
        {video: 123, … }                                              …
                               plays: 40, … }      plays: 75, … }

                              { displays: 5000,   { displays: 1100,
Keys   {publisher: 456, … }
                              plays: 4100, … }     plays: 756, … }
                                                                      …



                …                    …                   …            …




                                                                          33
Data Model - Timestamps

Row keys have a timestamp component

Row keys have a time granularity component

Allows for efficient queries over large time ranges (few row keys
with big numbers)

Preserves granularity at smaller time ranges

Currently Month/Week/Day. Maybe Hour/Minute in the future?




                                                                    34
Data Model – Timestamps
                                  “CA”               “US”         …

         { video: 123,
                             { plays: 1, … }    { plays: 1, … }   …
       day: 2011/10/31 }
         { video: 123,
                             { plays: 2, … }    { plays: 1, … }   …
       day: 2011/11/01 }
         { video: 123,
                             { plays: 4, … }         null         …
       day: 2011/11/02 }
         { video: 123,
                             { plays: 8, … }    { plays: 1, … }   …
       day: 2011/11/03 }
Keys
         { video: 123,
                            { plays: 16, … }    { plays: 1, … }   …
       day: 2011/11/04 }
         { video: 123,
                            { plays: 32, … }    { plays: 1, … }   …
       day: 2011/11/05 }
         { video: 123,
                            { plays: 64, … }    { plays: 1, … }   …
       day: 2011/11/06 }
         { video: 123,
                            { plays: 127, … }   { plays: 6, … }   …
       week: 2011/10/31 }
                                                                      35
Data Model – Metrics

Performance – plays, displays, unique users, time watched, bytes
downloaded, etc

Sharing – tweets, facebook shares, diggs, etc

Engagement – how many users watched through certain time
buckets of a video

QoS – bitrates, buffering events

Ad – ad requests, impressions, clicks, mouse-overs, failures, etc



                                                                    36
Data Model - Metrics

           CF =>                           Country
         Column =>                “CA”               “US”          …


          {video: 123,        { displays: 50,   { displays: 100,
                                                                   …
       metrics: video, … }    plays: 40, … }     plays: 75, … }
                                { clicks: 3,     { clicks: 7,
         {video: 123,
Keys    metrics: ad, … }
                             impressions: 40, impressions: 61,     …
                                    …}               …}

               …                    …                  …           …




                                                                       37
Data Model - Dimensions
Analytics data is sliced in different dimensions == CFs

Example: country. Column names are “US”, “CA”, “JP”, etc

Column values are aggregates of the metric for the row key in that
country

For example: the video performance metrics for month of 2011-10-
01 in the US for video asset 123

Example: platform. Column names: “desktop:windows:chrome”,
“tablet:ipad”, “mobile:android”, “settop:ps3”.




                                                                     38
Data Model - Dimensions


                    CF: Country                    CF: DMA                     CF: Platform


                                              “SF Bay                   “desktop:mac:c
                  “CA”           “US”                        “NYC”                       “settop:ps3”
                                               Area”                        hrome”



Key: {video:   { plays: 20,   { plays: 30,   { plays: 12,   { plays: 5,                  { plays: 7, …
                                                                        { plays: 60, … }
 123, …}           …}             …}             …}             …}                             }




                                                                                                         39
Data Model – Indices

Need to efficiently answer “Top N” queries over an aggregate of
multiple rows, sorted by some field in the metrics object

But, column sort order is “CA” < “JP” < “US” regardless of field
values

Would like to support multiple fields to sort on, anyway

Naïve implementation – read entire rows, aggregate, sort in RAM –
pretty slow

Solution: write additional index rows to C*


                                                                    40
Data Model – Indices

Every data row may have 0 or more index rows, depending on the
metrics type

Index rows – empty column values, column names are prepended
with the value of the indexed field, encoded as a fixed-width byte
array

Rely on C* to order the columns according to the indexed field

Index rows are stored in separate CFs which have “i_” prepended
to the dimension name.



                                                                     41
Data Model - Indices
             CF =>                                  country


       Column Name =>              “CA”              “US”          …

                              { displays: 50,   { displays: 100,
        {video: 123, …}                                            …
                              plays: 40, … }     plays: 75, … }
Keys
                             { displays: 5000, { displays: 1100,
       {publisher: 456, …}                                         …
                             plays: 4100, … } plays: 756, … }

             CF =>                                 i_country

           {video: 123,      Name: “40:CA”      Name: “75:US”
                                                                   …
          index: plays}        Value: null        Value: null
                                  Name:             Name:
        {publisher: 456,
Keys                            “5000:CA”         “1100:US”        …
        index: displays}
                                Value: null       Value: null

               …                    …                 …            …



                                                                       42
Data Model – Indices
Trivial to answer a “Top N” query for a single row if the field we sort
on has an index: just read the last N columns of the index row

What if the query spans multiple rows?

Use 3-pass uniform threshold algorithm. Guaranteed to get the top-
N columns in any multi-row aggregate in 3 RPC calls. See:
[http://www.cs.ucsb.edu/research/tech_reports/reports/2005-
14.pdf]

Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is
impossible, have to do top-2N and drop half.




                                                                          43
Data Model – Drilldowns
All cities in the world stored in one row, allowing us to do a global
sort. What if we need cities within some region only?

Solution: use “drilldown” indices.

Just a special kind of index that includes only a subset of all data in
the parent row.

Example: all cities in the country “US”

Works like regular index otherwise

Not free – more than 1/3rd of all our C* disk usage



                                                                          44
The Bad Stuff

Read-modify-write is slow, because in C* read latency >> write
latency

Having a write-only pipeline would greatly speed up processing,
but makes reading data more expensive (aggregate-on-read)

And/or requires more complicated asynchronous aggregation

Minimum granularity of 1 day is not that good, would like to do 1-
hour or 1-minute

But, storage requirements go up very fast


                                                                     45
The Bad Stuff

Synchronous updates of time rollups and index rows make
processing slower and increase delays

But, asynchronous is harder to get right

Reprocessing of data is currently difficult because of lack of locking
– have to pause regular pipeline

Also have to reprocess log files in batches of full days




                                                                     46
LESSONS
LEARNED


          47
DATA MODEL
 CHANGES
   ARE
PAINFUL
… so design to make them less so


                                   48
EVERYTHING
   WILL
BREAK
 … so test accordingly




                         49
SEPARATE
     LOGICALLY
     DIFFERENT
         DATA
… it will improve performance AND make
             your life simpler

                                         50
PERF TEST
    WITH
 PRODUCTION
       LOAD
… if you can afford a second cluster


                                       51
http://cassandra.apache.org

http://www.datastax.com/dev

  http://www.ooyala.com




                              52
THANK YOU
Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

Weitere ähnliche Inhalte

Was ist angesagt?

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
keyvi the key value index @ Cliqz
keyvi the key value index @ Cliqzkeyvi the key value index @ Cliqz
keyvi the key value index @ CliqzHendrik Muhs
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
 
Run Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesRun Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesBernd Ocklin
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraDataStax
 
Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Johnny Miller
 
Going native with Apache Cassandra
Going native with Apache CassandraGoing native with Apache Cassandra
Going native with Apache CassandraJohnny Miller
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
 
No Sql Introduction
No Sql IntroductionNo Sql Introduction
No Sql IntroductionDingding Ye
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Cassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large NodesCassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large Nodesaaronmorton
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 

Was ist angesagt? (20)

C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
keyvi the key value index @ Cliqz
keyvi the key value index @ Cliqzkeyvi the key value index @ Cliqz
keyvi the key value index @ Cliqz
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
 
Run Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in KubernetesRun Cloud Native MySQL NDB Cluster in Kubernetes
Run Cloud Native MySQL NDB Cluster in Kubernetes
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
 
Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1
 
Going native with Apache Cassandra
Going native with Apache CassandraGoing native with Apache Cassandra
Going native with Apache Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
No Sql Introduction
No Sql IntroductionNo Sql Introduction
No Sql Introduction
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Cassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large NodesCassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large Nodes
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 

Ähnlich wie Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

AWS Summit Nordics - Media and Gaming Application on AWS
AWS Summit Nordics - Media and Gaming Application on AWSAWS Summit Nordics - Media and Gaming Application on AWS
AWS Summit Nordics - Media and Gaming Application on AWSAmazon Web Services
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudRightScale
 
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
MLOps journey at Swisscom: AI Use Cases, Architecture and Future VisionMLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
MLOps journey at Swisscom: AI Use Cases, Architecture and Future VisionBATbern
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Amazon Web Services
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Amazon Web Services
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignAntonio Castellon
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAmazon Web Services
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 

Ähnlich wie Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra (20)

AWS Summit Nordics - Media and Gaming Application on AWS
AWS Summit Nordics - Media and Gaming Application on AWSAWS Summit Nordics - Media and Gaming Application on AWS
AWS Summit Nordics - Media and Gaming Application on AWS
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
CodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the CloudCodeFutures - Scaling Your Database in the Cloud
CodeFutures - Scaling Your Database in the Cloud
 
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
MLOps journey at Swisscom: AI Use Cases, Architecture and Future VisionMLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Leveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clusteringLeveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clustering
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 

Kürzlich hochgeladen

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Kürzlich hochgeladen (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

  • 1. Scaling Video Analytics With Apache Cassandra ILYA MAYKOV | Dec 6th, 2011
  • 2. Agenda Ooyala – quick company overview What do we mean by “video analytics”? What are the challenges? Cassandra at Ooyala - technical details Lessons learned Q&A 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 12. 1 Aggregate and Visualize Data 2 Give Insights 3 Enable experimentation 4 Optimize automagically 12
  • 14. Analytics Overview … to this … 14
  • 15. Analytics Overview … and this! 15
  • 17. 17
  • 18. State of Analytics Today Collect vast amounts of data Aggregate, slice in various dimensions Report and visualize Personalize and recommend Scalable, fault tolerant, near real-time using Hadoop + Cassandra 18
  • 20. Challenge: Scale 150M+ unique monthly users 15M+ monthly video hours Daily inflow: billions of log pings, TBs of uncompressed logs 10TB+ of historical analytics data in C* covering a period of about 4 years Exponential data growth in C*: currently 1TB+ per month 20
  • 21. Challenge: Processing Speed Large “fan-out” to multiple dimensions + per-video-asset analytics = lots of data being written. Parallelizable! “Analytics delay” metric = time from log ping hitting a server to being visible to a publisher in the analytics UI Current avg. delay: 10-25 minutes depending on time of day Target max analytics delay: <30 minutes (Hadoop system) Would like <1 minute (future real-time processing system) 21
  • 22. Challenge: Depth Per-video-asset analytics means millions of new rows added and/or updated in each CF every day 10+ dimensions (CFs) for slicing data in different ways Queries range from “everything in my account for all time” to “video X in city Y on date Z” We’d like 1-hour granularity, but that’s up to 24x more rows Or even 1-minute granularity in real-time, but that could be >1000x more rows … 22
  • 23. Challenge: Accuracy Publishers make business decisions based on analytics data Ooyala makes business decisions based on analytics data Ooyala bills publishers based on analytics data Analytics need to be accurate and verifiable 23
  • 24. Challenge: Developer Speed We’re still a small company with limited developer resources Like to iterate fast and release often, but … … we use Hadoop MR for large-scale data processing Hadoop is a Java framework So, MapReduce jobs have to be written in Java … right? 24
  • 28. Challenge: Developer Speed Word Count MR – Language Comparison Development Runtime Hadoop Lines Characters Speed Speed API Java 69 2395 Low High Native Ruby 30 738 High Low Streaming Scala 35 1284 Medium High Native 28
  • 30. A bit of history 2008 – 2009: Single MySQL DB Early 2010: Too much data Want higher granularity and more ways to slice data Need a scalable data store! 30
  • 31. Why Cassandra? Linear scaling (space, load) – handles Scale & Depth challenges Tunable consistency – QUORUM/QUORUM R/W allows accuracy Very fast writes, reasonably fast reads Great community support, rapidly evolving and improving codebase – 0.6.13 => 0.8.7 increased our performance by >4x Simpler and fewer dependencies than Hbase, richer data model than a simple K/V store, more scalable than an RDBMS, … 31
  • 32. Data Model - Overview Row keys specify the entity and time (and some other stuff …) Column families specify the dimension Column names specify a data point within that dimension Column values are maps of key/value pairs that represent a collection of related metrics Different groups of related metrics are stored under different row keys 32
  • 33. Data Model – Example CF => Country Column => “CA” “US” … { displays: 50, { displays: 100, {video: 123, … } … plays: 40, … } plays: 75, … } { displays: 5000, { displays: 1100, Keys {publisher: 456, … } plays: 4100, … } plays: 756, … } … … … … … 33
  • 34. Data Model - Timestamps Row keys have a timestamp component Row keys have a time granularity component Allows for efficient queries over large time ranges (few row keys with big numbers) Preserves granularity at smaller time ranges Currently Month/Week/Day. Maybe Hour/Minute in the future? 34
  • 35. Data Model – Timestamps “CA” “US” … { video: 123, { plays: 1, … } { plays: 1, … } … day: 2011/10/31 } { video: 123, { plays: 2, … } { plays: 1, … } … day: 2011/11/01 } { video: 123, { plays: 4, … } null … day: 2011/11/02 } { video: 123, { plays: 8, … } { plays: 1, … } … day: 2011/11/03 } Keys { video: 123, { plays: 16, … } { plays: 1, … } … day: 2011/11/04 } { video: 123, { plays: 32, … } { plays: 1, … } … day: 2011/11/05 } { video: 123, { plays: 64, … } { plays: 1, … } … day: 2011/11/06 } { video: 123, { plays: 127, … } { plays: 6, … } … week: 2011/10/31 } 35
  • 36. Data Model – Metrics Performance – plays, displays, unique users, time watched, bytes downloaded, etc Sharing – tweets, facebook shares, diggs, etc Engagement – how many users watched through certain time buckets of a video QoS – bitrates, buffering events Ad – ad requests, impressions, clicks, mouse-overs, failures, etc 36
  • 37. Data Model - Metrics CF => Country Column => “CA” “US” … {video: 123, { displays: 50, { displays: 100, … metrics: video, … } plays: 40, … } plays: 75, … } { clicks: 3, { clicks: 7, {video: 123, Keys metrics: ad, … } impressions: 40, impressions: 61, … …} …} … … … … 37
  • 38. Data Model - Dimensions Analytics data is sliced in different dimensions == CFs Example: country. Column names are “US”, “CA”, “JP”, etc Column values are aggregates of the metric for the row key in that country For example: the video performance metrics for month of 2011-10- 01 in the US for video asset 123 Example: platform. Column names: “desktop:windows:chrome”, “tablet:ipad”, “mobile:android”, “settop:ps3”. 38
  • 39. Data Model - Dimensions CF: Country CF: DMA CF: Platform “SF Bay “desktop:mac:c “CA” “US” “NYC” “settop:ps3” Area” hrome” Key: {video: { plays: 20, { plays: 30, { plays: 12, { plays: 5, { plays: 7, … { plays: 60, … } 123, …} …} …} …} …} } 39
  • 40. Data Model – Indices Need to efficiently answer “Top N” queries over an aggregate of multiple rows, sorted by some field in the metrics object But, column sort order is “CA” < “JP” < “US” regardless of field values Would like to support multiple fields to sort on, anyway Naïve implementation – read entire rows, aggregate, sort in RAM – pretty slow Solution: write additional index rows to C* 40
  • 41. Data Model – Indices Every data row may have 0 or more index rows, depending on the metrics type Index rows – empty column values, column names are prepended with the value of the indexed field, encoded as a fixed-width byte array Rely on C* to order the columns according to the indexed field Index rows are stored in separate CFs which have “i_” prepended to the dimension name. 41
  • 42. Data Model - Indices CF => country Column Name => “CA” “US” … { displays: 50, { displays: 100, {video: 123, …} … plays: 40, … } plays: 75, … } Keys { displays: 5000, { displays: 1100, {publisher: 456, …} … plays: 4100, … } plays: 756, … } CF => i_country {video: 123, Name: “40:CA” Name: “75:US” … index: plays} Value: null Value: null Name: Name: {publisher: 456, Keys “5000:CA” “1100:US” … index: displays} Value: null Value: null … … … … 42
  • 43. Data Model – Indices Trivial to answer a “Top N” query for a single row if the field we sort on has an index: just read the last N columns of the index row What if the query spans multiple rows? Use 3-pass uniform threshold algorithm. Guaranteed to get the top- N columns in any multi-row aggregate in 3 RPC calls. See: [http://www.cs.ucsb.edu/research/tech_reports/reports/2005- 14.pdf] Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is impossible, have to do top-2N and drop half. 43
  • 44. Data Model – Drilldowns All cities in the world stored in one row, allowing us to do a global sort. What if we need cities within some region only? Solution: use “drilldown” indices. Just a special kind of index that includes only a subset of all data in the parent row. Example: all cities in the country “US” Works like regular index otherwise Not free – more than 1/3rd of all our C* disk usage 44
  • 45. The Bad Stuff Read-modify-write is slow, because in C* read latency >> write latency Having a write-only pipeline would greatly speed up processing, but makes reading data more expensive (aggregate-on-read) And/or requires more complicated asynchronous aggregation Minimum granularity of 1 day is not that good, would like to do 1- hour or 1-minute But, storage requirements go up very fast 45
  • 46. The Bad Stuff Synchronous updates of time rollups and index rows make processing slower and increase delays But, asynchronous is harder to get right Reprocessing of data is currently difficult because of lack of locking – have to pause regular pipeline Also have to reprocess log files in batches of full days 46
  • 48. DATA MODEL CHANGES ARE PAINFUL … so design to make them less so 48
  • 49. EVERYTHING WILL BREAK … so test accordingly 49
  • 50. SEPARATE LOGICALLY DIFFERENT DATA … it will improve performance AND make your life simpler 50
  • 51. PERF TEST WITH PRODUCTION LOAD … if you can afford a second cluster 51