SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Realtime Analytics
  with Apache
   Cassandra
        Tom Wilkie
 Founder & CTO, Acunu Ltd
       @tom_wilkie
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




2
                                                        Analytics
Solution              Con

                       Scalability
                         $$$


                      Not realtime


               Spartan query semantics =>
                 complex, DIY solutions

3
                                            Analytics
Example I
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


    Batch (Hadoop) approach would require
    processing ~30 billion tweets, or ~4.2
                 TB of data
                  http://blog.twitter.com/2011/03/numbers.html


4
                                                                 Analytics
Okay, so how are we going to
                   do it?

    For each tweet,
    increment a bunch of counters,
    such that answering a query
    is as easy as reading some counters


5
                                          Analytics
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
6
                                                              Analytics
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
7
                                                              Analytics
Instead of this...
                                  Key            #Mentions
                         [01/05/11 00:01, acunu]    3
                         [01/05/11 00:02, acunu]    5
                                   ...              ...




                         We do this
                      Key           00:01       00:02        ...
                [01/05/11, acunu]     3           5          ...
                [02/05/11, acunu]    12           4          ...
                        ...           ...                    ...

    Row key is ‘big’                     Column key is ‘small’
     time bucket                             time bucket
8
                                                                   Analytics
Towards a more
    general solution...
      (Example II)



9
                          Analytics
count
                grouped by ...
                    day
  count
 distinct
(session)
     count       ... geography

avg(duration)
                  ... browser


10
                          Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

                         22:00      all→3221    :00→22      :00→19     :02→104       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→228    user01→1   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1904       ...

                           ∅        all→87314   UK→238     US→354         ...




11
                                                                                 Analytics
21:00      all→1345     :00→45     :01→62      :02→87       ...

                         22:00      all→3222     :00→22     :00→19     :02→105       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→229    user01→2   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1905       ...

                           ∅        all→87315   UK→239     US→354         ...




12
                                                                                 Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

      22:00      all→3221    :00→22      :00→19     :02→104       ...

        ...                                                       ...

        UK        all→228    user01→1   user14→12   user99→7      ...

        US        all→354    user01→4   user04→8    user56→17     ...

        ...

     UK, 22:00   all→1904       ...

        ∅        all→87314   UK→238     US→354         ...




13
                                                              Analytics
where time 21:00-22:00
 count(*)
                          21:00      all→1345    :00→45      :01→62      :02→87       ...

                          22:00      all→3222    :00→22      :01→19     :02→105       ...

                            ...                                                       ...

                            UK        all→229    user01→2   user14→12   user99→7      ...

                            US        all→354    user01→4   user04→8    user56→17     ...

                            ...

                         UK, 22:00   all→1905       ...

                            ∅        all→87315   UK→239     US→354         ...




14
                                                                                  Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345    :00→45      :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...

                             US        all→354    user01→4   user04→8    user56→17     ...

                             ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239     US→354         ...




15
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239      US→354        ...




16
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




17
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




group all by geo
18
                                                                                   Analytics
What about more than
       just aggregates?



19
                            Analytics
Approximate Analytics
                 Exact




     Real-time           Large Scale


20
                                       Analytics
Count Distinct

     Plan A: keep a list of all the things you’ve seen
               count them at query time


                Quick to update
                  ... but at scale ...
                Takes lots of space
                Takes a long time to query
21
                                                         Analytics
Approximate Distinct

     max # leading zeroes seen so far
         item          hash        leading zeroes   max so far

         x        00101001110...          2            2
         y        11010100111...          0            2
         z        00011101011...          3            3
                       ...
     ... to see a max of M takes about        2M    items

22
                                                                 Analytics
Approximate Distinct

            to reduce var, average over m=2k sub-streams

     item          hash          index, zeroes   max so far

     x       00101001110...          0, 0        0,0,0,0
     y       11010100111...          3, 1        0,0,0,1
     z       00011101011...          0, 1        1,0,0,1
                   ...
            take the harmonic mean
23
                                                              Analytics
Okay... now what?




                    Analytics
Analytics

                                     counter
                                     updates
Click stream    events
                          Acunu
Sensor data
                         Analytics
     etc




     •   Aggregate incrementally, on the fly
     •   Store live + historical aggregates
10x vs MySQL...




                  Analytics
Dashboard UI




27
                    Analytics
“Up and running in about 4 hours”
“We found out a competitor
  was scraping our data”

                      “We keep discovering use cases
                         we hadn’t thought of ”




                                                 Analytics
"Quick, efficient and easy to
        get started"
                       "We're still finding new and
                     interesting use cases, which just
                         aren't possible with our
                           current datastores."

                                                         Analytics
Thanks!

     Questions?


30
                  Analytics

Weitere ähnliche Inhalte

Mehr von JAX London

Everything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityEverything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityJAX London
 
Devops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisDevops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisJAX London
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsBusy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsJAX London
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisIt's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisJAX London
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerLocks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerJAX London
 
Worse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyWorse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyJAX London
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Clojure made-simple - John Stevenson
Clojure made-simple - John StevensonClojure made-simple - John Stevenson
Clojure made-simple - John StevensonJAX London
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfHTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfJAX London
 
Play framework 2 : Peter Hilton
Play framework 2 : Peter HiltonPlay framework 2 : Peter Hilton
Play framework 2 : Peter HiltonJAX London
 
Complexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundComplexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundJAX London
 
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberJAX London
 
Akka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerAkka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerJAX London
 
NoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundNoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundJAX London
 
Closures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderClosures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderJAX London
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
 
Mongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsMongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsJAX London
 
New opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonNew opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonJAX London
 
HTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaHTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaJAX London
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerThe Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerJAX London
 

Mehr von JAX London (20)

Everything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityEverything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexity
 
Devops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisDevops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick Debois
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsBusy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisIt's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick Debois
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerLocks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael Barker
 
Worse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyWorse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin Henney
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Clojure made-simple - John Stevenson
Clojure made-simple - John StevensonClojure made-simple - John Stevenson
Clojure made-simple - John Stevenson
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfHTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
 
Play framework 2 : Peter Hilton
Play framework 2 : Peter HiltonPlay framework 2 : Peter Hilton
Play framework 2 : Peter Hilton
 
Complexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundComplexity theory and software development : Tim Berglund
Complexity theory and software development : Tim Berglund
 
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave Gruber
 
Akka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerAkka in Action: Heiko Seeburger
Akka in Action: Heiko Seeburger
 
NoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundNoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim Berglund
 
Closures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderClosures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel Winder
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
Mongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsMongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdams
 
New opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonNew opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian Robinson
 
HTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaHTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun Gupta
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerThe Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
 

Kürzlich hochgeladen

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Realtime analytics with Apache Cassandra - Tom Wilkie

  • 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics
  • 3. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions 3 Analytics
  • 4. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 4 Analytics
  • 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters 5 Analytics
  • 6. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 6 Analytics
  • 7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 7 Analytics
  • 8. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket 8 Analytics
  • 9. Towards a more general solution... (Example II) 9 Analytics
  • 10. count grouped by ... day count distinct (session) count ... geography avg(duration) ... browser 10 Analytics
  • 11. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... { cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 11 Analytics
  • 12. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ... { cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 12 Analytics
  • 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 13 Analytics
  • 14. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 14 Analytics
  • 15. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 15 Analytics
  • 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 16 Analytics
  • 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... 17 Analytics
  • 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... group all by geo 18 Analytics
  • 19. What about more than just aggregates? 19 Analytics
  • 20. Approximate Analytics Exact Real-time Large Scale 20 Analytics
  • 21. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query 21 Analytics
  • 22. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items 22 Analytics
  • 23. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean 23 Analytics
  • 24. Okay... now what? Analytics
  • 25. Analytics counter updates Click stream events Acunu Sensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  • 26. 10x vs MySQL... Analytics
  • 27. Dashboard UI 27 Analytics
  • 28. “Up and running in about 4 hours” “We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ” Analytics
  • 29. "Quick, efficient and easy to get started" "We're still finding new and interesting use cases, which just aren't possible with our current datastores." Analytics
  • 30. Thanks! Questions? 30 Analytics