SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Realtime Analytics
  with Apache
   Cassandra
        Tom Wilkie
 Founder & CTO, Acunu Ltd
       @tom_wilkie
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




2
                                                        Analytics
Solution              Con

                       Scalability
                         $$$


                      Not realtime


               Spartan query semantics =>
                 complex, DIY solutions

3
                                            Analytics
Example I
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


    Batch (Hadoop) approach would require
    processing ~30 billion tweets, or ~4.2
                 TB of data
                  http://blog.twitter.com/2011/03/numbers.html


4
                                                                 Analytics
Okay, so how are we going to
                   do it?

    For each tweet,
    increment a bunch of counters,
    such that answering a query
    is as easy as reading some counters


5
                                          Analytics
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
6
                                                              Analytics
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
7
                                                              Analytics
Instead of this...
                                  Key            #Mentions
                         [01/05/11 00:01, acunu]    3
                         [01/05/11 00:02, acunu]    5
                                   ...              ...




                         We do this
                      Key           00:01       00:02        ...
                [01/05/11, acunu]     3           5          ...
                [02/05/11, acunu]    12           4          ...
                        ...           ...                    ...

    Row key is ‘big’                     Column key is ‘small’
     time bucket                             time bucket
8
                                                                   Analytics
Towards a more
    general solution...
      (Example II)



9
                          Analytics
count
                grouped by ...
                    day
  count
 distinct
(session)
     count       ... geography

avg(duration)
                  ... browser


10
                          Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

                         22:00      all→3221    :00→22      :00→19     :02→104       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→228    user01→1   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1904       ...

                           ∅        all→87314   UK→238     US→354         ...




11
                                                                                 Analytics
21:00      all→1345     :00→45     :01→62      :02→87       ...

                         22:00      all→3222     :00→22     :00→19     :02→105       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→229    user01→2   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1905       ...

                           ∅        all→87315   UK→239     US→354         ...




12
                                                                                 Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

      22:00      all→3221    :00→22      :00→19     :02→104       ...

        ...                                                       ...

        UK        all→228    user01→1   user14→12   user99→7      ...

        US        all→354    user01→4   user04→8    user56→17     ...

        ...

     UK, 22:00   all→1904       ...

        ∅        all→87314   UK→238     US→354         ...




13
                                                              Analytics
where time 21:00-22:00
 count(*)
                          21:00      all→1345    :00→45      :01→62      :02→87       ...

                          22:00      all→3222    :00→22      :01→19     :02→105       ...

                            ...                                                       ...

                            UK        all→229    user01→2   user14→12   user99→7      ...

                            US        all→354    user01→4   user04→8    user56→17     ...

                            ...

                         UK, 22:00   all→1905       ...

                            ∅        all→87315   UK→239     US→354         ...




14
                                                                                  Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345    :00→45      :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...

                             US        all→354    user01→4   user04→8    user56→17     ...

                             ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239     US→354         ...




15
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239      US→354        ...




16
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




17
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




group all by geo
18
                                                                                   Analytics
What about more than
       just aggregates?



19
                            Analytics
Approximate Analytics
                 Exact




     Real-time           Large Scale


20
                                       Analytics
Count Distinct

     Plan A: keep a list of all the things you’ve seen
               count them at query time


                Quick to update
                  ... but at scale ...
                Takes lots of space
                Takes a long time to query
21
                                                         Analytics
Approximate Distinct

     max # leading zeroes seen so far
         item          hash        leading zeroes   max so far

         x        00101001110...          2            2
         y        11010100111...          0            2
         z        00011101011...          3            3
                       ...
     ... to see a max of M takes about        2M    items

22
                                                                 Analytics
Approximate Distinct

            to reduce var, average over m=2k sub-streams

     item          hash          index, zeroes   max so far

     x       00101001110...          0, 0        0,0,0,0
     y       11010100111...          3, 1        0,0,0,1
     z       00011101011...          0, 1        1,0,0,1
                   ...
            take the harmonic mean
23
                                                              Analytics
Okay... now what?




                    Analytics
Analytics

                                     counter
                                     updates
Click stream    events
                          Acunu
Sensor data
                         Analytics
     etc




     •   Aggregate incrementally, on the fly
     •   Store live + historical aggregates
10x vs MySQL...




                  Analytics
Dashboard UI




27
                    Analytics
“Up and running in about 4 hours”
“We found out a competitor
  was scraping our data”

                      “We keep discovering use cases
                         we hadn’t thought of ”




                                                 Analytics
"Quick, efficient and easy to
        get started"
                       "We're still finding new and
                     interesting use cases, which just
                         aren't possible with our
                           current datastores."

                                                         Analytics
Thanks!

     Questions?


30
                  Analytics

Weitere ähnliche Inhalte

Mehr von Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixAcunu
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Acunu
 

Mehr von Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 

Kürzlich hochgeladen

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Kürzlich hochgeladen (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Realtime Analytics with Apache Cassandra - JAX London

  • 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics
  • 3. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions 3 Analytics
  • 4. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 4 Analytics
  • 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters 5 Analytics
  • 6. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 6 Analytics
  • 7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 7 Analytics
  • 8. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket 8 Analytics
  • 9. Towards a more general solution... (Example II) 9 Analytics
  • 10. count grouped by ... day count distinct (session) count ... geography avg(duration) ... browser 10 Analytics
  • 11. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... { cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 11 Analytics
  • 12. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ... { cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 12 Analytics
  • 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 13 Analytics
  • 14. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 14 Analytics
  • 15. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 15 Analytics
  • 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 16 Analytics
  • 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... 17 Analytics
  • 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... group all by geo 18 Analytics
  • 19. What about more than just aggregates? 19 Analytics
  • 20. Approximate Analytics Exact Real-time Large Scale 20 Analytics
  • 21. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query 21 Analytics
  • 22. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items 22 Analytics
  • 23. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean 23 Analytics
  • 24. Okay... now what? Analytics
  • 25. Analytics counter updates Click stream events Acunu Sensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  • 26. 10x vs MySQL... Analytics
  • 27. Dashboard UI 27 Analytics
  • 28. “Up and running in about 4 hours” “We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ” Analytics
  • 29. "Quick, efficient and easy to get started" "We're still finding new and interesting use cases, which just aren't possible with our current datastores." Analytics
  • 30. Thanks! Questions? 30 Analytics