SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Data modelling
                                  workshop

                                  Richard Low



                           rlow@acunu.com @richardalow



Wednesday, 28 March 2012
Outline
                   • What is data modelling?
                   • What do I need to know to come up with a
                           model?
                   • Options and available tools
                   • Denormalisation
                   • Example and demo: scalable messaging
                           application


Wednesday, 28 March 2012
What is data modelling?




Wednesday, 28 March 2012
Data modelling

                   • How you organise your data
                   • Store all in one big value?
                   • Store as columns in one row or lots of rows?
                   • Use counters?
                   • Can I avoid read-modify-write?

Wednesday, 28 March 2012
Why care about it?

                   • Performance
                   • Ensure good load balancing
                   • Disk usage
                   • Future proofing

Wednesday, 28 March 2012
Performance

                        100
                 • Bad data model: do read-modify-write on
                              0x
                   large column im
                                      pro
                 • Good data model: just overwrite updated data
                                            vem
                 •                                  ent
                   Difference? Could be 100 ops/s vs. 100k ops/s




Wednesday, 28 March 2012
Performance

                   • Cacheability
                    • Ensure your cache isn’t polluted by
                             uncacheable things
                           • Cached reads are ~100x faster than
                             uncached



Wednesday, 28 March 2012
What do you need?



Wednesday, 28 March 2012
Optimise for queries


                   • Data model design starts with queries
                   • What are the common queries?


Wednesday, 28 March 2012
Workload

                   • How many inserts?
                   • How many reads?
                   • Do inserts depend on current data?
                   • Is data write-once?

Wednesday, 28 March 2012
Sizes
                   • How big are the values?
                   • Are some ‘users’ bigger than others?
                   • How cacheable is your data?




Wednesday, 28 March 2012
How do I get this?
        • Back of the envelope calculation
        • Monitor existing solution
        • Prototype a solution




Wednesday, 28 March 2012
Options and tools




Wednesday, 28 March 2012
Keyspaces and Column Families
                    SQL                                    Cassandra

          Database         row/key col_1    col_2
                                                            Keyspace
                              row/key col_1     col_1
                                   row/  col_1    col_1


                Table                                     Column Family




Wednesday, 28 March 2012
Options and tools

                   • Rows
                   • Columns
                    • Supercolumns
                    • Composite columns

Wednesday, 28 March 2012
Rows and columns
                           col1   col2   col3   col4   col5   col6   col7
                row1               x                    x      x
                row2        x      x      x      x      x
                row3               x      x             x      x      x
                row4               x      x      x             x
                row5               x             x      x      x
                row6               x
                row7        x      x             x



Wednesday, 28 March 2012
Column options

                   • Regular columns
                   • Super columns: columns within columns
                   • Composite columns: multi-dimensional
                           column names




Wednesday, 28 March 2012
Composite columns
                           alice: {
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }

                           bob: {
                              m1: {
                                  Sender: alice,
                                  Subject: ‘rock?’, ...
                              }
                           }

                           charlie: {
                              m1: {
                                 Sender: alice,
                                 Subject: ‘rock?’, ...
                              },
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }


Wednesday, 28 March 2012
Tools

                   • Counters: atomic inc and dec
                   • Expiring columns: TTL
                   • Secondary indexes: your WHERE clause


Wednesday, 28 March 2012
Rows vs columns
                   • Row key is the shard key
                   • Need lots of rows for scalability
                   • Don’t be afraid of large-ish rows
                    • But don’t make them too big
                   • Avoid range queries across rows, but use
                           them within rows


Wednesday, 28 March 2012
Range queries
               • Within a row:
                      SELECT col3..col5 FROM
                      Standard1 WHERE KEY=row1


             row1          col1   col2   col5   col6   col8




Wednesday, 28 March 2012
Range queries
             • Across rows:
                    SELECT * FROM table WHERE key >
                    row2 LIMIT 2




Wednesday, 28 March 2012
Range queries
    SELECT * FROM table
    WHERE key > row2                     row4
    LIMIT 2
     > row2, row1
                                                  row2


                                  row3          row1



Wednesday, 28 March 2012
Range queries

                   • Range queries within rows ‘get_slice’ are
                           fine
                   • Avoid range queries across rows
                           ‘get_range_slices’




Wednesday, 28 March 2012
Batching
                   • Overhead on each call
                   • Batch together inserts, better if in the same
                           row
                   • Reduce read ops, use large get_slice reads



Wednesday, 28 March 2012
Denormalisation




Wednesday, 28 March 2012
Denormalisation

                   • Hard drive performance constraints:
                    • Sequential IO at 100s MB/s
                    • Seek at 100 IO/s
                   • Avoid random IO

Wednesday, 28 March 2012
Denormalisation
                   • Store columns accessed at similar times near
                           to each other
                   • => put them in the same row
                   • Involves copying
                   • Copying isn’t bad - pre flood prices <$100
                           per TB



Wednesday, 28 March 2012
Messaging Application
Wednesday, 28 March 2012
Messaging application

                   • Users can send messages to other users
                   • Horizontally scalable
                   • Expect users to send to lots of recipients


Wednesday, 28 March 2012
Messaging

                   • In an RDBMS we might have a table for:
                    • Users
                    • Messages (sender is unique)
                    • Mappings, Message → Receiver


Wednesday, 28 March 2012
A relational model
                                         Msg_Receipt
                                               Id
                                           Message_Id   ∞
                                     ∞      User_Id
                       Users     1          Is_read
                                                            1   Messages
                           Id
                                                                   Id
                      username   1
                                                                 Subject
                                                                 Content
                                                                  Date
            Example Relational                              ∞
                                                                Sender_Id

               DB model

Wednesday, 28 March 2012
Querying
        Most recent 10 messages sent by a user:
                SELECT *
                    FROM Messages
                    WHERE Messages.Sender_Id = <id>
                    ORDER BY Messages.Date DESC
                    LIMIT 10;



         Most recent 10 messages received by a user:
                SELECT Messages.*
                    FROM Messages, Msg_Receipt
                    WHERE Msg_Receipt.User_Id = <id>
                    AND Msg_Receipt.Message_Id = Messages.Id
                    ORDER BY Messages.Date DESC
                    LIMIT 10;


Wednesday, 28 March 2012
Under the hood
                    Msg_Receipt                    Messages
              id           msg_id user_id    id     subject   ...
               0              0      0        0        a
               1              3      1        1        b
               2              4      2        2        c
               3            6000     0        3        d
                                              4        e
                                             ...
                                            6000      x


Wednesday, 28 March 2012
Under the hood

                   • Normalisation => seeks
                   • So denormalise
                    • Hit capacity limit of one node quickly


Wednesday, 28 March 2012
Back of the envelope...

                   • 1 M users
                   • Message size 1 KB
                   • Each user has 5000 messages
                   • => 5 TB data

Wednesday, 28 March 2012
Back of the envelope...

                   • Reading 10 messages => 10 seeks
                   • If 10k active at once, need 100k seeks/s
                   • => need 1000 disks
                   • With 8 disks per node, RF 3, that’s 375
                           nodes



Wednesday, 28 March 2012
Back of the envelope...

                   • Denormalize: messages are immutable
                   • Insert them into everyone’s inbox
                   • Read 10 messages is one seek
                   • Paging is sequential
                   • => 10x fewer nodes: 38 nodes now!

Wednesday, 28 March 2012
In Cassandra

                   • Use a row per user
                   • Composite columns, with TimeUUID as ID
                   • Gives time ordering on messages
                   • Inserts go to all recipients

Wednesday, 28 March 2012
Messaging example
                               From:    alice
                               To:      bob, charlie
                               Subject: rock?


                                                m1

                              alice

                                       sender        subject
                              bob
                                        alice         rock?
                                       sender        subject
                             charlie
                                        alice         rock?
Wednesday, 28 March 2012
Messaging example
                                From:    bob
                                To:      alice, charlie
                                Subject: paper!


                                    m1                      m2

                                                   sender        subject
     alice
                                                    bob          paper!
                           sender        subject
      bob
                            alice         rock?
                           sender        subject   sender        subject
  charlie
                            alice         rock?     bob          paper!
Wednesday, 28 March 2012
Data
                           alice: {
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }

                           bob: {
                              m1: {
                                  Sender: alice,
                                  Subject: ‘rock?’, ...
                              }
                           }

                           charlie: {
                              m1: {
                                 Sender: alice,
                                 Subject: ‘rock?’, ...
                              },
                              m2: {
                                 Sender: bob,
                                 Subject: ‘paper!’, ...
                              }
                           }


Wednesday, 28 March 2012
Demo

                   • Pycassa
                   • Send message
                   • List messages
                   • Unread count

Wednesday, 28 March 2012

Weitere ähnliche Inhalte

Ähnlich wie Cassandra EU 2012 - Data modelling workshop by Richard Low

Mansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprintMansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprintAl Sayed Gamal
 
3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4jNeo4j
 
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)Alexis Bondu
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceAditya Parameswaran
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopYahoo Developer Network
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning台灣資料科學年會
 
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Daniel Austin
 
Intro to NoSQL and MongoDB
 Intro to NoSQL and MongoDB Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBMongoDB
 
Schema less table & dynamic schema
Schema less table & dynamic schemaSchema less table & dynamic schema
Schema less table & dynamic schemaDavide Mauri
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDBlehresman
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsSteven Francia
 
Program Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARCProgram Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARCAndrey Zakharevich
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_uploadRajini Ramesh
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsDATAVERSITY
 

Ähnlich wie Cassandra EU 2012 - Data modelling workshop by Richard Low (20)

Mansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprintMansoura University CSED & Nozom web development sprint
Mansoura University CSED & Nozom web development sprint
 
3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j3/15 - Intro to Spring Data Neo4j
3/15 - Intro to Spring Data Neo4j
 
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
AN INTRODUCTION TO AUTO-ML EDGE-ML (VIDEO 1/4)
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data Science
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
Big Data is a Big Scam Most of the Time! (MySQL Connect Keynote 2012)
 
Intro to NoSQL and MongoDB
 Intro to NoSQL and MongoDB Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDB
 
Schema less table & dynamic schema
Schema less table & dynamic schemaSchema less table & dynamic schema
Schema less table & dynamic schema
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
 
Program Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARCProgram Synthesis, DreamCoder, and ARC
Program Synthesis, DreamCoder, and ARC
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 

Mehr von Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraAcunu
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 

Mehr von Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 

Kürzlich hochgeladen

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Cassandra EU 2012 - Data modelling workshop by Richard Low

  • 1. Data modelling workshop Richard Low rlow@acunu.com @richardalow Wednesday, 28 March 2012
  • 2. Outline • What is data modelling? • What do I need to know to come up with a model? • Options and available tools • Denormalisation • Example and demo: scalable messaging application Wednesday, 28 March 2012
  • 3. What is data modelling? Wednesday, 28 March 2012
  • 4. Data modelling • How you organise your data • Store all in one big value? • Store as columns in one row or lots of rows? • Use counters? • Can I avoid read-modify-write? Wednesday, 28 March 2012
  • 5. Why care about it? • Performance • Ensure good load balancing • Disk usage • Future proofing Wednesday, 28 March 2012
  • 6. Performance 100 • Bad data model: do read-modify-write on 0x large column im pro • Good data model: just overwrite updated data vem • ent Difference? Could be 100 ops/s vs. 100k ops/s Wednesday, 28 March 2012
  • 7. Performance • Cacheability • Ensure your cache isn’t polluted by uncacheable things • Cached reads are ~100x faster than uncached Wednesday, 28 March 2012
  • 8. What do you need? Wednesday, 28 March 2012
  • 9. Optimise for queries • Data model design starts with queries • What are the common queries? Wednesday, 28 March 2012
  • 10. Workload • How many inserts? • How many reads? • Do inserts depend on current data? • Is data write-once? Wednesday, 28 March 2012
  • 11. Sizes • How big are the values? • Are some ‘users’ bigger than others? • How cacheable is your data? Wednesday, 28 March 2012
  • 12. How do I get this? • Back of the envelope calculation • Monitor existing solution • Prototype a solution Wednesday, 28 March 2012
  • 14. Keyspaces and Column Families SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family Wednesday, 28 March 2012
  • 15. Options and tools • Rows • Columns • Supercolumns • Composite columns Wednesday, 28 March 2012
  • 16. Rows and columns col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Wednesday, 28 March 2012
  • 17. Column options • Regular columns • Super columns: columns within columns • Composite columns: multi-dimensional column names Wednesday, 28 March 2012
  • 18. Composite columns alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } } Wednesday, 28 March 2012
  • 19. Tools • Counters: atomic inc and dec • Expiring columns: TTL • Secondary indexes: your WHERE clause Wednesday, 28 March 2012
  • 20. Rows vs columns • Row key is the shard key • Need lots of rows for scalability • Don’t be afraid of large-ish rows • But don’t make them too big • Avoid range queries across rows, but use them within rows Wednesday, 28 March 2012
  • 21. Range queries • Within a row: SELECT col3..col5 FROM Standard1 WHERE KEY=row1 row1 col1 col2 col5 col6 col8 Wednesday, 28 March 2012
  • 22. Range queries • Across rows: SELECT * FROM table WHERE key > row2 LIMIT 2 Wednesday, 28 March 2012
  • 23. Range queries SELECT * FROM table WHERE key > row2 row4 LIMIT 2 > row2, row1 row2 row3 row1 Wednesday, 28 March 2012
  • 24. Range queries • Range queries within rows ‘get_slice’ are fine • Avoid range queries across rows ‘get_range_slices’ Wednesday, 28 March 2012
  • 25. Batching • Overhead on each call • Batch together inserts, better if in the same row • Reduce read ops, use large get_slice reads Wednesday, 28 March 2012
  • 27. Denormalisation • Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s • Avoid random IO Wednesday, 28 March 2012
  • 28. Denormalisation • Store columns accessed at similar times near to each other • => put them in the same row • Involves copying • Copying isn’t bad - pre flood prices <$100 per TB Wednesday, 28 March 2012
  • 30. Messaging application • Users can send messages to other users • Horizontally scalable • Expect users to send to lots of recipients Wednesday, 28 March 2012
  • 31. Messaging • In an RDBMS we might have a table for: • Users • Messages (sender is unique) • Mappings, Message → Receiver Wednesday, 28 March 2012
  • 32. A relational model Msg_Receipt Id Message_Id ∞ ∞ User_Id Users 1 Is_read 1 Messages Id Id username 1 Subject Content Date Example Relational ∞ Sender_Id DB model Wednesday, 28 March 2012
  • 33. Querying Most recent 10 messages sent by a user: SELECT * FROM Messages WHERE Messages.Sender_Id = <id> ORDER BY Messages.Date DESC LIMIT 10; Most recent 10 messages received by a user: SELECT Messages.* FROM Messages, Msg_Receipt WHERE Msg_Receipt.User_Id = <id> AND Msg_Receipt.Message_Id = Messages.Id ORDER BY Messages.Date DESC LIMIT 10; Wednesday, 28 March 2012
  • 34. Under the hood Msg_Receipt Messages id msg_id user_id id subject ... 0 0 0 0 a 1 3 1 1 b 2 4 2 2 c 3 6000 0 3 d 4 e ... 6000 x Wednesday, 28 March 2012
  • 35. Under the hood • Normalisation => seeks • So denormalise • Hit capacity limit of one node quickly Wednesday, 28 March 2012
  • 36. Back of the envelope... • 1 M users • Message size 1 KB • Each user has 5000 messages • => 5 TB data Wednesday, 28 March 2012
  • 37. Back of the envelope... • Reading 10 messages => 10 seeks • If 10k active at once, need 100k seeks/s • => need 1000 disks • With 8 disks per node, RF 3, that’s 375 nodes Wednesday, 28 March 2012
  • 38. Back of the envelope... • Denormalize: messages are immutable • Insert them into everyone’s inbox • Read 10 messages is one seek • Paging is sequential • => 10x fewer nodes: 38 nodes now! Wednesday, 28 March 2012
  • 39. In Cassandra • Use a row per user • Composite columns, with TimeUUID as ID • Gives time ordering on messages • Inserts go to all recipients Wednesday, 28 March 2012
  • 40. Messaging example From: alice To: bob, charlie Subject: rock? m1 alice sender subject bob alice rock? sender subject charlie alice rock? Wednesday, 28 March 2012
  • 41. Messaging example From: bob To: alice, charlie Subject: paper! m1 m2 sender subject alice bob paper! sender subject bob alice rock? sender subject sender subject charlie alice rock? bob paper! Wednesday, 28 March 2012
  • 42. Data alice: { m2: { Sender: bob, Subject: ‘paper!’, ... } } bob: { m1: { Sender: alice, Subject: ‘rock?’, ... } } charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... } } Wednesday, 28 March 2012
  • 43. Demo • Pycassa • Send message • List messages • Unread count Wednesday, 28 March 2012