SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Cassandra as an Email Store

    Rustam Aliyev • 20 Feb 2012
Emails sent worldwide




4.500.000/sec
Email Statistics Report 2009-2013, The Radicati Group.


                                                         2
Email storage problem


      MTA       LDA




            3
Email storage problem


       MTA        LDA




   Filesystem + RDBMS
             ≠
 Scalability + Availability

             4
ElasticInbox 1000 ft view

   MTA
                                                      …


                                       elasticinbox nodes
                                   load-balancing, share-nothing




                             Message                        Original
                            Metadata                        Message




                                                                           Blob Store
                                                                   (OpenStack, AWS S3, others)

    Metadata Store (Cassandra                                                         …
            Cluster)




               …


                                         5
Why Cassandra?

Horizontal Scalability

High Availability, no SPOF and Automatic
Replication

Flexible schema

Counters

Email storage does more writes than reads
  spam, sent mails, notifications, mailing lists, unread
  emails, ...

                           6
Why not Cassandra for
         BLOBs?
Thrift does not support streaming

  Value has to fit into memory

  Default max Thrift frame size is 5MB



Possible solution: split large files into 1MB
chunks

  Less than 2% of emails >1MB (in our case)

                        7
Why not Cassandra for
         BLOBs?
Wasted RAM / JVM Heap

  200 x 5MB messages R/W = 1GB RAM




                        8
Why not Cassandra for
         BLOBs?
Wasted RAM / JVM Heap

  200 x 5MB messages R/W = 1GB RAM

Wasted disk space

  When RF=3, disk space = 6 × data

  1TB data   6TB storage required!




                           8
Why not Cassandra for
         BLOBs?
Wasted RAM / JVM Heap

  200 x 5MB messages R/W = 1GB RAM

Wasted disk space

  When RF=3, disk space = 6 × data

  1TB data   6TB storage required!

Wasted CPU

  More CPU used during compactions




                           8
Why not Cassandra for
         BLOBs?
Wasted RAM / JVM Heap

  200 x 5MB messages R/W = 1GB RAM

Wasted disk space

  When RF=3, disk space = 6 × data

  1TB data   6TB storage required!

Wasted CPU

  More CPU used during compactions

Leveled Compaction Strategy?
  New (1.0+), less wasted storage but more I/O.

                            8
BLOB Stores for BLOBs
BLOB Stores are designed for storing BLOBs

Can store unlimited number of objects in a single
container.

AWS S3, OpenStack Object Store, and other 15
supported (thanks @jclouds!).

40%-50% more space efficient than BLOBs in
Cassandra (w/RF=3; 1TB    3.5TB, rather than
6TB).

Cons: much slower than Cassandra (no memtable).

                       9
Polyglot Persistence
Martin Fowler: “any decent sized enterprise will
have a variety of different data storage
technologies for different kinds of data”




Martin Fowler, 16 Nov 2011
Don't take the example in the diagram too seriously.


                                                       10
Data Model




    11
Data Model
NoSQL data model is driven by data access pattens:




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated



But sometimes, access pattens are driven by NoSQL data model:




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated



But sometimes, access pattens are driven by NoSQL data model:

   Synergy between programming model and data model




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated



But sometimes, access pattens are driven by NoSQL data model:

   Synergy between programming model and data model

   Some Gmail features driven BigTable limitations?




                                11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated



But sometimes, access pattens are driven by NoSQL data model:

   Synergy between programming model and data model

   Some Gmail features driven BigTable limitations?

      Labels instead of folders




                                  11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated



But sometimes, access pattens are driven by NoSQL data model:

   Synergy between programming model and data model

   Some Gmail features driven BigTable limitations?

      Labels instead of folders

      No custom sorting, only by time




                                  11
Data Model
NoSQL data model is driven by data access pattens:

   Email is immutable

   Mostly, very recent messages are accessed and updated



But sometimes, access pattens are driven by NoSQL data model:

   Synergy between programming model and data model

   Some Gmail features driven BigTable limitations?

      Labels instead of folders

      No custom sorting, only by time

   Other examples: “More” pagination


                                  11
Data Model ‒ Column
           Families
4 Column Families:
  MessageMetadata
  IndexLabels
  Accounts
  Counters


Account ID: String (user@domail.tld)
Message ID: TimeUUID
Label ID:       Integer


                          12
Data Model ‒ Accounts
Column Family
Reserved Labels: 0 = All Mails, 1 = Inbox, 2 =
Drafts, ...

        "Accounts" {
            "user@elasticinbox.com" {
                "label:0" : "all",
                "label:1" : "inbox",
                "label:2" : "drafts",
                "label:230": "Custom Label",
                ...
            }
        }




                           13
Data Model ‒ IndexLabels
Column Family
Composite Key : Account + Label ID
Messages ordered by time
 "IndexLabels" {
     "user@elasticinbox.com:0" {   # All Mails
         "550e8400-e29b-41d4-a716-446655440000"   : null,
         "892e8300-e29b-41d4-a716-446655440000"   : null,
         "a0232400-e29b-41d4-a716-446655440000"   : null,
         ...
     }
     "user@elasticinbox.com:1" {   # Inbox
         "550e8400-e29b-41d4-a716-446655440000"   : null,
         "892e8300-e29b-41d4-a716-446655440000"   : null,
         "a0232400-e29b-41d4-a716-446655440000"   : null,
         ...
     }
 }

                              14
Data Model ‒
      MessageMetadata
SuperColumn Family

Stores message metadata and pre-parsed
contents

 Message headers, body and attachment info

TimeUUID as unique Message ID, ordered by
time




                     15
Data Model ‒
                MessageMetadata
"MessageMetadata" {
    "user@elasticinbox.com" {
        "550e8400-e29b-41d4-a716-446655440000" {
            "from"    : "[['Test','test@elasticinbox.com']]",
            "to"      : "[['Me','user@elasticinbox.com'],[…]]",
            "subject" : "Hello world!",
            "date"    : "12 March 2011 01:12:00",
            "uri"     : "blob://aws-s3/550e8400-e29b-41d4-a716-446655440000",
            "l:1"     : null,   # Label ID
            "m:1"     : null,   # Marker ID
            "html"    : "<html><body>This is message body</body></html>",
            "parts"   : "{'2.1': {'filename': 'image.png', ...}}",
            ...
        }
        "892e8300-e29b-41d4-a716-446655440000" {
            ...
        }
        ...
    }
}

                                     16
Data Model ‒
           MessageMetadata
Query: List 30 newest messages with label “Inbox”
   ids[] = SliceQuery(“IndexLabels”, “user@dom.tld:1”, 30)

   msg[] = MultigetQuery(“MessageMetadata”, “user@dom.tld”, ids[])



 Row Key                  SuperColumn                  SubColumns
                                                       "from"      :   "..."
                                                       "to"        :   "..."
                550e8400-e29b-41d4-a716-446655440000
                                                       "subject"   :   "..."
user@dom.tld                                           "html"      :   "..."

                892e8300-e29b-41d4-a716-446655440000        - // -
                a0232400-e29b-41d4-a716-446655440000        - // -
                e5586600-f81d-11df-8cc2-080027267700        - // -
some@dom2.tld
                e5595060-f81d-11df-bc91-080027267700        - // -


                                 17
Data Model ‒ Counters

SuperColumn Family
Account’s all counters are on the same node
"Counters" {
    "user@elasticinbox.com"    {
        "l:0" {
             "total_bytes" :   18239090,
             "total_msg"   :   394,
             "new_msg"     :   12
        }
        "l:1" {
             "total_msg"   :   144,
             "new_msg"     :   10
        }
        ...
}


                                   18
Data Model ‒ Counters

SuperColumn Family
Account’s all counters are on the same node
"Counters" {
                                           Non-atomic
    "user@elasticinbox.com"    {           Counters
        "l:0" {
             "total_bytes" :   18239090,   It’s easy to
                                           miscount
             "total_msg"   :   394,
             "new_msg"     :   12
        }
        "l:1" {
             "total_msg"   :   144,
             "new_msg"     :   10
        }
        ...
}


                                   18
ElasticInbox in Production
In production since Nov 2011

~200K accounts, 30M+ messages

4 node cluster, RF=3, Cassandra 0.8.x

Each 1TB of raw mails = 70GB in Cassandra

  Metadata + LZF compressed email text/html
  body




                      19
ElasticInbox in Production
Cassandra load : 40 requests per second per
node

Cassandra latency: 10ms read average, 0.02ms
write

Write to Read ratio:
         CF Name            W:R Ratio
     MessageMetadata           3:1
        IndexLabels            2:1
         Accounts             1:50
         Counters              2:3


                       20
Future work

Performance improvements (may involve minor
schema changes)

Full-text search (preferably on top of Cassandra)

POP3 and IMAP

Built-in filtering rules

Message threads / conversations



                          21
Questions?


www.elasticinbox.com

github.com/elasticinbox

@elasticinbox   @rstml

Weitere ähnliche Inhalte

Was ist angesagt?

NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenLorenzo Alberton
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and HowBigBlueHat
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql DatabasePrashant Gupta
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Oracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyOracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyJohn Kanagaraj
 
MongoDB- Crud Operation
MongoDB- Crud OperationMongoDB- Crud Operation
MongoDB- Crud OperationEdureka!
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Applicationsupertom
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSatya Pal
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 

Was ist angesagt? (20)

NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Oracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyOracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the ugly
 
MongoDB- Crud Operation
MongoDB- Crud OperationMongoDB- Crud Operation
MongoDB- Crud Operation
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explained
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 

Ähnlich wie ElasticInbox

Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Tao Cheng
 
Architectural anti-patterns for data handling
Architectural anti-patterns for data handlingArchitectural anti-patterns for data handling
Architectural anti-patterns for data handlingGleicon Moraes
 
Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013Maciek Próchniak
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQLUlf Wendel
 
DBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training PresentationsDBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training PresentationsSrinivas Mutyala
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Architectural anti patterns_for_data_handling
Architectural anti patterns_for_data_handlingArchitectural anti patterns_for_data_handling
Architectural anti patterns_for_data_handlingGleicon Moraes
 
Making MySQL Agile-ish
Making MySQL Agile-ishMaking MySQL Agile-ish
Making MySQL Agile-ishDave Stokes
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Yaroslav Tkachenko
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011bostonrb
 
Architectural Anti Patterns - Notes on Data Distribution and Handling Failures
Architectural Anti Patterns - Notes on Data Distribution and Handling FailuresArchitectural Anti Patterns - Notes on Data Distribution and Handling Failures
Architectural Anti Patterns - Notes on Data Distribution and Handling FailuresGleicon Moraes
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...Amazon Web Services
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesUlf Wendel
 

Ähnlich wie ElasticInbox (20)

Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Architectural anti-patterns for data handling
Architectural anti-patterns for data handlingArchitectural anti-patterns for data handling
Architectural anti-patterns for data handling
 
Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013Scalable database, Scalable language @ JDC 2013
Scalable database, Scalable language @ JDC 2013
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
DBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training PresentationsDBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training Presentations
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Architectural anti patterns_for_data_handling
Architectural anti patterns_for_data_handlingArchitectural anti patterns_for_data_handling
Architectural anti patterns_for_data_handling
 
Making MySQL Agile-ish
Making MySQL Agile-ishMaking MySQL Agile-ish
Making MySQL Agile-ish
 
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011
 
Architectural Anti Patterns - Notes on Data Distribution and Handling Failures
Architectural Anti Patterns - Notes on Data Distribution and Handling FailuresArchitectural Anti Patterns - Notes on Data Distribution and Handling Failures
Architectural Anti Patterns - Notes on Data Distribution and Handling Failures
 
No sql - { If and Else }
No sql - { If and Else }No sql - { If and Else }
No sql - { If and Else }
 
Mysql
MysqlMysql
Mysql
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
 

Kürzlich hochgeladen

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

ElasticInbox

  • 1. Cassandra as an Email Store Rustam Aliyev • 20 Feb 2012
  • 2. Emails sent worldwide 4.500.000/sec Email Statistics Report 2009-2013, The Radicati Group. 2
  • 4. Email storage problem MTA LDA Filesystem + RDBMS ≠ Scalability + Availability 4
  • 5. ElasticInbox 1000 ft view MTA … elasticinbox nodes load-balancing, share-nothing Message Original Metadata Message Blob Store (OpenStack, AWS S3, others) Metadata Store (Cassandra … Cluster) … 5
  • 6. Why Cassandra? Horizontal Scalability High Availability, no SPOF and Automatic Replication Flexible schema Counters Email storage does more writes than reads spam, sent mails, notifications, mailing lists, unread emails, ... 6
  • 7. Why not Cassandra for BLOBs? Thrift does not support streaming Value has to fit into memory Default max Thrift frame size is 5MB Possible solution: split large files into 1MB chunks Less than 2% of emails >1MB (in our case) 7
  • 8. Why not Cassandra for BLOBs? Wasted RAM / JVM Heap 200 x 5MB messages R/W = 1GB RAM 8
  • 9. Why not Cassandra for BLOBs? Wasted RAM / JVM Heap 200 x 5MB messages R/W = 1GB RAM Wasted disk space When RF=3, disk space = 6 × data 1TB data 6TB storage required! 8
  • 10. Why not Cassandra for BLOBs? Wasted RAM / JVM Heap 200 x 5MB messages R/W = 1GB RAM Wasted disk space When RF=3, disk space = 6 × data 1TB data 6TB storage required! Wasted CPU More CPU used during compactions 8
  • 11. Why not Cassandra for BLOBs? Wasted RAM / JVM Heap 200 x 5MB messages R/W = 1GB RAM Wasted disk space When RF=3, disk space = 6 × data 1TB data 6TB storage required! Wasted CPU More CPU used during compactions Leveled Compaction Strategy? New (1.0+), less wasted storage but more I/O. 8
  • 12. BLOB Stores for BLOBs BLOB Stores are designed for storing BLOBs Can store unlimited number of objects in a single container. AWS S3, OpenStack Object Store, and other 15 supported (thanks @jclouds!). 40%-50% more space efficient than BLOBs in Cassandra (w/RF=3; 1TB 3.5TB, rather than 6TB). Cons: much slower than Cassandra (no memtable). 9
  • 13. Polyglot Persistence Martin Fowler: “any decent sized enterprise will have a variety of different data storage technologies for different kinds of data” Martin Fowler, 16 Nov 2011 Don't take the example in the diagram too seriously. 10
  • 15. Data Model NoSQL data model is driven by data access pattens: 11
  • 16. Data Model NoSQL data model is driven by data access pattens: Email is immutable 11
  • 17. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated 11
  • 18. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated 11
  • 19. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated But sometimes, access pattens are driven by NoSQL data model: 11
  • 20. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated But sometimes, access pattens are driven by NoSQL data model: Synergy between programming model and data model 11
  • 21. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated But sometimes, access pattens are driven by NoSQL data model: Synergy between programming model and data model Some Gmail features driven BigTable limitations? 11
  • 22. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated But sometimes, access pattens are driven by NoSQL data model: Synergy between programming model and data model Some Gmail features driven BigTable limitations? Labels instead of folders 11
  • 23. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated But sometimes, access pattens are driven by NoSQL data model: Synergy between programming model and data model Some Gmail features driven BigTable limitations? Labels instead of folders No custom sorting, only by time 11
  • 24. Data Model NoSQL data model is driven by data access pattens: Email is immutable Mostly, very recent messages are accessed and updated But sometimes, access pattens are driven by NoSQL data model: Synergy between programming model and data model Some Gmail features driven BigTable limitations? Labels instead of folders No custom sorting, only by time Other examples: “More” pagination 11
  • 25. Data Model ‒ Column Families 4 Column Families: MessageMetadata IndexLabels Accounts Counters Account ID: String (user@domail.tld) Message ID: TimeUUID Label ID: Integer 12
  • 26. Data Model ‒ Accounts Column Family Reserved Labels: 0 = All Mails, 1 = Inbox, 2 = Drafts, ... "Accounts" { "user@elasticinbox.com" { "label:0" : "all", "label:1" : "inbox", "label:2" : "drafts", "label:230": "Custom Label", ... } } 13
  • 27. Data Model ‒ IndexLabels Column Family Composite Key : Account + Label ID Messages ordered by time "IndexLabels" { "user@elasticinbox.com:0" { # All Mails "550e8400-e29b-41d4-a716-446655440000" : null, "892e8300-e29b-41d4-a716-446655440000" : null, "a0232400-e29b-41d4-a716-446655440000" : null, ... } "user@elasticinbox.com:1" { # Inbox "550e8400-e29b-41d4-a716-446655440000" : null, "892e8300-e29b-41d4-a716-446655440000" : null, "a0232400-e29b-41d4-a716-446655440000" : null, ... } } 14
  • 28. Data Model ‒ MessageMetadata SuperColumn Family Stores message metadata and pre-parsed contents Message headers, body and attachment info TimeUUID as unique Message ID, ordered by time 15
  • 29. Data Model ‒ MessageMetadata "MessageMetadata" { "user@elasticinbox.com" { "550e8400-e29b-41d4-a716-446655440000" { "from" : "[['Test','test@elasticinbox.com']]", "to" : "[['Me','user@elasticinbox.com'],[…]]", "subject" : "Hello world!", "date" : "12 March 2011 01:12:00", "uri" : "blob://aws-s3/550e8400-e29b-41d4-a716-446655440000", "l:1" : null, # Label ID "m:1" : null, # Marker ID "html" : "<html><body>This is message body</body></html>", "parts" : "{'2.1': {'filename': 'image.png', ...}}", ... } "892e8300-e29b-41d4-a716-446655440000" { ... } ... } } 16
  • 30. Data Model ‒ MessageMetadata Query: List 30 newest messages with label “Inbox” ids[] = SliceQuery(“IndexLabels”, “user@dom.tld:1”, 30) msg[] = MultigetQuery(“MessageMetadata”, “user@dom.tld”, ids[]) Row Key SuperColumn SubColumns "from" : "..." "to" : "..." 550e8400-e29b-41d4-a716-446655440000 "subject" : "..." user@dom.tld "html" : "..." 892e8300-e29b-41d4-a716-446655440000 - // - a0232400-e29b-41d4-a716-446655440000 - // - e5586600-f81d-11df-8cc2-080027267700 - // - some@dom2.tld e5595060-f81d-11df-bc91-080027267700 - // - 17
  • 31. Data Model ‒ Counters SuperColumn Family Account’s all counters are on the same node "Counters" { "user@elasticinbox.com" { "l:0" { "total_bytes" : 18239090, "total_msg" : 394, "new_msg" : 12 } "l:1" { "total_msg" : 144, "new_msg" : 10 } ... } 18
  • 32. Data Model ‒ Counters SuperColumn Family Account’s all counters are on the same node "Counters" { Non-atomic "user@elasticinbox.com" { Counters "l:0" { "total_bytes" : 18239090, It’s easy to miscount "total_msg" : 394, "new_msg" : 12 } "l:1" { "total_msg" : 144, "new_msg" : 10 } ... } 18
  • 33. ElasticInbox in Production In production since Nov 2011 ~200K accounts, 30M+ messages 4 node cluster, RF=3, Cassandra 0.8.x Each 1TB of raw mails = 70GB in Cassandra Metadata + LZF compressed email text/html body 19
  • 34. ElasticInbox in Production Cassandra load : 40 requests per second per node Cassandra latency: 10ms read average, 0.02ms write Write to Read ratio: CF Name W:R Ratio MessageMetadata 3:1 IndexLabels 2:1 Accounts 1:50 Counters 2:3 20
  • 35. Future work Performance improvements (may involve minor schema changes) Full-text search (preferably on top of Cassandra) POP3 and IMAP Built-in filtering rules Message threads / conversations 21

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. Compaction everywhere, Outlook example\n
  9. Compaction everywhere, Outlook example\n
  10. Compaction everywhere, Outlook example\n
  11. Compaction everywhere, Outlook example\n
  12. OpenStack Swift has similarities with Cassandra design.\n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. SuperColumns implementation planned to be replaced in Cassandra 1.2\n
  30. Designed for ad/page impression count at Digg/Twitter\n
  31. Designed for ad/page impression count at Digg/Twitter\n
  32. Designed for ad/page impression count at Digg/Twitter\n
  33. \n
  34. \n
  35. \n
  36. \n