SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax
FACT

CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
 FOR SPINNING DISKS
LSM-TREES
WRITE PATH
insert({ cf1: { row1: { col3: foo } } })




      Client                         Cassandra




     On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }
                                                   In-Memory Memtable for “cf1”
{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }
                                                 row1   col1: [del]   col2: “def”   col3: “foo”

{ cf1: { row2: { col1: xyz } } }
                                                 row2   col1: “xyz”
{ cf1: { row1: { col3: foo } } }




                                                              COMMIT
In-Memory Memtable for “cf1”

                    row1         col1: [del]       col2: “def”   col3: “foo”


                    row2         col1: “xyz”




SSTable   SSTable      SSTable             SSTable




 1         2               3                   4

                                                                               FLUSH
SSTable   SSTable             SSTable   SSTable




        1         2                    3         4


                           SSTable




SSTables are merged to maintain read performance


                                           COMPACT
X X X X
  SSTable   SSTable         SSTable   SSTable




SSTable
                      New SSTable is streamed
                      to disk and old SSTables
                             are erased
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
                    IMPORTANT
  complexity O(N)
• SSTables are completely immutable
COMPARED
• Most popular data storage engines
  rewrite modified data in-place: MySQL
  (InnoDB), PostgreSQL, Oracle,
  MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
  writes before flushing to disk
• ... but flushes are RANDOM writes
SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
  to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
  sec for modern SATA drives
THAT WAS THE WORLD
IN WHICH CASSANDRA
     WAS BUILT
2012: MLC NAND FLASH*
            • Affordable: ~$1.75/GB street
            • Massive IOPS: 39,500/sec read, 23,000/
                  sec write
            • Latency of less than 100µs
            • Good sequential throughput: 270MB/
                  sec read, 205MB/sec write
            • Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
  LSM-TREES OBSOLETE?
SOLID STATE HAS
SOME MAJOR BUTS...
... BUT
• Cannot overwrite directly: must erase
  first, then write
• Can write in small increments (4KB),
  but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
  for each erase block
WEAR LEVELING is used
to reduce the number of
 total erase operations
WEAR LEVELING
WEAR LEVELING
Erase Block
WEAR LEVELING
WEAR LEVELING
WEAR LEVELING

 Disk Page
WEAR LEVELING

  Write 1
WEAR LEVELING

  Write 1
  Write 2
WEAR LEVELING

  Write 1
  Write 2
  Write 3
Remember: the whole block must be erased


         Write 1
         Write 2
         Write 3

                   How is data from only
                     Write 2 modified?
Mark Garbage
Empty Block




Mark Garbage    Append
                Modified
                 Data
Wait... GARBAGE?
THAT MEANS...
... fragmentation,
WHICH MEANS...
Garbage Collection!
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
  background... as much as possible
• If no empty blocks are available, GC
  must be done before ANY writes can
  complete
WRITE AMPLIFICATION
• When only a few kilobytes are written,
  but fragmentation causes a whole
  block to be rewritten
• The smaller & more random the writes,
  the worse this gets
• Modern “mark and sweep” GC reduces
  it, but cannot eliminate it
Torture test shows massive
             write performance drop-off
            for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
Some poorly designed drives
      COMPLETELY fall apart


Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
Even a well-behaved drive
suffers significantly from the
         torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
Post-torture, all disk blocks
were marked empty, and the
   “fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
“TRIM”
• Filesystems don’t typically immediately
  erase data when files are deleted, they just
  mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
  when a region of disk is no longer used
• If an entire erase block is marked as
  unused, GC is avoided, otherwise TRIM
  just hastens the collection process
TRIM only reduces the
write amplification effect,
   it can’t eliminate it.
THEN THERE’S
 LIFETIME...
AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
   which causes around 10x write amplification
REMEMBER THIS?
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
CASSANDRA
 ONLY WRITES
SEQUENTIALLY
“For a sequential write workload,
     write amplification is equal to 1,
          i.e., there is no write
              amplification.”


Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
        Performance: Understanding, Analysis, and Performance Modeling”
THANK YOU.
     ~ @rbranson

Weitere ähnliche Inhalte

Was ist angesagt?

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSPythian
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraDataStax Academy
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1DataStax Academy
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationRamkumar Nottath
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScyllaDB
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction EverywhereDataStax Academy
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyDataStax Academy
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph Community
 

Was ist angesagt? (18)

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 

Ähnlich wie Cassandra and Solid State Drives

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesDataStax Academy
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016Tomas Vondra
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptxJawaharPrasad3
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkI Goo Lee
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBXiao Yan Li
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensMatthew Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performancejimmytruong
 
Storage structure
Storage structureStorage structure
Storage structureMohd Arif
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updatedWerner Fischer
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMAlex Zaballa
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016Alex Chistyakov
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflowsjasonajohnson
 

Ähnlich wie Cassandra and Solid State Drives (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performance
 
Storage structure
Storage structureStorage structure
Storage structure
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
 
SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Cassandra and Solid State Drives

  • 1. CASSANDRA & SOLID STATE DRIVES Rick Branson, DataStax
  • 2. FACT CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED FOR SPINNING DISKS
  • 5. insert({ cf1: { row1: { col3: foo } } }) Client Cassandra On-Disk Node Commit Log { cf1: { row1: { col1: abc } } } In-Memory Memtable for “cf1” { cf1: { row1: { col2: def } } } { cf1: { row1: { col1: <del> } } } row1 col1: [del] col2: “def” col3: “foo” { cf1: { row2: { col1: xyz } } } row2 col1: “xyz” { cf1: { row1: { col3: foo } } } COMMIT
  • 6. In-Memory Memtable for “cf1” row1 col1: [del] col2: “def” col3: “foo” row2 col1: “xyz” SSTable SSTable SSTable SSTable 1 2 3 4 FLUSH
  • 7. SSTable SSTable SSTable SSTable 1 2 3 4 SSTable SSTables are merged to maintain read performance COMPACT
  • 8. X X X X SSTable SSTable SSTable SSTable SSTable New SSTable is streamed to disk and old SSTables are erased
  • 9. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 10. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear IMPORTANT complexity O(N) • SSTables are completely immutable
  • 11. COMPARED • Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc • Most perform similar buffering of writes before flushing to disk • ... but flushes are RANDOM writes
  • 12. SPINNING DISKS • Dirt cheap: $0.08/GB • Seek time limited by time it takes for drive to rotate: IOPS = RPM/60 • 7,200 RPM = ~120 IOPS • 15,000 RPM has been the max for decades • Sequential operations are best: 125MB/ sec for modern SATA drives
  • 13. THAT WAS THE WORLD IN WHICH CASSANDRA WAS BUILT
  • 14. 2012: MLC NAND FLASH* • Affordable: ~$1.75/GB street • Massive IOPS: 39,500/sec read, 23,000/ sec write • Latency of less than 100µs • Good sequential throughput: 270MB/ sec read, 205MB/sec write • Way cheaper per IOPS: $0.02 vs $1.25 * based on specifications provided by Intel for 300GB Intel 320 drive
  • 15. WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S LSM-TREES OBSOLETE?
  • 16.
  • 17. SOLID STATE HAS SOME MAJOR BUTS...
  • 18. ... BUT • Cannot overwrite directly: must erase first, then write • Can write in small increments (4KB), but only erase in ~512KB blocks • Latency: write is ~100µs, erase is ~2ms • Limited durability: ~5,000 cycles (MLC) for each erase block
  • 19. WEAR LEVELING is used to reduce the number of total erase operations
  • 25. WEAR LEVELING Write 1
  • 26. WEAR LEVELING Write 1 Write 2
  • 27. WEAR LEVELING Write 1 Write 2 Write 3
  • 28. Remember: the whole block must be erased Write 1 Write 2 Write 3 How is data from only Write 2 modified?
  • 30. Empty Block Mark Garbage Append Modified Data
  • 35. GARBAGE COLLECTION • Compacts fragmented disk blocks • Erase operations drag on performance • Modern SSDs do this in the background... as much as possible • If no empty blocks are available, GC must be done before ANY writes can complete
  • 36. WRITE AMPLIFICATION • When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten • The smaller & more random the writes, the worse this gets • Modern “mark and sweep” GC reduces it, but cannot eliminate it
  • 37. Torture test shows massive write performance drop-off for heavily fragmented drive Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
  • 38. Some poorly designed drives COMPLETELY fall apart Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
  • 39. Even a well-behaved drive suffers significantly from the torture test Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 40. Post-torture, all disk blocks were marked empty, and the “fast” comes back... Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 41.
  • 42. “TRIM” • Filesystems don’t typically immediately erase data when files are deleted, they just mark them as deleted and erase later • TRIM allows the OS to actively tell the drive when a region of disk is no longer used • If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
  • 43. TRIM only reduces the write amplification effect, it can’t eliminate it.
  • 45.
  • 46.
  • 47. AnandTech estimates that modern MLC SSDs only last about 1.5 years under heavy MySQL load, which causes around 10x write amplification
  • 49. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 51. “For a sequential write workload, write amplification is equal to 1, i.e., there is no write amplification.” Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
  • 52. THANK YOU. ~ @rbranson