SlideShare ist ein Scribd-Unternehmen logo
1 von 27
FlashCache



Mohan Srinivasan
April 2011
FlashCache at Facebook
▪   What
        ▪   We want to use some Flash storage on existing servers
        ▪   We want something that is simple to deploy and use
        ▪   Our IO access patterns benefit from a cache

▪   Who
    ▪   Mohan Srinivasan – design and implementation
    ▪   Paul Saab – platform and MySQL integration
    ▪   Michael Jiang – testing, performance and capacity planning
    ▪   Mark Callaghan - benchmarketing
Introduction
▪   Block cache for Linux - write back and write through modes
▪   Layered below the filesystem at the top of the storage stack
▪   Cache Disk Blocks on fast persistent storage (Flash, SSD)
▪   Loadable Linux Kernel module, built using the Device Mapper (DM)
▪   Primary use case InnoDB, but general purpose
▪   Based on dm-cache by Prof. Ming
Caching Modes
Write Back                              Write Through, Write Around
 ▪   Lazy writing to disk                ▪   Non-persistent
 ▪   Persistent across reboot            ▪   Are you a pessimist?
 ▪   Persistent across device removal
Cache Structure
▪   Set associative hash
▪   Hash with fixed sized buckets (sets) with linear probing within a set
▪   512-way set associative by default
▪   dbn: Disk Block Number, address of block on disk
▪   Set = (dbn / block size / set size) mod (number of sets)
▪   Sequential range of dbns map onto a single sets
Cache Structure
                         .                .
      Set 0              .                .
                         .                .
                         .                .
                         .                .
                         .                .

                                       Block 0
                 Cache set Block 0
                                          .
                                          .
                                          .

                                      Block 511

                                          .
      SET i                               .
                                          .       N set’s worth
                                          .
                                          .       Of blocks
                                          .

                                       Block 0

                                          .
                Cache set Block 511       .
                                          .
                         .
                         .
                                      Block 511
                         .
                         .                .
                         .                .
      Set N-1
Replacement and Memory Footprint
▪   Replacement policy is FIFO (default) or LRU within a set
▪   Switch on the fly between FIFO/LRU (sysctl)
▪   Metadata per cache block: 16 bytes in memory, 16 bytes on ssd
▪   On ssd metadata per-slot
     ▪   <dbn, block state>

▪   In memory metadata per-slot:
     ▪   <dbn, block state, LRU chain pointers, misc>
Reads
▪   Compute cache set for dbn
▪   Cache Hit
     ▪   Verify checksums if configured
     ▪   Serve read out of cache

▪   Cache Miss
     ▪   Find free block or reclaim block based on replacement policy
     ▪   Read block from disk and populate cache
     ▪   Update block checksum if configured
     ▪   Return data to user
Write Through - writes
▪   Compute cache set for dbn
▪   Cache hit
      ▪   Get cached block

▪   Cache miss
      ▪   Find free block or reclaim block

▪   Write data block to disk
▪   Write data block to cache
▪   Update block checksum
Write Back - writes
▪   Compute cache set for dbn
▪   Cache Hit
     ▪   Write data block into cache
     ▪   If data block not DIRTY, synchronously update on-ssd cache metadata to
         mark block DIRTY

▪   Cache miss
     ▪   Find free block or reclaim block based on replacement policy
     ▪   Write data block to cache
     ▪   Synchronously update on-ssd cache metadata to mark block DIRTY
Small or uncacheable requests
▪   First invalidate blocks that overlap the requests
      ▪   There are at most 2 such blocks
      ▪   For Write Back, if the overlapping blocks are DIRTY they are cleaned
          first then invalidated

▪   Uncacheable full block reads are served from cache in case of a cache
    hit.
▪   Perform disk IO
▪   Repeat invalidation to close races which might have caused the block
    to be cached while the disk IO was in progress
Write Back policy
▪   Dirty blocks not recently accessed
      ▪   A clock-like algorithm picks off Dirty blocks not accessed in the last
          15 minutes (configurable) for cleaning
▪   When dirty blocks in a set exceeds configurable threshold, clean some
    blocks
      ▪   Blocks selected for writeback based on replacement policy
      ▪   Default dirty threshold 20%. Set higher for write heavy workloads

▪   Sort selected blocks and pickup any other blocks in set that can be
    contiguously merged with these
▪   Writes merged by the IO scheduler
Write Back – overheads
▪   In-Memory cache metadata memory footprint
     ▪   300GB/4KB cache -> ~1.2GB
     ▪   160GB/4KB cache -> ~640MB

▪   Cache metadata writes/file system write
     ▪   Worst case is 2 cache metadata updates per write
           ▪   (VALID->DIRTY, DIRTY->VALID)
     ▪   Average case is much lower because of cache write hits and batching of
         cache metadata updates
Write Through/Around – cache overheads
▪   In-Memory Cache metadata footprint
     ▪   300GB/4KB cache -> ~1.2GB
     ▪   160GB/4KB cache -> ~640MB

▪   Cache metadata writes per file system write
     ▪   1 cache data write per file system write (Write Through)
     ▪   No overhead (for Write Around)
Write Back – metadata updates
▪   Cache (on-ssd) metadata only updated on writes and block cleanings
    (VALID->DIRTY or DIRTY->VALID)
▪   Cache (on-ssd) metadata not updated on cache population for reads
▪   Reload after an unclean shutdown only loads DIRTY blocks
▪   Fast and Slow cache shutdowns
     ▪   Only metadata is written on fast shutdown. Reload loads both dirty and
         clean blocks
     ▪   Slow shutdown writes all dirty blocks to disk first, then writes out
         metadata to the ssd. Reload only loads clean blocks.

▪   Metadata updates to multiple blocks in same sector are batched
Torn Page Problem
▪   Handle partial block write caused by power failure or other causes
▪   Problem exists for Flashcache in Write Back mode
▪   Detected via block checksums
     ▪   Checksums are disabled by default
     ▪   Pages with bad checksums are not used

▪   Checksums increase cache metadata writes and memory footprint
     ▪   Update cache metadata checksums on DIRTY->DIRTY block transitions
         for Write Back
     ▪   Each per-cache slot grows by 8 bytes to hold the checksum (a 50%
         increase from 16 bytes to 24 bytes for the Write Back case).
Cache controls for Write Back
▪   Work best with O_DIRECT file access
▪   Global modes – Cache All or Cache Nothing
     ▪   Cache All has a blacklist of pids and tgids
     ▪   Cache Nothing has a whitelist of pids and tgids

▪   tgids can be used to tag all pthreads in the group as cacheable
▪   Exceptions for threads within a group are supported
▪   List changes done via FlashCache ioctls
▪   Cache can be read but is not written for non-cacheable tgids and pids
▪   We modified MySQL and scp to use this support
Cache Nothing policy
▪   If the thread id is whitelisted, cache all IOs for this thread
▪   If the tgid is whitelisted, cache all IOs for this thread
▪   If the thread id is blacklisted do not cache IOs
Cache control example
▪   We use Cache Nothing mode for MySQL servers
▪   The mysqld tgid is added to the whitelist
     ▪   All IO done by it is cacheable
     ▪   Writes done by other processes do not update the cache

▪   Full table scans done by mysqldump use a hint that directs mysqld to
    add the query’s thread id to the blacklist to avoid wiping FlashCache
     ▪   select /* SQL_NO_FCACHE */ pk, col1, col2 from foobar
Utilities
▪   flashcache_create
     ▪   flashcache_create -b 4k -s 10g mysql /dev/flash /dev/disk

▪   flashcache_destroy
     ▪   flashcache_destory /dev/flash

▪   flashcache_load
sysctl –a | grep flash
dev.flashcache.cache_all = 0          dev.flashcache.do_pid_expiry = 0

dev.flashcache.fast_remove = 0        dev.flashcache.max_clean_ios_set = 2

dev.flashcache.zero_stats = 1         dev.flashcache.max_clean_ios_total = 4

dev.flashcache.write_merge = 1        dev.flashcache.debug = 0

dev.flashcache.reclaim_policy = 0     dev.flashcache.dirty_thresh_pct = 20

dev.flashcache.pid_expiry_secs = 60   dev.flashcache.stop_sync = 0

dev.flashcache.max_pids = 100         dev.flashcache.do_sync = 0
Removing FlashCache
▪   umount /data
▪   dmesetup remove mysql
▪   flashcache_destroy /dev/flash
cat /proc/flashcache_stats
reads=4 writes=0 read_hits=0 read_hit_percent=0 write_hits=0
 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0
 replacement=0 write_replacement=0 write_invalidates=0
 read_invalidates=0 pending_enqueues=0 pending_inval=0
 metadata_dirties=0 metadata_cleans=0 cleanings=0 no_room=0
 front_merge=0 back_merge=0 nc_pid_adds=0 nc_pid_dels=0
 nc_pid_drops=0 nc_expiry=0 disk_reads=0 disk_writes=0
 ssd_reads=0 ssd_writes=0 uncached_reads=169
 uncached_writes=128
Future Work
▪   Cache mirroring
        ▪   SW RAID 0 block device as a cache

▪   Online cache resize
        ▪   No shutdown and recreate

▪   Support for ATA trim
    ▪   Discard blocks no longer in use
▪   Fix the torn page problem
    ▪   Use shadow pages
Resources
▪   GitHub : facebook/flashcache
▪   Mailing list : flashcache-dev@googlegroups.com
▪   http://facebook.com/MySQLatFacebook
▪   Email :
        ▪   mohan@fb.com (Mohan Srinivasan)
        ▪   ps@fb.com (Paul Saab)
        ▪   michael@fb.com (Michael Jiang)
        ▪   mcallaghan@fb.com (Mark Callaghan)
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Was ist angesagt?

Varnish in action phpday2011
Varnish in action phpday2011Varnish in action phpday2011
Varnish in action phpday2011
Combell NV
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009)
PostgreSQL Experts, Inc.
 

Was ist angesagt? (20)

PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
 
12cR2 Single-Tenant: Multitenant Features for All Editions
12cR2 Single-Tenant: Multitenant Features for All Editions12cR2 Single-Tenant: Multitenant Features for All Editions
12cR2 Single-Tenant: Multitenant Features for All Editions
 
MYSQLDUMP & ZRM COMMUNITY (EN)
MYSQLDUMP & ZRM COMMUNITY (EN)MYSQLDUMP & ZRM COMMUNITY (EN)
MYSQLDUMP & ZRM COMMUNITY (EN)
 
GOTO 2013: Why Zalando trusts in PostgreSQL
GOTO 2013: Why Zalando trusts in PostgreSQLGOTO 2013: Why Zalando trusts in PostgreSQL
GOTO 2013: Why Zalando trusts in PostgreSQL
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
 
Dbdeployer
DbdeployerDbdeployer
Dbdeployer
 
Varnish in action phpday2011
Varnish in action phpday2011Varnish in action phpday2011
Varnish in action phpday2011
 
Caching and tuning fun for high scalability @ phpBenelux 2011
Caching and tuning fun for high scalability @ phpBenelux 2011Caching and tuning fun for high scalability @ phpBenelux 2011
Caching and tuning fun for high scalability @ phpBenelux 2011
 
Varnish in action phpuk11
Varnish in action phpuk11Varnish in action phpuk11
Varnish in action phpuk11
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Control your service resources with systemd
 Control your service resources with systemd  Control your service resources with systemd
Control your service resources with systemd
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009)
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
Improve your storage with bcachefs
Improve your storage with bcachefsImprove your storage with bcachefs
Improve your storage with bcachefs
 
Varnish in action confoo11
Varnish in action confoo11Varnish in action confoo11
Varnish in action confoo11
 
Things I wish I knew about GemStone
Things I wish I knew about GemStoneThings I wish I knew about GemStone
Things I wish I knew about GemStone
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 

Andere mochten auch

Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Facebook Technology Stack
Facebook Technology StackFacebook Technology Stack
Facebook Technology Stack
Husain Ali
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
parallellabs
 

Andere mochten auch (20)

How Big is Facebook Really?
How Big is Facebook Really?How Big is Facebook Really?
How Big is Facebook Really?
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Facebook Technology Stack
Facebook Technology StackFacebook Technology Stack
Facebook Technology Stack
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it Open
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Facebook Analysis and Study
Facebook Analysis and StudyFacebook Analysis and Study
Facebook Analysis and Study
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
20 Facebook, Twitter, Linkedin & Pinterest Features You Didn't Know Existed (...
20 Facebook, Twitter, Linkedin & Pinterest Features You Didn't Know Existed (...20 Facebook, Twitter, Linkedin & Pinterest Features You Didn't Know Existed (...
20 Facebook, Twitter, Linkedin & Pinterest Features You Didn't Know Existed (...
 
Customer Journey Analytics and Big Data
Customer Journey Analytics and Big DataCustomer Journey Analytics and Big Data
Customer Journey Analytics and Big Data
 

Ähnlich wie FlashCache

coa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptxcoa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptx
Ruhul Amin
 
SSD Caching: Device-Mapper- and Hardware-based solutions compared
SSD Caching: Device-Mapper- and Hardware-based solutions compared SSD Caching: Device-Mapper- and Hardware-based solutions compared
SSD Caching: Device-Mapper- and Hardware-based solutions compared
Werner Fischer
 
My sql innovation work -innosql
My sql innovation work -innosqlMy sql innovation work -innosql
My sql innovation work -innosql
thinkinlamp
 
My sql with enterprise storage
My sql with enterprise storageMy sql with enterprise storage
My sql with enterprise storage
Caroline_Rose
 

Ähnlich wie FlashCache (20)

coa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptxcoa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptx
 
Cachememory
CachememoryCachememory
Cachememory
 
SSD Caching: Device-Mapper- and Hardware-based solutions compared
SSD Caching: Device-Mapper- and Hardware-based solutions compared SSD Caching: Device-Mapper- and Hardware-based solutions compared
SSD Caching: Device-Mapper- and Hardware-based solutions compared
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
 
Racing with Droids
Racing with DroidsRacing with Droids
Racing with Droids
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storage
 
Computer organization memory hierarchy
Computer organization memory hierarchyComputer organization memory hierarchy
Computer organization memory hierarchy
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databases
 
11g r2 flashcache_Tips
11g r2 flashcache_Tips11g r2 flashcache_Tips
11g r2 flashcache_Tips
 
My sql innovation work -innosql
My sql innovation work -innosqlMy sql innovation work -innosql
My sql innovation work -innosql
 
rac_for_beginners_ppt.pdf
rac_for_beginners_ppt.pdfrac_for_beginners_ppt.pdf
rac_for_beginners_ppt.pdf
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
Raidz on-disk format vs. small blocks
Raidz on-disk format vs. small blocksRaidz on-disk format vs. small blocks
Raidz on-disk format vs. small blocks
 
RAIDZ on-disk format vs. small blocks
RAIDZ on-disk format vs. small blocksRAIDZ on-disk format vs. small blocks
RAIDZ on-disk format vs. small blocks
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Ceph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to JewelCeph Performance: Projects Leading Up to Jewel
Ceph Performance: Projects Leading Up to Jewel
 
My sql with enterprise storage
My sql with enterprise storageMy sql with enterprise storage
My sql with enterprise storage
 
When is MyRocks good?
When is MyRocks good? When is MyRocks good?
When is MyRocks good?
 

Mehr von Chris Westin

Mehr von Chris Westin (20)

Data torrent meetup-productioneng
Data torrent meetup-productionengData torrent meetup-productioneng
Data torrent meetup-productioneng
 
Gripshort
GripshortGripshort
Gripshort
 
Ambari hadoop-ops-meetup-2013-09-19.final
Ambari hadoop-ops-meetup-2013-09-19.finalAmbari hadoop-ops-meetup-2013-09-19.final
Ambari hadoop-ops-meetup-2013-09-19.final
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera manager
 
Building low latency java applications with ehcache
Building low latency java applications with ehcacheBuilding low latency java applications with ehcache
Building low latency java applications with ehcache
 
SDN/OpenFlow #lspe
SDN/OpenFlow #lspeSDN/OpenFlow #lspe
SDN/OpenFlow #lspe
 
cfengine3 at #lspe
cfengine3 at #lspecfengine3 at #lspe
cfengine3 at #lspe
 
mongodb-aggregation-may-2012
mongodb-aggregation-may-2012mongodb-aggregation-may-2012
mongodb-aggregation-may-2012
 
Nimbula lspe-2012-04-19
Nimbula lspe-2012-04-19Nimbula lspe-2012-04-19
Nimbula lspe-2012-04-19
 
mongodb-brief-intro-february-2012
mongodb-brief-intro-february-2012mongodb-brief-intro-february-2012
mongodb-brief-intro-february-2012
 
Stingray - Riverbed Technology
Stingray - Riverbed TechnologyStingray - Riverbed Technology
Stingray - Riverbed Technology
 
MongoDB's New Aggregation framework
MongoDB's New Aggregation frameworkMongoDB's New Aggregation framework
MongoDB's New Aggregation framework
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica sets
 
Architecting a Scale Out Cloud Storage Solution
Architecting a Scale Out Cloud Storage SolutionArchitecting a Scale Out Cloud Storage Solution
Architecting a Scale Out Cloud Storage Solution
 
Large Scale Cacti
Large Scale CactiLarge Scale Cacti
Large Scale Cacti
 
MongoDB: An Introduction - July 2011
MongoDB:  An Introduction - July 2011MongoDB:  An Introduction - July 2011
MongoDB: An Introduction - July 2011
 
Practical Replication June-2011
Practical Replication June-2011Practical Replication June-2011
Practical Replication June-2011
 
MongoDB: An Introduction - june-2011
MongoDB:  An Introduction - june-2011MongoDB:  An Introduction - june-2011
MongoDB: An Introduction - june-2011
 
Ganglia Overview-v2
Ganglia Overview-v2Ganglia Overview-v2
Ganglia Overview-v2
 
MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

FlashCache

  • 1.
  • 3. FlashCache at Facebook ▪ What ▪ We want to use some Flash storage on existing servers ▪ We want something that is simple to deploy and use ▪ Our IO access patterns benefit from a cache ▪ Who ▪ Mohan Srinivasan – design and implementation ▪ Paul Saab – platform and MySQL integration ▪ Michael Jiang – testing, performance and capacity planning ▪ Mark Callaghan - benchmarketing
  • 4. Introduction ▪ Block cache for Linux - write back and write through modes ▪ Layered below the filesystem at the top of the storage stack ▪ Cache Disk Blocks on fast persistent storage (Flash, SSD) ▪ Loadable Linux Kernel module, built using the Device Mapper (DM) ▪ Primary use case InnoDB, but general purpose ▪ Based on dm-cache by Prof. Ming
  • 5. Caching Modes Write Back Write Through, Write Around ▪ Lazy writing to disk ▪ Non-persistent ▪ Persistent across reboot ▪ Are you a pessimist? ▪ Persistent across device removal
  • 6. Cache Structure ▪ Set associative hash ▪ Hash with fixed sized buckets (sets) with linear probing within a set ▪ 512-way set associative by default ▪ dbn: Disk Block Number, address of block on disk ▪ Set = (dbn / block size / set size) mod (number of sets) ▪ Sequential range of dbns map onto a single sets
  • 7. Cache Structure . . Set 0 . . . . . . . . . . Block 0 Cache set Block 0 . . . Block 511 . SET i . . N set’s worth . . Of blocks . Block 0 . Cache set Block 511 . . . . Block 511 . . . . . Set N-1
  • 8. Replacement and Memory Footprint ▪ Replacement policy is FIFO (default) or LRU within a set ▪ Switch on the fly between FIFO/LRU (sysctl) ▪ Metadata per cache block: 16 bytes in memory, 16 bytes on ssd ▪ On ssd metadata per-slot ▪ <dbn, block state> ▪ In memory metadata per-slot: ▪ <dbn, block state, LRU chain pointers, misc>
  • 9. Reads ▪ Compute cache set for dbn ▪ Cache Hit ▪ Verify checksums if configured ▪ Serve read out of cache ▪ Cache Miss ▪ Find free block or reclaim block based on replacement policy ▪ Read block from disk and populate cache ▪ Update block checksum if configured ▪ Return data to user
  • 10. Write Through - writes ▪ Compute cache set for dbn ▪ Cache hit ▪ Get cached block ▪ Cache miss ▪ Find free block or reclaim block ▪ Write data block to disk ▪ Write data block to cache ▪ Update block checksum
  • 11. Write Back - writes ▪ Compute cache set for dbn ▪ Cache Hit ▪ Write data block into cache ▪ If data block not DIRTY, synchronously update on-ssd cache metadata to mark block DIRTY ▪ Cache miss ▪ Find free block or reclaim block based on replacement policy ▪ Write data block to cache ▪ Synchronously update on-ssd cache metadata to mark block DIRTY
  • 12. Small or uncacheable requests ▪ First invalidate blocks that overlap the requests ▪ There are at most 2 such blocks ▪ For Write Back, if the overlapping blocks are DIRTY they are cleaned first then invalidated ▪ Uncacheable full block reads are served from cache in case of a cache hit. ▪ Perform disk IO ▪ Repeat invalidation to close races which might have caused the block to be cached while the disk IO was in progress
  • 13. Write Back policy ▪ Dirty blocks not recently accessed ▪ A clock-like algorithm picks off Dirty blocks not accessed in the last 15 minutes (configurable) for cleaning ▪ When dirty blocks in a set exceeds configurable threshold, clean some blocks ▪ Blocks selected for writeback based on replacement policy ▪ Default dirty threshold 20%. Set higher for write heavy workloads ▪ Sort selected blocks and pickup any other blocks in set that can be contiguously merged with these ▪ Writes merged by the IO scheduler
  • 14. Write Back – overheads ▪ In-Memory cache metadata memory footprint ▪ 300GB/4KB cache -> ~1.2GB ▪ 160GB/4KB cache -> ~640MB ▪ Cache metadata writes/file system write ▪ Worst case is 2 cache metadata updates per write ▪ (VALID->DIRTY, DIRTY->VALID) ▪ Average case is much lower because of cache write hits and batching of cache metadata updates
  • 15. Write Through/Around – cache overheads ▪ In-Memory Cache metadata footprint ▪ 300GB/4KB cache -> ~1.2GB ▪ 160GB/4KB cache -> ~640MB ▪ Cache metadata writes per file system write ▪ 1 cache data write per file system write (Write Through) ▪ No overhead (for Write Around)
  • 16. Write Back – metadata updates ▪ Cache (on-ssd) metadata only updated on writes and block cleanings (VALID->DIRTY or DIRTY->VALID) ▪ Cache (on-ssd) metadata not updated on cache population for reads ▪ Reload after an unclean shutdown only loads DIRTY blocks ▪ Fast and Slow cache shutdowns ▪ Only metadata is written on fast shutdown. Reload loads both dirty and clean blocks ▪ Slow shutdown writes all dirty blocks to disk first, then writes out metadata to the ssd. Reload only loads clean blocks. ▪ Metadata updates to multiple blocks in same sector are batched
  • 17. Torn Page Problem ▪ Handle partial block write caused by power failure or other causes ▪ Problem exists for Flashcache in Write Back mode ▪ Detected via block checksums ▪ Checksums are disabled by default ▪ Pages with bad checksums are not used ▪ Checksums increase cache metadata writes and memory footprint ▪ Update cache metadata checksums on DIRTY->DIRTY block transitions for Write Back ▪ Each per-cache slot grows by 8 bytes to hold the checksum (a 50% increase from 16 bytes to 24 bytes for the Write Back case).
  • 18. Cache controls for Write Back ▪ Work best with O_DIRECT file access ▪ Global modes – Cache All or Cache Nothing ▪ Cache All has a blacklist of pids and tgids ▪ Cache Nothing has a whitelist of pids and tgids ▪ tgids can be used to tag all pthreads in the group as cacheable ▪ Exceptions for threads within a group are supported ▪ List changes done via FlashCache ioctls ▪ Cache can be read but is not written for non-cacheable tgids and pids ▪ We modified MySQL and scp to use this support
  • 19. Cache Nothing policy ▪ If the thread id is whitelisted, cache all IOs for this thread ▪ If the tgid is whitelisted, cache all IOs for this thread ▪ If the thread id is blacklisted do not cache IOs
  • 20. Cache control example ▪ We use Cache Nothing mode for MySQL servers ▪ The mysqld tgid is added to the whitelist ▪ All IO done by it is cacheable ▪ Writes done by other processes do not update the cache ▪ Full table scans done by mysqldump use a hint that directs mysqld to add the query’s thread id to the blacklist to avoid wiping FlashCache ▪ select /* SQL_NO_FCACHE */ pk, col1, col2 from foobar
  • 21. Utilities ▪ flashcache_create ▪ flashcache_create -b 4k -s 10g mysql /dev/flash /dev/disk ▪ flashcache_destroy ▪ flashcache_destory /dev/flash ▪ flashcache_load
  • 22. sysctl –a | grep flash dev.flashcache.cache_all = 0 dev.flashcache.do_pid_expiry = 0 dev.flashcache.fast_remove = 0 dev.flashcache.max_clean_ios_set = 2 dev.flashcache.zero_stats = 1 dev.flashcache.max_clean_ios_total = 4 dev.flashcache.write_merge = 1 dev.flashcache.debug = 0 dev.flashcache.reclaim_policy = 0 dev.flashcache.dirty_thresh_pct = 20 dev.flashcache.pid_expiry_secs = 60 dev.flashcache.stop_sync = 0 dev.flashcache.max_pids = 100 dev.flashcache.do_sync = 0
  • 23. Removing FlashCache ▪ umount /data ▪ dmesetup remove mysql ▪ flashcache_destroy /dev/flash
  • 24. cat /proc/flashcache_stats reads=4 writes=0 read_hits=0 read_hit_percent=0 write_hits=0 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0 replacement=0 write_replacement=0 write_invalidates=0 read_invalidates=0 pending_enqueues=0 pending_inval=0 metadata_dirties=0 metadata_cleans=0 cleanings=0 no_room=0 front_merge=0 back_merge=0 nc_pid_adds=0 nc_pid_dels=0 nc_pid_drops=0 nc_expiry=0 disk_reads=0 disk_writes=0 ssd_reads=0 ssd_writes=0 uncached_reads=169 uncached_writes=128
  • 25. Future Work ▪ Cache mirroring ▪ SW RAID 0 block device as a cache ▪ Online cache resize ▪ No shutdown and recreate ▪ Support for ATA trim ▪ Discard blocks no longer in use ▪ Fix the torn page problem ▪ Use shadow pages
  • 26. Resources ▪ GitHub : facebook/flashcache ▪ Mailing list : flashcache-dev@googlegroups.com ▪ http://facebook.com/MySQLatFacebook ▪ Email : ▪ mohan@fb.com (Mohan Srinivasan) ▪ ps@fb.com (Paul Saab) ▪ michael@fb.com (Michael Jiang) ▪ mcallaghan@fb.com (Mark Callaghan)
  • 27. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0