SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Optimal usage of SSDs
under Linux:
Optimize your I/O Subsystem

Werner Fischer,
Technology Specialist Thomas-Krenn.AG

LinuxCon Europe 2011,
October 26th - 28th 2011,
Prague, Czech Republic
Introduction

                  who I am                         who I am not
 Werner Fischer   from Australia    Linux user        Kernel
                                    since 2001       developer




 working for a      freelancer     Piano learner       SSD
 Server vendor     IT journalist                     developer




                                                      slide 2/30
Agenda

 1) SSD layout

 2) I/O performance metrics

 3) Configurations tips




                     Source: maximumpc.com


                                             slide 3/30
1) SSD layout: memory cell

  • memory cells
    – NAND memory cell =
      MOS transistor with
      floating gate
    – permanently store
      charge                         Source: Intel




        • programming puts electrons on floating gate
        • erase takes them off
    – one program/erase (p/e) cycle is a round trip by the electrons
    – back-and-forth round trips gradually damage the tunnel oxide
    – endurance is limited, measured in number of p/e cycles:
        • 50nm MLC ~ 10.000 p/e cycles
        • 34nm/25nm/20nm MLC ~ 3.000 – 5.000 p/e cycles

                                                           slide 4/30
1) SSD layout: memory cell

  • memory cells
    – SLC (Single Level Cell) → 1 bit per memory cell
    – MLC (Multi Level Cell) → 2 bits per memory cell




          Source: anandtech.com


    – TLC (Triple Level Cell) → 3 bits per memory cell
    – 16LC (16 Level Cell) → 4 bits per memory cell

  • multiple memory cells (e.g. 16.384) build up a “page”
    – page = smallest area, which can be read/written

                                                         slide 5/30
1) SSD layout: page

        blackboard #1                     • one line = page within a SSD
   1 Stiegl Bier
   2 Budweiser
                                               – 8.192 Bytes (8 kiB)
   3 Hofstettner                               – can be read/written individually
   4 Heinecken                                 – cannot be changed/erased individually
   5
   6
   7
   8




 Note: example sizes of pages and blocks are taken from Intel's Series 320 SSDs (with IMFT's 25nm Flash chips)


                                                                                                 slide 6/30
1) SSD layout: block

        blackboard #1                     • one line = page within a SSD
   1 Stiegl Bier
   2 Budweiser
                                               – 8.192 Bytes (8 kiB)
   3 Hofstettner                               – can be read/written individually
   4 Heinecken                                 – cannot be changed/erased individually
   5
   6                                      • one blackboard = block within a SSD
   7                                           – consists of 256 lines (pages),
   8                                             2.097.152 Bytes (2 MiB)
                                               – smallest area which
                                                 can be individually erased
                                                 (we have only
                                                 watering-cans for that ;-)
 Note: example sizes of pages and blocks are taken from Intel's Series 320 SSDs (with IMFT's 25nm Flash chips)


                                                                                                 slide 7/30
1) SSD layout: change page

      blackboard #1   • easier way to change lines?
  1 Stiegl Bier
  2 Budweiser
  3 Hofstettner
  4 Heinecken
  5
  6
  7
  8




                                                 slide 8/30
1) SSD layout: change page

      blackboard #1   • easier way to change lines!
  1 Stiegl Bier
  2 Budweiser
                        – (1) mark old line as invalid
  3 Hofstettner
  4 Heinecken
  5
  6
  7
  8




                                                         slide 9/30
1) SSD layout: change page

      blackboard #1   • easier way to change lines!
  1 Stiegl Bier
  2 Budweiser
                        – (1) mark old line as invalid
  3 Hofstettner         – (2) store new content in a free line
  4 Heinecken
  5 Pilsener
  6
  7
  8




                                                       slide 10/30
1) SSD layout: change page

      blackboard #1   • easier way to change lines!
  1 Stiegl Bier
  2 Budweiser
                        – (1) mark old line as invalid
  3 Hofstettner         – (2) store new content in a free line
  4 Heinecken
  5 Pilsener          • this can be done as long as there are
  6                     enough free lines left...
  7
  8                                        we need spare
                                           blackboards to have
                                           enough free lines!




                                                       slide 11/30
1) SSD layout: spare area

• SSDs need spare area
                                  cont (LBA 14)       cont (LBA 13)    cont (LBA 19)
                                  cont (LBA 93)       cont (LBA 92)    cont (LBA 91)


  – avoids the erasement          cont (LBA 15)
                                  cont (LBA 27)
                                  cont (LBA 04)
                                                      cont (LBA 11)
                                                      cont (LBA 26)
                                                      cont (LBA 09)
                                                                       cont (LBA 15)
                                                                       cont (LBA 29)
                                                                       cont (LBA 04)

    of a block when a single      cont (LBA 43)
                                  cont (LBA 58)
                                                       cont (LBA 42)
                                                       cont (LBA 57)
                                                                       cont (LBA 43)
                                                                       cont (LBA 59)


    page is changed
                                  cont (LBA 84)        cont (LBA 77)   cont (LBA 89)




  – after some time spare
    area will be filled up, too   cont (LBA 02)
                                  cont (LBA 93)
                                  cont (LBA 26)


  – cleaning gets necessary       cont (LBA 57)




    (garbage collection)




                                                  visible capacity,                       spare area,
                                                    e.g. 300 GB                           e.g. 44 GB



                                                                                       slide 12/30
1) SSD layout: blocks→planes→dies→TSOPs

 • planes
   – multiple blocks make up a plane
   – e.g. 1.024 blocks = 1 plane
 • dies
   – multiple planes make up a die
   – e.g. 4 planes = 1 die             Source: http://newsroom.intel.com/community/intel_newsroom/blog/2011/04/14




 • TSOP (thin small outline package)
   – multiple dies (e.g. 1 - 8)
 • SSDs
   – e.g. 10 TSOPs

                                       Source: maximumpc.com




                                                                               slide 13/30
Agenda

 1) SSD layout

 2) I/O performance metrics

 3) Configurations tips




                     Source: maximumpc.com


                                             slide 14/30
2) I/O performance metrics

  • throughput
    – MByte/s
    – throughput analogy:
       • # of persons/h from Berlin → Prague

  • # of I/O operations per second
    – IOPS analogy:
       • # of individual trips to Prague
         (from Berlin, Vienna, Paris, Rome, ...)

  • latency because of queue depth
    – queue depth analogy:
       • vehicles must use a ferry to reach destination
       • with how many vehicles does
         the ferry depart?
                                                          slide 15/30
Agenda

 1) SSD layout

 2) I/O performance metrics

 3) Configurations tips
    ●   AHCI (NCQ+DIPM)
    ●   TRIM (discard)
    ●   noatime
    ●   tmpfs
    ●   alignment
    ●   over-provisioning


                       Source: maximumpc.com


                                               slide 16/30
3) Configuration tips: AHCI (NCQ+DIPM)

  • NCQ (Native Command Queuing)
    – allows SSD to execute multiple I/O requests in parallel
    – boosts throughput
    – configure queue depth to get your optimal balance
      between max. # of IOPS and lowest latency




                                                          slide 17/30
3) Configuration tips: AHCI (NCQ+DIPM)

  • DIPM (Device Initiated Interface Power Management)
     – reduces idle power down to 0,1 Watt




  • Aggressive
    Link Power
    Management
     – min_power
     – medium_power
     – max_performance
     More Details: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/ALPM.html



updated slide! (has been corrected after the talk)                                                       slide 18/30
3) Configuration tips: TRIM (discard)

  • ATA TRIM
                             cont (LBA 14)       cont (LBA 13)    cont (LBA 19)
                             cont (LBA 93)       cont (LBA 92)    cont (LBA 91)


    – tells SSD which data   cont (LBA 15)
                             cont (LBA 27)
                             cont (LBA 04)
                                                 cont (LBA 11)
                                                 cont (LBA 26)
                                                  cont (LBA 09)
                                                                  cont (LBA 15)
                                                                  cont (LBA 29)
                                                                  cont (LBA 04)

      can be discarded       cont (LBA 43)
                             cont (LBA 58)
                                                  cont (LBA 42)
                                                  cont (LBA 57)
                                                                  cont (LBA 43)
                                                                  cont (LBA 59)
                             cont (LBA 84)        cont (LBA 77)   cont (LBA 89)




                             cont (LBA 02)




                                             visible capacity,                       spare area,
                                               e.g. 300 GB                           e.g. 44 GB



                                                                                  slide 19/30
3) Configuration tips: TRIM (discard)

  • ATA TRIM
                             cont (LBA 14)       cont (LBA 13)    cont (LBA 19)
                             cont (LBA 93)       cont (LBA 92)    cont (LBA 91)


    – tells SSD which data   cont (LBA 15)
                             cont (LBA 27)
                             cont (LBA 04)
                                                 cont (LBA 11)
                                                 cont (LBA 26)
                                                  cont (LBA 09)
                                                                  cont (LBA 15)
                                                                  cont (LBA 29)
                                                                  cont (LBA 04)

      can be discarded       cont (LBA 43)
                             cont (LBA 58)
                                                  cont (LBA 42)
                                                  cont (LBA 57)
                                                                  cont (LBA 43)
                                                                  cont (LBA 59)
                             cont (LBA 84)        cont (LBA 77)   cont (LBA 89)




                             cont (LBA 02)
                             cont (LBA 93)




                                             visible capacity,                       spare area,
                                               e.g. 300 GB                           e.g. 44 GB



                                                                                  slide 20/30
3) Configuration tips: TRIM (discard)

  • ATA TRIM
                                 cont (LBA 14)       cont (LBA 13)    cont (LBA 19)
                                 cont (LBA 93)       cont (LBA 92)    cont (LBA 91)


    – tells SSD which data       cont (LBA 15)
                                 cont (LBA 27)
                                 cont (LBA 04)
                                                     cont (LBA 11)
                                                     cont (LBA 26)
                                                      cont (LBA 09)
                                                                      cont (LBA 15)
                                                                      cont (LBA 29)
                                                                      cont (LBA 04)

      can be discarded           cont (LBA 43)
                                 cont (LBA 58)
                                                      cont (LBA 42)
                                                      cont (LBA 57)
                                                                      cont (LBA 43)
                                                                      cont (LBA 59)
                                 cont (LBA 84)        cont (LBA 77)   cont (LBA 89)


    – without TRIM:
       • deleting a big file
         (e.g. 100 GB) would     cont (LBA 02)
                                 cont (LBA 93)


         lead to keep
         unusable data
       • unusable data will be
         maintained during
         garbage collection!
       • more overhead →
         lower performance &                     visible capacity,                       spare area,
         lower endurance!                          e.g. 300 GB                           e.g. 44 GB



                                                                                      slide 21/30
3) Configuration tips: TRIM (discard)

  • ATA TRIM using discard infrastructure in Linux
    – online discard
       • Ext4: since Kernel 2.6.33
       • XFS: since Kernel 3.0
       • Btrfs: since Kernel 2.6.32
    – batched discard (using fstrim command)
       • Ext4: since Kernel 2.6.37
       • XFS: since Kernel 2.6.38
       • Btrfs: since Kernel 2.6.39
    – pre-discard on format
       • E2fsprogs >= 1.41.10
       • Xfsprogs >= 3.1.0



                                                 slide 22/30
3) Configuration tips: TRIM (discard)

  • ATA TRIM using discard infrastructure in Linux
    – I/O stack discard support (device mapper):
        • since Kernel 2.6.36: DM targets delay, linear, mpath, stripe
        • since Kernel 2.6.38: DM mirror target
    – no I/O stack discard support yet:
        • MD raid

  • alternatives to discard: wiper.sh / raid1ext4trim.sh
    – use hdparm --trim-sector-ranges-stdin
    – read warnings in the source of those scripts
    – cannot be used with device mapper



                                                                 slide 23/30
3) Configuration tips: TRIM (discard)

  • “does TRIM work?” howto




                                        slide 24/30
3) Configuration tips: noatime & tmpfs

  • noatime / relatime
    – omits writes of metadata on every read




  • tmpfs
    – for temporary data like /tmp/, /var/tmp/, /var/cache/, ...


                                                              slide 25/30
4) Configuration tips: alignment

  • align partition and file systems
    – wrong alignment:




    – use fdisk parameters: fdisk -c -u /dev/sda
    – correct alignment:




                                                   slide 26/30
4) Configuration tips: over-provisioning

• do not use full normal
  visible capacity
                                 cont (LBA 14)   cont (LBA 13)
                                 cont (LBA 93)   cont (LBA 92)
                                 cont (LBA 15)   cont (LBA 11)
                                 cont (LBA 27)   cont (LBA 26)
                                 cont (LBA 04)   cont (LBA 09)


  – activate HPA                 cont (LBA 43)
                                 cont (LBA 58)
                                 cont (LBA 84)
                                                 cont (LBA 42)
                                                 cont (LBA 57)
                                                 cont (LBA 77)

    (host protected area)
      • ATA8-ACS SET MAX
        ADDRESS                  cont (LBA 02)
                                 cont (LBA 93)


      • use hdparm -N
                                 cont (LBA 26)
                                 cont (LBA 57)




  – or simply do not partition
    full visible capacity
  – in either case if SSD has
    been used before:            HPA limited capacity
                                    e.g. 200 GB
      • do a secure erase to
                                         normal visible capacity,      spare area,
        TRIM all blocks                      e.g. 300 GB               e.g. 44 GB



                                                                    slide 27/30
4) Configuration tips: over-provisioning

  • over-provisioning is useful when discard cannot be
    used yet (e.g. MD-RAID, hardware RAID, ...)
  • measurements by Intel:




    Source: Intel




                                                 slide 28/30
Review:

 1) SSD layout

 2) I/O performance metrics

 3) Configurations tips
    ●   AHCI (NCQ+DIPM)
    ●   TRIM (discard)
    ●   noatime
    ●   tmpfs
    ●   alignment
    ●   over-provisioning


                       Source: maximumpc.com


                                               slide 29/30
Thanks for your time!




Image sources:
© Robynmac | Dreamstime.com | http://www.dreamstime.com/stock-photography-uluru-australia-kangaroo-sign-image19716692
© stormpic | aboutpixel.de | http://www.aboutpixel.de/foto/im-wartezimmer/stormpic/77740
© meisterleise | aboutpixel.de | http://www.aboutpixel.de/foto/tinnitus-g/meisterleise/113752
© goenz | aboutpixel.de | http://www.aboutpixel.de/foto/computer-1/goenz/104076
© Mr. Monk | aboutpixel.de | http://www.aboutpixel.de/foto/los-beschreib-mich%21/mr_-monk/39871
© RancoR | aboutpixel.de | http://www.aboutpixel.de/foto/giesskanne/rancor/29812
© Michael Hirschka | pixelio.de | http://www.pixelio.de/media/427783
© Sommaruga Fabio | pixelio.de | http://www.pixelio.de/media/448050
© Dieter Schütz | pixelio.de | http://www.pixelio.de/media/537686

Weitere ähnliche Inhalte

Ähnlich wie 20111026 optimal-usage-of-ssds-under-linux-updated

Secondary storage devices
Secondary storage devices Secondary storage devices
Secondary storage devices
Slideshare
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
Ryosuke IWANAGA
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
gafurov_x
 

Ähnlich wie 20111026 optimal-usage-of-ssds-under-linux-updated (20)

TokuDB vs RocksDB
TokuDB vs RocksDBTokuDB vs RocksDB
TokuDB vs RocksDB
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
Flash! (Modern File Systems)
Flash! (Modern File Systems)Flash! (Modern File Systems)
Flash! (Modern File Systems)
 
Secondarystoragedevices1 130119040144-phpapp02
Secondarystoragedevices1 130119040144-phpapp02Secondarystoragedevices1 130119040144-phpapp02
Secondarystoragedevices1 130119040144-phpapp02
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
Reshaping Core Genomics Software Tools for the Manycore Era
Reshaping Core Genomics Software Tools for the Manycore EraReshaping Core Genomics Software Tools for the Manycore Era
Reshaping Core Genomics Software Tools for the Manycore Era
 
Tainted LOB
Tainted LOBTainted LOB
Tainted LOB
 
Things you should know about Oracle truncate
Things you should know about Oracle truncateThings you should know about Oracle truncate
Things you should know about Oracle truncate
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Secondary storage devices
Secondary storage devices Secondary storage devices
Secondary storage devices
 
Flash 101
Flash 101Flash 101
Flash 101
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
 
Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)Improving the Performance of the qcow2 Format (KVM Forum 2017)
Improving the Performance of the qcow2 Format (KVM Forum 2017)
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
cache memory introduction, level, function
cache memory introduction, level, functioncache memory introduction, level, function
cache memory introduction, level, function
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

20111026 optimal-usage-of-ssds-under-linux-updated

  • 1. Optimal usage of SSDs under Linux: Optimize your I/O Subsystem Werner Fischer, Technology Specialist Thomas-Krenn.AG LinuxCon Europe 2011, October 26th - 28th 2011, Prague, Czech Republic
  • 2. Introduction who I am who I am not Werner Fischer from Australia Linux user Kernel since 2001 developer working for a freelancer Piano learner SSD Server vendor IT journalist developer slide 2/30
  • 3. Agenda 1) SSD layout 2) I/O performance metrics 3) Configurations tips Source: maximumpc.com slide 3/30
  • 4. 1) SSD layout: memory cell • memory cells – NAND memory cell = MOS transistor with floating gate – permanently store charge Source: Intel • programming puts electrons on floating gate • erase takes them off – one program/erase (p/e) cycle is a round trip by the electrons – back-and-forth round trips gradually damage the tunnel oxide – endurance is limited, measured in number of p/e cycles: • 50nm MLC ~ 10.000 p/e cycles • 34nm/25nm/20nm MLC ~ 3.000 – 5.000 p/e cycles slide 4/30
  • 5. 1) SSD layout: memory cell • memory cells – SLC (Single Level Cell) → 1 bit per memory cell – MLC (Multi Level Cell) → 2 bits per memory cell Source: anandtech.com – TLC (Triple Level Cell) → 3 bits per memory cell – 16LC (16 Level Cell) → 4 bits per memory cell • multiple memory cells (e.g. 16.384) build up a “page” – page = smallest area, which can be read/written slide 5/30
  • 6. 1) SSD layout: page blackboard #1 • one line = page within a SSD 1 Stiegl Bier 2 Budweiser – 8.192 Bytes (8 kiB) 3 Hofstettner – can be read/written individually 4 Heinecken – cannot be changed/erased individually 5 6 7 8 Note: example sizes of pages and blocks are taken from Intel's Series 320 SSDs (with IMFT's 25nm Flash chips) slide 6/30
  • 7. 1) SSD layout: block blackboard #1 • one line = page within a SSD 1 Stiegl Bier 2 Budweiser – 8.192 Bytes (8 kiB) 3 Hofstettner – can be read/written individually 4 Heinecken – cannot be changed/erased individually 5 6 • one blackboard = block within a SSD 7 – consists of 256 lines (pages), 8 2.097.152 Bytes (2 MiB) – smallest area which can be individually erased (we have only watering-cans for that ;-) Note: example sizes of pages and blocks are taken from Intel's Series 320 SSDs (with IMFT's 25nm Flash chips) slide 7/30
  • 8. 1) SSD layout: change page blackboard #1 • easier way to change lines? 1 Stiegl Bier 2 Budweiser 3 Hofstettner 4 Heinecken 5 6 7 8 slide 8/30
  • 9. 1) SSD layout: change page blackboard #1 • easier way to change lines! 1 Stiegl Bier 2 Budweiser – (1) mark old line as invalid 3 Hofstettner 4 Heinecken 5 6 7 8 slide 9/30
  • 10. 1) SSD layout: change page blackboard #1 • easier way to change lines! 1 Stiegl Bier 2 Budweiser – (1) mark old line as invalid 3 Hofstettner – (2) store new content in a free line 4 Heinecken 5 Pilsener 6 7 8 slide 10/30
  • 11. 1) SSD layout: change page blackboard #1 • easier way to change lines! 1 Stiegl Bier 2 Budweiser – (1) mark old line as invalid 3 Hofstettner – (2) store new content in a free line 4 Heinecken 5 Pilsener • this can be done as long as there are 6 enough free lines left... 7 8 we need spare blackboards to have enough free lines! slide 11/30
  • 12. 1) SSD layout: spare area • SSDs need spare area cont (LBA 14) cont (LBA 13) cont (LBA 19) cont (LBA 93) cont (LBA 92) cont (LBA 91) – avoids the erasement cont (LBA 15) cont (LBA 27) cont (LBA 04) cont (LBA 11) cont (LBA 26) cont (LBA 09) cont (LBA 15) cont (LBA 29) cont (LBA 04) of a block when a single cont (LBA 43) cont (LBA 58) cont (LBA 42) cont (LBA 57) cont (LBA 43) cont (LBA 59) page is changed cont (LBA 84) cont (LBA 77) cont (LBA 89) – after some time spare area will be filled up, too cont (LBA 02) cont (LBA 93) cont (LBA 26) – cleaning gets necessary cont (LBA 57) (garbage collection) visible capacity, spare area, e.g. 300 GB e.g. 44 GB slide 12/30
  • 13. 1) SSD layout: blocks→planes→dies→TSOPs • planes – multiple blocks make up a plane – e.g. 1.024 blocks = 1 plane • dies – multiple planes make up a die – e.g. 4 planes = 1 die Source: http://newsroom.intel.com/community/intel_newsroom/blog/2011/04/14 • TSOP (thin small outline package) – multiple dies (e.g. 1 - 8) • SSDs – e.g. 10 TSOPs Source: maximumpc.com slide 13/30
  • 14. Agenda 1) SSD layout 2) I/O performance metrics 3) Configurations tips Source: maximumpc.com slide 14/30
  • 15. 2) I/O performance metrics • throughput – MByte/s – throughput analogy: • # of persons/h from Berlin → Prague • # of I/O operations per second – IOPS analogy: • # of individual trips to Prague (from Berlin, Vienna, Paris, Rome, ...) • latency because of queue depth – queue depth analogy: • vehicles must use a ferry to reach destination • with how many vehicles does the ferry depart? slide 15/30
  • 16. Agenda 1) SSD layout 2) I/O performance metrics 3) Configurations tips ● AHCI (NCQ+DIPM) ● TRIM (discard) ● noatime ● tmpfs ● alignment ● over-provisioning Source: maximumpc.com slide 16/30
  • 17. 3) Configuration tips: AHCI (NCQ+DIPM) • NCQ (Native Command Queuing) – allows SSD to execute multiple I/O requests in parallel – boosts throughput – configure queue depth to get your optimal balance between max. # of IOPS and lowest latency slide 17/30
  • 18. 3) Configuration tips: AHCI (NCQ+DIPM) • DIPM (Device Initiated Interface Power Management) – reduces idle power down to 0,1 Watt • Aggressive Link Power Management – min_power – medium_power – max_performance More Details: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/ALPM.html updated slide! (has been corrected after the talk) slide 18/30
  • 19. 3) Configuration tips: TRIM (discard) • ATA TRIM cont (LBA 14) cont (LBA 13) cont (LBA 19) cont (LBA 93) cont (LBA 92) cont (LBA 91) – tells SSD which data cont (LBA 15) cont (LBA 27) cont (LBA 04) cont (LBA 11) cont (LBA 26) cont (LBA 09) cont (LBA 15) cont (LBA 29) cont (LBA 04) can be discarded cont (LBA 43) cont (LBA 58) cont (LBA 42) cont (LBA 57) cont (LBA 43) cont (LBA 59) cont (LBA 84) cont (LBA 77) cont (LBA 89) cont (LBA 02) visible capacity, spare area, e.g. 300 GB e.g. 44 GB slide 19/30
  • 20. 3) Configuration tips: TRIM (discard) • ATA TRIM cont (LBA 14) cont (LBA 13) cont (LBA 19) cont (LBA 93) cont (LBA 92) cont (LBA 91) – tells SSD which data cont (LBA 15) cont (LBA 27) cont (LBA 04) cont (LBA 11) cont (LBA 26) cont (LBA 09) cont (LBA 15) cont (LBA 29) cont (LBA 04) can be discarded cont (LBA 43) cont (LBA 58) cont (LBA 42) cont (LBA 57) cont (LBA 43) cont (LBA 59) cont (LBA 84) cont (LBA 77) cont (LBA 89) cont (LBA 02) cont (LBA 93) visible capacity, spare area, e.g. 300 GB e.g. 44 GB slide 20/30
  • 21. 3) Configuration tips: TRIM (discard) • ATA TRIM cont (LBA 14) cont (LBA 13) cont (LBA 19) cont (LBA 93) cont (LBA 92) cont (LBA 91) – tells SSD which data cont (LBA 15) cont (LBA 27) cont (LBA 04) cont (LBA 11) cont (LBA 26) cont (LBA 09) cont (LBA 15) cont (LBA 29) cont (LBA 04) can be discarded cont (LBA 43) cont (LBA 58) cont (LBA 42) cont (LBA 57) cont (LBA 43) cont (LBA 59) cont (LBA 84) cont (LBA 77) cont (LBA 89) – without TRIM: • deleting a big file (e.g. 100 GB) would cont (LBA 02) cont (LBA 93) lead to keep unusable data • unusable data will be maintained during garbage collection! • more overhead → lower performance & visible capacity, spare area, lower endurance! e.g. 300 GB e.g. 44 GB slide 21/30
  • 22. 3) Configuration tips: TRIM (discard) • ATA TRIM using discard infrastructure in Linux – online discard • Ext4: since Kernel 2.6.33 • XFS: since Kernel 3.0 • Btrfs: since Kernel 2.6.32 – batched discard (using fstrim command) • Ext4: since Kernel 2.6.37 • XFS: since Kernel 2.6.38 • Btrfs: since Kernel 2.6.39 – pre-discard on format • E2fsprogs >= 1.41.10 • Xfsprogs >= 3.1.0 slide 22/30
  • 23. 3) Configuration tips: TRIM (discard) • ATA TRIM using discard infrastructure in Linux – I/O stack discard support (device mapper): • since Kernel 2.6.36: DM targets delay, linear, mpath, stripe • since Kernel 2.6.38: DM mirror target – no I/O stack discard support yet: • MD raid • alternatives to discard: wiper.sh / raid1ext4trim.sh – use hdparm --trim-sector-ranges-stdin – read warnings in the source of those scripts – cannot be used with device mapper slide 23/30
  • 24. 3) Configuration tips: TRIM (discard) • “does TRIM work?” howto slide 24/30
  • 25. 3) Configuration tips: noatime & tmpfs • noatime / relatime – omits writes of metadata on every read • tmpfs – for temporary data like /tmp/, /var/tmp/, /var/cache/, ... slide 25/30
  • 26. 4) Configuration tips: alignment • align partition and file systems – wrong alignment: – use fdisk parameters: fdisk -c -u /dev/sda – correct alignment: slide 26/30
  • 27. 4) Configuration tips: over-provisioning • do not use full normal visible capacity cont (LBA 14) cont (LBA 13) cont (LBA 93) cont (LBA 92) cont (LBA 15) cont (LBA 11) cont (LBA 27) cont (LBA 26) cont (LBA 04) cont (LBA 09) – activate HPA cont (LBA 43) cont (LBA 58) cont (LBA 84) cont (LBA 42) cont (LBA 57) cont (LBA 77) (host protected area) • ATA8-ACS SET MAX ADDRESS cont (LBA 02) cont (LBA 93) • use hdparm -N cont (LBA 26) cont (LBA 57) – or simply do not partition full visible capacity – in either case if SSD has been used before: HPA limited capacity e.g. 200 GB • do a secure erase to normal visible capacity, spare area, TRIM all blocks e.g. 300 GB e.g. 44 GB slide 27/30
  • 28. 4) Configuration tips: over-provisioning • over-provisioning is useful when discard cannot be used yet (e.g. MD-RAID, hardware RAID, ...) • measurements by Intel: Source: Intel slide 28/30
  • 29. Review: 1) SSD layout 2) I/O performance metrics 3) Configurations tips ● AHCI (NCQ+DIPM) ● TRIM (discard) ● noatime ● tmpfs ● alignment ● over-provisioning Source: maximumpc.com slide 29/30
  • 30. Thanks for your time! Image sources: © Robynmac | Dreamstime.com | http://www.dreamstime.com/stock-photography-uluru-australia-kangaroo-sign-image19716692 © stormpic | aboutpixel.de | http://www.aboutpixel.de/foto/im-wartezimmer/stormpic/77740 © meisterleise | aboutpixel.de | http://www.aboutpixel.de/foto/tinnitus-g/meisterleise/113752 © goenz | aboutpixel.de | http://www.aboutpixel.de/foto/computer-1/goenz/104076 © Mr. Monk | aboutpixel.de | http://www.aboutpixel.de/foto/los-beschreib-mich%21/mr_-monk/39871 © RancoR | aboutpixel.de | http://www.aboutpixel.de/foto/giesskanne/rancor/29812 © Michael Hirschka | pixelio.de | http://www.pixelio.de/media/427783 © Sommaruga Fabio | pixelio.de | http://www.pixelio.de/media/448050 © Dieter Schütz | pixelio.de | http://www.pixelio.de/media/537686