Andrei Khurshudov gave a presentation on solid state drives (SSDs) at the Symposium on Magnetic Storage Tribology and Reliability in Miami, Florida on October 20, 2008. In the presentation, he discussed SSD technology trends, challenges relating to reliability over the life of SSDs, and the need for standardization of SSD reliability testing methods. He noted that while SSDs offer benefits over hard disk drives like improved performance and lower power consumption, challenges remain regarding cost, reliability over the lifetime of the product, and write performance.
Future Information Growth And Storage Device Reliability 2007
SSD Reliability Expert Discusses SSD Quality and Failure Modes
1. Andrei Khurshudov
Sr. Director
SSD Q&R
Seagate Technology
October 20, 2008
Symposium on Magnetic Storage Tribology and Reliability
Miami, Florida
October 20, 2008
10/27/2008 1
2. SSD – In One Page
SSD ≡ Solid State Drive
SSD is a storage device
◦ using solid state memory as components instead of heads and
disks
◦ appearing to the user as a drive similar to a hard disk drive (HDD)
SSD uses non-volatile memory (NAND Flash) or volatile
semiconductor memory (RAM) with a battery
Current SSD products utilize either SLC (single-level cell)
or MLC (multi-level cell) NAND Flash
SSD benefits: read performance, higher reliability, low
power consumption
SSD challenges: cost, product reliability over life, and write
performance
Andrei Khurshudov
Seagate Technology
10/27/2008 2
October 20, 2008
3. Today and Tomorrow of SSD
Today’s total revenue ~ $400 M
Projected 2011 revenue ~ $5 B
Today’s unit shipments ~ 4M units
◦ Dominated by the industrial applications
◦ Dominated by capacities <1 GB
Projected 2011 unit shipments ~ 50M units
◦ Dominated by shipments to portable PCs
◦ Dominated by capacities from 64 GB to 128 GB
The Total Cost of Ownership (TCO) is expected to drive the
transition from HDDs to SSDs
◦ Conclusion: there is no need for the complete price parity at
equivalent capacity points
| Source: IDC
Andrei Khurshudov
Seagate Technology
10/27/2008 3
October 20, 2008
4. Basic Flash Operation
Flash stores data by trapping charge at the floating
gate
Direct access to data:
Program (write) a “page” (2KB or 4 KB + ECC bytes)
◦
Read a page
◦
Erase the smallest unit is a block (64,128, or more pages)
◦
Over-write = Erase (Block) + Write (page)
◦
Program / Erase operations:
◦ Forces electrons in the substrate to tunnel through the oxide layer to be
transported to and trapped on the floating gate (“0”)
◦ Forces electrons back to the substrate (“1”)
Read operation:
◦ Apply voltage to the control gate and sense the current
through the inversion channel:
“1” if there is a current flow
“0” if there is no current flow
Andrei Khurshudov
Seagate Technology
10/27/2008 4
October 20, 2008
5. Program and Erase Cycle
20 V 0V
Control Gate Control Gate
Dielectric Dielectric
Floating Gate Floating Gate
Float Float Float Float
eeeeeeeeeee
Gate Oxide Gate Oxide
eeeeeeeeeee
Source Drain Source Drain
0V 20 V
Equivalent to “data write” in HDD Equivalent to “data erase” in HDD
Electrons are moved from the substrate Electrons are moved from the floating
and trapped in the floating gate gate into the substrate
Programming is done by “pages” Erasures are done by “blocks”
Results in a logical “0” Results in a logical “1”
Uses Fowler-Nordheim tunneling Uses Fowler-Nordheim tunneling
Andrei Khurshudov
Seagate Technology
10/27/2008 5
October 20, 2008
6. Flash Technology Trends
| Source: J. Cooke, Micron technology
| Source: Samsung
Future roadmap for NAND charge
storage technology:
Scaling down and increasing
complexity
10X reduction in reliability that
needs to be compensated for by
other means
Transition from SLC (single-level cells) to MLC (multi-level
cells) will represent a significant challenge to Flash reliability
Not just writes but reads have a degrading effect on the
flash data retention
Andrei Khurshudov
Seagate Technology
10/27/2008 6
October 20, 2008
7. Quality Assurance: HDD vs. SSD
SSD
HDD
Immature Industry: Non-uniform,
Mature Industry: Mature Tests
Inconsistent
Development and Qualification Development and Qualification
Tests – very similar across the industry Tests – inconsistent across the industry
Test conditions – consistent across the Test conditions – inconsistent across the
industry industry
Test sample size and environments - very Test sample size, environments, and failure
similar across the industry criteria - inconsistent across the industry
Firmware testing, validation, and issue Firmware testing, validation, and issue
handling – years of experience handling – little experience
Acceleration factors: Acceleration factors:
Temperature – similar Temperature – understood
Usage – unclear Usage – understood
Voltage – not well defined Voltage – understood
Reliability demonstration – standard RDT tests Reliability demonstration – inconsistent across
& standard data interpretation the industry
Reliability Focus
Reliability Focus
Endurance (wearout)
Head-disk interface
Data retention
Handling robustness
Read and write disturb
Wear-leveling algorithms
Andrei Khurshudov
Seagate Technology
10/27/2008 7
October 20, 2008
8. Major Failure Modes of NAND Flash
• Flash-specific failure modes include:
• Program disturb: other cells than those being programmed receive elevated
voltage. Can be on the page that is not supposed to be programmed. Erase
will return cells to the “normal” state
• Read disturb: within the block being read but on pages not being read. Erase
will return cells to the “normal” state
• Data retention: charge loss or gain occurs in the cell over time. Erase will
return cells to the “normal” state
• Endurance (Wear-out): cell fails due to charge trapped in the dielectric layer.
Not recoverable by erase.
Programmed Cell after P/E Cycling
Other SSD failure modes:
•
Control Gate
• Handling damage
• EOS/ESD
Dielectric
Floating Gate
Gate Oxide, SiO2
eeeeeeeeeee
• Firmware / ASIC failures eeeee
Source Drain
• Other failures
P-substrate
Andrei Khurshudov
Seagate Technology
October10/27/2008 8
20, 2008
9. SSD Endurance
Electrical effects:
P/Emax Electrical effects:
--Faster programming
Faster programming
due to trapping charges
Failure rate, %
due to trapping charges
inside in dielectric instead
inside in dielectric instead
of the FG
of the FG
--Slower erasure because
ß=1 Slower erasure because
the trapped charges are
the trapped charges are
harder to remove than
ß>1
ß<1 harder to remove than
those in FG;
those in FG;
True P/E cycles
Time
GB written
Program/Erase (P/E) cycles cause charge to be trapped in the dielectric layer
This causes a permanent shift in cell characteristics, which is not recovered
by erase
Observed as failed program or erase status
In most cases, data could be recovered from the failed block
Blocks that fail should be retired (marked as bad and no longer used)
Andrei Khurshudov
Seagate Technology
10/27/2008 9
October 20, 2008
10. SSD Endurance: Major Factors
Stress:
Number of P/E cycles
External P/E cycles (host write data rate)
Internal Write multiplication
External data entropy (block size distribution application
specific)
Internal data handling (data buffering, Flash architecture, etc.)
Wear-leveling efficiency (write uniformity across Flash cells)
Operating environment
Temperature (could both stress and help)
Strength:
Flash Endurance robustness
Device ECC power
Design redundancies or excess capacities
Bad block identification and data re-assign mechanism
Andrei Khurshudov
Seagate Technology
10/27/2008 10
October 20, 2008
11. Endurance: SLC vs. MLC
Multi-level cells use
different charge levels
to store two or more
bits in one cell
Read/Write design
margins (and the gaps
between the Vt levels)
are much smaller for
MLC resulting in lower
endurance
| Source: W. Hutsell, Texas Memory Systems
Transition to MLC would represent a significant
reliability challenge
Andrei Khurshudov
Seagate Technology
10/27/2008 11
October 20, 2008
12. SSD Data Retention
Programmed Cell Programmed Cell after P/E
Cycling
Control Gate Control Gate
Dielectric Dielectric
Floating Gate Floating Gate
Gate Oxide
eeeeeeeeeee eeeeeeeeeee
eeeee
Gate Oxide
Source Drain Source Drain
P-substrate P-substrate
Programmed Cell after long NOP Storage Programmed Cell after P/E
Cycling and long NOP Storage
Control Gate Control Gate
Dielectric Dielectric
Floating Gate Floating Gate
Gate Oxide
e e e e e e e e
e e e e e
Gate Oxide
e e e e e
Source Drain Source Drain
e e e e e
e e e e
P-substrate P-substrate
Non-operating storage causes charge to leak from the floating gate
P/E cycling lead to even faster charge dissipation and eventual data loss
Andrei Khurshudov
Seagate Technology
10/27/2008 12
October 20, 2008
13. Data Retention vs. Time and P/E cycles
P/E cycling shortens data retention
| Source: Samsung
| Source: Jim Cooke, Micron
No P/E cycling impact on endurance Strong endurance
dependence on P/E
cycling
Newer technologies shortens data retention
Exercising flash reduces its long-term data retention
This problem gets worse as the Flash scales down (60
nm 4x nm) and increases in complexity (SLC MLC)
Andrei Khurshudov
Seagate Technology
10/27/2008 13
October 20, 2008
14. Understand and Overcome Fundamental Technology limitations
Write Endurance (max. program/erase cycles)
Degrades with device scaling
100k for SLC NAND, 10k for MLC-2b, 1k for MLC-3b, 100 for MLC-4b
Data Retention
Degrades with device scaling
Depends on temperature and P/E cycling
10 year retention @ up to 10% P/E cycles, 1 year retention @ 100% P/E cycles
Read disturb
Degrades with device scaling
1M for SLC NAND, 100k for MLC-2b
Write multiplication
Block erasure might lead to many additional internal writes for every host write
Mitigate Flash Limitations with Advanced Reliability & Test
Technologies
Static and dynamic wear leveling to maximize life of the device
Write reduction solutions
Deploying increased ECC power
SSD-specific Test and Qualification process (CERT, DMT, RDT, ORT, etc.)
Andrei Khurshudov
Seagate Technology
10/27/2008 14
October 20, 2008
15. Predictive Life Modeling
SSD reliability modeling could potentially be more accurate than
that for HDD. However, …
Failure mechanisms are highly inter-independent and supplier-
specific, which makes things difficult
Flash Component Quality and Reliability
Superb quality control is required to compensate for high lot &
part variability in high-volume environment
Component reliability correlation to a system and to the field &
integration needs to be established
Standardization of the most critical tests and methodologies
Need to establish common language and definitions
Andrei Khurshudov
Seagate Technology
10/27/2008 15
October 20, 2008
16. SSD future is bright and promising but dependent on several critical
areas, including reliability
HDD to SSD transition rate will be a strong function of the total cost of ownership
◦
(TCO)
Reliability plays a critical role in reducing the TCO
◦
SSD technology scaling is expected to have a negative impact on reliability
◦
SSD reliability efforts should focus on the following major areas:
Endurance
◦
Data retention
◦
Read / Program disturb
◦
Reliability enhancing technologies (wear leveling, ECC, etc.)
◦
SSD test standardization is required:
No “apple-to-apple” comparison will be possible otherwise
◦
TCO is difficult to estimate without having standard tests
◦
Andrei Khurshudov
Seagate Technology
10/27/2008 16
October 20, 2008