This presentation was given as part of an industry panel at DSN 2008 (Dependable Systems and Networks). The topic was "Is SDC a myth or reality?". This presentation gives the SDC perspective in the enterprise server class domain.
The 2012 transition from dfm to pdfd leor nevo-intel
Silent Data Corruption in Servers
1. SDC in Enterprise Class Servers
Ishwar Parulkar
Sun Microsystems, Inc.
DSN 2008 Panel: SDC – Myth or Reality? Slide 1
2. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Server sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 2
3. Silent Data Corruption (SDC)
SDC is defined as incorrect data being
generated in hardware and the incorrect data
being communicated to the application layer
without being detected for a period of time (it
might get detected eventually).
DSN 2008 Panel: SDC – Myth or Reality? Slide 3
4. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Server sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 4
5. Sources of SDC in Servers
1. Cosmic radiation induced bit flips in silicon
2. Design and process marginalities
3. Very corner case logic design bugs
4. Defects occurring in silicon due to ageing
DSN 2008 Panel: SDC – Myth or Reality? Slide 5
6. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Server sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 6
7. Example – Cosmic Radiation
• Sun UltraSPARC-II servers had a noticeable
crash rate in the field in 2000
– symptom was system panic, NOT SDC
• Diagnosed to cosmic radiation induced soft
errors in external cache
– symptom exhibited by SRAM from one vendor (IBM)
• Several examples and experiments from
aerospace, NASA, medical implant electronics
industries
DSN 2008 Panel: SDC – Myth or Reality? Slide 7
8. Example - Design Marginality
• “AMD Options suffer heat issue” - CNET 4/28/0
• From AMD web site:
http://www.amd.com/usen/0,,3715_13965,00.html?redir=CORPR01
– “A few processors have been observed to produce
inconsistent results in a non-production
synthetic test environment with the convergence
of the following three simultaneous conditions:
• The running of FP intensive code sequences,
• elevated CPU temperatures, and
• elevated ambient temperatures”
• In general, temperature gradients in silicon can be up
to 30oC per mm on large dice
Question: Design, Manufacturing test or In-field reliability issue?
DSN 2008 Panel: SDC – Myth or Reality? Slide 8
9. Example - Process Marginality
• Very infrequent, intermittent parity errors noticed in
the field (NOT SDC)
• Symptom seen on few parts
– long, unpredictable time to failure
– parts were from one manufacturing line
• Diagnosed to a long route with multiple jogs
– no DFM rule violation
– combination of
• location of die on wafer
• mechanical warping
• electrical use condition (load)
DSN 2008 Panel: SDC – Myth or Reality? Slide 9
10. Example - Logic Design Bug - (1)
Famous Pentium FDIV Bug in 1994
• Discovered by a user running code to enumerate primes
• Symptom: Reduction in precision of division operations
• Concern in scientific/engineering and financial
engineering fields
• Source: Few missing entries in a look-up table used in
floating point divide operations, not detected in
verification
• Intel estimated MTBSDC of 27000 years, IBM estimated
24 days
DSN 2008 Panel: SDC – Myth or Reality? Slide 10
11. Example - Logic Design Bug - (2)
A more subtle case
• Multithreaded processor with multiple strands sharing
resources
• 1-3 cycle of vulnerability created when
– more than 1 strand is using an execution pipe with
specific combinations of operations
• SDC occurs if all of the following arrive at the trap
commit unit within 1-3 cycle window of vulnerability
– A checkpoint state
– A trap
– A park request
• Scenario pathologically possible; probability of
occurring in code is close to 0
DSN 2008 Panel: SDC – Myth or Reality? Slide 11
12. Examples – Silicon Degradation
• Several phenomena
– Electromigration
– Gate Oxide Breakdown
– Channel Hot Carrier Effect
– Negative Bias Temperature Instability
• Addressed by DFM rules, guard-banding in design
and accelerating via burn-in during manufacturing
• Not a major concern for SDC, because they are not
silent for long
DSN 2008 Panel: SDC – Myth or Reality? Slide 12
13. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 13
14. Server Market Segments
Back Office
• CRM
• ERP
• BIDW
• Database
HPC
Mainstream Web
•
•
Finance
Manufacturing
Infrastructure
• Oil and Gas
• Life Sciences • Web 2.0
• Government • Storage
• Service
Providers
DSN 2008 Panel: SDC – Myth or Reality? Slide 14
15. Server SDC and Availability
Typical Targets
Server Type MTBSDC Availability
Data Centric 100-1000 years 99.999
Web Centric 10-100 years 99.999-99.9999
Compute Centric 100-1000 years 99.990
MTBF in years = 109 / (FIT * 24 Hours * 365 Days)
DSN 2008 Panel: SDC – Myth or Reality? Slide 15
16. Classification of Silicon Errors from
a User Perspective
Universe of
Silicon Errors
in a Server Chip
DSN 2008 Panel: SDC – Myth or Reality? Slide 16
17. Classification of Silicon Errors from
a User Perspective
C U
Corrected Uncorrected
DSN 2008 Panel: SDC – Myth or Reality? Slide 17
18. Classification of Silicon Errors from
a User Perspective
Silent SC SU
Reported RC RU
Corrected Uncorrected
DSN 2008 Panel: SDC – Myth or Reality? Slide 18
19. Classification of Silicon Errors from
a User Perspective
Customer
does not care
Silent SC SU
Reported RC RU
Corrected Uncorrected
DSN 2008 Panel: SDC – Myth or Reality? Slide 19
20. Classification of Silicon Errors from
a User Perspective
Customer
does not care
Silent SC SU
Reported RC RU
Required by Corrected Uncorrected
Service/Customer
to monitor health
DSN 2008 Panel: SDC – Myth or Reality? Slide 20
21. Classification of Silicon Errors from
a User Perspective
Customer
does not care
Silent SC SU
Reported RC RU
System Crash
Required by Corrected Uncorrected
Service/Customer
to monitor health
DSN 2008 Panel: SDC – Myth or Reality? Slide 21
22. Classification of Silicon Errors from
a User Perspective
Customer Silent Data
does not care Corruption
Silent SC SU
Reported RC RU
System Crash
Required by Corrected Uncorrected
Service/Customer
to monitor health
DSN 2008 Panel: SDC – Myth or Reality? Slide 22
23. A Typical Data Centric Server
Component Approx. Count Comments
Processors 8-64 8-64 way systems
ASICs 320 Memory controllers, IO bridges, Crypto, etc.
Memory DIMMs 640 Depends on memory capacity
AC/DC
8-10 Main power supply
Power Supplies
DC/DC
640 High and low voltage supplies
Power Supplies
Clocking 64 Clock synthesizers and distribution
Service Processor 4 Small processors, FPGA
Miscellaneous
1000-10000 Resistors, Capacitors, Pins, Connectors
Small Components
DSN 2008 Panel: SDC – Myth or Reality? Slide 23
24. Server Sensitivity to Processor SDC
Sensitivity of Server to Processor SU Rate
120
110
100
Server MTBSDC (Years)
90
80
70
60
50
40
30
20
10
0
100 200 300 400 500 600 700
Processor SU (Silent Uncorrected) FIT
DSN 2008 Panel: SDC – Myth or Reality? Slide 24
25. Server Sensitivity to Processor SDC
Sensitivity to Processor SU Rate
120
110
100
Server MTBSDC (Years) 90 89 years
80
70
60
50 42 years
40
30
20
10
0
100 200 300 400 500 600 700
Processor SU (Silent Uncorrected) FIT
• A 150 FIT increase in processor implies:
– 52.8% degradation of MTBSDC
DSN 2008 Panel: SDC – Myth or Reality? Slide 25
26. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 26
27. Design for SDC Mitigation
VOC, Field Data, Marketing
DSN 2008 Panel: SDC – Myth or Reality? Slide 27
28. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level
FIT Targets
DSN 2008 Panel: SDC – Myth or Reality? Slide 28
29. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level
FIT Targets
SER Estimation Raw Static SER
from SPICE Measurement
Simulations at LANL
DSN 2008 Panel: SDC – Myth or Reality? Slide 29
30. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level
FIT Targets
Raw Soft Error Rate
SER Estimation Raw Static SER
from SPICE Measurement
Simulations at LANL
DSN 2008 Panel: SDC – Myth or Reality? Slide 30
31. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level
FIT Targets
Raw Soft Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 31
32. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level
FIT Targets
Raw Soft Error Rate Raw Hard Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 32
33. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level
FIT Targets
Circuit, Logic, Architecture, SW
Detection, Correction, Recovery Solutions
Raw Soft Error Rate Raw Hard Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 33
34. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level Electrical, Logical and
FIT Targets Architectural Derating
Circuit, Logic, Architecture, SW
Detection, Correction, Recovery Solutions
Raw Soft Error Rate Raw Hard Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 34
35. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level Actual Chip Level Electrical, Logical and
FIT Targets FIT Architectural Derating
Circuit, Logic, Architecture, SW
Detection, Correction, Recovery Solutions
Raw Soft Error Rate Raw Hard Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 35
36. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Chip Level Actual Chip Level
FIT Targets = FIT
Electrical, Logical and
Architectural Derating
Circuit, Logic, Architecture, SW
Detection, Correction, Recovery Solutions
Raw Soft Error Rate Raw Hard Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 36
37. Design for SDC Mitigation
VOC, Field Data, Marketing
System Level
MTBSDC, MTBUSI
Targets
Not Equal
Chip Level Actual Chip Level
FIT Targets = FIT
Electrical, Logical and
Architectural Derating
Not Equal
Circuit, Logic, Architecture, SW
Detection, Correction, Recovery Solutions
Raw Soft Error Rate Raw Hard Error Rate
SER Estimation Raw Static SER
from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test
Simulations at LANL Reliability Modeling of Samples
DSN 2008 Panel: SDC – Myth or Reality? Slide 37
38. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 38
39. Solution Trends for SDC
• Unit level redundancy is too costly
• Logic and flops need to be protected
• Circuit level solutions can be limiting
• Logic/architectural solutions more promising
• Periodic on-line testing for predicting degradation
• Trillions of random verification cycles
DSN 2008 Panel: SDC – Myth or Reality? Slide 39
40. Outline
• Sources of SDC
• Examples of cases of SDC
• How big a concern is SDC?
– Application space
– Sensitivity to SDC
• Design/Measurement for SDC mitigation
• Solution trends
• Conclusions
DSN 2008 Panel: SDC – Myth or Reality? Slide 40
41. Conclusions
• SDC is a reality
– criticality and investment in mitigation highly dependent
on application space
• Solutions to SDC need to be low overhead –
mainframe level reliability/availability at server
price points
• Need more accurate estimation of SDC
• SDC due to design bugs and design/process
marginalities still hard to estimate
DSN 2008 Panel: SDC – Myth or Reality? Slide 41
43. Using Sun Processor “Ranch”
(Testing in Broomfield, CO)
DSN 2008 Panel: SDC – Myth or Reality? Slide 43
44. Broomfield Test Setup
• Altitude and geomagnetic location give ~4.1x
acceleration over sea-level
• 600 US-III Processors
• 3 months of testing
• Used modified POST code to write 0's and 1's
to memory arrays and observe bit flips
• Monitored power supply fails as well
DSN 2008 Panel: SDC – Myth or Reality? Slide 44
45. Soft Error Testing of SUN Processors
- A Chronology
Date Process Node Device Under Test Location Test Type
8/2000 250nm, 180nm US III Los Alamos Neutron Irradiation
11/2000 – 2/2001 250nm, 180nm US III Broomfield Large Volume (600 CPUs)
11/2002 150nm, 130nm US III Los Alamos Neutron Irradiation
11/2003 130nm, 90nm US IIIi, IIIi+ Los Alamos Neutron Irradiation
8/2004 - Commodity SRAM Berkeley Neutron Irradiation
4/2005 90nm US IIIi+ Los Alamos Neutron Irradiation
11/2005 90nm US T1, IIIi+, IV+ Los Alamos Neutron Irradiation
12/2005 90nm US T1 Los Alamos Neutron Irradiation
12/2006 65nm US T2 Los Alamos Neutron Irradiation
12/2007 65nm US T2/Nextgen Proc Los Alamos Neutron Irradiation
DSN 2008 Panel: SDC – Myth or Reality? Slide 45
46. A Typical LANL Test Setup
• Recently tested UltraSPARC T2 and a next
generation processor in 65nm technology
• Ran multiple systems in parallel
• Different parts, voltages & test patterns
• Beam time efficiency
– 12% beam off
– 5% of time in setup, debug
– 83% of beam time gave useful data
• Cumulative 775 hours of data gathered
DSN 2008 Panel: SDC – Myth or Reality? Slide 46
47. Design/Process Marginality
Where do you solve it?
Design Guard-bands
Loss of Performance
Field Manufacturing
In-line Correction Wider Test Box
Area/Power Cost Loss of Yield
DSN 2008 Panel: SDC – Myth or Reality? Slide 47