SlideShare ist ein Scribd-Unternehmen logo
1 von 47
SDC in Enterprise Class Servers

                                    Ishwar Parulkar
                             Sun Microsystems, Inc.




DSN 2008 Panel: SDC – Myth or Reality?                Slide 1
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
               – Application space
               – Server sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 2
Silent Data Corruption (SDC)


         SDC is defined as incorrect data being
         generated in hardware and the incorrect data
         being communicated to the application layer
         without being detected for a period of time (it
         might get detected eventually).




DSN 2008 Panel: SDC – Myth or Reality?                 Slide 3
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
               – Application space
               – Server sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 4
Sources of SDC in Servers

         1. Cosmic radiation induced bit flips in silicon
         2. Design and process marginalities
         3. Very corner case logic design bugs
         4. Defects occurring in silicon due to ageing




DSN 2008 Panel: SDC – Myth or Reality?                      Slide 5
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
               – Application space
               – Server sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 6
Example – Cosmic Radiation

     • Sun UltraSPARC-II servers had a noticeable
       crash rate in the field in 2000
           – symptom was system panic, NOT SDC
     • Diagnosed to cosmic radiation induced soft
       errors in external cache
           – symptom exhibited by SRAM from one vendor (IBM)
     • Several examples and experiments from
       aerospace, NASA, medical implant electronics
       industries


DSN 2008 Panel: SDC – Myth or Reality?                   Slide 7
Example - Design Marginality
       • “AMD Options suffer heat issue” - CNET 4/28/0
       • From AMD web site:
           http://www.amd.com/usen/0,,3715_13965,00.html?redir=CORPR01
            – “A few processors have been observed to produce
              inconsistent results in a non-production
              synthetic test environment with the convergence
              of the following three simultaneous conditions:
                • The running of FP intensive code sequences,
                • elevated CPU temperatures, and
                • elevated ambient temperatures”
       • In general, temperature gradients in silicon can be up
         to 30oC per mm on large dice

       Question: Design, Manufacturing test or In-field reliability issue?
DSN 2008 Panel: SDC – Myth or Reality?                                   Slide 8
Example - Process Marginality

     • Very infrequent, intermittent parity errors noticed in
       the field (NOT SDC)
     • Symptom seen on few parts
           – long, unpredictable time to failure
           – parts were from one manufacturing line
     • Diagnosed to a long route with multiple jogs
           – no DFM rule violation
           – combination of
                • location of die on wafer
                • mechanical warping
                • electrical use condition (load)


DSN 2008 Panel: SDC – Myth or Reality?                          Slide 9
Example - Logic Design Bug - (1)
     Famous Pentium FDIV Bug in 1994


     • Discovered by a user running code to enumerate primes
     • Symptom: Reduction in precision of division operations
     • Concern in scientific/engineering and financial
       engineering fields
     • Source: Few missing entries in a look-up table used in
       floating point divide operations, not detected in
       verification
     • Intel estimated MTBSDC of 27000 years, IBM estimated
       24 days



DSN 2008 Panel: SDC – Myth or Reality?                   Slide 10
Example - Logic Design Bug - (2)
     A more subtle case
       • Multithreaded processor with multiple strands sharing
         resources
       • 1-3 cycle of vulnerability created when
             – more than 1 strand is using an execution pipe with
               specific combinations of operations
       • SDC occurs if all of the following arrive at the trap
         commit unit within 1-3 cycle window of vulnerability
             – A checkpoint state
             – A trap
             – A park request
       • Scenario pathologically possible; probability of
         occurring in code is close to 0


DSN 2008 Panel: SDC – Myth or Reality?                              Slide 11
Examples – Silicon Degradation

       • Several phenomena
          – Electromigration
          – Gate Oxide Breakdown
          – Channel Hot Carrier Effect
          – Negative Bias Temperature Instability
       • Addressed by DFM rules, guard-banding in design
         and accelerating via burn-in during manufacturing
       • Not a major concern for SDC, because they are not
         silent for long



DSN 2008 Panel: SDC – Myth or Reality?                       Slide 12
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            – Application space
            – Sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 13
Server Market Segments

                                     Back Office
                                         • CRM
                                         • ERP
                                         • BIDW
                                         • Database
                       HPC
                 Mainstream                                Web
                 •
                 •
                     Finance
                     Manufacturing
                                                      Infrastructure
                 •   Oil and Gas
                 •   Life Sciences                      • Web 2.0
                 •   Government                         • Storage
                                                        • Service
                                                          Providers




DSN 2008 Panel: SDC – Myth or Reality?                                 Slide 14
Server SDC and Availability
  Typical Targets

                    Server Type            MTBSDC          Availability
                    Data Centric         100-1000 years      99.999
                    Web Centric           10-100 years    99.999-99.9999
                 Compute Centric 100-1000 years              99.990




      MTBF in years = 109 / (FIT * 24 Hours * 365 Days)


DSN 2008 Panel: SDC – Myth or Reality?                                     Slide 15
Classification of Silicon Errors from
  a User Perspective



                                       Universe of
                                     Silicon Errors
                                    in a Server Chip




DSN 2008 Panel: SDC – Myth or Reality?                 Slide 16
Classification of Silicon Errors from
  a User Perspective




                                     C         U


                               Corrected   Uncorrected



DSN 2008 Panel: SDC – Myth or Reality?                   Slide 17
Classification of Silicon Errors from
  a User Perspective


                 Silent             SC        SU


            Reported                RC        RU

                               Corrected   Uncorrected



DSN 2008 Panel: SDC – Myth or Reality?                   Slide 18
Classification of Silicon Errors from
  a User Perspective
   Customer
  does not care

                 Silent             SC        SU


            Reported                RC        RU

                               Corrected   Uncorrected



DSN 2008 Panel: SDC – Myth or Reality?                   Slide 19
Classification of Silicon Errors from
  a User Perspective
   Customer
  does not care

                 Silent             SC      SU


            Reported                RC      RU

   Required by    Corrected              Uncorrected
Service/Customer
to monitor health

DSN 2008 Panel: SDC – Myth or Reality?                 Slide 20
Classification of Silicon Errors from
  a User Perspective
   Customer
  does not care

                 Silent             SC      SU


            Reported                RC      RU
                                                       System Crash

   Required by    Corrected              Uncorrected
Service/Customer
to monitor health

DSN 2008 Panel: SDC – Myth or Reality?                        Slide 21
Classification of Silicon Errors from
  a User Perspective
   Customer                                            Silent Data
  does not care                                        Corruption

                 Silent             SC      SU


            Reported                RC      RU
                                                       System Crash

   Required by    Corrected              Uncorrected
Service/Customer
to monitor health

DSN 2008 Panel: SDC – Myth or Reality?                         Slide 22
A Typical Data Centric Server

          Component            Approx. Count                  Comments
           Processors                    8-64               8-64 way systems

              ASICs                      320    Memory controllers, IO bridges, Crypto, etc.

         Memory DIMMs                    640           Depends on memory capacity
            AC/DC
                                         8-10               Main power supply
         Power Supplies
            DC/DC
                                         640           High and low voltage supplies
         Power Supplies
             Clocking                    64         Clock synthesizers and distribution

        Service Processor                 4              Small processors, FPGA
         Miscellaneous
                                  1000-10000      Resistors, Capacitors, Pins, Connectors
        Small Components


DSN 2008 Panel: SDC – Myth or Reality?                                                    Slide 23
Server Sensitivity to Processor SDC
                                        Sensitivity of Server to Processor SU Rate
                                            120
                                            110
                                            100
                    Server MTBSDC (Years)



                                             90
                                             80
                                             70
                                             60
                                             50
                                             40
                                             30
                                             20
                                             10
                                              0
                                              100 200     300 400 500 600 700
                                                 Processor SU (Silent Uncorrected) FIT




DSN 2008 Panel: SDC – Myth or Reality?                                                   Slide 24
Server Sensitivity to Processor SDC
                                                     Sensitivity to Processor SU Rate
                                                 120
                                                 110
                                                 100
                         Server MTBSDC (Years)    90     89 years
                                                  80
                                                  70
                                                  60
                                                  50             42 years
                                                  40
                                                  30
                                                  20
                                                  10
                                                   0
                                                   100 200 300 400 500 600 700
                                                     Processor SU (Silent Uncorrected) FIT


             • A 150 FIT increase in processor implies:
                – 52.8% degradation of MTBSDC
DSN 2008 Panel: SDC – Myth or Reality?                                                       Slide 25
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            – Application space
            – Sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 26
Design for SDC Mitigation
         VOC, Field Data, Marketing




DSN 2008 Panel: SDC – Myth or Reality?   Slide 27
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level
                FIT Targets




DSN 2008 Panel: SDC – Myth or Reality?   Slide 28
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level
                FIT Targets




                   SER Estimation        Raw Static SER
                    from SPICE            Measurement
                    Simulations             at LANL
DSN 2008 Panel: SDC – Myth or Reality?                    Slide 29
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level
                FIT Targets




                              Raw Soft Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement
                    Simulations             at LANL
DSN 2008 Panel: SDC – Myth or Reality?                    Slide 30
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level
                FIT Targets




                              Raw Soft Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement     GOI, NBTI,CHC,EM       Accelerated Test
                    Simulations             at LANL       Reliability Modeling     of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                Slide 31
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level
                FIT Targets




                              Raw Soft Error Rate                    Raw Hard Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement     GOI, NBTI,CHC,EM       Accelerated Test
                    Simulations             at LANL       Reliability Modeling     of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                Slide 32
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level
                FIT Targets




                                               Circuit, Logic, Architecture, SW
                                          Detection, Correction, Recovery Solutions




                              Raw Soft Error Rate                            Raw Hard Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement             GOI, NBTI,CHC,EM       Accelerated Test
                    Simulations             at LANL               Reliability Modeling     of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                        Slide 33
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level                                                               Electrical, Logical and
                FIT Targets                                                              Architectural Derating




                                               Circuit, Logic, Architecture, SW
                                          Detection, Correction, Recovery Solutions




                              Raw Soft Error Rate                            Raw Hard Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement             GOI, NBTI,CHC,EM           Accelerated Test
                    Simulations             at LANL               Reliability Modeling         of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                             Slide 34
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level                          Actual Chip Level                    Electrical, Logical and
                FIT Targets                                FIT                           Architectural Derating




                                               Circuit, Logic, Architecture, SW
                                          Detection, Correction, Recovery Solutions




                              Raw Soft Error Rate                            Raw Hard Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement             GOI, NBTI,CHC,EM           Accelerated Test
                    Simulations             at LANL               Reliability Modeling         of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                             Slide 35
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets



                Chip Level                            Actual Chip Level
                FIT Targets              =                   FIT
                                                                                            Electrical, Logical and
                                                                                            Architectural Derating




                                                  Circuit, Logic, Architecture, SW
                                             Detection, Correction, Recovery Solutions




                              Raw Soft Error Rate                               Raw Hard Error Rate

                   SER Estimation        Raw Static SER
                    from SPICE            Measurement                GOI, NBTI,CHC,EM           Accelerated Test
                    Simulations             at LANL                  Reliability Modeling         of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                                Slide 36
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
                 Targets
                                          Not Equal

                Chip Level                             Actual Chip Level
                FIT Targets               =                   FIT
                                                                                             Electrical, Logical and
                                                                                             Architectural Derating

                              Not Equal


                                                   Circuit, Logic, Architecture, SW
                                              Detection, Correction, Recovery Solutions




                              Raw Soft Error Rate                                Raw Hard Error Rate

                   SER Estimation         Raw Static SER
                    from SPICE             Measurement                GOI, NBTI,CHC,EM           Accelerated Test
                    Simulations              at LANL                  Reliability Modeling         of Samples

DSN 2008 Panel: SDC – Myth or Reality?                                                                 Slide 37
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            – Application space
            – Sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 38
Solution Trends for SDC

    • Unit level redundancy is too costly
    • Logic and flops need to be protected
    • Circuit level solutions can be limiting
    • Logic/architectural solutions more promising
    • Periodic on-line testing for predicting degradation
    • Trillions of random verification cycles



DSN 2008 Panel: SDC – Myth or Reality?               Slide 39
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            – Application space
            – Sensitivity to SDC
          • Design/Measurement for SDC mitigation
          • Solution trends
          • Conclusions


DSN 2008 Panel: SDC – Myth or Reality?              Slide 40
Conclusions

    • SDC is a reality
         – criticality and investment in mitigation highly dependent
           on application space
    • Solutions to SDC need to be low overhead –
      mainframe level reliability/availability at server
      price points
    • Need more accurate estimation of SDC
    • SDC due to design bugs and design/process
      marginalities still hard to estimate

DSN 2008 Panel: SDC – Myth or Reality?                         Slide 41
Backup Slides




DSN 2008 Panel: SDC – Myth or Reality?        Slide 42
Using Sun Processor “Ranch”
   (Testing in Broomfield, CO)




DSN 2008 Panel: SDC – Myth or Reality?   Slide 43
Broomfield Test Setup

    • Altitude and geomagnetic location give ~4.1x
      acceleration over sea-level
    • 600 US-III Processors
    • 3 months of testing
    • Used modified POST code to write 0's and 1's
      to memory arrays and observe bit flips
    • Monitored power supply fails as well



DSN 2008 Panel: SDC – Myth or Reality?           Slide 44
Soft Error Testing of SUN Processors
  - A Chronology


       Date         Process Node         Device Under Test     Location           Test Type
      8/2000        250nm, 180nm               US III         Los Alamos      Neutron Irradiation
 11/2000 – 2/2001   250nm, 180nm               US III         Broomfield   Large Volume (600 CPUs)
      11/2002       150nm, 130nm               US III         Los Alamos      Neutron Irradiation
      11/2003        130nm, 90nm            US IIIi, IIIi+    Los Alamos      Neutron Irradiation
      8/2004               -             Commodity SRAM        Berkeley       Neutron Irradiation
      4/2005             90nm                 US IIIi+        Los Alamos      Neutron Irradiation
      11/2005            90nm             US T1, IIIi+, IV+   Los Alamos      Neutron Irradiation
      12/2005            90nm                 US T1           Los Alamos      Neutron Irradiation
      12/2006            65nm                 US T2           Los Alamos      Neutron Irradiation
      12/2007            65nm        US T2/Nextgen Proc       Los Alamos      Neutron Irradiation




DSN 2008 Panel: SDC – Myth or Reality?                                                    Slide 45
A Typical LANL Test Setup

       • Recently tested UltraSPARC T2 and a next
         generation processor in 65nm technology
       • Ran multiple systems in parallel
       • Different parts, voltages & test patterns
       • Beam time efficiency
            – 12% beam off
            – 5% of time in setup, debug
            – 83% of beam time gave useful data
       • Cumulative 775 hours of data gathered


DSN 2008 Panel: SDC – Myth or Reality?               Slide 46
Design/Process Marginality
   Where do you solve it?

                                         Design      Guard-bands
                                                  Loss of Performance




             Field                                          Manufacturing
       In-line Correction                                     Wider Test Box
       Area/Power Cost                                         Loss of Yield


DSN 2008 Panel: SDC – Myth or Reality?                                  Slide 47

Weitere ähnliche Inhalte

Ähnlich wie Silent Data Corruption in Servers

Cielution imaps short_presentation_public
Cielution imaps short_presentation_publicCielution imaps short_presentation_public
Cielution imaps short_presentation_publicKamal Karimanal
 
2011 年會-IC封測產業技術發展現況與未來挑戰
2011 年會-IC封測產業技術發展現況與未來挑戰2011 年會-IC封測產業技術發展現況與未來挑戰
2011 年會-IC封測產業技術發展現況與未來挑戰CHENHuiMei
 
Open source and open communities will play a big role in SDN and networking i...
Open source and open communities will play a big role in SDN and networking i...Open source and open communities will play a big role in SDN and networking i...
Open source and open communities will play a big role in SDN and networking i...Open Networking Summits
 
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfAI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfObject Automation
 
Industry Perspectives of SDN: Technical Challenges and Business Use Cases
Industry Perspectives of SDN: Technical Challenges and Business Use CasesIndustry Perspectives of SDN: Technical Challenges and Business Use Cases
Industry Perspectives of SDN: Technical Challenges and Business Use CasesOpen Networking Summits
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...Edge AI and Vision Alliance
 
Using Low Cost of Ownership Direct Bonding Technologies For MEMS Application
Using Low Cost of Ownership Direct Bonding Technologies For MEMS ApplicationUsing Low Cost of Ownership Direct Bonding Technologies For MEMS Application
Using Low Cost of Ownership Direct Bonding Technologies For MEMS ApplicationInvensas
 
Ciel mech june2014_webinarpresentation
Ciel mech june2014_webinarpresentationCiel mech june2014_webinarpresentation
Ciel mech june2014_webinarpresentationKamal Karimanal
 
HiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentationHiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentationVEDLIoT Project
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
3D-IC Designs require 3D tools
3D-IC Designs require 3D tools3D-IC Designs require 3D tools
3D-IC Designs require 3D toolschiportal
 
Design Verification: The Past, Present and Futurere
Design Verification: The Past, Present and FuturereDesign Verification: The Past, Present and Futurere
Design Verification: The Past, Present and FuturereDVClub
 
Design verification--the-past-present-and-future
Design verification--the-past-present-and-futureDesign verification--the-past-present-and-future
Design verification--the-past-present-and-futureObsidian Software
 
Ceramic Solutions Enabling the Evolution of Semiconductor Processing
Ceramic Solutions Enabling the Evolution of Semiconductor ProcessingCeramic Solutions Enabling the Evolution of Semiconductor Processing
Ceramic Solutions Enabling the Evolution of Semiconductor ProcessingCoorsTek, Inc.
 
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...Sayonsom Chanda
 
The Latest in Cloud Computing Standards
The Latest in Cloud Computing StandardsThe Latest in Cloud Computing Standards
The Latest in Cloud Computing StandardsCA API Management
 
System On Chip
System On ChipSystem On Chip
System On ChipA B Shinde
 
The 2012 transition from dfm to pdfd leor nevo-intel
The 2012 transition from dfm to pdfd  leor nevo-intelThe 2012 transition from dfm to pdfd  leor nevo-intel
The 2012 transition from dfm to pdfd leor nevo-intelchiportal
 

Ähnlich wie Silent Data Corruption in Servers (20)

Cielution imaps short_presentation_public
Cielution imaps short_presentation_publicCielution imaps short_presentation_public
Cielution imaps short_presentation_public
 
2011 年會-IC封測產業技術發展現況與未來挑戰
2011 年會-IC封測產業技術發展現況與未來挑戰2011 年會-IC封測產業技術發展現況與未來挑戰
2011 年會-IC封測產業技術發展現況與未來挑戰
 
Qualcomm
QualcommQualcomm
Qualcomm
 
Open source and open communities will play a big role in SDN and networking i...
Open source and open communities will play a big role in SDN and networking i...Open source and open communities will play a big role in SDN and networking i...
Open source and open communities will play a big role in SDN and networking i...
 
Data Intensive Engineering
Data Intensive EngineeringData Intensive Engineering
Data Intensive Engineering
 
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfAI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
 
Industry Perspectives of SDN: Technical Challenges and Business Use Cases
Industry Perspectives of SDN: Technical Challenges and Business Use CasesIndustry Perspectives of SDN: Technical Challenges and Business Use Cases
Industry Perspectives of SDN: Technical Challenges and Business Use Cases
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
 
Using Low Cost of Ownership Direct Bonding Technologies For MEMS Application
Using Low Cost of Ownership Direct Bonding Technologies For MEMS ApplicationUsing Low Cost of Ownership Direct Bonding Technologies For MEMS Application
Using Low Cost of Ownership Direct Bonding Technologies For MEMS Application
 
Ciel mech june2014_webinarpresentation
Ciel mech june2014_webinarpresentationCiel mech june2014_webinarpresentation
Ciel mech june2014_webinarpresentation
 
HiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentationHiPEAC-CSW 2022_Pedro Trancoso presentation
HiPEAC-CSW 2022_Pedro Trancoso presentation
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
3D-IC Designs require 3D tools
3D-IC Designs require 3D tools3D-IC Designs require 3D tools
3D-IC Designs require 3D tools
 
Design Verification: The Past, Present and Futurere
Design Verification: The Past, Present and FuturereDesign Verification: The Past, Present and Futurere
Design Verification: The Past, Present and Futurere
 
Design verification--the-past-present-and-future
Design verification--the-past-present-and-futureDesign verification--the-past-present-and-future
Design verification--the-past-present-and-future
 
Ceramic Solutions Enabling the Evolution of Semiconductor Processing
Ceramic Solutions Enabling the Evolution of Semiconductor ProcessingCeramic Solutions Enabling the Evolution of Semiconductor Processing
Ceramic Solutions Enabling the Evolution of Semiconductor Processing
 
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
How to leverage Quantum Computing and Generative AI for Clean Energy Transiti...
 
The Latest in Cloud Computing Standards
The Latest in Cloud Computing StandardsThe Latest in Cloud Computing Standards
The Latest in Cloud Computing Standards
 
System On Chip
System On ChipSystem On Chip
System On Chip
 
The 2012 transition from dfm to pdfd leor nevo-intel
The 2012 transition from dfm to pdfd  leor nevo-intelThe 2012 transition from dfm to pdfd  leor nevo-intel
The 2012 transition from dfm to pdfd leor nevo-intel
 

Silent Data Corruption in Servers

  • 1. SDC in Enterprise Class Servers Ishwar Parulkar Sun Microsystems, Inc. DSN 2008 Panel: SDC – Myth or Reality? Slide 1
  • 2. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Server sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 2
  • 3. Silent Data Corruption (SDC) SDC is defined as incorrect data being generated in hardware and the incorrect data being communicated to the application layer without being detected for a period of time (it might get detected eventually). DSN 2008 Panel: SDC – Myth or Reality? Slide 3
  • 4. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Server sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 4
  • 5. Sources of SDC in Servers 1. Cosmic radiation induced bit flips in silicon 2. Design and process marginalities 3. Very corner case logic design bugs 4. Defects occurring in silicon due to ageing DSN 2008 Panel: SDC – Myth or Reality? Slide 5
  • 6. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Server sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 6
  • 7. Example – Cosmic Radiation • Sun UltraSPARC-II servers had a noticeable crash rate in the field in 2000 – symptom was system panic, NOT SDC • Diagnosed to cosmic radiation induced soft errors in external cache – symptom exhibited by SRAM from one vendor (IBM) • Several examples and experiments from aerospace, NASA, medical implant electronics industries DSN 2008 Panel: SDC – Myth or Reality? Slide 7
  • 8. Example - Design Marginality • “AMD Options suffer heat issue” - CNET 4/28/0 • From AMD web site: http://www.amd.com/usen/0,,3715_13965,00.html?redir=CORPR01 – “A few processors have been observed to produce inconsistent results in a non-production synthetic test environment with the convergence of the following three simultaneous conditions: • The running of FP intensive code sequences, • elevated CPU temperatures, and • elevated ambient temperatures” • In general, temperature gradients in silicon can be up to 30oC per mm on large dice Question: Design, Manufacturing test or In-field reliability issue? DSN 2008 Panel: SDC – Myth or Reality? Slide 8
  • 9. Example - Process Marginality • Very infrequent, intermittent parity errors noticed in the field (NOT SDC) • Symptom seen on few parts – long, unpredictable time to failure – parts were from one manufacturing line • Diagnosed to a long route with multiple jogs – no DFM rule violation – combination of • location of die on wafer • mechanical warping • electrical use condition (load) DSN 2008 Panel: SDC – Myth or Reality? Slide 9
  • 10. Example - Logic Design Bug - (1) Famous Pentium FDIV Bug in 1994 • Discovered by a user running code to enumerate primes • Symptom: Reduction in precision of division operations • Concern in scientific/engineering and financial engineering fields • Source: Few missing entries in a look-up table used in floating point divide operations, not detected in verification • Intel estimated MTBSDC of 27000 years, IBM estimated 24 days DSN 2008 Panel: SDC – Myth or Reality? Slide 10
  • 11. Example - Logic Design Bug - (2) A more subtle case • Multithreaded processor with multiple strands sharing resources • 1-3 cycle of vulnerability created when – more than 1 strand is using an execution pipe with specific combinations of operations • SDC occurs if all of the following arrive at the trap commit unit within 1-3 cycle window of vulnerability – A checkpoint state – A trap – A park request • Scenario pathologically possible; probability of occurring in code is close to 0 DSN 2008 Panel: SDC – Myth or Reality? Slide 11
  • 12. Examples – Silicon Degradation • Several phenomena – Electromigration – Gate Oxide Breakdown – Channel Hot Carrier Effect – Negative Bias Temperature Instability • Addressed by DFM rules, guard-banding in design and accelerating via burn-in during manufacturing • Not a major concern for SDC, because they are not silent for long DSN 2008 Panel: SDC – Myth or Reality? Slide 12
  • 13. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 13
  • 14. Server Market Segments Back Office • CRM • ERP • BIDW • Database HPC Mainstream Web • • Finance Manufacturing Infrastructure • Oil and Gas • Life Sciences • Web 2.0 • Government • Storage • Service Providers DSN 2008 Panel: SDC – Myth or Reality? Slide 14
  • 15. Server SDC and Availability Typical Targets Server Type MTBSDC Availability Data Centric 100-1000 years 99.999 Web Centric 10-100 years 99.999-99.9999 Compute Centric 100-1000 years 99.990 MTBF in years = 109 / (FIT * 24 Hours * 365 Days) DSN 2008 Panel: SDC – Myth or Reality? Slide 15
  • 16. Classification of Silicon Errors from a User Perspective Universe of Silicon Errors in a Server Chip DSN 2008 Panel: SDC – Myth or Reality? Slide 16
  • 17. Classification of Silicon Errors from a User Perspective C U Corrected Uncorrected DSN 2008 Panel: SDC – Myth or Reality? Slide 17
  • 18. Classification of Silicon Errors from a User Perspective Silent SC SU Reported RC RU Corrected Uncorrected DSN 2008 Panel: SDC – Myth or Reality? Slide 18
  • 19. Classification of Silicon Errors from a User Perspective Customer does not care Silent SC SU Reported RC RU Corrected Uncorrected DSN 2008 Panel: SDC – Myth or Reality? Slide 19
  • 20. Classification of Silicon Errors from a User Perspective Customer does not care Silent SC SU Reported RC RU Required by Corrected Uncorrected Service/Customer to monitor health DSN 2008 Panel: SDC – Myth or Reality? Slide 20
  • 21. Classification of Silicon Errors from a User Perspective Customer does not care Silent SC SU Reported RC RU System Crash Required by Corrected Uncorrected Service/Customer to monitor health DSN 2008 Panel: SDC – Myth or Reality? Slide 21
  • 22. Classification of Silicon Errors from a User Perspective Customer Silent Data does not care Corruption Silent SC SU Reported RC RU System Crash Required by Corrected Uncorrected Service/Customer to monitor health DSN 2008 Panel: SDC – Myth or Reality? Slide 22
  • 23. A Typical Data Centric Server Component Approx. Count Comments Processors 8-64 8-64 way systems ASICs 320 Memory controllers, IO bridges, Crypto, etc. Memory DIMMs 640 Depends on memory capacity AC/DC 8-10 Main power supply Power Supplies DC/DC 640 High and low voltage supplies Power Supplies Clocking 64 Clock synthesizers and distribution Service Processor 4 Small processors, FPGA Miscellaneous 1000-10000 Resistors, Capacitors, Pins, Connectors Small Components DSN 2008 Panel: SDC – Myth or Reality? Slide 23
  • 24. Server Sensitivity to Processor SDC Sensitivity of Server to Processor SU Rate 120 110 100 Server MTBSDC (Years) 90 80 70 60 50 40 30 20 10 0 100 200 300 400 500 600 700 Processor SU (Silent Uncorrected) FIT DSN 2008 Panel: SDC – Myth or Reality? Slide 24
  • 25. Server Sensitivity to Processor SDC Sensitivity to Processor SU Rate 120 110 100 Server MTBSDC (Years) 90 89 years 80 70 60 50 42 years 40 30 20 10 0 100 200 300 400 500 600 700 Processor SU (Silent Uncorrected) FIT • A 150 FIT increase in processor implies: – 52.8% degradation of MTBSDC DSN 2008 Panel: SDC – Myth or Reality? Slide 25
  • 26. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 26
  • 27. Design for SDC Mitigation VOC, Field Data, Marketing DSN 2008 Panel: SDC – Myth or Reality? Slide 27
  • 28. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets DSN 2008 Panel: SDC – Myth or Reality? Slide 28
  • 29. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets SER Estimation Raw Static SER from SPICE Measurement Simulations at LANL DSN 2008 Panel: SDC – Myth or Reality? Slide 29
  • 30. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Raw Soft Error Rate SER Estimation Raw Static SER from SPICE Measurement Simulations at LANL DSN 2008 Panel: SDC – Myth or Reality? Slide 30
  • 31. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Raw Soft Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 31
  • 32. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 32
  • 33. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 33
  • 34. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level Electrical, Logical and FIT Targets Architectural Derating Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 34
  • 35. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level Actual Chip Level Electrical, Logical and FIT Targets FIT Architectural Derating Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 35
  • 36. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level Actual Chip Level FIT Targets = FIT Electrical, Logical and Architectural Derating Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 36
  • 37. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Not Equal Chip Level Actual Chip Level FIT Targets = FIT Electrical, Logical and Architectural Derating Not Equal Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 37
  • 38. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 38
  • 39. Solution Trends for SDC • Unit level redundancy is too costly • Logic and flops need to be protected • Circuit level solutions can be limiting • Logic/architectural solutions more promising • Periodic on-line testing for predicting degradation • Trillions of random verification cycles DSN 2008 Panel: SDC – Myth or Reality? Slide 39
  • 40. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 40
  • 41. Conclusions • SDC is a reality – criticality and investment in mitigation highly dependent on application space • Solutions to SDC need to be low overhead – mainframe level reliability/availability at server price points • Need more accurate estimation of SDC • SDC due to design bugs and design/process marginalities still hard to estimate DSN 2008 Panel: SDC – Myth or Reality? Slide 41
  • 42. Backup Slides DSN 2008 Panel: SDC – Myth or Reality? Slide 42
  • 43. Using Sun Processor “Ranch” (Testing in Broomfield, CO) DSN 2008 Panel: SDC – Myth or Reality? Slide 43
  • 44. Broomfield Test Setup • Altitude and geomagnetic location give ~4.1x acceleration over sea-level • 600 US-III Processors • 3 months of testing • Used modified POST code to write 0's and 1's to memory arrays and observe bit flips • Monitored power supply fails as well DSN 2008 Panel: SDC – Myth or Reality? Slide 44
  • 45. Soft Error Testing of SUN Processors - A Chronology Date Process Node Device Under Test Location Test Type 8/2000 250nm, 180nm US III Los Alamos Neutron Irradiation 11/2000 – 2/2001 250nm, 180nm US III Broomfield Large Volume (600 CPUs) 11/2002 150nm, 130nm US III Los Alamos Neutron Irradiation 11/2003 130nm, 90nm US IIIi, IIIi+ Los Alamos Neutron Irradiation 8/2004 - Commodity SRAM Berkeley Neutron Irradiation 4/2005 90nm US IIIi+ Los Alamos Neutron Irradiation 11/2005 90nm US T1, IIIi+, IV+ Los Alamos Neutron Irradiation 12/2005 90nm US T1 Los Alamos Neutron Irradiation 12/2006 65nm US T2 Los Alamos Neutron Irradiation 12/2007 65nm US T2/Nextgen Proc Los Alamos Neutron Irradiation DSN 2008 Panel: SDC – Myth or Reality? Slide 45
  • 46. A Typical LANL Test Setup • Recently tested UltraSPARC T2 and a next generation processor in 65nm technology • Ran multiple systems in parallel • Different parts, voltages & test patterns • Beam time efficiency – 12% beam off – 5% of time in setup, debug – 83% of beam time gave useful data • Cumulative 775 hours of data gathered DSN 2008 Panel: SDC – Myth or Reality? Slide 46
  • 47. Design/Process Marginality Where do you solve it? Design Guard-bands Loss of Performance Field Manufacturing In-line Correction Wider Test Box Area/Power Cost Loss of Yield DSN 2008 Panel: SDC – Myth or Reality? Slide 47