SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
57
INTERMITTENT FAILURES IN HARDWARE AND SOFTWARE
Dr. Michael Pecht, Anwar Mohammed
CALCE Electronic Products and Systems Center, University of Maryland, College Park, MD 20742,
USA
Flextronics, 847 Gibraltar Drive, Milpitas, CA 95035, USA
ABSTRACT
Intermittent failures are a major concern in electronics system because they are unpredictable
and non-repeatable. They can be very expensive for companies, damage the reputation of a company,
or cause catastrophic damage in safety-critical systems such as nuclear plants. This paper discusses,
both at the hardware and software level, the causes of intermittent failures and the methodology to
diagnose the causes. Mitigation strategies to help reduce the occurrence of these failures are
discussed and new, emerging technologies designed to minimize intermittent failures are also
reviewed. The paper concludes with recommendations designed to minimize the occurrence of
intermittent failures.
1. INTRODUCTION
Intermittent failures are sporadic failures that are not easily repeatable. According to IEEE,
intermittent failure (IF) can be defined as the failure of an item for a limited period of time,
following which the item recovers its ability to perform its required function without being subjected
to any external corrective action [1]. When a product can no longer perform its designed function
over the intended time frame, it is considered to have failed. When the product manifests a loss of
some of its function or performance characteristics for a limited time, but shows subsequent
recovery, it has experienced intermittent failure. Intermittent failures are hard to replicate because of
their erratic behavioral pattern. Intermittent failures are often called “ghost failures” for the obvious
reason that they come and go, as well as being hard to reproduce on the bench [2].
Therefore, it is more difficult to conduct failure analysis for intermittent failures, understand
their root causes, and isolate their failure sites than it is for permanent failures. An intermittent
INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING &
TECHNOLOGY (IJEET)
ISSN 0976 – 6545(Print)
ISSN 0976 – 6553(Online)
Volume 5, Issue 5, May (2014), pp. 57-73
© IAEME: www.iaeme.com/ijeet.asp
Journal Impact Factor (2014): 6.8310 (Calculated by GISI)
www.jifactor.com
IJEET
© I A E M E
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
58
failure is not necessarily repeatable; however, it often is [3]. An intermittent failure may lead to
permanent failures in later stages of the life cycle.
During the inspection process in manufacturing, intermittent failures may be reported as
rejected parts with no failure found (NFF). This means that a failure was observed in the system, but
when the device was re-tested, a failure mode could not be identified or the failure could not be
duplicated. This is also known as trouble not identified (TNI), no trouble found (NTF), cannot
duplicate (CND), or retest ok (RTOK) [3]. These failures are hard to identify or replicate, even
though they are recurrent. Many different factors can cause intermittent failures, such as process
variations like a change in the humidity level, manufacturing residuals like solder fluxes and epoxy
bleed outs, radiation, vibration, wear out leading to opens, and voltage and temperature fluctuations
[3]. Such transient causes, seen both in hardware and software, are hard to reproduce and can lead to
negative consequences such as mission aborts and flight and train delays or cancellations. They can
increase system downtime and decrease system availability. A reduction in IF will increase system
availability more than a reduction in failure rate [4].
An intermittent failure can lead to unintended consequences such as increased operation cost,
higher downtime, and a perception of lower quality, especially in sensitive industries such as
aerospace. A system which has failed previous testing and then suddenly starts passing testing,
showing no signs of failure, can erode the trust in the testing methodology [5] and can cause an IF to
be identified as a false alarm even though a real failure exists in the system.
Intermittent failures inflict a heavy toll on companies. During retesting, when a failed part
cannot be validated as a failed part, extra testing must be conducted to identify the failure. These
extra tests impose additional costs. In the case of IFs, since the failures cannot be replicated
consistently, the retest and repair costs are higher than those for permanent failures. This is because
an effective repair cannot be made till the failure is validated. Maintenance can cost time and labor in
an attempt to identify a failure without any success, sometimes resulting in blind replacement of
parts that are suspected of having a defect (without finding any specific problem), which increases
the cost of inventory. For example, in 2001, fighter plane customers spent $10 million to replace
parts that were tested as intermittent failures at the shop level [6]. In another case, in the 1980s, the
thick film integrated (TFI) ignition module in an automotive company were afflicted by intermittent
failures, leading to a lawsuit settlement by the company [3]. A study carried out in 2005 found that
IFs account for about 63% of the mobile phones returned to the manufacturer, costing the industry
$4.5 billion dollars per year [7]. Kimseng et al. [8] carried out a study on intermittent failures in the
digital electronic cruise control modules made by a manufacturer for various automobiles and found
that 96% of the modules returned to the manufacturer passed the bench tests carried out by the
manufacturer. Kimseng concluded that the bench tests were not representative of the actual
automotive environment and nor was the testing appropriate to assess the original failure.
A holistic approach is helpful to understand and eliminate intermittent failures. This approach
would include better diagnostic capability and efficient mitigation techniques. Therefore, this paper
discusses both hardware and software intermittent failures, including their causes, diagnosis, and
mitigation methodologies. Emerging developments in this technology space are also reviewed to
help formulate better solutions.
2. HARDWARE INTERMITTENT FAILURES
Tentative or temporary hardware malfunctions can cause intermittent failure in electronic
devices. This section describes common hardware components that experience intermittent failures
and their failure mechanisms. The diagnosis and mitigation of hardware intermittent failures is also
examined, and some recent technologies designed to overcome these problems are covered.
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
59
2.1 Causes of Failure
Unlike permanent failures with persistent causes, the failure cause in intermittent failures
may no longer exist during testing, because of changes in the working environment. Hardware
intermittent failures can have different root causes, such as mismatched thermal expansion, vibration,
corrosion, and electromigration. In this section, some key intermittent failure causes for hardware
components are investigated.
2.1.1 Wire Bond and Connectors Failures
Wire bonds and connectors cause a high percentage of hardware intermittent failures [9].
Some common causes include coefficient of thermal expansion (CTE) mismatch, component wear
out caused by age or repeated usage and corrosion. For example, the CTE mismatch between the
wire bonds and the copper bonding pads on a PCB can cause intermittent opens and shorts during
temperature excursions. In another example, the contact resistance of a new, tin-plated contact may
be a few milliohms, but after a thousand contact cycles, the resistance can become as high as several
ohms. With more usage, intermittent failures that disappear in the next contact cycle may also occur
[9]. The thermal and mechanical vibrations in the connectors can lead to fretting corrosion, causing
the contact resistance to increase, thus inducing intermittent connection failures [10, 11]. It has been
identified [3] that loose PCB interconnectors and aging connectors and components are some of the
common causes for electronic systems failure. Gibson et al. [12] concluded that over 50% of all
electronic failures are triggered by interconnector related problems. Other common causes are
vibration, stress relaxation, and the movement of the wiring harness generated by the magnetic field
[9]. The following paragraphs will describe some of these failures in more details.
Wire bond related intermittent failure occurs when a poorly connected wire bond temporarily
dislodges because of thermal expansion at temperatures above the room temperature. The wire bond
may then restore to its normal state once the thermal stress caused by CTE mismatch is removed.
The failure mode in such cases is usually an open circuit. On the other hand, a loose conductive
material floating on the package may connect with a wire bond on another part of the circuit,
resulting in a short circuit. When this floating piece moves away from the failure site, because of
vibration for example, the failure is no longer observed [13]. Loose materials can be detected by
using appropriated screening methods including X-ray, vibration, and acoustic testing. Screening and
testing methodologies are designed based on the potential causes and effects of the short circuit on
the component performance. [14]. Intermittent wire bond failures may also be induced by the
molding process which can damage wire bonds. This damage is not easily detectible and is attributed
to the weakening and lifting of the gold bond during the molding process on the side of the package
opposite to where the injection molding occurred [15]. Proper molding process control parameters
and effective detection techniques would minimize such intermittent failures.
In a study done by Sorensen [16] on military aircraft he noted that 50% of all the failures
were intermittent failures and 80% of those were related to solder joints and connector pins. For the
aircraft industry, aging devices will lead to IFs, quite often as a prelude to permanent failures. Many
IFs are the result of the gradual degradation of a component or system. They may initially appear as
small noise fluctuations but could lead to permanent failures. Filho et al. [17] point out that for
continuous monitoring methods, intermittent failures can appear long before open circuits are
detected.
Corrosion can cause electrical degradation of the contact, which is initiated by a galvanic
reaction between two metals within the electrical circuit. Corrosion on electronic parts can result in
either of the two scenarios: short circuits or an increase in the electrical resistance of the components.
When corrosion occurs, it is rarely uniform on the affected surfaces, which may result in the
appearance of an intermittent failure. With respect to the contacts of electronic parts, intermittent
failures occur because of frequent connections and disconnections, as seen in the corrosion of copper
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
60
connectors that have layers of nickel and gold to protect against wear out. In harsh environments
(high relative humidity and the presence of H2S), formation of the corrosive component Cu2S causes
intermittent failure behaviors [18]. With vibration and temperature fluctuations, this conductive path
can be connected and disconnected, resulting into intermittent failures. Intermittent failures due to
corrosion generally occur in the early stages (the first 50%) of the product life cycle. Intermittent
behavior aggravated by CTE mismatch or vibration generally appears during the later stages (the last
50%) of the life cycle of a product [19]. For example, electrochemical migration, which occurs
between anodes and cathodes (and can be a reason behind IF reports), is a corrosion-related failure
mechanism that forms dendrites between opposite biases and eventually results in short circuits. The
driving forces for this corrosion process are the potential voltage bias, contaminated surfaces (lack of
environmental control), and the fact that the metals that are commonly used (Sn, Pb, Cu and Ag) are
susceptible to corrosion. Since this process is not time induced, the intermittent failures are
manifested early in the product life cycle. Tin whiskering has been identified [20] as another
common cause for intermittent failures. A PCB with a pure tin finish, having non-compressive
internal stress, is known to create tin dendrites that can cause short failures. However at elevated
temperatures the dendrites may melt away and repair the short.
2.1.2 Digital Integrated Circuit Failures
Integrated chip devices are being scaled down rapidly. This reduction in size makes digital
integrated circuits more susceptible to permanent and intermittent behavior. Intermittent failure
modes in logic, digital integrated circuits (ICs) have been categorized as timing violations, stuck-at-
zero or stuck-at-one failures, intermittent shorts or opens, or electro-migration failures [21].
An increase in the resistance of interconnects due to thermal or mechanical loads,
electromigration, or material diffusion, increases the time for signal propagation and leads to a
timing violation [22]. These failures are manifested because of thermal and electrical loads and
signal frequency variations. Kothawade et al. [23] found that timing violation in a processor can be
attributed to multiple factors such as process variations, negative bias temperature instability (NBTI),
temperature fluctuations, hot carrier injection (HCI), and voltage fluctuations. Since timing
violations can be caused by many factors, it is challenging for processor designers to design fault
tolerance mechanisms. Time dependent HCI failures are generally permanent in nature. NBTI
failures caused by AC stress tend to be intermittent failures whereas failures caused by static stress
usually manifest as permanent failures. Within an integrated circuit, the thin oxide layers separating
the adjacent metal traces can also lead to intermittent shorting or opens caused by traces coming in
contact with each other or losing contact. Constantinescu [21] also studied the causes of intermittent
behavior in integrated circuits (ICs). The study attributed voltage fluctuations across ICs as the cause
for oxide layer breakdown. As ICs have become smaller, the thickness of the oxide layers has
decreased. This leads to an increased risk of breakdown in oxide layer thickness. When this oxide
layer breaks down, it creates a conducting path, thereby increasing the leakage current. The
introduction of high k dielectrics reduces the rate of oxide breakdown, enabling the use of thinner
dielectrics. However, this can also lead to timing violation failures. Before a complete breakdown
takes place due to dielectric breakdown leading to a permanent failure, there is a stage known as
dielectric soft breakdown, during this stage a device may exhibit intermittent failures.
Intermittent stuck-at-zero or stuck-at-one failures occur in storage elements. Digital circuits
have two states, 0 or 1, and a fault occurs when a particular signal is tied to either 1 or 0. This
produces a logical error. Pan et al. [24] developed a metric for stuck-at-zero/stuck-at-one to
characterize the vulnerability of a microprocessor to intermittent failures based on its structure.
Experimental results show that the susceptibility varies significantly across different structures, and
the vulnerability of the reorder buffer is much higher than that of the register file. These storage
element intermittent failures have an active time and an inactive time. The active time is the time
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
61
during which the failure is in process and causes unexpected behavior, while the inactive time is the
time when the failure does not affect performance. The length of this active time determines how
significantly the failure affects the performance of a microprocessor.
ICs are susceptible to intermittent failures due to electro-migration. Electro-migration is the
movement of metal atoms when electrons flow through those atoms. This movement of atoms can
lead to an open or short circuit failure. In both the cases the failures appear initially as intermittent
failures and end up as permanent failures. As IC chip technology becomes smaller, the wire widths
are reduced. When current flow is not scaled down proportionally, the ICs become vulnerable to
electro-migration [24].
2.1.3 COMPONENT CONNECTION FAILURES
Another area of concern for intermittent failures is the area of component pins, whether it be
a multi-pin IC, resistor network or a simple two-lead capacitor. Intermittent failures can be caused
by imperfections in the solder process or a fractured lead where the two broken ends are
intermittently making and breaking connections. Once the pin is broken, the failure may show up
during thermal cycling or vibration testing. Resolution for these types of failure includes better
attachment methods of longer-size components like a resistor network or large capacitor to the circuit
card. Studies [25] have shown that intrinsic flaws in design and sub quality manufacturing processes
like soldering play a big role in creating intermittent failures.
2.2 Diagnosis
The Failure Modes, Mechanisms, and Effects Analysis (FMMEA) can be used to detect
intermittent and permanent failures in hardware. Mathew et al. [26] have proposed the following
methodology.
Figure 1: FMMEA Methodology [26]
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
62
The first two steps identified in Figure 1 are ‘define system and identify elements and
functions to be analyzed’ and ‘identify potential failure modes.’ They are more challenging for
intermittent failures than for permanent failures. This is because, in the case of intermittent failures,
it is difficult to define which system has the failure in a complex system consisting of several
subsystems intermeshed together. A failure in one of the subsystems could affect another subsystem
and result in its failure. Finding the subsystem with the initial failure is challenging, since
intermittent failures are not always detected when the system is tested for faults. Identifying the
correct failure modes is also not easy because of the erratic nature of intermittent failures; this
requires extra work.
Kirkland [27] describes a variety of methods to detect failure modes for intermittent failures
in electronic devices, including signal looping, pattern looping, signal stepping, frequency deviation,
pattern adjustment in critical areas, signal strength variation, current path duplication, measuring
capacitance variations, Vcc adjustments, resistive or impedance rebounce, temperature change
application, and noise dissimilarity testing. Using these methods can help identify failure modes,
such as increased gate delays, degraded signals, increased leakage, and high frequency failures. A
minimum set of conditions (such as voltage drop threshold and temperature variations) needs to be
present to make the failure mode observable.
Another systematic approach for analyzing intermittent failures is employing a cause and
effect diagram, which is also known as the fishbone diagram. An example of this diagram is depicted
in the following Ishikawa fishbone diagram below [28].
Figure 2: Fishbone diagram for intermittent failures in hardware and software [28]
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
63
A cause and effect diagram defines the key failure (also known as key effect) and investigates
the possible causes of each of the effects and offers a list of all the possible causes leading to the
failure. It is an effective method for analyzing failures in complex systems. For example intermittent
failures in plastic ball grid array packages using this method and narrowed down the possible causes
of failure, finally identifying solder joint failure as the main cause of intermittent failure [28].
Steadman et al. [29] developed a test methodology for intermittent faults in aircraft. This method
subjected an avionics system to thermal and vibrational loads, while simultaneously monitoring the
system for faulty components, thus reducing the occurrence of intermittent failures. An improved
approach should include online monitoring of critical avionics components while the system is in
operation. This would reduce the overhead cost incurred by offline monitoring that uses load profiles
that do not accurately replicate the operating conditions. The monitoring of current to detect
intermittent failure has been recommended [30] because normal circuits would carry a significantly
different current load when compared to damaged circuits.
In 1978, Savir [31] presented a paper on developing a model to detect intermittent failures in
a sequential circuit, which is a type of circuit with memory logic and is found in most digital
systems. He recommends the leveraging of both deterministic (non- random) and random test
procedures for optimizing the probability of IF detection. The intermittent failures are divided in two
major categories comprising stationary failures (such as loose connections) and transient failures
(such as failures induced by electro-magnetic interference). In sequential circuits, the first
manifestation of an active fault may induce the circuit to enter an incorrect state without producing
an immediate output error. This state change may generate an output error later when the fault has
become inactive. The optimal value of detection probability is obtained by developing a graph of all
the input sequences and determining which sequences lead to intermittent failures.
To detect intermittent failures, a minimum set of conditional requirements is necessary to
manifest the failure [32]. The challenge is in determining the environmental conditions when the
failure occurred and re-creating them. Harsh ambient conditions, such as high humidity and the
presence of halides, can initiate unintended conductive pathways on insulating surfaces. Such a
pathway could eventually become a permanent failure, but it could manifest itself in the earlier
stages as intermittent failure. Figure 3 offers a brief list of potential causes for hardware related
intermittent failures.
Figure 3: List of Causes for Intermittent Failure
2.3 Mitigation
Integrated circuits try to compensate for breakdowns by having failure tolerance built into
them. Failure tolerance masks the occurrence of failures from the end user (it prevents end users
from experiencing performance drops). For example, most processors choose a max clock rate after
having guard-banded against unpredictable interactions and variations in the actual clock rate. ICs
Component shifting during solder reflow Magnetic field variations
Contamination (including oxidation at test sites) Materials degradation (aging, chemical, stress, etc.)
Chemical degradation (including creep corrosion Overstress (example high voltage on cap. Dielectric)
fretting, whiskers, electron migration etc.) Partial delamination
Cracked substrates PCB (warpage, via cracking, black pad, etc.)
Damaged circuits Poor wire bonding (on high K dielectric, etc.)
ESD induced Temperature sensitivity (CTE mismatch, etc.)
Floating leads (or other conductive pieces) Vibration induced
Ionizing Radiation in Semiconductors Voltage overstress
Insulation Oxide layer breakdown Weak solder joints (varying with temp/stress)
Irregular or altered current path Weak structural integrity
Loose connections (wire bonds, connectors etc.) Wire sweep during molding
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
64
also have chip-level failure tolerance, such as error correcting codes, self-checking circuits, and
hardware-implemented check pointing and retries [22].
Three main methodologies to mitigate the intermittent behavior in ICs are dynamic instruction
delaying, core frequency scaling, and thread migration. When the processor incurs more than the
expected time to execute a process, time delay and timing violation occur. This fault may be avoided
by using techniques such as dynamic instruction delaying. This is a type of algorithm that calculates
the scheduling priorities during the execution of the system. The objective is to respond dynamically
to the changing conditions and form a self-sustained, optimized configuration. Another approach to
mitigating delay is core frequency scaling, which scales down the performance of the CPU to a lower
frequency when less is needed and scales it up to a higher frequency when more is needed. Thread
migration is another technique used to overcome intermittent failure. A thread is an ordered set of
instructions that tells a computer exactly what to do. When a specific thread encounters failures, the
content of the thread within the faulty computer core is transferred to another thread within an idle
core, where the problem is addressed and solved.
The intermittent failures in some avionic systems can be caused by failures in solder joints
and multi-layer ribbon cables [29]. These failures may be initiated by the variations in operating
conditions, such as temperature or current, and may disappear due to re-melting of the solder, closing
of the crack, or filling of the void due to thermal fluctuations. Development of robust soldering
processes which include appropriate material selection would mitigate soldering related intermittent
failures. The plethora of solder choices which include leaded solder, lead free solder, low
temperature solder, low silver solder, soft solder make it even more critical for developing
appropriate processes for solder attach and solder reflow. Since there is no known, effective method
to mitigate solder joints and multi-layer ribbon cable failures, more research on improving the
robustness and consistency of solder joints is necessary, and self-repairing wire bonds should also be
developed.
2.4 New Technology Trends
Recent technological developments to solve hardware intermittent failures offer us insight to
future solutions. The industry is addressing the IF problem by developing innovative approaches.
The focus is also shifting from failure detection to failure avoidance.
Intermittent failures on a silicon chip, such as Time Dependent Dielectric Breakdown
(TDDB) and Electromigration (EM), are caused by gate wear out because of extensive usage. Gate
usage can be monitored in the form of gate toggles [33]. Researchers [34] discovered that the
vulnerability to intermittent failure could be monitored by tracking the amount of gate toggles. They
studied four OpenSPARC RTL modules and tracked how each instruction moved through these four
modules while toggling different gates. The four modules studied were the IFU, EXU, FFU, and
LSU modules. They discovered that certain sub modules within the EXU module, such as the exu-
alu and lsu-dcdp within the load store unit, display a relatively high amount of toggling regardless of
the type of instruction being executed. This revealed that there could be groups of modules and sub
modules which would have higher susceptibility to wear out failures, resulting in intermittent
failures. Higher vulnerability by itself cannot be a good predictor for a failure rate, but when
combined with operating conditions such as temperature, the degradation of a gate structure can be
forecasted. Preemptive steps could also be taken during the design stage to avoid the occurrence of
such intermittent failures.
The intermittent loss of connection between connectors is a very common failure in electrical
systems [35]. In spite of the extra caution during connector installation, this remains a problem in
avionics and military equipment. In 2012 an approach was suggested [36] to create an online
methodology to detect intermittent failures caused by intermittent connections. The idea is premised
around the principle derived from the Lorentz Law that any sudden flux change should create a large
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
65
voltage manifesting as an arc which would propagate along the circuitry as a traveling wave. The arc
is defined as the electrical discharge initiated by improper cable connections. Intermittent failures
caused by lose connector connections can be detected by monitoring for the presence of this arc.
Their research describes the online monitoring methodology to detect the presence of this arc to flag
any connector disconnection failures.
Advances in semiconductor scaling technology have revealed that there is now greater
exposure and vulnerability to not only single event upsets (SEUs) in integrated memories but also to
single-event transients (SETs) in high speed logic [37]. SEUs are induced by environmental causes
such as cosmic radiation or alpha particle radiation. They initiate current pulses at random times and
locations in a digital circuit. SETs are caused by transient charge displacements which generate logic
errors in subsequent circuits. Both SEUs and SETs are responsible for creating intermittent failures.
This is a problem which is getting worse because of industry demand for semiconductor scaling. An
estimation methodology to monitor the SEUs and SETs in combinatorial circuits using CMOS
technology has been proposed [38]. The source for alpha particle contamination is some packaging
materials, such as the filler materials, deployed in molding compound or the presence of lead in non-
lead free solders. SEU problems initiated by alpha particles have been essentially solved by the
industry, but cosmic rays still pose significant SEU problems. [28]
A paper published in 2012 by Pan et al. [39] strives to address the CMOS technology scaling
problem from a different perspective. The paper proposes the quantitative characterization of the
vulnerability of the microprocessor structure to intermittent failures. This is called the intermittent
vulnerability factor (IVF), and it is the probability that an intermittent fault in the microprocessor
structure will manifest as an external visible failure. Their research revealed that it is the intermittent
stuck at one fault model which has the most serious impact on program execution. The IVF factor is
calculated after listing the causes of the intermittent failures, classifying them into different fault
models and setting parameters to determine when the intermittent fault will result in a visible error.
This information is used to develop IVF computational algorithms for different intermittent fault
models within a processor. The IVF data could now be used to improve the microprocessor quality,
reliability, and durability (QRD) by proper interventions during the design stage. The IVF could also
be used for intermittent fault detection and error recovery.
Correcher et al. in their paper [40] published in 2012 introduce the concept of modeling
intermittent failure dynamics. They propose two methodologies for characterizing the dynamics: the
probabilistic model and the temporal model. The probabilistic model allows the computing of
intermittent failure probability at any time; however, it needs historical data which may not always
be available. The temporal model is more practical, and it offers the measurement of failure density.
Research shows that the duration and frequency of intermittent failures increase with time, and the
failure density and pseudo-period can help us in predicting it. The pseudo-period is the average time
difference between failures, which is normalized by the number of failures. It is related to MTBF
(mean time between failures) and used to model the reliability of repairable systems. The pseudo-
period can be used to predict the number of operations before replacement in determining whether
the model should follow a linear or exponential fitting. A limitation of this approach is the ability to
derive optimal values for the failure density and pseudo-period.
Recent research on component residual life is helpful for predictive maintenance systems.
The approach focuses around not avoiding the intermittent failures but on predicting when the
negative effects of the IF failures are no longer tolerable. A stochastic model has been proposed [41]
to predict the residual life of live components of a coherent system. A coherent system is a system
where, when a failed component is replaced by a new component, the system does not fail. The
conditional reliability of components within a working system exhibiting an increasing failure rate
has been shown to decrease with time. Also, when two coherent working systems comprising similar
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
66
components have the hazard rates sequenced, the corresponding residual lives are also stochastically
ordered.
New approaches from Kleer et al. [42] offer a framework for diagnosing intermittent failures
in a continuously operating piece of machinery, where objects are transferred from one module to the
next, as in the case of a copying machine involving the transfer of paper from one site in the copier
to another. Research has shown [43, 44, and 45] that by leveraging in-situ sensors, physics of failure
models and life cycle monitoring one can predict the occurrence of failure and measure degradation
and remaining useful life. Such information could become the building blocks of developing
modalities to troubleshoot intermittent failures.
3. SOFTWARE INTERMITTENT FAILURES
Software intermittent failures are generated when some conditions occur simultaneously. For
example, if the available memory and CPU processing power are both below a certain threshold due
to other applications running on a computer, a selected program can exhibit intermittent failures due
to insufficient resources.
Software intermittent failures can also occur are when two or more processes (called threads)
are running simultaneously and can “collide”. When this happens, the computer can end up in a lock
up condition in which the software does not have a clear exit point and may result in a “frozen
screen” condition showing on the computer monitor. These potential collisions may not be obvious
when the software code is being written for the many different subroutine modules used in the
computer.
An example of one such collision of process involves a bank ATM where a customer may dip
their ATM card to open up a session, and at the same time the branch personnel may open the rear
safe door of the ATM (out of view from the customer). The resulting condition causes the computer
to “freeze up” and the screen to be stuck in one view, making the ATM non-responsive to the
customer.
Software may also contain bugs and exhibit intermittent failure whenever a user encounters
the buggy parts of the program. In the next sections, the causes of software intermittent behavior are
investigated, and then the methods for identification and mitigation of these failures are described.
Some recent research in this area is also briefly discussed.
3.1 Causes
Even though software intermittent failures occur in most software-based systems, the end
user may not always experience a drop in performance. The ability to perceive a failure is known as
observability of faults. The observability of software intermittent failures is affected by three factors:
processor speed, memory capacity, and processor load. A low processor speed increases the
possibility of occurrence of intermittent failures, whereas with high processor speed, intermittent
failures may be observed less frequently. A high memory capacity reduces the observability of
software intermittent failures, whereas an increase in the processor load could increase the
occurrence of intermittent failures. To mitigate the frequency of intermittent behavior, the factors
and fault causes of the intermittent behavior must be addressed.
Gracia et al. [46] classify the causes of software-related intermittent failures as timing
failures, errors in memory, unhandled exceptions, errors in disks, and concurrency-related failures.
Timing failures occur when process executions are delayed during processing or when the sequence
of their execution is disturbed. For example, because process executions are time-sensitive, the
timing of parallel processes running simultaneously can experience a delay if one of the processes
does not get completed within the expected time. Memory leaks and memory errors occur because of
improper memory allocation or de-allocation. This can happen when the memory footprint, which is
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
67
the amount of main memory a program uses or references, becomes very high. This may be caused
by prolonged memory usage and can result in intermittent freezes and crashes. Software failures
because of unhandled exceptions happen when an unexpected error occurs during execution and this
error is not handled by the software. For example, when the software tries to divide one by zero, an
error is generated. If this error is not handled, it could lead to an intermittent failure. Disk error
failures are software intermittent failures resulting from physical errors in the disk drives.
Concurrency-related failures occur when concurrent tasks are being executed, leading to heavy usage
of the system.
3.2 Diagnosis
In software, there are many different configurations possible. It is difficult, if not impossible;
to test a product under all these configurations, and intermittent failures can occur on configurations
which have not been fully tested.
While testing for intermittent behavior, the interaction between the hardware and software
needs to be considered, because hardware configuration can influence the frequency and length of
intermittent software failure. Syed et al. [47] observed that software testing results in a different
frequency of intermittent failures based upon the hardware configuration. For example, parameters
such as processor speed, memory, hard drive capacity, and processor load led to a variation in the
number of intermittent failures observed. Wei et al. [48] developed a test methodology to inject
faults at the hardware architecture level to understand the effect of hardware intermittent failures on
software failures. The authors discovered that different sites of the processor architecture affected the
software execution differently. They observed that the impact of a hardware fault on software will
depend upon the origination site and length of the hardware fault.
For the detection of intermittent software failures, five techniques [47] are used. The first
technique is known as deterministic replay debugging (DRB). It is the ability to replay precisely the
same set of instructions that led up to a software failure. Essentially, the engineer records all
instructions up to the point where the system crashes and then replays that recording to determine the
roots of the failure. It is used for bug detection, fault tolerance studies, and intrusion analysis [47]. It
is effective in debugging issues caused in multi- threaded and distributed applications. The second
technique is called fuzzy testing (FT). It uses random, invalid, or unexpected data and observes how
the system reacts. Fuzzy-testing is generally used for detecting failures related to corrupted data,
leaks in memory, software crashes and assertions [47]. FT is also used to enhance software security.
The third commonly used technique is termed high volume test automation (HVTA). In this
approach the software automatically generates, executes and evaluates a large number of tests cases
to detect failures. The high volume of testing, which is automatically generated, offers a higher
probability of detecting failures. HVTA techniques are generally used in detecting failures such as
buffer overruns, stack overflows, resource exhaustion, and timing-related errors. The fourth failure
detection technique is load testing, which includes tests such as stress testing (testing at the operating
condition limits until the system breaks) and volume testing (operating very large tasks). Load
testing involves a demand which is exerted on a system or device while the response is being
monitored. It assists in determining the maximum operating capacity and identifying the bottlenecks
and weak links in a system. The last technique is called disturbance testing (DT). In this case, the
normal operation of the system is disrupted by introducing physical failures such as by unplugging
the power cord. This technique is used for testing the fault tolerance and the overall quality of a
system.
3.3. Mitigation
The aim of fault mitigation is to prevent unexpected outputs and control errors. Anderson et
al. [49] discussed the phases that constitute fault mitigation: error detection, damage assessment, and
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
68
error recovery. Error detection is used to identify the source of intermittent faults, while damage
assessment determines the extent of disruption and losses suffered by the system. Once the nature of
the fault is clearly identified, the next phase, error recovery, mitigates these faults. This stage
minimizes the negative effects experienced by the end user.
There are three techniques for error recovery: recovery block, n-version software, and self-
checking software [50]. Recovery blocks were originally developed by Randell [51] to prevent faults
in software components from affecting functionality at the system level. In this approach, results
from sequences in a software component are verified by adjudicator software. Each of the outputs of
the software component needs to pass an acceptance test by the adjudicator. N-version programming
(NVP) is also known as multi-version programming. In this method, multiple versions of
functionally equivalent software are created independently using the identical original specifications.
This assumes that independently generated software will have a sharply reduced probability of the
same software faults. Statistical techniques are employed to determine the most common responses
to these multiple versions, and measures are undertaken to mitigate the responses. N-version
software combines the advantages of redundancy (multiple software versions) and leveraging
statistical techniques [52]. Even though the NVP approach is commonly used in software developed
for electronic voting and switching trains, it is not free of controversy. There are critics who do not
agree that independently developed software versions will reduce the common errors. Self-checking
software [53] detects the occurrence of software errors, locate and identify the causes, and stop the
propagation of errors. For self-checking software to perform successfully, the system needs to
monitor both functional aspects of the process and the data. Functional monitoring checks for infinite
loops and incorrect loop terminations in a software program, while data monitoring checks the
integrity of defined data structures in software.
3.4 New Technology Trends
New approaches are being developed to overcome software related intermittent failures. Data
race issues can cause many intermittent failures in software. They are non-deterministic, hard to
debug, and cause problems at runtime [54]. A data race is initiated when two threads access the same
memory location without undergoing a synchronized operation and when at least one of the access
events is a write operation. Because of its complexity, the C and C++ language specifications leave
such program behavior undefined [55] and the Java specification for such programs is complicated
and known to be buggy [56]. There is a trend of increased usage of multithread programs because of
the use of multicore processors, and multithreading is prone to data race issues. One approach to
overcome data race detection issues was presented in 2013 by Wester et al. [57]. It is called
parallelizing data race detection. They point out that traditional data race detectors are too slow to be
used regularly. Wester et al. propose to increase the speed by spreading the detection work across
multiple cores. Their strategy involves a process called uniparallelism, which allows the execution of
program time intervals in a parallel manner, providing scalability while executing all threads on a
single core to eliminate locking.
Another emerging research area is automated software repair. Heuristic and algorithmic
approaches are leveraged for generating, evaluating, and repairing defective sites. This approach has
received attention in the field of language programming [58], operating systems [59], and software
engineering [60]. Automated repair is effective in solving concurrency bugs which lead to IF issues
[58]. Schulte et al. [61] presented a paper in 2013 outlining a methodology to employ automated
repair on arbitrary and non-repeatable software defects in embedded systems. This process has been
implemented on Nokia N9000 smart phones. The algorithm used for localizing fault sites is based on
Gaussian convolution and stochastic sampling. It reduces memory requirements by 85% for
embedded systems. It is ten times faster and is suited for devices where direct instrumentation is not
feasible.
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
69
Sahoo et al. [62] published a paper in 2013 wherein automatic diagnostic techniques are
proposed for isolating root causes for software-related intermittent failures. Self-generated likely
program invariants are used with filtering techniques at sites close to the fault-triggering point to
select a set of candidate programs as possible root causes. Likely program invariants are effective
tools for detecting and diagnosing software errors [63]. They are program properties that are
observed to hold valid in some set of successful executions but not necessarily for all executions.
The set of candidate sites are trimmed down by dynamic backward slicing, which is a technique that
can pinpoint precisely which instructions affect a particular value in a single execution of a program
[64]. The list of candidates are further reduced by dependence filtering, which is based upon the
premise that if an invariant on one instruction fails, then a different dependent instruction may also
have a chance of invariant failure, but the underlying cause is the first invariant and not the second.
The second filtering approach assumes that if multiple similar inputs result in the same failure
symptom, they are likely to have the same cause. This is a promising approach for the automatic
diagnosis of software root causes; however, this approach only works on deterministic detectors.
Future work is planned to include non-deterministic detectors.
The use of multicore processors has resulted in concurrency errors in multithreaded
programs. These errors can lead to intermittent failures arising from schedule-dependent failures.
These failures are caused by interactions between threads that were not anticipated by the program
developer [65]. Atomicity is another schedule-dependent failure that can cause intermittent failures.
This occurs when a thread accessing a shared state is inadvertently allowed to interleave between a
pair of accesses in another thread. A paper from the University of Washington [65] in 2013 discusses
the development of automated techniques for avoiding schedule-dependent failures such as
concurrency and atomicity. They established a system for collecting relevant program events during
run time. When a program fails, the information collected is analyzed to generate hypotheses for
failure causes. Leveraging the multiple instances of the deployed software in operation, a predictive
statistical model and an empirical framework has been developed to identify which hypothesis is
most likely to be correct. Corrective actions are taken by manipulating future program executions.
The emphasis of the study is not on failure detection but on failure avoidance.
4. RECOMMENDATIONS
Intermittent failures should be treated seriously not only because of the massive cost but also
because they could be early indicators to permanent failures. For intermittent failures, it is better to
focus on failure avoidance rather than failure detection or failure mitigation. From the hardware
design perspective it is recommended that the specification of minimum spacing requirements for
circuit traces should be dependent upon the current usage. With the increase in semiconductor
scaling, preemptive design strategies need to be developed that leverage data like IVF (Intermittent
Vulnerability Factor) discussed in this paper. On the packaging side it would be valuable to develop
new materials which offer better shielding from cosmic radiation to prevent SEUs (Single Event
Upsets). Self-repairing wire bonds and self-healing solder joints may sound futuristic but they can
diminish the occurrence of intermittent failures in hardware. Since connector disconnections is a
common cause for intermittent failures it is recommended to develop effective methodologies for
monitoring travel waves caused by sudden connector dis-connections. For some avionic systems it is
recommended to develop an online test methodology rather than performing lab testing to increase
the probability of detecting intermittent failures.
Software intermittent failures should always be studied within the context of the hardware
being used and it is important to focus on fault causes rather than on the observability of intermittent
failures. There is a need for more detailed studies in solving system-level intermittent failures. With
the increase in multicore processor usage, it is recommended to anticipate and preempt IF problems
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
70
caused by data race when using multithreading programming. Parallelizing techniques should be
employed where possible to detect data race failure. It is recommended to use automated software
repair for solving concurrency issues and likely program invariants are encouraged for automatic
diagnostic techniques for solving deterministic failures.
5. CONCLUSIONS
Intermittent failures are difficult to diagnose because, when they are investigated, the faults
cannot be replicated consistently. This paper undertakes a wider approach by describing the various
causes, diagnosis and mitigation strategies for intermittent failures manifested at the hardware and
software levels. Some promising upcoming technologies are highlighted that might help develop
future solutions for intermittent failures. Since diagnosing intermittent failure is challenging, helpful
tables and methodologies have been presented to detect the causes of hardware and software
intermittent failures. Recommendations have been offered to help minimize the occurrence of
intermittent failures in hardware and software. The paper strives to advance the state of the art and
practice by covering a wide diversity of intermittent failures, both in hardware and software while
offering an understanding of the underlying causes and proposing approaches and methodologies for
diagnosis and mitigation.
6. ACKNOWLEDGEMENTS
The authors would like to acknowledge the personnel associated with the University of
Maryland and CALCE (Center for Advanced Life Cycle Engineering) for their constant support and
assistance in developing this paper. Special appreciation and thanks are due to Diganta Das, Kelly
Smith, Mark Zimmerman, Faye Chai, Weifeng Liu and Ken Neubeck for guidance in the content,
structure and presentation of this paper.
7. REFERENCES
[1] Authoritative Dictionary of IEEE Standard Terms, 7th
edition, published by Standards
Information Network IEEE Press, 2000 IEEE 100.
[2] K. Neubeck, “Practical Reliability Analysis”, (Prentice Hall, 2004).
[3] D. A. Thomas, K. Ayers, and M. Pecht, “The ‘trouble not identified’ phenomenon in
automotive electronics,” Microelectronics Reliability, vol. 42, no. 4–5, pp. 641–651,
Apr. 2002.
[4] I. James, D. Lumbard, I. Willis, and J. Goble, “Investigating no fault found in the aerospace
industry,” in Reliability and Maintainability Symposium, 2003. Annual, 2003, pp. 441 – 446.
[5] P. Söderholm, “A system view of the No Fault Found (NFF) phenomenon,” Reliability
Engineering & System Safety, vol. 92, no. 1, pp. 1–14, Jan. 2007.
[6] B. Steadman, T. Pombo, I. Madison, J. Shively, and L. Kirkland, “Reducing No Fault Found
using statistical processing and an expert system,” in AUTOTESTCON Proceedings, 2002.
IEEE, 2002, pp. 872 – 878.
[7] WDS Global white paper, “No Fault Found returns cost the mobile industry $4.5 Billion per
year”, 2006. <online>
http://www.wds.co/news/whitepapers/20060717/MediaBulletinNFF.pdf.
[8] Kimseng K., Hoit M, Pecht M, “ Physics of failure assessment of a cruise control module”
Microelectronics Reliability, 1999, 39(10):423-444.
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
71
[9] C. Maul, J. W. McBride, and J. Swingler, “Intermittency phenomena in electrical
connectors,” Components and Packaging Technologies, IEEE Transactions on, vol. 24, no. 3,
pp. 370 –377, Sep. 2001.
[10] M. Antler, “Contact fretting of electronic connectors”, IEICE Trans. Electron, Vol E82-C, #1,
1994, pp 3-12.
[11] C. Maul, J. McBride and J. Swingler, “On the nature of intermittence in electrical contacts”,
in 20th Int. Conf. Electrical Contacts, Stockholm, 2000, pp 23-28.
[12] A. Gibson, S. Choi, T. Bieler and K. Subramanian, Environmental concerns and materials
issues in manufactured solder joints, Proceedings of the 1997 IEEE International Symposium,
In Electronics and the Environment (1997) 246–251.
[13] H. A. Schafft, “Failure Analysis of Wire Bonds,” in Reliability Physics Symposium, 1973.
11th Annual, 1973, pp. 98 –104.
[14] R. E. McCullough, “Screening Techniques for Intermittent Shorts,” in Reliability Physics
Symposium, 1972. 10th Annual, 1972, pp. 19 –22.
[15] T. Koch, W. Richliug, J. Whitlock, and D. Hall, “A Bond Failure Mechanism,” in Reliability
Physics Symposium, 1986. 24th Annual, 1986, pp. 55 –60.
[16] Sorensen B. Digital averaging-the smoking gun behind No-Fault-Found, Air Safety Week,
February, 24, 2003.
[17] W.C. Maia Filho, M. Brizoux, H.Fremont, Y. Danto, “Improved Physical Understanding of
Intermittent Failure in Continuous Monitoring Method”, Proceedings of 14th IPFA, 2007,
pp.141-146.
[18] M. Reid, J. Punch, G. Grace, L. F. Garfias, and S. Belochapkine, “Corrosion Resistance of
Copper-Coated Contacts,” Journal of The Electrochemical Society, vol. 153, no. 12, p. B513,
2006.
[19] D. Minzari, M. S. Jellesen, P. Møller, and R. Ambat, “On the electrochemical migration
mechanism of tin in electronics,” Corrosion Science, vol. 53, no. 10, pp. 3366–3379,
Oct. 2011.
[20] B. Sood, M. Osterman and M. Pecht, Tin whisker analysis of Toyotas electronic throttle
control, CircuitWorld 37(3) (2011) 4–9.
[21] C. Constantinescu, “Intermittent faults and effects on reliability of integrated circuits,” in
Reliability and Maintainability Symposium, 2008. RAMS 2008. Annual, 2008, pp. 370 –374.
[22] D. T. Blaauw, C. Oh, V. Zolotov, and A. Dasgupta, “Static electromigration analysis for on-
chip signal interconnects,” Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, vol. 22, no. 1, pp. 39 – 48, Jan. 2003.
[23] S. Kothawade, K. Chakraborty, S. Roy, and Y. Han, “Analysis of intermittent timing fault
vulnerability,” Microelectronics Reliability, vol. 52, no. 7, pp. 1515–1522, Jul. 2012.
[24] S. Pan, Y. Hu, and X. Li, “IVF: Characterizing the vulnerability of microprocessor structures
to intermittent faults,” in Design, Automation Test in Europe Conf. Exhibition, 2010,
pp. 238 –243.
[25] N. Vichare and M. Pecht, Prognostics and health management of electronics IEEE
Transactions on Components and Packaging Technologies, 29(1) (2006) 222–229
[26] S. Mathew, D. Das, R. Rossenberger, and M. Pecht, “Failure mechanisms based prognostics,”
in Prognostics and Health Management, 2008. PHM 2008. International Conference, 2008,
pp. 1 –6.
[27] L. V. Kirkland, “When should intermittent failure detection routines be part of the legacy re-
host TPS?” in AUTOTESTCON, 2011 IEEE, 2011, pp. 54 –59.
[28] H. Qi, S. Ganesan, and M. Pecht, “No-fault-found and intermittent failures in electronic
products,” Microelectronics Reliability, vol. 48, no. 5, pp. 663–674, May 2008.
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
72
[29] Bryan Steadman, Floyd Berghout, Nathan Olsen, “Intermittent Fault Detection and Isolation
System”, IEEE AUTOTESTCON, 2008.
[30] M. Pecht, Prognostics and health monitoring of electronics, John Wiley & Sons, Ltd, 2008.
[31] J. Savir, “Detection of Intermittent Faults in Sequential Circuits” Stanford University, Rep.
TR-120, 1978.
[32] L. Kirkland, “When should intermittent failure detection routines be part of the Legacy
Re-Host TPS”, IEEE, Autotestcon, 2011, pp 54-59.
[33] R. Vattikonda, W. Wang and Y. Cao, “Modeling and minimization of PMOS NBTI effect for
robust nanometer design”, in proceedings of the Design Automation Conference, DAC 2006.
[34] M. Demertzi, B. Zandian, R. Rojas and M. Annavaram, “Benchmarking ISA Reliability to
Intermittent Failures”, IEE International Symposium on Workload Characterization (IISWC),
2012, pp. 86-87.
[35] S. Hannel, S. Fouvry, P. Kapsa and L. Vincent “The fretting sliding transition as a criterion
for electrical contact performance” WEAR, Vol 49, 2001, pp 761-770.
[36] A.Ginart, I. Ali, J. Goldwin, P. Kalgren, M. Roemer, E. Balaban and J. Celaya “Sensing and
characterization of EMI during Intermittent Connector Anomalies” Aerospace Conference,
IEEE, March 3-10, 2012, pp 1-7.
[37] R. Rao, K. Chopra, D. Blaauw and D. Sylvester, “An efficient static algorithm for computing
the soft error rates of combinatorial circuits,” in Proceedings of Design, Automation and Test
in Europe, Vol. 1, March 2006, pp1-6.
[38] N. Kehl and W. Rosenstiel, “An efficient SER estimation method for Combinatorial
Circuits”, IEEE Transactions on Reliability, vol 60, number 4, 2011, pp 742-747.
[39] S. Pan, Y. Hu and X. Li, “IVF: Characterizing the Vulnerability of Microprocessor Structures
to Intermittent Faults”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Vol. 20, number 5, 2012, pp 777-790.
[40] A. Correcher, E. Garcia, F. Morant, E. Quiles and L. Rodriguez, “Intermittent Failure
Dynamics Characterization”, IEEE Transactions on Reliability, Vol 61, Number 3,
pp 649-658, Sep. 2012.
[41] N. Balakrishnan and M. Asadi, “A proposed measure of Residual Life of Live Components
of a Coherent System”, IEEE Trans. Rel. Vol. 61, #1, pp 41-49.
[42] J. Kleer, B. Price, L.Kuhn, M. Doh, R. Zhou, “A framework for continuously estimating
persistent and intermittent failure probabilities”, Palo Alto Research Center Publications,
2008.
[43] J. Xie and M. Pecht, Applications of in-situ health monitoring and prognostic sensors, The
9th Pan Pacific microelectronics Symposium, Exhibits and Conference (2004) 10–12.
[44] S. Mathew, D, Das, M. Oserma, M. Pecht and N. Ferebee, Prognostic assessment of
aluminum support structure on printed circuit boards, ASME Journal of Electronic Packaging
128(4) (2006), 339–345.
[45] V. Shetty, D. Das, M. Pecht, D. Hiemstra and S, Martin, Remaining life assessment of shuttle
remote manipulator system end effector, Proceedings of the 22nd Space Simulation
Conference (2002), 21–23.
[46] J. Gracia, L. Saiz, J. C. Baraza, D. Gil, and P. Gil, “Analysis of the influence of intermittent
faults in a microcontroller,” in Design and Diagnostics of Electronic Circuits and Systems,
2008. DDECS 2008. 11th IEEE Workshop on, 2008, pp. 1 –6.
[47] R. A. Syed, B. Robinson, and L. Williams, “Does Hardware Configuration and Processor
Load Impact Software Fault Observability?” in Software Testing, Verification and Validation
(ICST), 2010 Third International Conference on, 2010, pp. 285 –294.
International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print),
ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME
73
[48] J. Wei, L. Rashid, K. Pattabiraman, and S. Gopalakrishnan, “Comparing the effects of
intermittent and transient hardware faults on programs,” in Dependable Systems and
Networks Workshops (DSN-W), 2011 IEEE/IFIP 41st International Conference on, 2011,
pp. 53 –58.
[49] T. Anderson and J. C. Knight, “A Framework for Software Fault Tolerance in Real-Time
Systems,” IEEE Transactions on Software Engineering, vol. SE-9, no. 3, pp. 355 – 364,
May 1983.
[50] M. R. Lyu, Software Fault Tolerance. New York, NY, USA: John Wiley &amp; Sons, Inc.,
1995.
[51] B. Randell, “System structure for software fault tolerance,” in Proceedings of the
international conference on Reliable software, New York, NY, USA, 1975, pp. 437–449.
[52] A. Avizienis, “The N-Version Approach to Fault-Tolerant Software,” IEEE Transactions on
Software Engineering, vol. SE-11, no. 12, pp. 1491 – 1501, Dec. 1985.
[53] Ronitt A. Rubinfeld, A mathematical theory of self-checking, self-testing and self-correcting
programs, University of California at Berkeley, Berkeley, CA, 1991.
[54] N. Levenson and C. Turner, “An investigation of the Therac-25 accidents”, IEEE Computer,
26(7): 18-41, July 1993.
[55] H. Boehm and S. Adve, “Foundations of the C++ concurrency memory model”, In Proc.
2008 ACM Conference on Programming Language Design and Implementation, pp. 69-78.
[56] J. Seveik and D. Aspinall, “On validity of Program Transformations in the Java memory
Model”, in Proc. 2008 European Conference on Object-Oriented Programming. Pp 27-51.
[57] B. Wester, D. Devecsery, P. Chen, J. Flinn and S. Narayanasamy, “Parallelizing Data Race
Detection”, In APLOS 2013, Houston Texas, March 16-20, 2013.
[58] G. Jin, L. Song, W. Zhang, S. Lu and B. Liblit, “Automated atomicity violation fixing”, In
Programming Language Design and Implementation”, In Programming Language Design and
Implementation, 2011, pp. 389-400.
[59] J. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, and M. Carbin, “Automatically
patching errors in deployed software. In Symposium on Operating Systems Principles, 2009,
pp. 87-102.
[60] Y. Wei, Y. Pei, C. Furia, L. Silva, S. Buchholz, B. Meyer and A. Zeller, “Automated fixing
of programs with contracts”, in International Symposium on Software Testing and Analysis”,
2010, pp.61-72.
[61] E. Schulte, J. DiLorenzo, W. Weimer, S. Forrest, “ Automated repair of binary and assembly
programs for cooperating embedded devices”, In APLOS 2013, Houston Texas, March
16-20, 2013.
[62] S. Sahoo, J. Crisswell, C. Geigle and V. Adve, “ Using Likely Invariants for automated
Software Fault Localization”, In APLOS 2013, Houston Texas, March 16-20, 2013.
[63] M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, “Dynamically discovering likely program
invariants to support program evolution” IEEE Trans. Software Eng., 2001.
[64] X. Zhang, R. Gupta and Y. Zhang, “ Precise dynamic slicing algorithms”, In Proceedings of
the 25th International Conference on Software Engineering, 2003.
[65] B. Lucia and L. Ceze, “Cooperative Empirical Failure Avoidance for Multithread programs”,
In APLOS 2013, Houston Texas, March 16-20, 2013.
[66] V.Yuvaraj and T.Vasanth, “Simulation, Control and Analysis of HTS Resistive and Power
Electronic FCL for Fault Current Limitation and Voltage Sag Mitigation in Electrical
Network”, International Journal of Electrical Engineering & Technology (IJEET), Volume 4,
Issue 3, 2013, pp. 82 - 94, ISSN Print : 0976-6545, ISSN Online: 0976-6553.

Weitere ähnliche Inhalte

Ähnlich wie 40220140505007

Iaetsd significance of stator winding insulation systems of low-voltage induc...
Iaetsd significance of stator winding insulation systems of low-voltage induc...Iaetsd significance of stator winding insulation systems of low-voltage induc...
Iaetsd significance of stator winding insulation systems of low-voltage induc...Iaetsd Iaetsd
 
Application Note – Resilience, Reliability and Redundancy
Application Note – Resilience, Reliability and RedundancyApplication Note – Resilience, Reliability and Redundancy
Application Note – Resilience, Reliability and RedundancyLeonardo ENERGY
 
Design out maintenance on frequent failure of motor ball bearings-2-3
Design out maintenance on frequent failure of motor ball bearings-2-3Design out maintenance on frequent failure of motor ball bearings-2-3
Design out maintenance on frequent failure of motor ball bearings-2-3IAEME Publication
 
Photovoltaic Module Weather Durability & Reliability
Photovoltaic Module Weather Durability & ReliabilityPhotovoltaic Module Weather Durability & Reliability
Photovoltaic Module Weather Durability & Reliabilitysunpower
 
11.methods to determine the overall health and condition of large power trans...
11.methods to determine the overall health and condition of large power trans...11.methods to determine the overall health and condition of large power trans...
11.methods to determine the overall health and condition of large power trans...Alexander Decker
 
Methods to determine the overall health and condition of large power transfor...
Methods to determine the overall health and condition of large power transfor...Methods to determine the overall health and condition of large power transfor...
Methods to determine the overall health and condition of large power transfor...Alexander Decker
 
A review of various techniques used for shaft failure analysis
A review of various techniques used for shaft failure analysisA review of various techniques used for shaft failure analysis
A review of various techniques used for shaft failure analysisLaukik Raut
 
IRJET- Smart Production Line Industry 4.0 - Leak Testing for Fuel Tank
IRJET-  	  Smart Production Line Industry 4.0 - Leak Testing for Fuel TankIRJET-  	  Smart Production Line Industry 4.0 - Leak Testing for Fuel Tank
IRJET- Smart Production Line Industry 4.0 - Leak Testing for Fuel TankIRJET Journal
 
Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...
Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...
Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...Dr. Hachimenum Amadi
 
Maintenance and Test Equipment Cyber Security
Maintenance and Test Equipment Cyber Security Maintenance and Test Equipment Cyber Security
Maintenance and Test Equipment Cyber Security Michael Toecker
 
Electrostatic precipitator EPS Maintenance
Electrostatic precipitator EPS MaintenanceElectrostatic precipitator EPS Maintenance
Electrostatic precipitator EPS MaintenanceFebrianto Utomo
 
Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...
Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...
Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...IJERA Editor
 
Estimating Reliability of Power Factor Correction Circuits: A Comparative Study
Estimating Reliability of Power Factor Correction Circuits: A Comparative StudyEstimating Reliability of Power Factor Correction Circuits: A Comparative Study
Estimating Reliability of Power Factor Correction Circuits: A Comparative StudyIJERA Editor
 
Emergency Power Supplies: Electrical Distribution Design, Installation and Co...
Emergency Power Supplies: Electrical Distribution Design, Installation and Co...Emergency Power Supplies: Electrical Distribution Design, Installation and Co...
Emergency Power Supplies: Electrical Distribution Design, Installation and Co...Living Online
 
Electrical engineering-portal.com-why does electric motor fail and what can y...
Electrical engineering-portal.com-why does electric motor fail and what can y...Electrical engineering-portal.com-why does electric motor fail and what can y...
Electrical engineering-portal.com-why does electric motor fail and what can y...Daniel García
 

Ähnlich wie 40220140505007 (20)

Iaetsd significance of stator winding insulation systems of low-voltage induc...
Iaetsd significance of stator winding insulation systems of low-voltage induc...Iaetsd significance of stator winding insulation systems of low-voltage induc...
Iaetsd significance of stator winding insulation systems of low-voltage induc...
 
Application Note – Resilience, Reliability and Redundancy
Application Note – Resilience, Reliability and RedundancyApplication Note – Resilience, Reliability and Redundancy
Application Note – Resilience, Reliability and Redundancy
 
PD_Presentation_1.pdf
PD_Presentation_1.pdfPD_Presentation_1.pdf
PD_Presentation_1.pdf
 
Design out maintenance on frequent failure of motor ball bearings-2-3
Design out maintenance on frequent failure of motor ball bearings-2-3Design out maintenance on frequent failure of motor ball bearings-2-3
Design out maintenance on frequent failure of motor ball bearings-2-3
 
2007 Introduction MEOST
2007 Introduction MEOST2007 Introduction MEOST
2007 Introduction MEOST
 
Photovoltaic Module Weather Durability & Reliability
Photovoltaic Module Weather Durability & ReliabilityPhotovoltaic Module Weather Durability & Reliability
Photovoltaic Module Weather Durability & Reliability
 
11.methods to determine the overall health and condition of large power trans...
11.methods to determine the overall health and condition of large power trans...11.methods to determine the overall health and condition of large power trans...
11.methods to determine the overall health and condition of large power trans...
 
Methods to determine the overall health and condition of large power transfor...
Methods to determine the overall health and condition of large power transfor...Methods to determine the overall health and condition of large power transfor...
Methods to determine the overall health and condition of large power transfor...
 
PQ_surveys_FFF.pdf
PQ_surveys_FFF.pdfPQ_surveys_FFF.pdf
PQ_surveys_FFF.pdf
 
Expert System Based on Fuzzy Logic: Application on Faults Detection and Diagn...
Expert System Based on Fuzzy Logic: Application on Faults Detection and Diagn...Expert System Based on Fuzzy Logic: Application on Faults Detection and Diagn...
Expert System Based on Fuzzy Logic: Application on Faults Detection and Diagn...
 
A review of various techniques used for shaft failure analysis
A review of various techniques used for shaft failure analysisA review of various techniques used for shaft failure analysis
A review of various techniques used for shaft failure analysis
 
Seminar Reliability
Seminar ReliabilitySeminar Reliability
Seminar Reliability
 
IRJET- Smart Production Line Industry 4.0 - Leak Testing for Fuel Tank
IRJET-  	  Smart Production Line Industry 4.0 - Leak Testing for Fuel TankIRJET-  	  Smart Production Line Industry 4.0 - Leak Testing for Fuel Tank
IRJET- Smart Production Line Industry 4.0 - Leak Testing for Fuel Tank
 
Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...
Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...
Analysis of Transformer Loadings and Failure Rate in Onitsha Electricity Dist...
 
Maintenance and Test Equipment Cyber Security
Maintenance and Test Equipment Cyber Security Maintenance and Test Equipment Cyber Security
Maintenance and Test Equipment Cyber Security
 
Electrostatic precipitator EPS Maintenance
Electrostatic precipitator EPS MaintenanceElectrostatic precipitator EPS Maintenance
Electrostatic precipitator EPS Maintenance
 
Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...
Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...
Design of Data Aquistion interface circuit used in Detection Inter-turn Fault...
 
Estimating Reliability of Power Factor Correction Circuits: A Comparative Study
Estimating Reliability of Power Factor Correction Circuits: A Comparative StudyEstimating Reliability of Power Factor Correction Circuits: A Comparative Study
Estimating Reliability of Power Factor Correction Circuits: A Comparative Study
 
Emergency Power Supplies: Electrical Distribution Design, Installation and Co...
Emergency Power Supplies: Electrical Distribution Design, Installation and Co...Emergency Power Supplies: Electrical Distribution Design, Installation and Co...
Emergency Power Supplies: Electrical Distribution Design, Installation and Co...
 
Electrical engineering-portal.com-why does electric motor fail and what can y...
Electrical engineering-portal.com-why does electric motor fail and what can y...Electrical engineering-portal.com-why does electric motor fail and what can y...
Electrical engineering-portal.com-why does electric motor fail and what can y...
 

Mehr von IAEME Publication

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME Publication
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...IAEME Publication
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSIAEME Publication
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSIAEME Publication
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSIAEME Publication
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSIAEME Publication
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOIAEME Publication
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IAEME Publication
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYIAEME Publication
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...IAEME Publication
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEIAEME Publication
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...IAEME Publication
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...IAEME Publication
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...IAEME Publication
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...IAEME Publication
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...IAEME Publication
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...IAEME Publication
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...IAEME Publication
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...IAEME Publication
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTIAEME Publication
 

Mehr von IAEME Publication (20)

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdf
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

40220140505007

  • 1. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 57 INTERMITTENT FAILURES IN HARDWARE AND SOFTWARE Dr. Michael Pecht, Anwar Mohammed CALCE Electronic Products and Systems Center, University of Maryland, College Park, MD 20742, USA Flextronics, 847 Gibraltar Drive, Milpitas, CA 95035, USA ABSTRACT Intermittent failures are a major concern in electronics system because they are unpredictable and non-repeatable. They can be very expensive for companies, damage the reputation of a company, or cause catastrophic damage in safety-critical systems such as nuclear plants. This paper discusses, both at the hardware and software level, the causes of intermittent failures and the methodology to diagnose the causes. Mitigation strategies to help reduce the occurrence of these failures are discussed and new, emerging technologies designed to minimize intermittent failures are also reviewed. The paper concludes with recommendations designed to minimize the occurrence of intermittent failures. 1. INTRODUCTION Intermittent failures are sporadic failures that are not easily repeatable. According to IEEE, intermittent failure (IF) can be defined as the failure of an item for a limited period of time, following which the item recovers its ability to perform its required function without being subjected to any external corrective action [1]. When a product can no longer perform its designed function over the intended time frame, it is considered to have failed. When the product manifests a loss of some of its function or performance characteristics for a limited time, but shows subsequent recovery, it has experienced intermittent failure. Intermittent failures are hard to replicate because of their erratic behavioral pattern. Intermittent failures are often called “ghost failures” for the obvious reason that they come and go, as well as being hard to reproduce on the bench [2]. Therefore, it is more difficult to conduct failure analysis for intermittent failures, understand their root causes, and isolate their failure sites than it is for permanent failures. An intermittent INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING & TECHNOLOGY (IJEET) ISSN 0976 – 6545(Print) ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME: www.iaeme.com/ijeet.asp Journal Impact Factor (2014): 6.8310 (Calculated by GISI) www.jifactor.com IJEET © I A E M E
  • 2. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 58 failure is not necessarily repeatable; however, it often is [3]. An intermittent failure may lead to permanent failures in later stages of the life cycle. During the inspection process in manufacturing, intermittent failures may be reported as rejected parts with no failure found (NFF). This means that a failure was observed in the system, but when the device was re-tested, a failure mode could not be identified or the failure could not be duplicated. This is also known as trouble not identified (TNI), no trouble found (NTF), cannot duplicate (CND), or retest ok (RTOK) [3]. These failures are hard to identify or replicate, even though they are recurrent. Many different factors can cause intermittent failures, such as process variations like a change in the humidity level, manufacturing residuals like solder fluxes and epoxy bleed outs, radiation, vibration, wear out leading to opens, and voltage and temperature fluctuations [3]. Such transient causes, seen both in hardware and software, are hard to reproduce and can lead to negative consequences such as mission aborts and flight and train delays or cancellations. They can increase system downtime and decrease system availability. A reduction in IF will increase system availability more than a reduction in failure rate [4]. An intermittent failure can lead to unintended consequences such as increased operation cost, higher downtime, and a perception of lower quality, especially in sensitive industries such as aerospace. A system which has failed previous testing and then suddenly starts passing testing, showing no signs of failure, can erode the trust in the testing methodology [5] and can cause an IF to be identified as a false alarm even though a real failure exists in the system. Intermittent failures inflict a heavy toll on companies. During retesting, when a failed part cannot be validated as a failed part, extra testing must be conducted to identify the failure. These extra tests impose additional costs. In the case of IFs, since the failures cannot be replicated consistently, the retest and repair costs are higher than those for permanent failures. This is because an effective repair cannot be made till the failure is validated. Maintenance can cost time and labor in an attempt to identify a failure without any success, sometimes resulting in blind replacement of parts that are suspected of having a defect (without finding any specific problem), which increases the cost of inventory. For example, in 2001, fighter plane customers spent $10 million to replace parts that were tested as intermittent failures at the shop level [6]. In another case, in the 1980s, the thick film integrated (TFI) ignition module in an automotive company were afflicted by intermittent failures, leading to a lawsuit settlement by the company [3]. A study carried out in 2005 found that IFs account for about 63% of the mobile phones returned to the manufacturer, costing the industry $4.5 billion dollars per year [7]. Kimseng et al. [8] carried out a study on intermittent failures in the digital electronic cruise control modules made by a manufacturer for various automobiles and found that 96% of the modules returned to the manufacturer passed the bench tests carried out by the manufacturer. Kimseng concluded that the bench tests were not representative of the actual automotive environment and nor was the testing appropriate to assess the original failure. A holistic approach is helpful to understand and eliminate intermittent failures. This approach would include better diagnostic capability and efficient mitigation techniques. Therefore, this paper discusses both hardware and software intermittent failures, including their causes, diagnosis, and mitigation methodologies. Emerging developments in this technology space are also reviewed to help formulate better solutions. 2. HARDWARE INTERMITTENT FAILURES Tentative or temporary hardware malfunctions can cause intermittent failure in electronic devices. This section describes common hardware components that experience intermittent failures and their failure mechanisms. The diagnosis and mitigation of hardware intermittent failures is also examined, and some recent technologies designed to overcome these problems are covered.
  • 3. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 59 2.1 Causes of Failure Unlike permanent failures with persistent causes, the failure cause in intermittent failures may no longer exist during testing, because of changes in the working environment. Hardware intermittent failures can have different root causes, such as mismatched thermal expansion, vibration, corrosion, and electromigration. In this section, some key intermittent failure causes for hardware components are investigated. 2.1.1 Wire Bond and Connectors Failures Wire bonds and connectors cause a high percentage of hardware intermittent failures [9]. Some common causes include coefficient of thermal expansion (CTE) mismatch, component wear out caused by age or repeated usage and corrosion. For example, the CTE mismatch between the wire bonds and the copper bonding pads on a PCB can cause intermittent opens and shorts during temperature excursions. In another example, the contact resistance of a new, tin-plated contact may be a few milliohms, but after a thousand contact cycles, the resistance can become as high as several ohms. With more usage, intermittent failures that disappear in the next contact cycle may also occur [9]. The thermal and mechanical vibrations in the connectors can lead to fretting corrosion, causing the contact resistance to increase, thus inducing intermittent connection failures [10, 11]. It has been identified [3] that loose PCB interconnectors and aging connectors and components are some of the common causes for electronic systems failure. Gibson et al. [12] concluded that over 50% of all electronic failures are triggered by interconnector related problems. Other common causes are vibration, stress relaxation, and the movement of the wiring harness generated by the magnetic field [9]. The following paragraphs will describe some of these failures in more details. Wire bond related intermittent failure occurs when a poorly connected wire bond temporarily dislodges because of thermal expansion at temperatures above the room temperature. The wire bond may then restore to its normal state once the thermal stress caused by CTE mismatch is removed. The failure mode in such cases is usually an open circuit. On the other hand, a loose conductive material floating on the package may connect with a wire bond on another part of the circuit, resulting in a short circuit. When this floating piece moves away from the failure site, because of vibration for example, the failure is no longer observed [13]. Loose materials can be detected by using appropriated screening methods including X-ray, vibration, and acoustic testing. Screening and testing methodologies are designed based on the potential causes and effects of the short circuit on the component performance. [14]. Intermittent wire bond failures may also be induced by the molding process which can damage wire bonds. This damage is not easily detectible and is attributed to the weakening and lifting of the gold bond during the molding process on the side of the package opposite to where the injection molding occurred [15]. Proper molding process control parameters and effective detection techniques would minimize such intermittent failures. In a study done by Sorensen [16] on military aircraft he noted that 50% of all the failures were intermittent failures and 80% of those were related to solder joints and connector pins. For the aircraft industry, aging devices will lead to IFs, quite often as a prelude to permanent failures. Many IFs are the result of the gradual degradation of a component or system. They may initially appear as small noise fluctuations but could lead to permanent failures. Filho et al. [17] point out that for continuous monitoring methods, intermittent failures can appear long before open circuits are detected. Corrosion can cause electrical degradation of the contact, which is initiated by a galvanic reaction between two metals within the electrical circuit. Corrosion on electronic parts can result in either of the two scenarios: short circuits or an increase in the electrical resistance of the components. When corrosion occurs, it is rarely uniform on the affected surfaces, which may result in the appearance of an intermittent failure. With respect to the contacts of electronic parts, intermittent failures occur because of frequent connections and disconnections, as seen in the corrosion of copper
  • 4. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 60 connectors that have layers of nickel and gold to protect against wear out. In harsh environments (high relative humidity and the presence of H2S), formation of the corrosive component Cu2S causes intermittent failure behaviors [18]. With vibration and temperature fluctuations, this conductive path can be connected and disconnected, resulting into intermittent failures. Intermittent failures due to corrosion generally occur in the early stages (the first 50%) of the product life cycle. Intermittent behavior aggravated by CTE mismatch or vibration generally appears during the later stages (the last 50%) of the life cycle of a product [19]. For example, electrochemical migration, which occurs between anodes and cathodes (and can be a reason behind IF reports), is a corrosion-related failure mechanism that forms dendrites between opposite biases and eventually results in short circuits. The driving forces for this corrosion process are the potential voltage bias, contaminated surfaces (lack of environmental control), and the fact that the metals that are commonly used (Sn, Pb, Cu and Ag) are susceptible to corrosion. Since this process is not time induced, the intermittent failures are manifested early in the product life cycle. Tin whiskering has been identified [20] as another common cause for intermittent failures. A PCB with a pure tin finish, having non-compressive internal stress, is known to create tin dendrites that can cause short failures. However at elevated temperatures the dendrites may melt away and repair the short. 2.1.2 Digital Integrated Circuit Failures Integrated chip devices are being scaled down rapidly. This reduction in size makes digital integrated circuits more susceptible to permanent and intermittent behavior. Intermittent failure modes in logic, digital integrated circuits (ICs) have been categorized as timing violations, stuck-at- zero or stuck-at-one failures, intermittent shorts or opens, or electro-migration failures [21]. An increase in the resistance of interconnects due to thermal or mechanical loads, electromigration, or material diffusion, increases the time for signal propagation and leads to a timing violation [22]. These failures are manifested because of thermal and electrical loads and signal frequency variations. Kothawade et al. [23] found that timing violation in a processor can be attributed to multiple factors such as process variations, negative bias temperature instability (NBTI), temperature fluctuations, hot carrier injection (HCI), and voltage fluctuations. Since timing violations can be caused by many factors, it is challenging for processor designers to design fault tolerance mechanisms. Time dependent HCI failures are generally permanent in nature. NBTI failures caused by AC stress tend to be intermittent failures whereas failures caused by static stress usually manifest as permanent failures. Within an integrated circuit, the thin oxide layers separating the adjacent metal traces can also lead to intermittent shorting or opens caused by traces coming in contact with each other or losing contact. Constantinescu [21] also studied the causes of intermittent behavior in integrated circuits (ICs). The study attributed voltage fluctuations across ICs as the cause for oxide layer breakdown. As ICs have become smaller, the thickness of the oxide layers has decreased. This leads to an increased risk of breakdown in oxide layer thickness. When this oxide layer breaks down, it creates a conducting path, thereby increasing the leakage current. The introduction of high k dielectrics reduces the rate of oxide breakdown, enabling the use of thinner dielectrics. However, this can also lead to timing violation failures. Before a complete breakdown takes place due to dielectric breakdown leading to a permanent failure, there is a stage known as dielectric soft breakdown, during this stage a device may exhibit intermittent failures. Intermittent stuck-at-zero or stuck-at-one failures occur in storage elements. Digital circuits have two states, 0 or 1, and a fault occurs when a particular signal is tied to either 1 or 0. This produces a logical error. Pan et al. [24] developed a metric for stuck-at-zero/stuck-at-one to characterize the vulnerability of a microprocessor to intermittent failures based on its structure. Experimental results show that the susceptibility varies significantly across different structures, and the vulnerability of the reorder buffer is much higher than that of the register file. These storage element intermittent failures have an active time and an inactive time. The active time is the time
  • 5. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 61 during which the failure is in process and causes unexpected behavior, while the inactive time is the time when the failure does not affect performance. The length of this active time determines how significantly the failure affects the performance of a microprocessor. ICs are susceptible to intermittent failures due to electro-migration. Electro-migration is the movement of metal atoms when electrons flow through those atoms. This movement of atoms can lead to an open or short circuit failure. In both the cases the failures appear initially as intermittent failures and end up as permanent failures. As IC chip technology becomes smaller, the wire widths are reduced. When current flow is not scaled down proportionally, the ICs become vulnerable to electro-migration [24]. 2.1.3 COMPONENT CONNECTION FAILURES Another area of concern for intermittent failures is the area of component pins, whether it be a multi-pin IC, resistor network or a simple two-lead capacitor. Intermittent failures can be caused by imperfections in the solder process or a fractured lead where the two broken ends are intermittently making and breaking connections. Once the pin is broken, the failure may show up during thermal cycling or vibration testing. Resolution for these types of failure includes better attachment methods of longer-size components like a resistor network or large capacitor to the circuit card. Studies [25] have shown that intrinsic flaws in design and sub quality manufacturing processes like soldering play a big role in creating intermittent failures. 2.2 Diagnosis The Failure Modes, Mechanisms, and Effects Analysis (FMMEA) can be used to detect intermittent and permanent failures in hardware. Mathew et al. [26] have proposed the following methodology. Figure 1: FMMEA Methodology [26]
  • 6. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 62 The first two steps identified in Figure 1 are ‘define system and identify elements and functions to be analyzed’ and ‘identify potential failure modes.’ They are more challenging for intermittent failures than for permanent failures. This is because, in the case of intermittent failures, it is difficult to define which system has the failure in a complex system consisting of several subsystems intermeshed together. A failure in one of the subsystems could affect another subsystem and result in its failure. Finding the subsystem with the initial failure is challenging, since intermittent failures are not always detected when the system is tested for faults. Identifying the correct failure modes is also not easy because of the erratic nature of intermittent failures; this requires extra work. Kirkland [27] describes a variety of methods to detect failure modes for intermittent failures in electronic devices, including signal looping, pattern looping, signal stepping, frequency deviation, pattern adjustment in critical areas, signal strength variation, current path duplication, measuring capacitance variations, Vcc adjustments, resistive or impedance rebounce, temperature change application, and noise dissimilarity testing. Using these methods can help identify failure modes, such as increased gate delays, degraded signals, increased leakage, and high frequency failures. A minimum set of conditions (such as voltage drop threshold and temperature variations) needs to be present to make the failure mode observable. Another systematic approach for analyzing intermittent failures is employing a cause and effect diagram, which is also known as the fishbone diagram. An example of this diagram is depicted in the following Ishikawa fishbone diagram below [28]. Figure 2: Fishbone diagram for intermittent failures in hardware and software [28]
  • 7. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 63 A cause and effect diagram defines the key failure (also known as key effect) and investigates the possible causes of each of the effects and offers a list of all the possible causes leading to the failure. It is an effective method for analyzing failures in complex systems. For example intermittent failures in plastic ball grid array packages using this method and narrowed down the possible causes of failure, finally identifying solder joint failure as the main cause of intermittent failure [28]. Steadman et al. [29] developed a test methodology for intermittent faults in aircraft. This method subjected an avionics system to thermal and vibrational loads, while simultaneously monitoring the system for faulty components, thus reducing the occurrence of intermittent failures. An improved approach should include online monitoring of critical avionics components while the system is in operation. This would reduce the overhead cost incurred by offline monitoring that uses load profiles that do not accurately replicate the operating conditions. The monitoring of current to detect intermittent failure has been recommended [30] because normal circuits would carry a significantly different current load when compared to damaged circuits. In 1978, Savir [31] presented a paper on developing a model to detect intermittent failures in a sequential circuit, which is a type of circuit with memory logic and is found in most digital systems. He recommends the leveraging of both deterministic (non- random) and random test procedures for optimizing the probability of IF detection. The intermittent failures are divided in two major categories comprising stationary failures (such as loose connections) and transient failures (such as failures induced by electro-magnetic interference). In sequential circuits, the first manifestation of an active fault may induce the circuit to enter an incorrect state without producing an immediate output error. This state change may generate an output error later when the fault has become inactive. The optimal value of detection probability is obtained by developing a graph of all the input sequences and determining which sequences lead to intermittent failures. To detect intermittent failures, a minimum set of conditional requirements is necessary to manifest the failure [32]. The challenge is in determining the environmental conditions when the failure occurred and re-creating them. Harsh ambient conditions, such as high humidity and the presence of halides, can initiate unintended conductive pathways on insulating surfaces. Such a pathway could eventually become a permanent failure, but it could manifest itself in the earlier stages as intermittent failure. Figure 3 offers a brief list of potential causes for hardware related intermittent failures. Figure 3: List of Causes for Intermittent Failure 2.3 Mitigation Integrated circuits try to compensate for breakdowns by having failure tolerance built into them. Failure tolerance masks the occurrence of failures from the end user (it prevents end users from experiencing performance drops). For example, most processors choose a max clock rate after having guard-banded against unpredictable interactions and variations in the actual clock rate. ICs Component shifting during solder reflow Magnetic field variations Contamination (including oxidation at test sites) Materials degradation (aging, chemical, stress, etc.) Chemical degradation (including creep corrosion Overstress (example high voltage on cap. Dielectric) fretting, whiskers, electron migration etc.) Partial delamination Cracked substrates PCB (warpage, via cracking, black pad, etc.) Damaged circuits Poor wire bonding (on high K dielectric, etc.) ESD induced Temperature sensitivity (CTE mismatch, etc.) Floating leads (or other conductive pieces) Vibration induced Ionizing Radiation in Semiconductors Voltage overstress Insulation Oxide layer breakdown Weak solder joints (varying with temp/stress) Irregular or altered current path Weak structural integrity Loose connections (wire bonds, connectors etc.) Wire sweep during molding
  • 8. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 64 also have chip-level failure tolerance, such as error correcting codes, self-checking circuits, and hardware-implemented check pointing and retries [22]. Three main methodologies to mitigate the intermittent behavior in ICs are dynamic instruction delaying, core frequency scaling, and thread migration. When the processor incurs more than the expected time to execute a process, time delay and timing violation occur. This fault may be avoided by using techniques such as dynamic instruction delaying. This is a type of algorithm that calculates the scheduling priorities during the execution of the system. The objective is to respond dynamically to the changing conditions and form a self-sustained, optimized configuration. Another approach to mitigating delay is core frequency scaling, which scales down the performance of the CPU to a lower frequency when less is needed and scales it up to a higher frequency when more is needed. Thread migration is another technique used to overcome intermittent failure. A thread is an ordered set of instructions that tells a computer exactly what to do. When a specific thread encounters failures, the content of the thread within the faulty computer core is transferred to another thread within an idle core, where the problem is addressed and solved. The intermittent failures in some avionic systems can be caused by failures in solder joints and multi-layer ribbon cables [29]. These failures may be initiated by the variations in operating conditions, such as temperature or current, and may disappear due to re-melting of the solder, closing of the crack, or filling of the void due to thermal fluctuations. Development of robust soldering processes which include appropriate material selection would mitigate soldering related intermittent failures. The plethora of solder choices which include leaded solder, lead free solder, low temperature solder, low silver solder, soft solder make it even more critical for developing appropriate processes for solder attach and solder reflow. Since there is no known, effective method to mitigate solder joints and multi-layer ribbon cable failures, more research on improving the robustness and consistency of solder joints is necessary, and self-repairing wire bonds should also be developed. 2.4 New Technology Trends Recent technological developments to solve hardware intermittent failures offer us insight to future solutions. The industry is addressing the IF problem by developing innovative approaches. The focus is also shifting from failure detection to failure avoidance. Intermittent failures on a silicon chip, such as Time Dependent Dielectric Breakdown (TDDB) and Electromigration (EM), are caused by gate wear out because of extensive usage. Gate usage can be monitored in the form of gate toggles [33]. Researchers [34] discovered that the vulnerability to intermittent failure could be monitored by tracking the amount of gate toggles. They studied four OpenSPARC RTL modules and tracked how each instruction moved through these four modules while toggling different gates. The four modules studied were the IFU, EXU, FFU, and LSU modules. They discovered that certain sub modules within the EXU module, such as the exu- alu and lsu-dcdp within the load store unit, display a relatively high amount of toggling regardless of the type of instruction being executed. This revealed that there could be groups of modules and sub modules which would have higher susceptibility to wear out failures, resulting in intermittent failures. Higher vulnerability by itself cannot be a good predictor for a failure rate, but when combined with operating conditions such as temperature, the degradation of a gate structure can be forecasted. Preemptive steps could also be taken during the design stage to avoid the occurrence of such intermittent failures. The intermittent loss of connection between connectors is a very common failure in electrical systems [35]. In spite of the extra caution during connector installation, this remains a problem in avionics and military equipment. In 2012 an approach was suggested [36] to create an online methodology to detect intermittent failures caused by intermittent connections. The idea is premised around the principle derived from the Lorentz Law that any sudden flux change should create a large
  • 9. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 65 voltage manifesting as an arc which would propagate along the circuitry as a traveling wave. The arc is defined as the electrical discharge initiated by improper cable connections. Intermittent failures caused by lose connector connections can be detected by monitoring for the presence of this arc. Their research describes the online monitoring methodology to detect the presence of this arc to flag any connector disconnection failures. Advances in semiconductor scaling technology have revealed that there is now greater exposure and vulnerability to not only single event upsets (SEUs) in integrated memories but also to single-event transients (SETs) in high speed logic [37]. SEUs are induced by environmental causes such as cosmic radiation or alpha particle radiation. They initiate current pulses at random times and locations in a digital circuit. SETs are caused by transient charge displacements which generate logic errors in subsequent circuits. Both SEUs and SETs are responsible for creating intermittent failures. This is a problem which is getting worse because of industry demand for semiconductor scaling. An estimation methodology to monitor the SEUs and SETs in combinatorial circuits using CMOS technology has been proposed [38]. The source for alpha particle contamination is some packaging materials, such as the filler materials, deployed in molding compound or the presence of lead in non- lead free solders. SEU problems initiated by alpha particles have been essentially solved by the industry, but cosmic rays still pose significant SEU problems. [28] A paper published in 2012 by Pan et al. [39] strives to address the CMOS technology scaling problem from a different perspective. The paper proposes the quantitative characterization of the vulnerability of the microprocessor structure to intermittent failures. This is called the intermittent vulnerability factor (IVF), and it is the probability that an intermittent fault in the microprocessor structure will manifest as an external visible failure. Their research revealed that it is the intermittent stuck at one fault model which has the most serious impact on program execution. The IVF factor is calculated after listing the causes of the intermittent failures, classifying them into different fault models and setting parameters to determine when the intermittent fault will result in a visible error. This information is used to develop IVF computational algorithms for different intermittent fault models within a processor. The IVF data could now be used to improve the microprocessor quality, reliability, and durability (QRD) by proper interventions during the design stage. The IVF could also be used for intermittent fault detection and error recovery. Correcher et al. in their paper [40] published in 2012 introduce the concept of modeling intermittent failure dynamics. They propose two methodologies for characterizing the dynamics: the probabilistic model and the temporal model. The probabilistic model allows the computing of intermittent failure probability at any time; however, it needs historical data which may not always be available. The temporal model is more practical, and it offers the measurement of failure density. Research shows that the duration and frequency of intermittent failures increase with time, and the failure density and pseudo-period can help us in predicting it. The pseudo-period is the average time difference between failures, which is normalized by the number of failures. It is related to MTBF (mean time between failures) and used to model the reliability of repairable systems. The pseudo- period can be used to predict the number of operations before replacement in determining whether the model should follow a linear or exponential fitting. A limitation of this approach is the ability to derive optimal values for the failure density and pseudo-period. Recent research on component residual life is helpful for predictive maintenance systems. The approach focuses around not avoiding the intermittent failures but on predicting when the negative effects of the IF failures are no longer tolerable. A stochastic model has been proposed [41] to predict the residual life of live components of a coherent system. A coherent system is a system where, when a failed component is replaced by a new component, the system does not fail. The conditional reliability of components within a working system exhibiting an increasing failure rate has been shown to decrease with time. Also, when two coherent working systems comprising similar
  • 10. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 66 components have the hazard rates sequenced, the corresponding residual lives are also stochastically ordered. New approaches from Kleer et al. [42] offer a framework for diagnosing intermittent failures in a continuously operating piece of machinery, where objects are transferred from one module to the next, as in the case of a copying machine involving the transfer of paper from one site in the copier to another. Research has shown [43, 44, and 45] that by leveraging in-situ sensors, physics of failure models and life cycle monitoring one can predict the occurrence of failure and measure degradation and remaining useful life. Such information could become the building blocks of developing modalities to troubleshoot intermittent failures. 3. SOFTWARE INTERMITTENT FAILURES Software intermittent failures are generated when some conditions occur simultaneously. For example, if the available memory and CPU processing power are both below a certain threshold due to other applications running on a computer, a selected program can exhibit intermittent failures due to insufficient resources. Software intermittent failures can also occur are when two or more processes (called threads) are running simultaneously and can “collide”. When this happens, the computer can end up in a lock up condition in which the software does not have a clear exit point and may result in a “frozen screen” condition showing on the computer monitor. These potential collisions may not be obvious when the software code is being written for the many different subroutine modules used in the computer. An example of one such collision of process involves a bank ATM where a customer may dip their ATM card to open up a session, and at the same time the branch personnel may open the rear safe door of the ATM (out of view from the customer). The resulting condition causes the computer to “freeze up” and the screen to be stuck in one view, making the ATM non-responsive to the customer. Software may also contain bugs and exhibit intermittent failure whenever a user encounters the buggy parts of the program. In the next sections, the causes of software intermittent behavior are investigated, and then the methods for identification and mitigation of these failures are described. Some recent research in this area is also briefly discussed. 3.1 Causes Even though software intermittent failures occur in most software-based systems, the end user may not always experience a drop in performance. The ability to perceive a failure is known as observability of faults. The observability of software intermittent failures is affected by three factors: processor speed, memory capacity, and processor load. A low processor speed increases the possibility of occurrence of intermittent failures, whereas with high processor speed, intermittent failures may be observed less frequently. A high memory capacity reduces the observability of software intermittent failures, whereas an increase in the processor load could increase the occurrence of intermittent failures. To mitigate the frequency of intermittent behavior, the factors and fault causes of the intermittent behavior must be addressed. Gracia et al. [46] classify the causes of software-related intermittent failures as timing failures, errors in memory, unhandled exceptions, errors in disks, and concurrency-related failures. Timing failures occur when process executions are delayed during processing or when the sequence of their execution is disturbed. For example, because process executions are time-sensitive, the timing of parallel processes running simultaneously can experience a delay if one of the processes does not get completed within the expected time. Memory leaks and memory errors occur because of improper memory allocation or de-allocation. This can happen when the memory footprint, which is
  • 11. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 67 the amount of main memory a program uses or references, becomes very high. This may be caused by prolonged memory usage and can result in intermittent freezes and crashes. Software failures because of unhandled exceptions happen when an unexpected error occurs during execution and this error is not handled by the software. For example, when the software tries to divide one by zero, an error is generated. If this error is not handled, it could lead to an intermittent failure. Disk error failures are software intermittent failures resulting from physical errors in the disk drives. Concurrency-related failures occur when concurrent tasks are being executed, leading to heavy usage of the system. 3.2 Diagnosis In software, there are many different configurations possible. It is difficult, if not impossible; to test a product under all these configurations, and intermittent failures can occur on configurations which have not been fully tested. While testing for intermittent behavior, the interaction between the hardware and software needs to be considered, because hardware configuration can influence the frequency and length of intermittent software failure. Syed et al. [47] observed that software testing results in a different frequency of intermittent failures based upon the hardware configuration. For example, parameters such as processor speed, memory, hard drive capacity, and processor load led to a variation in the number of intermittent failures observed. Wei et al. [48] developed a test methodology to inject faults at the hardware architecture level to understand the effect of hardware intermittent failures on software failures. The authors discovered that different sites of the processor architecture affected the software execution differently. They observed that the impact of a hardware fault on software will depend upon the origination site and length of the hardware fault. For the detection of intermittent software failures, five techniques [47] are used. The first technique is known as deterministic replay debugging (DRB). It is the ability to replay precisely the same set of instructions that led up to a software failure. Essentially, the engineer records all instructions up to the point where the system crashes and then replays that recording to determine the roots of the failure. It is used for bug detection, fault tolerance studies, and intrusion analysis [47]. It is effective in debugging issues caused in multi- threaded and distributed applications. The second technique is called fuzzy testing (FT). It uses random, invalid, or unexpected data and observes how the system reacts. Fuzzy-testing is generally used for detecting failures related to corrupted data, leaks in memory, software crashes and assertions [47]. FT is also used to enhance software security. The third commonly used technique is termed high volume test automation (HVTA). In this approach the software automatically generates, executes and evaluates a large number of tests cases to detect failures. The high volume of testing, which is automatically generated, offers a higher probability of detecting failures. HVTA techniques are generally used in detecting failures such as buffer overruns, stack overflows, resource exhaustion, and timing-related errors. The fourth failure detection technique is load testing, which includes tests such as stress testing (testing at the operating condition limits until the system breaks) and volume testing (operating very large tasks). Load testing involves a demand which is exerted on a system or device while the response is being monitored. It assists in determining the maximum operating capacity and identifying the bottlenecks and weak links in a system. The last technique is called disturbance testing (DT). In this case, the normal operation of the system is disrupted by introducing physical failures such as by unplugging the power cord. This technique is used for testing the fault tolerance and the overall quality of a system. 3.3. Mitigation The aim of fault mitigation is to prevent unexpected outputs and control errors. Anderson et al. [49] discussed the phases that constitute fault mitigation: error detection, damage assessment, and
  • 12. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 68 error recovery. Error detection is used to identify the source of intermittent faults, while damage assessment determines the extent of disruption and losses suffered by the system. Once the nature of the fault is clearly identified, the next phase, error recovery, mitigates these faults. This stage minimizes the negative effects experienced by the end user. There are three techniques for error recovery: recovery block, n-version software, and self- checking software [50]. Recovery blocks were originally developed by Randell [51] to prevent faults in software components from affecting functionality at the system level. In this approach, results from sequences in a software component are verified by adjudicator software. Each of the outputs of the software component needs to pass an acceptance test by the adjudicator. N-version programming (NVP) is also known as multi-version programming. In this method, multiple versions of functionally equivalent software are created independently using the identical original specifications. This assumes that independently generated software will have a sharply reduced probability of the same software faults. Statistical techniques are employed to determine the most common responses to these multiple versions, and measures are undertaken to mitigate the responses. N-version software combines the advantages of redundancy (multiple software versions) and leveraging statistical techniques [52]. Even though the NVP approach is commonly used in software developed for electronic voting and switching trains, it is not free of controversy. There are critics who do not agree that independently developed software versions will reduce the common errors. Self-checking software [53] detects the occurrence of software errors, locate and identify the causes, and stop the propagation of errors. For self-checking software to perform successfully, the system needs to monitor both functional aspects of the process and the data. Functional monitoring checks for infinite loops and incorrect loop terminations in a software program, while data monitoring checks the integrity of defined data structures in software. 3.4 New Technology Trends New approaches are being developed to overcome software related intermittent failures. Data race issues can cause many intermittent failures in software. They are non-deterministic, hard to debug, and cause problems at runtime [54]. A data race is initiated when two threads access the same memory location without undergoing a synchronized operation and when at least one of the access events is a write operation. Because of its complexity, the C and C++ language specifications leave such program behavior undefined [55] and the Java specification for such programs is complicated and known to be buggy [56]. There is a trend of increased usage of multithread programs because of the use of multicore processors, and multithreading is prone to data race issues. One approach to overcome data race detection issues was presented in 2013 by Wester et al. [57]. It is called parallelizing data race detection. They point out that traditional data race detectors are too slow to be used regularly. Wester et al. propose to increase the speed by spreading the detection work across multiple cores. Their strategy involves a process called uniparallelism, which allows the execution of program time intervals in a parallel manner, providing scalability while executing all threads on a single core to eliminate locking. Another emerging research area is automated software repair. Heuristic and algorithmic approaches are leveraged for generating, evaluating, and repairing defective sites. This approach has received attention in the field of language programming [58], operating systems [59], and software engineering [60]. Automated repair is effective in solving concurrency bugs which lead to IF issues [58]. Schulte et al. [61] presented a paper in 2013 outlining a methodology to employ automated repair on arbitrary and non-repeatable software defects in embedded systems. This process has been implemented on Nokia N9000 smart phones. The algorithm used for localizing fault sites is based on Gaussian convolution and stochastic sampling. It reduces memory requirements by 85% for embedded systems. It is ten times faster and is suited for devices where direct instrumentation is not feasible.
  • 13. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 69 Sahoo et al. [62] published a paper in 2013 wherein automatic diagnostic techniques are proposed for isolating root causes for software-related intermittent failures. Self-generated likely program invariants are used with filtering techniques at sites close to the fault-triggering point to select a set of candidate programs as possible root causes. Likely program invariants are effective tools for detecting and diagnosing software errors [63]. They are program properties that are observed to hold valid in some set of successful executions but not necessarily for all executions. The set of candidate sites are trimmed down by dynamic backward slicing, which is a technique that can pinpoint precisely which instructions affect a particular value in a single execution of a program [64]. The list of candidates are further reduced by dependence filtering, which is based upon the premise that if an invariant on one instruction fails, then a different dependent instruction may also have a chance of invariant failure, but the underlying cause is the first invariant and not the second. The second filtering approach assumes that if multiple similar inputs result in the same failure symptom, they are likely to have the same cause. This is a promising approach for the automatic diagnosis of software root causes; however, this approach only works on deterministic detectors. Future work is planned to include non-deterministic detectors. The use of multicore processors has resulted in concurrency errors in multithreaded programs. These errors can lead to intermittent failures arising from schedule-dependent failures. These failures are caused by interactions between threads that were not anticipated by the program developer [65]. Atomicity is another schedule-dependent failure that can cause intermittent failures. This occurs when a thread accessing a shared state is inadvertently allowed to interleave between a pair of accesses in another thread. A paper from the University of Washington [65] in 2013 discusses the development of automated techniques for avoiding schedule-dependent failures such as concurrency and atomicity. They established a system for collecting relevant program events during run time. When a program fails, the information collected is analyzed to generate hypotheses for failure causes. Leveraging the multiple instances of the deployed software in operation, a predictive statistical model and an empirical framework has been developed to identify which hypothesis is most likely to be correct. Corrective actions are taken by manipulating future program executions. The emphasis of the study is not on failure detection but on failure avoidance. 4. RECOMMENDATIONS Intermittent failures should be treated seriously not only because of the massive cost but also because they could be early indicators to permanent failures. For intermittent failures, it is better to focus on failure avoidance rather than failure detection or failure mitigation. From the hardware design perspective it is recommended that the specification of minimum spacing requirements for circuit traces should be dependent upon the current usage. With the increase in semiconductor scaling, preemptive design strategies need to be developed that leverage data like IVF (Intermittent Vulnerability Factor) discussed in this paper. On the packaging side it would be valuable to develop new materials which offer better shielding from cosmic radiation to prevent SEUs (Single Event Upsets). Self-repairing wire bonds and self-healing solder joints may sound futuristic but they can diminish the occurrence of intermittent failures in hardware. Since connector disconnections is a common cause for intermittent failures it is recommended to develop effective methodologies for monitoring travel waves caused by sudden connector dis-connections. For some avionic systems it is recommended to develop an online test methodology rather than performing lab testing to increase the probability of detecting intermittent failures. Software intermittent failures should always be studied within the context of the hardware being used and it is important to focus on fault causes rather than on the observability of intermittent failures. There is a need for more detailed studies in solving system-level intermittent failures. With the increase in multicore processor usage, it is recommended to anticipate and preempt IF problems
  • 14. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 70 caused by data race when using multithreading programming. Parallelizing techniques should be employed where possible to detect data race failure. It is recommended to use automated software repair for solving concurrency issues and likely program invariants are encouraged for automatic diagnostic techniques for solving deterministic failures. 5. CONCLUSIONS Intermittent failures are difficult to diagnose because, when they are investigated, the faults cannot be replicated consistently. This paper undertakes a wider approach by describing the various causes, diagnosis and mitigation strategies for intermittent failures manifested at the hardware and software levels. Some promising upcoming technologies are highlighted that might help develop future solutions for intermittent failures. Since diagnosing intermittent failure is challenging, helpful tables and methodologies have been presented to detect the causes of hardware and software intermittent failures. Recommendations have been offered to help minimize the occurrence of intermittent failures in hardware and software. The paper strives to advance the state of the art and practice by covering a wide diversity of intermittent failures, both in hardware and software while offering an understanding of the underlying causes and proposing approaches and methodologies for diagnosis and mitigation. 6. ACKNOWLEDGEMENTS The authors would like to acknowledge the personnel associated with the University of Maryland and CALCE (Center for Advanced Life Cycle Engineering) for their constant support and assistance in developing this paper. Special appreciation and thanks are due to Diganta Das, Kelly Smith, Mark Zimmerman, Faye Chai, Weifeng Liu and Ken Neubeck for guidance in the content, structure and presentation of this paper. 7. REFERENCES [1] Authoritative Dictionary of IEEE Standard Terms, 7th edition, published by Standards Information Network IEEE Press, 2000 IEEE 100. [2] K. Neubeck, “Practical Reliability Analysis”, (Prentice Hall, 2004). [3] D. A. Thomas, K. Ayers, and M. Pecht, “The ‘trouble not identified’ phenomenon in automotive electronics,” Microelectronics Reliability, vol. 42, no. 4–5, pp. 641–651, Apr. 2002. [4] I. James, D. Lumbard, I. Willis, and J. Goble, “Investigating no fault found in the aerospace industry,” in Reliability and Maintainability Symposium, 2003. Annual, 2003, pp. 441 – 446. [5] P. Söderholm, “A system view of the No Fault Found (NFF) phenomenon,” Reliability Engineering & System Safety, vol. 92, no. 1, pp. 1–14, Jan. 2007. [6] B. Steadman, T. Pombo, I. Madison, J. Shively, and L. Kirkland, “Reducing No Fault Found using statistical processing and an expert system,” in AUTOTESTCON Proceedings, 2002. IEEE, 2002, pp. 872 – 878. [7] WDS Global white paper, “No Fault Found returns cost the mobile industry $4.5 Billion per year”, 2006. <online> http://www.wds.co/news/whitepapers/20060717/MediaBulletinNFF.pdf. [8] Kimseng K., Hoit M, Pecht M, “ Physics of failure assessment of a cruise control module” Microelectronics Reliability, 1999, 39(10):423-444.
  • 15. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 71 [9] C. Maul, J. W. McBride, and J. Swingler, “Intermittency phenomena in electrical connectors,” Components and Packaging Technologies, IEEE Transactions on, vol. 24, no. 3, pp. 370 –377, Sep. 2001. [10] M. Antler, “Contact fretting of electronic connectors”, IEICE Trans. Electron, Vol E82-C, #1, 1994, pp 3-12. [11] C. Maul, J. McBride and J. Swingler, “On the nature of intermittence in electrical contacts”, in 20th Int. Conf. Electrical Contacts, Stockholm, 2000, pp 23-28. [12] A. Gibson, S. Choi, T. Bieler and K. Subramanian, Environmental concerns and materials issues in manufactured solder joints, Proceedings of the 1997 IEEE International Symposium, In Electronics and the Environment (1997) 246–251. [13] H. A. Schafft, “Failure Analysis of Wire Bonds,” in Reliability Physics Symposium, 1973. 11th Annual, 1973, pp. 98 –104. [14] R. E. McCullough, “Screening Techniques for Intermittent Shorts,” in Reliability Physics Symposium, 1972. 10th Annual, 1972, pp. 19 –22. [15] T. Koch, W. Richliug, J. Whitlock, and D. Hall, “A Bond Failure Mechanism,” in Reliability Physics Symposium, 1986. 24th Annual, 1986, pp. 55 –60. [16] Sorensen B. Digital averaging-the smoking gun behind No-Fault-Found, Air Safety Week, February, 24, 2003. [17] W.C. Maia Filho, M. Brizoux, H.Fremont, Y. Danto, “Improved Physical Understanding of Intermittent Failure in Continuous Monitoring Method”, Proceedings of 14th IPFA, 2007, pp.141-146. [18] M. Reid, J. Punch, G. Grace, L. F. Garfias, and S. Belochapkine, “Corrosion Resistance of Copper-Coated Contacts,” Journal of The Electrochemical Society, vol. 153, no. 12, p. B513, 2006. [19] D. Minzari, M. S. Jellesen, P. Møller, and R. Ambat, “On the electrochemical migration mechanism of tin in electronics,” Corrosion Science, vol. 53, no. 10, pp. 3366–3379, Oct. 2011. [20] B. Sood, M. Osterman and M. Pecht, Tin whisker analysis of Toyotas electronic throttle control, CircuitWorld 37(3) (2011) 4–9. [21] C. Constantinescu, “Intermittent faults and effects on reliability of integrated circuits,” in Reliability and Maintainability Symposium, 2008. RAMS 2008. Annual, 2008, pp. 370 –374. [22] D. T. Blaauw, C. Oh, V. Zolotov, and A. Dasgupta, “Static electromigration analysis for on- chip signal interconnects,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 22, no. 1, pp. 39 – 48, Jan. 2003. [23] S. Kothawade, K. Chakraborty, S. Roy, and Y. Han, “Analysis of intermittent timing fault vulnerability,” Microelectronics Reliability, vol. 52, no. 7, pp. 1515–1522, Jul. 2012. [24] S. Pan, Y. Hu, and X. Li, “IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults,” in Design, Automation Test in Europe Conf. Exhibition, 2010, pp. 238 –243. [25] N. Vichare and M. Pecht, Prognostics and health management of electronics IEEE Transactions on Components and Packaging Technologies, 29(1) (2006) 222–229 [26] S. Mathew, D. Das, R. Rossenberger, and M. Pecht, “Failure mechanisms based prognostics,” in Prognostics and Health Management, 2008. PHM 2008. International Conference, 2008, pp. 1 –6. [27] L. V. Kirkland, “When should intermittent failure detection routines be part of the legacy re- host TPS?” in AUTOTESTCON, 2011 IEEE, 2011, pp. 54 –59. [28] H. Qi, S. Ganesan, and M. Pecht, “No-fault-found and intermittent failures in electronic products,” Microelectronics Reliability, vol. 48, no. 5, pp. 663–674, May 2008.
  • 16. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 72 [29] Bryan Steadman, Floyd Berghout, Nathan Olsen, “Intermittent Fault Detection and Isolation System”, IEEE AUTOTESTCON, 2008. [30] M. Pecht, Prognostics and health monitoring of electronics, John Wiley & Sons, Ltd, 2008. [31] J. Savir, “Detection of Intermittent Faults in Sequential Circuits” Stanford University, Rep. TR-120, 1978. [32] L. Kirkland, “When should intermittent failure detection routines be part of the Legacy Re-Host TPS”, IEEE, Autotestcon, 2011, pp 54-59. [33] R. Vattikonda, W. Wang and Y. Cao, “Modeling and minimization of PMOS NBTI effect for robust nanometer design”, in proceedings of the Design Automation Conference, DAC 2006. [34] M. Demertzi, B. Zandian, R. Rojas and M. Annavaram, “Benchmarking ISA Reliability to Intermittent Failures”, IEE International Symposium on Workload Characterization (IISWC), 2012, pp. 86-87. [35] S. Hannel, S. Fouvry, P. Kapsa and L. Vincent “The fretting sliding transition as a criterion for electrical contact performance” WEAR, Vol 49, 2001, pp 761-770. [36] A.Ginart, I. Ali, J. Goldwin, P. Kalgren, M. Roemer, E. Balaban and J. Celaya “Sensing and characterization of EMI during Intermittent Connector Anomalies” Aerospace Conference, IEEE, March 3-10, 2012, pp 1-7. [37] R. Rao, K. Chopra, D. Blaauw and D. Sylvester, “An efficient static algorithm for computing the soft error rates of combinatorial circuits,” in Proceedings of Design, Automation and Test in Europe, Vol. 1, March 2006, pp1-6. [38] N. Kehl and W. Rosenstiel, “An efficient SER estimation method for Combinatorial Circuits”, IEEE Transactions on Reliability, vol 60, number 4, 2011, pp 742-747. [39] S. Pan, Y. Hu and X. Li, “IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 20, number 5, 2012, pp 777-790. [40] A. Correcher, E. Garcia, F. Morant, E. Quiles and L. Rodriguez, “Intermittent Failure Dynamics Characterization”, IEEE Transactions on Reliability, Vol 61, Number 3, pp 649-658, Sep. 2012. [41] N. Balakrishnan and M. Asadi, “A proposed measure of Residual Life of Live Components of a Coherent System”, IEEE Trans. Rel. Vol. 61, #1, pp 41-49. [42] J. Kleer, B. Price, L.Kuhn, M. Doh, R. Zhou, “A framework for continuously estimating persistent and intermittent failure probabilities”, Palo Alto Research Center Publications, 2008. [43] J. Xie and M. Pecht, Applications of in-situ health monitoring and prognostic sensors, The 9th Pan Pacific microelectronics Symposium, Exhibits and Conference (2004) 10–12. [44] S. Mathew, D, Das, M. Oserma, M. Pecht and N. Ferebee, Prognostic assessment of aluminum support structure on printed circuit boards, ASME Journal of Electronic Packaging 128(4) (2006), 339–345. [45] V. Shetty, D. Das, M. Pecht, D. Hiemstra and S, Martin, Remaining life assessment of shuttle remote manipulator system end effector, Proceedings of the 22nd Space Simulation Conference (2002), 21–23. [46] J. Gracia, L. Saiz, J. C. Baraza, D. Gil, and P. Gil, “Analysis of the influence of intermittent faults in a microcontroller,” in Design and Diagnostics of Electronic Circuits and Systems, 2008. DDECS 2008. 11th IEEE Workshop on, 2008, pp. 1 –6. [47] R. A. Syed, B. Robinson, and L. Williams, “Does Hardware Configuration and Processor Load Impact Software Fault Observability?” in Software Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010, pp. 285 –294.
  • 17. International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 73 [48] J. Wei, L. Rashid, K. Pattabiraman, and S. Gopalakrishnan, “Comparing the effects of intermittent and transient hardware faults on programs,” in Dependable Systems and Networks Workshops (DSN-W), 2011 IEEE/IFIP 41st International Conference on, 2011, pp. 53 –58. [49] T. Anderson and J. C. Knight, “A Framework for Software Fault Tolerance in Real-Time Systems,” IEEE Transactions on Software Engineering, vol. SE-9, no. 3, pp. 355 – 364, May 1983. [50] M. R. Lyu, Software Fault Tolerance. New York, NY, USA: John Wiley &amp; Sons, Inc., 1995. [51] B. Randell, “System structure for software fault tolerance,” in Proceedings of the international conference on Reliable software, New York, NY, USA, 1975, pp. 437–449. [52] A. Avizienis, “The N-Version Approach to Fault-Tolerant Software,” IEEE Transactions on Software Engineering, vol. SE-11, no. 12, pp. 1491 – 1501, Dec. 1985. [53] Ronitt A. Rubinfeld, A mathematical theory of self-checking, self-testing and self-correcting programs, University of California at Berkeley, Berkeley, CA, 1991. [54] N. Levenson and C. Turner, “An investigation of the Therac-25 accidents”, IEEE Computer, 26(7): 18-41, July 1993. [55] H. Boehm and S. Adve, “Foundations of the C++ concurrency memory model”, In Proc. 2008 ACM Conference on Programming Language Design and Implementation, pp. 69-78. [56] J. Seveik and D. Aspinall, “On validity of Program Transformations in the Java memory Model”, in Proc. 2008 European Conference on Object-Oriented Programming. Pp 27-51. [57] B. Wester, D. Devecsery, P. Chen, J. Flinn and S. Narayanasamy, “Parallelizing Data Race Detection”, In APLOS 2013, Houston Texas, March 16-20, 2013. [58] G. Jin, L. Song, W. Zhang, S. Lu and B. Liblit, “Automated atomicity violation fixing”, In Programming Language Design and Implementation”, In Programming Language Design and Implementation, 2011, pp. 389-400. [59] J. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, and M. Carbin, “Automatically patching errors in deployed software. In Symposium on Operating Systems Principles, 2009, pp. 87-102. [60] Y. Wei, Y. Pei, C. Furia, L. Silva, S. Buchholz, B. Meyer and A. Zeller, “Automated fixing of programs with contracts”, in International Symposium on Software Testing and Analysis”, 2010, pp.61-72. [61] E. Schulte, J. DiLorenzo, W. Weimer, S. Forrest, “ Automated repair of binary and assembly programs for cooperating embedded devices”, In APLOS 2013, Houston Texas, March 16-20, 2013. [62] S. Sahoo, J. Crisswell, C. Geigle and V. Adve, “ Using Likely Invariants for automated Software Fault Localization”, In APLOS 2013, Houston Texas, March 16-20, 2013. [63] M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, “Dynamically discovering likely program invariants to support program evolution” IEEE Trans. Software Eng., 2001. [64] X. Zhang, R. Gupta and Y. Zhang, “ Precise dynamic slicing algorithms”, In Proceedings of the 25th International Conference on Software Engineering, 2003. [65] B. Lucia and L. Ceze, “Cooperative Empirical Failure Avoidance for Multithread programs”, In APLOS 2013, Houston Texas, March 16-20, 2013. [66] V.Yuvaraj and T.Vasanth, “Simulation, Control and Analysis of HTS Resistive and Power Electronic FCL for Fault Current Limitation and Voltage Sag Mitigation in Electrical Network”, International Journal of Electrical Engineering & Technology (IJEET), Volume 4, Issue 3, 2013, pp. 82 - 94, ISSN Print : 0976-6545, ISSN Online: 0976-6553.