Reliability growth planning (RGP) is emerging as a promising technique to address the reliability challenges arising from the distributed manufacturing environment. Unlike RGT (reliability growth testing), RGP drives the reliability growth of new products by spanning the product’s lifecycle from design, prototyping, manufacturing, to field use. It is a lifetime commitment to the product reliability via systematic failure analysis, rigorous corrective actions, and cost-effective financial investment. RGP has shown to be very effective, particularly in new product introductions under the fast time-to-market requirement.
The RGP process will be introduced based on the three-phase product lifecycle: 1) design for reliability during early product development; 2) accelerated lifetime testing and corrective actions in pilot line stage; and 3) continuous reliability improvement following the volume shipment. Trade-offs among reliability investment, warranty cost reduction, and customer satisfactions will be investigated from the perspective of the manufacturer and the customer. Reliability growth tools such as Crow/AMSAA, Pareto graphs, failure mode run chart, FIT (failure-in-time), and FMECA will be reviewed and their roles in the GRP process will be discussed and demonstrated. Case studies drawn from electronics equipment industry will be used to demonstrate the RGP applications and justify its benefits as well.
In parallel with the RGP, efforts have been devoted to developing optimal preventative maintenance programs, either time-based or usage-based strategies. Recently, CBM (condition based maintenance) is showing a great potential to achieve just-in-time maintenance or zero-downtime equipment. RGP and maintenance strategies share a common objective, i.e. achieving high system reliability and availability. In this presentation, optimal maintenance policies will be devised in the context of system reliability growth.
2. ASQ Reliability Division
English Webinar Series
One of the monthly webinars
on topics of interest to
reliability engineers.
To view recorded webinar (available to ASQ Reliability
Division members only) visit asq.org/reliability
To sign up for the free and available to anyone live
webinars visit reliabilitycalendar.org and select English
Webinars to find links to register for upcoming events
http://reliabilitycalendar.org/The_Reli
ability_Calendar/Webinars_‐
_English/Webinars_‐_English.html
3. 1
Reliability Growth Planning: Its Concept,
Applications, and Challenges
Tongdan Jin
Assistant Prof. of Industrial Engineering
Ingram School of Engineering
Texas State University-San Marcos
November 11, 2010
4. 2
Contents
• RGT vs. RGP
• Design for Reliability
• New Reliability Monitoring Metrics
• Reliability Growth under Budget Constraints
• Conclusion
5. 3
RGT vs. RGP
Product Life Cycle
Design and Prototype and Volume Production, Field Use
Development Pilot Phase and End of Life
Reliability Growth Testing (RGT)
Reliability Growth Planning (RGP)
6. 4
Why Need RGP?
• Design Cycle Shrinks
• Cut-off of Testing Budget
• Different Design/Development Schedule
Adv. subsys 6
Adv. subsys 5
Adv. subsys 4
Basic subsys 3
Basic subsys 2
t0 Basic subsys 1 t1 t2 t3 t4 time
Basic design Volume manufacturing and shipping
Automatic Test Equipment Figure 3 Compressed System Design Cycle
7. 5
System Reliability vs. Shipment
Target MTBF
Field System Populations
System MTBF
MTBF
System Installs
Chronological Time
8. 6
Reliability Growth Planning Across Lifecycle Time
hardware CA effectiveness
design
optimization
software Design Driving
for Reliability
mfg Reliability Growth
budget
NFF
failure mode
process
pareto
Note: mfg=manufacturing, NFF=no fault found, CA=corrective action
10. 8
System Failure Mode Categories
Failures Breakdown by Root-Cause Catagory
50%
A
40% C
30% D
B
20%
10%
0%
(components)
Hardware
NFF
Design
Software
Mfg
Process
12. 10
Modeling Hardware Failure Rate
λ = λ0π T π Eπ Qπ FT π R
λ0 = base failure rate.
πT = temperature factor.
πE = electrical stress factor.
πQ = quality factor.
π FT = fault tolerance factor.
π R = redundancy factor.
For a given design, π T , π E play essential roles in
the actual component reliability.
13. 11
Aggregate Failure Rate for Hardware
k k
λhw = ∑ ni λi = ∑ ni λ0iπ Tiπ Ei
i =1 i =1 ASIC Temperature Distribution
14 0.08
12 histogram 0.07
10 pdf 0.06
k 0.05
E[λhw ] = ∑ ni λ0i E[π Ti ]E[π Ei ]
Quantity
8
0.04
pdf
6
0.03
4
i =1 2
0.02
0.01
0 0
<65 [65, 70)[70, 75)[75, 80)[80, 85)[85, 90) >90
k
var(λhw ) = ∑ ni2λ2i var(π Ti π Ei )
Degree in Celsius
0
i =1
Where
k = number of types of devices used in the product.
ni = quantity of ith type of device used in the product.
!0i = base failure rate for ith type of device.
14. 12
Challenges in Modeling Non-Hardware Failures
1. Quite often data is not well recorded
2. Varies from one product line to another
3. Process related
4. Design experience
5. Other random factors
15. 13
Triangle Models for Non-Hardware Failures
⎧ 2
h
⎪ (c − a)(b − a) (λ − a) a ≤ λ ≤ c
⎪
⎪ 2
g (λ ) = ⎨ (λ − b ) c < λ ≤ b
g(λ)
⎪ (c − b)(b − a)
⎪ 0 otherwise
λ ⎪
a c b ⎩
Where:
a = the smallest possible value of the failure rate
b = the largest possible value of the failure rate
c = the most likely value, and c=3λ -b-a
λ = is the sample mean for the dataset
16. 14
Example for Non-Hardware Failure Estimate
Example:
Based on historical data of predecessor products, it
shows failure rates pertaining to manufacturing issues
are (faults/hour):
1.2!10-6, 1.4!10-6 and 2.4 !10-6.
Then :
λ = (1.2!10-6+1.4!10-6 +2.3 !10-6)/3=1.6!10-6
a = 1.2!10-6
b = 2.4 !10-6
c = 1.3!10-6
17. 15
Combining HW and Non-HW Failure Rate
k
λsys = λd + λs + λm + λ p + λo + ∑ ni λi
i =1
Where:
!d = failure rate of design weakness
!s = failure rate of software
!m = failure rate of manufacturing
!p = failure rate of process
!o = failure rate of other issues (e.g. NFF)
k= total number of HW component types
!i = failure rates for component type i
18. Confidence Intervals for Failure Rate
16
k
λsys = λd + λs + λm + λ p + λo + ∑ ni λi
i =1
k
σ λ = σ λ + σ λ + σ λ + σ λ + σ λ + ∑ ni2σ λ
2
sys
2
d
2
s
2
m
2
p
2
o
2
i
i =1
− 2σ λsys λsys 2σ λsys
22. 20
Pareto Chart for Failure Modes
Pareto by Failure Mode From January to March
14 100%
12
10
8
No C/A
C/A In Process
80%
60%
Difficulties:
C/A Complete
• Static View
Qty
6 Percentage 40%
4
20%
2
0 0% • No Trend of Each
No Fault
Relays
Solder
Resistors
Software
Op-Amp
Failure Mode
Found
Cold
Bug
Pareto Chart by Failure Mode From April to June
100%
• Fail to Reflect
28
24 No C/A 80%
Product MTBF
20 C/A In Process
60%
16 C/A Complete
Qty
12 Percentage 40% Note: C/A= corrective action
8
20%
4
0 0%
No Fault
Relays
Solder
Resistors
software
Op-Amp
Cold
Found
bug
23. 21
Failure Mode Rate (FMR)
failures for a type of FM
FMR =
field product installations
24. 22
FMR Estimation: Example
failure quantity for a type of FM
FMR =
field product installation
For example:
Assuming 120 PCBs were shipped and installed
in the field in the first quarter, 5 failures returned
due poor solder joints, then the FMR for poor
solder joints in the first quarter is
5
FMR = = 0.042 faults / board / quarter
120
27. 25
Estimate for PCB Failure Rate
k
λsys = λd + λs + λm + λ p + λo + ∑ ni λi
i =1
Notice FIT = λ ×10 9
k
FITsys = FITd + FITs + FITm + FITp + FITo + ∑ FITi
i =1
Where
!d = failure rate of design errors
!s = failure rate of software bugs
!m = failure rate of manufacturing
!p = failure rate of process
!o = failure rate of other issues
!i = failure rates for component type i
k= total number of new component types
ni= quantity of component type i used in the product
28. 26
FIT-Based Reliability Driven: Example (1)
FM Category Target MTBF (hrs) Target FIT
Overall Product 50,000 20,000
Components (hardware) 117,647 8,500
Others (NFF) 250,000 4,000
Design 333,333 3,000
Manufacturing 500,000 2,000
Process 666,667 1,500
Software 1,000,000 1,000
109
Notice FIT =
MTBF
29. 27
FIT-Based Reliability Driven: Example (2)
Product Target Categorical FM FIT Failure Mode Target FIT Current FIT Ownership
FIT
Relay 2,000 2,491 Tom
Op-Amp 3,000 4,097 Jones
Component (8,500) Resistor 1,500 2,786 Carlos
DC-DC converter 800 1,393 Jesson
ASICs 1,200 1,716 Jim
Eng Change Order 1,300 2,383 David
FPGA Rev Upgrade 900 1,643 Kim
Design (3,000) Change relay type 800 1,498 John
cold Solder 1,600 3,092 Tony
PCB (20,000)
backward component 250 355 Joe
Manufacturing (2,000) Faked component 150 255 Paul
Process (1,500) broken part 700 942 Jen
Missing part 300 447 Chris
OES 500 515 Andrew
Software (1,000) Sever bugs 200 398 Eileen
Medium bugs 400 665 Ed
Trivial bugs 400 497 Eric
Others (4,000) NFF 3,000 457 Mark
PCFD 1,000 1,669 Jeff
31. 29
Crow/AMSAA Growth Model
Failure Intensity: λ = αβ
ˆ ˆt βˆ −1
Various Failure Intensity Models
ˆ
Where β = N N 6
α=
ˆ
⎛ ts ⎞ ˆ
t sβ
5 beta 1 !=1 for all
N
∑ ln⎜ ⎟
Failure Intensity
4 beta 0.5
⎜t ⎟ beta 1.5
i =1
⎝ i ⎠ 3
2
Hypothesis Testing: 1
0
H0: β=1, HPP 0 1 2 3 4 5
Time
H1: β!1, NHPP
2N 2N
Reject H0 < χ 2 N ,1−θ / 2
2
or > χ 2 N ,θ / 2
2
ˆ
β ˆ
β
ts=termination time, ti=ith failure arrival time
32. 30
An Example
Cumulative Failure Arrival Interarrival
ln(ts/ti)
Failures Time (hours) Time (hours)
N=10
1 67 67 3.23
2 150 83 2.43 N
3 234 84 1.98 ˆ
β= = 0.797
4 360 126 1.55
N ⎛t ⎞
ln⎜ s ⎟
∑ ⎜ ⎟
5 533 173 1.16 i =1 ⎝ ti ⎠
6 720 187 0.86
7 912 192 0.62
8 1102 190 0.43 N
9 1345 243 0.23
α=
ˆ ˆ
β
= 0.0266
10 1632 287 0.04
ts
ts 1700 sum 12.55
33. 31
Failure Modes (FM) Pareto Chart
Given $10 budget for
corrective actions.
Which FM should
be fixed? Given Option one: Fix relays
limited budget. MTBF=4800/(14-2.5)
=417 hours
Option two: fix all others
MTBF=4800/(14-9)
=960 hours
Cumulative operating time is 4800 hours,
total failures is 14.
Current MTBF=4800/14=343 hours.
34. 32
New Reliability Growth Model
1. Failure mode based growth prediction
2. Reliability growth subject to CA budget
constraints
3. No assumption of parametric models
4. CA effectiveness function
35. 33
Why Need the CA Effectiveness Function?
Limit Recourses ($)
Spent on CA due to
1. Retrofit
2. ECO
CA Effectiveness
Function
Maximize
Reliability
Growth
36. 34
An Example: ECO or Retrofit
A type of relays used on a PCB module fails constantly due to
a known failure mechanism. Two options available for
corrective actions
1. Replace all on-board relays upon the failure return of the
module
2. Pro-actively recall all modules and replace with new types
of relays having much higher reliability
CA Option Cost ($) CA Effectiveness
ECO Low Low
Retrofit High High
37. 35
Modeling CA Effectiveness
h(x)
Effectiveness Model
1
effectiveness
b
b<1 ⎛ x⎞
b=1 h( x ) = ⎜ ⎟
b>1 ⎝c⎠
x
b and c to be determined
0 CA budget ($) c
Failure rate before CA – Failures rate after CA
Effectiveness=
Failure rate before CA
38. 36
An Example
The current failure rate a type of relay is 2!10-8 faults per
hour. Upon the implementation of CA, the rate is reduced to
5!10-9.
The CA effectiveness can be expressed as 0.75, that is
−8 −9
2 ×10 − 5 ×10
= 0.75
2 ×10 −8
39. 37
Incorporate h(x) into System Failure Rate
HW Non-HW
k m b
λs (t ) = ∑ ni λi (t ) + ∑ λi (t ) ⎛ x⎞
h( x ) = ⎜ ⎟
i =1 i =k +1 ⎝c⎠
k m
λs ,CA (t ) = ∑ ni (1 − hi ( xi ))λi (t ) + ∑ (1 − hi ( xi ))λi (t )
i =1 i = k +1
42. 40
Reliability Growth Planning Process
Retrofit
Team
Retrofit Loop
FRACA
System Repair In-service
Manufacturer Center Systems
1. Failure analysis ECO Loop
2. CA decisions
3. Reliability prediction Stocks
ECO=Engineering Change Order
CA=Corrective Actions
43. 41
Conclusions
1. Design for reliability (DFR) should incorporate hardware and
non-hardware issues along with the variation of the failure
rates.
2. Trade-off should be made between the reliability growth and
the associated availability of CA resources.
3. The CA effectiveness function links the CA budget with the
expected failure mode reduction rate.
4. A reliability database system such as FRACAS is essential for
performing RGP.