2. Why Fault Tolerance?
ï
Offers many advantages:
âŠ
âŠ
âŠ
âŠ
ï
Avoids costly packet retransmissions
Avoids catastrophic data loss
Can increase chip yield
Allows higher speed operation
In NoC specifically
⊠Ensures success of interconnect
⊠Grows in importance as technology scales
3. Fault Classes
ï±
ï
ï±
ï
ï
ï
ï±
ï
ï
ï
ï
Transient faults (or soft errors) : Random appearance and
disappearance
Alpha particles, Cosmic-ray-induced neutrons etc.
Intermittent faults: appear only under certain conditions like
Occur repeatedly at the same location
Tend to occur in bursts
Replacement of the faulty component removes the fault
Permanent faults (or Hard errors): occur always but may be
masked
Static (occurring at manufacture-time)
Process Variability (PV), Manufacturing imperfections
Dynamic (occurring at run-time,)
Electro-Migration (EM), Negative Bias Temperature Instability
(NBTI), Oxide breakdown, Stress-Induced Voiding (SIV), Hot
Carrier Injection (HCI), etc.
4. Making NoCâs Reliable
Current Methods
T-error tolerant NoC design
ï Error Control
ï
⊠Error detection and correction codes
⊠HBH retransmission mechanism
âą
Reliable task mapping
ï Fault tolerant rerouting
8. Power consumption
Observations
ï
ï
ï
The ee-par scheme has higher power
consumption than ee-crc and hybrid
scheme.
The flit based scheme incurs more
power consumption because as the no.
of flits per packet increases the useful
bits decreases.
The packet buffer requirements impact
the power consumption. Hence, as the
number of hops increases, the power
overhead of ss-flit scheme increases.
14. ROBUST: SELF HEALING
ROUTER
Universal Logic Block
Crossbar protection using multiple
ULB blocks
Advantages
It has higher silicon protection factor and a higher reliability improvement
factor.
15. Future challenges
⊠All the schemes presented to improve the reliability of the
NoC architecture have power overhead associated with
them. This increases the power dissipated which can
reduce the mean time to failure (MTTF).
⊠All the techniques should be thermal aware in order to
prevent the above mentioned phenomena.
⊠Instead of evenly wearing out all cores in MPSoCs, a
method should be deigned to self heal failed cores.
⊠Most error resilient schemes today focus primarily on
making router, links fault tolerant. There should be some
focus on making memories more reliable
16. Conclusion
ï
The ideas presented in this paper make the NoC
architecture resilient to permanent and intermittent
errors. To improve the reliability several techniques like
t-error tolerant mechanism, self healing router
architecture, reliability driven task mapping, deadlock
recovery mechanism, error detection and correction
schemes are employed. Several techniques make use
of redundancy in hardware component which is good in
terms of area since because of âdark siliconâ it is
impossible to turn on every component on the die
anyways. However, most techniques increase the
power consumption in the NoC architecture which is by
far the only drawback in using them. Designing systems
to make them resilient to errors is very crucial in
exploiting the advantages of using Network on chips.