2. Overview
Introduction
Background
Disk Terminology
Data Paths
Technology Trends
Disk Array Basics
Data Striping and Redundancy
Basic Raid Organizations
Performance and Cost Comparisons
Reliability
Implementation Considerations
2
3. Overview
Advanced Topics
Improving Small Write Performance for RAID Level 5
Declustered Parity
Exploiting On-Line Spare Disks
Data Striping in Disk Arrays
Performance and Reliability Modeling
Opportunities for Future Research
Experience with Disk Arrays
Interaction among New Organizations
Scalability, Massively Parallel Computers and Small Disks
Latency
3
4. Introduction
RAID: Redundant Arrays of Inexpensive / Independent
Disks
Improvements in microprocessors and memory
systems require larger, higher-performance secondary
storage systems
Microprocessors performance increase rate > Disk
performance increase rate
Disk arrays: multiple, independent disks large,
high-performance logical disk
4
9. Disk Array Basics
Data Striping and Redundancy
Basic Raid Organizations
Performance and Cost Comparisons
Reliability
Implementation Considerations
9
10. Data Striping and Redundancy
Data Striping
Distribute data over multiple disks
Service in parallel
More disks More performance
10
11. Data Striping and Redundancy
More disks More unreliable
100 disks 1/100 reliability of a single disks
Redundancy
Two categories
Granularity of data interleaving
Method of computing redundant information and
distribute accross the disk array
11
12. Data Striping and Redundancy
Data interleaving
Fine grained
Advantages:
access all the disks
high transfer rate
Disadvantages
only one I/O request serviced at any time
All disks waste time positioning for every request
12
13. Data Striping and Redundancy
Data interleaving
Coarse grained
Advantages:
Multiple small requests serviced simultaneously
Large requests can access all the disks
13
14. Data Striping and Redundancy
Redundancy
Two main problems
Computing the redundant information: Parity
Selecting a method for distributing the redundant
information accross the disk array
14
15. Basic Raid Organizations
Nonredundant (RAID Level 0)
Lowest cost
Best write performance
No best read performance
Any single disk failure result data loss
15
16. Basic Raid Organizations
Mirrored (RAID Level 1)
Twice number of disks
Data also written to redundant disk
If a disk fails, the other copy is used
16
17. Basic Raid Organizations
Memory Style ECC (RAID Level 2)
Contain parity disks
Parity disk proportional to data disks
Efficiency increases when data disk number increases
Multiple parity disks are needed to identify the failed
disk, but only one is needed to recover
17
18. Basic Raid Organizations
Bit-Interleaved Parity (RAID Level 3)
Bit-wise data is used
Disk controller can identify which disk has failed
A single parity disk is used
Read all disks, Write all disks + parity disk
18
19. Basic Raid Organizations
Block-Interleaved Parity (RAID Level 4)
Same as Level 3 but blocks (striping units) are used
Read & write < striping unit one disk
Parity calculation xor new data with old data
Four I/O: write new data, read old data and old parity,
write new parity
Bottleneck at parity disk
19
20. Basic Raid Organizations
Block-Interleaved Distributed-Parity (RAID Level 5)
Solves bottleneck problem at Level 4
Best small read, larger read and large write performance
Small writes are inefficient because of read-modifywrite
20
21. Basic Raid Organizations
P + Q Redundancy (RAID Level 6)
Have stronger codes to solve multiple failures
Operate in much the same manner as Level 5
Small writes are inefficient because of 6 I/O requests
due to update both P and Q information
21
22. Performance and Cost
Comparisons
Ground Rules and Observations
Reliability, performance and cost
Disk arrays are throughput oriented
I/Os per second per dollar
Configuration
RAID 5 can operate as RAID 1 and RAID 3 by configuring
striping unit
22
26. Reliability
Basic Reliability
RAID 5
MTTF: mean time to failure, MTTR: mean time to repair
N: total number of disks, G: parity group size
100 disks each had MTTF of 200.000 hours, MTTR of 1 hour,
partiy group size 16 mean time to failure of the system is
about 3000 years !!!
26
27. Reliability
Basic Reliability
RAID 6
MTTF: mean time to failure, MTTR: mean time to repair
N: total number of disks, G: parity group size
100 disks each had MTTF of 200.000 hours, MTTR of 1
hour, partiy group size 16 mean time to failure of the
system is about 38.000.000 years !!!
27
28. System Crashes and Parity
Inconsistency
System crash: power failure, operator error, hardware
breakdown, software crash etc.
Causes parity inconsistencies in both bit-interleaved
and block-interleaved disk arrays
System crash may occur more frequently than disk
failures
To avoid the loss of parity on system
crashes, information sufficient to recover parity mus
be logged on a non-volatile storage (nvram) before
each write operation.
28
29. Uncorrectable Bit Errors
What is bit error? It is unclear
Data is incorrectly written or magnetic media
gradually damaged
Some manifacturers developed an approach; monitors
the warnings given by disks and notifies an operator
when it feels the disk is about to fail.
29
31. Reliability Revisited
Double disk failure
System crash followed by a disk failure
Disk failure followed by an uncorrectable bit error
during reconstruction
31
35. Implementation Considerations
Avoiding Stale Data
When a disk fails, failed disk must be marked as invalid.
Invalid mark prevents user from reading corrupted data
on the failed disk
When an invalid logical sector is reconstructed to a
spare disk, the logical sector must be marked as valid.
35
36. Implementation Considerations
Regenerating Parity after a System Crash
Before servicing any write request, the corresponding
parity sectors must be marked inconsistent
When bringing a system up from a system crash, all
inconsistent parity sectors must be regenerated
36
37. Implementation Considerations
Operating with a Failed Disk
Demand reconstruction: access to a parity stripe with an
invalid sector triggers reconstruction of the appropriate
data immediately onto a spare disk. A background
process scans the entire disk.
Parity sparing: before servicing a write request, the
invalid sector is reconstructed and relocated to
overwrite its corresponding parity sector
37
39. Advanced Topics
Improving Small Write Performance for RAID Level 5
Declustered Parity
Exploiting On-Line Spare Disks
Data Striping in Disk Arrays
Performance and Reliability Modeling
39
40. Improving Small Write
Performance for RAID Level 5
Buffering and Caching
Write buffering (async writes): Collect small writes in a
buffer and write as a large data
Read caching: reduce four I/O access to three, old data is
read from cache
40
41. Improving Small Write
Performance for RAID Level 5
Floating Parity
Shortens the read-modify-write time
Many free blocks
New parity block is writted rotationally nearest
unallocated block following the old parity block
Implemented on disk controller
41
42. Improving Small Write
Performance for RAID Level 5
Floating Parity
Shortens the read-modify-write time
Many free blocks
New parity block is writted rotationally nearest
unallocated block following the old parity block
Implemented on disk controller
42
43. Improving Small Write
Performance for RAID Level 5
Parity Logging
Delaying the read of old parity and write of the new
parity
Difference is temporarily logged
Logs are grouped together and large contiguous blocks
are updated more efficiently
43
47. Data Striping in Disk Arrays
Disk positioning time is wasted work
Idle times are same as disk positioning
Data striping or interleaving is, distributing data
among multiple disks.
Researchers work on data striping unit size to
maximize the throughput
47
48. Data Striping in Disk Arrays
P: average disk positioning time
X: average disk transfer rate
L: concurrency
Z: request size
N: array size in disks
48
49. Performance and Reliability
Modeling
Performance
Kim: response time equations
Kim & Tantawi: approximate service time equations
Chen & Towsley
Lee & Katz
Reliability
Markov
49
50. Opportunities for Future Research
Experience with Disk Arrays
Interaction among New Organizations
Scalability, Massively Parallel Computers and Small
Disks
Latency
50