This paper will address the application and file system performance
differences between RAID 3 and RAID 5 when used in a high performance
computing environment. The reader is expected to have a working knowledge of
RAID architecture and parity management for the different RAID types.
RAID 3 typically organizes data by segmenting a user data record into either
bit or byte sized chunks and evenly spreading the data across "N" drives in
parallel. One of the drives is a parity drive. In this manner, every record
that is accessed is delivered at the full media rate of the "N" drives that
comprise the stripe group. The drawback is that every record I/O must access
every drive in the stripe group.
Additionally, the stripe width chosen forces the user record length to be a
multiple of what each disks underlying atomic format is. Assume each disk
has an atomic 4 KByte format. Atomic format, in this sense, is the blocksize
on the disk, (e.g. the data you get when you issue a single atomic read or
write operation). An 8+P array would have 32 KByte logical blocks. A 5+P
would have 20 KByte logical blocks and so on. You may not write to a RAID 3
except in full stripe logical blocks. This limits application design
flexibility and limits the users ability to change the physical array
characteristics.
For a single long request delivering a single stream of sequential data,
RAID 3 is excellent. For multiple requests RAID 3 performs very poorly. A
single record request (e.g. file system metadata) would tie up 100% of the
disks and, more importantly, pull the disk heads away from the actual user
data to read the metadata. Multiple requests from two or more active
processes (jobs) will only serve to exacerbate this problem. The disk heads
are being pulled from one user's data area to the other every time each
process makes a request. Each head movement requires a "seek" and a disk
latency. Often, this is on the order of 15-20 ms. Two processes typically
require four (4) movements each time an I/O request ping-pongs between the
two jobs; two metadata requests and two data transfer requests.
The resulting disk head positioning time of possibly up to 80 ms. coupled
with the latencies built into any controller contribute to poor overall
throughput. In RAID 3 mode, disk latency is not 1/2 a revolution like a
single disk, but is statistically (N-1)/N, where N is the width of the
stripe group. For example, an 8 way stripe would have a latency of (8-1)/8
or 7/8th of a revolution. Additionally, on a 1.063 GBit FC-AL loop or a 100
MBytes/sec. HIPPI channel, sustaining 75 MBytes/sec. data transfers, just
two operations, each reading a megabyte of data with a 20 ms latency to
switch between them, would result in an average throughput of about 30
MBytes/sec. (i.e. 20 ms position, 13.333 data transfer, 20 ms position,
13.333 transfer and so on). This example where we lost 45 MBytes/sec. of
potential throughput assumed NO metadata fetches. If you add metadata I/O,
assuming each metadata fetch was less than 64 KByte, then the throughput
drops to between 10 and 15 MBytes/sec. Any host induced delays will further
reduce this number. A good rule of thumb is to assume that a good RAID 3
controller can sustain about 60-65 unique short record I/Os per second (60 x
15 ms = 900 ms) or less. Long I/Os decrease this rate by the function of the
transfer time involved for each request.
It is clear that RAID 3 is not well suited for either multiple process I/O
(long or short) and especially not suited for any application that requires
a high I/O per second rate (IOPS) with any degree of randomness. On the
other hand, RAID 3 will deliver excellent performance for single
process/single stream long sequential I/O requests.
Note: Long "skip sequential" processing (regular stride access) is a special
case and is covered in the RAID 5 section for both types of arrays.
RAID 3 Performance Mitigation
Some users have proposed to mitigate the head movement problem and limited
I/O per second rate of RAID 3 by making only very large requests and caching
vast quantities of user data in main memory. This method has been shown to
work in some cases where there is a high degree of locality of reference and
data reuse, but is not very cost effective given the 100X cost delta between
RAM and disk storage. Most users have realized that it is much more cost
effective to use a flexible RAID 5 architecture rather than try to hide the
shortcomings of the RAID 3 architecture. They would rather use expensive
main memory for application space and data manipulation vs. allocating giant
in-memory disk caches.
It is for these reasons of overall inflexibility and cost that the industry
has gravitated to full function RAID 5 type devices in order to try to
better handle both long and short requests, better multiplex many concurrent
users and still deliver the very high data rates desired for long requests.
RAID 5 devices interleave data on a logical block by block basis. (Logical
block is usually defined in this sense as what the user application or file
system sees and expects to perform atomic I/O on.) Logical blocks may be any
size that is a multiple of the physical disk block. Most high performance
RAID 5s allow the user to specify the logical blocksize as a multiple of the
underlying physical disk block. This design allows a single logical block
read to access only a single disk. A 16 block read would access all disks
twice in a 8+P configuration. Parity is typically distributed (rotated)
across all the disks to eliminate "hot spots" when accessing parity drives.
The most common method is called "right orthogonal" parity wherein the
parity block starts on the right most drive for track zero and rotates to
the left on a track or cylinder basis. Numerous other schemes have been
supported, but the right orthogonal scheme seems to be the most prevalent.
MAXSTRAT's Gen5 family of storage servers uses a variant of this method.
Not all RAID 5s are alike...
Within the RAID 5 category, there are two fundamental types of RAID 5s:
These "entry" arrays are typically found in small SCSI arrays designed for
PC servers and UNIX workstations that cannot use the data rate but want the
data protection offered by parity arrays. Larger hosts require the "full
function" variety since they want RAID 3 speed for long requests but also
want the improved short, random block performance offered by RAID 5s.
Skip Sequential (Stride) Operations
Skip sequential operations (regular stride access) degrades the performance
of RAID 3s directly in proportion to the "stride" length. If you read one
block and skip two (stride of 3) then your data rate will be 1/3 of
theoretical burst rate. A stride of 4 delivers 1/4th the rate and so on. A
RAID 5, on the other hand will access every third disk in parallel and in
the case of a stride of 3 mapped to a 8+P array would sustain the full data
rate. (Disks 1, 4, & 7 hold requests 1, 2 & 3. Disks 2, 5 & 8 hold records
4, 5 & 6 and disks 3 & 6 hold records 7 & 8.) Other strides and other stripe
widths map in a similar manner but may not always deliver 100% of the
potential bandwidth due to resultant topology of the data.
The Read-Modify-Write Cycle
Short block writes are a special case for RAID 5s. In order to make the
parity stay consistent when an entire stripe is not completely updated,
(e.g. fewer records are written than the width of the stripe group) a
function called "read-modify-write" must take place. This function requires
that the old data block, the new data block and the old parity all be XOR'd
together to create the new parity block. In the worst case, this can cause
four (4) disk operations (2 reads/2 writes) to logically perform a single
block write.
Modern RAID 5s tend to have built in caching to hold the old data in
internal cache and possibly pre-read the old parity so that when the new
write block is sent to the unit, all the necessary disk blocks are in cache
and do not have to be fetched to do the three-way XOR. The MAXSTRAT Gen5
does this via 24 MBytes of write behind cache and up to 96 MBytes of read
ahead/store-through cache. Obviously, the cache hit ratio is paramount to
the performance of a RAID 5 when there are many short, random write updates.
As a second rule of thumb, high performance, full function RAID 5s are
capable of sustaining between 500 and 2000 short random disk read operations
a second. The Gen5 is at the high end of this range. Note: These numbers
represent actual disk reads, not cache hits. Cache hits typically are
higher. Short writes operate at about half this speed. Long writes (greater
than a stripe width) perform in a manner similar to RAID 3 since parity can
be generated on the fly and complete stripes, including new parity, can be
written out in parallel.
Read to Write Ratios
Given the effect of the read-modify-write cycle, the actual read to write
I/O ratio becomes important to any user considering a RAID 5 device. Most
commercial processing has at least a 4-1 read to write ratio (IBM study) and
scientific processing typically is higher on the read side.
Intermediate "swap" I/Os typically are large so the read-modify-write case
is not of interest during job execution. Final output, now that visual vs.
textual output is the rule, virtually eliminates short end of job writes. In
summary, the read-modify-write overhead may be of less concern than
previously thought.
Multiple Processes - RAID 5
Going back to our examples of single and multiple processes that we used
above in the RAID 3 discussion, the case of a single process is very
similar. In a full function RAID 5 array, a long request will cause all the
disks to be accessed in parallel thus resulting in approximately the same
data rate that RAID 3 offers. (Entry level RAID 5 arrays will be gated by
the serial nature of their ability to access more than one disk.) Accessing
metadata will tend to have a significantly lesser effect on overall
performance since only one disk (vs. all the disks in the RAID 3 case) will
receive a sudden command to move its read head to a different position. If
this is not the disk that is next in the chain to transfer data for the long
read, then the effect of head movement may be masked by the statistical
location of the long read block transfer start location as measured within
the time domain of the entire operation. This "mask" is statistically half
of the stripe group width since we can never predict which disk in the
stripe group actually holds the first block of the request.
In any given specific case, the mask time is both a function of file system
design and metadata layout as well as the width of the stripe group. For
instance, file system layout may or may not attempt to issue well formed
(complete stripe group) I/O. To the extent that the metadata fetch operation
can be masked, we recognize that we have, on the average, the time it takes
to read the data off half of a stripe group to recover from a metadata
induced disk head movement. This time is a disk rotational latency period
plus the disk transfer time for one logical block of data. Assuming 64 KByte
logical disk blocks and 7200 RPM drives, you would require about 11.7 ms. to
perform this operation.
We can calculate that the average time required to return the disk head on
the metadata drive (to a position where it will be able to access user data)
will be approximately 12.2 ms. (4.2 ms. latency and 8 ms. average seek).
Note that this number is less than the RAID 3 number since latency is always
1/2 a revolution for RAID 5 arrays vs. 7/8th of a revolution in the RAID 3
example. This value tells us that in most cases a fast RAID 5 controller can
recover from a metadata fetch without impacting user data transfers.
Unfortunately, our analysis above also shows us that this "mask" window is
only available 50% of the time. However, half loaf is better than none!
Given the assumptions, we can then extrapolate that the "two process"
performance hit will occur about half of the time on a well designed RAID 5.
The data rate will not degrade to 33 MBytes/sec from 75 MBytes/sec, as in
the RAID 3 example, but only to about 54 MBytes/sec. This is quite an
improvement! A four process access will scale in the same manner. Experience
has shown these assumptions to be correct.
Given the above, it is difficult to understand why anybody would choose RAID
3 architecture over a full function RAID 5 design except in the case where
the user is virtually guaranteed that there will be only a single, long
process accessing sequential data. The RAID 3 is so limited that it becomes
a poor choice in this case.
Chris Wood
Email: chrisw@maxstrat.com
The examples used in this paper are not representative of any particular RAID system or vendor unless explicitly so stated. The plethora of RAID types, models, architectures and implementations available today may or may not operate in the manner(s) described herein. This paper is designed to be used as a general guide to RAID types and not as a performance warranty or implementation design document. All opinions are those of the author and do not represent MAXSTRAT Corporation in any way.