White Paper - RAID 3 vs. RAID 5 In An HPC Environment

RAID 3 vs. RAID 5 In An HPC Environment

This paper will address the application and file system performance differences between RAID 3 and RAID 5 when used in a high performance computing environment. The reader is expected to have a working knowledge of RAID architecture and parity management for the different RAID types.

RAID 3:

RAID 3 typically organizes data by segmenting a user data record into either bit or byte sized chunks and evenly spreading the data across "N" drives in parallel. One of the drives is a parity drive. In this manner, every record that is accessed is delivered at the full media rate of the "N" drives that comprise the stripe group. The drawback is that every record I/O must access every drive in the stripe group.

Additionally, the stripe width chosen forces the user record length to be a multiple of what each disks underlying atomic format is. Assume each disk has an atomic 4 KByte format. Atomic format, in this sense, is the blocksize on the disk, (e.g. the data you get when you issue a single atomic read or write operation). An 8+P array would have 32 KByte logical blocks. A 5+P would have 20 KByte logical blocks and so on. You may not write to a RAID 3 except in full stripe logical blocks. This limits application design flexibility and limits the users ability to change the physical array characteristics.

Multiple Process Problems

For a single long request delivering a single stream of sequential data, RAID 3 is excellent. For multiple requests RAID 3 performs very poorly. A single record request (e.g. file system metadata) would tie up 100% of the disks and, more importantly, pull the disk heads away from the actual user data to read the metadata. Multiple requests from two or more active processes (jobs) will only serve to exacerbate this problem. The disk heads are being pulled from one user's data area to the other every time each process makes a request. Each head movement requires a "seek" and a disk latency. Often, this is on the order of 15-20 ms. Two processes typically require four (4) movements each time an I/O request ping-pongs between the two jobs; two metadata requests and two data transfer requests.

The resulting disk head positioning time of possibly up to 80 ms. coupled with the latencies built into any controller contribute to poor overall throughput. In RAID 3 mode, disk latency is not 1/2 a revolution like a single disk, but is statistically (N-1)/N, where N is the width of the stripe group. For example, an 8 way stripe would have a latency of (8-1)/8 or 7/8th of a revolution. Additionally, on a 1.063 GBit FC-AL loop or a 100 MBytes/sec. HIPPI channel, sustaining 75 MBytes/sec. data transfers, just two operations, each reading a megabyte of data with a 20 ms latency to switch between them, would result in an average throughput of about 30 MBytes/sec. (i.e. 20 ms position, 13.333 data transfer, 20 ms position, 13.333 transfer and so on). This example where we lost 45 MBytes/sec. of potential throughput assumed NO metadata fetches. If you add metadata I/O, assuming each metadata fetch was less than 64 KByte, then the throughput drops to between 10 and 15 MBytes/sec. Any host induced delays will further reduce this number. A good rule of thumb is to assume that a good RAID 3 controller can sustain about 60-65 unique short record I/Os per second (60 x 15 ms = 900 ms) or less. Long I/Os decrease this rate by the function of the transfer time involved for each request.

It is clear that RAID 3 is not well suited for either multiple process I/O (long or short) and especially not suited for any application that requires a high I/O per second rate (IOPS) with any degree of randomness. On the other hand, RAID 3 will deliver excellent performance for single process/single stream long sequential I/O requests.

Note: Long "skip sequential" processing (regular stride access) is a special case and is covered in the RAID 5 section for both types of arrays.

RAID 3 Performance Mitigation

Some users have proposed to mitigate the head movement problem and limited I/O per second rate of RAID 3 by making only very large requests and caching vast quantities of user data in main memory. This method has been shown to work in some cases where there is a high degree of locality of reference and data reuse, but is not very cost effective given the 100X cost delta between RAM and disk storage. Most users have realized that it is much more cost effective to use a flexible RAID 5 architecture rather than try to hide the shortcomings of the RAID 3 architecture. They would rather use expensive main memory for application space and data manipulation vs. allocating giant in-memory disk caches.

It is for these reasons of overall inflexibility and cost that the industry has gravitated to full function RAID 5 type devices in order to try to better handle both long and short requests, better multiplex many concurrent users and still deliver the very high data rates desired for long requests.

RAID 5:

RAID 5 devices interleave data on a logical block by block basis. (Logical block is usually defined in this sense as what the user application or file system sees and expects to perform atomic I/O on.) Logical blocks may be any size that is a multiple of the physical disk block. Most high performance RAID 5s allow the user to specify the logical blocksize as a multiple of the underlying physical disk block. This design allows a single logical block read to access only a single disk. A 16 block read would access all disks twice in a 8+P configuration. Parity is typically distributed (rotated) across all the disks to eliminate "hot spots" when accessing parity drives. The most common method is called "right orthogonal" parity wherein the parity block starts on the right most drive for track zero and rotates to the left on a track or cylinder basis. Numerous other schemes have been supported, but the right orthogonal scheme seems to be the most prevalent. MAXSTRAT's Gen5 family of storage servers uses a variant of this method.

Not all RAID 5s are alike...

Within the RAID 5 category, there are two fundamental types of RAID 5s:

The "full function" variety where the array recognizes large requests and dynamically selects and reads as many disks in parallel as are required to answer the request, up to the physical width of the stripe group. This type of array typically has a separate lower interface port for every drive in the stripe group.
The "entry level" model that can not perform parallel stripe reads, but has to read the data off each disk in a serial sequence. This type of array usually "daisy chains" multiple drives a small number of lower interfaces to save on cost at the expense of performance and parallelism.

These "entry" arrays are typically found in small SCSI arrays designed for PC servers and UNIX workstations that cannot use the data rate but want the data protection offered by parity arrays. Larger hosts require the "full function" variety since they want RAID 3 speed for long requests but also want the improved short, random block performance offered by RAID 5s.

Skip Sequential (Stride) Operations

Skip sequential operations (regular stride access) degrades the performance of RAID 3s directly in proportion to the "stride" length. If you read one block and skip two (stride of 3) then your data rate will be 1/3 of theoretical burst rate. A stride of 4 delivers 1/4th the rate and so on. A RAID 5, on the other hand will access every third disk in parallel and in the case of a stride of 3 mapped to a 8+P array would sustain the full data rate. (Disks 1, 4, & 7 hold requests 1, 2 & 3. Disks 2, 5 & 8 hold records 4, 5 & 6 and disks 3 & 6 hold records 7 & 8.) Other strides and other stripe widths map in a similar manner but may not always deliver 100% of the potential bandwidth due to resultant topology of the data.

The Read-Modify-Write Cycle

Short block writes are a special case for RAID 5s. In order to make the parity stay consistent when an entire stripe is not completely updated, (e.g. fewer records are written than the width of the stripe group) a function called "read-modify-write" must take place. This function requires that the old data block, the new data block and the old parity all be XOR'd together to create the new parity block. In the worst case, this can cause four (4) disk operations (2 reads/2 writes) to logically perform a single block write.

Modern RAID 5s tend to have built in caching to hold the old data in internal cache and possibly pre-read the old parity so that when the new write block is sent to the unit, all the necessary disk blocks are in cache and do not have to be fetched to do the three-way XOR. The MAXSTRAT Gen5 does this via 24 MBytes of write behind cache and up to 96 MBytes of read ahead/store-through cache. Obviously, the cache hit ratio is paramount to the performance of a RAID 5 when there are many short, random write updates.

As a second rule of thumb, high performance, full function RAID 5s are capable of sustaining between 500 and 2000 short random disk read operations a second. The Gen5 is at the high end of this range. Note: These numbers represent actual disk reads, not cache hits. Cache hits typically are higher. Short writes operate at about half this speed. Long writes (greater than a stripe width) perform in a manner similar to RAID 3 since parity can be generated on the fly and complete stripes, including new parity, can be written out in parallel.

Read to Write Ratios

Given the effect of the read-modify-write cycle, the actual read to write I/O ratio becomes important to any user considering a RAID 5 device. Most commercial processing has at least a 4-1 read to write ratio (IBM study) and scientific processing typically is higher on the read side.

Intermediate "swap" I/Os typically are large so the read-modify-write case is not of interest during job execution. Final output, now that visual vs. textual output is the rule, virtually eliminates short end of job writes. In summary, the read-modify-write overhead may be of less concern than previously thought.

Multiple Processes - RAID 5

Going back to our examples of single and multiple processes that we used above in the RAID 3 discussion, the case of a single process is very similar. In a full function RAID 5 array, a long request will cause all the disks to be accessed in parallel thus resulting in approximately the same data rate that RAID 3 offers. (Entry level RAID 5 arrays will be gated by the serial nature of their ability to access more than one disk.) Accessing metadata will tend to have a significantly lesser effect on overall performance since only one disk (vs. all the disks in the RAID 3 case) will receive a sudden command to move its read head to a different position. If this is not the disk that is next in the chain to transfer data for the long read, then the effect of head movement may be masked by the statistical location of the long read block transfer start location as measured within the time domain of the entire operation. This "mask" is statistically half of the stripe group width since we can never predict which disk in the stripe group actually holds the first block of the request.

In any given specific case, the mask time is both a function of file system design and metadata layout as well as the width of the stripe group. For instance, file system layout may or may not attempt to issue well formed (complete stripe group) I/O. To the extent that the metadata fetch operation can be masked, we recognize that we have, on the average, the time it takes to read the data off half of a stripe group to recover from a metadata induced disk head movement. This time is a disk rotational latency period plus the disk transfer time for one logical block of data. Assuming 64 KByte logical disk blocks and 7200 RPM drives, you would require about 11.7 ms. to perform this operation.

We can calculate that the average time required to return the disk head on the metadata drive (to a position where it will be able to access user data) will be approximately 12.2 ms. (4.2 ms. latency and 8 ms. average seek). Note that this number is less than the RAID 3 number since latency is always 1/2 a revolution for RAID 5 arrays vs. 7/8th of a revolution in the RAID 3 example. This value tells us that in most cases a fast RAID 5 controller can recover from a metadata fetch without impacting user data transfers. Unfortunately, our analysis above also shows us that this "mask" window is only available 50% of the time. However, half loaf is better than none!

Given the assumptions, we can then extrapolate that the "two process" performance hit will occur about half of the time on a well designed RAID 5. The data rate will not degrade to 33 MBytes/sec from 75 MBytes/sec, as in the RAID 3 example, but only to about 54 MBytes/sec. This is quite an improvement! A four process access will scale in the same manner. Experience has shown these assumptions to be correct.

Given the above, it is difficult to understand why anybody would choose RAID 3 architecture over a full function RAID 5 design except in the case where the user is virtually guaranteed that there will be only a single, long process accessing sequential data. The RAID 3 is so limited that it becomes a poor choice in this case.

Chris Wood
Email: chrisw@maxstrat.com

Disclaimer

The examples used in this paper are not representative of any particular RAID system or vendor unless explicitly so stated. The plethora of RAID types, models, architectures and implementations available today may or may not operate in the manner(s) described herein. This paper is designed to be used as a general guide to RAID types and not as a performance warranty or implementation design document. All opinions are those of the author and do not represent MAXSTRAT Corporation in any way.

| Back to top | MAXSTRAT Company Info | More White Papers |