Software RAID for BSD: Vinum

Software RAID for BSD: Vinum By Greg Lehey

This is an abridged version of the introduction to Vinum on the LEMIS web site. See that site for more details.

Many BSD systems have storage needs which current generation disks can't fulfill by themselves: they may want more storage, more performance or more reliability than an individual disk can provide. There are several alternatives solutions to these issues, generally known by the generic term RAID (Redundant Array of Inexpensive Disks).

One solution is to use a special disk controller, called a RAID controller. This controller creates an interface to a virtual disk which shows the desired characteristics.

The problem with RAID controllers is that they are not compatible with other controllers, so special drivers are needed. Currently, only FreeBSD and NetBSD support any RAID controllers, and in each case only the DPT SmartRAID and SmartCache III and IV are supported. These are old models which are no longer in production. Drivers for newer controllers from DPT, Mylex and AMD are on their way, though.

An alternative is a ``SCSI-SCSI'' RAID controller. This kind of controller doesn't interface to the system directly, it interfaces to the SCSI bus. This means it doesn't need a special driver, but it also limits performance somewhat.

Vinum

The third alternative is software RAID, which performs the necessary virtualization in software. FreeBSD offers this functionality in Vinum, a device driver which implements virtual disk drives. Vinum is an volume manager implemented under FreeBSD. It was inspired by the VERITAS® volume manager and maintains many of the concepts of VERITAS®. Its key features are:

Vinum implements RAID-0 (striping), RAID-1 (mirroring) and RAID-5 (rotated block-interleaved parity). In RAID-5, a group of disks are protected against the failure of any one disk by an additional disk with block checksums of the other disks.
Drive layouts can be combined to increase robustness, including striped mirrors (so-called ``RAID-10'').
Vinum implements only those features which appear useful. Some commercial volume managers appear to have been implemented with the goal of maximizing the size of the spec sheet. Vinum does not implement ``ballast'' features such as RAID-4. It would have been trivial to do so, but the only effect would have been to further confuse an already confusing topic.
Volume managers initially emphasized reliability and performance rather than ease of use. The results are frequently down time due to misconfiguration, with consequent reluctance on the part of operational personnel to attempt to use the more unusual features of the product. Vinum attempts to provide an easier-to-use non-GUI interface.

Let's look again at the problems we're trying to solve:

Disks are too small

The ufs file system can theoretically span more than a petabyte (2**50 or 1,125,899,906,842,624 bytes) of storage, but no current disk drive comes close to this size. Although this problem is not as acute as it was ten years ago, there is a simple solution: the disk driver can create an abstract device which stores its data on a number of disks.

Vinum solves this problem with virtual disks, which it calls volumes, a term borrowed from VERITAS. These disks have essentially the same properties as a UNIX disk drive, though there are some minor differences. Volumes have no size limitations.

Access bottlenecks

Modern systems frequently need to access data in a highly concurrent manner. For example, wcarchive.cdrom.com maintains up to 5000 concurrent FTP sessions and transfers about 1.4 TB of data a day.

Current disk drives can transfer data sequentially at up to 30 MB/s, but this value is of little importance in an environment where many independent processes access a drive, where they may achieve only a fraction of these values. In such cases it's more interesting to view the problem from the viewpoint of the disk subsystem: the important parameter is the load that a transfer places on the subsystem, in other words the time for which a transfer occupies the drives involved in the transfer.

In any disk transfer, the drive must first position the heads, wait for the first sector to pass under the read head, and then perform the transfer. These actions can be considered to be atomic: it doesn't make any sense to interrupt them.

Consider a typical transfer of about 10 kB: the current generation of high-performance disks can position the heads in an average of 6 ms. The fastest drives spin at 10,000 rpm, so the average rotational latency (half a revolution) is 3 ms. At 30 MB/s, the transfer itself takes about 350 盜, almost nothing compared to the positioning time. In such a case, the effective transfer rate drops to a little over 1 MB/s and is clearly highly dependent on the transfer size.

The traditional and obvious solution to this bottleneck is ``more spindles'': rather than using one large disk, it uses several smaller disks with the same aggregate storage space. Each disk is capable of positioning and transferring independently, so the effective throughput increases by a factor close to the number of disks used.

The exact throughput improvement is, of course, smaller than the number of disks involved: although each drive is capable of transferring in parallel, there is no way to ensure that the requests are evenly distributed across the drives. Inevitably the load on one drive will be higher than on another.

The evenness of the load on the disks is strongly dependent on the way the data is shared across the drives. In the following discussion, it's convenient to think of the disk storage as a large number of data sectors which are addressable by number, rather like the pages in a book. The most obvious method is to divide the virtual disk into groups of consecutive sectors the size of the individual physical disks and store them in this manner, rather like taking a large book and tearing it into smaller sections. This method is called concatenation and has the advantage that the disks do not need to have any specific size relationships. It works well when the access to the virtual disk is spread evenly about its address space. When access is concentrated on a smaller area, the improvement is less marked. The following figure illustrates the sequence in which storage units are allocated in a concatenated organization.

Concatenated organization

Concatenated organization

An alternative mapping is to divide the address space into smaller, even-sized components and store them sequentially on different devices. For example, the first 256 sectors may be stored on the first disk, the next 256 sectors on the next disk and so on. After filling the last disk, the process repeats until the disks are full. This mapping is called striping or RAID-0, though the latter term is somewhat misleading: it provides no redundancy. Striping requires somewhat more effort to locate the data, and it can cause additional I/O load where a transfer is spread over multiple disks, but it can also provide a more constant load across the disks. The following figure illustrates the sequence in which storage units are allocated in a striped organization.

Striped organization

Striped organization

Vinum implements both concatenation and striping. Since it exists within the UNIX disk storage framework, it would be possible to use UNIX partitions as the building block for multi-disk plexes, but in fact this turns out to be too inflexible: UNIX disks can have only a limited number of partitions. Instead, Vinum subdivides a single UNIX partition into contiguous areas called subdisks, which it uses as building blocks for plexes.

Data integrity

The final problem with current disks is that they are unreliable. Although disk drive reliability has increased tremendously over the last few years, they are still the most likely core component of a server to fail. When they do, the results can be catastrophic: replacing a failed disk drive and restoring data to it can take days.

The traditional way to approach this problem has been mirroring, keeping two copies of the data on different physical hardware. Since the advent of the RAID levels, this technique has also been called RAID level 1 or RAID-1. Any write to the volume writes to both locations; a read can be satisfied from either, so if one drive fails, the data is still available on the other drive.

Mirroring has two problems:

The price. It requires twice as much disk storage as a non-redundant solution.
The performance impact. Writes must be performed to both drives, so they take up twice the bandwidth of a non-mirrored volume. Reads do not suffer from a performance penalty: it even looks as if they are faster.

An alternative solution is parity, implemented in the RAID levels 2, 3, 4 and 5. Of these, RAID-5 is the most interesting. As implemented in Vinum, it is a variant on a striped organization which dedicates one block of each stripe to parity of the other blocks: As implemented by Vinum, a RAID-5 plex is similar to a striped plex, except that it implements RAID-5 by including a parity block in each stripe. As required by RAID-5, the location of this parity block changes from one stripe to the next. The numbers in the data blocks indicate the relative block numbers.

RAID-5 organization

RAID-5 organization

Compared to mirroring, RAID-5 has the advantage of requiring significantly less storage space. Read access is similar to that of striped organizations, but write access is significantly slower, approximately 25% of the read performance. If one drive fails, the array can continue to operate in degraded mode: a read from one of the remaining accessible drives continues normally, but a read from the failed drive is recalculated from the corresponding block from all the remaining drives.

Vinum implements both mirroring and RAID-5. It implements mirroring by providing objects called plexes, each of which is a representation of the data in a volume. A volume may contain between one and eight plexes.

From an implementation viewpoint, it is not practical to represent a RAID-5 organization as a collection of plexes. We'll look at this issue below.

The big picture

As a result of these considerations, Vinum provides a total of four kinds of abstract storage structures:

At the lowest level is the UNIX disk partition, which Vinum calls a drive. With the exception of a small area at the beginning of the drive, which is used for storing configuration and state information, the entire drive is available for data storage.
Next come subdisks, which are part of a drive. They are used to build plexes.
A plex is a copy of the data of a volume. It is built out of subdisks, which may be organized in one of three manners:
- A concatenated plex uses the address space of each subdisk in turn.
- A striped plex stripes the data across each subdisk. The subdisks must all have the same size, and there must be at least two subdisks to distinguish it from a concatenated plex.
- Like a striped plex, a RAID-5 plex stripes the data across each subdisk. The subdisks must all have the same size, and there must be at least three subdisks, since otherwise mirroring would be more efficient.
Although a plex represents the complete data of a volume, it is possible for parts of the representation to be physically missing, either by design (by not defining a subdisk for parts of the plex) or by accident (as a result of the failure of a drive).
A volume is a collection of between one and eight plexes. Each plex represents the data in the volume, so more than one plex provides mirroring. As long as at least one plex can provide the data for the complete address range of the volume, the volume is fully functional.

RAID-5

Conceptually, RAID-5 is used for redundancy, but in fact the implementation is a kind of striping. This poses problems for the implementation of Vinum: should it be a kind of plex or a kind of volume? In the end, the implementation issues won, and RAID-5 is a plex type. This means that there are two different ways of ensuring data redundancy: either have more than one plex in a volume, or have a single RAID-5 plex. These methods can be combined.

Which plex organization?

Vinum implements only that subset of RAID organizations which make sense in the framework of the implementation. It would have been possible to implement all RAID levels, but there was no reason to do so. Each of the chosen organizations has unique advantages:

Concatenated plexes are the most flexible: they can contain any number of subdisks, and the subdisks may be of different length. The plex may be extended by adding additional subdisks. They require less CPU time than striped or RAID-5 plexes, though the difference in CPU overhead from striped plexes is not measurable. On the other hand, they are most susceptible to hot spots, where one disk is very active and others are idle.
The greatest advantage of striped (RAID-0) plexes is that they reduce hot spots: by choosing an optimum sized stripe (empirically determined to be in the order of 256 kB), the load on the component drives can be made more even. The disadvantages of this approach are (fractionally) more complex code and restrictions on subdisks: they must be all the same size, and extending a plex by adding new subdisks is so complicated that Vinum currently does not implement it. Vinum imposes an additional, trivial restriction: a striped plex must have at least two subdisks, since otherwise it is indistinguishable from a concatenated plex.
RAID-5 plexes are effectively an extension of striped plexes. Compared to striped plexes, they offer the advantage of fault tolerance, but the disadvantages of higher storage cost and significantly higher CPU overhead, particularly for writes. The code is an order of magnitude more complex than for concatenated and striped plexes. Like striped plexes, RAID-5 plexes must have equal-sized subdisks and cannot currently be extended. Vinum enforces a minimum of three subdisks for a RAID-5 plex, since any smaller number would not make any sense.

Some examples

Vinum maintains a configuration database which describes the objects known to an individual system. Initially, the user creates the configuration database from one or more configuration files with the aid of the vinum(8) utility program. Vinum stores a copy of its configuration database on each disk slice (which Vinum calls a device) under its control. This database is updated on each state change, so that a restart accurately restores the state of each Vinum object.

The configuration file

The configuration file describes individual Vinum objects. The definition of a simple volume might be:

drive a device /dev/da3h
volume myvol
  plex org concat
    sd length 512m drive a

This file describes a four Vinum objects:

The drive line describes a disk partition (drive) and its location relative to the underlying hardware. It is given the symbolic name a. This separation of the symbolic names from the device names allows disks to be moved from one location to another without confusion.
The volume line describes a volume. The only required attribute is the name, in this case myvol.
The plex line defines a plex. The only required parameter is the organization, in this case concat. No name is necessary: the system automatically generates a name from the volume name by adding the suffix .px, where x is the number of the plex in the volume. Thus this plex will be called myvol.p0.
The sd line describes a subdisk. The minimum specifications are the name of a drive on which to store it, and the length of the subdisk. As with plexes, no name is necessary: the system automatically assigns names derived from the plex name by adding the suffix .sx, where x is the number of the subdisk in the plex. Thus Vinum gives this subdisk the name myvol.p0.s0

After processing this file, vinum(8) produces the following output:

vinum -> create config1
Configuration summary

Drives:         1
Volumes:        1
Plexes:         1
Subdisks:       1

D a                     State: up       Device /dev/da3h        Avail: 2061/2573 MB (80%)

V myvol                 State: up       Plexes:       1 Size:        512 MB

P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB

S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB

This output shows the brief listing format of vinum(8). It is represented graphically in the following figure.

A simple Vinum volume

A simple Vinum volume

This figure, and the ones which follow, represent a volume, which contains the plexes, which in turn contain the subdisks. In this trivial example, the volume contains one plex, and the plex contains one subdisk.

This particular volume has no specific advantage over a conventional disk partition. It contains a single plex, so it is not redundant. The plex contains a single subdisk, so there is no difference in storage allocation from a conventional disk partition. The following sections illustrate various more interesting configuration methods.

Increased resilience: mirroring

The resilience of a volume can be increased either by mirroring or by using RAID-5 plexes. When laying out a mirrored volume, it is important to ensure that the subdisks of each plex are on different drives, so that a drive failure will not take down both plexes. The following configuration mirrors a volume:

drive b device /dev/da4h
volume mirror
  plex org concat
    sd length 512m drive a
  plex org concat
    sd length 512m drive b

In this example, it was not necessary to specify a definition of drive a again, since Vinum keeps track of all objects in its configuration database. After processing this definition, the configuration looks like:

Drives:         2
Volumes:        2
Plexes:         3
Subdisks:       3

D a                     State: up       Device /dev/da3h        Avail: 1549/2573 MB (60%)
D b                     State: up       Device /dev/da4h        Avail: 2061/2573 MB (80%)

V myvol                 State: up       Plexes:       1 Size:        512 MB
V mirror                State: up       Plexes:       2 Size:        512 MB

P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB
P mirror.p0           C State: up       Subdisks:     1 Size:        512 MB
P mirror.p1           C State: initializing     Subdisks:     1 Size:        512 MB

S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB
S mirror.p0.s0          State: up       PO:        0  B Size:        512 MB
S mirror.p1.s0          State: empty    PO:        0  B Size:        512 MB

The following figure shows the structure graphically.

A mirrored Vinum volume

A mirrored Vinum volume

In this example, each plex contains the full 512 MB of address space. As in the previous example, each plex contains only a single subdisk.

Optimizing performance

The mirrored volume in the previous example is more resistant to failure than an unmirrored volume, but its performance is less: each write to the volume requires a write to both drives, using up a greater proportion of the total disk bandwidth. Performance considerations demand a different approach: instead of mirroring, the data is striped across as many disk drives as possible. The following configuration shows a volume with a plex striped across four disk drives:

drive c device /dev/da5h
drive d device /dev/da6h
volume stripe
  plex org striped 512k
    sd length 128m drive a
    sd length 128m drive b
    sd length 128m drive c
    sd length 128m drive d

As before, it is not necessary to define the drives which are already known to Vinum. After processing this definition, the configuration looks like:

Drives:         4
Volumes:        3
Plexes:         4
Subdisks:       7

D a                     State: up       Device /dev/da3h        Avail: 1421/2573 MB (55%)
D b                     State: up       Device /dev/da4h        Avail: 1933/2573 MB (75%)
D c                     State: up       Device /dev/da5h        Avail: 2445/2573 MB (95%)
D d                     State: up       Device /dev/da6h        Avail: 2445/2573 MB (95%)

V myvol                 State: up       Plexes:       1 Size:        512 MB
V mirror                State: up       Plexes:       2 Size:        512 MB
V striped               State: up       Plexes:       1 Size:        512 MB

P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB
P mirror.p0           C State: up       Subdisks:     1 Size:        512 MB
P mirror.p1           C State: initializing     Subdisks:     1 Size:        512 MB
P striped.p1            State: up       Subdisks:     1 Size:        512 MB

S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB
S mirror.p0.s0          State: up       PO:        0  B Size:        512 MB
S mirror.p1.s0          State: empty    PO:        0  B Size:        512 MB
S striped.p0.s0         State: up       PO:        0  B Size:        128 MB
S striped.p0.s1         State: up       PO:      512 kB Size:        128 MB
S striped.p0.s2         State: up       PO:     1024 kB Size:        128 MB
S striped.p0.s3         State: up       PO:     1536 kB Size:        128 MB

This volume is represented in the following figure. The darkness of the stripes indicates the position within the plex address space: the lightest stripes come first, the darkest last.

A striped Vinum volume

A striped Vinum volume

Increased resilience: RAID-5

The alternative approach to resilience is RAID-5. A RAID-5 configuration might look like:

drive e device /dev/da6h
volume raid5
  plex org raid5 512k
    sd length 128m drive a
    sd length 128m drive b
    sd length 128m drive c
    sd length 128m drive d
    sd length 128m drive e

Although this plex has five subdisks, its size is the same as the plexes in the other examples, since the equivalent of one subdisk is used to store parity information. After processing the configuration, the system configuration is:

Drives:         5
Volumes:        4
Plexes:         5
Subdisks:       12

D a                     State: up       Device /dev/da3h        Avail: 1293/2573 MB (50%)
D b                     State: up       Device /dev/da4h        Avail: 1805/2573 MB (70%)
D c                     State: up       Device /dev/da5h        Avail: 2317/2573 MB (90%)
D d                     State: up       Device /dev/da6h        Avail: 2317/2573 MB (90%)
D e                     State: up       Device /dev/da6h        Avail: 2445/2573 MB (95%)

V myvol                 State: up       Plexes:       1 Size:        512 MB
V mirror                State: up       Plexes:       2 Size:        512 MB
V striped               State: up       Plexes:       1 Size:        512 MB
V raid5                 State: up       Plexes:       1 Size:        512 MB

P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB
P mirror.p0           C State: up       Subdisks:     1 Size:        512 MB
P mirror.p1           C State: initializing     Subdisks:     1 Size:        512 MB
P striped.p0          S State: up       Subdisks:     1 Size:        512 MB
P raid5.p0            R State: up       Subdisks:     1 Size:        512 MB

S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB
S mirror.p0.s0          State: up       PO:        0  B Size:        512 MB
S mirror.p1.s0          State: empty    PO:        0  B Size:        512 MB
S striped.p0.s0         State: up       PO:        0  B Size:        128 MB
S striped.p0.s1         State: up       PO:      512 kB Size:        128 MB
S striped.p0.s2         State: up       PO:     1024 kB Size:        128 MB
S striped.p0.s3         State: up       PO:     1536 kB Size:        128 MB
S raid5.p0.s0           State: init     PO:        0  B Size:        128 MB
S raid5.p0.s1           State: init     PO:      512 kB Size:        128 MB
S raid5.p0.s2           State: init     PO:     1024 kB Size:        128 MB
S raid5.p0.s3           State: init     PO:     1536 kB Size:        128 MB
S raid5.p0.s4           State: init     PO:     1536 kB Size:        128 MB

The following figure represents this volume graphically.

A RAID-5 Vinum volume

A RAID-5 Vinum volume

As with striped plexes, the darkness of the stripes indicates the position within the plex address space: the lightest stripes come first, the darkest last. The completely black stripes are the parity stripes.

On creation, RAID-5 plexes are in the init state: before they can be used, the parity data must be created. Vinum currently initializes RAID-5 plexes by writing binary zeros to all subdisks, though a conceivable alternative would be to rebuild the parity blocks, which would allow better recovery of crashed plexes.

Resilience and performance

With sufficient hardware, it is possible to build volumes which show both increased resilience and increased performance compared to standard UNIX partitions. Mirrored disks will always give better performance than RAID-5, so a typical configuration file might be:

volume raid10
  plex org striped 512k
    sd length 102480k drive a
    sd length 102480k drive b
    sd length 102480k drive c
    sd length 102480k drive d
    sd length 102480k drive e
  plex org striped 512k
    sd length 102480k drive c
    sd length 102480k drive d
    sd length 102480k drive e
    sd length 102480k drive a
    sd length 102480k drive b

The subdisks of the second plex are offset by two drives from those of the first plex: this helps ensure that writes do not go to the same subdisks even if a transfer goes over two drives.

The following figure represents the structure of this volume.

A mirrored, striped Vinum volume

A mirrored, striped Vinum volume

Object naming

As described above, Vinum assigns default names to plexes and subdisks, although they may be overridden. Overriding the default names is not recommended: experience with the VERITAS volume manager, which allows arbitrary naming of objects, has shown that this flexibility does not bring a significant advantage, and it can cause confusion.

Names may contain any non-blank character, but it is recommended to restrict them to letters, digits and the underscore characters. The names of volumes, plexes and subdisks may be up to 64 characters long, and the names of drives may up to 32 characters long.

Vinum objects are assigned device nodes in the hierarchy /dev/vinum. The configuration shown above would cause Vinum to create the following device nodes:

The control devices /dev/vinum/control and /dev/vinum/controld, which are used by vinum(8) and the Vinum d熤on respectively.
Device entries for each volume. These are the main devices used by Vinum, and the name corresponds to the name of the volume. Thus the configuration above would include the devices /dev/vinum/myvol, /dev/vinum/mirror, /dev/vinum/striped, /dev/vinum/raid5 and /dev/vinum/raid10.
A directory /dev/vinum/drive with entries for each drive. These entries are in fact symbolic links to the corresponding disk nodes.
A directory /dev/vinum/volume with entries for each volume. It contains subdirectories for each plex, which in turn contain subdirectories for their component subdisks.
The directories /dev/vinum/plex and /dev/vinum/sd, which contain device nodes for each plex and subdisk.

Although it is recommended that plexes and subdisks should not be allocated specific names, Vinum drives must be named. This makes it possible to move a drive to a different location and still recognize it automatically.

Creating file systems

Volumes appear to the system to be identical to disks, with one exception. Unlike UNIX drives, Vinum does not partition volumes, which thus do not contain a partition table. This has required modification to some disk utilities, notably newfs, which previously tried to interpret the last letter of a Vinum volume name as a partition identifier. For example, a disk drive may have a name like /dev/wd0a or /dev/da2h. These names represent the first partition (a) on the first (0) IDE disk (wd) and the eight partition (h) on the third (2) SCSI disk (da) respectively. By contrast, a Vinum volume may be called /dev/vinum/concat.

Normally, newfs(8) interprets the name of the disk and complains if it cannot understand it. For example:

# newfs /dev/vinum/concat
newfs: /dev/vinum/concat: can't figure out file system partition

In order to create a file system on this volume, use the -v option to newfs(8):

# newfs -v /dev/vinum/concat

Startup

Vinum stores configuration information on the disk slices in a form similar to that in the configuration files. Once you have configured Vinum, it remembers the configuration even after rebooting. You can even move the disks to other locations (giving them different IDs), and Vinum will still find the correct drives. This is the reason why drives must have a name separate from the device name. To start Vinum automatically, place this line in your /etc/rc.conf file:

start_vinum=YES

Where to get more information

This article just scrapes the surface of using Vinum. There's a lot more information in the man pages vinum(4) and vinum(8), and also on the web site. If you're interested in having Vinum on NetBSD or OpenBSD, please contact me.