How to Set Up a ZFS Root Pool Mirror in Oracle Solaris 11 Express

Mirroring the root pool with ZFS

One of the first things to do when setting up a new system is to mirror your boot disk. This protects you against system disk failures: If one of the two mirrored boot disks fails, the system can continue running from the other disk without downtime. You can even boot from the surviving mirror half and continue using the system normally, until you have replaced the failed half.

At the currently low prices for boot drive sized disks, this is a no-brainer for increasing your system's availability, even for a home server system.

Unfortunately, the steps to complete until you're running off a mirrored ZFS root pool are not yet a no-brainer. While there is a piece of documentation entitled How to Configure a Mirrored Root Pool, it only covers how to add a second disk to your root pool, it does not cover how to prepare and layout a fresh disk so Solaris will accept it as a bootable second half of an rpool mirror.

Which, for historic reasons, is slightly more complicated than just saying zpool attach.

Over the weekend, I sat down and played a bit with the current Oracle Solaris 11 Express release in VirtualBox and tested, re-tested and investigated all currently necessary steps to get your root pool mirrored, including some common issues and variations.

Here's a complete, step-by-step guide with background information on how to mirror your ZFS root pool:

The Basic Plan

After a standard install of Oracle Solaris 11 Express, we'll have our system disk configured as a ZFS root pool called rpool. The rpool disk is set up as an fdisk partition with some SMI partitions (= "slices") on top. The fdisk part is for compatibility with other OSes, the SMI slicing is done in order to reserve some room on the physical disk for the boot blocks and GRUB.

This is different from a regular ZFS data disk which would normally use EFI (not fdisk) labels and no further partitioning.

So here's the basic plan on how to turn a fresh disk into an rpool mirror:

  1. First, we'll figure out what disks we have on the system and what their device names are.
  2. For x86 systems, we need to create an fdisk(1M) primary partition on the second disk (the one to be mirrored), so it uses the same partitioning technology as the original rpool disk and so it can support booting.
  3. We'll then set up an SMI label on the second disk with the same layout (slices) than on the first disk. This includes the slice that reserves space for boot blocks and GRUB and the slice that will contain the rpool's second mirror half.
  4. Now that we have sliced and diced the second disk the same way as the original rpool disk, we can use zpool attach to let ZFS mirror its data on top of it.
  5. For x86 systems, and after ZFS has finished resilvering the rpool mirror, we'll install GRUB onto the second disk so it becomes equivalent to the original rpool disk in terms of bootability.
  6. As a final step, you need to configure your BIOS or OpenBoot PROM to try booting from the second disk if the first one is not available.

You see, the official documentation only covers step 4 above, and let's you guess about the other steps. Here's the full sequence of stuff to do to create a proper mirror in more detail:

1. Figure Out Your System's Disks

Hard drives in Solaris show up in the /dev/rdsk directory as raw devices and the same drives with the same names show up again in /dev/dsk. The former are used to perform raw partitioning and low-level options, while the latter is the standard way to access disks from a day-to-day point of view such as setting up ZFS pools.

Here's a typical device name: c0t0d0s0. The naming convention is simple: Controller 0, SCSI target 0, disk 0 and Solaris slice 0.

Of course, the digits may vary and even become multi-digits in larger systems such as c12t18d5s8, but the convention is always the same.

PATA systems omit the t0 part, because PATA doesn't support "targets" like SCSI or SATA does. This will give you devices like: c0d0s0.

Sometimes, when dealing with DOS partitions, you'll see a p0 part instead of the (Solaris specific) s0 piece. This simply refers to DOS partition 0 (or any other DOS primary partition).

So before we do anything, we need to figure out what disks we are dealing with, what device names they have and if they're used somewhere else already. Two commands will help us here:

  • zpool status will print information about running zpools. This should tell you what the device name for your existing root pool ("rpool") is. On my system, I get this:
    admin@s11test:~$ zpool status
      pool: rpool
     state: ONLINE
     scan: resilvered 2.61G in 0h16m with 0 errors on Sun Mar 13 21:01:06 2011
    config:
     
            NAME        STATE     READ WRITE CKSUM
            rpool       ONLINE       0     0     0
              c7t0d0s0  ONLINE       0     0     0
     
    errors: No known data errors

    This means my rpool sits on controller 7, target 0, disk 0 and slice 0.

  • The easiest, interactive way of figuring out all of your disks in the system would be the format command, but we don't want to spend time going through menus and needless interactivity. Here's a less common, but effective option: cfgadm. This command will tell you what disks we have in the system:
    admin@s11test:~$ cfgadm -s "select=type(disk)"
    Ap_Id                          Type         Receptacle   Occupant     Condition
    sata0/0::dsk/c7t0d0            disk         connected    configured   ok
    sata0/1::dsk/c7t1d0            disk         connected    configured   ok

    Not surprisingly, the second disk in our system therefore sits on target 1 of the same controller. Since cfgadm only knows about hardware, not (software) slices, it omits any "s" part.

Now we know what disks we have, which of them is used for rpool already, and which ones are available as a second mirror half for our rpool.

2. x86 only: Set Up a Single Fdisk Partition on Your Second Disk

Solaris disk partitioning works differently in the SPARC and in the x86 world:

  • SPARC: Disks are labeled using special, Solaris-specific "SMI labels". No need for special boot magic or GRUB, etc. here, as the SPARC systems' OpenBoot PROM is intelligent enough to handle the boot process by itself.
  • x86: For reasons of compatibility with the rest of the x86 world, Solaris uses a primary fdisk partition labeled Solaris2, so it can coexist with other OSes. Solaris then treats its fdisk partition as if it were the whole disk and proceeds by using an SMI label on top of that to further slice the disk into smaller partitions. These are then called "slices".
    The boot process uses GRUB, again for compatibility reasons, with a special module that is capable of booting off a ZFS root pool.

So for x86, the first thing to do now is to make sure that the disk has an fdisk partition of type "Solaris2" that spans the whole disk. For SPARC, we can skip this step.

fdisk doesn't know about Solaris slices, it only cares about DOS-style partitions. Therefore, device names are different when dealing with fdisk: We'll refer to the first partition now and call it "p0". This will work even if there are no partitions defined on the disk, it's just a way to address the disk in DOS partition mode.

Again, we could use fdisk in interactive mode and wiggle ourselves through the menus, but I prefer the command line way. Here's how to check if your disk already has some kind of DOS partitioning:

admin@s11test:~# fdisk -W - c7t1d0p0
 
* /dev/rdsk/c7t1d0p0 default fdisk table
* Dimensions:
*    512 bytes/sector
*     63 sectors/track
*    255 tracks/cylinder
*   2088 cylinders
*
* systid:
*    1: DOSOS12
*    2: PCIXOS
*    4: DOSOS16

(lots of id specifications omitted...)

*  191: SUNIXOS2
*  238: EFI_PMBR
*  239: EFI_FS
*
 
* Id    Act  Bhead  Bsect  Bcyl    Ehead  Esect  Ecyl    Rsect      Numsect
  0     0    0      0      0       0      0      0       0          0         
  0     0    0      0      0       0      0      0       0          0         
  0     0    0      0      0       0      0      0       0          0         
  0     0    0      0      0       0      0      0       0          0

The second - tells the W option to write to standard out instead of to a file.SUNIXOS2 (191) really means SOLARIS2. This is the partition type that we'll create soon.

Here's how to apply a default Solaris fdisk partition to a disk in one simple step:

admin@s11test:~# fdisk -B c7t1d0p0

That's it. Be careful and double-check that you got the device name right! If you're unsure, you can still use the interactive version (fdisk c7t1d0p0) and work through the menus by hand.

Now let's verify that we got what we wanted:

admin@s11test:~# fdisk -W - c7t1d0p0
 
* /dev/rdsk/c7t1d0p0 default fdisk table
* Dimensions:
*    512 bytes/sector
*     63 sectors/track
*    255 tracks/cylinder
*   2088 cylinders
*
* systid:
*    1: DOSOS12
*    2: PCIXOS
*    4: DOSOS16

(stuff omitted...)

*  191: SUNIXOS2
*  238: EFI_PMBR
*  239: EFI_FS
*
 
* Id    Act  Bhead  Bsect  Bcyl    Ehead  Esect  Ecyl    Rsect      Numsect
  191   128  0      1      1       254    63     1023    16065      33527655  
  0     0    0      0      0       0      0      0       0          0         
  0     0    0      0      0       0      0      0       0          0         
  0     0    0      0      0       0      0      0       0          0

Here's the fdisk partition we wanted. Its type is 191 which equals to SOLARIS2 (you can double-check using the interactive version of fdisk), and it spans the whole disk.

3. Set Up an SMI Label With the Same Partitioning on the Second Disk

Before ZFS can do its magic, we need to tell it where on the disk the rpool's mirror is supposed to be, and what blocks are off-limits because they're supposed to host the GRUB bootloader. This is done by using a Solaris SMI label that breaks down our Solaris2 fdisk partition into Solaris "slices".

Again, there's an interactive possibility using the format command, which involves many interactive steps (print out the original disk's layout, set it up step by step on the second disk, write the label), but we want to be cool here, so we'll do it in a single step, again:

admin@s11test:~# prtvtoc /dev/rdsk/c7t0d0s0 | fmthard -s - /dev/rdsk/c7t1d0s0
fmthard:  New volume table of contents now in place.

That's it. You can check how the new Solaris-style partitioning looks like on the second disk and compare to the first one. Here's my first disk:

admin@s11test:~# format
Searching for disks...done
 
 
AVAILABLE DISK SELECTIONS:
       0. c7t0d0 <ATA    -VBOX HARDDISK  -1.0  cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,2829@d/disk@0,0
       1. c7t1d0 <ATA    -VBOX HARDDISK  -1.0  cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,2829@d/disk@1,0
Specify disk (enter its number): 0
selecting c7t0d0
[disk formatted]
/dev/dsk/c7t0d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
 
 
FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        fdisk      - run the fdisk program
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !<cmd>     - execute <cmd>, then return
        quit
format> p
 
 
PARTITION MENU:
        0      - change `0' partition
        1      - change `1' partition
        2      - change `2' partition
        3      - change `3' partition
        4      - change `4' partition
        5      - change `5' partition
        6      - change `6' partition
        7      - change `7' partition
        select - select a predefined table
        modify - modify a predefined partition table
        name   - name the current table
        print  - display the current table
        label  - write partition map and label to the disk
        !<cmd> - execute <cmd>, then return
        quit
partition> p
Current partition table (original):
Total disk cylinders available: 2085 + 2 (reserved cylinders)
 
Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm       1 - 2084       15.96GB    (2084/0/0) 33479460
  1 unassigned    wm       0               0         (0/0/0)           0
  2     backup    wu       0 - 2084       15.97GB    (2085/0/0) 33495525
  3 unassigned    wm       0               0         (0/0/0)           0
  4 unassigned    wm       0               0         (0/0/0)           0
  5 unassigned    wm       0               0         (0/0/0)           0
  6 unassigned    wm       0               0         (0/0/0)           0
  7 unassigned    wm       0               0         (0/0/0)           0
  8       boot    wu       0 -    0        7.84MB    (1/0/0)       16065
  9 unassigned    wm       0               0         (0/0/0)           0
 
partition> q

And here's my second disk:

admin@s11test:~# format
Searching for disks...done
 
 
AVAILABLE DISK SELECTIONS:
       0. c7t0d0 <ATA    -VBOX HARDDISK  -1.0  cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,2829@d/disk@0,0
       1. c7t1d0 <ATA    -VBOX HARDDISK  -1.0  cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,2829@d/disk@1,0
Specify disk (enter its number): 1
selecting c7t1d0
[disk formatted]
 
 
FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        fdisk      - run the fdisk program
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !<cmd>     - execute <cmd>, then return
        quit
format> p
 
 
PARTITION MENU:
        0      - change `0' partition
        1      - change `1' partition
        2      - change `2' partition
        3      - change `3' partition
        4      - change `4' partition
        5      - change `5' partition
        6      - change `6' partition
        7      - change `7' partition
        select - select a predefined table
        modify - modify a predefined partition table
        name   - name the current table
        print  - display the current table
        label  - write partition map and label to the disk
        !<cmd> - execute <cmd>, then return
        quit
partition> p
Current partition table (original):
Total disk cylinders available: 2085 + 2 (reserved cylinders)
 
Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm       1 - 2084       15.96GB    (2084/0/0) 33479460
  1 unassigned    wu       0               0         (0/0/0)           0
  2     backup    wu       0 - 2084       15.97GB    (2085/0/0) 33495525
  3 unassigned    wu       0               0         (0/0/0)           0
  4 unassigned    wu       0               0         (0/0/0)           0
  5 unassigned    wu       0               0         (0/0/0)           0
  6 unassigned    wu       0               0         (0/0/0)           0
  7 unassigned    wu       0               0         (0/0/0)           0
  8       boot    wu       0 -    0        7.84MB    (1/0/0)       16065
  9 unassigned    wu       0               0         (0/0/0)           0
 
partition> q

Note: This is a typical x86 layout. It's likely different on SPARC systems as they don't use a special slice for boot block hosting. But the basic idea on how to replicate the partition table is the same.

Great! We're almost there.

4. Set Up the ZFS Rpool Mirror

Now that our second disk is prepared, the rest is quite easy. From now on, we can just follow the standard Solaris documentation for mirroring the root pool.

The right command to use here is zpool attach. Notice that this is different from zpool add: By attaching a disk to an existing disk, we mean attaching it to its mirror (you can attach more than one disk to a mirror). By adding a disk to a pool, we mean expanding the pool size in the sense of striping in another disk (or sets of mirrored/RAID-Z disks). For mirroring, zpool attach is the way to go. Remember? Slice 0 is the one we reserved for the rpool's mirrored data:

admin@s11test:~# zpool attach rpool c7t0d0s0 c7t1d0s0
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c7t1d0s0 overlaps with /dev/dsk/c7t1d0s2

Wait, what happened? ZFS is complaining that two slices are overlapping. If ZFS uses slice 0, and something else uses slice 2, it may overwrite some of ZFS' data!

In this particular case, ZFS' worries are unfounded: Slice 2 by convention spans the whole disk and is named "backup" (see the output of format above), so traditional disk backup solutions have a way of easily performing raw backups of whole disks. Today it's hardly used, but the convention remains for historical reasons.

Therefore, we can safely override this little nit and get our mirror done:

admin@s11test:~# zpool attach -f rpool c7t0d0s0 c7t1d0s0
Make sure to wait until resilver is done before rebooting.
admin@s11test:~# zpool status
  pool: rpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Tue Mar 15 18:17:32 2011
    13.9M scanned out of 2.72G at 594K/s, 1h19m to go
    13.3M resilvered, 0.50% done
config:
 
        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c7t0d0s0  ONLINE       0     0     0
            c7t1d0s0  ONLINE       0     0     0  (resilvering)
 
errors: No known data errors

Great! Everything's working fine now. Before we make the second disk bootable, we should really wait until it has finished resilvering. We don't want to boot into a half-baked root pool, do we?

Here's the end state, freshly resilvered:

admin@s11test:~# zpool status
  pool: rpool
 state: ONLINE
 scan: resilvered 2.72G in 0h15m with 0 errors on Tue Mar 15 18:33:23 2011
config:
 
        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c7t0d0s0  ONLINE       0     0     0
            c7t1d0s0  ONLINE       0     0     0
 
errors: No known data errors

5. x86 only: Make the Second Mirror Half Bootable

Since x86 systems depend on a bootloader that is installed on disk, we need to perform a final step so that the system can boot off the second disk, too, in case the first one fails completely.

This is a simple install of GRUB onto the second disk. GRUB, ZFS and Solaris will then figure it out automatically in case you have to boot from the second disk instead of the original one.

admin@s11test:~# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c7t1d0s0
stage2 written to partition 0, 277 sectors starting at 50 (abs 16115)
stage1 written to partition 0 sector 0 (abs 16065)

Since we're dealing with a low-level operation (boot blocks etc.), we want to address the devices using the raw device paths. The s0 part is still needed so GRUB knows what slice to boot from.

Almost done!

6. Add the Second Disk to the BIOS' or OpenBoot PROM's List of Bootable Devices

This is one of the little things that often gets overlooked but then becomes critical in case of a real failure: The system crashes because the first disk is completely borked, or you force a reboot and the first disk fails to come up again. How does the system know it's supposed to boot from the second half of the mirror?

  • Managing the Solaris boot behavior and its mechanism is described thoroughly in the documentation:
  • SPARC: Here you usually set up aliases for your bootable mirror halfs in the Open Boot PROM, then assign them to the boot-device variable as a list of possible devices to boot from (e.g.: "disk1 disk2 net"). Check out the SPARC Enterprise Servers section of the Oracle System Documentation area, find the administration guide for your particular system, then consult the sections on booting.
  • x86: Most BIOSes have a section where you can configure what disks to boot from, in what order and what to do if a disk is not bootable. Here's a list of current Oracle Sun x86 system documentations. Again, look for the boot section of your system's admin manual.

Play With It! And Check Out Some Man Pages!

How do you know if this really works? How do you develop confidence for something critical like booting from a second mirror half, surviving a disk disaster, etc.?

Here's the easiest option: Use VirtualBox to set up a test system like I did. It comes with ready-to use suggestions for a standard Solaris machine. Then, configure a second virtual disk and play with the commands above. Set up a mirrored rpool, bring down the machine, unconfigure the original disk, then see if it can boot from the second mirror half and so on.

BTW: I did not find a way to tell VirtualBox what disk to boot from (it only allows to specify what type of device to boot from, not what individual disk), so I reverted to just pull out (figuratively speaking) the original boot disk, then test if if boots from the mirrored one.

In short: Play, experiment, break it, etc., until you know what's going on and are confident to make it happen on your real system.

Finally, here's a list of useful man pages to check out, including links:

I hope this article has made rpool mirroring a little easier for you from now on!

Your Take

There are endless variations to the above, and sometimes I've been more verbose, or more simplified for the sake of ease-of-use. I'm sure there are many different ways to achieve the same result, so here's your chance to share your favorite mirrored rpool tricks!

What's your routine for mirroring rpools? Did you find other good tutorials to share? ('cause I didn't, at least nothing obvious in Google...) What are your preferred rpool mirroring tricks?

Feel free to write a comment!

Update: Wow, this article got a lot of comments, thank you! Make sure you check them out as they contain a lot of useful additional information.

Stay in Touch!

Did you like this article? Have you found it useful, interesting or entertaining?

Then click here to get free regular updates and help me reach my goal of 1,000 regular blog readers this summer!

Thank you for reading Constant Thinking.