A Closer Look at ZFS, Vdevs and Performance

When looking at the mails and comments I get about my ZFS optimization and my RAID-Greed posts, the same type of questions tend to pop up over and over again. Here’s an example from a reader email: “I was reading about ZFS on your blog and you mention that if I do a 6 drive array for example, and a single RAID-Z the speed of the slowest drive is the maximum I will be able to achieve, now I thought that ZFS would be better in terms of speed. Please let me know if there is a newer ZFS version that improved this or if it does not apply anymore.” This is just an example, but the basic theme is the same for much for the reactions I see: Many people think that RAID-Z will give them always good performance and are surprised that it doesn’t, thinking it’s a software, an OpenSolaris or a ZFS issue.

In reality, it’s just pure logic and physics, and to understand that we should look a little closer at what vdevs are in ZFS and how they work.

Before we start, a quick TOC:

And now lets dive in by looking at vdevs:

What is a Vdev?

Another reader pointed out that I should define vdevs as short and simple as possible, so here we go:

A ZFS vdev (aka “virtual device”) is either:

a single disk, or
two or more disks that are mirrored, or
a group of disks that are organized using RAID-Z.

There are also special kinds of vdevs like hot-spares, ZIL or cache devices, etc. but we’ll leave that to another post. Look up the full definition in the zpool (1M) (no link, sun.com no longer exists) man page.

So there you have it: A disk, a mirror or a RAID-Z group.

A ZFS pool is always a stripe of one or more vdevs that supplies blocks for your ZFS file systems to store data in.

Vdev examples

The following typical command creates a new pool out of one vdev which is a mirror of two disks:

  zpool create tank1 mirror c0t0d0 c0t1d0

And this example creates a new pool out of two vdevs that are RAID-Z groups with 2 data disks and one parity disk each:

  zpool create tank2 raidz c0t0d0 c0t1d0 c0t2d0 raidz c0t3d0 c0t4d0 c0t5d0

And when you decide to grow your pool, you just add one or more vdevs to it:

  zpool add tank1 mirror c0t2d0 c0t3d0
  zpool add tank2 raidz c0t6d0 c0t7d0 c0t8d0

You see: Whenever the keywords “mirror”, “raidz”, “raidz2”, “raidz3” etc. show up in a zpool command, a new vdev is created out of the disks that follow the keyword.

And in all cases, a ZFS pool is always a stripe of vdevs, no matter what.

Since a vdev can be three different things, let’s look at them in more detail:

Single Disks

The simplest of vdevs is a single disk. Not much to say here other than whenever we say “disk”, it could be a physical disk, an SSD, a LUN in a SAN, a USB stick or even a simple file sitting on another filesystem (useful for testing and demo purposes): Anything that looks like a block device in Solaris can be used as a “disk” in vdevs.

Single disks don’t offer any protection: If you lose a vdev that is a single disk, all data on it is lost.

Single disks also don’t offer any special performance characteristics, you’ll get the IOPS and bandwidth that the physics of the disk gives you, not more, not less.

The other vdev types combine single disks into groups of disks:

Mirrors

ZFS mirrored vdevs function similarly to traditional mirrors: Two or more disks can form a mirror and data is replicated across them in an identical fashion.

If you have a mirror vdev with n disks, you can lose n-1 disks and ZFS will still be able to recover all data from this vdev. Same as with RAID controllers or traditional volume managers.

But ZFS can do even more: If any block is corrupted and the hardware didn’t notice (aka silent block errors), ZFS will still detect the error (through the block’s checksum) and correct it using the other disk’s copy of the block (assuming it was stored correctly).

This works even if both halfs of the mirror are affected by data corruption, as long as for any given block there is at least one disk in the mirror group that carries the correct version of that block.

Mirrors in ZFS are always more robust than mirrors in traditional RAID systems.

Mirrored Performance Considerations

When writing to mirrored vdevs, ZFS will write in parallel to all of the mirror disks at once and return from the write operation when all of the individual disks have finished writing. This means that for writes, each mirrored vdev will have the same performance as its slowest disk. This is true for both IOPS (number of IO operations per second) and bandwidth (MB/s).

When reading from mirrored vdevs, ZFS will read blocks off the mirror’s individual disks in a round-robin fashion, thereby increasing both IOPS and bandwidth performance: You’ll get the combined aggregate IOPS and bandwidth performance of all disks.

But from the point of view of a single application that issues a single write or read IO, then waits until it is complete, the time it will take will be the same for a mirrored vdev as for a single disk: Eventually it has to travel down one disk’s path, there’s no shortcut for that.

RAID-Z

RAID-Z vdevs are a variant of RAID-5 and RAID-6 with interesting properties:

You can choose the number of data disks and the number of parity disks. Today, the number of parity disks is limited to 3, but this may become larger in the future.
Each data block that is handed over to ZFS is split up into its own stripe of multiple disk blocks at the disk level, across the RAID-Z vdev. This is important to keep in mind: Each individual I/O operation at the filesystem level will be mapped to multiple, parallel and smaller I/O operations across members of the RAID-Z vdev.
When writing to a RAID-Z vdev, ZFS may choose to use less than the maximum number of data disks. For example, you may be using a 3+2 (5 disks) RAID-Z2 vdev, but ZFS may choose to write a block as 2+2 because it fits better.
Write transactions in ZFS are always atomic, even when using RAID-Z: Each write operation is only finished if the überblock has been successfully written to disk. This means there’s no possibility to suffer from the traditional RAID-5 write hole, in which a power-failure can cause a partially (and therefore broken) written RAID-5 set of blocks.
Due to the copy-on-write nature of ZFS, there’s no read-modify-write cycle for changing blocks on disk: ZFS writes are always full stripe writes to free blocks. This allows ZFS to choose blocks that are in sequence on the disk, essentially turning random writes into sequential writes, maximizing disk write capabilities.
Since all writes are atomic and since they naturally map to sequential writes, there’s no need for a RAID-Z with battery backed-up cache: There’s no possibility to have an inconsistent state on disk and ZFS already writes at maximum disk write speed. This saves money and lets you leverage cheap disks.

Just like traditional RAID-5, you can lose up to the number of parity disks without losing any data. And just like ZFS mirroring, for each block at the filesystem level, ZFS can try to reconstruct data out of partially working disks, as long as it can find a critical number of blocks to reconstruct the original RAID-Z group with.

ZFS RAID-Z is always better than RAID-5, RAID-6 or other RAID-schemes on traditional RAID controllers.

RAID-Z Performance Considerations

When writing to RAID-Z vdevs, each filesystem block is split up into its own stripe across (potentially) all devices of the RAID-Z vdev. This means that each write I/O will have to wait until all disks in the RAID-Z vdev are finished writing. Therefore, from the point of view of a single application waiting for its IO to complete, you’ll get the IOPS write performance of the slowest disk in the RAID-Z vdev.

Granted: A large IO may be broken down into smaller IOs across the data disks and they’ll take less time to complete. But the seek time of the disk outweighs the actual write time, so the size of the IO is not of much significance.

For write bandwidth, you may get more: Large blocks at the file system level are split into smaller blocks at the disk level that are written in parallel across the vdev’s individual data disks and therefore you may get up to n times an individual disk’s bandwidth for an n+m type RAID-Z vdev.

Unfortunately, in practice, the cases where performance matters most (mail servers, file servers, iSCSI storage for virtualization, database servers) also happen to be the cases that care a lot about IOPS performance, not much bandwidth performance.

When reading from RAID-Z vdevs, the same rules apply, as the process is essentially reversed (no round robin shortcut like in the mirroring case): Better bandwidth if you’re lucky (and read the same way as you’ve written) and a single disk’s IOPS read performance in the majority of cases that matter.

I hope that it’s now starting to become apparent why copying a single large file from one directory to another is not a realistic way to measure performance, if ultimately one is interested in delivering storage to a virtualization server over NFS or iSCSI :).

Vdev Performance Summary

Mirrored vdevs outperform RAID-Z vdevs in all cases that matter.

The only case where RAID-Z performance can beat mirrored performance for the same number of disks is when you look at sequential bandwidth. Good for a tape-replacement archive, but not exactly the majority of real-world performance situations.

Is that all there is to say about performance? No! Vdevs are only the beginning:

Putting Vdevs Together

ZFS pools are always created out of one or more vdevs. When using more than one vdev, they’re always striped.

Striping is good: ZFS will send reads and writes down to all the vdevs in parallel, maximizing throughput whenever it can.

Simply said: The more vdevs you stripe together, the faster your pool becomes in terms of aggregate bandwidth and aggregate IOPS, for both reads and writes.

Notice the caveat involved in the little word “aggregate”: Your single little app waiting for its single IO to finish won’t see a shorter wait time if your pool has many vdevs, because it’ll get assigned only one of them.

“Aggregate” means here that the server as a whole will be able to sustain a higher load of IOPS and bandwidth altogether, across multiple parallel IO streams when using a pool with many vdevs.

How much higher? For n vdevs that make up your pool, it will be able to deliver n times the IOPS and n times the bandwidth of a single vdev (assuming all are equal etc.).

Theory vs. Pratice

Now all of this is just theory. It’s based on logical considerations on the way ZFS handles data to and from disks, so it tells you what performance to expect in the worst case, which typically means lots of random read or write IO operations.

If you want the raw, unfiltered practical data up to ms and rpm, check Richard Elling’s excellent post ZFS RAID recommendations: space, performance, and MTTDL (no link, sun.com no longer exists). Lots of good, foundational stuff.

In practice, you’ll almost always see better numbers, because of many possible reasons:

You were lucky because the disk head happened to have been near the position it needed to be.
You were lucky and your app uses large IOs that were split up into multiple smaller IOs by the system which could be handled in parallel.
You were lucky and your app uses some portion of asynchronous IO operations so the system could take advantage of caching and other optimizations that rely on async IO.
You were lucky and your app’s performance is more dependent on disk bandwidth than latency.
You were lucky and your app has a bottleneck elsewhere.
You’re benchmarking your ZFS pool in a way that has nothing to do with real-world performance.

So, if you see better numbers, be thankful for your luck. If it’s because of the last case, use a more realistic approach to measure your disk performance.

The Fastest Vdev

So if you have a number n of disks to create a single vdev with, what would be the fastest vdev?

This one is tricky. For writes, the situation is as follows:

Mirroring n disks will give you a single disk’s IOPS and bandwidth performance.
RAID-Z with one parity drive will give you a single disk’s IOPS performance, but n-1 times aggregate bandwidth of a single disk.

Does it make RAID-Z the winner? Let’s check reads:

Mirroring n disks will give you n times a single disk’s IOPS and bandwidth read performance. And on top, each disk gets to position its head independently of the others, which will help random read performance for both IOPS and bandwidth.
RAID-Z would still give you a single disk’s performance for IOPS, but n-1 times aggregate bandwidth of a single disk. This time, though, the reads need to be correlated, because ZFS can only read groups of blocks across the disks that are supposed to hang out together. Good for sequential reads, bad for random reads.

Assuming a workload that cares about both reads and writes, and assuming that when the going gets tough, IOPS is more important than bandwidth, I’d opt for mirroring whenever I can.

Keeping the Cake and Eating it Too

So is that it? Is there no way to make IOPS faster than a single disk?

There is, if you don’t use regular disks.

The place where write performance hurts most is typically where the application is writing in single blocks in synchronous mode:

The app sends the block down the ZFS pipe using write(2) (no link, sun.com no longer exists), waiting for the system call to finish.
ZFS sends the block to its ZIL structure on the disk.
The disk positions its head to write the block, then writes it. This takes a few milliseconds: About half a rotation of its head on average. In the case of a very fast 15,000 rpm disk, this is 2 ms, or 4 Million CPU cycles of a typical 2 GHz CPU.
After writing the block, the disk reports success to the driver, ZFS finishes the write(2) call and the application continues its thing.
During the next pool update, ZFS writes the data again, this time to the regular pool structure, then discards the ZIL data.
In the event of a power loss, ZFS will replay the ZIL data into the pool structure as soon as it imports the pool, so data on the ZIL is always as safe as if it was written to the pool already.

The longest part is taken up by the disk drive positioning its head. Even though ZFS optimizes synchronous writes by writing them to a special ZIL structure on disk (instead of going through the whole ZFS on-disk pool update procedure), it still needs to wait for the disk to complete the transaction.

The only way to accelerate that is to use a faster disk. A much faster disk, an SSD.

Because if you assign an SSD as a ZIL to your ZFS pool, the following happens:

The app sends the block down the ZFS pipe using write(2) (no link, sun.com no longer exists), waiting for the system call to finish.
ZFS sends the block to its ZIL structure on the SSD.
The SSD doesn’t need to position anything. It just writes the data into flash memory. Granted, flash is not the fastest memory, but it is much cheaper than RAM and it’s still about 100x faster than a disk. Now your program only needs to wait about 40,000 CPU cycles.
After writing the block, the disk reports success to the driver, ZFS finishes the write(2) call and the application continues its thing.
Pool update and ZIL replaying works the same as described above

That’s why SSDs are becoming so popular, especially with ZFS.

Conclusion

A ZFS vdev is either a single disk, a mirror or a RAID-Z group.

RAID performance can be tricky, independently of the file system. ZFS does its best to optimize, but ultimately it comes down to disk latency (seek time, rotation speed, etc.) for the cases where performance becomes critical.

Mirrors are almost always faster than RAID-Z groups, especially for the cases that are interesting to databases, fileservers etc.

The best way to accelerate your ZFS pool is to use SSDs.

Your Turn

What are your experiences with ZFS and/or RAID performance? How did you measure it? What kind of applications do you use and with what ZFS setups? Share your experience in the comment section below!

This post is obsolete