Ten Ways to Easily Improve Oracle Solaris ZFS Filesystem Performance

This is a long article, but I hope you’ll still find it interesting to read. Let me know if you want me to break down future long articles into multiple parts instead.

One of the most frequently asked questions around ZFS is: “How can I improve ZFS performance?”.

This is not to say that ZFS performance would be bad. ZFS can be a very fast file system. ZFS is mostly self-tuning and the inherent nature of the algorithms behind ZFS help you reach better performance than most RAID-controllers and RAID-boxes - but without the expensive “controller” part.

Most of the ZFS performance problems that I see are rooted in incorrect assumptions about the hardware, or just unrealistic expectations of the laws of physics.

So let’s look at ten ways to easily improve ZFS performance that everyone can implement without being a ZFS expert.

For ease of reading, here’s a table of contents:

But before we start with our performance tips, let’s cover the basics:

The Basics of File System Performance

It’s important to distinguish between the two basic types of file system operation:

Reads, and
Writes.

This may sound stupidly simple, but data travels very different paths through the ZFS I/O subsystem for reads vs. writes and this means there are different ways to improve read performance than there are to make writes faster.

Use zpool iostat (no link, sun.com no longer exists) or iostat(1M) (no link, sun.com no longer exists) and verify what read/write performance the system sees and whether it matches your observations and expectations.

Then there are two kinds of file system performance:

Bandwidth: Measured in MB/s (or GB/s if you’re lucky), telling you how much overall data passes through the system over time.
IOPS: The number of IO operations that are carried out per second.

Again, these different ways of looking at performance can be optimized by different means, you just need to know into which category your particular problem falls.

There are also two patterns of read/write performance:

Sequential: Predictable, one block after the other, lined up like pearls on a string.
Random: Unpredictable, unordered, difficult to grasp.

The good news here is that ZFS automatically turns random writes into sequential writes through the magic of copy-on-write. One less class of performance problems to take care of.

And finally, for write I/Os, you should know about the difference between:

Synchronous Writes: Writes that are only complete after they have been successfully written to stable storage. In ZFS, they’re implemented through the ZFS Intent Log, or ZIL (no link, sun.com no longer exists). These are most often found in file and database servers and these kinds of writes are very sensitive to latency or IOPS performance.
Asynchronous Writes: Write operations that may return after being cached in RAM, before they are committed to disk. Performance is easy to get here, at the expense of reliability: If the power fails before the buffer is written to disk, data can be lost.

Performance Expectations, Goals and Strategy

We’re almost there, but before we get to the actual performance tips, we need to discuss a few methodical things:

Set realistic expectations: ZFS is great, yes. But you need to observe the laws of physics. A disk with 10000 rpm can’t deliver more than 166 random IOPS because 10000 revolutions divided by 60 seconds (per minute) means the head can only position itself above a random block 166 times per second. Any more than that and your data is not really random. That’s just how the numbers play out. Similarly, RAID-Z means that you’ll only get the IOPS performance of a single disk per RAID-Z group, because each filesystem IO will be mapped to all the disks in a RAID-Z group in parallel. Make sure you know what the limits of your storage are and what performance you can realistically expect, when analyzing your performance and setting performance goals. By the way…
Define performance goals: What exactly is “too slow”? What performance would be acceptable? Where are you now, and where do you want to be? Performance goals are important to set, because they tell you when you’re done. There are always ways to improve performance, but there’s no use in improving performance at all costs. Know when you’re done, then celebrate!
Be systematic: It happens so many times: We try this, then we try that, we measure with cp(1) even though our app is actually a database, then we tweak here and there, and before we know it, we realize: We really know nothing. Being systematic means defining how to measure the performance we want, establishing the status quo, in a way that is directly related to the actual application we’re interested in, then sticking to the same performance measurement method through the whole performance analysis and optimization process. Otherwise, things become confusing, we lose sight of where we are and we won’t be able to tell if we reached our goal or not.

Now that we have an understanding of the kind of performance we want, we know what we can expect from today’s hardware, we defined some realistic goals and have a systematic approach to performance optimization, let’s begin:

#1: Add Enough RAM

A small amount of data on your disks is spent for storing ZFS metadata. This is the data that ZFS needs, so it knows where your actual data is. In a way, this is the roadmap that ZFS needs to find its way through your disks and the data structures there.

If your server doesn’t have enough RAM to store metadata, then it will need to issue extra metadata read IOs for every data read IO to figure out where your data actually is on disk. This is slower than necessary, and you really want to avoid that. If you’re really short on RAM, this could have a massive impact!

How much RAM do you need? As a rough rule of thumb, divide the size of your total storage by 1000, then add 1 GB so the OS has some extra RAM of its own to breathe. This means for every TB of data, you’ll want at least 1GB of RAM for caching ZFS metadata, in addition to one GB for the OS to feel comfortable in.

Having enough RAM will benefit all of your reads, no matter if they’re random or sequential, just because they’ll be easier for ZFS to find on your disks, so make sure you have at least n/1000 + 1 GB of RAM, where n is the number of GB in your storage pool.

#2: Add More RAM

ZFS uses every piece of RAM it finds to cache data. It has a very sophisticated caching algorithm that tries to cache both most frequently used data, and most recently used data, adapting their balance while it’s used. ZFS also has some advanced prefetching abilities that can greatly improve performance for different kinds of sequential reads.

All of this works the better the more RAM you give to ZFS. But when do you know if more RAM will give you breakthrough performance, or just a small improvement?

This is where your *working set * comes in.

Your working set is the part of your data that is used most often: Your top products/websites/customers in an e-commerce database. Your clients with the biggest traffic in your hosting environment. Your most popular files etc.

If your working set fits into RAM, the utter majority of reads can be serviced from RAM most of the time, without having to create any IOs to slow-spinning disks.

Try to figure out what the most popular subset of your data is, then add enough RAM to your ZFS server to help it live there. This will give you the biggest read performance boost.

If you want something more automated, Ben Rockwood has written a great tool called arc_summary (ARC is the ZFS Adaptive Replacement Cache). The two “Ghost” values tell you exactly how much more memory would have helped you to handle the load that your server has seen in the past.

If you want to influence the balance between user data and metadata in the ZFS ARC cache, check out the primarycache filesystem property that you can set using the zfs(1M) (no link, sun.com no longer exists) command. For RAM-starved servers with a lot of random reads, it may make sense to restrict the precious RAM cache to metadata and use an L2ARC, explained in tip #4 below.

#3: Boost Deduplication Performance With Even More RAM

In an earlier article, I wrote about the basics of ZFS Deduplication. If you plan to use it, keep in mind that ZFS will assemble a table of all the blocks stored in your filesystem and their checksums, so it can determine whether a specific block has been already written and can thus safely marked as a duplicate.

Deduplication will save you space and it can also add to your performance because it saves you unnecessary read and write IOPS. But the cost of this is the need to keep the dedup table as handy as possible, ideally in RAM.

How big is the ZFS dedup table? Richard Elling (no link, sun.com no longer exists) pointed out in a recent mailinglist post that a ZFS dedup table entry uses about 250 Bytes per data block (no link, opensolaris.org no longer exists). Assuming an average block size of 8K, a TB of user data would need about 32GB of RAM if you want to be real fast. If your data tends to be spread over large files, you’ll have a bigger average blocksize, say, 64K, and then you’d only need about 4GB of RAM for the dedup table.

If you don’t have that amount of RAM, there’s no need to despair, there’s always the possibility to…

#4: Use SSDs to Improve Read Performance

If you can’t add any more RAM to your server (or if your purchasing department won’t allow you), the next best way to increase read performance is to add solid state disks (aka flash memory) as a level 2 ARC cache (L2ARC) to your system.

You can easily configure them with the zpool(1M) (no link, sun.com no longer exists) command, read the “Cache devices” section of its man-page.

SSDs can deliver two orders of magnitude better IOPS than traditional harddisks, and they’re much cheaper on a per-GB basis than RAM. They form an excellent layer of cache between the ZFS RAM-based ARC and the actual disk storage.

You don’t need to observe any reliability requirements when configuring L2ARC devices: If they fail, no data is lost because it can always be retrieved from disk.

This means that L2ARC devices can be cheap, but before you start putting USB sticks into your server, you should make sure they deliver a good performance benefit over your rotating disks :).

SSDs come in various sizes: From drop-in-replacements for existing SATA disks in the range of 32GB to the Oracle Sun F20 PCI card (no link, page no longer exists) with 96GB of flash and built-in SAS controllers (which is one of the secrets behind Oracle Exadata V2’s (no link, page no longer exists) breakthrough performance), to the mighty fast Oracle Sun F5100 flash array (no link, page no longer exists) (which is the secret behind Oracle’s current TPC-C and other world records (no link, sun.com no longer exists)) with a whopping 1.96TB of pure flash memory and over a million IOPS. Nice!

And since the dedup table is stored in the ZFS ARC and consequently spills off into the L2ARC if available, using SSDs as cache devices will also benefit deduplication performance.

#5: Use SSDs to Improve Write Performance

Most write performance problems are related to synchronous writes. These are mostly found in file servers and database servers.

With synchronous writes, ZFS needs to wait until each particular IO is written to stable storage, and if that’s your disk, then it’ll need to wait until the rotating rust has spun into the right place, the harddisk’s arm moved to the right position, and finally, until the block has been written. This is mechanical, it’s latency-bound, it’s slow.

See Roch’s excellent article on ZFS NFS performance (no link, sun.com no longer exists) for a more detailed discussion on this.

SSDs can change the whole game for synchronous writes because they have 100x better latency: No moving parts, no waiting, instant writes, instant performance.

So if you’re suffering from a high load in synchronous writes, add SSDs as ZFS log devices (aka ZIL, Logzillas) and watch your synchronous writes fly. Check out the zpool(1M) (no link, sun.com no longer exists) man page under the “Intent Log” section for more details.

Make sure you mirror your ZIL devices: They are there to guarantee the POSIX requirement for “stable storage” so they must function reliably, otherwise data may be lost on power or system failure.

Also, make sure you use high quality SLC Flash Memory devices, because they can give you reliable write transactions. Cheaper MLC cells can damage existing data if the power fails during write operations, something you really don’t want.

#6: Use Mirroring

Many people configure their storage for maximum capacity. They just look at how many TB they can get out of their system. After all, storage is expensive, isn’t it?

Wrong. Storage capacity is cheap. Every 18 months or so, the same disk only costs half as much, or you can buy double the capacity for the same price, depending on how you view it.

But storage performance can be precious. So why squeeze the last GB out of your storage if capacity is cheap anyway? Wouldn’t it make more sense to trade in capacity for speed?

This is what mirroring disks offer as opposed to RAID-Z or RAID-Z2:

RAID-Z(2) groups several disks into a RAID group, called vdevs. This means that every I/O operation at the file system level is going to be translated into a parallel group of I/O operations to all of the disks in the same vdev. The result: Each RAID group can only deliver the IOPS performance of a single disk, because the transaction always has to wait until all of the disks in the same vdev are finished. This is both true for reads and for writes: The whole pool can only deliver as many IOPS as the total number of striped vdevs times the IOPS of a single disk. There are cases where the total bandwidth of RAID-Z can take advantage of the aggregate performance of all drives in parallel, but if you’re reading this, you’re probably not seeing such a a case.
Mirroring behaves differently: For writes, the rules are the same: Each mirrored pair of disks will deliver the write IOPS of a single disk, because each write transaction will need to wait until it has completed on both disks. But a mirrored pair of disks is a much smaller granularity than your typical RAID-Z set (with up to 10 disks per vdev). For 20 disks, this could be the difference between 10x the IOPS of a disk in the mirror case vs. only 2x the IOPS of a disk in a wide stripes RAID-Z2 scenario (8+2 disks per RAID-Z2 vdev). A 5x performance difference! For reads, the difference is even bigger: ZFS will round-robin across all of the disks when reading from mirrors. This will give you 20x the IOPS of a single disk in a 20 disk scenario, but still only 2x if you use wide stripes of the 8+2 kind. Of course, the numbers can change when using smaller RAID-Z stripes, but the basic rules are the same and the best performance is always achieved with mirroring.

For a more detailed discussion on this, I highly recommend Richard Elling (no link, sun.com no longer exists)’s post on ZFS RAID recommendations: Space, performance and MTTDL (no link, sun.com no longer exists).

Also, there’s some more discussion on this in my earlier RAID-GREED-article.

Bottom line: If you want performance, use mirroring.

#7: Add More Disks

Our next tip was already buried inside tip #6: Add more disks. The more vdevs ZFS has to play with, the more shoulders it can place its load on and the faster your storage performance will become.

This works both for increasing IOPS and for increasing bandwidth, and it’ll also add to your storage space, so there’s nothing to lose by adding more disks to your pool.

But keep in mind that the performance benefit of adding more disks (and of using mirrors instead of RAID-Z(2)) only accelerates aggregate performance. The performance of every single I/O operation is still confined to that of a single disk’s I/O performance.

So, adding more disks does not substitute for adding SSDs or RAM, but it’ll certainly help aggregate IOPS and bandwidth for the cases where lots of concurrent IOPS and bigger overall bandwidth are needed.

#8 Leave Enough Free Space

Don’t wait until your pool is full before adding new disks, though.

ZFS uses copy on write which means that it writes new data into free blocks, and only when the überblock has been updated, the new state becomes valid.

This is great for performance because it gives ZFS the opportunity to turn random writes into sequential writes - by choosing the right blocks out of the list of free blocks so they’re nicely in order and thus can be written to quickly.

That is, when there are enough blocks.

Because if you don’t have enough free blocks in your pool, ZFS will be limited in its choice, and that means it won’t be able to choose enough blocks that are in order, and hence it won’t be able to create an optimal set of sequential writes, which will impact write performance.

As a rule of thumb, don’t let your pool become more full than about 80% of its capacity. Once it reaches that point, you should start adding more disks so ZFS has enough free blocks to choose from in sequential write order.

#9: Hire A ZFS Expert

There’s a reason why this point comes up almost last: In the utter majority of all ZFS performance cases, one or more of #1-#8 above are almost always the solution.

And they’re cheaper than hiring a ZFS performance expert who will likely tell you to add more RAM, or add SSDs or switch from RAID-Z to mirroring after looking at your configuration for a couple of minutes anyway!

But sometimes, a performance problem can be really tricky. You may think it’s a storage performance problem, but instead your application may be suffering from an entirely different effect.

Or maybe there are some complex dependencies going on, or some other unusual interaction between CPUs, memory, networking, I/O and storage.

Or perhaps you’re hitting a bug or some other strange phenomenon?

So, if all else fails and none of the above options seem to help, contact your favorite Oracle/Sun representative (or send me a mail) and ask for a performance workshop quote. If your performance problem is really that hard, we want to know about it.

#10: Be An Evil Tuner - But Know What You Do

If you don’t want to go for option #9 and if you know what you do, you can check out the ZFS Evil Tuning Guide (no link, solarisinternals.com no longer exists).

There’s a reason it’s called “evil”: ZFS is not supposed to be tuned. The default values are almost always the right values, and most of the time, changing them won’t help, unless you really know what you’re doing. So, handle with care.

Still, when people encounter a ZFS performance problem, they tend to Google “ZFS tuning”, then they’ll find the Evil Tuning Guide, then think that performance is just a matter of setting that magic variable in /etc/system.

This is simply not true.

Measuring performance in a standardized way, setting goals, then sticking to them helps. Adding RAM helps. Using SSDs helps. Thinking about the right number and RAID level of disks helps. Letting ZFS breathe helps.

But tuning kernel parameters is reserved for very special cases, and then you’re probably much better off hiring an expert to help you do that correctly.

Bonus: Some Miscellaneous Settings

If you look through the zfs(1M) (no link, sun.com no longer exists) man page, you’ll notice a few performance related properties you can set. They’re not general cures for all performance problems (otherwise they’d be set by default), but they can help in specific situations. Here are a few:

atime: This property controls whether ZFS records the time of last access for reads. Switching this to off will save you extra write IOs when reading data. This can have a big impact if your application doesn’t care about the time of last access for a file and if you have a lot of small files that need to be read frequently.
checksum and compression can be double-edged swords: The stronger the checksum, the better your data is protected against corruption (and this is even more important when using dedup). But a stronger checksum method will incur some more load on the CPU for both reading and writing. Similarly, using compression may save a lot of IOPS if the data can be compressed well, but may be in the way for data that isn’t easily compressed. Again, compression costs some extra CPU time. Keep an eye on CPU load while running tests and if you find that your CPU is under heavy load, you might need to tweak one of these.
recordsize: Don’t change this property unless your running a database in this filesystem. ZFS automatically figures out what the best blocksize is for your filesystems. In case you’re running a database (where the file may be big, but the access pattern is always in fixed-size chunks), setting this property to your database record size may help performance a lot.
primarycache and secondarycache: We already introduced the primarycache property in tip #2 above. It controls whether your precious RAM cache should be used for metadata or for both metadata and user data. In cases where you have an SSD configured as a cache device and if you’re using a large filesystem, it may help to set primarycache=metadata so the RAM is used for metadata only. secondarycache does the same for cache devices, but it should be used to cache metadata only in cases where you have really big file systems and almost no real benefit from caching data.
logbias: When executing synchronous writes, there’s a tradeoff to be made: Do you want to wait a little, so you can accumulate more synchronous write requests to be written into the log at once, or do you want to service each individual synchronous write as fast as possible, at the expense of throughput? This property lets you decide which side of the tradeoff you want to favor.

Your Turn

Sorry for the long article. I hope the table of contents at the beginning makes it more digestible, and I hope it’s useful to you as a little checklist for ZFS performance planning and for dealing with ZFS performance problems.

Let me know if you want me to split up longer articles like these (though this one is really meant to remain together).

Now it’s your turn: What is your experience with ZFS performance? What options from the above list did you implement for what kind of application/problem and what were your results? What helped and what didn’t and what are your own ZFS performance secrets?

Share your ZFS performance expertise in the comments section and help others get the best performance out of ZFS!