Solaris ZFS, Synchronous Writes and the ZIL Explained

When talking to customers, partners and colleagues about Oracle Solaris ZFS performance, one topic almost always seems to pop up: Synchronous writes and the ZIL.

In fact, most ZFS performance problems I see are related to synchronous writes, how they are handled by ZFS through the ZIL and how they impact IOPS load on the pool’s disks.

Many people blame the ZIL for bad performance, and they even try to turn it off (no link, solarisinternals.com no longer exists), but that’s not good. Actually, the opposite is true: The ZIL is there to help you.

In this article, we’ll learn what synchronous writes are, how they’re processed by ZFS, what the ZIL is, how it works, how to measure ZIL activity and how to accelerate synchronous write performance, which is at the root of many, if not the majority of ZFS performance problems.

So let’s start by understanding how synchronous writes work and how they are processed by ZFS, as opposed to asynchronous writes.

The POSIX Mandate for Synchronous Writes

The Oracle Solaris man page for the open(2) system call (no link, sun.com no longer exists) contains a number of flags that applications can use when opening files. One of them looks rather innocent:

`O_SYNC`: Write I/O operations on the file descriptor complete as defined by synchronized I/O file integrity completion.

Now, what is “synchronized I/O file integrity completion”?

We find more information in the description of the fcntl.h header file (no link, sun.com no longer exists):

`O_SYNC`: When opening a regular file, this flag affects subsequent writes. If set, each `write(2)` will wait for both the file data and file status to be physically updated. Write I/O operations on the file descriptor complete as defined by synchronized I/O file integrity completion.

In other words: If a file has been opened with the O_SYNC flag, then each write(2) call will only return when both the data and the file status has been physically updated.

Which means: O_SYNC writes are never cached in memory, they have to be completely written to stable storage before the write(2) call returns to the application.

Depending on how often you encounter O_SYNC writes and how much data is written per write(2) call, this can feel slow. Very slow.

To illustrate, this is how normal, asynchronous writes are handled by a POSIX compliant OS:

The application issues a write(2) call, asking the OS to write some data to disk.
The OS caches the data in RAM and returns immediately to the application
Now, the application can continue to do its thing.
Later, when the time is right, the OS will flush its caches, writing all the data accumulated so far to disk in an orderly and efficient fashion.

In contrast, a synchronous write works quite differently on modern OSes:

The application issues a write(2) call, asking the OS to write some data to disk.
The OS caches the data in RAM but it also needs to ensure that the data is written to stable storage before returning from the system call. Since writing to the regular on-disk data structures now would take too much time, it decides to store that data in a special area of the disk instead. This is called the intent log.
1. After the data has been written to the intent log on stable storage, the write(2) call returns its status back to the application.
2. Now, the application can continue to do its thing.
3. Later, when the time is right, the OS will flush its caches, writing all the data accumulated so far to disk in an orderly and efficient fashion. The data in the intent log is then marked as “completed”.
4. Should there be a fatal error that prevents the OS from flushing its caches, it’ll try to replay the log as soon as it mounts the file system again, e.g. after a reboot.

Non modern OSes don’t even have an intent log. They have to go through writing full data structures to disk every time they encounter a synchronous write. Or they break POSIX rules by treating synchronous writes just like regular writes, then accusing modern OSes of “being slow”.

Why Synchronous Writes are Bad for Performance

As you see, synchronous writes are a lot more complicated and time-consuming:

The time that the application spends waiting for the function call to return is determined by the time the OS needs to ensure the data is written to the intent log, which needs to be on stable storage (i.e. physical disk, not RAM, nor any other volatile cache). If your application issues a lot of synchronous writes, it will have to wait many times. Wait times add up and the app feels slow.
But worse yet: The data needs to be written twice to disk: First to the intent log, and second, to its regular data structure on-disk, where it actually belongs to. This means that synchronous writes actually create double as many IO operations to disk than normal, asynchronous writes!
Even worse yet: Each time the intent log is updated, the disk’s arms need to be repositioned (because intent logs tend to be located in different areas of the disk), thereby distracting the disk’s regular write operations. This increases random write load for the disk (which is slow) and prevents the disk from performing sequential writes (which would have been fast). As Victor pointed out in the comments, this issue doesn’t apply to ZFS, as we’ll see below.

By the way: Another way to get the same synchronous write behavior is to call the fsync(3C) (no link, sun.com no longer exists) function. This will force the OS to write all outstanding IO operations to stable storage before returning from the function call.

If synchronous writes are so slow, why bother going through all that fuzz?

Why Synchronous Writes are Good for You

If we want things to be fast, so why go through all this shiitake and why not just cache everything in RAM (which is super-fast), return quickly to the application and win every benchmark?

Because we care about our data.

Imagine you’re paying your bills online, your account has been debited with the $699 that your shiny new iPad cost you and Apple’s account has been credited the same $699. Your bank is super-fast because it doesn’t care about synchronous writes, so your transaction sits in memory, waiting for the next update to disk.

Now the power fails.

After everything is back online, you discover that your account is missing the $699 (because that portion of the transaction happened to have made it to disk), but Apple’s portion of the transaction didn’t happen (because unfortunately, it was cached in RAM when the power failed, and never made it to disk).

Now you just spent $699 on nothing. Good luck convincing your bank to give you back your money.

We can easily see how this bank won’t stay in business for long, any why sacrificing integrity for performance wasn’t a good idea in this case.

Now if the bank had decided to do things the proper way, the story would be like this: After the power failure, the systems reboot. When mounting file systems, the OS notices that there’s still data in the intent log that hasn’t been written to disk yet, and proceeds with updating it. After updating all data, the databases are consistent again, Apple gets your money, proceeds with shipping your iPad and we have a happy ending.

Essentially, the intent log of a file system is nothing more than an insurance against power failures, a to-do list if you will, that keeps track of the stuff that needs to be updated on disk, even if the power fails (or something else happens that prevents the system from updating its disks).

This simple example is indeed one of the two most common places where synchronous writes happen: Databases.

The other place is also common: File servers. Both databases and file servers need to ensure transactional integrity towards their clients, so they often use synchronous writes to handle those operations. Roch Bourbonnais describes the reasoning behind NFS, synchronous writes and ZFS in a classic blog article (no link, sun.com no longer exists).

Synchronous Writes in ZFS

So far, so good: The above is valid for any OS that claims POSIX compliance and is worth its salt. Now, how are synchronous writes handled in Solaris and ZFS?

We find the answer in the man page for zpool(1M) (no link, sun.com no longer exists):

The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions. For instance, databases often require their transactions to be on stable storage devices when returning from a system call. NFS and other applications can also use fsync() to ensure data stability. By default, the intent log is allocated from blocks within the main pool. However, it might be possible to get better performance using separate intent log devices such as NVRAM or a dedicated disk.

There we have it. If you are using a regular ZFS pool and your application is issuing synchronous writes, they’ll be written to the ZIL first. Later, when it’s time to update the ZFS on-disk data structures, the data will be written again as part of the regular pool update sequence (typically every 5 seconds, unless the pool is under heavy load).

Like Victor pointed out below in the comments, ZFS doesn’t use a special area of the disk for it’s ZIL (“the intent log is allocated from blocks within the main pool”), so the extra seeks for log writes are not an issue for ZFS. On the other hand, excessive use of the ZIL will create a “swiss cheese” effect and increase fragmentation of the disk’s blocks, potentially hurting read and write performance for pools that are near the top of their capacity.

More information on how the ZFS ZIL is implemented can be found in Neil Perrin’s article: “ZFS: The Lumberjack (no link, sun.com no longer exists)”.

Using Separate Log Devices in ZFS

Using a ZIL is faster than writing to the regular pool structure, because it doesn’t come with the overhead of updating file system metadata and other housekeeping tasks. But having the ZIL on the same disk as the rest of the pool introduces a competition between the ZIL and the regular pool structure, fighting over precious IOPS resources, resulting in bad performance whenever there’s a significant portion of synchronous writes.

The good news is that ZFS permits you to place the ZIL on one or more separate devices. Just say something like:

zpool add pool log c2d0

And your pool will use c2d0 as a separate ZIL device.

Mirroring log devices is recommended to prevent data loss if the power fails and one of the ZIL devices happens to fail, too:

zpool add pool log mirror c2d0 c3d0

Finally, you can even have many ZIL devices (if you really have a high synchronous write load):

zpool add pool log mirror c2d0 c3d0 log mirror c4d0 c5d0 log mirror c6d0 c7d0

Log Device Types

Using regular disks as log devices can help a great deal. As we have seen above, using synchronous writes introduces competition between ZIL writes and regular pool writes over scarce disk head positioning resources. By adding a separate log device, the regular pool disks can concentrate on executing sequential writes (which are naturally generated by ZFS’ CoW algorithms and which can be handled quickly by the drives), while the separate log devices can concentrate on writing ZIL blocks.

So if you’re suffering from a large amount of synchronous writes (i.e. your NFS or database server is very slow when writing data), you’ll probably experience a lot of relief by adding a separate log device to your ZFS pool.

But the biggest benefit can be obtained by using ZIL devices that can execute writes fast. Really fast. And that’s why we at Oracle like to make such a great fuzz about Flash memory devices. From Oracle’s Sun F5100 Flash Storage Array (no link, page no longer exists) to the Sun Flash Accelerator F20 (no link, page no longer exists), both running on cool, space-saving Flash Modules (no link, page no longer exists), to standard, enterprise-class SATA/SAS SSDs.

Of course, other types of ZFS log devices are possible. The important things are: Reliability against power failure and low latency for write IO operations. Adam Leventhal, for instance, recently tested a DDR RAM based solution which works really well (no link, sun.com no longer exists), provided you can supply it with a reliable power source and you don’t care about cluster takeovers of the ZIL (which are a whole different topic).

The ZFS Logbias Property

Recent versions of OpenSolaris (build 122 and newer) as well as the Oracle Sun Storage 7000 Systems (no link, page no longer exists) introduced a new property called logbias.

This property lets you tune the log destination on a per-filesystem basis: The default, logbias=latency, is to use a separate ZIL, like described above, which improves latency for synchronous writes for the application. If you set logbias=throughput, then no separate ZIL log is used. Instead, ZFS will allocate intent log blocks from the main pool, writing data immediately and spreading the load across the pool’s devices. This improves bandwidth at the expense of latency for synchronous writes. Depending on the application, you may prefer one or the other. In fact, the whole motivation behind implementing this property is to optimize ZFS for Oracle databases.

Check out the ARC case for logbias (no link, opensolaris.org no longer exists) for details, as well as Roch’s article on logbias (no link, sun.com no longer exists).

Do You Need a ZIL Device?

Before spending that money on an expensive Flash or other ZIL solution, it is useful to know if you’re actually being hit by too many write IOPS to the ZIL, just to confirm that what you’re experiencing as a performance problem is indeed a large number of writes to the ZIL. After all, you could be hitting a different problem, who knows?

Roch (no link, sun.com no longer exists) and Richard Elling have put together a nice tool for this: The aptly named zilstat.ksh will show you statistics on your current ZIL usage so you can see for yourself how much data, and how much IOPS are going to the ZIL on your system, plus some more details. Very cool!

Conclusion

I hope this article sheds some light on the story behind ZFS, Synchronous Writes and the ZIL. The bottom line is:

The ZIL is there to prevent data loss in the event of power or other failures.
It is actually an optimization technique that avoids updating the whole data structure on disk every time a synchronous write is issued.
Unfortunately, high numbers of synchronous writes both increase the number of write IOPS to disk and the occurance of random writes, both of which slow down disk performance.
ZFS allows the ZIL to be placed on a separate device, which reliefs the main pool disks from the additional burden.
Even better: Use flash memory instead of regular disks as ZIL devices to further speed up synchronous write performance.
zilstat.ksh will show you exactly how much ZIL activity is on your system so you can judge for yourself how much a separate ZIL will help you.

Your ZIL Experiences

Are you suffering from lots of synchronous write requests? Are you using a separate ZIL device? Maybe a flash memory based one? What are your experiences? How much faster has your ZFS pool become as a result of separating the ZIL?

Post a comment below and share your ZIL experience!

Update: Added a pointer to Roch’s article on logbias (no link, sun.com no longer exists). Very well worth a read!

Update 2: Victor pointed out in the comments that ZFS doesn’t use a special area for the ZIL, instead it allocates ZIL blocks similarly than the rest of the file system. This mitigates the effect of extra seeks, but instead introduces a “swiss cheese” effect, because the ZIL blocks are short lived and tend to increase fragmentation in the pool. I just corrected the relevant section above.

This post is obsolete