A few weeks ago, a reader asked me a couple of questions about SSDs and ZFS, hinting that this might be a good topic to write a blog post about.
Sure enough, just last week, a couple of similar questions came up, this time from a customer and a colleague at work.
Well, if that's not a sign from heaven, I don't know what is, so here's a collection of frequently asked questions about flash memory (also known as solid state disks, or SSDs) and ZFS, with answers and some useful links, and an index, too.
"SSD" stands for "Solid State Disk". In short: A disk that doesn't have any moving parts. Today, most SSDs are made using flash memory, but some SSDs use a combination of RAM and some form of battery back up to make them permanent. The key is to leave out the moving parts so the SSD has much faster performance characteristics, especially in terms of IOPS. Read more about SSDs on Wikipedia.
Technically, no: SSDs are a drive category, and flash memory are a memory technology. But in practice, people use both terms synonymously: A drive that looks like a harddrive but uses flash memory to store data instead of rotating rust is an SSD, which is the same as a "Flash Memory Drive". But much if not all that is being said about Flash Memory SSDs can also be applied to any other storage medium that has very fast IO performance characteristics, especially in terms of IOPS.
It depends on what you're looking at: Flash Memory SSDs are about two orders of magnitude faster in terms of IOPS (the number of IO operations per second) than traditional harddisks. But in terms of throughput performance (MB/s) they're in a similar league. Since there are no moving parts, SSDs consume a lot less energy than traditional hard disks. For the same reason, they're also more reliable. On the other hand, the cost per GB is much higher with SSDs. This creates an interesting phenomenon: While it would be cool to throw out all traditional disks from your datacenter, PCs and laptops, it would also be expensive. Therefore, the key to taking advantage of SSDs is to use them at the right place, and where their benefits most matter. Think of SSDs as being a new layer of memory between RAM and traditional harddisks: Faster than drives, slower than RAM, cheaper than RAM on a per GB basis, but more expensive than disks.
No. There are different categories of flash memory and occasionally, new technologies are invented that expand the scope of SSD variants. Today, the two most important types of flash memory are SLC (Single Level Cell) and MLC (Multi Level Cell). SLC flash memory uses a single level per cell to store a bit of data. This results in faster performance, better reliability and lower power consumption than MLC. MLC flash on the other hand uses multiple levels per cell. This allows each cell to store more values than just 0 and 1. Therefore, an MLC cell can store two (most common) or more bits of data. This makes MLC flash less expensive on a per GB basis and more data can be stored in the same form factor. When selecting SSDs, you need to weigh these factors according to the way you want to use your SSD: Do you need better reliability and performance? Or are you after a lower cost, more dense option?
There are two ways in which SSDs can be used with ZFS (See the ZFS source tour diagram to better understand the following):
ZFS uses a logging mechanism, the ZFS intent log (ZIL) to store synchronous writes, until they're safely written to the main data structure on the pool. The speed at which data can be written to the ZIL determines the speed at which synchronous write requests can be serviced: The faster the ZIL, the faster most database, NFS and other important write operations become. Normally, the ZIL is part of the regular pool on disk. But ZFS offers the possibility to use a dedicated device for the ZIL. This is then called a "log device". By using a fast SSD as a ZFS log device, you accelerate the ZIL and synchronous write performance improves. See also Solaris ZFS, Synchronous Writes and the ZIL Explained.
ZFS also has a sophisticated cache called the "Adaptive Replacement Cache" (ARC) where it stores both most frequently used blocks of data and most recently used ones. The ARC is stored in RAM, so each block of data that is found in the RAM can be delivered quickly to the application, instead of having to fetch it again from disk. When RAM is full, data needs to be thrown out of the cache and is not available any more to accelerate reads. SSDs can be used as a second level cache: Blocks that can't be stored in the RAM-based ARC can then be stored on SSDs and in case they're needed, they can still be delivered quicker to the application than by fetching them again from disk. An SSD that is used as a second level ARC is therefore called an L2ARC, or a "cache device".
In summary, SSDs can be used in two important parts of the ZFS architecture: A log device will accelerate synchronous writes, and a cache device will accelerate reads.
Good question. Sometimes, the situation is so borked (or the bottleneck so non-obvious) that an SSD for the ZIL won't help you much. Before you invest money in an SSD, it's better to check. Richard Elling has written a cool utility called
zilstat that answers the question: How stressed is my ZIL?. Run this script, put your server under a typical load and watch for yourself. If you see a lot of ZIL activity, you'll likely benefit from a ZIL that is offloaded onto an SSD.
Again a very good question. Sometimes, the most popular data fits into your RAM, so an L2ARC won't give you much extra benefit. Sometimes, though, the ARC is so busy that it really could use some extra help from an SSD. Again, there's a cool script that will help you decide, this time from Ben Rockwood:
arc_summary will collect a number of useful ARC statistics and his Explore Your ZFS Adaptive Replacement Cache (ARC) article will teach you what numbers to look for. The short version: If you see a lot of ghosts (= blocks that were eliminated from the ARC but that were discovered later to be useful), then you should buy an SSD for an L2ARC to keep them from becoming ghosts.
SLC flash is faster for writes and more reliable, therefore it's the best choice for a ZIL. You can still use it as an L2ARC, though. Reliability and speed never hurts.
MLC flash will give you more capacity for your money, but is less reliable and less fast than SLC. Therefore, MLC makes a good read accelerator when you're budget-constrained. But don't use MLC flash as a ZIL: If a write to a single cell breaks, other bits may be affected and the failed write may affect blocks that were previously written. Data on a ZIL is an important backup mechanism, you don't want to compromise that.
Interesting question. On one hand, hardware breaks all the time, and SSDs are no exception. On the other hand, SSDs don't have any moving parts and so they're statistically much less susceptible to failures. It really depends on your risk tolerance:
The ZIL is the last resort to go to if the system crashes before data that was promised to the application to be "safe" is actually written to disk. Then, upon reboot, the system reads back the ZIL and performs the missing updates on the actual ZFS storage pool. Since the ZIL is so important for ensuring data integrity, it should therefore be mirrored and ZFS supports that quite nicely.
The L2ARC is a read cache: It stores data for convenience and speed only, but every bit of data in the L2ARC is also available elsewhere. So mirroring an L2ARC SSD is not really necessary (though Marcelo has a very good point in that a dramatic loss of performance may actually justify an L2ARC mirror). Instead, ZFS will use any extra SSDs you give it for L2ARC in order to expand the amount of space for caching data.
Yes, you can. (I always wanted to say that phrase, btw...) In theory you can split up an SSD into two slices through the
format(1M) command. In practice, this means that you'll have two streams of data (ZIL writes and L2ARC writes and reads) instead of one, competing for the limited resources of the SSD's connection and controller. That may compromise your ZIL performance as two mechanisms step on each other's feet. Better try it out: split up the SSD, configure the ZIL part of it, see how much it improves your write performance, then hook up the L2ARC part, while observing if the ZIL performance is still good. Use
zilstat for monitoring ZIL performance.
The role of the ZIL is to store a transaction group until it has safely been written to disk. After that, this can be safely deleted and the space used for the next transaction group. So the question becomes: How much transaction group data is "in flight" (i.e. not yet written to disk) at any time? ZFS issues a new transaction group (and consequently a new pool update) every 5 seconds at the latest (more if the load is higher). While one transaction group is written to the ZIL, the previous one may still be in the process of being written to disk, so we need enough space to store two transaction groups, which means 10 seconds of maximum write throughput worth of data. What's the maximum amount of data that your server writes in 10 seconds? Well, an upper boundary would be the maximum write speed of your SSD. At the time of this writing this was about 170 MB/s for an Intel X25-E, times 10 that would be just short of 2 GB for a typical ZIL. So for ZILs, a little can go a long way.
This is more difficult, or more easy, depending on how you put it. More is always better, but too much would be a waste if it's not used. Check your L2ARC usage with
arc_summary and if you still see a significant amount of ghosts after adding an L2ARC, you'll likely benefit from even more L2ARC space. Another way to estimate L2ARC need is by looking at your working set: The amount of data that is used most frequently. Depending on your application, this could be your top 10 research projects, your top 20% of recurring customers, your most popular 100 products etc.
zpool(1M) man page, in particular the sections on Intent Log, Cache Devices, the
zpool add and
zpool attach subcommands and the EXAMPLES section.
It's actually not complicated, ZFS can be administered with just two commands! But if you're looking for a GUI version, check out the Oracle ZFS Storage Appliance products for an easy-to-use NAS version of ZFS. Some third party vendors offer other appliances or appliance-ready OS distributions that are based on OpenSolaris or FreeBSD. They don't support all the latest features that the Oracle ZFS Storage Appliances (or Oracle Solaris 11 Express) offer, but they may be good enough for some. Examples include Nexenta, EON and FreeNAS.
Certainly. Here are a few common scenarios:
Any filesystem that you place on an SSD is going to be faster due to the physical nature of SSDs versus rotating rust.
Databases often allow you to place their database log onto a separate file system. SSDs are a great choice for database log files. Actually, a database log and the ZFS ZIL are very similar mechanisms.
The Oracle database in particular has a Smart Flash Cache Feature that works similarly to the L2ARC described above: It's basically an extension of the SGA so that frequently needed data remains accessible from fast flash memory even if the SGA (which is in memory) is not big enough.
I'm sure you can come up with other great uses, but that almost always involves some sort of intelligent distinction between data that is accelerated with flash and data that is stored on regular disks. The beauty of ZFS is that it does this job automatically for you.
That was quite a few questions, but I think these are the most common ones. Hopefully this FAQ is useful to you!
What are your most common SSD+ZFS related questions? Which ones have been left unanswered? What do you think would make a valuable addition to the FAQ above?
As always, let me know in the comments!
Update: Thanks to Cedar for the idea for this post!