OpenSolaris ZFS Deduplication: Everything You Need to Know

Deduplicaed Folders Illustration

Since November 1st, 2009, when ZFS Deduplication was integrated into OpenSolaris, a lot has happened: We learned how it worked, people got to play with it, used it in production and it became part of the Oracle Sun Storage 7000 Unified Storage System.

Here's everything you need to know about ZFS Deduplication and a few links to help you dig deeper into the subject:

What is ZFS Deduplication?

In short: ZFS Deduplication automatically avoids writing the same data twice on your drive by detecting duplicate data blocks and keeping track of the multiple places where the same block is needed. This can save space and unnecessary IO operations which can also improve performance.

ZFS Dedupe is sychronous (it happens instantly during writes, without any need for background dedupe processes), safe (There's no realistic chance that two data blocks are mistakenly treated as equal) and efficient (designed to scale with any size of ZFS filesystem).

But, who would be better to explain this than the author himself? Read Jeff Bonwick's blog entry about ZFS Deduplication. There's also a video with George Wilson from the ZFS team available.

How Much Can I Save?

It depends. If your data has a high potential for duplicate data blocks, then the savings can be substantial. Here are a few examples:

  • Virtualization Storage: Multiple installations of the same virtualized operating system share the same kernel, libraries, system files and applications. With deduplication, these will only be stored once, but still be available to any number of virtualized OS images.
  • File Servers: Of course it depends on what your users will store on the file server. But the chances are good that they'll end up storing a lot of documents multiple times through collaboration, versioning and the viral effect of sending good stuff around.
  • Mail Servers: The same effect can be expected from mail servers. Some mail servers try to detect this through different means, ZFS can be sure even if the duplicate data comes in obscure ways.
  • Backup to Disk: With multiple people backing up their stuff to disk, there's again a lot of potential for multiple copies of the same data: Applications, System files, Documents, images, etc.
  • Web 2.0 and Social Sharing Websites: Social networking on the web almost always follows viral patterns: Someone finds something cool, then passes it on to their friends. This involves a lot of copying and re-using through different means, creating a lot of potential for deduplication.

I'm sure you can come up with examples of your own!

ZFS will tell you exactly how much you saved through deduplication: Just do a zpool list <pool> or a zpool get dedupratio <pool> and look at the dedupratio property. It will tell you how much duplication has occured since dedup was enabled on the dataset and give you a feeling for how much space you have saved.

What Are The Costs?

It depends. As Jeff explained, deduplication involves using a stronger checksum algorithm and/or some extra checking, which can have a slight impact on performance. If the deduplication table fits into memory, then the performance hit is minimal, if ZFS needs to fetch dedup table data from disk, then it may be more.

On the other hand, you'll likely save a lot of IO operations for data that doesn't need to be written or read a second time, so the performance may actually improve as a result of deduplication.

The rule of thumb is: The more dedup saves in terms of space, the more the benefits will outweigh the costs. But if your data is unique all the time, there won't be a benefit from deduplication and the cost will become more prevalent in terms of performance.

The good news is that you can try it out at no risk: You can easily switch dedup on, then test for a week or so, and the dedup ratio (see above) will tell you how much space you saved, while you'll be able to observe how performance is impacted. If you determine that dedup doesn't give you (enough) benefits, then you can easily switch it off again.

And no, there are no extra monetary costs for ZFS dedup: It's free.

Tell Me More About Performance!

Let me introduce you to Roch Bourbonnais from the Oracle Solaris Kernel Performance Engineering group. He recently wrote a very thorough article on ZFS Deduplication performance that is a must-read on this topic.

Darren Moffat optimized the performance of the SHA256 checksum used by ZFS Dedup shortly after it was integrated and shared his results in another blog post.

How About Real-World Testing?

A number of people shared their experience with ZFS Deduplication:

Ok, Where Can I Get It?

ZFS Deduplication is available in a number of ways:

Where's The Documentation?

Simple: It's built into the current man pages for zpool(1M) and zfs(1M). Similarly, the ZFS Administration Guide now has a section on ZFS Deduplication, too and there's a ZFS Deduplication FAQ available on OpenSolaris.org.

Conclusion

ZFS Deduplication is real, and ready for you to profit from. It's powerful, easy-to-use and in the majority of cases, it saves a lot of space and can even improve performance. It's also risk-free to try it out. What more could one want?

Well, there's still some stuff to do: As Jeff points out in his blog article, there are three places to dedup on: Disk, memory and network. So we can expect even more capacity and performance gains as deduplication makes its way into RAM and networking.

And then there's still a lot of experience to gain from real-world applications: How much ist the average space gain for a deduped VMware installation per virtual machines? How much faster does a full backup-to-disk run if there's already a backup stored? Will dedupe work across different representations of the same data (like a ZVOL storing data in some file format vs. the same data stored in a filesystem)?

This is where you come in:

Your Turn

What are your experiences with ZFS Deduplication? Have you tested it already? Are you using it in production? What are your average space savings, how does it impact performance for you? What use cases benefit the most from dedup for you and what are cases where it should stay off? What other ZFS Deduplication resources did you find useful?

As always, feel free to share your thoughts in the comments!

Getting More Out Of This Blog

If you found this article useful, then add Constant Thinking to your feed reader and get more Technology Thoughts delivered right to your desktop!

Update: Added a link to the ZFS Deduplication FAQ in the Documentation section, fixed a typo, the correct command is zpool get dedupratio <pool>.

Stay in Touch!

Did you like this article? Have you found it useful, interesting or entertaining?

Then click here to get free regular updates and help me reach my goal of 1,000 regular blog readers this summer!

Thank you for reading Constant Thinking.