Building your own ZFS fileserver

sub.mesa · Mar 4, 2010

As some people have suggested, i'm creating this thread to help people setting up their fileserver using the ZFS filesystem. I will also try to explain some things about ZFS, and give some links to get your started.

What is ZFS?
Sun Microsystems developed the Zettabyte FileSystem, which first showed up in Sun´s Solaris operating system in mid-2005. ZFS is different from other filesystems like FAT, NTFS, Ext3/4, UFS, JFS and XFS because ZFS is both a filesystem and RAID-engine in one package. This yields emerging properties; things that aren't possible if the two systems were separate.

Why is it so cool?
ZFS is the most advanced single-system filesystem available, with unique properties:

ZFS is both Filesystem and RAID-engine.
ZFS protects your data from corruption using checksums.
ZFS is maintenance-free and requires no filesystem check; it automatically fixes any problems or corruption.
ZFS can also act as (incremental) backup using snapshots, much like Windows Restore Points.
ZFS is versatile; it allows you to grow your filesystem by adding more disks.

Some more technical features:

It allows you to make any combination of RAID0 (striping), RAID1 (mirroring), RAID5 (single parity) and RAID6 (double parity)
Because on its Copy-on-Write design, its very resilient against crashes and other problems, and won't need a filesystem check ever!
Because ZFS is both RAID and Filesystem, it can use dynamic stripesizes to adapt to the I/O workload.
Aggressive caching and buffering make ZFS consume lots of RAM, but benefits I/O performance.
Transparent compression can significantly reduce size in some cases.
ZFS can use SSDs as cache device, increasing the performance of the entire array and adapts to your usage pattern.

In essence, ZFS offers excellent protection against data-loss of any kind; other than disasters which affect the whole system physically like flooding/fire. Its also very flexible, allowing you to change almost anything.

What do i need to use ZFS?
Sadly, ZFS doesn't run on Windows, and Apple withdrawn their plans to include it into MacOS. Windows falls short in offering advanced storage technology - all the hot stuff is either on Linux or UNIXes like FreeBSD. So for the purpose of building our fileserver, we want something more modern. To use ZFS, you need to have:

64-bit dual-core/multi-core processor (AMD64)
Lots of RAM (2GB minimum; 4GB+ recommended)
Modern motherboard with onboard SATA + gigabit ethernet and PCI-express
Operating system that supports ZFS. Currently: OpenSolaris, FreeBSD and FreeNAS

Tell me more about FreeNAS
FreeNAS is an open source OS based on FreeBSD, but adapted to function as an easy NAS operating system that is configurable over the network. FreeNAS installs very easy and should be easy to configure, but it has limited features. It also offers ZFS, but a slightly older version; v6 instead of v13/14 which FreeBSD/OpenSolaris use. Generally your best option if you want something setup quick; but for large filesystems or more complicated setups, a full-fledged Operating System is required.

So what about FreeBSD?
FreeBSD version 7.0 - 7.2 supports ZFS version 6. This is what FreeNAS uses.
FreeBSD version 8.0 supports ZFS version 13.
FreeBSD version 8.1 supports ZFS version 14.

FreeBSD 8.1 is not released yet, so FreeBSD 8.0 is the stable release that packs the real kernel-based implementation of ZFS in full ornate.
FreeBSD is a UNIX operating system, and may be difficult to master. Especially the installation part is a bit tricky, as it uses an outdated text-based installer.

Wait, hold on! So you're saying this FreeBSD is command line stuff?
Well yes, especially the installation and setup will be hard and you should follow a howto document or be guided by someone familiar with FreeBSD. After that, you can use a windows terminal like PuTTY to connect to your FreeBSD server; so you can configure the server from your Windows pc; but still have a command-line.

So what do i need to do in this command-line; i prefer something graphical!
Yes, well, it can be useful to work with commands so you know exactly how you got there. Meaning, if you write down the commands you used to create your filesystem, you know how to do that a second time. So while commands may be scary at first, it will be more logical on the long run.

So how do these commands look like, can you give me an example?
Certainly!

Code:

# creates a RAID-Z (RAID5) array called "tank" from disks 1, 2 and 3
zpool create tank raidz label/disk1 label/disk2 label/disk3
# create filesystems
zfs create tank/pictures
zfs create tank/documents
# enable compression for our documents directory only
zfs set compression=gzip tank/documents
# also store each file in the documents directory on all 3 three disks; for maximum safety
zfs set copies=3 tank/documents
# snapshot the documents directory, creating a "restore point"
zfs snapshot tank/documents@2010-03-04
# made a mistake? simply roll back to the last snapshot
zfs rollback tank/documents@2010-03-04
# get status from your array
zpool status tank

But ZFS is still Software RAID right? Should i get a real Hardware RAID card instead?
No, certainly not! Doing so would mean you lose the portion of ZFS where it can heal itself. You should let ZFS do the RAID, and just use the onboard SATA ports. If those are not enough, expand with PCI-express controllers which can act as normal SATA controller to ZFS. Never use PCI for anything! Only PCI-express.

But still isn't software RAID slower than hardware RAID?
On paper its not, on paper software RAID is superior to hardware RAID. Hardware RAID has increased latency that is unavoidable - software RAID has access to very fast hardware already and is easier to implement an advanced RAID controller in software. Note that in the case of Hardware RAID, its still the firmware software that actually implements the RAID; and its implementation may be simple/unsophisticated when compared to ZFS.

As for speed, ZFS is speedy enough; but never without making sure the data is safe. Some unprotected filesystems may be a fraction faster, but ZFS adds a lot of reliability without sacrificing on speed all that much.

So exactly how do i setup this ZFS?
I will be explaining this in detail later, but generally:

First, install your operating system, i'm assuming FreeBSD here. The OS should be on a different system drive, which can be a USB pendrive or compactflash card, a parallel ATA disk or just a SATA drive. It's best if the system drive is completely separate from the disks that will be used by ZFS.
Then, connect your HDDs, let FreeBSD find them. Label the drives so each drive has a name like label/disk1 or label/disk2 etc. This avoids confusion, and makes sure that it will be found and identified correctly, regardless of how it was connected.
ZFS is already pre-installed in FreeBSD; so no need to install FreeBSD.
Create ZFS RAID pool using the "zpool create" command
Create ZFS filesystems
Set various ZFS options
Set permissions
Setup Samba/NFS so you can use the filesystem from your networked computers

Can i expand my ZFS RAID array?
Yes, but some restrictions apply.

What you cannot do, is expand an existing RAID-Z (RAID5) or RAID-Z2 (RAID6) array with one or more disks.
But, you can add new disks or RAIDs to an existing pool. So if you have a 4-disk RAID-Z, you can add another 4-disk RAID-Z so you have 8 disks. The second array would share free space with the first; in essence it would be a RAID0 of two RAID5 arrays. ZFS can expand this way.

What you can do, is expanding mirrors and RAID0's. In the example above that's what actually happened: a new array is RAID0-ed with the existing array. New created files will be written to both devices, for additional speed. Setting copies=2 would make files in that directory be stored on both RAID arrays; for extra redundancy.

What if a disk fails?
Then you identify which disk is causing problems with the zpool status command. Then replace the disk for a new one.
As long as the failures do not exceed the redundancy offered by ZFS, everything will continue to work, including write access.

Do i need to use TLER or RAID edition harddrives?
No and if you use TLER you should disable it when using ZFS. TLER is only useful for mission-critical servers who cannot afford to be frozen for 10-60 seconds, and to cope with bad quality RAID controller that panic when a drive is not responding for multiple seconds because its performing recovery on some sector. Do not use TLER with ZFS!

Instead, allow the drive to recover its errors. ZFS will wait, the wait time can be configured. You won't have broken RAID arrays, which is common with Windows-based FakeRAID arrays.

How future-proof is ZFS?
As Sun got acquired by Oracle, the future of ZFS may be uncertain. However, it is open source code and still in development. Several non-Sun operating systems now have ZFS integrated, and projects like kFreeBSD may port ZFS to Linux-distributions like Ubuntu.

But, ZFS is not very portable; only few systems can read it.
However, you can connect the disks to windows, and use VirtualBox/VMWare to let FreeBSD inside a VM access the RAID, and export over the network. That works, but Windows should not touch your disks in any way. Simply selecting to 'initialize' the disks would lead in data-loss and perhaps total corruption as key sectors get overwritten. ZFS is resilient, but such tampering may exceed the metadata redundancy of 3 copies per metadata block.

How do i maintain ZFS? Defragment etc?
You don't. You don't need to.

The only thing you need to do, is make sure you get an email/update when one of your drive fails or has corruption, so you are aware and can intervene at earliest opportunity.

Can ZFS replace a backup?
RAID alone can never replace a backup; RAID doesn't protect against accidental file deletion, filesystem corruption or a virus that wipes the drive. But ZFS can protect against that, using snapshots you can make incremental backups so you can go back in time and get each day's version of the filesystem.

A nightly snapshot is very useful and snapshots do not use additional storage space, unless you modify your files since the latest snapshot.

So yes, ZFS can replace a backup. But note that ZFS is advanced software with many lines of code, and any bug may still threaten your data. So for data you can't afford to lose, a real backup at another physical location is still highly recommended.

What about the ZFS support in Linux?
Linux has the GPL-license, and it is not compatible with the CDDL-license that ZFS uses. That means ZFS can't be directly integrated into the Linux Kernel; the best implementation possible. Instead, the FUSE project aims at implementing ZFS in userspace; something that has huge drawbacks and is generally not suitable for most users.

Another effort to implement a kernel-level implementation of ZFS as a CDDL-module, linked to the Linux kernel has a working prototype but appears unmaintained. If you want ZFS, you need either FreeBSD, FreeNAS or OpenSolaris.

How fast does ZFS go?
Real performance is too complicated to be reduced to simple numbers. The buffering and caching of ZFS also make benchmarking it quite hard. But its very fast in real-world scenario's and its speed should never be any issue. As long as you do not use PCI in your system!

What are ZFS cache devices?
ZFS is able to use SSDs in a special configuration, where it acts as cache for the HDDs. This is like having more RAM as filecache, but then SSDs are much bigger than your RAM can hold. Whenever you read something from the RAID array that is cached, the SSD will perform the read request instead; with very low access times. ZFS tracks which data is most accessed and puts those on the SSD. This means it automatically adapts to your usage pattern. You

can have an array of many TeraBytes and have a small SSD that serves the files you access every day and make a real improvement to the performance of the array.

Can i use hot-spares with ZFS?
Yes, you can add one or more hot-spare disks as 'spare' device. These will be available to any array that is degraded; so you can share one hot-spare disk across multiple RAID-Z arrays, for example.

How much RAM can ZFS use?
A lot, largest i've seen was 6.3GB. The RAM space depends on settings, number of disks, stripesize and most of all - the workload. The faster you can make ZFS work, the more memory it will consume. It will be well spent though. For low-memory systems, you can limit the memory ZFS uses; but this also limits performance. Generally, you should not use ZFS with less than 2GB without conservative tuning disabling alot of fancy ZFS features.

Why can't i use ZFS on 32-bit?
You can, but memory constraints mean ZFS is limited to 512MiB memory; where only minimum settings would work. In those conditions, heavy workload would cause ZFS to panic and crash. It wouldn't really be bad; just reboot and it works again without you needing to do anything. But that's not the way you should use ZFS. ZFS is an 128-bit filesystem and feels at home with an 64-bit CPU and 64-bit operating system.

How do i access ZFS from my Windows PC?
For that you need a network protocol. Windows filesharing is common, which uses CIFS/SMB protocol. Samba can be used to export your ZFS filesystem to your windows PC's; you would have a drive letter like X:\ which contains your ZFS filesystem. Other protocols are recommended though, especially NFS and iSCSI work very well. Unfortunately they are not natively supported by Windows. While Samba works, it may limit throughput speed. Its a shame if your ZFS array does 400MB/s internally but over the network you're stuck at 40MB/s. That's an a common issue with Samba.

How do you access your ZFS filesystem?
I use ZFS as mass-storage disk and access it using NFS or Network File System - the preferred way to share files on Linux and alike.
I also use ZFS to store my system volumes of my five Ubuntu workstations. So my desktop PCs don't have internal drives - everything is on the network, on ZFS. This makes using backups much easier as i can perform snapshots on my system disks. The system drives are accessed using iSCSI, which also works on ZFS using FreeBSD. Booting also happens over the network, using "PXE" and specifically, "pxelinux".
The upside is i have a lot of control over my data - especially because i can make incremental snapshots really easy. The downside is that performance is capped by the network bandwidth, as i'm still using 1Gbps ethernet. 10Gbps NICs are available but at supreme cost - more than $500 per NIC; and switches are even more exotic. I suspect prices will drop significantly in 2011; getting 10Gigabit to enthusiasts as well as the server market.

Please use this thread to discuss setting up ZFS and talk about its features. Feel free to ask questions.

Version history:
1.0 - initial version
1.1 - added Hot-Spare section, added section about how i access my ZFS

sub.mesa · Mar 4, 2010

Useful links

Sun's ZFS Administration guide
FreeBSD ZFS Quick Start Guide
FreeBSD handbook explaining ZFS administration
Wikipedia ZFS page
Download FreeBSD (amd64 means 64-bit; you need the disc1.iso file that fits on CD)
Download FreeNAS (to use ZFS pick the amd64 version, either embedded or livecd)
FreeNAS screenshots
FreeNAS documentation

Self-written guides

Benchmarks
.. todo ..

keenan · Mar 4, 2010

Very nice article. Props for the effort!

My only concern with ZFS is that sometime in the near future it will be abandoned by Oracle in favour of btrfs, which isn't really ready for primetime yet. Obviously your existing system will be fine, but I foresee a big slowdown in ZFS development in the next year or two.

sub.mesa · Mar 4, 2010

Well, its integrated in FreeBSD trunk now, so it'll stay there regardless of Oracle or Sun. It will continue to be patched and hacked by Pawel Jakub Dawidek (pjd@) who ported ZFS to FreeBSD.
Currently, missing features are:

Kernel-level CIFS driver (FreeBSD-only; present in OpenSolaris). Workaround: Samba
Kernel-level iSCSI driver (FreeBSD-only; present in OpenSolaris). Workaround: istgt port
Expanding RAID-Z and RAID-Z2
Transparent encryption (zfs-crypto project integration target Q1 2010; FreeBSD might import it after inclusion in OpenSolaris)

Other than that, i think ZFS works like a charm. Even if it would stay as it was, it would be great and much more modern and sleek than any of the other filesystems. And i'm not concerned ZFS would disappear from FreeBSD.

When kFreeBSD kernel project works well in Ubuntu, it would be also possible to have FreeBSD's ZFS implementation running in Ubuntu, now that would be kick ass, wouldn't it? ZFS for Ubuntu desktops or servers. Still i prefer FreeBSD, because i know it reasonably well and like its consistency.

john4200 · Mar 4, 2010

What about people with large numbers of hotswap drives, possibly with SAS expanders and external enclosures? Can any FreeBSD tools blink the appropriate lights on the hot-swap enclosures to indicate a failed drive that needs to be replaced?

At the end, you mention using iSCSI to mount a ZFS. I have always used iSCSI to mount a block device, with the consequence (unlike NFS) that it can only be mounted by one machine at a time, and more importantly, the machine mounting it must understand the filesystem on the block device. Is this what you had in mind when you mentioned using iSCSI with ZFS? Or is there another way to use iSCSI?

sub.mesa · Mar 4, 2010

john4200 said:
What about people with large numbers of hotswap drives, possibly with SAS expanders and external enclosures? Can any FreeBSD tools blink the appropriate lights on the hot-swap enclosures to indicate a failed drive that needs to be replaced?

As you label each disk, you can put stickers on each drive telling the number. So a particular disk may be called "label/disk3" then you know what physical disk that is.

As for blinking LEDs; i use 2,5" enclosures which go in 5¼-bays also used for CDROMs. The same size as a cdrom-drive gives you 4 x 2,5" HDD/SSD hotswap bays. They all have LEDs on them, which are triggered by the SATA interface itself not any separate cable (you only have one on your motherboard anyway). So as you read from one drive, you can see which LED is blinking/on.

At the end, you mention using iSCSI to mount a ZFS. I have always used iSCSI to mount a block device, with the consequence (unlike NFS) that it can only be mounted by one machine at a time, and more importantly, the machine mounting it must understand the filesystem on the block device. Is this what you had in mind when you mentioned using iSCSI with ZFS? Or is there another way to use iSCSI?

iSCSI is a SAN-protocol, meaning that the filesystem is on the client system. In this case, ZFS exports a ZVOL; a block-level device that acts like a blank harddrive to the iSCSI-target daemon (the server) which uses it to export to the client.

Let's show a small zvol example, in which we create a UFS filesystem (native to FreeBSD) on a zvol running on ZFS:

Code:

# zfs create -V 10g tank/ufs
# newfs /dev/zvol/tank/ufs
# mount /dev/zvol/tank/ufs /ufs

So, if you create a zvol and export it using the iSCSI-target daemon, Ubuntu can use it like it was an internal drive, and partition it with Ext4. Can be used as system disk or normal extra storage disk. In this case, ZFS is not the actual filesystem, but you still get the reliability of the underlying ZFS filesystem and pool.

The downside of SAN-protocols like iSCSI is that the files can only be accessed by one machine. If you want to make central storage space available to your local network, you need a NAS-protocol where the filesystem is on the server and a network protocol acts as abstraction layer between the two systems.

NFS is the best shared-access protocol i know. Its quite easy, it works, its bloody fast and available on anything that isn't Windows. And for Windows, i'm sure third-party commercial NFS-drivers exist. It would be a faster alternative than using Samba with SMB/CIFS protocol.

On the other hand, SMB/CIFS works on all operating systems (windows, linux, mac, unix) so its very compatible. Just know that performance will be lower than NFS or FTP.

john4200 · Mar 4, 2010

So it sounds like FreeBSD/ZFS is not convenient for large hot-swap file systems. For a system with several large external hot-swap enclosures, it would be a hassle to try to match up and label all the drives with a device code in the OS. But with any decent hardware RAID controller, no labeling is necessary. When a drive fails, it will light up a red LED or whatever next to the failed drive, and you can easily swap it. No need to even log into the OS.

As for iSCSI with linux and ZFS, I do not understand the usage you mentioned. Have you actually mounted a ZFS volume on linux via iSCSI? If so, you must be using some method that I am not aware of. Whenever I mount an XFS volume in linux via iSCSI, the client sees the volume as a block device with XFS, and must understand XFS just as if the block device were attached directly.

What is the UFS in your example?

nitrobass24 · Mar 4, 2010

john4200 said:
So it sounds like FreeBSD/ZFS is not convenient for large hot-swap file systems. For a system with several large external hot-swap enclosures, it would be a hassle to try to match up and label all the drives with a device code in the OS. But with any decent hardware RAID controller, no labeling is necessary. When a drive fails, it will light up a red LED or whatever next to the failed drive, and you can easily swap it. No need to even log into the OS.

As for iSCSI with linux and ZFS, I do not understand the usage you mentioned. Have you actually mounted a ZFS volume on linux via iSCSI? If so, you must be using some method that I am not aware of. Whenever I mount an XFS volume in linux via iSCSI, the client sees the volume as a block device with XFS, and must understand XFS just as if the block device were attached directly.

What is the UFS in your example?

Id be interested in this as well.

How could mounting the volume with iSCSI in windows work? Windows treats it like a block device, and would want to format it.

Also you said that it could not be used in linux, but isnt OpenSolaris Linux?

keenan · Mar 4, 2010

Well, its integrated in FreeBSD trunk now, so it'll stay there regardless of Oracle or Sun. It will continue to be patched and hacked by Pawel Jakub Dawidek (pjd@) who ported ZFS to FreeBSD.

I have no doubt that it'll kick around for quite a while, but if Oracle/Sun drop their support for it, it'll stagnate pretty quickly.

Also, lack of OCE is pretty much a showstopper for any home setup I'd use myself, though others likely have different requirements, and if you don't need that feature I'd agree ZFS is a great choice today. I wish I could use it myself. OCE support for ZFS has been discussed, developers have even proposed a mechanism by which it could be accomplished, but Sun is apparently unwilling to fund development of this feature (or puts it low on the priority list), and the community hasn't thus far stepped up. See this pretty interesting blog post.

it would be a hassle to try to match up and label all the drives with a device code in the OS.

I really don't think it's that big a deal in a home setting. Generate some constant I/O to the array. The disk with no access is the failed one.

nitrobass24 said:
Also you said that it could not be used in linux, but isnt OpenSolaris Linux?

Not even close.

sub.mesa · Mar 4, 2010

OpenSolaris is UNIX, like FreeBSD is. UNIX uses the UFS or UNIX File System. It is much like Ext2/3 or NTFS; nothing special, but not terribly outdated like FAT.

In my example i did three things:

# zfs create -V 10g tank/ufs
this creates a 10GiB block-level device ("ZVOL") on a ZFS filesystem

# newfs /dev/zvol/tank/ufs
this creates UFS filesystem on the ZVOL

# mount /dev/zvol/tank/ufs /ufs
mounts the UFS filesystem to /ufs
congratulations! You now have a UFS filesystem running on ZFS.

The iSCSI story is similar: Ubuntu client connects using iSCSI to FreeBSD server and gets a block-level device of 10GiB and formats it with its own filesystem like Ext4. But this Ext4 filesysem will actually reside on ZFS, so you still get the ZFS benefits. The client is not aware of ZFS though.

It's much like ISO files: they have their own filesystem (iso9660) but its just a file that resides on a FAT/NTFS filesystem. ZFS can reserve blocks on its filesystem as ZVOL's, which can be used for almost anything.

sub.mesa · Mar 4, 2010

As for lack of capacity expansion, i don't think this is such a big problem.

As you add more drives, relying on a single parity drive becomes kind of weak, so RAID6 comes into play. But a RAID0 of two RAID5s would be a better solution, and that you can do with ZFS:

You start with 4-disk RAID-Z array
Later, you buy 4 more disks, create a second RAID-Z array and add to the existing pool
Now, its like you run one 8-disk RAID-Z2 or RAID6 array, you have 2 parity drives
Add a 3-disk RAID-Z? Sure! Add a mirror? Fine too. Add a single device? Also fine but single devices won't have redundancy. SSDs could be added as single device though, but i only recommend that if the entire array is SSDs; else using the SSD as cache device is more logical.
Best yet: any device you add doesn't have to be of the same size as your previous array

So you can expand, just not in the usual way. ZFS is still very flexible.
Maybe even better, as you start with 4x 1.5TB drives now, later add 4x 2.0TB drives then some 5.0TB drives as they get available in the future; you don't have to be stuck on using smaller units because that's what your RAID array originally uses.

keenan · Mar 4, 2010

My storage needs don't justify buying a large number of disks at a time, and it also makes the efficiency of the system go down over time with no correspondent increase in redundancy. It could make sense if you're the type building up 20-disk arrays and buying disks a half dozen at a time, but for my piddly little 6TB array, I can't justify doing things that way when I could just use RAID-6 on Linux.

odditory · Mar 4, 2010

kudos, sub.mesa -- i was gonna recommend that you start your own ZFS thread since you've been so .. um... *passionate* about ZFS is many other threads on here

thanks for your time and efforts.

mikesm · Mar 4, 2010

john4200 said:
So it sounds like FreeBSD/ZFS is not convenient for large hot-swap file systems. For a system with several large external hot-swap enclosures, it would be a hassle to try to match up and label all the drives with a device code in the OS. But with any decent hardware RAID controller, no labeling is necessary. When a drive fails, it will light up a red LED or whatever next to the failed drive, and you can easily swap it. No need to even log into the OS.

As for iSCSI with linux and ZFS, I do not understand the usage you mentioned. Have you actually mounted a ZFS volume on linux via iSCSI? If so, you must be using some method that I am not aware of. Whenever I mount an XFS volume in linux via iSCSI, the client sees the volume as a block device with XFS, and must understand XFS just as if the block device were attached directly.

What is the UFS in your example?

It's worse than this. When you add new disks on new controllers, the disk device names can change on you. This isn't a problem for ZFS, it reads the label off the disk and isn't distracted by the device names changing. It is a major pain for a user trying to figure out which disk is which.

Basically, you have to keep track of the disk serial numbers etc... and which slot they are in, and then use smart tools on the system to find out what the serial number of the disk was that had the problem and then manually look that up in the cabinet to find the proper disk. You can't see the serial number without pulling the disk out of the enclosure which is a problem.

Also, in the case of expanders, changing cables around and inserting disks in a empty slots in the expander can also change device names. The failure of being able to signal via the enclosure LED indicators can lead to the mistake of pulling a good drive during a critical outage instead of a bad drive, and take the volume down when it could have been avoided.

BTW, this is a big problem in Unix systems in general - hardware raid vendors have invested in GUI tools but the core OS has none of this. Solaris included.

john4200 · Mar 4, 2010

So you are saying the FreeBSD/ZFS is actually simulating a block device, but in reality there is a filesystem, ZFS, running underneath the block device?

I've never heard of such a thing. Would you happen to have a link that discusses this further?

keenan · Mar 4, 2010

john4200 said:
So you are saying the FreeBSD/ZFS is actually simulating a block device, but in reality there is a filesystem, ZFS, running underneath the block device?

ZFS is kind of weird. It's both a filesystem and a volume manager. It's a little confusing and really I'd prefer the two to be separated, but it's borne out of necessity by some of ZFS's features (e.g. only mirroring used blocks).

john4200 · Mar 4, 2010

keenan said:
ZFS is kind of weird. It's both a filesystem and a volume manager. It's a little confusing and really I'd prefer the two to be separated, but it's borne out of necessity by some of ZFS's features (e.g. only mirroring used blocks).

So, in the iSCSI example being discussed, say you set up a ZFS device (pool?) as an iSCSI target on the FreeBSD system, and then mount that block device in linux via iSCSI, and format the device with, say XFS.

Will you still get all the benefits of ZFS, even though you would be running XFS on a block device on top of ZFS? I am confused since, as you say, it seems many ZFS advantages rely on coordinating the filesystem with the volume management, but if XFS is running on top, how can such coordination be achieved?

sub.mesa · Mar 4, 2010

@odditory: thanks, and yes passionate is a kind word for my spamming on these boards ;-)
I don't work for Sun/Oracle or ZFS though, but it changed my perception on storage. ZFS is essentially the next-generation of filesystems. No more partitions and stuff, but a real filesystem that centralizes storage in a flexible way that deals with multiple disks intelligently. It never needs to be offline for a disk check and isn't susceptible to crashes, power problems that plague other filesystems, especially when using a large write-back.

@keenan: well building a NAS pretty much means multiple disks, between 4 and 20 i'd say. If your storage demands do not reach this level, a single disk setup with a backup is mostly sufficient. For those who have more storage needs and need to centralize their storage, a NAS comes into play. And for the same price as one of the many inflexible, unreliable, slow and expensive commercial NAS products around, you can opt to build your own. And when you do, you get the best filesystem and RAID engine for free, if you opt to use it.

ZFS offers reliability features no RAID can ever provide, the incremental snapshots are great. Essentially, you can go back to any day in the past and see what the filesystem was at that point in time. All you need to do is make zfs create a snapshot every night using a cronjob-script, much like task scheduler on Windows.

Also, the checksumming will keep you informed of any corruption in progress, and whether it was corrected or not. You would never have corrupt data again when using ZFS.

Of course, ZFS can and does contain bugs as well. Since its integration in FreeBSD in 2007, Pawel has been working hard at solving them, while finally in FreeBSD 8.0 the `experimental´ warning was removed and ZFS was labeled production-ready. Currently, i´ve been running ZFS for years, and works like a charm. Anything other feels unsafe, as i wouldn´t know whether i could trust my data.

keenan · Mar 4, 2010

john4200 said:
So, in the iSCSI example being discussed, say you set up a ZFS device (pool?) as an iSCSI target on the FreeBSD system, and then mount that block device in linux via iSCSI, and format the device with, say XFS.

In ZFS lingo I think a logical volume is called a zvol. A zpool is a collection of physical devices.

I'm curious about your other questions as well. I'm wondering how checksumming would work, for one.

Edit: sub.mesa - Yeah, I understand what I get out of ZFS, but really it has few real-world advantages for me over Linux software RAID, and the significant disadvantage that I need to buy 3-4 disks at a time whenever I want to expand (and lose extra disks to parity while I'm at it), as opposed to 1. Otherwise most of what ZFS does I can do or approximate with Linux softraid and LVM.

nitrobass24 · Mar 4, 2010

How does ZFS handle power outages since it uses RAM for writeback?
Do I need a UPS to prevent data loss of what is still in cache?

sub.mesa · Mar 4, 2010

mikesm said:
It's worse than this. When you add new disks on new controllers, the disk device names can change on you. This isn't a problem for ZFS, it reads the label off the disk and isn't distracted by the device names changing. It is a major pain for a user trying to figure out which disk is which.

Basically, you have to keep track of the disk serial numbers etc...

No no no!

That's why you use labeling. When you connect your new disks to FreeBSD for the first time, the first thing you do is label all the drives:

# read from disk to see the LED blinking, so you know which physical disk this is
# attach a physical label on the drive or hot-swap cage saying "disk1"
dd if=/dev/ad14 of=/dev/null bs=1m count=1000
# optional: create fdisk partition so it doesn't confuse Windows when you connect this disk:
fdisk -I /dev/ad14
# now you have a /dev/ad14s1 device, s1 being the first partition that spans the entire drive
# now attach the label with the name to the disk
glabel label disk1 /dev/ad14s1

Repeat for all disks, so you get labels for your disks:
/dev/label/disk1 = actually /dev/ad14s1
/dev/label/disk2 = actually /dev/ad16s1
etc.

Then you use the labels to create your ZFS RAID:
zpool create tank raidz label/disk1 label/disk2 label/disk3 /label/disk4

Done! Now when zfs says "disk3" has failed, you know which disk that is physically. And can replace it with ease.

john4200 · Mar 4, 2010

sub.mesa said:
@keenan: well building a NAS pretty much means multiple disks, between 4 and 20 i'd say. If your storage demands do not reach this level, a single disk setup with a backup is mostly sufficient. For those who have more storage needs and need to centralize their storage, a NAS comes into play. And for the same price as one of the many inflexible, unreliable, slow and expensive commercial NAS products around, you can opt to build your own. And when you do, you get the best filesystem and RAID engine for free, if you opt to use it.

I'm not sure you are thinking of the same hardware as we are. Imagine a server with a SAS controller in it connected to 4 external enclosures, with 24 hot-swap drives each, via cascaded SAS expanders. You can imagine one big RAID set, or multiple smaller RAID sets, spanning those 96 drives in various configurations. Now one drive fails and needs to be replaced. There are LEDs next to each drive in the enclosures. The easiest way to indicate which drive to replace is to light up an LED next to the failed drive. All decent hardware RAID controllers can do that. It sounds like FreeBSD/ZFS cannot light up an LED next to the failed drive.

sub.mesa said:
ZFS offers reliability features no RAID can ever provide, the incremental snapshots are great. Essentially, you can go back to any day in the past and see what the filesystem was at that point in time. All you need to do is make zfs create a snapshot every night using a cronjob-script, much like task scheduler on Windows.

This can be done easily in linux using hardlinks in conjunction with rsync. For example:

http://www.mikerubel.org/computers/rsync_snapshots/

john4200 · Mar 4, 2010

sub.mesa said:
No no no!

That's why you use labeling. When you connect your new disks to FreeBSD for the first time, the first thing you do is label all the drives:

What if you change around the connections on the external enclosures and/or SAS expanders, as mikesm mentioned? I think your labels are likely to be invalid.

keenan · Mar 4, 2010

john4200 said:
It sounds like FreeBSD/ZFS cannot light up an LED next to the failed drive.

The hardware to do so is not available, or at least not standardized.

What if you change around the connections on the external enclosures and/or SAS expanders, as keenan mentioned? I think your labels are likely to be invalid.

I assume they're stored in the partition metadata. That's the whole point.

sub.mesa · Mar 4, 2010

john4200 said:
So you are saying the FreeBSD/ZFS is actually simulating a block device, but in reality there is a filesystem, ZFS, running underneath the block device?

Oh yes, absolutely. Ubuntu uses Ext4 filesystem, which resides on a block-device on the ZFS pool. I am using this setup right now. In fact, all my 5 Ubuntu Linux workstations run this way, their system disk is mounted over iSCSI and reside on my ZFS array, on a RAID-Z, and an SSD as cache device. So in my setup, most of my system disk reads would be handled by the SSD instead. Isn't that neat?

I can instantly backup my system drive, replace it with a working one should it fail one time (bad update for example). The actual data is stored on all disks part of the RAID-Z, and also get parity redundancy protection, so protection from corruption even though the front-end filesystem is Ext4.

I've never heard of such a thing. Would you happen to have a link that discusses this further?

I'm not sure, try googling for "zfs zvol". Also, this Quick-Start guide for ZFS on FreeBSD may be useful:
http://wiki.freebsd.org/ZFSQuickStartGuide

sub.mesa · Mar 4, 2010

john4200 said:
What if you change around the connections on the external enclosures and/or SAS expanders, as keenan mentioned? I think your labels are likely to be invalid.

As the label is tied the data on the physical disk, it will be identified by the physical disk. You can put it in a USB external enclosure and connect it using USB instead and it will still be detected by its label.

So no matter if your controller changes, the cable changes, the interface changes; you still get a label that identifies the physical drive. The configuration data (metadata) is stored on the last sector of the underlying storage device, in my example thats the partition on the bare disks; so the last sector on the HDD. The label produces a block-device (at /dev/label/<name>) with a size of 512 bytes less - exactly one sector in which the metadata is located.

nitrobass24 · Mar 4, 2010

sub.mesa said:
As the label is tied the data on the physical disk, it will be identified by the physical disk. You can put it in a USB external enclosure and connect it using USB instead and it will still be detected by its label.

So no matter if your controller changes, the cable changes, the interface changes; you still get a label that identifies the physical drive. The configuration data (metadata) is stored on the last sector of the underlying storage device, in my example thats the partition on the bare disks; so the last sector on the HDD. The label produces a block-device (at /dev/label/<name>) with a size of 512 bytes less - exactly one sector in which the metadata is located.

Ok but then how do you go about locating the failed drive? I can see in a home environment this could be relatively easy, but at work where you have racks of drives what do you do?

sub.mesa · Mar 4, 2010

nitrobass24 said:
How does ZFS handle power outages since it uses RAM for writeback?
Do I need a UPS to prevent data loss of what is still in cache?

You don't need a UPS with a proper filesystem.

Three ways to deal with this:

enforcing BIO_FLUSH commands, forcing the disks to empty their buffers and wait until this is complete - to guarantee transactions do not overlap.
UFS' soft updates - what UFS uses - delays metadata updates; after a crash it will see an older version of the filesystem; does not guarantee the transactions do not overlap
ZFS's Copy-on-Write model enforces that each write, writes to a new location and never overwrites any existing data. That way the old data is still present, and with older metadata you can simply work with that model instead; so if a power failure occurs while ZFS is flushing to disk; it will be like all the writes of about 30 seconds ago were reversed and never happened after you reboot. In many cases, the ZFS-Intent-Log or ZIL is a powerful journal that protects against the huge RAM writebacks.

So yes, your data is safe. As long as you don't disable your ZIL. ;-)

sub.mesa · Mar 4, 2010

nitrobass24 said:
Ok but then how do you go about locating the failed drive? I can see in a home environment this could be relatively easy, but at work where you have racks of drives what do you do?

When you setup the drives for first use, you label each disk using software, but also physically like a label on the hot-swap drive cage telling the number. So if you have a drive-cage of 48 disks, you can number them 1 to 48 and make each disk in those cages have labels accordingly.

This is something you only have to do once. If you have to replace a failed drive, you would have to setup the label again, preferably using the same name. The old drive should not be connected to the system anymore if you use the same name, or it may cause obvious problems.

The idea is that you have some link between what ZFS tells you "disk17 has failed!" and which physical disk that is. Once you pull disk17 out of its cage, you check the status of your pool again to see you didn't pull out a wrong drive. If you did, it will be in state UNAVAIL (means you don't have access) unless you have a RAID-Z2 with 2 parity drives. If you pulled out the wrong drive, insert it again and your array is in state DEGRADED again. Try another drive until you hit the jackpot and your array is still DEGRADED with one disk pulled out.

So it's quite fool proof i would say.

nitrobass24 · Mar 4, 2010

Now to see if I can get OpenSolaris to run in a Hyper-V VM......hmmm

Thanks for all the info sub.mesa

I didnt think ZFS had advanced this much.

sub.mesa · Mar 4, 2010

john4200 said:
So, in the iSCSI example being discussed, say you set up a ZFS device (pool?) as an iSCSI target on the FreeBSD system, and then mount that block device in linux via iSCSI, and format the device with, say XFS.

Will you still get all the benefits of ZFS, even though you would be running XFS on a block device on top of ZFS? I am confused since, as you say, it seems many ZFS advantages rely on coordinating the filesystem with the volume management, but if XFS is running on top, how can such coordination be achieved?

Not all the benefits, like checksums of each file. I'm not entirely sure but i think ZFS still does checksumming on the zvol. You also get the protection of RAID-Z, and its features.

Even cooler, you can also snotshot your zvols, so you can make snapshots of your XFS filesystem and should you ever mess up your filesystem return to a date in the past where it still worked.

mikesm · Mar 4, 2010

sub.mesa said:
As the label is tied the data on the physical disk, it will be identified by the physical disk. You can put it in a USB external enclosure and connect it using USB instead and it will still be detected by its label.

So no matter if your controller changes, the cable changes, the interface changes; you still get a label that identifies the physical drive. The configuration data (metadata) is stored on the last sector of the underlying storage device, in my example thats the partition on the bare disks; so the last sector on the HDD. The label produces a block-device (at /dev/label/<name>) with a size of 512 bytes less - exactly one sector in which the metadata is located.

My experience in linux and older versions of BSD was you could try and create links with friendly names that wouldn't change or labels and then you could issue commands with the shell that used these names and avoid problems. But when the kernel reported an error with the drive or other OS logging occurred, the physical device name (which changes with hardware configuration changes would be used. This is not good, since all your error reporting didn't use the non-transitory name.

Are you saying that in the latest FreeBSD with ZFS that if you issue a glabel command, all the kernel error reporting uses that symbolic name and not the device name? That would be a very good change, but completely inconsistent with my experience.

mikesm · Mar 4, 2010

keenan said:
In ZFS lingo I think a logical volume is called a zvol. A zpool is a collection of physical devices.

I'm curious about your other questions as well. I'm wondering how checksumming would work, for one.

Edit: sub.mesa - Yeah, I understand what I get out of ZFS, but really it has few real-world advantages for me over Linux software RAID, and the significant disadvantage that I need to buy 3-4 disks at a time whenever I want to expand (and lose extra disks to parity while I'm at it), as opposed to 1. Otherwise most of what ZFS does I can do or approximate with Linux softraid and LVM.

LVM is a POS. It can reduce performance by almost 2/3 and greatly complicate filesystem recovery when you have an underlying media fault. In my opinion integrating volume management with the filesystem makes perfect sense. When you have file system content the volume management can be greatly simplified. The OS is not going to allow you to assemble segments in incorrect ways, which I think is a good thing.

I think the way ZFS integrated volume management is a great idea, and works very well in practice. A lot of people talk about why it's good to have LVM separate from the filesystem, but I have yet to find a system admin who uses LVM extensively who likes it, or is glad that it's implemented separately from the filesystem. I have had more than my fair share of pain dealing with cleaning up LVM's issues.

sub.mesa · Mar 4, 2010

mikesm said:
Are you saying that in the latest FreeBSD with ZFS that if you issue a glabel command, all the kernel error reporting uses that symbolic name and not the device name? That would be a very good change, but completely inconsistent with my experience.

I/O errors that appear in the kernel logs will use the bare device name like /dev/ad14 and not the label. Normally, however, you would issue a "zpool status" command and that uses the label names and will say which disk failed. If for some reason you want to know which label /dev/ad14 has, for example, you can issue:

$ glabel dump /dev/ad14s1
Metadata on /dev/ad14s1:
Magic string: GEOM::LABEL
Metadata version: 2
Label: disk6

Aha, so its disk6 that failed. Always rely on the label.

john4200 · Mar 4, 2010

sub.mesa said:
When you setup the drives for first use, you label each disk using software, but also physically like a label on the hot-swap drive cage telling the number. So if you have a drive-cage of 48 disks, you can number them 1 to 48 and make each disk in those cages have labels accordingly.

This is something you only have to do once. If you have to replace a failed drive, you would have to setup the label again, preferably using the same name. The old drive should not be connected to the system anymore if you use the same name, or it may cause obvious problems.

Okay, thanks for the explanation. I can see how it could be made to work, but personally, I think it is too much of a hassle for large systems of disks. Even discounting the hassle of labeling the drives when the system is set up, labeling is not actually something you only do once. As you point out, you have to label every drive you replace. So, if a drive fails in such a system, first you have to lookup in the OS which drive failed, then go and match up the physical label on the drive, then swap the drive, then go back to the OS and set up a new label.

As opposed to a hardware controller, where you never need to touch the OS. You just go to the hardware, look for the red light, swap the drive, done.

sub.mesa · Mar 4, 2010

You make it sound like its a lot of work. If a drive fails:

You execute "zpool status" command, and see that disk4 has failed
You look at the drive-cage, see the number 4 next to the fourth drive, eject it
Execute "zpool status" command, and see that you did not disconnect the wrong device
Replace the drive for a new one, a new device name will appear in the logs, say /dev/ad14
Relabel the device using "glabel label disk4 /dev/ad14" (or create a partition first; whatever you use)
Add the new disk to the ZFS array using "zpool replace disk4"

During the resilvering process, you may use the array as normal. This ZFS manual for FreeBSD may also be useful:
http://www.freebsd.org/doc/handbook/filesystems-zfs.html

john4200 · Mar 4, 2010

That is a lot of work! The whole point of computers is to take the tedious jobs and do them for us!

sub.mesa · Mar 4, 2010

It's alot of work to restore from a broken RAID configuration too, or one of the millions of things that can go wrong with the poor technology design that many commercial NAS products use.

For example, someone using driver RAID5 has a crash, reboots, and sees that 3 of its 8 drives are not part of the raid anymore, and the raid shows as failed. The data on the disks is in fact fine, its just the raid layer that crashed essentially, not knowing which disk is in which order, or some other metadata situation it can't recover from. For most users that means byebye data.

Also, silent corruption can cause many headaches. If you want reliable storage, that would mean learning about ZFS and spending some time on it. Many home users interested in storage won't be too bothered with the time/effort this means, but more whether the solution suits their needs as far as reliability and features go.

Your arguments are more an argument against custom-built hardware, than a specific issue versus ZFS. If you prefer a read-made commercial solution, then you fall into a different category of people that want to build their own fileserver, saving up money, having higher performance, higher reliability, better features and more flexibility; but requiring more time and effort to setup, study and master.

sub.mesa · Mar 4, 2010

By the way, you can set autoreplace=on so that devices with the same name will automatically get formatted and replaced. That would be safe to use in conjunction with labeling. So you won't need to do a replace command but only replace the physical disk.

john4200 · Mar 4, 2010

sub.mesa said:
Your arguments are more an argument against custom-built hardware, than a specific issue versus ZFS. If you prefer a read-made commercial solution, then you fall into a different category of people that want to build their own fileserver, saving up money, having higher performance, higher reliability, better features and more flexibility; but requiring more time and effort to setup, study and master.

No, anyone can build a box and put a hardware RAID controller in it. Having a hardware RAID controller has nothing to do with whether a box was purchased as a turnkey system or custom-built.

As for the other things you mention, it sounds like you lose a lot of them if you mount your block device via iSCSI. And some of them are not true anyway (ZFS is not higher performance; and the reliability is debatable)

Building your own ZFS fileserver

2[H]4U

2[H]4U

2[H]4U

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Month - December 2009

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

Limp Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|DCer of the Month - December 2009

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

2[H]4U

2[H]4U

[H]ard|DCer of the Month - December 2009

2[H]4U

2[H]4U

[H]ard|DCer of the Month - December 2009

2[H]4U

Limp Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd