Building your own ZFS fileserver

Hi,

I've been interested in ZFS for a while and have some questions-

1- Does the limitation of large hard drives- 1.5TB and bigger- with poor BER still have a negative impact if you have a RAID 5 or 6 with big drives?

My concern with using these cheap large drives is if I have to rebuild one drive (in a RAID), that, ultimately, others will fail because, statisically and reality, 1.5TB+ with low BER will enounter an error when rebuilding.

So, does ZFS help with this at all?

2- for clarification, if my one zfs 'volume', we will call it Data, gets filled up,, all I need to do is add another physical RAID to the system and then add it to the Data pool and the Data volume will automatically get bigger?

3- I'm planning on deploying a solaris setup with a dual core 3ghz intel, 4gigs ram, will this be enough?

4- How simple is it to move the zfs drives to another machine if the computer or OS dies?

Thanks for any help!!
 
1) ZFS is much smarter here. You don't have to worry about BER. If you use a RAID-Z, any bit errors that occur are automatically fixed, because ZFS uses checksums to validate the data - if its invalid a redundant copy is used, and then the original (corrupt) version is fixed. This counts for both data and metadata.

Also, if you have to rebuild a disk, only files are rebuilt; not the whole disk bit-wise. If a bit error occurs, ZFS will correct that bit error once the data has been read, written or modified. So in short, if you have a 10TB array with only 2GB of data, a (full) rebuild would only involve that 2GB data.


2) Yes, it would be like you expanded an array; leaving you with lot of new free disk space for all filesystems on the volume. Think a filesystem as a directory, as that's basically how ZFS works.


3) 4GB RAM is enough, but certainly not alot for ZFS. Any less would disable the prefetch-optimizations. I opted for 8GB instead, but since memory prices are going up again, you may opt for 4 instead. Understand that ZFS would be using most of this, so don't run too many other memory-sensitive applications.

May i ask why you opt for OpenSolaris instead of FreeBSD? Remember, since i'm alot more familiar with FreeBSD, any specific issues you may run into when first setting up your system i can help you with; though i'm not so sure if that would be the case of OpenSolaris. If you feel confident in that area or have some other form of help, then its certainly a viable option.


4) You can migrate the array to any other OS that supports the same or higher ZFS version. That means:

You can migrate from FreeNAS to FreeBSD 7.0 or higher (zfs version 6 versus 6/13/14)
You can migrate from FreeBSD 8.0 to OpenSolaris (zfs version 13 versus 14)
You can migrate from OpenSolaris to FreeBSD 8.1 (zfs version 14 versus 14)
You cannot migrate from FreeBSD 8.0 to FreeNAS (zfs version 13 versus 6)
You cannot migrate from FreeBSD 8.0 to FreeBSD 7.x (zfs version 13 versus 6)
You cannot migrate from OpenSolaris to FreeBSD 8.0 (zfs version 14 versus 13)

If you connect the array to another machine, you have to use the "zfs import" command to take ownership of that array; or ZFS won't touch it. A "zfs import" command without any argument would list any ZFS arrays that can be imported but aren't so already.

In short, its possible to migrate your array. There's one danger though: if you use disk labeling like i suggested earlier, that would only work on FreeBSD and FreeNAS; not on any other OS like OpenSolaris. Also, you must be careful Windows does not interfere with the drives, and DO NOT initialize the drives when asked to do so, this destroys data!

If you use disk labeling, and your FreeBSD install gone bad, here's what you do:
  • Connect your physical disks to a windows or linux PC
  • Install Virtualbox VM software, then install FreeBSD in a VM
  • Use rawdisks so the VM has direct/physical access to your RAID disks
  • Boot up the VM, FreeBSD should be able to recognise the disk labels and once you start zfs you can import the array with "zfs import".
  • You can move data off to your host windows filesystem via the network, like CIFS/NFS or even FTP if you prefer that.
 
Do you have a source for this claim?
Here's the actual commit that implements this:
http://www.mail-archive.com/[email protected]/msg10328.html

So prefetching on ZFS is disabled by default, for all 32-bit i386 systems, and for 64-bit systems that have less than 4GB RAM. If you try this you'll get this message:

ZFS NOTICE: system has less than 4GB and prefetch enable is not set... disabling.

Nice thread, i'm joining the zfs bandwagon:

Just ordered an AMD X4 600e/mobo/ram + 4x2tb to be my new NAS and webserver etc (Replacing a readynas) - going in a normal ATX case until I expand beyond 4 disks/need HBA cards. Will post how I get on, also will probably setup some form of monitoring for the drive temps etc - might post a guide if anyone is interested.
Nice to hear that! As i might add, a quadcore plays nice with FreeBSD as FreeBSD is highly optimized for using multiple cores. The FreeBSD kernel can make good use of multiple cores and process stuff in parallel; so really use them. This is especially apparent when you apply encryption or compression to your storage pool; a quadcore would pump out almost double the numbers a dualcore produces.

So in the case for FreeBSD, a quadcore running at low GHz is your best bet - so your CPU choice is excellent for strong workloads. And with 8GB RAM you won't have any issues with RAM too; its very important to make ZFS perform fast.

If you are in need of a HBA or simple SATA/SAS controller; this Supermicro product would provide 2GB/s full-duplex bandwidth to 8 SATA/300 ports, without using a port multiplier or any of that stuff:
http://www.supermicro.com/products/accessories/addon/AOC-USAS-L8i.cfm

It should also work fine with FreeBSD, while other - similar - products use Marvell chipset which has issues under BSD as far as i heard reading through the mailinglists. It sells for about 130 euro in my area. It's a UIO-card, which is PCI-express x8 but the components are on the 'wrong' side. This shouldn't be any problem as the card doesn't pack anything that's tall. But you may have to remove the bracket and run without one. Still, its fully PCI-express x8 compatible (first generation so 2GB/s full-duplex bandwidth).
 
Just wanted to share my latest experience with ZFS:

As i run Ubuntu, i get an OS upgrade every 6 months. But sometimes its nice to "sneak peak", meaning to upgrade Ubuntu to an alpha or beta version of the next release. That's not without risk of course, but as i ran ZFS, i have much more control of my data.

So here's what i did: before upgrading to lucid, i made a snapshot of my iSCSI-disk image which resides on ZFS:
zfs snapshot /volume/image/V2@backupbeforeupgradetolucid

I then proceeded to upgrade to Lucid. It took about 20 minutes, before terrible error messages appeared and i was left with a non-working system. Oops! Whatever it was, it was ugly. And i would have had to spent a lot of time on restoring that to a usable state. But, thanks to ZFS, life can be made so much more enjoyable, so i did a simple rollback:
zfs rollback /volume/image/V2@backupbeforeupgradetolucid

Then booted my workstation again, and it booted exactly in a state it was in before the upgrade to Lucid. So, in essence, i feel i have more control over my data with ZFS. That feeling develops as you learn more about ZFS and how to benefit from its features.
 
1) ZFS is much smarter here. You don't have to worry about BER. If you use a RAID-Z, any bit errors that occur are automatically fixed, because ZFS uses checksums to validate the data - if its invalid a redundant copy is used, and then the original (corrupt) version is fixed. This counts for both data and metadata.

-- So, while rebuilding an array, if it comes across any error-- which it will, since the BER on 1.5TB drives is such that going through 1.5TBs of data will produce an error-- everything should be A-OK?

Also, if you have to rebuild a disk, only files are rebuilt; not the whole disk bit-wise. If a bit error occurs, ZFS will correct that bit error once the data has been read, written or modified. So in short, if you have a 10TB array with only 2GB of data, a (full) rebuild would only involve that 2GB data.

-- that is very nice

2) Yes, it would be like you expanded an array; leaving you with lot of new free disk space for all filesystems on the volume. Think a filesystem as a directory, as that's basically how ZFS works.

-- woohoo! like LVM I suppose?

3) 4GB RAM is enough, but certainly not alot for ZFS. Any less would disable the prefetch-optimizations. I opted for 8GB instead, but since memory prices are going up again, you may opt for 4 instead. Understand that ZFS would be using most of this, so don't run too many other memory-sensitive applications.

-- I'll probably go to 8G. It's going to be used for file serving and backing up (remotely and locally).


May i ask why you opt for OpenSolaris instead of FreeBSD? Remember, since i'm alot more familiar with FreeBSD, any specific issues you may run into when first setting up your system i can help you with; though i'm not so sure if that would be the case of OpenSolaris. If you feel confident in that area or have some other form of help, then its certainly a viable option.

-- The software for backing up is only for Solaris or Redhat.

4) You can migrate the array to any other OS that supports the same or higher ZFS version. That means:

-- Excellent

In short, its possible to migrate your array. There's one danger though: if you use disk labeling like i suggested earlier, that would only work on FreeBSD and FreeNAS; not on any other OS like OpenSolaris. Also, you must be careful Windows does not interfere with the drives, and DO NOT initialize the drives when asked to do so, this destroys data!

-- I'll have to see the Solaris' options

Thanks for your help!! very informative!! Time to buy more drives and build this beast!

Btw-- I loved your example of the snapshot with your Ubuntu upgrade!! priceless!!

John
 
-- So, while rebuilding an array, if it comes across any error-- which it will, since the BER on 1.5TB drives is such that going through 1.5TBs of data will produce an error-- everything should be A-OK?
Hmm, i have to restate this part i think. ZFS can restore any corruption as long as it still has one intact copy. This applies to sectors; not to complete drives. Lets compare some examples. Assume you have an 8-disk RAID-Z, and multiple disks develop bit errors.

If your disks have lots of bit errors, but all at different locations, then ZFS has at least one good copy of each block, as it can use redundant parity data to reconstruct the good non-corrupt version. In this case, as you scrub/rebuild the volume, all errors will be fixed by ZFS without any dataloss. You can see exactly which disks were having corruption, too, so you're not left in the dark about any ongoing corruption.

If bad luck strikes you, and your RAID-Z array has corruption/bad blocks where two blocks collide - then the parity protection is broken as it would be as if 2 disks were down - not enough redundancy exists to cope with this. In this case, ZFS would fix as much damage as possible, and would list the files where corruption was uncorrectable, and disables any access to these files. This way, the rest of your filesystem is still healed and you at least know which files were affected; they may not be important anyway and you can delete them to clear the errors. But, dataloss can occur in this case.

If the last scenario unfolds, but you have set copies=2 or copies=3 on a filesystem/directory, then it would - in addition to the parity redundancy offered by RAID5 - store multiple copies of each file. This would strengthen ZFS' ability to correct widespread bit-errors involving all disk members. It can be enabled only for documents you deem to be very important - for example your business/home administration and personal documents - perhaps also your personal photos. These shouldn't be all that big.

So, in addition to RAID, ZFS can also add protection in a flexible way to give your most valued files even better protection.
 
Change of plans getting an i3 530 (2 phys cores with HT), also MATX board with two pci-e for future HBA madness.

Anyway - some questions that might be nice for this thread, can you have have 'hot spares' in zfs and/or can you force the filesystem to unmount if its 'resilvering'?
 
If bad luck strikes you, and your RAID-Z array has corruption/bad blocks where two blocks collide - then the parity protection is broken as it would be as if 2 disks were down - not enough redundancy exists to cope with this. In this case, ZFS would fix as much damage as possible, and would list the files where corruption was uncorrectable, and disables any access to these files. This way, the rest of your filesystem is still healed and you at least know which files were affected; they may not be important anyway and you can delete them to clear the errors. But, dataloss can occur in this case.

If the last scenario unfolds, but you have set copies=2 or copies=3 on a filesystem/directory, then it would - in addition to the parity redundancy offered by RAID5 - store multiple copies of each file. This would strengthen ZFS' ability to correct widespread bit-errors involving all disk members. It can be enabled only for documents you deem to be very important - for example your business/home administration and personal documents - perhaps also your personal photos. These shouldn't be all that big.

So, in addition to RAID, ZFS can also add protection in a flexible way to give your most valued files even better protection.

in that scenario- -if one drive IS physically bad,, clicks, etc.,, is ZFS 'smart' enough to allow me to access whatever files are still intact with the broken RAID? or is the RAID completely lost (like traditional RAIDs would be if you were to lose 2 drives for example in a RAID 5)

i like the copie thingy,, pretty slick.

thanks again!!
 
Here's the actual commit that implements this:
http://www.mail-archive.com/[email protected]/msg10328.html

So prefetching on ZFS is disabled by default, for all 32-bit i386 systems, and for 64-bit systems that have less than 4GB RAM. If you try this you'll get this message:

ZFS NOTICE: system has less than 4GB and prefetch enable is not set... disabling.



If the last scenario unfolds, but you have set copies=2 or copies=3 on a filesystem/directory, then it would - in addition to the parity redundancy offered by RAID5 - store multiple copies of each file. This would strengthen ZFS' ability to correct widespread bit-errors involving all disk members. It can be enabled only for documents you deem to be very important - for example your business/home administration and personal documents - perhaps also your personal photos. These shouldn't be all that big.

So, in addition to RAID, ZFS can also add protection in a flexible way to give your most valued files even better protection.

sub.mesa, thank you for the excellent info. I was unaware of both of these characteristics.



Change of plans getting an i3 530 (2 phys cores with HT), also MATX board with two pci-e for future HBA madness.

Anyway - some questions that might be nice for this thread, can you have have 'hot spares' in zfs and/or can you force the filesystem to unmount if its 'resilvering'?

Zpools support however many hot spares specified. A zpool may be manually offlined but resilvering may be performed online.



in that scenario- -if one drive IS physically bad,, clicks, etc.,, is ZFS 'smart' enough to allow me to access whatever files are still intact with the broken RAID? or is the RAID completely lost (like traditional RAIDs would be if you were to lose 2 drives for example in a RAID 5)

i like the copie thingy,, pretty slick.

thanks again!!



I imagine if the drive's data is completely inaccessible and the zpool is unrecoverable and if zfs provided access to the dead pool only some very small files may be accessible if they did not contiguously span across the missing drives but reside wholly in the active drives.

If a drive's data is still accessible with accumulating errors or intermittent problems zfs will continue working to read data as long as possible.




I am testing an opensolaris machine with a xeon 5520 w/ ecc ram. What is the method to scrub or verify memory controller error count status? Any scrubbing feature for higher level memory?
 
Last edited:
All my above effort for nothing. I tried turning on ZFS compression, my primary reason for using ZFS. Speeds slowed to a crawl and the my SSH and console sessions kept freezing. These freezes would appear to affect my media streaming LUN, even though it wouldn't be compressed. I found a few complaints of the same thing on Google.

I installed Solaris 10u8 on another VM, same configuration. Turning on ZFS compression slowed it down a bit, but not by too much and there were no freezes or pauses. When I install it on my hardware Xeon X3430 w/8gb RAM, it'll be perfect.

Unfortunately, the FreeBSD camp still has a lot to do with the ZFS part. I had high hopes!
 
Last edited:
I'm looking to build a ZFS system for home NAS use. Probably a Norco 4220 case + 700W PSU (decent quality) + i5/i7 (or Xeon?) + 8-12gb RAM (will I need ECC?).

I'm just confused at what controllers and motherboard I should be looking into getting since I'd want something to work near perfectly in FreeBSD. I'd be populating this server with 2tb Greens (old version*).

Any ideas? Thanks!
 
For anyone contemplating running OpenSolaris, I recommend the LSI SAS3081E-R (aka Intel SASUC8i)
Unlike some other controllers, it is transparent to the OS by default. Other cards often require you to create a RAID0-array for each drive before they are advertised to the OS.
BTW., steer clear of the Gigabyte MA78* boards, they don't work well with LSI cards. (I tried a MA785GT-UD3H)
http://forum.giga-byte.co.uk/index.php?topic=772.0

If you try out OpenSolaris, you will have fun trying to change to a static IP. Fortunately, you are not alone. Howto here:
http://blogs.sun.com/observatory/entry/beyond_dhcp_with_dns_and

If you have a Mac, you'll probably want to run AFP. Here's how:
http://cafenate.wordpress.com/2009/02/08/building-netatalk-on-opensolaris-200811/

Then you probably want to have your shiny new server advertised via Bonjour (avahi is already installed):
http://www.kremalicious.com/2008/06/ubuntu-as-mac-file-server-and-time-machine-volume/#netatalk1


Bonnie++ output with 5x 2TB Hitachis in RAIDZ1:
Code:
Version 1.03c     ------Sequential Output------   --Sequential Input- --Random-
                  -Per Chr- --Block--- -Rewrite-- -Per Chr- --Block--  --Seeks--
Machine      Size K/sec %CP K/sec  %CP K/sec %CP  K/sec %CP K/sec %CP  /sec %CP
Ui             8G 73715  97 328363  47 211968  31 54220  99 482575  27 729.5 2
                  ------Sequential Create------ --------Random Create--------
                  -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
            files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
               16 +++++ +++ +++++ +++ +++++ +++ 28246  99 +++++ +++ +++++ +++
480MB/s read and 330MB/s write is quite satisfactory ;)
(Athlon II X2 235e (2,7Ghz), 4GB 1333Mhz DDR3)
 
Last edited:
Thanks for the info!! especially the AFP stuff! you must have a Mac(s) also :)

Is that the cheapest card? $200 plus expanders!! $$$$ Hoping to spend $100 for 6 more ports or so.. hmmm

I'm going to throw open solaris on an e5200 dual 3ghz setup on a GA-EP45-UD3L board with a bunch of 1.5TB drives (and possibly 2TB depending upon prices)

this setup is currently a hackintosh, but going to make it opensolaris for storage and then setting up a quad black phenom for my new hackintosh :) (should give my real Mac Pro a run for its money!!)

For anyone contemplating running OpenSolaris, I recommend the LSI SAS3081E-R (aka Intel SASUC8i)
Unlike some other controllers, it is transparent to the OS by default. Other cards often require you to create a RAID0-array for each drive before they are advertised to the OS.
BTW., steer clear of the Gigabyte MA78* boards, they don't work well with LSI cards. (I tried a MA785GT-UD3H)
http://forum.giga-byte.co.uk/index.php?topic=772.0

If you try out OpenSolaris, you will have fun trying to change to a static IP. Fortunately, you are not alone. Howto here:
http://blogs.sun.com/observatory/entry/beyond_dhcp_with_dns_and

If you have a Mac, you'll probably want to run AFP. Here's how:
http://cafenate.wordpress.com/2009/02/08/building-netatalk-on-opensolaris-200811/

Then you probably want to have your shiny new server advertised via Bonjour (avahi is already installed):
http://www.kremalicious.com/2008/06/ubuntu-as-mac-file-server-and-time-machine-volume/#netatalk1

(Athlon II X2 235e (2,7Ghz), 4GB 1333Mhz DDR3)
 
How big is the preformance loss in freenas zfs v6 compared to v13/14 that freebsd is using?
I need a new "home server" for streaming movies, pictures and music in my home. My current setup is:
*20" imac with osx snow leopard
*htpc for streaming with win xp
*laptop with win xp
*workstation with win 7

I need a storage solution with raid 5, 4tb+, easy config (http). I'm streaming 1080p from my workstation now and have 4x1tb that i plan to move the the new server.

Server conf:
*Gigabyte MA790FXT-UD5P
*Low power dualcore cpu
*80gb ssd or ide-CF for os
*4gb ram
*4x1,5/2tb @ raid 5 + 4x1tb @ raid 5

Is freenas my best choise as a linux novice?
 
How big is the preformance loss in freenas zfs v6 compared to v13/14 that freebsd is using?
I need a new "home server" for streaming movies, pictures and music in my home. My current setup is:
*20" imac with osx snow leopard
*htpc for streaming with win xp
*laptop with win xp
*workstation with win 7

I need a storage solution with raid 5, 4tb+, easy config (http). I'm streaming 1080p from my workstation now and have 4x1tb that i plan to move the the new server.

Server conf:
*Gigabyte MA790FXT-UD5P
*Low power dualcore cpu
*80gb ssd or ide-CF for os
*4gb ram
*4x1,5/2tb @ raid 5 + 4x1tb @ raid 5

Is freenas my best choise as a linux novice?

If you are a linux novice, I would stay away from building a linux based NAS. Go with a platform you are more comfortable with. You don't want to make mistakes when something goes wrong that costs you your data....
 
All my above effort for nothing. I tried turning on ZFS compression, my primary reason for using ZFS. Speeds slowed to a crawl and the my SSH and console sessions kept freezing.
So what speeds were you getting with compression enabled? I just did a test, and its still close to 80MB/s for the 1GB-memory-disk filled with random junk with compression enabled.

Have you looked at top? Do you see the interrupt usage, system usage and user usage? Perhaps something is set wrong on your system that causes this behavior; cause i don't see it on my FreeBSD 8.0 system. All my workstations are on iSCSI, and compressing 1GB doesn't seem to affect my iSCSI performance also.

Perhaps if you were using PCI, i would understand. Heavy interrupt usage would bring a multicore system down to its knees. Interrupts can be very bad for performance. Sometimes, using polling for some devices instead of interrupts, would lead to much better performance (less spikes).

It could also be that FreeBSD does not like your VM solution, some recommend using the "hz=100" tweak with certain VM solutions and FreeBSD. I would not recommend using a production FreeBSD machine in a Guest VM. It would make a good host-OS, though. FreeBSD cpu scheduler is quite advanced, and both the MySQL and Linux kernel have incoporated parts of its design. But, perhaps it doesn't work too well when contained by a virtualisation solution.

Unfortunately, the FreeBSD camp still has a lot to do with the ZFS part. I had high hopes!
As i said, it might have been something with your configuration. I'm willing to diagnose this with you, unless you feel its not worth the trouble because you've settled on OpenSolaris instead. However, i feel you shouldn't be too quick to judge this is FreeBSD's fault.

Do you have any links to the articles you mentioned?
 
If you are a linux novice, I would stay away from building a linux based NAS. Go with a platform you are more comfortable with. You don't want to make mistakes when something goes wrong that costs you your data....
Linux newbie can mean several things. If it means he can work with Ubuntu linux, even basic things, then that's already the same level of expertise required to navigate through FreeNAS web-interface and configure some storage.

FreeNAS should be comparable to Windows Home Server, especially if you want simple things.

I'm not sure how ZFS v6 performs, though. I suggest you get the 64-bit package of FreeNAS and try it. If it doesn't suit your wishes you can always go and buy Windows Home Server. Be sure to install fixes or you might run into corruption issues in some specific cases. Google more on the corruption issue for details.
 
Linux newbie can mean several things. If it means he can work with Ubuntu linux, even basic things, then that's already the same level of expertise required to navigate through FreeNAS web-interface and configure some storage.

FreeNAS should be comparable to Windows Home Server, especially if you want simple things.

I'm not sure how ZFS v6 performs, though. I suggest you get the 64-bit package of FreeNAS and try it. If it doesn't suit your wishes you can always go and buy Windows Home Server. Be sure to install fixes or you might run into corruption issues in some specific cases. Google more on the corruption issue for details.
I have used ubuntu and some other "basic" linux os... I have good computer knowledge, but since this will be a crucial data emigration, moving 4tb of ntfs data to a "new unknown" file system and a web based gui os.

I will always be more familiar with server 2003/2007 because some field work for the Swedish army...;)

But since this is for my "home server" I want a easy config, good preforming home server.
I'm willing to try new things and I'm missing raid 5 for whs.
 
Well you could go with Hardware RAID5 + WHS.

If you go the FreeNAS route you avoid the cost for hardware RAID as software RAID works just fine. Remember though; you need a backup! RAID alone can not protect your data enough. If you cannot afford a backup, using ZFS with snapshots would be a good alternative. That would give you RAID5 (called RAID-Z) and with snapshots you can make incremental backups. FreeNAS should allow this, though i haven't really tested its interface myself yet since ZFS was integrated in the FreeNAS GUI.

Especially if you want to do more than just a fileserver/NAS, Windows might be more comfortable. If all you need to do is have a RAID5 and store files, setting up FreeNAS should be done in a few minutes. If you go for the ZFS+snapshots route, you would need to setup a script or do this manually like every week or so.

So FreeNAS would save you cost (hardware RAID controller + WHS costs) while giving you most basic features and even some advanced ones. If you need more than just a NAS, i would stick with the WHS instead as FreeNAS can't be easily extended into a multple-purpose server.
 
Anyone a run a Linux host OS, have BSD/Solaris running as a guest via KVM/Xen. and pass a HBA to the guest with pci pass through?

I'm also thinking about just jumping on btrfs. I've been messing around with it and its pretty slick.
 
@sub.mesa

What advantages does running RAID-Z or RAID-Z2 offer over just RAID5/RAID6 with a ZFS volume?

Wouldnt error checking and the benefits of ZFS still be realized?
 
I've read that a RAIDZ/RAIDZ2 is platform independent, so it's not going to depend on your RAID controller. Also with RAIDZ you can do stuff like zfs set copies=3 which would mirror it across three drives, in addition to the RAIDZ redundancy.
 
The thing that kills RAIDZ for me is enclosure management. Determining which physical drive in a hotswap has failed is not easy without it. I love the other ZFS features, though.
 
@sub.mesa

What advantages does running RAID-Z or RAID-Z2 offer over just RAID5/RAID6 with a ZFS volume?

Wouldnt error checking and the benefits of ZFS still be realized?
You mean RAID-Z using SATA controller versus Hardware RAID5 on RAID controller? If both would be running ZFS, the "software" solution (RAID-Z) is superior.

If you are running a Hardware RAID (like Areca), the Operating System and thus also ZFS never sees or can communicate with the physical drives. All it sees is one big disk from a SCSI/RAID controller. That also means that ZFS cannot access the parity data itself; only the Hardware RAID controller can do this; and it won't do that at ZFS' request.

So when ZFS encounters some disk corruption, it cannot fix your damage by using parity data to reconstruct alternate copies. You lose that benefit; its like running a RAID0 as far as fixing corruption is concerned. If you set copies=2 or copies=3, any corruption might affect both/all of your copies at once, if for example one disk has severe corruption (rare though) - ZFS won't know how to distinguish from drive1 and drive2; so it can't guarantee that additional copies from copies=2 actually are stored on different physical disks. This makes the feature less powerful.

Also, write flushing may be a problem, as some RAID controllers appear not to support this; or chose to ignore it; even with write-through mode. This may affect stability and filesystem corruption if the system crashes or has is powered down abruptly. ZFS prefers to be close to the physical disks, so that the BIO_FLUSH commands actually reach the disk. That would more or less guarantee ZFS won't work with an out-of-sync journal. Journaling filesystems need some form of synchronisation; it does I/O in transactions. So Hardware RAID could even be dangerous; though i have no knowledge of actual affected products in relation to ZFS.

As far as performance is concerned; sequential I/O will be good. But random I/O especially random writes on RAID5 may be much worse. For one, your Hardware RAID controller cannot do dynamic stripesizes - another benefit you lose when using Hardware RAID. This is especially important for random write on RAID5/6; as otherwise the RAID controller would have read the whole block, calculate all parity, then write the data; this is very slow. A small 4KiB write may cause more than 1MB of actual disk I/O and worse; it will require multiple seeks. This is known as Write Amplification, and it's also known to SSD's (who are also 'striped' in some way).

So, when using Hardware RAID you loose self-healing capability on redundant arrays, possible issues with write flushing, uncertain reliability when using copies=2 and inability to use dynamic stripesizes. Other than that you're fine.

Generally, i'd say ZFS has made Hardware RAID controllers obsolete. It's not like having a static and hard-to-improve RAID-engine on proprietary hardware ultimately is a good solution. The software has the most 'knowledge' of relevant information and thus can use the best optimizations. Software RAID should be superior by design, especially when combined with a good filesystem, or in this case a combination of both in one package - ZFS.
 
That is all true, but its worth noting, simply detecting corruption is a huge step forward in data protection. Raidz is better than ZFS on hardware RAID, but ZFS on hardware RAID is better than XFS/NTFS/whatever on hardware RAID.

Actually my backup plan if I can't find a nice way to run BSD/Solaris without resorting to building another system, is to run btrfs on top of a software raid5 array. I'll be able to pop in one new drive at a time and expand the raid5 array, and then expand the btrfs to use new space-- something I cannot do with ZFS. No it won't be ideal, but again, simply being able to detect corruption is much better than nothing at all.
 
Last edited:
You're absolutely correct about the importance of detecting corruption. If you have a backup, you might not need self-healing this way; though it would require manually restoring files from backup.

I would like to remind you or others that expanding a RAID5/6 is a dangerous operation and may put the data at risk if anything goes wrong during this procedure (and it happens). I would only trust that with data i got backed up also. The ZFS expand option is very safe as it doesn't require shuffling of data and putting the array in a transient-state and loses its parity protection during the expansion process.

Expanding single disks with ZFS is possible if you use no redundancy; thus a RAID0 array. then set copies=2 on all data and this should work to correct any corruption/bit-errors as well. It would be the fastest solution and has no restrictions on expansion. The only downside is that a complete drive failure ("crash") would make the array inaccessible, even with the copies=2. This is a limitation of ZFS at the moment, as ZFS has all data necessary to access those files even with one disk being down. Essentially, copies=2 is like mirroring on the filesystem level.

So it may be an alternative, if you have a backup in addition to the RAID0-array. Since you're using copies=2, that would mean data takes twice as much space, just like mirroring. The advantage here is that you can set it per directory/filesystem; thus leaving large bulky and unimportant data alone and only protect the most crucial files against corruption. Generally, corruption in some video file isn't that serious, for example. So it may not even need corruption protection.

RAID works on all files, so its inflexible if you got mixed important and unimportant files. The copies=2 solution could be very good in this case, especially when ZFS improves itself and is able to access arrays with one disk missing, and view any content that has copies=2 set. For now, its protection is limited to bit-errors and corruption, but not against complete disk failures or missing disks (basically the same).
 
The thing that kills RAIDZ for me is enclosure management. Determining which physical drive in a hotswap has failed is not easy without it. I love the other ZFS features, though.

I don't understand your problem, this is an 'OS' thing; I noted down my drive serial numbers and where I plugged them in and worked out which was which just fine (didn't need serials in the end):

Code:
batcave#  egrep 'ad[0-9]|cd[0-9]' /var/run/dmesg.boot
ad4: 76319MB <WDC WD800JB-00JJC0 05.01C05> at ata2-master UDMA100
acd0: DVDROM <Pioneer DVD-ROM ATAPIModel DVD-115 0111/E1.11> at ata2-slave UDMA33
ad6: 1907729MB <Hitachi HDS722020ALA330 JKAOA28A> at ata3-master UDMA33
ad7: 1907729MB <Hitachi HDS722020ALA330 JKAOA28A> at ata3-slave UDMA33
ad8: 1907729MB <Hitachi HDS722020ALA330 JKAOA28A> at ata4-master UDMA33
ad9: 1907729MB <Hitachi HDS722020ALA330 JKAOA28A> at ata4-slave UDMA33

I'm still testing the water with various things, not moving my data over just yet, not sure if that 'udma33' is superficial or not - played a 720p mkv over nfs last night; does anyone know of a drive benchmark that gives more 'human' readings than bonnie?

I'm struggling to find any decent NFS documentation, anyone found a good resource? I've got everything working with the basic 'everyone' setup at the moment but things are 'read-only' so its not ideal.
 
UDMA33 is because you have your SATA drives in BIOS setup with IDE compatibility.
Go in BIOS and setup that as AHCI. It will show you SATA-300.
 
I don't understand your problem, this is an 'OS' thing; I noted down my drive serial numbers and where I plugged them in and worked out which was which just fine (didn't need serials in the end):
It'd be nice if the OS could light up the failed drive indicator on the chassis. But I agree, unless you've got a rack full of them, it's not a big deal.

I'm still testing the water with various things, not moving my data over just yet, not sure if that 'udma33' is superficial or not - played a 720p mkv over nfs last night; does anyone know of a drive benchmark that gives more 'human' readings than bonnie?
Maybe iozone would suit you better? Or just a plain old stream copy with dd? Bonnie isn't hard to read if you RTFM to see what all the numbers mean.
 
UDMA33 is because you have your SATA drives in BIOS setup with IDE compatibility.
Go in BIOS and setup that as AHCI. It will show you SATA-300.

Yeah I thought as much, will change that over tonight :) wonder if it will improve performance at all:

Code:
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
         3552 72873 56.8 135461 23.5 84433 10.2 80403 65.1 216059 10.1 119.0  0.3
 
Last edited:
I don't understand your problem, this is an 'OS' thing; I noted down my drive serial numbers and where I plugged them in and worked out which was which just fine (didn't need serials in the end):

Its easy enough to make a change that will change how the drives are presented to the OS. ZFS and other Software RAID doesn't really care, but it still makes determining the failed drive to be harder than it needs to be. Example - you remove a drive from a degraded array and have to reboot. The drives after the removed drive will have their /dev designation changed. Or, you add another controller card with a few drives.

Now, if there was a way to consistently map drive slots to devices in Solaris or Linux so that Slot 1 -> whatever sata port -> /dev/sda is always true, then I'd be happy. Make it so the software RAID management can light up the failure light using enclosure management and I'd be ecstatic.
 
Last edited:
Its easy enough to make a change that will change how the drives are presented to the OS. ZFS and other Software RAID doesn't really care, but it still makes determining the failed drive to be harder than it needs to be.
Not if you use geom labeling like discussed earlier in this topic.
Example - you remove a drive from a degraded array and have to reboot.
Why? If you pulled the wrong drive, put it back in. If the SATA supports hot-plug you should not need to reboot. Depends on SATA controller though.

The drives after the removed drive will have their /dev designation changed. Or, you add another controller card with a few drives.
The /dev/ name is irrelevant if you use geom labeling.

Now, if there was a way to consistently map drive slots to devices in Solaris or Linux so that Slot 1 -> whatever sata port -> /dev/sda is always true, then I'd be happy.
Again, look at disk labeling. That way a physical disk will be known as "disk5" no matter how you connect it to your system.

And about the LED thing; like said earlier reading from your array will make all LEDs burn except that from the failed/offline drive. So you have LEDs already and can locate the failed disk within a few seconds. I really don't understand the problem.

Even if you do everything wrong and pull disk after disk; what's going to be the problem? Its not like if you pull two drives that all the data is instantly gone... Put the drive back and your array runs.
 
Not if you use geom labeling like discussed earlier in this topic.

That's only useful if you're using FreeBSD. FreeBSD seems to do well for basic ZFS functions, but the iSCSI support was borderline poor and fancy features like ZFS compression are unusable.

Why? If you pulled the wrong drive, put it back in. If the SATA supports hot-plug you should not need to reboot. Depends on SATA controller though.

Pulling the wrong device on a RAIDZ2 would eliminate your redundancy, assuming that you already had a failed device.

The /dev/ name is irrelevant if you use geom labeling.

Again, look at disk labeling. That way a physical disk will be known as "disk5" no matter how you connect it to your system.

This is a wonderful feature of FreeBSD, but it doesn't work for Linux Software RAID or Solaris/OpenSolaris.

And about the LED thing; like said earlier reading from your array will make all LEDs burn except that from the failed/offline drive. So you have LEDs already and can locate the failed disk within a few seconds. I really don't understand the problem.

This is one option, but it is somewhat less idiot proof than I would like.

Even if you do everything wrong and pull disk after disk; what's going to be the problem? Its not like if you pull two drives that all the data is instantly gone... Put the drive back and your array runs.

And lose another drive during the rebuild?
 
Rebuild of what? On an 8-disk RAID-Z, you could pull 4 drives; it will be UNAVAIL/FAILED. Put 3 drives back in and it will be ONLINE/DEGRADED. Put the last drive back and it will be ONLINE.

Its not like if you pull the wrong drive that your data is gone. You also don't need a rebuild. Only if - after you pulled one drive - you wrote to the array, while being in a degraded state. Even then, only the 'differencing' data will be 'rebuilt' - not a full disk rebuild that takes hours.

So it should be very idiot proof. If one disk failed, you should be able to disconnect one disk at a time, until you reach the disk that was faulty. So what isn't "idiot proof" here? You can make unlimited mistakes with pulling/disconnecting the wrong drive; and still have no risk at all.

I would have to test this myself though, but i believe it works like i stated; assuming hot-plug on the SATA ports is supported. If its not, you indeed have to reboot. But even then, pulling multiple drives out of the system wouldn't cause anything catastrophic.

I agree about disk labeling; that would not work in OpenSolaris. But the idea is not new. UUIDs are kind of comparable. Perhaps OpenSolaris has some equivalent of geom_label. This issue isn't exactly relevant only to FreeBSD; it applies to all OS. The FreeBSD solution is just very elegant; though not portable to other operating systems.
 
I'm jumping on the ZFS bandwagon. I was interested before and after reading all the posts in this thread I'm convinced. My plan is to run an OpenSolaris box with ZFS and then sharing out a NFS mount and having my ESXi box use that for storage. I'll post back once I have it all setup. :p
 
Nice to hear that! Be sure to post some benchmarks if you can, both sequential (network/local) and random I/O if possible.

Currently, i'm trying to set my spare server up so i can do some benchmarking my own. Would be nice to compare speeds on FreeBSD+ZFS and OpenSolaris+ZFS.
 
Rebuild of what? On an 8-disk RAID-Z, you could pull 4 drives; it will be UNAVAIL/FAILED. Put 3 drives back in and it will be ONLINE/DEGRADED. Put the last drive back and it will be ONLINE.

Its not like if you pull the wrong drive that your data is gone. You also don't need a rebuild. Only if - after you pulled one drive - you wrote to the array, while being in a degraded state. Even then, only the 'differencing' data will be 'rebuilt' - not a full disk rebuild that takes hours.

So it should be very idiot proof. If one disk failed, you should be able to disconnect one disk at a time, until you reach the disk that was faulty. So what isn't "idiot proof" here? You can make unlimited mistakes with pulling/disconnecting the wrong drive; and still have no risk at all.

I would have to test this myself though, but i believe it works like i stated; assuming hot-plug on the SATA ports is supported. If its not, you indeed have to reboot. But even then, pulling multiple drives out of the system wouldn't cause anything catastrophic.

I agree about disk labeling; that would not work in OpenSolaris. But the idea is not new. UUIDs are kind of comparable. Perhaps OpenSolaris has some equivalent of geom_label. This issue isn't exactly relevant only to FreeBSD; it applies to all OS. The FreeBSD solution is just very elegant; though not portable to other operating systems.

This is one thing that I have never understood. OS vendors put virtually no work into utilities that aid storage management. Whether it's implementing code that blinks lights on drives or GUI style management. This is one area that I really appreciate after transitioning to hardware raid under windows. The management utilities are way better.

That said, there is no reason at all that Solaris or Linux can't do all this with better results than a hardware raid vendor. And expose SMART stats and a bunch of other info. And be able to have sane treatment of devices labels and GUI mapping to enclosure geometries. But do any of these OS folks do any of this? no!

What a waste. I still believe architecturally software raid is superior in 99.9% of use cases. But hardware RAID is just a lot easier (as implemented today) to manage because of where the different vendors have decided to add value (or not add any value as the case may be).
 
I have a Norco case, LSI1068E, and a Chenbro SAS expander. Norco is all plugged into SAS expander. Lets pretend I paid attention to how I connected expander to Norco's backplanes, which is possible, the expander has little lables on each minisas connector, I just shoulda connected them in order. If I really cared I would have but whatever.

So now I got my setup.

I use my software array. Lets check status:

bexamous@nine:~/src/smp_utils/smp_utils-0.94$ ./sample.sh
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid5 sdj3[6] sdi3[3] sdh3[5] sdg3[4] sde3[7] sdd3[2] sdc3[1] sdb3[0]
6837318656 blocks level 5, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]

unused devices: <none>


/dev/sdj is connected to port 19
/dev/sdi is connected to port 18
/dev/sdh is connected to port 17
/dev/sdg is connected to port 16
/dev/sde is connected to port 7
/dev/sdd is connected to port 6
/dev/sdc is connected to port 5
/dev/sdb is connected to port 4
bexamous@nine:~/src/smp_utils/smp_utils-0.94$

=)

Oh sdh failed? That is connected to port 17.

Actually I hid one extra step, you would actually need to run this BEFORE a disk failed. It queries the disk to see where it is, if disk failed it couldn't query it :p. So really you would have this script (well an improved one) run when system boots up and record the /dev/sd? -> port mapping. Later when a disk failed you would look up where the disk is.

Btw there is no error checking at all in this script, and it assumes everything is connected to an expander, and there is only 1 expander... but really it took 10 minutes to do this just to show its possible.

Install smp_utils (http://sg.danny.cz/sg/smp_utils.html) and sg3_utils (apt-get in ubuntu)

Here is simple script:

Code:
#/bin/bash

cat /proc/mdstat
echo ""
echo ""
cat /proc/mdstat 2>&1 | grep -o sd. | while read disk; do
    disk=/dev/$disk
    sudo ./smp_discover --multiple --brief --sa="$(cat $(sudo find /sys | grep "expander-.:./sas_address$"))" /dev/mptctl | grep "$(sudo sg_vpd --page=di_port --quiet $disk | sed 's/0x//')" | grep -o "phy[ ]*[0-9]*" | sed 's:phy[ ]*:'$disk' is connected to port :'
done
 
Hrm, this is very interesting, I wasn't aware of SMP tools. It looks like it should be able to control chassis indicators as well via smp_write_gpio. That should mean that it wouldn't be very difficult to write a script to track insertion state on each expander port, and then use a script to handle events from the RAID layer and light the appropriate LEDs on the backplane.

Now I'm wishing I had the hardware to play with this at home so I could try it out :p
 
Back
Top