Migrating Linux RAID 5 to larger disks online...

The Hunter

Limp Gawd
Joined
May 29, 2002
Messages
414
So, having found a pretty good deal on 3x 500gig 7200.10's, I've started setting up the box I'm going to use for my raid5 fileserver and playing around with mdadm using some virtual disks for practice. After getting the basics down, I decided to try something a little more adventurous - migrating an array from 3 'disks' of one size, to 3 larger 'disks'. I thought I would share my results.

For the impatient, here are the basic steps:
  1. Create your starting array with the smaller disks
  2. Add the larger disks to the array as spares
  3. Mark the old disks as failed *one at a time*
  4. Remove the old disks from the array
  5. Grow the array to encompass new space
  6. Wonder if this is really smart to try with live data and real disks?

This was all done using Ubuntu 6.10 Server, mdadm 2.4.1 and the 2.6.17.14 kernel, compiled to support raid5 expand (Ubuntu 6.10 doesn't include this by default).

Here are the specifics of what I did for the curious:
Create 3 10MB images and 3 20MB images and mount them as loop devices:
Code:
$ dd if=/dev/zero of=img0 bs=1M count=10
$ dd if=/dev/zero of=img1 bs=1M count=10
$ dd if=/dev/zero of=img2 bs=1M count=10
$ dd if=/dev/zero of=img3 bs=1M count=20
$ dd if=/dev/zero of=img4 bs=1M count=20
$ dd if=/dev/zero of=img5 bs=1M count=20

# losetup /dev/loop0 img0
# losetup /dev/loop1 img1
# losetup /dev/loop2 img2
# losetup /dev/loop3 img3
# losetup /dev/loop4 img4
# losetup /dev/loop5 img5

Create a 3 device raid5 array composed of the 10M 'disks':
Code:
# mdadm --create /dev/md0 --auto=yes -l 5 -n 3 /dev/loop0 /dev/loop1 /dev/loop2

# cat /proc/mdstat
Personalities : [raid5] [raid4] 
md0 : active raid5 loop2[2] loop1[1] loop0[0]
      20352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>

Create a filesystem and mount it - I used xfs and mounted at /testdir. If you were doing this on a real system you'd probably want to unmount for safety before expanding, but for demonstration purposes, I kept it mounted throughout.

Now add your larger 'disks' to your array (if doing this for real, you'd probably add them one by one as you remove each smaller disk in the next step):
Code:
# mdadm /dev/md0 --add /dev/loop3 /dev/loop4 /dev/loop5
mdadm: added /dev/loop3
mdadm: added /dev/loop4
mdadm: added /dev/loop5

# cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 loop5[5](S) loop4[4](S) loop3[3](S) loop2[2] loop1[1] loop0[0]
      20352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>

Now we mark our smaller 'disks' as failed. IMPORTANT: do this one at a time, and make sure it finishes syncing (check using #cat /proc/mdstat) before you fail the next disk.
Code:
# mdadm /dev/md0 -f /dev/loop0
mdadm: set /dev/loop0 faulty in /dev/md0
Your /proc/mdstat should look like this:
Code:
# cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 loop5[0] loop4[3](S) loop3[4](S) loop2[2] loop1[1] loop0[5](F)
      20352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>
NOT this:
Code:
# cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 loop5[3] loop4[4](S) loop3[5](S) loop2[2] loop1[1] loop0[6](F)
      20352 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
      [==================>..]  recovery = 90.0% (9600/10176) finish=0.0min speed=1600K/sec

Once you have repeated that process for your three smaller 'disks', remove them from the array:
Code:
# mdadm /dev/md0 -r /dev/loop0 /dev/loop1 /dev/loop2
mdadm: hot removed /dev/loop0
mdadm: hot removed /dev/loop1
mdadm: hot removed /dev/loop2

You should end up with something like this:
Code:
# cat /proc/mdstat Personalities : [raid5] [raid4] 
md0 : active raid5 loop5[0] loop4[1] loop3[2]
      20352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>

Now that we have our array consisting only of larger 'disks', we expand it to use the new space:
Code:
# mdadm /dev/md0 --grow --size=max

# cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 loop5[0] loop4[1] loop3[2]
      [B]40832 blocks[/B] level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>

Expand the filesystem to match:
Code:
# xfs_growfs /testdir/
And we've migrated a full array to larger disks while still online.

Before:
Code:
# df -h /dev/md0
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               15M   72K   15M   1% /testdir
After:
Code:
# df -h /dev/md0
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               [B]35M[/B]  100K   35M   1% /testdir

So, some questions remain:
This was just using loopback devices as virtual drives, is there something I've overlooked that would make this proceedure not work with actual drives?

If it is technically possible to do with live data/drives, is it a very good idea? It seems to me that aside from the time delays of having to wait for each drive to resync as you swap them one by one, there aren't really any more risks to this than to a normal raid5 expand.
Slightly off-topic, what are resync times like when a spare disk gets put into use? I've heard that online expansion can take almost a full day sometimes, is resync as long? (If so, it almost becomes prohibitively long to do this, if you're losing 2 or 3 days just to add the drives in, before you expand the capacity).

Has anyone actually tried this or thought about it?
 
Very nice write up!! I've been wondering how to take my 8x320g drives in a Gentoo Raid5 to something more ubber. ;)

Anyway... I had an array that was 4x250g adn then added in 2 more 250g 1 into the array other as a spare. It took it around 20 hours or so to sync the array with the 1 drive add in. But then I also have this on a 1.4ghz machine so that may have added to the time for all the computations as this is software raid.

My problem is bringing up 8x500 drives in the machine... I have no more sata ports. :( I was thinking of bringing up a temp machine with those 8, and bring it up... rsync the data then move the disks into the live server in place of the 320s. But then this also gives me the excuse to buy more toys. :)
 
This was just using loopback devices as virtual drives, is there something I've overlooked that would make this proceedure not work with actual drives?
It should work the same. I tested with VMware and virtual disks for a similar test a while back.
If it is technically possible to do with live data/drives, is it a very good idea? It seems to me that aside from the time delays of having to wait for each drive to resync as you swap them one by one, there aren't really any more risks to this than to a normal raid5 expand.
Slightly off-topic, what are resync times like when a spare disk gets put into use? I've heard that online expansion can take almost a full day sometimes, is resync as long? (If so, it almost becomes prohibitively long to do this, if you're losing 2 or 3 days just to add the drives in, before you expand the capacity).
Resync doesn't take quite as long, since you're not moving data from any disk to itself. Depending on how good the drivers for your controller are, though, you might have to wait a day or so for the resync.

Did you keep any data on the XFS filesystem, or just the FS itself? Try creating a large random file, md5sum it, and then do the expansion. I must admit I'm a little surprised this works - I wouldn't've thought of using them as spares and then failing the originals.

Note that adding another disk of the same size works pretty consistently. Keeping backups, though, is always recommended. *Especially* if you're running a brand-new kernel.
 
My problem is bringing up 8x500 drives in the machine... I have no more sata ports. :( I was thinking of bringing up a temp machine with those 8, and bring it up... rsync the data then move the disks into the live server in place of the 320s. But then this also gives me the excuse to buy more toys. :)
You can still use this process without having to get any more SATA ports, just remove one 320gig drive, then put a 500gig in its place and let it resync. Do it one by one till you've replaced them all, then grow the array as I did above. Should work prefectly fine.

Did you keep any data on the XFS filesystem, or just the FS itself? Try creating a large random file, md5sum it, and then do the expansion. I must admit I'm a little surprised this works - I wouldn't've thought of using them as spares and then failing the originals.

I tried this again with a random file on it:
Code:
# dd if=/dev/urandom of=/testdir/testfile bs=1M count=12
12+0 records in
12+0 records out
12582912 bytes (13 MB) copied, 6.90992 seconds, 1.8 MB/s

# md5sum /testdir/testfile 
dffe16a0832eab4e50332de84aa0fdab  /testdir/testfile

After growing the file, the md5 is the same:
Code:
# md5sum /testdir/testfile 
dffe16a0832eab4e50332de84aa0fdab  /testdir/testfile

So looks pretty safe, but as you say, backups are always a good idea when working with life data like this.
 
Sorry to reply to such and old thread, but I'm getting ready to buy up some bigger drives and give this a shot. Just have to figure out how to back up my data before hand which is problematic as its around 1.3T.


Anyway, I was hoping someone could possible answer some questions for me. I have my Raid5 setup on 8 drives. I then have the array carved up using LVM2. md0 is the physical drive in the LVM2 setup.

I'm guessing that using this method I don't have to worry about resize the LVM2 stuff as after the array grows LVM2 will just see that md0 got bigger and allow me to allocate the free space to the 'partitions' or make new ones. Or would I need to run the 'vgextend' command on the new md0?

I'm also assuming that I won't have to worry about resize the reiserfs systems that I'm using b/c the array itself doesn't have a filesystem.. just the LVM2 slices.


Any input or comments would be greatly appreciated.

Thanks
 
Ok,

I decided to try this out on a vm. I installed Ubuntu 7.10 into a vmware server and went to town. Following The Hunter's instructions, I created my loops.

I went 1 step further and then put md0 into a lvm2 and carved it up and made a file to test with 'data'.

This is what I came up with.

Code:
root@hoth-ubuntu:/home/trmentry# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 loop2[2] loop1[1] loop0[0]
      20352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

So far so good... got md0 online as raid5. :)

Code:
root@hoth-ubuntu:/home/trmentry# pvcreate /dev/md0
  Physical volume "/dev/md0" successfully created

root@hoth-ubuntu:/home/trmentry# vgcreate vg0 /dev/md0
  Volume group "vg0" successfully created

root@hoth-ubuntu:/home/trmentry# vgchange -a y vg0
  0 logical volume(s) in volume group "vg0" now active

root@hoth-ubuntu:/home/trmentry# lvcreate  -l 4 vg0 -n lv0
  Logical volume "lv0" created

(This is using the total space on the vg0.)

root@hoth-ubuntu:/home/trmentry# mkfs.xfs /dev/vg0/lv0
meta-data=/dev/vg0/lv0           isize=256    agcount=1, agsize=4096 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=4096, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal log           bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

Added md0 to the pv and then created the vg and lv and formated as xfs.

Code:
root@hoth-ubuntu:/# mount /dev/vg0/lv0 test
root@hoth-ubuntu:/# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/hoth--ubuntu-root
                       23G  777M   21G   4% /
varrun                125M  180K  125M   1% /var/run
varlock               125M     0  125M   0% /var/lock
udev                  125M   64K  125M   1% /dev
devshm                125M     0  125M   0% /dev/shm
/dev/sda1             236M   25M  199M  11% /boot
/dev/mapper/vg0-lv0    12M   64K   12M   1% /test

Mounted the filesystem.

Code:
root@hoth-ubuntu:/test# dd if=/dev/zero of=file0 bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.171985 seconds, 61.0 MB/s
root@hoth-ubuntu:/test# ls -l
total 10240
-rw-r--r-- 1 root root 10485760 2007-10-20 01:17 file0
root@hoth-ubuntu:/test#
root@hoth-ubuntu:/test# md5sum file0
f1c9645dbc14efddc7d8a322685f26eb  file0

Created a file and md5sum'ed it to keep it honest. :)

I then followed The Hunter's instructions on failing 1 loop device at a time and bring up the bigger spares. Once that was all done, I did the following.

Code:
/* Before */
root@hoth-ubuntu:/# pvs
  PV                     VG          Fmt  Attr PSize  PFree
  /dev/md0               vg0         lvm2 a-   16.00M    0

root@hoth-ubuntu:/# pvresize /dev/md0
  Physical volume "/dev/md0" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized

/* After */
root@hoth-ubuntu:/# pvs
  PV                     VG          Fmt  Attr PSize  PFree
  /dev/md0               vg0         lvm2 a-   36.00M 20.00M


So far so good. The md5 is still the same.

Code:
root@hoth-ubuntu:/test# md5sum file0
f1c9645dbc14efddc7d8a322685f26eb  file0

However the /test still isn't the whole size... so lets grow it as well. I messed up on this part and did it in 2 steps. As I didn't grow it to the full 20M avail the first time as I fat fingered the command.. but got it on the second try.

Code:
root@hoth-ubuntu:/# lvextend -L+16M /dev/vg0/lv0
  Extending logical volume lv0 to 36.00 MB
  Logical volume lv0 successfully resized
root@hoth-ubuntu:/#
root@hoth-ubuntu:/# xfs_growfs /test
meta-data=/dev/mapper/vg0-lv0    isize=256    agcount=2, agsize=4096 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=5120, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 5120 to 9216

All said and done a df shows

Code:
/dev/mapper/vg0-lv0    32M   11M   22M  33% /test

And md5

Code:
root@hoth-ubuntu:/test# md5sum file0
f1c9645dbc14efddc7d8a322685f26eb  file0

So appears that if you have a raid setup lvm2'ed you can grow the array and take the new space on the arry and add it to your exisiting lvm2 'partitions'.

of course.. BACK UP YOUR DATA for the Murphy's Law thing.

hopefully this will help out some people. :) Now to go off to Newegg and get my 750g drives. muhahahaha.
 
Cool to see this revisited and glad its helping people. trmentry, are you going to actually do this with a live array? If so be sure to let us know how it goes. I've still got about 40% of my 3x500 RAID5 available, and 500 is still the cheapest price point in $/GB, so I don't think I'll be putting this into practice any time soon (need to start downloading more or getting into the HD DVD thing I guess :p)
 
Cool to see this revisited and glad its helping people. trmentry, are you going to actually do this with a live array? If so be sure to let us know how it goes. I've still got about 40% of my 3x500 RAID5 available, and 500 is still the cheapest price point in $/GB, so I don't think I'll be putting this into practice any time soon (need to start downloading more or getting into the HD DVD thing I guess :p)

I'm torn as to doing this on live array. I tend to be a bad luck magnet for shit going completely south on me when I attempt stuff like this.

So I'm thinking of bring up the new drives in a new machine (ordered the mobo/cpu/memory) and then copy the data over with rsync or some such. Partly b/c I'm thinking of using XFS. I'm currently using ReiserFS.

o Asus P5GC-VM
o Celeron D 532
o 4g mem

Will use a Supermicro SAT2-MV8 for the controller (works in normal pci slot). I still need to get my new drives. I added 8x750 to my cart and when I woke up from the shock, I'm thinking 8x500 is looking much better. But that doesn't gain me much over the 8x400 I currently have. I currently have about 120g free on the array (I'm a pack rat w/ no HD. What do I have on here?). I guess I could make the spare disk active and put data on it, but see comment above about bad luck magnet. Or I guess get 8 more 400s and then use both my Supermicros for 16x400 goodness.
 
I'm torn as to doing this on live array. I tend to be a bad luck magnet for shit going completely south on me when I attempt stuff like this.

So I'm thinking of bring up the new drives in a new machine (ordered the mobo/cpu/memory) and then copy the data over with rsync or some such. Partly b/c I'm thinking of using XFS. I'm currently using ReiserFS.

o Asus P5GC-VM
o Celeron D 532
o 4g mem

Will use a Supermicro SAT2-MV8 for the controller (works in normal pci slot). I still need to get my new drives. I added 8x750 to my cart and when I woke up from the shock, I'm thinking 8x500 is looking much better. But that doesn't gain me much over the 8x400 I currently have. I currently have about 120g free on the array (I'm a pack rat w/ no HD. What do I have on here?). I guess I could make the spare disk active and put data on it, but see comment above about bad luck magnet. Or I guess get 8 more 400s and then use both my Supermicros for 16x400 goodness.

Since the the P5GC-VM has AHCI support, why not use SATA port multipliers? You can fit 5 SATA drives onto each AHCI SATA port in the motherboard. With all the money you save on the controller, you can but more drives... :)

thx
mike
 
Since the the P5GC-VM has AHCI support, why not use SATA port multipliers? You can fit 5 SATA drives onto each AHCI SATA port in the motherboard. With all the money you save on the controller, you can but more drives... :)
Port multipliers aren't as widely supported as one would hope. I'm not sure whether the ACHI chipsets work under Linux, but in any case it's not a sure thing.

Signing your posts is unnecessary. The forum software keeps track of who said what.
 
Back
Top