Inexplicably Decreased Performance - RAID Array on Live Server vs Test Server

Hurin · Nov 8, 2013

Hi All,

Relevant Hardware
Core 2 Quad @ 2.5GHz
Supermicro Q35 Motherboard
8GB RAM
Adaptec 6405
Intel SAS Expander
Norco 4020 case
Two 16TB RAID 0 Arrays of 4 x 4TB Drives each.

Software
Windows Server 2008 R2
Microsoft Data Protection Manager (DPM)
Hyper-V (prereq of DPM for restoring VMs)
MS SQL Server (lite edition prereq for DPM)

Problem
In a nutshell, if I replace the boot drive with a clean install of Windows Server 2008 R2 with all the same drivers but none of the additional software (DPM, etc.), I see data transfer rates of 450 MB/s between the two RAID Zero Arrays.

When I boot to the production DPM server that will soon be using these arrays, with the exact same hardware, I see data transfer rates of only 235 MB/s between the two RAID Zero Arrays.

I have made the faster system stack as similar to the production server as I possibly can short of also installing DPM on it. Same drivers, using the same arrays, everything. On the production server, I have tried disabling the DPM, SQL, and Hyper-V services and yet the performance delta remains.

Can anyone think of any registry setting, policy, or other buried setting that could account for a Windows Server with this software stack installed behaving so differently from a fresh Windows Server install on the same hardware? Write caching is enabled on the Adaptec controller at the hardware level and it's disabled in Windows on both systems.

Sorry, but this mystery is getting the best of me. And it's driving me a bit nutty! =P

spazoid · Nov 9, 2013

You have to use DPM as well? My condolences..

Hurin · Nov 9, 2013

spazoid said:
You have to use DPM as well? My condolences..

Helpful.

We get Microsoft software for (almost) free. So we are running a lot of Hyper-V VMs (both linux and windows guests). DPM makes backing up and restoring Hyper-V VMs very, very easy.

Free software on top of already free software infrastructure with no learning curve or complications (beyond this strange one) = win.

There are better alternatives. But not at the pricing we get.

Nex7 · Nov 9, 2013

I don't know that you have sufficient information in your post for anyone to do anything but wildly guess as to the cause. But, to be as helpful as the last guy... are you insane?

RAID-0 arrays of 4 x 4 TB drives? You'll lose 16 TB of data if you lose a single disk. And if that's not a really smart RAID controller and TLER-enabled drives, 'lose' could simple mean it hits a bad sector and spends too long trying to recover it and the RAID controller decides its bad and ejects it from the array, and poof, bye bye array.

And you do realize that even if you have some form of continual off-server backup, if a single one of those drives hiccups you lose that array and all VM's on it, thus immediate downtime that could take a significant amount of time to get back up, since you'll have to find a new disk, put it in, create a new array, and then restore from backups everything that was on there? I mean, even @ 10 Gbe, and even assuming you had a spare disk right there, and the array was only half full, that's still easily a 3 hour window where those VM's are dead. If the restore speed is only your aforementioned /fastest/ rate of 450 MB/s, that's at least 5-6 hours. 200 MB/s? 11. 12 TB used @ 200 MB/s restore rate? 17 hours.

I hope there's no SLA whatsoever on whatever you're building, and that having to restore the thing from backups is a task you're prepared to do multiple times during the life of this environment.

Hurin · Nov 9, 2013

Nex7 said:
I don't know that you have sufficient information in your post for anyone to do anything but wildly guess as to the cause. But, to be as helpful as the last guy... are you insane?

RAID-0 arrays of 4 x 4 TB drives? You'll lose 16 TB of data if you lose a single disk. And if that's not a really smart RAID controller and TLER-enabled drives, 'lose' could simple mean it hits a bad sector and spends too long trying to recover it and the RAID controller decides its bad and ejects it from the array, and poof, bye bye array.

And you do realize that even if you have some form of continual off-server backup, if a single one of those drives hiccups you lose that array and all VM's on it, thus immediate downtime that could take a significant amount of time to get back up, since you'll have to find a new disk, put it in, create a new array, and then restore from backups everything that was on there? I mean, even @ 10 Gbe, and even assuming you had a spare disk right there, and the array was only half full, that's still easily a 3 hour window where those VM's are dead. If the restore speed is only your aforementioned /fastest/ rate of 450 MB/s, that's at least 5-6 hours. 200 MB/s? 11. 12 TB used @ 200 MB/s restore rate? 17 hours.

I hope there's no SLA whatsoever on whatever you're building, and that having to restore the thing from backups is a task you're prepared to do multiple times during the life of this environment.

The raid zero is temporary while I test. It will be RAID 6 when I'm finished if I determine that I can go over the 17TB LUNs that DPM has been tested (by MS) to support. Or, a RAID 10 at 16TB if I decide to play it safe and not tempt fate (likely to be this one, backups are important, after all).

Also keep in mind, this is the D2D backup server, that also writes weekly to tape. It's not hosting the VMs itself. Just backing them up. But, that said, I'm still not going to run RAID 0.

I understand that there isn't a lot of info to go on regarding the original issue. But, I actually am looking for "wild guesses" to check here since I've run out of ideas. The performance delta is, after all, inexplicable unless there's something inherent to DPM, MSSQL (lite), or Hyper-V that somehow turns on some conservative disk/array usage in Windows.

I suppose I'll make time in the next few weeks to wipe the backups server and start over. Might be a good opportunity to go to Server 2012 R2 and DPM 2012.

Still looking for possible causes though if anyone has any ideas!

Nex7 · Nov 9, 2013

Whew!

Well, if it helps, your description of the problem seems to implicate DPM? That 'upon installing DPM' the performance drops? That wouldn't really be all that surprising, depending on what DPM is and does. It could be installing an agent, it could be modifying Windows properties related to storage to make it sync or sync more often or some other things to improve reliability, it could be analyzing all disk writes (can you tell I have no idea what DPM does?), etc. 'Disabling' it might not be enough. If you have a fresh install that works faster, try just installing DPM and then seeing if the performance again drops. If so, at least you've found your culprit, and it would let you ask more targeted questions.

Hurin · Nov 9, 2013

Nex7 said:
Whew!

Well, if it helps, your description of the problem seems to implicate DPM? That 'upon installing DPM' the performance drops? That wouldn't really be all that surprising, depending on what DPM is and does. It could be installing an agent, it could be modifying Windows properties related to storage to make it sync or sync more often or some other things to improve reliability, it could be analyzing all disk writes (can you tell I have no idea what DPM does?), etc. 'Disabling' it might not be enough. If you have a fresh install that works faster, try just installing DPM and then seeing if the performance again drops. If so, at least you've found your culprit, and it would let you ask more targeted questions.

Yep, I was stopping short of spending that much time with our backup server offline while it's running the other boot drive. But really the only way to be sure it's just something about DPM is to install it and see if the behavior can be reproduced. As you said.

If I'm going to spend that much time on this though, I'd rather do it while also upgrading the OS and DPM version.

spazoid · Nov 10, 2013

Well, neither DPM 2010 nor 2012 has some kind of 450 MB/s performance limiter on the drives it uses. I daily see 1.0-1.3 GB/s reads from the SAN disks.

The fact that it's so poorly coded that it's actually NEEDED to perform a backup is another discussion.

Hurin · Nov 10, 2013

spazoid said:
Well, neither DPM 2010 nor 2012 has some kind of 450 MB/s performance limiter on the drives it uses. I daily see 1.0-1.3 GB/s reads from the SAN disks.

The fact that it's so poorly coded that it's actually NEEDED to perform a backup is another discussion.

Yep, there must be some other x factor on my production server causing it to behave so differently when compared to the clean yet identical (hardware-wise) system.

Ah well, I have more than enough throughput to the tape drive even at the lower speed. So I guess it doesn't even matter. Was just a mystery that was bugging me.

Hurin · Nov 13, 2013

Quick update. . . Hyper-V appears to be the catalyst. I replicated the entire software stack of my production server. The transfers between arrays remained zippy all the way to the end, when the only thing left uninstalled was Hyper-V. Upon installing Hyper-V, the performance decreased to the exact same level as my production server.

I'm not blaming Hyper-V. But it's a factor perhaps combined with others on this system that for some bizarre reason causes a substantial performance drop-off in transfers between my arrays on an Adaptec 6405 controller. Other factors may be involved as well. But Hyper-V is apparently one of the catalysts.

Inexplicably Decreased Performance - RAID Array on Live Server vs Test Server

Hurin

2[H]4U

spazoid

Limp Gawd

Hurin

2[H]4U

Nex7

Weaksauce

Hurin

2[H]4U

Nex7

Weaksauce

Hurin

2[H]4U

spazoid

Limp Gawd

Hurin

2[H]4U

Hurin

2[H]4U