Calling all File Server gurus...what is the best File Server money can buy?

joecool234

Limp Gawd
Joined
Feb 1, 2001
Messages
433
My company has a client with lots of data. 326,000 files totalling 150GB. And this will only continue to grow. He currently has all of this data residing on a single HP DL380 G5. The OS is on 2 72GB 10K SAS drives and the data on a separate RAID5 with 3 146GB 10K SAS drives. For several weeks, he has been complaining about severe latency issues. It can take up to a minute or two to open a simple 5MB Access database across the LAN. As far as hardware issues, we pretty much ruled out every possible scenario. The entire LAN is gigabit with 3Com superstacks. The server even has both NICs in teaming mode. Not a single application is installed on this server. It is simply a file server running Windows Server 2003 R2 Standard. I even rebuilt the server from scratch. The latency is still there. There are no other complaints about latency with the other servers. Network consists of a DC, Exchange Server, 2 SQL servers, Sharepoint, and another file server. Everything else runs just fine across the LAN (yes, even the other file server).

We pretty much have decided that the issue lies with the disk subsystem. Even browsing the data partition on the server locally is incredibly slow (10-15 seconds to go from folder to folder). During all of these tests, CPU never goes above 10%.

We have decided to go with a DL320 with a dual core Xeon 2.6GHz CPU and 4GB of RAM. We even opted for the 512MB upgrade kit for the RAID controller cache. We picked this server b/c it is the only one HP offers (within reason) that can support 146GB 15k SAS drives. We are going to put 6 of these in a single RAID5 array. The OS will be on a 10-15GB partition and the rest will be the data partition. Does anyone see a problem with this? Will the extra cache and extra drives (6 as opposed to 3) drives result in faster read speeds? I know write speeds will take a big hit, but people are primarily complaining about opening files, not saving them.

Any input will be greatly appreciated. I will be more than happy to provide any additional information.
 
The question is: How much money do you want to spend and what class of file server are you looking for? I can tell you what to look for in x86/64, or IA32/IA64 server space, but not anything beyond that.
 
In all honesty 150GB is nothing for a database file.

1. You should check the hard drive for any errors by using the Manufacturers tools.
2. If possible try swapping the drive into another server/computer and test the performance if its better than it is some other hardware at fault.
3. Try posting your problem on some Oracle/DB forum or thread, because I strongly believe your problem lies in DB itself.

It is always better to buy a server, and only build workstation from scratch, because of WARANTY and also because pre-built server go through rigorous testing and are 100% made compatible with all the hardware.

DELL, HP avoid IBM to expensive.

ALSO don't have OS on the same drive as your DB, it is always better for performance and maintanance to have OS on seperate drive form your DB files because what you want to achive is to have each spindle either doing the writing or reading. Yes 6 drives will improve read and write compared to 3 drives.
 
In all honesty 150GB is nothing for a database file.

1. You should check the hard drive for any errors by using the Manufacturers tools.
2. If possible try swapping the drive into another server/computer and test the performance if its better than it is some other hardware at fault.
3. Try posting your problem on some Oracle/DB forum or thread, because I strongly believe your problem lies in DB itself.

It is always better to buy a server, and only build workstation from scratch, because of WARANTY and also because pre-built server go through rigorous testing and are 100% made compatible with all the hardware.

DELL, HP avoid IBM to expensive.

ALSO don't have OS on the same drive as your DB, it is always better for performance and maintanance to have OS on seperate drive form your DB files because what you want to achive is to have each spindle either doing the writing or reading. Yes 6 drives will improve read and write compared to 3 drives.

I appreciate the response, but I just want to clarify a few things. This client can have NO downtime. Moving the drives to another server requires 2 servers to go down. He also wants to limit the amount of manhours. While a new server is certainly not cheap, we will use the "old server" as a new SQL server, so on and so forth. He feels at this point, more money thrown at a server is more economical than money thrown at manpower.

So in response to these points:

1) This is a highend DL380. Almost every aspect of the server is tied into the server's management processor, which is monitored in real time by several services. There are no other utilities by HP that we can run. The only other option is to bring it back to our lab and benchmark the hell out of it.

2) Moving the drives to another server is a very good idea that we have not thought of yet. Unfortunately, as some permissions are applied with local groups (not on the domain) I dont want to mess with the NTFS ACLs. He also doesnt want downtime and our after-hours rate is kind of excessive.

3) This server is by no means a database server. Most of the 326,000 files are AutoCAD drawings. He just happens to see the most latency when opening the dozen or so Access .mdb's. And in case you're thinking it, the databases are fine. They were copied to another file server and opened right away across the LAN.

As for the comment about not rebuilding servers from scratch, I may not have been clear before. All I did was reinstall the OS. I formatted the C drive (own its own RAID1 array) and reinstalled windows. This does not void the warranty as the box was purchased from HP without an OS.
 
Forgive me if I am giving an idea that is complete off base since I am still somewhat new to the server game. We have used ASR at a few clients to minimize downtime and setup servers in parallel of each other. The only problem is that the partition's you start with on the original server are what you start with on the new server. Don't know if this is a viable solution for minimizing downtime at all but I figured id throw it out there.
 
I don't have much experience with ASR, but since the two servers are completely different, I would think this is out of the question. Thanks for the suggestion though.
 
Figured it was worth a shot. The ASR is pretty easy to use as well.
 
Assuming you have "teaming" aka link aggregation on both ends.. server and switch, since improperly configured you can have negative results.
 
Assuming you have "teaming" aka link aggregation on both ends.. server and switch, since improperly configured you can have negative results.

I would think this would be evidenced by occasional flat lines in the network monitor in the task manager. This is never the case. All we see are spikes that never really go above 5%. And windows sees this as 2gig adapter as NIC teaming is setup on the server. As far as link aggregation on the switch, I would think that is only necessary when teaming is NOT setup on the server. The HP Teaming utility has an algorithm that sets up both NICs to transmit.

I have been trying to find information on the ideal stripe size for a RAID 5 array. And everything I've read basically says to use the controller's default. In the case of the P400, I think its 64KB. Any suggestions? Again, most if not all of the files on this server are going to be less than 20MB.
 
My bet is a failing drive as you say it can take 10-15seconds to switch between folders.

my bet is IBM servers although I am a little biased as I work for them as a Service Rep but the higher end IBM stuff although expensive you cannot negate the fact that IBM has ACTUAL employees that come onsite for your service calls and are located within 2-3hrs of every tiny hick town in north and south america LOL, The actual onsite service are almost never subcontracted out which is why you pay the enormous price for the hardware and the maintenance service.


Steve
 
What metrics did you look at to conclude it was the disk subsystem? Does the network perform like you'd expect if you take the disks out of the picture? Why are you running raid 5 and not 10? Have you checked for incorrect checksums going out on the wire? Are there any differences between this machine and the other file servers that are working? What does the array management software say?
 
which controller is supporting the raid 5? do you have the drives split to individual channels?
 
Spindles Man!! Add more spindles! Get a P800 controller for that DL380G5 and an MSA60 with 12 small 15K SAS drives. You will have bumped your disk I/O by 5x+. I'd keep it RAID 5 with a hotspare for maximum read speed.

OR, if you're feelin' frisky get some fast NetApp FC disk. $$$$
 
If you absolutely cannot afford downtime, you need clustered NAS. Literally, no single point of failure. For a system like that, you pay for the robust software that makes sure it fails over - you pay a lot more than just the price of hardware. If you need it for a revenue generating application, then it is worth it.
 
Paging ockie to thread 1279433

Ockie reporting in! :D

In all honesty 150GB is nothing for a database file.

1. You should check the hard drive for any errors by using the Manufacturers tools.
2. If possible try swapping the drive into another server/computer and test the performance if its better than it is some other hardware at fault.
3. Try posting your problem on some Oracle/DB forum or thread, because I strongly believe your problem lies in DB itself.

It is always better to buy a server, and only build workstation from scratch, because of WARANTY and also because pre-built server go through rigorous testing and are 100% made compatible with all the hardware.

DELL, HP avoid IBM to expensive.

ALSO don't have OS on the same drive as your DB, it is always better for performance and maintanance to have OS on seperate drive form your DB files because what you want to achive is to have each spindle either doing the writing or reading. Yes 6 drives will improve read and write compared to 3 drives.

It sounds much to me that as if the OP has some sort of software/diff config running on the server. This does not seem to be a server hardware issue.

I appreciate the response, but I just want to clarify a few things. This client can have NO downtime. Moving the drives to another server requires 2 servers to go down. He also wants to limit the amount of manhours. While a new server is certainly not cheap, we will use the "old server" as a new SQL server, so on and so forth. He feels at this point, more money thrown at a server is more economical than money thrown at manpower.

So in response to these points:

1) This is a highend DL380. Almost every aspect of the server is tied into the server's management processor, which is monitored in real time by several services. There are no other utilities by HP that we can run. The only other option is to bring it back to our lab and benchmark the hell out of it.

Our DL380's are screaming fast, this must be either a configuration issue or a software issue.

2) Moving the drives to another server is a very good idea that we have not thought of yet. Unfortunately, as some permissions are applied with local groups (not on the domain) I dont want to mess with the NTFS ACLs. He also doesnt want downtime and our after-hours rate is kind of excessive.

I wouldn't worry much about moving the drives.

3) This server is by no means a database server. Most of the 326,000 files are AutoCAD drawings. He just happens to see the most latency when opening the dozen or so Access .mdb's. And in case you're thinking it, the databases are fine. They were copied to another file server and opened right away across the LAN.

Can you post performance metrics and baseline information? What about other server baselines?

Also, what are EVERY SINGLE program running on this machine?


I would think this would be evidenced by occasional flat lines in the network monitor in the task manager. This is never the case.

You would only see a flat line if you start to saturate your connection, however, since you are saying the performance is weak, you would never know.

All we see are spikes that never really go above 5%.

Processor or network utelization? If this is your net utl. I have a feeling that something isn't right there.

And windows sees this as 2gig adapter as NIC teaming is setup on the server. As far as link aggregation on the switch, I would think that is only necessary when teaming is NOT setup on the server. The HP Teaming utility has an algorithm that sets up both NICs to transmit.

You should configure the switch to do the same. There are many types of aggregation options, the switch wont know what you are trying to get it to do. Some advanced and newer switches may have this option already, some dont. I would configure both ends so that you eliminate all your possibilities.

I have been trying to find information on the ideal stripe size for a RAID 5 array. And everything I've read basically says to use the controller's default. In the case of the P400, I think its 64KB. Any suggestions? Again, most if not all of the files on this server are going to be less than 20MB.

Leave it at default until you figure out your problem. Have you tried perhaps running no network teaming and just gone with single link for the time being? You shouldn't get any downtime.

If you absolutely cannot afford downtime, you need clustered NAS. Literally, no single point of failure. For a system like that, you pay for the robust software that makes sure it fails over - you pay a lot more than just the price of hardware. If you need it for a revenue generating application, then it is worth it.

qft.
 
What happens if you open the files locally? Is it slow too? If it is slow too, then it has nothing to do with network. Otherwise, most likely it is a network issue.
 
Ummmm, before you start shelling out bucks for a new setup, you may want to dig into the perfmon stats and prove it's something within the disk subsystem first. Specifically, Disk Que lengths and IOPS during these moments of hiccups. If you use 3rd party disk management, you can see "hot spots" on the disk, and if there are too many spread across the disk, can result in some bad disk readouts.

My guess? Windows file system/file sharing. Depending on your file structure, Windows can be extremely laggy. For example:

Have a client that has a share comprised of 600,000 files, very small PDF files. Oh yeah, they are in the SAME FOLDER. The program that writes them is java based and does not have to read the file structure before writing the file. HOWEVER. When the users browse the folder through a Windows client, it takes 1-2 minutes to display the files, and it responds very laggy. BTW, this is a Windows file cluster running on an extremely fast HP SAN @ 4Gbps. Disk que length couldn't be closer to 0 on this box. Have other shares that have upwards of 3 million files in a single folder, and it's almost impossible to display anything. Friggin nightmare.

So my question to you: How does the file structure lay out?

I agree with Okie. DL380G5 is a snappy box. I have many servers running many configs with this hardware, and have never questioned the physical hardware as a limitation.
 
Ok, so this thread is jumping now. Sorry for the delayed response.

After reinstalling the OS, there are only 3 applications installed. One set is the HP Management Tools Suite. The others are McAfee Enterprise 8.0i and the Backup Exec 11d Remote Agent. Thats it. But after digging around somemore I see one serious culprit. The RAID5 array is split accross 2 channels. Channel one serves both disks of the RAID1 logical drive (C:) and 2 disks of the RAID5 array (D:). Channel 2 serves the other disk for the RAID5. I have no idea why it was setup this way. I wonder if this could lead to performance issues.

As for perfmon, I could not find a single metric that proves my theory it is disk related. The average DQL is usually about .5 I haven't tried the IOPS though, maybe that is worse.

When I mentioned 5% before, I was referring to 5% of the 2gbps NIC. Not CPU.

Considering that he has an identical DL380 configured as an additional File Server connected to the same switch, I seriously doubt the switch configuration is the problem. I'm not 100% sure how the disk arrays are configured though.

At first, we did thing the way the files were organized could be the problem. On the RAID 5 D: drive, he had a single folder with 8 subfolders. All 8 subfolders were shared. So all 326,000 files were stored under a single folder. When then had the client move all 8 folders to the root of D: and then reshare them...no difference in performance. And yes, when users are complaining about latency (during times of heavy use) browsing the D drive locally is incredibly slow.

Ok...so I have sitting behind me a DL320s with 6 15k SAS 146GB drives. I just put Server 2003 R2 on it and it's just flying high. I put 5 drives in RAID all on the same channel. In fact, this is the only option. All 12 bays are on a single channel of P400. I guess it has some kind of multiplexer on the backplane. The sixth drive is a hotspare.

After booting into Windows, I see the ACU is reporting 25%Read/75% write. I went ahead and changed this to 50/50. All other options are left default. Oh yea, and I upgraded to the 512MB BBWC option.

I then ran the following commands:
Fsutil behavior set disable8dot3 1
Fsutil behavior set disablelastaccess 1
Fsutil behavior set mftzone 2.

Otherwise, I don't know what else to do. And tonight is the install. Hopefully I'll have a progress report posted here by the end of the week.

The client is also going to give us the old server for a couple of weeks. I am going to try and run every benchmark I can think of (any suggestions?). I will then reconfigure the RAID arrays to NOT be split across channels and then re-benchmark. Hopefully that proves my theory. If not, I'm gonna beg HP to give us a new server (like that'll work).
 
IMO it shoulds like you are gun-ho on pushing a new server without figuring out why this problem is happening.


Well best of luck to you.
 
If it were up to me, I would spend more time and get the problem solved. Unfortunately, the client is rather impatient and simply wants a new server. He needs a new server anyway to replace his aging SQL server, he's just jumping the gun by a few months. The DL380 giving us issues will come back to the lab to undergo extra diagnostic tests and benchmarks before going back into production as the "new" SQL server.
 
to OP, I am going with a few others in the thread, that in no way should saturate those drives there is something else going on.

If he wants the new server, buy it setup RAID 5 or 10 with some 15krpm drives.Setup DFS between the two servers, let it replicate then rename the new one, dns alias whatever.

To Rocco123, No OS is going to handle 600k small files in one folder, dont blame the lag on windows.
 
Have a client that has a share comprised of 600,000 files, very small PDF files. Oh yeah, they are in the SAME FOLDER. The program that writes them is java based and does not have to read the file structure before writing the file. HOWEVER. When the users browse the folder through a Windows client, it takes 1-2 minutes to display the files, and it responds very laggy.

With 600k files in one folder I'm not surprised it is so slow. My 2TB fileserver with 2.01TB of files in one folder(6500 files) is snappy, but each time I open the folder it is 1MB of traffic to do the folder listing.
So, at roughly 100x the ammount of files I have, you are probably looking at 100MB of traffic each time you open the folder on a client machine. Plus, I can't think of anything that would do well displaying 600k files.
 
NTFS is a poor file system for a situation where you have more than 100 files in a older - with 600k files you need a file server that uses a better file system. You could go nuts and get a full-on NetApp cluster if you need that kind of reliability, or go with their small-office brand (StoreVault) - both use the WAFL file system which organizes and pages directory metadata in a way that scales well with lots of files in one directory. A NetApp FAS270 "shrunken head" may be another good alternative for this class of application - basically it is a 3RU disk shelf, but it has an embedded server in the slot where the FC-AL loop card would normally plug in - and you can run redundant embedded servers all within the single enclosure just by adding a 2nd card - they synchronize over the backplane.

You may also want to look at HP's PolyServe offering, which provides clustering on top of Windows but also replaces NTFS with thier own file system. This solution requires shared storage, either iSCSI or fiberchannel attached to both servers in the cluster - and in fact scales to clusters of 16 servers all sharing a common pool of disks. I believe EMC makes a small Clariion array with redundant (mirrored) RAID controllers - CX100 or CX1000 - which is one candidate for the shared storage. Obviously HP will try to sell you an array in a package deal, but don't be afraid to look elsewhere - regardless of what HP tells you, this solution will work with any array that supports Windows clustering (look at Microsoft's approved vendor list on the Server 20003 web site).

What you don't need are faster disks - 7200 RPM SATA disks are fine for the rate of change in this application, as long as you have 5-8 spindles to keep up with the I/O rate. What you do need is a better file system than NTFS.
 
I agree with Bryan's post about regarding NetApp. If you really need the uptime you pay for it and the NetApps are an excellent platform. To get the best out of them profile the current data load when you buy and they'll help you sort out a system that meets your requirement. NetApp use IOPS (I/O operations/second). You've said uptime it important, so you'll need dual heads.

Sun have recently producted a NAS system. The cost/performance ratio is excellent but it has a large amount of storage as standard.

As has been said a few times, go for lots of spindles and smaller disks. Generally speaking you don't need anything more than 10K SCSI. 7200rpm would probably be fine.

If you don't got with an off the shelf NAS don't use RAID5. It's too expensive during rebuilds. Go with RAID 1+0 making sure to put the mirror pairs on different controllers. Have a couple of hot spares as well.

Whatever they get, make sure they have appropriate onsite support. Failed disks and a 3 days lead time are no fun, especially as when they start to fail there tends to be multiple failures relatively close together.

Oh, and make sure they have a good backup/restore policy that is regularly tested.

And don't switch the disks between servers. It's too greater risk and you can't test it before you do it. Restore the data from a backup to the new system then sync data from the old machine to the new machine before roll out (do this a few times over a week or so to make sure you're 100% happy with it).

Cut over at night during a scheduled outage. Make sure you have a back out plan that is simple. If something goes wrong when they don't want the downtime there'll be trouble! :)
 
A few things you can try if you are willing to move stuff around:

1. I know for a fact that sometimes certain card combinations in HP servers can cause problems if they are all using the same bus. Try making sure that if you have multiple cards installed that they are not using a shared bus. I have no idea if you are using a PCI-X interface or a PCI-e interface, so this may or may not be an issue.

2. Network Port speed mismatch. I have seen problems such as this when the speed of the NIC interface is set to auto, and the switch is not. For the most part, NIC interface speed should be hard set as the switch is.

3. Your OS should be on a different set of disk than your data disk, and your swap file should even be on another set if possible.

4. If accessing databases I recommend RAID 10. We used it exclusively in the banking business where databases are hit hard.

5. Remember Access databases don't scale well. Is it possible that you have out scaled your database app?
 
Back
Top