Most of the servers in our data center run on CentOS. As Linux admins know, software raid (Raid-z, ZFS etc) in Linux has far surpassed hardware raid in terms of feature set, reliability, and functionality. However, several of our server admins prefer running CentOS and WHM inside virtual private servers (VPS) on a box running Windows Server because that’s the OS they are most comfortable with.
Windows software raid is very different. We have extensively tested Software raid on Windows server 2008 & 2012 (and Windows 7 Professional) over the last 3-4 years. Here is what our data center techs have learned.
NOTE: This article refers primarily to challenges with Windows Software Raid 1,5 & 10. Our testing has shown an extrememly high reliability and stability using Windows Raid 0 and all forms of Linux software raid.
When it comes to stripping, Software Raid 0 offers 85-95% of the speed when compared to hardware raid 0. Software Raid 0 also seems to be incredibly stable- more stable in fact than Raid 0 on many of the hardware raid cards we have tested. In our tests, Raid 0 arrays on LSI & Arcea raid cards failed 30-40% more often than Windows software raid, running on the same hardware and using the same hard drives. We now have an internal policy that requires software Raid for all our stripped server arrays, except when a client requests something different.
Windows Software Raid becomes even more problematic with all other raid types. For running Raid 1 (mirrored), Raid 5, or any other Raid type on a Windows box (Striped Raid 0 is the exception to this rule), our internal policy requires a Logitech or Intel Raid card, as those are the only two we have found to be stable enough. Here are the largest issues we have consistently seen with Windows Software Raid 1, 5 & 10:
– A power loss, reset or reboot of a server while the raid array is in use (even during read-only operations where no changes are being made to the array) causes Windows to initiate a complete array rebuild. This also happens if you shut down or reboot the server and windows ends some of the services too fast or has to force them to quit.
– Windows Raid is not a true Raid. With Windows Software mirrored and parity arrays, when a drive mechanically fails or if you physically remove one of the drives, you almost always lose all data. For any Non-Windows Raid 1, 5 or 10 array (hardware or software), when I remove a drive (or a drive mechanically fails) from a Mirrored Raid 1 array, I am still able to access data from the remaining drive. However, with Windows Raid 1 if I completely remove one of the drives it shows the entire array as offline or corrupted, and I am unable to recover any of the data until the problem drive is repaired and re-connected to the array. Adding a new drive to a Windows Raid 1 or 5 Array only allowed a rebuild with no lost data 28% of the time.
– High I/O activity on a Windows Software Raid 1 or 5 drive frequently forces a rebuild (Never run a database or a virtual machine on a Windows Raid 1 or 5 array)
– Rebuild time: It’s understandable that Software Raid took 4-7 times longer for rebuilds then Hardware Raid, since Software Raid requires more CPU and system resources. For example, rebuild of a 4TB hard drive on an Adaptec Raid Controller takes 8-12 hours on one of our servers, depending on how high the I/O load is. However, a Software Raid rebuild on that same box for that same drive can take as long as 48-96 hours, and the rebuild processes significantly slows down all other processes on that server. We’ve seen Server 2008 consume as much as 11GB of Ram and 5 CPU cores JUST FOR THE REBUILD PROCESS ITSELF.
– Rebuild failure: The ONLY way to ensure a good chance of avoiding data loss during a Windows Software Array rebuild is to remove that server from the network and ensure it has as close to a zero load as possible for the duration of the rebuild. More than half the time it was faster for us to simply move the data onto another array, fix the bad array and move the data back. During our testing, the most common cause of a complete unrecoverable array failure seems to have been medium to heavy I/O requests of the array while it’s rebuilding. This is the single biggest caution for running Windows Software Raid in any kind of production environment (again, Raid 0 is the exception). We have never seen a hardware array fail during rebuild on any of our Adaptec raid controllers, even during peak I/O times, except for the few cases where two drives mechanically failed. Even when all drives are mechanically functioning properly, a Windows Software Raid rebuild is EXTREMELY fragile.
Windows Software Raid 0 (stripping) has been the surprising exception to all of these rules. In fact, we have also seen data transfer rates between two Windows Raid 0 arrays as high as 1.2mb/sec. The highest transfer we have seen between two stripped hardware arrays is 1mb/sec. Also surprising is the fact that none of our techs have ever had a Software Stripped Array fail from a server reboot or power failure, even during data writes.
Windows Software Raid offers many benefits, as long as you know what to expect and you always keep an additional backup. It’s free, it works with practically any type/size/brand of hardware, is easy to setup, and is extremely fast for data transfers. The most important lesson to take away from this is the critical importance of NEVER relying on a Windows Software Array for a single copy of data. Regardless of the array type, always make sure you have a minimum of 1 backup copy. Many of our clients virtual servers run on Software Stripped Arrays because of the speed increase- but we never store production data on Windows Software Arrays, and every byte of data is backed up a minimum of twice to other locations (Hardware Raid 60, Tape, offsite etc).
For servers or IT Departments running on a tight budget, Windows Software Raid can be a Godsend when used correctly in tandem with data backup.
Our Data Center
This extensive testing has helped us provide rock solid stability for our clients data. Our server techs exclusively run Hardware Raid 60 for every one of our secure data storage servers. This provides the extreme fault tolerance of Raid 6 with the high speed of Raid 5. Raid 60 allows up to 8 drives to fail simultaneously on one of our servers, with zero data loss or server downtime. And if that ever happened (it hasn’t yet), 4 of those drives would rebuild simultaneously while the server ran. The only card we have found that can provide this is Adaptec SAS/SATA Harware Raid cards. Our card of choice is the Adaptec 2274600-R 71605Q. It supports Hardware Raid 60 and can run with up to 4 cards simultaneously in the same physical server, for a total of 64 drives, configured in any number of different array types and combinations. This card also has an onboard battery to protect data in case of power loss or server reset, and supports “Spare” Raid drives. A “Spare” is a drive that sits idle but is used immediately to rebuild an array when one of it’s drives fails.