Averting disaster with redundant hardware.Hard disk failures can be catastrophic; they can necessitate three days of downtime while a replacement drive arrives and is installed, and a backup tape See tape backup. restores the data. It can take as much as three weeks to recover data initially, arid three months to recover it completely.
Initial recovery replaces damaged data or programs on a priority basis. In the remaining months, data or programs of minor value can be replaced.
While backups are traditionally the mainstay of recapturing and restoring lost data, they are not foolproof for restoring failed drives; the last day's data is generally missing and the damaged portions of the drive that are faithfully being backed up are of no value when restored.
The crisis that results from a hard drive failure can be averted by the purchase of a redundant array of inexpensive disks Redundant Array of Inexpensive Disks - Redundant Arrays of Independent Disks (RAID) subsystem, which can remove a single failure point by spreading the risk of hard disk failure over multiple disks instead of one. The redundancy offered by multiple disks (offsets the lower reliability of a single disk. RAID can be either hardware-based or software-simulated. A hardware-based RAID system provides more robust features than its slower software-based simulated alternative. RAID offers various mixes of performance, reliability and cost, which have been standardized and categorized in six levels known as RAID levels 0 through 5. (The Microsoft Technet A subscription service from Microsoft that contains technical documentation, software evaluation, tools and tips, plus drivers and patches for all of Microsoft's products. Subscribers can receive monthly updates on CD-ROMs or via downloads from the TechNet portal. document entitled Reliability and Fault Tolerance See fault tolerant.
(architecture) fault tolerance - 1. The ability of a system or component to continue normal operation despite the presence of hardware or software faults. This often involves some degree of redundancy.
2. , which describes RAID as implemented ill the Microsoft Windows See Windows.
(operating system) Microsoft Windows - Microsoft's proprietary window system and user interface software released in 1985 to run on top of MS-DOS. Widely criticised for being too slow (hence "Windoze", "Microsloth Windows") on the machines available then. NT Server Network Operating System An operating system that is designed for network use. Normally, it is a complete operating system with file, task and job management; however, with some earlier products, it was a separate component that ran under the OS; for example, LAN Server required OS/2, and LANtastic required DOS. , can be found at http:// technet.microsofk.com/cdonline/ content/complete/boes/bo/winntas/ technote/reliabil.htm.)
RAID fault tolerance. RAID allows restoration of information on an old drive without a backup tape. In fact, when a failing drive is removed, a server continues to operate normally and no data is lost due to the drive's removal. All this is possible because the RAID hard disk controller reconstructs the missing information from the information that resides on the remaining redundant healthy disks.
Disk striping The spreading of data over multiple disk drives to improve performance. Data are interleaved by bytes or blocks of bytes across the drives. For example, with four drives and a RAID controller that simultaneous reads and writes all drives, four times as much data is read or written in the . A RAID controller A disk controller card that supports one or more RAID configurations. Originally only for SCSI drives, RAID controllers have become very popular for PATA and SATA drives. See RAID. keeps track of the different physical disks and allocates storage space on all of these disks in a particular manner called striping Interleaving or multiplexing data to increase speed. See disk striping.
striping - data striping . Striping allows the contents of a single data file or program to be spread over three or more disks. This can provide faster access to the data than if the file were solely located within one drive. If one drive fails, the missing data would be lost, but RAID level 5, for example, adds parity data (check digits) to disk striping, allowing the data on the failed drive to be reconstructed from the data remaining on the other drives.
Hot swapping (hardware) hot swapping - The connection and disconnection of peripherals or other components without interrupting system operation. This facility may have design implications for both hardware and software. . Some RAID subsystems allow a technician to remove a failing hard disk drive without shutting down a server. This ability is known as "hot swapping." Many older drives are not hot swappable See hot swap. . While a RAID controller supports only hot swapping, this feature requires hot-swappable drives to work.
RAID hardware recovery. When a new drive is slipped back into a server, it contains no data and is completely unformatted (1) A hard disk, rewritable optical disc or floppy disk that has not been initialized and is completely blank. See format program.
(2) Without a structure. For example, an e-mail message that contains only text without any style attributes and no graphics is . This is when a RAID controller really works best; it detects the new drive and determines it is unformatted, and precedes to format the drive without anyone's intervention. The RAID controller then adds the newly formatted drive to the existing RAID array of drives. The missing data is then reconstructed and written back to the new drive. It can take as little as 20 minutes for a RAID controller to finish writing the missing data to the new drive, completing the recovery process.
Low- vs. High-End RAID Subsystems
It takes a minimum of three identical drives to obtain RAID level-5 fault tolerance. To an operating system operating system (OS)
Software that controls the operation of a computer, directs the input and output of data, keeps track of files, and controls the processing of computer programs. (such as Windows NT (Windows New Technology) A 32-bit operating system from Microsoft for Intel x86 CPUs. NT is the core technology in Windows 2000 and Windows XP (see Windows). Available in separate client and server versions, it includes built-in networking and preemptive multitasking. ), these three drives appear as one large drive that can then be partitioned into logical sections. Of the three drives, two hold actual data, while a third contains check digits. In reality, the data is actually striped across all three drives, such that only part of the data resides on any one drive. The check-digit (the parity data) portion rotates from one drive to the next so that each drive contains some data and some check digits.
Striping with parity data across the three drives is what provides the fault tolerance and allows for no loss of data should one complete drive be removed. If one of the three drives is actually removed, a RAID controller recreates the missing data in realtime, based on the bits of data remaining on the other two drives. While the controller can interpolate See interpolation. the missing data, the RAID controller and the remaining two drives are no longer fault-tolerant; the third drive must be replaced to regain fault tolerance. Should a second drive fail before the third drive is replaced, all data would be lost.
Replacing and hot swapping drives.
Low-end RAID controllers require a user to turn off a server when removing or adding drives. High-end RAID controllers allow removal and addition of drives while the server is running (previously referred to as hot swapping). A system must have swappable drives to use this feature. When a new drive is replacing a bad drive, low-end RAID controllers require the running of a rebuild utility before booting the operating system. High-end RAID controllers rebuild a new drive when a server is restarted. In fact, high-end RAID controllers with hot-swappable drives rebuild the new drive without even turning off the server.
Adding and increasing drive space. When adding additional drives, low-end RAID controllers require a user to reformat (1) To change the record layout of a file or database.
(2) To initialize a disk over again. all the drives from scratch. The operating system and the backup software See backup program.
(tool, software) backup software - Software for doing a backup, often included as part of the operating system.
Backup software should provide ways to specify what files get backed up and to where. must be reinstalled to restore the rest of the programs and data from the backup tape. High-end RAID controllers allow the addition of ch-ives to a controller and dynamically adds them to existing drives, thereby increasing the size of the usable space without reformatting the entire drive subsystem. High-end RAID controllers with hot-swappable drives allow this while the server is still running.
Recovery costs. The cost of RAID level-5 fault tolerance is the premium paid for the RAID controller, plus the cost of one extra hard drive. For example, three 9GB drives provide only 18GB of usable space, because the remaining 9GB is used for the parity data. If an entry-level RAID controller costs $800 while a standard controller costs $400 and its 9GB drives cost $800 each, a disk drive subsystem would cost $3,200; users would be paying a 38% premium for RAID-5 protection. The additional cost of the RAID hardware would be $1,200 ($400 extra for the RAID controller plus $800 for the third drive). However, 27GB of usable space (instead of 18GB) can be added with just one more 9GB drive, dropping the premium to 30%; the additional cost remains $1,200 while the total cost becomes $4,000.
The actual cost of RAID hardware varies with features purchased--the more features, the higher the RAID premium. A fully loaded RAID controller can cost between $2,000 and $4,000, while hot-swappable 9GB drives with 5ms access and spinning speeds of 10,000 rpm can cost upwards of $2,000. Of course, prices drop on computer hardware every month. Drives purchased in January 1998 for $1,600, for example, can now be purchased for $800.
While RAID hardware provides fault tolerance and redundancy, it does not provide any recovery during complete hardware meltdown. RAID technology removes a single source of potential failure and replaces it with multiple redundant sources of failure. If any one drive fails, the failed drive can be replaced and the missing data can be reconstructed on the new drive from the data remaining on the other drives.
If more than one RAID drive fails simultaneously, however, the fault tolerance will not work. The only source of recovery from multiple-drive failures or complete hardware failure is a backup tape.
Hardware alone will not guard against a catastrophic disaster. Offsite storage of duplicate backup tapes is helpful, but a comprehensive disaster plan should be designed and tested. The disaster plan should include a disk and data replacement plan, as well as a plan to supply a stable conditioned source of power to a computer. The availability of spare parts Spare parts, also referred to as Service Parts is a term used to indicate extra parts available and in proximity to the mechanical item, such as a automobile, boat, engine, for which they might be used.
Spare parts are also called “spares. and the technical expertise needed to install and configure them should be considered. An IT professional who services and maintains a network should be able to help design an appropriate strategy.
Editor's note Editor's Note (foaled in 1993 in Kentucky) is an American thoroughbred Stallion racehorse. He was sired by 1992 U.S. Champion 2 YO Colt Forty Niner, who in turn was a son of Champion sire Mr. Prospector and out of the mare, Beware Of The Cat.
Trained by D. : Messrs. Maida and Brown are members of the AICPA AICPA
See American Institute of Certified Public Accountants (AICPA). Tax Division's Tax Technology Committee.
If you would like additional information about this article, contact Mr. Maida at (609) 8826874 or firstname.lastname@example.org or Mr. Brown at (703) 848-2502 or email@example.com.
FROM STEVEN D. BROWN, CPA (Computer Press Association, Landing, NJ) An earlier membership organization founded in 1983 that promoted excellence in computer journalism. Its annual awards honored outstanding examples in print, broadcast and electronic media. The CPA disbanded in 2000. , BROWN & BROWN, PC, MCLEAN, VA
Joseph C. Maida Shareholder Nicholas C. Maida C.P.A., Chartered Princeton, NJ
Steven D. Brown Managing Partner Brown & Brown, PC McLean, VA