Data protection: the #1 storage priority; There's no ILM process without it.
Different levels of data protection exist and, as expected, higher levels of data protection cost more to implement. If the MTBF (Mean Time Between Failure) of hardware devices would have been 100% for the early years of the IT industry, businesses would have only needed to engage in straightforward backup and recovery processes. Software errors, human errors, natural disasters, power failures, building damages, and intrusions such as worms and viruses have turned data protection into a complex process. Better data protection and security has evolved over the years from simply improving the MTBF of devices to implementing local backup, remote backup and campus and remote hot sites. Three objectives are now routinely being used and have been defined by SNIA (Storage Networking Industry Association) to form the criteria needed to build an optimal data protection strategy.
RPO (Recovery Point Objective): The desired amount of time between data protection events.
RTO (Recovery Time Objective): The time needed to recover from a data loss event and return to service. In other words, this requires classifying data or an application by its criticality or value to the business and determining how long the business can survive without this data.
DPW (Data Protection or Backup Window): This is the acceptable amount of time available for an application to be interrupted while data is copied to another physical location for backup purposes.
Note: Replication differs from backup in that replication moves data to another location and the data is accessible from both locations. A replicated image is ready to go almost immediately whereas a backup copy requires a move to become accessible again. A replicated image improves the RTO significantly. A backed copy of data has to be restored by backup software or intelligence in the fabric.
Most businesses have developed multiple strategies to reduce and try to eliminate downtime. Hourly costs of computer downtime can cost $50,000 to nearly $3 million, depending on the business and application. Minimizing downtime is potentially the most critical IT activity. Businesses that haven't measured the cost of downtime should do so in order to establish their own value of data metries. The computer industry has made significant strides in reducing the failure rate of the IT infrastructure by providing enhancements such as RAID, a variety of replication capabilities, hot (non-disruptive) code loads, and many redundancy features. Natural disasters require careful development of contingency plans while energy failures mandate the expensive provision for alternative sources. All of these failures can be managed with enough financial resources.
While devices have become significantly more reliable in protecting against device and component failures, valuable data is now being exposed to even higher risks as a result of destructive worms, viruses and spam as the wave of hackers and terrorists worldwide gains momentum. Recovery from an intrusion is difficult and the impact of an intrusion is destructive as permanent data loss frequently results unless special and often complex procedures are implemented. The looming threat to delivering high data availability is now the "intrusion factor" and storage security has become the newest storage management discipline. In reality, there is no silver bullet in place yet to implement a bulletproof and secure IT infrastructure.
Planned downtime is unpleasant but often occurs as a business choice. The most common causes of planned downtime are maintenance, hardware and software upgrades, and database backup and are presently unavoidable--but many non-disruptive capabilities are in development. The downtime required to perform database backup is more challenging and requires the database to completely stop service or be placed in a read-only mode.
The Value of Uptime
Businesses often calculate availability indexes for key applications in terms of "the number of nines." The estimated average costs of system failures would be nearly fatal to some companies and can range to nearly $3 million/hour of downtime in certain industries. A server that is 99% available may seem highly available but will actually be unavailable 5,000 minutes per year! Availability figures range from approximately 99% for Windows-based servers to 99.999+% (or five-9s) for enter-prise z/Series mainframe servers. Revenues lost per hour of outage reflect on the criticality of the IT function to a particular business. Higher availability systems normally cost more but are often justified based on reducing the amount of revenue lost when an outage occurs.
The number of minutes per year of unavailability or availability index is a good starting point to measure availability--but it can be misleading. Beyond the number of 9s, a new set of metrics has emerged defining the impact of lost availability on the level of service delivered. QoS, or Quality of Service, is a better way to look at the type of service being delivered when a failure actually occurs. QoS takes the availability percentages to the next level and begins to add meaning to the real impact of an outage. A variety of data recovery architectures provide increasing levels of availability at a corresponding higher cost. The path to the "high 9s" describes new computing architectures that will ultimately implement advanced self-healing capabilities using embedded nanotechnology components.
Data Protection Options
Backup/restore is the most traditional disaster recovery method moving data, usually a complete file or full volume, from primary disk to either disk or tape for backup. The backed copy is not executable and must be restored to become accessible. In most cases, traditional backup causes the application being backed up to be impacted or even stop. Tradeoffs exist when choosing an effective backup strategy.
Backing up full disk volumes or files can become very time consuming and may be difficult to schedule. In addition to full backups, incremental and differential backups represent further options.
In a differential backup, the same data that was backed up on the previous differential backup is also backed up on the next differential backup. That's why differentials often grow each day in size between full backups. Daily backups get gradually larger, but the restore time is minimized compared to full or incremental backups. A full restore only requires the last full backup and the last differential copy.
For incremental backups, only the data that has changed since the last incremental backup is backed up. This minimizes the amount of data backed up and therefore reduces the time needed for the "backup window" making it different from differential backup. A full restore takes longer as each incremental backup will have to be restored to get all files to their last known state and is generally a more complex process. Often a full backup will be performed weekly, while an incremental backup is performed daily. Incremental backup minimizes the backup time and differential backup minimizes the restore time and the specific application may require one or the other.
Mirroring is implemented as a block-for-block replica of a file, a logical unit, or a physical disk volume normally using disks for all copies. Once the mirrored data element is established by copying the original data element, the mirror is maintained by replicating all write operations in two (or more) places creating identical copies. Mirroring eliminates the backup window but doubles the amount of disk storage required adding expense. Storage administrators must choose to implement asynchronous or synchronous mirroring and tradeoffs exist for each case.
Synchronous mirroring is frequently used in z/OS (mainframe) environments, given the critical nature of its applications. In synchronous mirroring, both the source and the target devices must acknowledge the write is completed before the next write can occur. This degrades application performance but keeps the mirrored elements synchronized as true mirror images of each other. For asynchronous mirroring, the source and target devices do not have to synchronize their writes and the second and subsequent writes occur independently. Therefore, asynchronous mirroring is faster than synchronous mirroring but the secondary copies are slightly out-of-sync with the primary copy. This is sometimes referred to as a fuzzy copy. Asynchronous mirroring is often used with an IP storage protocol to replicate data to locations hundreds of miles away. In reality, the secondary data element is usually no more than one minute behind or out-of-sync with the primary copy. This can be a significant exposure for write-intensive applications.
Mirroring is used for many mission-critical applications and it is the fastest way to recover data from a device or subsystem failure, since restore operations can occur in no more than a few seconds by switching to a mirrored copy. Mirroring does not help protect against a data corruption problem (hacker, worm, virus, intrusion, human or software error) as it produces two or more copies of corrupted data. For best practices, mirroring should always be accompanied by point-in-time copies for data that can permit a restore to occur from clean data that existed before the corruption occurred. Mirroring is defined, and also commonly referred to, as RAID-1.
Snapshot copy presents a consistent point-in-time view of changing data. There are many variations of snapshot copy. When using snapshot copy and write operations occur, the changed areas (writes) are saved in a separate area or partition on disk of disk storage specifically reserved for snapshot activity. Here, the old value of the affected area or block can be saved in case the new block(s) are corrupted or to permit a fuzzy data image that can be used for a non-disruptive backup. Snapshots provide data protection from intrusion and data corruption but not from a device failure.
PIT (Point-In-Time) copy provides an executable image of data at a specific point-in-time. Like a series of still images, PIT copies are complete data images are taken at specified points in time. PIT copies enable an administrator to go back in time to restore data from a stable state prior to when a corruption or other disruption occurred. This represents the most complete method to protect from human errors, software problems, hardware viruses and intrusions and data corruption and should accompany any mirroring implementation.
Again, tradeoffs exist. The more frequently the PIT copy is taken, the more storage is required and the more time it takes to determine which copy is the correct one to restore from.
Journaling is another method to enable data recovery where every write and update operation is continuously written to another device that may or may not be the same as the primary device. Unlike mirroring however, the secondary copy is a sequential history of write events. All write operations are queued to the secondary device, or the journal device, which may be disk or tape. Journals are typically kept as a continuous history for 2-4 days covering the period of maximum likelihood for a data recovery action to occur. Journals are especially good for protecting from intrusion and data corruption enabling restores to go back in time to a point before the corruption occurred.
Data protection is a critical IT discipline and businesses often choose the simplest approach after sustaining three years of downsizing and cutbacks. However the simplest approach may not provide the highest availability and severe business impact may occur. Today's IT environments are demanding a more comprehensive strategy for data protection, security and high-availability than ever before. Replication options must match the applications' specific business requirements in order to yield the highest probability of success. The data protection solutions are now available to deliver ultra-high availability, thus increasing the probability for a business to survive most all types of outages. This is critical since most all businesses will not survive without IT. Resilient to machine and human imperfections such as intrusions, mistakes, accidents and cyber-terrorism, the price to pay for implementing data protection is not an option.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Data Protection; Information Lifecycle Management|
|Publication:||Computer Technology Review|
|Date:||Nov 1, 2004|
|Previous Article:||KVM over IP: keyboard, video and mouse streamline systems management.|
|Next Article:||The year in storage: data protection led innovations.|