Rapid restores from data disasters.
This article will examine the most frequent cause of data disasters like hardware failure, application corruption, and site disasters. It will also describe an architecture that is equally effective for protecting transactional applications--such as databases, mail, and files--using Zetta Server, a 64-bit operating system that can be installed on standard Intel servers.
Appliances based on Zetta Server provide advanced data protection, high-availability, and rapid restores for Windows, Unix, and MAC OSX platforms--all managed from a single unified Web interface. The secret sauce is a combination of redundant hardware, unified file, block technology, sophisticated application aware (e.g., Exchange, SQL Server) snapshot technology, as well as advanced replication that allows you to replicate both file and block data to a remote facility.
There are four major causes of data disasters:
* Hardware failure
* Admin/user error
* Application corruption
* Site disasters
Hardware Failure: Hardware failures can be mitigated by using redundant power supplies, RAID, mirrored memory, and an Active/Passive failover system.
Admin/User Error: Administrator error can arise in different ways. First, the traditional backup process is largely untested for restores. Administrator errors often show up only in the event of emergencies. There is no easy way to verify that backups have been successfully executed.
Application Corruption: Applications such as Microsoft Exchange. SQL Server, and Oracle sometimes corrupt data. Almost every system or database administrator with any level of experience has spent hours wrestling with a corrupt database or a corrupt information store. This is difficult to deal with because backups themselves may be corrupt and there is virtually no way to deal with it other than to extract backup after backup from tape until the application recognizes a consistent backup. Application corruption can also cause significant data loss.
Site Disasters: Site disasters can be prompted by earthquakes, fires, floods, terrorism and sometimes as simply as the failure of air conditioning in the server room. A site disaster is often disastrous and less than 40% of companies recover successfully from one. Even companies that use offsite backup have to deal with data loss, because backups are typically over a week old. Recreating servers from backups is a monumental task, especially if it involves databases.
Data Protection: It's All About Rapid Restores
Traditionally, backup has been one of the chores relegated to the junior system administrator. Most companies religiously back up to tape or tape libraries--a full backup once a week and incremental backups every day. In the event of a disaster (especially with data continuously generated by applications such as Exchange, SQL Server, and Oracle) current backup techniques are woefully inadequate for the following reasons.
First, a backup restored from tape is likely to be at least 24 hours old. Second, it is quite likely that the most recent backup is not usable. Third, it is not possible to determine whether Exchange data or databases are corrupt until the backup is restored and working correctly with the application. Most often, the application server is likely down for the better part of a day, while back-up administrators are figuring out which backup to use. After going through this painful process, in many cases, the backup that is successfully restored is often a week old.
There are a number of companies touting disk-based backup as the panacea for these problems. Disk-based backups are great for decreasing the time to back up vast amounts of data, but have the same problems associated with backing up to tape, because it is very difficult to back up application-generated data continuously to disk.
Point-in-Time Copy or Snapshot Technology
Snapshot technologies allow companies to create point-in-time copies or online backups as the system is running. In the event of data corruption or data loss, users can roll the state back to the last snapshot or the last uncorrupted snapshot and restore. According to Randy Kerns, an analyst with the Evaluator Group, "With snapshots, you do not have to take your application out of service."
Most users are used to traditional backup methodologies: Level 0 backups every week and Level 1 backups every day. Restoring from these backups involves first loading a Level 0 backup and then a Level 1 backup. Unlike traditional backups, snapshots have the following important features:
* Each snapshot is complete in itself. It is not necessary to load the original snapshot, and then add incremental snapshots to create a final image.
* Advanced snapshot algorithms are incremental using copy-on-write technology. Copy-on-write works by intercepting calls to modify files or blocks and making a copy prior to the modification.
The benefits of good copy-on-write algorithms are:
* There is no necessity to make a copy of the data at the point a snapshot is taken. Instead, a copy of the pointers pointing to data is made. This greatly speeds the snapshot process since, during this time, there can be no access to data to achieve consistency and prevent data corruption.
* Copy-on-write ensures that only changed blocks are copied. Consequently, data is not unnecessarily duplicated.
* If a virus was to attack a system, it would automatically initiate a copy-on-write on any files it corrupts. Consequently, it is easy for a sys admin to go back to a previous snapshot without the virus corruption.
Point-in-Time Copy Technology
Zetta Systems' point-in-time copy technology, ZSnap, is a significant advancement over current snapshot or point-in-time technologies. These advancements include.
Low snapshot overhead: Most current copy-on-write snapshot technologies are fast for small volumes of data. As data sizes increase dramatically, the time required for a snapshot increases, often linearly, as the data increases. Zetta Systems' patented snapshot algorithm is constant-time, meaning that independent of the size of the data, the same number of pointers is copied. It takes less than 100 milliseconds to snapshot a Zetta Server SAN-NAS device, independent of the volume of data. In addition, the actual disk overhead at snapshot time is always constant at 1KB. Lightweight snapshots are key in decreasing the granularity of data protection from days to minutes.
Number of active snapshots: Most current technologies limit the number of snapshots to less than 255. Microsoft supports 64 snapshots per volume: Network Appliance supports 255 snapshots per volume. Scaling beyond this number is difficult without the right algorithm and file system support. Zetta Server places no practical limits on the number of active snapshots. Unlimited snap shots are another important consideration in obtaining maximum data protection.
Unified block+file snapshots: Storage connected to a Zetta Server can be accessed as both a block (SAN) and a file (NAS) device. Zetta Server's snapshot technology allows users to snapshot both files as well as blocks using the same mechanism. Snapshotting files is very useful for recovering single files, while snapshotting blocks is essential for preserving the integrity of databases and the Exchange server information store.
Rapid Restoration: Many snapshot technologies do not facilitate rapid restores. Restores can take hours with even moderate amounts of data. Zetta Server has been designed so that any snapshot can be swapped with the active file system in less than 5 seconds, facilitating rapid restores.
What is Zetta Server?
Zetta Server includes a number of patent pending technologies resident on an enhanced version of the FreeBSD operating system, including:
ZFS: A high-performance, adaptive, caching, file system that is core in implementing ZSnap, a powerful, lightweight snapshot technology that works intuitively with blocks as well as files.
Zunified: A unified driver that implements block and file access to a common pool of storage. Users can access Zetta Server-based appliances as network attached storage. In this mode. Zetta Server supports both Windows and Unix clients with integration into Active Directory and NIS. Zetta Server-based appliances can also be accessed through a Fibre Channel connection which presents a block SCSI interface to clients. Multiple servers can be connected to Zetta Server via a Fibre Channel switch.
Zditto: Replication technology that replicates block as well as file data to a remote off-site server. ZDitto is ideal for disaster recovery from site disasters.
The Figure (on page 28) shows an ideal disaster-proof implementation using Zetta Server appliances. Each Zetta Appliance is configured with one gigabit Ethernet card and one Qlogic Fibre Channel HBA. Zetta appliances are connected to each other using a gigabit Ethernet for sensing whether the paired Zetta Server is active. Both Zetta Servers are connected to a high-end storage array either through a separate Fibre Channel connection or through SCSI.
Data is replicated to a remote site, to a pair of Zetta Servers. In the event of a site disaster, data can be recovered quickly from the remote server.
Zetta Server-based appliances protect data from the primary causes of data loss; disk failures through RAID, hardware failures through automated fail-over, administrator errors and application corruption through highly granular snapshotting, and site failures through file + block replication. Zetta Server is a cost-effective approach to creating disaster-resistant file and block network storage.
Dr. Ganapathy Krishnan is founder and CEO of Zetta Systems (Woodinville, WA)
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Disaster Recovery; Zetta Server|
|Publication:||Computer Technology Review|
|Date:||Feb 1, 2004|
|Previous Article:||The impact of regulatory compliance on storage: "the compliance landscape is a minefield."--Enterprise Storage Group.|
|Next Article:||Lifecycle management drives data management's evolution from art to science.|