Printer Friendly

Beyond the data protection dilemma.

Disk-based data protection has recently emerged as an important element of any data center infrastructure. The benefits, reduced backup windows, accelerated restore times, improved backup success rates and improved efficiencies in data protection environments have all been widely noted. While the use of disk as a backup storage media brings great benefits, the sheer volume of primary data growth in the enterprise causes extraordinary disk capacity requirements in the data protection infrastructure.

Disk continues to be the most expensive element of the infrastructure and for most companies demand for data protection capacity will outpace the storage budget. Because of this, companies have not been able to optimize their use of disk for data protection. Typically, relatively small capacity disk systems are deployed as temporary caches, before data is offloaded to tape. In such deployments, tape is still heavily relied upon for backup and recovery, which is a growing problem.

The essence of the problem is simply this: reliable recovery results are compromised to the degree that data that is stored on tape media which strays beyond systems' control. As we all know, in the course of manual de-staging and offsite storage tape media moves beyond systems' control. In short, data that is stored on tape is compromised in terms of the level of data protection and data availability.


This fact, combined with the current perception of the economics of disk storage, has caused many to declare that disk-based backup and recovery (sometimes called "D2D" backup) is only a temporary fix for a growing problem. The cost of disk on hand, and the risks and limitations associated with tape on the other, represents the first dilemma faced by IT shops. There is consensus that a long-term solution is required.

The growth of primary storage in the typical enterprise is a well-understood phenomenon and any long-term solution to data backup problems must address this state of affairs. A recent study (by META Group) suggests that primary storage capacity is growing at 90% per year. While high, this growth is manageable in primary storage, and simply requires predictable purchases of storage per year.

The problem is the exponential growth in the data protection environment that is spawned by the primary storage growth. 1 TB of primary storage can require 26 TB of tape capacity over its lifecycle when following a standard grandfather-father-son (GFS) operation, according to Michael Peterson of Strategic Research Corp. To utilize disk for this purpose is typically not reasonable.

In parallel to the need for capacity is the growing need for increased levels of data availability. No longer are extended downtimes tolerable, and in many environments, even large data restores of several hundred gigabytesGB are expected to be completed within an hour. Beyond typical file or volume recoveries is the growing requirement for accessing older data quickly. Such reasons include: historical data analysis, regulatory compliance, legal discoveries, as well as traditional archiving. The result of all of these drivers is a growing need to keep increasing the amount of backup data on disk.

The traditional infrastructure cannot reasonably--or economically--scale to meet these growing requirements. The net result of this confluence of problems is a forced tradeoff in data protection. IT shops must choose between tape-centric backup and recovery, which offers lower levels of data protection, characterized by slower and unreliable restores and data that strays beyond systems' control, or disk-centric data protection which offers great benefits in data protection and availability, but comes at a steep price that will only accelerate with data growth. Many argue that the industry has solved this problem with commonality factoring solutions.

Optimizing capacity solutions

Within the last two years, several products that provide commonality factoring which neither requires more tape nor more disk have entered the market. The aim of such solutions is to "filter-out" duplicated data from the backup environment, thus allowing customers to store more data on a given disk-storage system, and improving the economics of disk as a backup media.

These solutions typically use hashing algorithms, such as MD-5 and or SHA-1. These algorithms work by assigning a value--or a hash--to unique data chunks, and keeping track of a database (or index) of hash-to-chunk assignments. When new data is received, the system creates a hash for the new chunk and compares it against the existing hash database. Whenever a seemingly exact matching chunk is found, the system does not record it again, but rather records a small pointer to the matching chunk that is already housed in the system.

These products are interesting in that they have applied an existing 10-year-old technology to a new problem. These technologies are appropriately deployed in Content Addressable Storage (CAS) subsystems, when the purpose is that of an immutable storage archive. However, when applied to the challenge of de-duplication, the hashing algorithm approach falls short on several fronts. Hash-based solutions suffer from an inherent risk that the same hash might be assigned to two data chunks that are not in fact identical. While improbable, a hash collision is a real statistical risk. A hash collision means lost data. To be clear, hash algorithms are well-suited for immutable CAS archive systems, but offer modest value for data-coalescence systems due to the associated compromise in data integrity. While painful tradeoffs also exist in performance, scalability, and achieved factoring ratio, by far the greatest toll is the risk to data integrity. IT shops can at least theoretically spend more money to address performance and capacity to the economic detriment of the solution; however, no amount of money can mitigate the risk of a hash collision or recover data lost in real collision event.

Another commonality factoring solution available to the data center,--ProtecTIER, by Diligent Technologies addresses the issue of ever increasing disk capacity needs.

ProtecTIER is a software data protection platform running on a standard enterprise server that connects to any standard Fibre--Channel attached disk array. These three elements--the server, the ProtecTIER software, and the disk array--combine to become a high capacity, high performance, highly and highly scalable target for backup and archive data.

At its core is a technology that is based on a series of algorithms that identify and filter out the elements of a data stream that have previously been stored. Data matches are located without any I/O to the disk, utilizing a very efficient RAM-based index, which enables a high data throughput rate. The index maps 1 Petabyte (PB) of physical storage utilizing just 4 GB of RAM on the hosting server, representing a 250,000:1 ratio of storage to memory.

When a data stream is sent to the ProtecTIER platform, the HyperFactor commonality factoring algorithm scans all the data in the stream and uses the memory-resident index to filter out the identical items at a very fine level of granularity. When a data stream is sent to ProtecTIER, HyperFactor scans all the data in the stream and uses the memory resident index to filter out the identical items at a very fine level of granularity. Only the new data elements are stored, along with pointers to all of the existing matched data elements that were already in the repository.

HyperFactor saves space by taking advantage of the fact that only a small percentage of data actually changes between output data streams (for example, between two backups of the same policy). The amount of space saved is a function of many factors, but mostly of the backup policies, the retention periods and the variance of the data between them. The more full backups retained on ProtecTIER, and the more intervening incremental backups, the more space that will be saved overall resulting in increased economic value

Key elements to consider when selecting data capacity optimizing software

When evaluating disk-based data protection solutions, here are a few features and items you should ask your short list of vendors:

1. Do they provide Redundancy Elimination? -- Does the technology filter out any duplicated incoming data, such that each unique data element is only stored once by the system?

2. Can this be implemented simply and non-disruptively? -- Does the technology enable smooth deployment, such that existing operations (such as your existing policies, practices and procedures) continue uninterrupted as the new technology is introduced?

3. Will this work with any vendor's hardware? -- Is it non-reliant on specialized hardware, orand can it leverage open hardware standards?

4. Will it provide enterprise-class performance? -- Does the solution perform the factoring function at high data throughput speeds, such that the overall solution meets the performance requirements of the high end data center?

5. How will this solution scale in my current and future environment? -- Is the solution "enterprise scale,", allowing for the storage and management of many petabytes (not just terabytes) of storage?

6. Will it provide me 100% data integrity? -- Does the solution have no risk--however slight--of corrupting data based on false data matches at the hash reference level?


Today's IT shops and data center managers can avoid the typical painful trade offs and the dilemmas associated with burgeoning data growth and backup requirements. More than this, an industry leading factoring solution solves the dilemma with zero risk and in a non-disruptive manner. Diligent Technologies' ProtecTIER also addresses the need to scale, massively if needed, and thus future proofs enterprises for their next stage of planned or unplanned data growth. Disk storage has historically been the most expensive resource in the data protection solution. By reducing the capacity needs, ProtecTIER lowers the acquisition cost of a disk-based solution to below that of an equivalent tape based system across a wide array of typical usage scenarios.

Neville Yates is chief technology officer at Diligent Technologies Corp., Framingham, MA
COPYRIGHT 2005 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2005, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Disaster Recovery & Backup/Restore
Author:Yates, Neville
Publication:Computer Technology Review
Date:Aug 1, 2005
Previous Article:Modified time-based servo enables increased track density on tape.
Next Article:SATA opens its doors to tape.

Related Articles
Data grid disaster puts SMBs at most risk.
Rapid restores from data disasters.
TCO analysis: where D2D fits--part 2.
Data protection: the #1 storage priority; There's no ILM process without it.
Overcoming recovery barriers: rapid and reliable system and data recovery.
The push for continuous data protection.
Personal disaster recovery software: an essential part of business disaster recovery plans.
Building practical data protection strategies.
Infosecurity Europe 2007.

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters