Beyond the data protection dilemma.Disk-based data protection has recently emerged as an important element of any data center infrastructure. The benefits, reduced backup windows, accelerated restore times, improved backup success rates and improved efficiencies in data protection environments have all been widely noted. While the use of disk as a backup storage A storage device used to hold copies of data for backup and recovery. In the IT world, tape drives and tape libraries have been the traditional backup storage medium; however, magneto-optic (MO) and other optical discs as well as regular magnetic disks are also used. See LAN free backup. media brings great benefits, the sheer volume of primary data growth in the enterprise causes extraordinary disk capacity requirements in the data protection infrastructure. Disk continues to be the most expensive element of the infrastructure and for most companies demand for data protection capacity will outpace out·pace tr.v. out·paced, out·pac·ing, out·pac·es To surpass or outdo (another), as in speed, growth, or performance. outpace Verb [-pacing, the storage budget. Because of this, companies have not been able to optimize their use of disk for data protection. Typically, relatively small capacity disk systems are deployed as temporary caches, before data is offloaded to tape. In such deployments, tape is still heavily relied upon for backup and recovery, which is a growing problem. The essence of the problem is simply this: reliable recovery results are compromised to the degree that data that is stored on tape media which strays beyond systems' control. As we all know, in the course of manual de-staging and offsite storage tape media moves beyond systems' control. In short, data that is stored on tape is compromised in terms of the level of data protection and data availability Refers to the degree to which data can be instantly accessed. The term is mostly associated with service levels that are set up either by the internal IT organization or that may be guaranteed by a third party datacenter or storage provider. . [FIGURE 1 OMITTED] This fact, combined with the current perception of the economics of disk storage, has caused many to declare that disk-based backup and recovery (sometimes called "D2D (Disk-to-Disk) Typically refers to backing up data on disks rather than on tape. Disk-to-disk backup systems provide a very fast restore capability compared with tape backup. See D2D2T and virtual tape. " backup) is only a temporary fix for a growing problem. The cost of disk on hand, and the risks and limitations associated with tape on the other, represents the first dilemma faced by IT shops. There is consensus that a long-term solution is required. The growth of primary storage in the typical enterprise is a well-understood phenomenon and any long-term solution to data backup problems must address this state of affairs. A recent study (by META Group) suggests that primary storage capacity is growing at 90% per year. While high, this growth is manageable in primary storage, and simply requires predictable purchases of storage per year. The problem is the exponential growth Extremely fast growth. On a chart, the line curves up rather than being straight. Contrast with linear. in the data protection environment that is spawned by the primary storage growth. 1 TB of primary storage can require 26 TB of tape capacity over its lifecycle when following a standard grandfather-father-son (GFS See Google File System. GFS - Grandfather, Father, Son ) operation, according to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. Michael Peterson of Strategic Research Corp. To utilize disk for this purpose is typically not reasonable. In parallel to the need for capacity is the growing need for increased levels of data availability. No longer are extended downtimes tolerable tol·er·a·ble adj. 1. Capable of being tolerated; endurable. 2. Fairly good; passable. See Synonyms at average. tol , and in many environments, even large data restores of several hundred gigabytesGB are expected to be completed within an hour. Beyond typical file or volume recoveries is the growing requirement for accessing older data quickly. Such reasons include: historical data analysis, regulatory compliance, legal discoveries, as well as traditional archiving. The result of all of these drivers is a growing need to keep increasing the amount of backup data on disk. The traditional infrastructure cannot reasonably--or economically--scale to meet these growing requirements. The net result of this confluence confluence /con·flu·ence/ (kon´floo-ins) 1. a running together; a meeting of streams.con´fluent 2. in embryology, the flowing of cells, a component process of gastrulation. of problems is a forced tradeoff in data protection. IT shops must choose between tape-centric backup and recovery, which offers lower levels of data protection, characterized by slower and unreliable restores and data that strays beyond systems' control, or disk-centric data protection which offers great benefits in data protection and availability, but comes at a steep price that will only accelerate with data growth. Many argue that the industry has solved this problem with commonality com·mon·al·i·ty n. pl. com·mon·al·i·ties 1. a. The possession, along with another or others, of a certain attribute or set of attributes: a political movement's commonality of purpose. factoring solutions. Optimizing capacity solutions Within the last two years, several products that provide commonality factoring which neither requires more tape nor more disk have entered the market. The aim of such solutions is to "filter-out" duplicated data from the backup environment, thus allowing customers to store more data on a given disk-storage system, and improving the economics of disk as a backup media. These solutions typically use hashing algorithms See hash function. , such as MD-5 and or SHA-1. These algorithms work by assigning a value--or a hash--to unique data chunks, and keeping track of a database (or index) of hash-to-chunk assignments. When new data is received, the system creates a hash for the new chunk and compares it against the existing hash database. Whenever a seemingly seem·ing adj. Apparent; ostensible. n. Outward appearance; semblance. seem ing·ly adv. exact matching Exact matchingA bond portfolio management strategy that involves finding the lowest cost portfolio generating cash inflows exactly equal to cash outflows that are being financed by investment. chunk is found, the system does not record it again, but rather records a small pointer to the matching chunk that is already housed in the system. These products are interesting in that they have applied an existing 10-year-old technology to a new problem. These technologies are appropriately deployed in Content Addressable Reachable. When something is addressable, it can be identified and manipulated independently of its surroundings. For example, screen pixels and RAM memory are addressable. Each of the screen's picture elements can be individually turned on and off, and each of the memory's bytes can be Storage (CAS) subsystems, when the purpose is that of an immutable IMMUTABLE. What cannot be removed, what is unchangeable. The laws of God being perfect, are immutable, but no human law can be so considered. storage archive. However, when applied to the challenge of de-duplication, the hashing algorithm approach falls short on several fronts. Hash-based solutions suffer from an inherent risk that the same hash might be assigned to two data chunks that are not in fact identical. While improbable, a hash collision (programming) hash collision - (Or "hash clash") When two different keys hash to the same value, i.e. to the same location in a hash table. ESR once asked a friend what he expected Berkeley to be like. is a real statistical risk. A hash collision means lost data. To be clear, hash algorithms are well-suited for immutable CAS archive systems, but offer modest value for data-coalescence systems due to the associated compromise in data integrity. While painful tradeoffs also exist in performance, scalability, and achieved factoring ratio, by far the greatest toll is the risk to data integrity. IT shops can at least theoretically spend more money to address performance and capacity to the economic detriment Any loss or harm to a person or property; relinquishment of a legal right, benefit, or something of value. Detriment is most frequently applied to contract formation, since it is an essential element of consideration, which is a prerequisite of a legally enforceable contract. of the solution; however, no amount of money can mitigate the risk of a hash collision or recover data lost in real collision event. Another commonality factoring solution available to the data center,--ProtecTIER, by Diligent Technologies addresses the issue of ever increasing disk capacity needs. ProtecTIER is a software data protection platform running on a standard enterprise server that connects to any standard Fibre--Channel attached disk array. These three elements--the server, the ProtecTIER software, and the disk array--combine to become a high capacity, high performance, highly and highly scalable target for backup and archive data. At its core is a technology that is based on a series of algorithms that identify and filter out the elements of a data stream that have previously been stored. Data matches are located without any I/O (Input/Output) The transfer of data between the CPU and a peripheral device. Every transfer is an output from one device and an input to another. See PC input/output. I/O - Input/Output to the disk, utilizing a very efficient RAM-based index, which enables a high data throughput rate Throughput rate is an obsolete term[1] in the terminology of automated chemical analysis. It may mean either:
1. ^ International Union of Pure and Applied Chemistry. "throughput rate". . The index maps 1 Petabyte One quadrillion bytes (one trillion kilobytes). Also PB, Pbyte and P-byte. See peta, binary values and space/time. (unit) petabyte - 2^50 = 1,125,899,906,842,624 bytes = 1024 terabytes or roughly 10^15 bytes. 1024 petabytes is one exabyte. (PB) of physical storage utilizing just 4 GB of RAM on the hosting server, representing a 250,000:1 ratio of storage to memory. When a data stream is sent to the ProtecTIER platform, the HyperFactor commonality factoring algorithm scans all the data in the stream and uses the memory-resident index to filter out the identical items at a very fine level of granularity The degree of modularity of a system. More granularity implies more flexibility in customizing a system, because there are more, smaller increments (granules) from which to choose. . When a data stream is sent to ProtecTIER, HyperFactor scans all the data in the stream and uses the memory resident index to filter out the identical items at a very fine level of granularity. Only the new data elements are stored, along with pointers to all of the existing matched data elements that were already in the repository. HyperFactor saves space by taking advantage of the fact that only a small percentage of data actually changes between output data streams (for example, between two backups of the same policy). The amount of space saved is a function of many factors, but mostly of the backup policies, the retention periods and the variance of the data between them. The more full backups See backup types. retained on ProtecTIER, and the more intervening incremental backups See backup types. (operating system) incremental backup - A kind of backup that copies all files which have changed since the date of the previous backup. The first backup of a file system should include all files - a "full backup". Call this level 0. , the more space that will be saved overall resulting in increased economic value Key elements to consider when selecting data capacity optimizing software When evaluating disk-based data protection solutions, here are a few features and items you should ask your short list of vendors: 1. Do they provide Redundancy Elimination? -- Does the technology filter out any duplicated incoming data, such that each unique data element is only stored once by the system? 2. Can this be implemented simply and non-disruptively? -- Does the technology enable smooth deployment, such that existing operations (such as your existing policies, practices and procedures) continue uninterrupted as the new technology is introduced? 3. Will this work with any vendor's hardware? -- Is it non-reliant on specialized spe·cial·ize v. spe·cial·ized, spe·cial·iz·ing, spe·cial·iz·es v.intr. 1. To pursue a special activity, occupation, or field of study. 2. hardware, orand can it leverage open hardware standards? 4. Will it provide enterprise-class performance? -- Does the solution perform the factoring function at high data throughput speeds, such that the overall solution meets the performance requirements of the high end data center? 5. How will this solution scale in my current and future environment? -- Is the solution "enterprise scale,", allowing for the storage and management of many petabytes (not just terabytes) of storage? 6. Will it provide me 100% data integrity? -- Does the solution have no risk--however slight--of corrupting data based on false data matches at the hash reference level? Conclusion Today's IT shops and data center managers can avoid the typical painful trade offs and the dilemmas associated with burgeoning data growth and backup requirements. More than this, an industry leading factoring solution solves the dilemma with zero risk and in a non-disruptive manner. Diligent Technologies' ProtecTIER also addresses the need to scale, massively if needed, and thus future proofs enterprises for their next stage of planned or unplanned data growth. Disk storage has historically been the most expensive resource in the data protection solution. By reducing the capacity needs, ProtecTIER lowers the acquisition cost of a disk-based solution to below that of an equivalent tape based system across a wide array of typical usage scenarios. Neville Yates is chief technology officer at Diligent Technologies Corp., Framingham, MA www.diligent.com |
|
||||||||||||||||||

ing·ly adv.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion