Printer Friendly
The Free Library
6,672,050 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

The value of compression for data protection over TCP/IP WANs.


Data compression data compression

Process of reducing the amount of data needed for storage or transmission of a given piece of information (text, graphics, video, sound, etc.), typically by use of encoding techniques.
 generally enables devices to transmit or store the same amount of data with fewer bits. The primary objective is to minimize the amount of data to be transmitted or stored. Data compression transforms a string of bytes into a new string containing the same information with a much smaller length. Data transmission and storage costs a nontrivial nontrivial - Requiring real thought or significant computing power. Often used as an understated way of saying that a problem is quite difficult or impractical, or even entirely unsolvable ("Proving P=NP is nontrivial"). The preferred emphatic form is "decidedly nontrivial".  amount of money. There is a direct correlation Noun 1. direct correlation - a correlation in which large values of one variable are associated with large values of the other and small with small; the correlation coefficient is between 0 and +1
positive correlation
 between the amount of data transmitted or stored and cost. Compacting the data both in motion and at rest would minimize this cost.

Contrary to popular views, the vast majority of data is neither transmitted nor stored in a compressed form. Data is usually transmitted or stored in the way it makes it easiest for an application to use. Examples of this are ASCII text Alphanumeric characters that are not in any proprietary file format. See ASCII file.  for e-mail, word processing word processing, use of a computer program or a dedicated hardware and software package to write, edit, format, and print a document. Text is most commonly entered using a keyboard similar to a typewriter's, although handwritten input (see pen-based computer) and , spreadsheets, etc., or computer OS executable binary code binary code

Code used in digital computers, based on a binary number system in which there are only two possible states, off and on, usually symbolized by 0 and 1. Whereas in a decimal system, which employs 10 digits, each digit position represents a power of 10 (100, 1,000,
. Typically, these easy-to-use encoding methods require data files that range from 2x to 40x or more larger than required to represent the information. Data compression optimizes data for compactness. Data decompression restores the data back to its original form.

There are two principal types of data compression/decompression that address this situation. The first is lossless data compression Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not allow the exact original data to be reconstructed from the . And just as it sounds, lossless See lossless compression.

(algorithm, compression) lossless - A term describing a data compression algorithm which retains all the information in the data, allowing it to be recovered perfectly by decompression.

Unix compress and GNU gzip perform lossless compression.
 data compression/decompression means that restored data is identical to the original data. Lossless data compression is used for data that must not be changed even a single bit. This is the standard for business data. LZO LZO The Lempel-Ziv-Oberhumer (data compression algorithm)
LZO Lens Effects Focus
 compression is an example of a portable lossless data compression library written in ANSI C. It provides fast compression and very fast decompression. Decompression requires no memory.

The second type of data compression is called lossy See lossy compression.

(algorithm) lossy - A term describing a data compression algorithm which actually reduces the amount of information in the data, rather than just the number of bits used to represent that information.
. Lossy data compression means that the restored data may not be completely identical to the original data. Lossy data compression is usually used for data that may have some random noise where additional noise or losses will not matter. Photos and video are examples of this type of data. JPEG JPEG
 in full Joint Photographic Experts Group

Standard computer file format for storing graphic images in a compressed form for general use. JPEG images are compressed using a mathematical algorithm.
 (photos) compression and MPEG (Moving Pictures Experts Group) An ISO/ITU standard for compressing digital video. Pronounced "em-peg," it is the universal standard for digital terrestrial, cable and satellite TV, DVDs and digital video recorders (DVRs).  (video) compression are examples of lossy data compression. Both provide very fast compression and decompression capabilities. Lossy data compression is not appropriate for business data. Both lossless and lossy compression A compression technique that does not decompress data back to 100% of the original. Lossy methods provide high degrees of compression and result in very small compressed files, but there is a certain amount of loss when they are restored.  can be found in software, drivers, firmware, and in some cases even ASICs. Each has a fit.

This article will focus on the use and implementation of lossless data compression, specifically when used with data protection applications over TCP/IP TCP/IP
 in full Transmission Control Protocol/Internet Protocol

Standard Internet communications protocols that allow digital computers to communicate over long distances.
 WANs.

Data Protection, Bandwidth and TCP/IP

WAN bandwidth costs have been in a steep decline since the Telco bubble burst at the turn of the century. If unemployment numbers in Dallas, RTP (1) (Rapid Transport Protocol) The protocol used in IBM's High Performance Routing (HPR) system.

(2) (Realtime Transport Protocol) An IP protocol that supports real time transmission of voice and video.
, Ontario, Morristown, and San Jose are any indication, Telco has not yet made a dent in its supply of bandwidth and costs continue to decline worldwide. Even with the decreasing costs, bandwidth is one of the largest operating expenses Operating expenses

The amount paid for asset maintenance or the cost of doing business, excluding depreciation. Earnings are distributed after operating expenses are deducted.
 for the IT organization. It is not free.

TCP/IP is the principal WAN protocol of choice for data protection applications. This is because of the continuing myth that TCP/IP makes bandwidth free for data protection applications. Conventional wisdom is that data protection applications usually occur at night or on weekends when the TCP/IP network is sparsely utilized. In this way, it piggybacks on the same WAN links at no additional charge. Hence the perception that it is free. The logic is flawed.

Data protection applications have significantly increased requirements beyond the day-to-day business applications. In some cases, they have been known to overwhelm the IP routers. It is also a false notion that the data protection applications will run only in the "off" hours. Depending on the type of data needing protection, regulations involved, and requirements for recovery, these applications will be running during the prime business day.

The Market Problem: Data Protection Throughput Over TCP/IP

Another common myth is that standard TCP/IP will always meet the unique needs of these applications. Although this is for the most part true, it is not always true. TCP/IP over the WAN was never designed to handle the large amounts of bulk data that a data protection program can and often does generate. And when that TCP/IP WAN has the typical packet loss of approximately 1%, data protection windows for operations such as complete volume replications, can be and are missed.

Packet loss is a direct result of bit error rate (BER (1) (Basic Encoding Rules) A set of encoding rules for ASN.1 notation, which is a method for defining data structures. See ASN.1.

(2) (Bit Error Rate) The average number of bits transmitted in error. See BERT.

1.
), jitter A flicker or fluctuation in a transmission signal or display image. The term is used in several ways, but it always refers to some offset of time and space from the norm. For example, in a network transmission, jitter would be a bit arriving either ahead or behind a standard clock cycle , network congestion, distance, router buffer overruns, and multiple service providers. One way to mitigate the packet loss problem is through lossless data compression.

TCP (1) (Transmission Control Protocol) The reliable transport protocol within the TCP/IP protocol suite. TCP ensures that all data arrive accurately and 100% intact at the other end.  Bandwidth Long Haul Problems That Limit Data Protection Throughput

Several characteristics of TCP/IP cause it to perform poorly over high bandwidth and long distances.

Packet Loss: Most TCP/IP WANs are designed around an average packet loss of 1%. This is a relatively low number for standard interactive business traffic. It is a high number for storage data protection applications. Packet loss increases when there is a high bit error rate known as BER (10-10 to 10-6), or jitter becomes an issue, or when congestion The condition of a network when there is not enough bandwidth to support the current traffic load.

congestion - When the offered load of a data communication path exceeds the capacity.
 is high. Multiple service providers typically have different network vendors increasing the probability of BER and jitter.

Window Size: Window size is the amount of data allowed to be outstanding (in-the-air) at any given point in time. The available window size on a given bandwidth pipe is the speed of the bandwidth times the round-trip delay or latency. Using a cross North American North American

named after North America.


North American blastomycosis
see North American blastomycosis.

North American cattle tick
see boophilusannulatus.
 continent OC-3 link (approximately 60ms based on a total 3000-mile roundtrip) creates an available data window of 155Mbps X 60ms = 1,163 Kbytes. A DS3 satellite connection (540ms roundtrip) creates an available data window of 45Mbps X 540ms = 3,038 Kbytes.

When this is contrasted with standard and even enhanced versions of TCP, there is a very large gap between the available window and the window utilized. Most standard TCP implementations are limited to 65 Kbyte windows. There are a few enhanced TCP versions that may be capable of using up to 512 Kbytes or larger windows. Either case means an incredibly large amount of "dead air" and very inefficient bandwidth utilization. The amount a packet can be compressed is very dependent on the size of the packet. The larger the window size, the larger the packet. The larger the packet, the more it can be compressed.

Acknowledgement Scheme: TCP causes the entire stream from any lost portion to be retransmitted in its entirety. In high bit-error-rate scenarios this will cause large amounts of bandwidth to be wasted in resending data that has already been successfully received, all with the long latency time of the path. Each retransmission Retransmission might refer to:
  • Retransmission (data networks), the resending of packets which have been damaged or lost
  • Replication of a signal at a repeater
 is additionally subjected to the performance penalty issues of "Slow Start".

Slow Start: TCP data transfers start slowly to avoid congestion due to possible large numbers of sessions competing for the bandwidth, and ramp-up to their maximum transfer rate, resulting in poor performance for short sessions.

Session Free-For-All: Each TCP session is throttled and contends for network resources independently, which can cause over-subscription of resources relative to each individual session. This increases the congestion and packet loss.

Lossless Compression A compression technique that decompresses data back to its original form without any loss. The decompressed file and the original are identical. All compression methods used to compress text, databases and other business data are lossless. : One Piece in Solving the Problem

Lossless data compression can mitigate some of the throughput decreases caused by TCP/IP packet loss. It does that by increasing the payload of each packet.

There are limits to how much lossless data compression can compress and increase the payload of a TCP/IP data packet. If the data within the packet is already compressed, the answer is, not much. If there is a lot of null data (blanks) within that block, the answer becomes quite a bit. The amount a lossless compression algorithm can compress a packet is also dependent on the size of the packet. The larger the packet size, the more likely measurable compression gains can and will take place. Small packets do not compress well. And typically TCP/IP Local networks are limited by the Ethernet's maximum packet size of 1500 bytes and TCP/IP window sizes.

Some data protection network vendors have developed very clever lossless data compression schemes that overcome the limits of TCP/IP packets. One of these is Network Executive Software (Maple Grove, MN) with their HyperIP compression engine.

HyperIP Compression Engine

The HyperIP compression engine compresses aggregated or concatenated HyperIP packets versus individual TCP/IP packets and much larger window sizes. The increased compression comes by compressing the gaps between the packets as well as the packets themselves. This implementation can increase throughput by up to fifteen times more than standard lossless data compression. FTP FTP
 in full file transfer protocol

Internet protocol that allows a computer to send files to or receive files from another computer. Like many Internet resources, FTP works by means of a client-server architecture; the user runs client software to connect to
 test results at a very large insurance company (see Figure) demonstrated compression rates up to approximately 10 times the bandwidth available and 187 times the native TCP throughput. These are impressive numbers.

Summary and Conclusion

Data protection applications such as backup, volume replication, snapshot and mirroring typically use standard TCP/IP WANs for long haul. Standard TCP/IP WANs are less than ideal for the throughput required. TCP/IP packet loss limits data protection throughput to the point where protection windows can and are often missed.

Lossless data compression can increase packet payload so that it mitigates throughput decreases from TCP/IP packet loss. Smart implementations such as Network Executive Software's HyperIP increases throughput up to eight times greater than standard lossless compression.

Lossless data compression is only part of the solution to increase throughput for data protection applications over TCP/IP WANs. Ultimately, the total solution must shield the application from the impact of TCP/IP WAN packet loss while maximizing bandwidth utilization.
Table 1: HyperIP[R] Compression Test Results vs. Native TCP

35Mbps bandwidth                         Mbps
w/60ms one-way                                       FTP w/HyperIP &
delay                 FTP Native TCP  FTP w/HyperIP  Compression

Throughput w/no bit
  errors                  33.3            34.2           327
Throughput w/.01%
  bit errors              18              34.2           324
Throughput w/1% bit
  errors                   1.73           34.2           324

Note: Table made from bar graph.


www.netex.com

Steve Thompson is director, Storage Networking, NetEx Software, Inc. (Maple Grove, MN)
COPYRIGHT 2004 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2004, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Disaster Recovery & Backup/Restore
Author:Thompson, Steve
Publication:Computer Technology Review
Geographic Code:1USA
Date:Jun 1, 2004
Words:1666
Previous Article:Keeping the enterprise afloat: the drive to terabyte-class tape cartridges.(Disaster Recovery & Backup/Restore)
Next Article:High availability WAN Clusters.(Disaster Recovery & Backup/Restore)(Wide area networks)
Topics:



Related Articles
INNOVATION Data Processing announces S/390 Linux rescuer.(disaster recovery software)
Centralized file-cached storage protects against disaster: consolidated data is easier to protect, easier to manage, and much less expensive. But...
Overcoming TCP/IP distance and BER limitations.(Connectivity)
TCO analysis: where D2D fits--part 2.(Storage Networking)(Total Cost of Ownership)
IP SAN for dummies.(Back to Basics)(Storage Area Network)
The cold hard truth about TCP/IP performance over the WAN.(Storage Networking)(Wide Area Network )(Transmission Control Protocol/Internet Protocol )
Overcoming recovery barriers: rapid and reliable system and data recovery.(Data Protection)
Looking back.(Calendar)
Understanding the new generation of data protection solutions.(Disaster Recovery & Backup/Restore)
Building practical data protection strategies.

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles