The value of compression for data protection over TCP/IP WANs.Data compression data compression Process of reducing the amount of data needed for storage or transmission of a given piece of information (text, graphics, video, sound, etc.), typically by use of encoding techniques. generally enables devices to transmit or store the same amount of data with fewer bits. The primary objective is to minimize the amount of data to be transmitted or stored. Data compression transforms a string of bytes into a new string containing the same information with a much smaller length. Data transmission and storage costs a nontrivial nontrivial - Requiring real thought or significant computing power. Often used as an understated way of saying that a problem is quite difficult or impractical, or even entirely unsolvable ("Proving P=NP is nontrivial"). The preferred emphatic form is "decidedly nontrivial". amount of money. There is a direct correlation Noun 1. direct correlation - a correlation in which large values of one variable are associated with large values of the other and small with small; the correlation coefficient is between 0 and +1 positive correlation between the amount of data transmitted or stored and cost. Compacting the data both in motion and at rest would minimize this cost. Contrary to popular views, the vast majority of data is neither transmitted nor stored in a compressed form. Data is usually transmitted or stored in the way it makes it easiest for an application to use. Examples of this are ASCII text Alphanumeric characters that are not in any proprietary file format. See ASCII file. for e-mail, word processing word processing, use of a computer program or a dedicated hardware and software package to write, edit, format, and print a document. Text is most commonly entered using a keyboard similar to a typewriter's, although handwritten input (see pen-based computer) and , spreadsheets, etc., or computer OS executable binary code binary code Code used in digital computers, based on a binary number system in which there are only two possible states, off and on, usually symbolized by 0 and 1. Whereas in a decimal system, which employs 10 digits, each digit position represents a power of 10 (100, 1,000, . Typically, these easy-to-use encoding methods require data files that range from 2x to 40x or more larger than required to represent the information. Data compression optimizes data for compactness. Data decompression restores the data back to its original form. There are two principal types of data compression/decompression that address this situation. The first is lossless data compression Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not allow the exact original data to be reconstructed from the . And just as it sounds, lossless See lossless compression. (algorithm, compression) lossless - A term describing a data compression algorithm which retains all the information in the data, allowing it to be recovered perfectly by decompression. Unix compress and GNU gzip perform lossless compression. data compression/decompression means that restored data is identical to the original data. Lossless data compression is used for data that must not be changed even a single bit. This is the standard for business data. LZO LZO The Lempel-Ziv-Oberhumer (data compression algorithm) LZO Lens Effects Focus compression is an example of a portable lossless data compression library written in ANSI C. It provides fast compression and very fast decompression. Decompression requires no memory. The second type of data compression is called lossy See lossy compression. (algorithm) lossy - A term describing a data compression algorithm which actually reduces the amount of information in the data, rather than just the number of bits used to represent that information. . Lossy data compression means that the restored data may not be completely identical to the original data. Lossy data compression is usually used for data that may have some random noise where additional noise or losses will not matter. Photos and video are examples of this type of data. JPEG JPEG in full Joint Photographic Experts Group Standard computer file format for storing graphic images in a compressed form for general use. JPEG images are compressed using a mathematical algorithm. (photos) compression and MPEG (Moving Pictures Experts Group) An ISO/ITU standard for compressing digital video. Pronounced "em-peg," it is the universal standard for digital terrestrial, cable and satellite TV, DVDs and digital video recorders (DVRs). (video) compression are examples of lossy data compression. Both provide very fast compression and decompression capabilities. Lossy data compression is not appropriate for business data. Both lossless and lossy compression A compression technique that does not decompress data back to 100% of the original. Lossy methods provide high degrees of compression and result in very small compressed files, but there is a certain amount of loss when they are restored. can be found in software, drivers, firmware, and in some cases even ASICs. Each has a fit. This article will focus on the use and implementation of lossless data compression, specifically when used with data protection applications over TCP/IP TCP/IP in full Transmission Control Protocol/Internet Protocol Standard Internet communications protocols that allow digital computers to communicate over long distances. WANs. Data Protection, Bandwidth and TCP/IP WAN bandwidth costs have been in a steep decline since the Telco bubble burst at the turn of the century. If unemployment numbers in Dallas, RTP (1) (Rapid Transport Protocol) The protocol used in IBM's High Performance Routing (HPR) system. (2) (Realtime Transport Protocol) An IP protocol that supports real time transmission of voice and video. , Ontario, Morristown, and San Jose are any indication, Telco has not yet made a dent in its supply of bandwidth and costs continue to decline worldwide. Even with the decreasing costs, bandwidth is one of the largest operating expenses Operating expenses The amount paid for asset maintenance or the cost of doing business, excluding depreciation. Earnings are distributed after operating expenses are deducted. for the IT organization. It is not free. TCP/IP is the principal WAN protocol of choice for data protection applications. This is because of the continuing myth that TCP/IP makes bandwidth free for data protection applications. Conventional wisdom is that data protection applications usually occur at night or on weekends when the TCP/IP network is sparsely utilized. In this way, it piggybacks on the same WAN links at no additional charge. Hence the perception that it is free. The logic is flawed. Data protection applications have significantly increased requirements beyond the day-to-day business applications. In some cases, they have been known to overwhelm the IP routers. It is also a false notion that the data protection applications will run only in the "off" hours. Depending on the type of data needing protection, regulations involved, and requirements for recovery, these applications will be running during the prime business day. The Market Problem: Data Protection Throughput Over TCP/IP Another common myth is that standard TCP/IP will always meet the unique needs of these applications. Although this is for the most part true, it is not always true. TCP/IP over the WAN was never designed to handle the large amounts of bulk data that a data protection program can and often does generate. And when that TCP/IP WAN has the typical packet loss of approximately 1%, data protection windows for operations such as complete volume replications, can be and are missed. Packet loss is a direct result of bit error rate (BER (1) (Basic Encoding Rules) A set of encoding rules for ASN.1 notation, which is a method for defining data structures. See ASN.1. (2) (Bit Error Rate) The average number of bits transmitted in error. See BERT. 1. ), jitter A flicker or fluctuation in a transmission signal or display image. The term is used in several ways, but it always refers to some offset of time and space from the norm. For example, in a network transmission, jitter would be a bit arriving either ahead or behind a standard clock cycle , network congestion, distance, router buffer overruns, and multiple service providers. One way to mitigate the packet loss problem is through lossless data compression. TCP (1) (Transmission Control Protocol) The reliable transport protocol within the TCP/IP protocol suite. TCP ensures that all data arrive accurately and 100% intact at the other end. Bandwidth Long Haul Problems That Limit Data Protection Throughput Several characteristics of TCP/IP cause it to perform poorly over high bandwidth and long distances. Packet Loss: Most TCP/IP WANs are designed around an average packet loss of 1%. This is a relatively low number for standard interactive business traffic. It is a high number for storage data protection applications. Packet loss increases when there is a high bit error rate known as BER (10-10 to 10-6), or jitter becomes an issue, or when congestion The condition of a network when there is not enough bandwidth to support the current traffic load. congestion - When the offered load of a data communication path exceeds the capacity. is high. Multiple service providers typically have different network vendors increasing the probability of BER and jitter. Window Size: Window size is the amount of data allowed to be outstanding (in-the-air) at any given point in time. The available window size on a given bandwidth pipe is the speed of the bandwidth times the round-trip delay or latency. Using a cross North American North American named after North America. North American blastomycosis see North American blastomycosis. North American cattle tick see boophilusannulatus. continent OC-3 link (approximately 60ms based on a total 3000-mile roundtrip) creates an available data window of 155Mbps X 60ms = 1,163 Kbytes. A DS3 satellite connection (540ms roundtrip) creates an available data window of 45Mbps X 540ms = 3,038 Kbytes. When this is contrasted with standard and even enhanced versions of TCP, there is a very large gap between the available window and the window utilized. Most standard TCP implementations are limited to 65 Kbyte windows. There are a few enhanced TCP versions that may be capable of using up to 512 Kbytes or larger windows. Either case means an incredibly large amount of "dead air" and very inefficient bandwidth utilization. The amount a packet can be compressed is very dependent on the size of the packet. The larger the window size, the larger the packet. The larger the packet, the more it can be compressed. Acknowledgement Scheme: TCP causes the entire stream from any lost portion to be retransmitted in its entirety. In high bit-error-rate scenarios this will cause large amounts of bandwidth to be wasted in resending data that has already been successfully received, all with the long latency time of the path. Each retransmission Retransmission might refer to:
Slow Start: TCP data transfers start slowly to avoid congestion due to possible large numbers of sessions competing for the bandwidth, and ramp-up to their maximum transfer rate, resulting in poor performance for short sessions. Session Free-For-All: Each TCP session is throttled and contends for network resources independently, which can cause over-subscription of resources relative to each individual session. This increases the congestion and packet loss. Lossless Compression A compression technique that decompresses data back to its original form without any loss. The decompressed file and the original are identical. All compression methods used to compress text, databases and other business data are lossless. : One Piece in Solving the Problem Lossless data compression can mitigate some of the throughput decreases caused by TCP/IP packet loss. It does that by increasing the payload of each packet. There are limits to how much lossless data compression can compress and increase the payload of a TCP/IP data packet. If the data within the packet is already compressed, the answer is, not much. If there is a lot of null data (blanks) within that block, the answer becomes quite a bit. The amount a lossless compression algorithm can compress a packet is also dependent on the size of the packet. The larger the packet size, the more likely measurable compression gains can and will take place. Small packets do not compress well. And typically TCP/IP Local networks are limited by the Ethernet's maximum packet size of 1500 bytes and TCP/IP window sizes. Some data protection network vendors have developed very clever lossless data compression schemes that overcome the limits of TCP/IP packets. One of these is Network Executive Software (Maple Grove, MN) with their HyperIP compression engine. HyperIP Compression Engine The HyperIP compression engine compresses aggregated or concatenated HyperIP packets versus individual TCP/IP packets and much larger window sizes. The increased compression comes by compressing the gaps between the packets as well as the packets themselves. This implementation can increase throughput by up to fifteen times more than standard lossless data compression. FTP FTP in full file transfer protocol Internet protocol that allows a computer to send files to or receive files from another computer. Like many Internet resources, FTP works by means of a client-server architecture; the user runs client software to connect to test results at a very large insurance company (see Figure) demonstrated compression rates up to approximately 10 times the bandwidth available and 187 times the native TCP throughput. These are impressive numbers. Summary and Conclusion Data protection applications such as backup, volume replication, snapshot and mirroring typically use standard TCP/IP WANs for long haul. Standard TCP/IP WANs are less than ideal for the throughput required. TCP/IP packet loss limits data protection throughput to the point where protection windows can and are often missed. Lossless data compression can increase packet payload so that it mitigates throughput decreases from TCP/IP packet loss. Smart implementations such as Network Executive Software's HyperIP increases throughput up to eight times greater than standard lossless compression. Lossless data compression is only part of the solution to increase throughput for data protection applications over TCP/IP WANs. Ultimately, the total solution must shield the application from the impact of TCP/IP WAN packet loss while maximizing bandwidth utilization. Table 1: HyperIP[R] Compression Test Results vs. Native TCP 35Mbps bandwidth Mbps w/60ms one-way FTP w/HyperIP & delay FTP Native TCP FTP w/HyperIP Compression Throughput w/no bit errors 33.3 34.2 327 Throughput w/.01% bit errors 18 34.2 324 Throughput w/1% bit errors 1.73 34.2 324 Note: Table made from bar graph. www.netex.com Steve Thompson is director, Storage Networking, NetEx Software, Inc. (Maple Grove, MN) |
|
||||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion