New Holy Grail: information lifecycle management; Has it been found? Not yet.
These important questions need answers so we can understand how data should be managed and where data should--ideally--reside during its existence. In particular, the probability of reuse of data has historically been one of the most meaningful metrics for understanding optimal data placement and remains a key metric for effective HSM (Hierarchical Storage Management) implementation.
For the majority of data types, the number of references to data significantly declines as the data ages. This basic observation serves as the basis for more cost-effective storage management as it enables the movement of less active data to lower-cost levels of storage. The lower frequency of access as data ages has been a fundamental storage management principle for over 25 years.
We are now witnessing a new effect on the life cycle of data: the amount of data now increases as it ages. Unlike the past, fixed content and archival storage have become the fastest growing areas of the storage industry. Storage demand grew at over 100% per year during the dot-com boom of the late 1990s. Today, the storage industry is generating new data at a rate of approximately 50-70% per year. In addition, some of the current demand for storage is presently being consumed from existing and unused capacity that is the result of the excessive buying habits of the past several years. Regardless of the growth rate, the continual increases in the amount of digital data have made storage management more difficult and, as a result, more data is being accumulated for longer periods of time. Much of this data lives digitally without effective storage management services.
The percent of all digital data that has lost its value and, therefore, should be deleted is quickly declining as obsolete data is often "just kept around forever." In many cases, this approach is perceived to be easier than managing the data throughout its life cycle. The probability of reusing data typically falls by 50% after the data is three days old, or three days since its creation. Thirty days after creation, the probability of reuse normally falls below a few percentage points. E-mail and medical imaging applications represent good examples for the data aging profile described here. Keeping very low activity, archival and inactive data on spinning disks for long periods of time is not economical for environmental reasons (electrical consumption) and security issues, in addition to the tangible differential in storage prices between disk and tape per unit of storage purchased.
Data Retention Requirements Change
When the Nearline concept was becoming widely accepted in the 1990s, the common belief was that archival status was the last stop for data before deletion or end-of-life. Then, one- to two-year data retention periods were viewed as a reasonable amount of time to keep digital data accessible. Fifteen years later, the game and rules are different. New government regulations, the Sarbanes-Oxley Act, and HIPAA requirements for transmission and retention of data have made us change the way we look at data as it ages. Several major health care providers are faced with generating and storing more than 500TBs of data over the next few years that will need to be managed for a person's lifetime plus seven years--a time period that could exceed 100 years. SEC rule 17a-4(t) mandates digital archiving requirements as they relate to storage. These include what type of storage format should be used, how long data must be retained, and where and how long duplicate copies of data must be stored. The back end of the data life cycle is swelling, not shrinking as was the case previously, and retention policies are now being based on data value and legal issues, not just reference activity. For lifetime data management. "It doesn't matter if the data is ever used: it does matter if the data is there." This change in the storage landscape calls for new management policies based on the value of data and requires that a universal, standard classification scheme for data needs to emerge. All data is not created equal.
Life Cycle Management and Policies
How does someone actually implement an information life cycle strategy? Is managing data for its lifetime realistically possible? It won't become a reality without some major enhancements to the existing levels of data management capability. Data is growing faster than our ability to manage it. As storage networks and SAN deployment continue to evolve, optimal data placement and movement between various levels of the storage hierarchy will occur automatically without human involvement. As these functions move outboard of the application servers, they will be implemented as either an in-band or out-of-band function in the storage fabric itself. Either of these implementations will likely be delivered by using blades or appliances. Advanced policy-driven SRM (Storage Resource Management) software will be required and should evolve to measure reference patterns and trigger management policies that result in moving data, in conjunction with HSM or a similar function, to the most optimal storage location throughout its lifetime. In the future. SRM tools may become the optimal storage management function for assigning data values.
Data Life Cycle Management Needs a Solution
Ideally, the data life cycle management solution should be completely transparent to applications and to users who do not necessarily need to know where their data is stored as long as it is accessible. In a tiered storage migration policy, data is typically moved from expensive hard disks either to less expensive online storage, or to tape. Storage administrators should not have to inform users that their files are in a new location, nor should they have to go to the client systems and change the file location pointers. Ideally, the migration of data from one level of the storage hierarchy to another should be transparent. The users should not even know that their data has migrated to a less costly storage media. A data life cycle management solution needs to track the new location when data is relocated and must make the data available to the user and/or application as requested. One common technique is to separate the file's attributes from the actual data in the file. When the data is migrated, the file's attributes in the local system still contains all the important descriptive information about the file (new location, file name, security information. etc.) and the data is now stored in another typically lower-cost storage subsystem.
When a user or application retrieves a file that has moved down the storage hierarchy, the management software retrieves that file from the new migrated target location.
Intelligent Storage Architecture
It isn't well understood yet if the overall cost of the additional server I/O traffic required to move data within the hierarchy is more than the cost of just leaving the data to reside indefinitely on higher-priced disk and not moving it at all. What we do know is that the overhead, or I/O tax, is very high. A new trend called draining the server--moving storage management functions off the server and into the storage fabric to minimize host resource consumption and improve storage management speeds--is emerging as a primary direction for the storage industry.
The initial application expected to move outboard has been referred to as server-less backup and recovery. Representing a fundamental change in the way large data centers operate, server-less backup will allow businesses to perform a variety of operations such as full backup, snapshots and incremental backups at any time without consuming computing and I/O bandwidth resources from application servers. With server-less backup, the server initiates the backup or recovery function but doesn't sit in the data movement path. The movement of data directly from disk arrays to and from automated libraries across a dedicated network for backup and recovery applications is now highly desirable. For recovery, the data moves directly from tape storage back to disk. This capability further leverages the SAN infrastructure by providing significant management benefits for storage administrators.
After outboard or server-less backup, look for HSM to become a primary candidate to move into a SAN appliance or blade basically bringing HSM to life. As stated earlier, server-less or outboard storage management technologies will eventually progress beyond backup and recovery techniques to include mirroring, replication, snapshot copy and a variety of virtualization functions. Advanced SRM products make possible proactive or anticipatory data movement that further optimizes the storage hierarchy. The capability of using one set of management tools and utility software through a single interface can enable storage administrators to effectively manage far more storage than ever before, finally shrinking the management gap between installed storage capacity and what can actually be managed.
As storage becomes cheaper to buy, it becomes harder to manage. In parallel, the value of data is increasing irrespective of economic and other pressing global issues. As the value of data now changes significantly as it ages, storage management has now become a lifetime activity. The place where data is initially stored is not necessarily the same place where it will finally be stored. Everyone can state the problem of data life cycle management. Building and delivering a solution to this growing problem will take the best minds in the industry to resolve. Given the anticipated growth rates for digital storage, the time to begin has already passed.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Storage Management|
|Publication:||Computer Technology Review|
|Date:||Feb 1, 2004|
|Previous Article:||The challenges of testing SATA and SAS; part 1: the physical level.|
|Next Article:||SANs more menaced from within than without: security is a people thing.|