Printer Friendly
The Free Library
14,505,585 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Preservation risk management for Web resources: preserving Web content requires substantial resource commitments and flexible and innovative approaches to new technologies, organizational missions, and user expectations.


At the Core

This article:

* Discusses current Web preservation efforts

* Defines a risk-based preservation management program

* Introduces Cornell University's Project Prism

Actuaries spend their careers figuring out what benefits a company should offer, at what price, and for how long. Their job is to make sense of all the empirical and statistical evidence of age, gender, health, heredity heredity, transmission from generation to generation through the process of reproduction in plants and animals of factors which cause the offspring to resemble their parents. That like begets like has been a maxim since ancient times. , life styles, physical habits, and living and working conditions that serve as indicators of longevity, productivity, and obligation. How well they do their job depends on how good their evidence is, how skilled they are at reading it, and how risk tolerant their customers are.

Archivists and research librarians interested in preserving Web resources face a similar challenge. Libraries increasingly depend on digital assets they neither own nor manage. Academic libraries have dramatically increased their offerings of online resources. A 2001 survey of the 21 members of the Digital Library Federation revealed that 40 percent of their costs for digital libraries in 2000 went for commercial content. The big-ticket items were electronic scholarly journals that libraries license rather than own. Yet little direct evidence shows that publishers have developed full-scale digital preservation capabilities to protect this material, and research libraries continue to purchase the print versions for preservation purposes. However, none appears ready to forgo access to the licensed content just because its long-term accessibility might be in question.

Research libraries also are including in their catalogs and gateways more open-access Web resources that are not covered not covered Health care adjective Referring to a procedure, test or other health service to which a policy holder or insurance beneficiary is not entitled under the terms of the policy or payment system–eg, Medicare. Cf Covered.  by licenses or other formal arrangements. A spring 2001 survey of Cornell University's and Michigan University's Making of America (MOA moa (mō`ə) [Maori], common name for an extinct flightless bird of New Zealand related to the kiwi, the emu, the cassowary, and the ostrich. The various species ranged in size from that of a turkey to the 10-ft (3-m) Dinornis giganteus. ) collections revealed that nearly 250 academic institutions link directly to the MOA collections, although neither university has committed to provide other entities with long-term access. Similarly, a review of the holdings of several research library gateways over the past few years indicates growth in the number of links to open-access Web resources that are managed with varying degrees of control. Approximately 65 percent of the electronic resources on Cornell's gateway are unrestricted, and additional open resources are included in aggregated sets that are available only to the campus community. In contrast, only six percent of Michigan's electronic resources are open-access materials.

Current Web Preservation Efforts

Estimates put the average life expectancy Life Expectancy

1. The age until which a person is expected to live.

2. The remaining number of years an individual is expected to live, based on IRS issued life expectancy tables.
 of a Web page between 44 days and two years, and a significant proportion of those that survive undergo some change in content within a year. Since 1998, Online Computer Library Center's (OCLC OCLC - Online Computer Library Center ) Web Characterization Project has tracked trends in growth and content of the publicly available Web space. One of the more revealing statistics, IP address volatility, identifies the percent of extant IP [Internet Provider Internet provider - Internet Service Provider ] addresses from one year to the next. In a fairly consistent trend since 1998, slightly over half (55-56 percent) the IP addresses identified in one year are still available the next. Within two years, a little over a third (35-37 percent) remain. Four years later, only 25 percent of the sample 1998 IP addresses could be located, according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 OCLC.

OCLC's annual review points to the instability of Web resources; it doesn't indicate whether those resources still exist elsewhere on the Web or whether the content has changed. While some resources disappear, others become unfindable due to the well-known problem that URLs change. A recent preservation review of the 75 Smithsonian Institution Smithsonian Institution, research and education center, at Washington, D.C.; founded 1846 under terms of the will of James Smithson of London, who in 1829 bequeathed his fortune to the United States to create an establishment for the "increase and diffusion of  Web sites noted that an exhaustive search could not locate a copy of the first Smithsonian Web site, created in 1995. A URL URL
 in full Uniform Resource Locator

Address of a resource on the Internet. The resource can be any type of file stored on a server, such as a Web page, a text file, a graphics file, or an application program.
 may persist while content changes wildly: the editors of RLG RLG Research Libraries Group, Inc. (Dublin, OH)
RLG Ring Laser Gyro
RLG RedLightGreen Project
RLG Royal Laotian Government
RLG Resident Love Goddess
RLG Right, Let's Go
 DigiNews discovered that links in several past issues pointed to lapsed domain names that had been converted by others into pornography sites.

Much attention has been paid to unstable URLs and to creating administrative/preservation metadata, but to date no evidence suggests that research libraries are privileging open access sites that utilize some form of URN (Uniform Resource Name) A name that identifies a resource on the Internet. Unlike URLs, which use network addresses (domain, directory path, file name), URNs use regular words that are protocol and location independent.  [Uniform Resource Name] or that document content change.

With the growing dependence on external digital assets, libraries and archives are undertaking some measures to protect their continued use of these resources. Efforts can be grouped into three areas: collaborating with publishers to preserve licensed content, developing policies and guidelines for creating and maintaining Web sites, and assuming archival custody for Web resources of interest.

Licensed Content

Publishers are developing their own preservation strategies as they realize the commercial benefits of creating deep content databases. Several are working with third parties to back up, store, and refresh digital content. OCLC recently announced the formation of the Digital and Preservation Resources Division to provide integrated solutions for creating, accessing, and preserving digital collections. With planning grants received in 2001 from The Andrew W. Mellon Foundation The Andrew W. Mellon Foundation is a foundation endowed with wealth accumulated by the late Andrew W. Mellon. It is the product of the 1969 merger of the Avalon Foundation and the Old Dominion Foundation. , seven research libraries and key commercial and scholarly publishers began exploring formal archiving arrangements for e-journals and developing plans for moving toward implementation.

Creating and Maintaining Sites

The World Wide Web Consortium's (W3C (World Wide Web Consortium, www.w3.org) An international industry consortium founded in 1994 by Tim Berners-Lee to develop standards for the Web. It is hosted in the U.S. by the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (www.csail.mit.edu/index.php). ) "Web Content Accessibility Guidelines Web Content Accessibility Guidelines (WCAG) are part of a series of Web accessibility guidelines published by the W3C's Web Accessibility Initiative. They consist of a set of guidelines on making content accessible, primarily for disabled users, but also for all user agents, , Techniques, and Checklist" provides some recommendations for good resource management (e.g., use of standard formats and backward-compatible software) and have had a major impact on the development of Web materials worldwide. However, the W3C guidelines do not expressly address content stability, documentation of change, or good database management. In fact, preservation and records management issues are noticeably absent.

In the United States United States, officially United States of America, republic (2005 est. pop. 295,734,000), 3,539,227 sq mi (9,166,598 sq km), North America. The United States is the world's third largest country in population and the fourth largest country in area. , Web preservation is more directly supported through government policies and guidelines to promote accountability, spurred in part by such legislation as the Paperwork Reduction Act The Paper Reduction Act, officially the Paperwork Reduction Act of 1980, Pub. L. No. 96-511, 94 Stat. 2812 (Dec. 11, 1980), codified in part at Subchapter I of Chapter 35 of Title 44 of the United States Code, through , is a United States federal law enacted in 1980 that . Governments also are promulgating specific policies and recommendations for preserving government-supported Web content. In January 2001, the U.S. National Commission on Libraries and Information Science The National Commission on Libraries and Information Science (NCLIS) is one of the smallest policy agencies in the U.S. government and for the last 20 years has been fighting for its existence.  published "A Comprehensive Assessment of Public Information Dissemination," which recommended legislation that would "formally recognize and affirm the concept that public information is a strategic national resource." Another recommendation is to "partner broadly, in and outside of government, to ensure permanent public availability of public information resources (1) The data and information assets of an organization, department or unit. See data administration.

(2) Another name for the Information Systems (IS) or Information Technology (IT) department. See IT.
."

The archivist's perspective has been quite influential, as arguments are advanced to treat Web sites as important records in their own right. National archives National Archives, official depository for records of the U.S. federal government, established in 1934 by an act of Congress. Although displeasure concerning the method of keeping national records was voiced in Congress as early as 1810, the United States continued  in many countries are developing policies and guidelines. The U.S. Federal Records Act, as amended, requires that agencies identify and transfer Web site records to agency recordkeeping systems, including the National Archives and Records Administration (NARA Nara (nä`rä), city (1990 pop. 349,349), capital of Nara prefecture, S Honshu, Japan. An ancient cultural and religious center, it was founded in 706 by imperial decree and was modeled after Chang'an (see Xi'an), the capital of T'ang China. ), for permanent retention. NARA has issued several bulletins on the disposition of electronic records that include Web sites. It has also slowly begun to respond to this new form of recordkeeping and has appraised at least one federal Web site as a permanent record. In late 2000, NARA established an initiative to capture a snapshot of all federal Web sites at the end of the Clinton Administration Noun 1. Clinton administration - the executive under President Clinton
executive - persons who administer the law
. NARA also has contracted with the San Diego Supercomputer Center “SDSC” redirects here. For the Satish Dhawan Space Centre, see Satish Dhawan Space Centre.

The San Diego Supercomputer Center (SDSC) is an organized research unit of the University of California, San Diego (UCSD).
 for a project to investigate the preservation of presidential Web sites.

The National Library of Australia The National Library of Australia is located in Canberra, Australia. Established in 1960, the Library grew out of the Federal Parliamentary Library, which was established in 1901.  (NLA NLA National Library of Australia
NLA National Liberation Army (Macedonian rebel group)
NLA No Longer Available
NLA Network Location Awareness
NLA National Lipid Association
NLA National Legislative Assembly
) has been a world leader in promulgating guidelines for preservation. In December 2000 the NLA issued "Safeguarding Australia's Web Resources," which provides advice on creating, describing, naming, and managing Web resources. The Council on Library and Information Resources funded NLA's Safekeeping Safekeeping

The storage of assets or other items of value in a protected area.

Notes:
Individuals may use self-directed methods of safekeeping or the services of a bank or brokerage firm.
 Project, which targets 170 key items accessible through Preserving Access to Digital Information (PADI). NLA staff wrote to the resource managers, encouraging them to voluntarily preserve these materials and outlined nine strategies for long-term access. According to Susan Thomas, PADI administrator, 116 resource owners responded and safekeeping arrangements have been made for 77 items to date. Negotiations are in progress for an additional 33 resources. Eight resource owners lacked the appropriate infrastructures to comply with the recommendations. Alternative "safekeepers" have been approached for four of these. By the end of 2001, 54 resource owners had not responded.

Assuming Archival Custody

The third major focus of Web preservation has been to identify and ingest in·gest  
tr.v. in·gest·ed, in·gest·ing, in·gests
1. To take into the body by the mouth for digestion or absorption. See Synonyms at eat.

2.
 Web content into digital repositories. The best-known example is the Internet Archive See Wayback Machine and Web archiving. , a not-for-profit organization associated with Alexa Internet Alexa Internet, Inc. is a California-based subsidiary company of Amazon.com that is best known for operating a website that provides information on the web traffic to other websites. , which has been automatically collecting all open access HTML HTML
 in full HyperText Markup Language

Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web.
 [hypertext markup language (hypertext, World-Wide Web, standard) Hypertext Markup Language - (HTML) A hypertext document format used on the World-Wide Web. HTML is built on top of SGML. "Tags" are embedded in the text. A tag consists of a "<", a "directive" (in lower case), zero or more parameters and a ">". ] pages since 1996. Also in 1996, the NLA's Pandora adapted Web crawling to archive selected Australian online publications. That same year, the Royal Library of Sweden launched Kulturarw3 to collect, preserve, and make accessible Swedish electronic documents published online. For Pandora, ingest includes manual creation and/or clean up of metadata and the establishment of content boundaries. This approach may be cost effective for a few highly valuable documents but may be prohibitively expensive for large collections.

In 2001, the Internet Archive released the Wayback Machine A Web site from the Internet Archive (www.archive.org) that records the content of most Web sites for each year of their existence since 1996. All of the pages in the site are generally included unless the site is password protected or is coded to explicitly refuse to be archived (see , which lets users view snapshots of Web sites as they appeared at various points in the past. With more than 10 billion Web pages exceeding 100 terabytes of data and growing at a rate of 12 terabytes a month, the Internet Archive provides the best view of the early Web as well as a panoramic record of its rapid evolution over the past five years. It provides an invaluable tool for documenting change and filling some of the void in recordkeeping in the Web's early days.

However, this approach to Web preservation is only part of the solution to a much larger problem. The Internet Archive and similar efforts to preserve the Web by copying suffer from common weaknesses. Snapshots may or may not capture important changes in content and structure. Technology development, including robot exclusions, password protection, Javascript, and server-side image maps, inhibits full capture. A Web page may serve as the front end to a database, image repository, or a library management system, and Web crawlers capture none of the material contained in these so-called "deep" Web resources.

The sheer volume of material on the Web is staggering. The high-speed crawlers used by the Internet Archive traverse the entire Web every two months--even more time would be needed to treat anomalies associated with downloading. Not all sites merit the same level of attention, especially given limited resources, and means must be devised for honing selection and treating materials according to their needs.

Automated approaches to collecting Web data tend to stop short of incorporating the means to manage the risks of content loss to valuable Web documents. File copying File copying is creation of a new file which has the same content as an existing file.

All operating systems include file copying in the user interface, like "cp" in Unix and "copy" in MS-DOS; operating systems with GUIs usually provide copy-and-paste or drag-and-drop
 by itself fails to meet the criteria RLG and OCLC have identified. For example, the Internet Archive has not overtly committed to continued access through changing file formats, encoding standards, and software technologies. In addition, legal constraints limit the ability of crawlers to copy and preserve the Web.

Project Prism

Current Web preservation efforts fail to consider the challenge of preserving content that an institution does not control or for which it cannot negotiate formal archiving arrangements or assume direct custody. Over time, preserving Web content will require substantial resource commitments, as well as flexible and innovative approaches to changes in technologies, organizational missions, and user expectations.

The National Science Foundation (NSF NSF - National Science Foundation ) has funded Cornell University's Project Prism, which is a joint research effort by the Computer Science Department and the University Library to support libraries and archives as they extend their role from custodians of physical artifacts artifacts

see specimen artifacts.
 to managers of selected digital objects distributed over the network. Digital curatorial responsibilities will need to be reconsidered and undertaken in light of cost, level of participation by cooperative or uncooperative partners, and technical feasibility. At the same time, the project aims to design archiving tools and services that will enable non-librarians to raise the information integrity of research collections that are now managed haphazardly, if at all. Ultimately, the goal is to create an approach to archiving distributed Web content that takes custody of digital files as a last resort, though the methodology also could be used for pre-ingest management.

Project Prism is producing a framework for developing an ongoing comprehensive monitoring program that is scalable, extensible, and cost effective. Its approach begins with characterizing the nature of preservation risks in the Web environment, develops a risk management methodology for establishing a preservation monitoring and evaluation program, and leads to the creation of management tools and policies for virtual remote control. The approach will demonstrate how Web crawlers and other automated tools and utilities can be used to identify and quantify risks; to implement appropriate and effective measures to prevent, mitigate, recover from damage to and loss of Web-based assets; and to support post-event remediation.

The project is exploring a noncustodial non·cus·to·di·al  
adj.
1. Not having custody of one's children after a divorce or separation: a noncustodial parent.

2.
, distributed model for archiving, in which resources are managed along a spectrum from, at the highest level, a formal repository to, at the lowest level, the unmanaged Web. One of the goals is to show how the integrity of unmanaged resources can be raised at minimal cost, using automated routines for monitoring and validating files according to policies established by organizations that value the longevity of those resources. The overall goal is to create archiving tools that will enable libraries, archives, commercial database providers, scholarly organizations, and individual authors to manage different sets of risks affecting the same resources remotely.

Risk Management

A risk-based preservation management program begins with two key questions: What assets may be at risk and should be included in the program, and what constitutes risks to those assets? Risk management programs should be developed and implemented within an organizational context: Each institution will need to define its own "worry radius"--the context that provides definitions of perceived risk and acceptable loss. Effective risk management also requires determining the scope and value of assets. The cost of implementing the program should be appropriate to the estimated value of the assets and the impact of their loss on operations and services.

Risk management implementation defines policies, procedures, and mechanisms to manage and respond to identifiable risks. The implemented program should balance the value of assets and the direct and indirect costs Indirect costs are costs that are not directly accountable to a particular function or product; these are fixed costs. Indirect costs include taxes, administration, personnel and security costs. See also
  • Operating cost
 of preventing or recovering from damage or loss. The program should be known and understood both within the organization and by relevant stakeholders Stakeholders

All parties that have an interest, financial or otherwise, in a firm-stockholders, creditors, bondholders, employees, customers, management, the community, and the government.
. An effective program includes comprehensive scope, regular audits, tested responses and strategies, built-in redundancies, and openly available, assigned responsibilities.

Automated Support Strategies

Project Prism is exploring technologies that will form the basis for a suite of tools to support risk-based preservation monitoring and evaluation of Web resources. From a technical perspective, its goal is to design feasible and appropriate mechanisms for off-site monitoring. Assuming that over time libraries and other information intermediaries will extend their collecting scope over greatly increasing amounts of distributed content and that the longevity of these resources will be a primary concern, automatic methods will be needed to deal with such volume cost effectively and for consistent results that are less prone to human error. The methods will need to accommodate content providers who both cooperate in the effort, for example by contributing metadata, or content providers who, while not hostile to the idea of monitoring, are not collaborating. The methods also will need to be flexible enough to suit the variety of management requirements of diverse institutions.

These monitoring mechanisms should be deployable in a range of systems contexts. For a university research library, that context might be a management system used to collect lists of URLs that faculty and librarians have deemed important through some rating scale. The library might then employ the monitoring schemes outlined in the rest of this section as it assumes a role of "managing agent" for those external resources. At the other end of the spectrum, a preservation service might be a program that users could install on their own workstations to monitor Web resources of their own choice. This tool could be launched like other utility tools such as a disk defragmenter or an anti-virus scanner.

The Web resources within an organization's worry radius might be a Web site, a subset of resources in a Web site, or a single Web page or document. Furthermore, a Web resource might live in an individual's informally managed Web page or in an organization's highly controlled Web site. Defining the boundaries of a Web resource for preservation monitoring is not easy. Mechanisms for preservation risk management must address four levels of context:

* A Web page as a stand-alone object, ignoring its hyperlinks

* A Web page in local context, considering the links into it and out from it

* A Web site as a semantically coherent set of linked Web pages

* A Web site as an entity in a broader technical and organizational context

For risk analysis, some threats can be detected from the examination of a single static snapshot of a resource, while other threats become visible through analysis of how the resource changes over time. Project Prism is concerned with both the snapshot view and the time-elapsed view. For each of the four contexts, the team hypothesizes appropriate technical approaches for risk detection. By testing these hypotheses, the team can transform the results into the suite of tools it needs.

Monitoring a Web Page As a Standalone Object

As a stand-alone object, a Web page must be considered without regard to its hyperlinked context. What risk attributes are visible by looking at a single Web resource minus its link structure? Given a one-time snapshot of a single Web page, automated tools can observe these significant features:

* Tidiness of HTML formatting: Just as sloppy work habits reflect badly on an employee, untidy HTML is a reason for some unease about the management of a Web resource. While early versions of HTML had poorly defined structure, the recent redefinition of HTML in the context of XML XML
 in full Extensible Markup Language.

Markup language developed to be a simplified and more structural version of SGML. It incorporates features of HTML (e.g., hypertext linking), but is designed to overcome some of HTML's limitations.
 [extensible markup language See XML.

(language, text) Extensible Markup Language - (XML) An initiative from the W3C defining an "extremely simple" dialect of SGML suitable for use on the World-Wide Web.

http://w3.org/XML/.
] has now formally defined HTML structure. The TIDY tool makes it possible to determine how well an HTML document conforms to this structure, revealing the sophistication so·phis·ti·cate  
v. so·phis·ti·cat·ed, so·phis·ti·cat·ing, so·phis·ti·cates

v.tr.
1. To cause to become less natural, especially to make less naive and more worldly.

2.
 and care of the page's manager.

* Standards conformance: Data format standards change over time, sometimes making previous versions unreadable. A monitoring mechanism could automatically determine whether a Web resource conformed to current standards. Conformance to open standards Specifications for hardware and software that are developed by a standards organization or a consortium involved in supporting a standard. Available to the public for developing compliant products, open standards imply "open systems;" that an existing component in a system can be replaced  also could be considered. Arguably, Web resources formatted according to a nonpublic standard--for example, Microsoft Word A full-featured word processing program for Windows and the Macintosh from Microsoft. Included in the Microsoft application suite, it is a sophisticated program with rudimentary desktop publishing capabilities that has become the most widely used word processing application on the market.  documents--may be a greater longevity risk than those formatted to public standards.

* Document structure: Like HTML formatting, a document that manifests good structure may be more dependable than one that consists of text with no apparent order. Automated digital libraries such as ResearchIndex have had success with heuristics for deriving structure from PDF (Portable Document Format) The de facto standard for document publishing from Adobe. On the Web, there are countless brochures, data sheets, white papers and technical manuals in the PDF format.  [portable document format (file format) Portable Document Format - (PDF) The native file format for Adobe Systems' Acrobat. PDF is the file format for representing documents in a manner that is independent of the original application software, hardware, and operating system used to create those documents. ], HTML [and other] documents. These techniques could be used to measure the level of structure in a Web resource.

* Metadata: The presence or absence of metadata tags conforming to standards such as Dublin Core A set of meta-data descriptions about resources on the Internet. Used for resource discovery, it contains data elements such as title, creator, subject, description, date, type, format and so on. Dublin Core descriptions are often included in HTML meta tags.  may indicate the level of management.

Automatic mechanisms could track the following characteristics over time:

* HTTP HTTP
 in full HyperText Transfer Protocol

Standard application-level protocol used for exchanging files on the World Wide Web. HTTP runs on top of the TCP/IP protocol.
 [hypertext transfer protocol See HTTP.

(protocol) Hypertext Transfer Protocol - (HTTP) The client-server TCP/IP protocol used on the World-Wide Web for the exchange of HTML documents. It conventionally uses port 80.

Latest version: HTTP 1.1, defined in RFC 2068, as of May 1997.
] response code: The HTTP protocol defines response codes that indicate transfer error or success. An off-site monitor could record the incidence of HTTP response codes over time and certain patterns of codes, such as a high frequency of 404 ("page-not-available") codes, could be used to measure risk.

* Response time: A server with widely fluctuating response times or consistently slow response time indicates a higher level of risk than one that is responsive.

* Page changes: For certain types of pages, no changes at all might indicate complete lack of management or maintenance. On the other hand, unpredictable and large changes might indicate chaotic management. Pages that change on some predictable schedule with some predictable delta might indicate high-integrity management. Monitoring mechanisms that employ copy detection methods or page-similarity metrics would be useful for developing a measurement for page changes over time.

* Page relocation: The lack of persistence of URLs is a well-known problem. Certainly, the disappearance of a selected resource, evidenced by consistent "page-not-found" errors, should be a cause for alarm. Techniques such as "robust hyperlinks" might make it possible to track the movement of a resource across the Web and use that movement and/or replication to determine risk.

Monitoring a Web Page in a Hyperlinked Context

The hyperlinked structure of a Web page, its in-links and out-links, has been exploited successfully in the development of better Web search engines A Web site that maintains an index and short summaries of billions of pages on the Web, Google being the world's largest. Most search engine sites are free and paid for by advertising banners, while others charge for the service. . Similarly, such "link context," the links out from a page and the links from other pages to that page, may prove useful in deducing longevity risks.

Using a page snapshot, risks can be detected by analyzing:

* Out-link structure: Consider a page that links to a number of pages on the same server, in contrast to another page that either has no out-links or only links to pages on other servers. Intuitively, the "intralinked page" may be more integrated into a site and at lower risk. Pages with no links at all might be considered highly suspicious, having the appearance of "one-offs" rather than long-term Web resources.

* In-link structure: An equal if not greater indicator of longevity risk is the number of links from other pages to a page and the nature of those links. Ascertaining the absence of in-links in the Web context is hard because it requires crawling the entire Web.

* Page provenance: The URL of a Web page can itself provide metadata about the page's provenance and management structure. The host name often provides useful information on the identity (the "address") of the Web server hosting a page and, less reliably, the name of the institution responsible for publishing the page. A top-level domain (networking) top-level domain - The last and most significant component of an Internet fully qualified domain name, the part after the last ".". For example, host wombat.doc.ic.ac.uk is in top-level domain "uk" (for United Kingdom).  name can help classify a publishing organization by type (.edu, .gov, .com). Also, the path name may provide clues about organizational subunits that may be responsible for managing a Web page or site. Project Prism will investigate the correlation between top-level domain name and preservation risks.

* Link volatility: Once the nature of the links to and from a page is determined, it is useful to compare changes in those links over time. If out-links are added or updated, a page is evidently being maintained and is at reduced risk. A decrease in in-links may indicate approaching isolation and should cause concern.

Assessing the Risk

Assessing the longevity risk of a Web site will require algorithms for aggregating the risk metrics of its individual pages. Additionally, the structure of the site might serve as an indicator of risk. To analyze this structure we can exploit the wealth of work and algorithms on graphs and the characterization of the Web as a directed graph directed graph - (digraph) A graph with one-way edges.

See also directed acyclic graph.
. In this characterization, resources (documents) at URLs are nodes and the hyperlinks from documents at URLs to documents at other URLs are directed edges in the graph. The organization of a site's internal structure might be appropriate for risk analysis, just as for an individual page. Using graph analysis methods to derive cliques or strongly connected components from graph representations of site structure may make it possible to develop a set of patterns that reflect good site management.

Based on the static analysis of a site's structure, it would then be possible to analyze changes to it over time. How the Web site evolves should be considered another indicator of risk. A site where links are added or modified regularly and which conforms to a discernable structure exemplifies good management practices and, thus, lower risk.

A Web site is a collection of Web pages, but it also resides on a server within an administrative context, all of which may be affected by the external technical, economic, legal, organizational, and cultural environment. Identifying, monitoring, and managing the ecology of a Web site involves the individual and collective analysis of a number of factors at these different levels--more than just checking for HTTP codes that indicate a page is unavailable or has moved. Problems can be caused by server software misconfiguration, bad cables and router failure, denial-of-service attacks, and many other factors. It is entirely possible that the biggest threat to the continued health of a Web site has nothing to do with how well the site is maintained or even how often it is backed up but whether the backup tapes are stored in the same room as the server--increasing the chance that a single catastrophic event (fire, flood, earthquake) could destroy them both.

Some environmental factors can be monitored remotely, in tandem Adv. 1. in tandem - one behind the other; "ride tandem on a bicycle built for two"; "riding horses down the path in tandem"
tandem
 with direct monitoring of the Web site itself. Slowness or unresponsiveness could indicate hardware failure or power interruption, excessive load on the server from legitimate use, Web crawling, hacker attack, or a network problem. Network utilities such as Ping and Traceroute can help determine whether the problem is confined to Web services (1) Loosely, any online service delivered over the Web. Such usage appears in articles from non-technical sources, but not in IT-oriented publications, because definition #2 below describes the correct use of the term. , the particular machine, or the larger network. Specialized software for the Web can reveal internal security hazards such as viruses, Trojan horses It may never be fully completed or, depending on its its nature, it may be that it can never be completed. However, new and revised entries in the list are always welcome.
  • AIDS
  • Beast Trojan
  • Bifrost
  • Generic8.
, outdated software, missing patches, and incorrect configurations. Adapting these tools and utilities will add to Project Prism's preservation risk management toolkit.

Assessing Technological Watersheds

Through the longevity study (www.library.cornell.edu/preservation/ prism.html) and future crawls of the Internet Archive, Project Prism is identifying significant technology watersheds that may put Web sites at risk. The Web crawler and other tools can be used to analyze the use of markup languages
  • List of XML markup languages
  • List of general purpose markup languages
  • List of document markup languages
  • List of content syndication markup languages
  • List of lightweight markup languages
  • List of user interface markup languages
, MIME [multimedia internet mail See Internet e-mail service.  exchange] types, and other attributes of Web pages that reflect evolving standards and practice. Certain periods may merit closer scrutiny than others. Times of intense and rapid growth generally coincide with greater competition and the need to be more agile and flexible to survive. Periods when many new standards and features are introduced also would present greater risk to content. The Web sites that have been captured in the Internet Archive provide an ideal set of materials for testing these hypotheses by allowing characterization of the introduction and domination of markup languages and formats, the introduction of various types of dynamic behavior, and changes in the use of header fields and tags.

Risk Research

Project Prism is using the Web crawler to study risk factors for Web pages and Web sites. At the server level, it is reviewing the kinds of tools that can be developed or adapted to analyze and mitigate potential risks. While an organization may take on the preservation management of its own Web sites, the project is interested in scenarios that must consider two organizational players: the entities that control the Web sites and the entities that are interested in the longevity of those Web sites. In the first round, significant factors in the administrative context and external environment are being identified, but in-depth work in these areas will be part of the team's follow-up research.

References

"A Comprehensive Assessment of Public Information Dissemination." Available at www.nclis.gov/govt/assess/ assess.vol1.pdf(accessed 29 July 2002).

Arms, William Y., Roger Adkins, Cassy Ammen, and Arlene Haynes. "Collecting and Preserving the Web: The Minerva Prototype." RLG News. April 15, 2001. Available at www.rlg.org/preserv/diginews/ diginews5-2.html#feature1 (accessed 29 July 2002).

Bergman, Michael. "The Deep Web: Surfacing Hidden Value." The Journal of Electronic Publishing An umbrella term for non-paper publishing, which includes publishing online or on media such as CDs and DVDs. . August 2001. Available at www.press.umich.edu/jep/07-01/bergman.html (accessed 11 July 2002).

Brin, S., and L. Page, "Anatomy of a Large-Scale Hypertextual Web Search Engine See Web search engines. ." Computer Networks and ISDN ISDN
 in full Integrated Services Digital Network

Digital telecommunications network that operates over standard copper telephone wires or other media.
 Systems. 1998.

Byrnes, Christian. "Information Risk Management: Why Now?" www.trusecure.com/html/tspub/whitepapers/irm.pdf

Fielding, R., J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, "Hypertext Transfer Protocol--HTTP/1.1." The Internet Society (Internet Society, Reston, VA, www.isoc.org) An international membership organization dedicated to extending and enhancing the Internet, founded in 1992. It supports Internet bodies such as the IETF and works with governments, organizations and the general public to promote Internet . RFC (Request For Comments) A document that describes the specifications for a recommended technology. Although the word "request" is in the title, if the specification is ratified, it becomes a standards document.  2616, June 1999. Available at www.ietf.org/rfc/rfc2616.txt (accessed 11 July 2002).

Flecker, Dale. "Preserving Scholarly E-Journals"' D-Lib Magazine D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Content of current and past issues are available free of charge. The publication is financially supported by the Defense Advanced Research Projects Agency (as part of the Digital . September 2001. Available at www.dlib.org/dlib/september01/flecker/09flecker.html (accessed 11 July 2002).

Global Association of Risk Professionals (GARP (General Attributes Registration Protocol) A standard for registering a client station into a multicast domain. See 802.1p.

GARP - A graphical language for concurrent programming.

["Visual Concurrent Programmint in GARP", S.K.
). Available at www.garp.com/index-b.htm (accessed 29 July 2002).

Greenstein, D., S. Thorin, and D. Mckinney. "Draft report of a meeting held on 10 April in Washington, D.C., to discuss preliminary results of a survey issued by the DLF DLF Digital Library Federation
DLF Digital Library Federation (Washington, DC)
DLF Development Loan Fund
DLF Distribution Loss Factor
DLF Det Liberale Folkeparti (Norwegian political party) 
 to its members." April 23, 2001. Available at www.diglib.org/roles/ prelim.htm (accessed 11 July 2002).

Information Management Forum Internet and Intranet Working Group (Government of Canada The Government of Canada is the federal government of Canada. The powers and structure of the federal government are set out in the Constitution of Canada.

In modern Canadian use, the term "government" (or "federal government") refers broadly to the cabinet of the day and
). "An Approach to Managing Internet and Intranet Information for Long Term Access and Accountability." Available at www.imforumgi.gc.ca/ iapproach_e.html (accessed 29 July 2002).

Kleinberg, J. M. "Authoritative Sources in a Hyperlinked Environ-ment." Journal of the ACM The Journal of the ACM (JACM) is the leading scientific journal of the Association for Computing Machinery (ACM) in the broad area of computer science. It was started in 1954. . 1999.

Kleindorfer, Paul R. "Industrial Ecology industrial ecology

Discipline that traces the flow of energy and materials from their natural resources through manufacture, the use of products, and their final recycling or disposal. Research in industrial ecology began in the early 1990s.
 and Risk Analysis." Available at http://grace.wharton.upenn.edu/risk/downloads/01-23-PK.pdf (accessed 29 July 2002).

Kumar, S. R., P. Raghaan, S. Rajagopalan, D. Sivakumar, A. S. Tomkins, and E. Upfal. "The Web as a Graph." Presented at Nineteenth ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field.  SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Dallas, 2000.

Kunreuther, Howard, Patricia Grossi, Nano Seeber, and Andrew Smyth. "A Framework for Evaluating the Cost-Effectiveness of Mitigation Measures." Presented at the Bogazici University/Columbia University Workshop. Available at http://grace.wharton.upenn.edu/risk/downloads/01-18-HK.pdf (accessed 11 July 2002).

Lawrence, Gregory W., William R. Kehoe, Oya Y. Rieger, William H. Walters, and Anne R. Kenney. Risk Management of Digital Information: A File Format Investigation. Available at www.Clir.org/pubs/abstract/pub93abst.html (accessed 11 July 2002).

Lawrence, S., K. Bollacker, and C. L. Giles. "Digital Libraries and Autonomous Citation Indexing." IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields.  Computer 32, No. 6, 1999.

National Archives & Records Administration. "Records Management Requirements." Available at www.archives.gov/ records_management/policy_and_guidance (accessed 29 July 2002).

National Archives of Australia The National Archives of Australia is a body established by the Government of Australia for the purpose of preserving Commonwealth Government records. It is an Executive Agency of the Department of Communications, Information Technology and the Arts and reports to the Minister for . "Web Policy and Guidelines." Available at www.naa.gov.au/recordkeeping/er/summary.html (accessed 29 July 2002).

National Library of Australia. "Safekeeping Strategies." Available at www.nla.gov.au/padi/safekeeping/safekeeping.html#ss (accessed 29 July 2002).

The NEDLIB NEDLIB Networked European Deposit Library  Harvester harvester, farm machine that mechanically harvests a crop. Small-grain harvesting has been mechanized to a certain extent since early times. In the modern period the first harvester to gain general acceptance was made by Cyrus McCormick in 1831 (see reaper).  Project. Available at www.csc.fi/sovellus/nedlib (accessed 29 July 2002/.

Nonprofit Risk Management Center. "Making Net Gains: Staying Safe While Making a Name for Your Nonprofit on the Internet." Available at www.nonprofitrisk.org/nwsltr/ current/nl901_3.htm (accessed 11 July 2002).

OCLC Web Characterization Web Site. Available at http://wcp.oclc.org (accessed 29 July 2002).

Phelps, T. A., and R. Wilensky. "Robust Hyperlinks: Cheap, Everywhere, Now." Presented at Digital Documents and Electronic Publishing (DDEP DDEP Doctoral Dissertation Enhancement Projects
DDEP Defense Data Exchange Program
DDEP Defense Development Exchange Program
DDEP Demand Dependency
00), Munich, 2000.

Raggett, D. "Clean up your Web pages with HTML TIDY HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and give the source code a reasonable layout (aka indent style).

It was developed by Dave Raggett of W3C, then passed on to become a Sourceforge project.
." W3C. 2000. Available at www.w3.org/People/Raggett/tidy/ (accessed 11 July 2002).

RLG-OCLC. "Attributes of a Trusted Digital Repository: Meeting the Needs of Research Resources." 2001. Available at www.rlg.org/longterm/attributes01.pdf (accessed 11 July 2002).

Rivard, Catherine L. and Michael A. Rossi. "Is Computer Data `Tangible Property' or Subject to `Physical Loss or Damage'?--Part 1 and Part 2." Insurance Law Group Inc., August 2001 and November 2001. Available at www.irmi.com/expert/articles/rossi008.asp (accessed 11 July 2002).

Shivakumar, N., and H. Garcia-Molina. "Finding Near-Replicas of Documents on the Web." Presented at WebDB'98, 1998.

Smithsonian Institution. "Archival Preservation of Smithsonian Web Resources: Strategies, Principles, and Best Practices." Available at www.si.edu/archives/archives/dollar %20report.html (accessed 29 July 2002).

Wood, Angus. "Integrating Risk Assessment into the Enterprise Information Management Strategy," presented at the Sixth International Pipeline Reliability Conference, November 19-22, 1996, Houston, Texas “Houston” redirects here. For other uses, see Houston (disambiguation).
Houston (pronounced /'hjuːstən/) is the largest city in the state of Texas and the
. Available at www.itpapers.com/cgi/PSummaryIT.pl?paperid=8433&scid=88 (accessed 11 July 2002).

World Wide Web Consortium, XHTML (EXtensible HTML) A markup language for Web pages from the W3C. XHTML combines HTML and XML into a single format (HTML 4.0 and XML 1.0). Like XML, XHTML can be extended with proprietary tags. Also like XML, XHTML must be coded more rigorously than HTML.  1.0: The Extensible HyperText Markup Language (hypertext, standard, World-Wide Web) Extensible HyperText Markup Language - (XHTML) A reformulation of HTML 4.01 in XML. Being XML means that XHTML can be viewed, edited, and validated with standard XML tools. , 2nd ed., 2001. Available at www.w3.org/TR/2001/WD-xhtml1-20011004/ (accessed 11 July 2002).

Zhang, K., J. T. L. Wang, and D. Shasha, "On the Editing Distance between Undirected Acyclic a·cy·clic  
adj.
1. Botany Not cyclic. Used especially of flowers whose parts are arranged in spirals rather than in whorls, as in magnolias.

2.
 Graphs and Related Problems." Presented at CPM Combinatorial Pattern Matching 1. pattern matching - A function is defined to take arguments of a particular type, form or value. When applying the function to its actual arguments it is necessary to match the type, form or value of the actual arguments against the formal arguments in some definition. , 1995.

READ MORE ABOUT IT

Additional Risk Management Resources. www.library.cornell.edu/iris/research/prism/rm-resources.html (accessed 29 July 2002).

Dublin Core Metadata Initiative. Available at http://dublincore.org (accessed 1 August 2002).

The International Risk Management Benchmarking Association (IRMBA IRMBA International Risk Management Benchmarking Association ). Available at www.irmba.com (accessed 1 August 2002); Risk Management Reports. Available at www.riskreports.com (accessed 1 August 2002).

The Internet Archive. Available at www.archive.org (accessed 1 August 2002).

Kleindorfer, Paul R. "Industrial Ecology and Risk Analysis" in Handbook of Industrial Ecology. L. Ayres and R. Ayres, eds. United Kingdom: Edward Elgar Sir Edward William Elgar, 1st Baronet, OM, GCVO (2 June 1857 – 23 February 1934) was an English Romantic composer. Several of his first major orchestral works, including the Enigma Variations and the Pomp and Circumstance Marches, were greeted with acclaim. , 2001.

Kunreuther, Howard, and Patricia Grossi. "The Role of Uncertainty on Alternative Disaster Management Strategies." April 2001, Available at http://grace.wharton.upenn.edu/risk/ downloads/01-15-HK.pdf (accessed 11 July 2002).

Library Project Prism. Available at www.library.cornell.edu/ preservation/prism.html (accessed 1 August 2002).

McClure, Charles R., J. Timothy Sprehe, and Kristen Eschenfelder. Performance Measures for Federal Agency Websites. Available at www.defenselink.mil/webmasters/technical/measures/measures.pdf (accessed 11 July 2002).

McNamee, David. "Assessing Risk Assessment," in New Perspectives on Healthcare Internal Auditing. Available at www.mc2consulting.com/riskart2.htm (accessed 11 July 2002).

Mercator Web Crawler. Available at www.research.compaq.com/ SRC/mercator/ (accessed 1 August 2002).

Miller, Jean C. "Risk Management for Your Web Site." IRMI.com. 2000. Available at www.irmi.com/expert/ articles/schoenfeld003.asp (accessed 11 July 2002).

The National Risk Management Research Laboratories. Available at www.epa.gov/ordntrnt/ORD/NRMRL/ (accessed 1 August 2002).

Paperwork Reduction Act. Available at http://frwebgate.access.gpo.gov (accessed 11 July 2002).

PANDORA Archive For other uses, see Pandora (disambiguation) and Pandora's box (disambiguation).

PANDORA is a web archive of Australian online publications, established initially by the National Library of Australia in 1996, and now built in collaboration with a number of other Australian
. Available at http://pandora.nla.gov.au/ index.html (accessed 1 August 2002).

Preserving Presidential Library Websites. Sand Diego Supercomputer Center. Available at www.sdsc.edu/TR/TR-2001-03.pdf (accessed 1 August 2002).

Project Prism. Available at www.prism.cornell.edu/ (accessed 1 August 2002).

The Royal Swedish Web Archive. Available at www.ifla.org/IV/ifla66/papers/154-157e.htm (accessed 1 August 2002).

The Four Phases of Project Prism

Project Prism's four main phases map well to the typical stages of risk management programs.

Phase 1: Risk Identification--the process of detecting potential risks or hazards through data collection. The team is using both automated and manual techniques to collect data and characterize potential risks to Web resources. Web crawling is one effective way to collect information about the state of Web pages and sites. The Prism team employs the Mercator Web crawler to collect and analyze data to test hypotheses about the relationship between observable characteristics of Web resources and threats to longevity.

Phase 2: Risk Classification--the process of developing a structured model to categorize risk and fitting observable risk attributes and events into the model. The Prism team combines quantitative and qualitative methods to characterize and classify the risks to Web pages, Web sites, and the hosting servers.

Phase 3: Risk Assessment--variables to consider include the value of assets, possible threats, known vulnerabilities, likelihood of loss, and potential safeguards. The team is defining a data model for storing risk-significant information. This model reflects key attributes about Web assets, observed events in the life of these resources, and information about the resources' environment. A key aspect of risk assessment in Prism is defining and detecting significant patterns that may exist in this data.

Phase 4: Risk Analysis--determines the potential impact of risk patterns or scenarios, the possible extent of loss, and the direct and indirect costs of recovery. This step identifies vulnerabilities, considers the willingness of the organization to accept risk given potential consequences, and develops mitigation responses. Artificial intelligence methods, decision support systems, and profiles of organizations all support risk analysis. The resulting knowledge and exposure databases provide evolving sources of information for analyzing potential risks. Project Prism is developing a knowledge base that could be characterized as a risk analysis engine.

Web Site Care

Comprehensive care of a Web site must include:

* Hardware and software environment, including any upgrades to the operating system operating system (OS)

Software that controls the operation of a computer, directs the input and output of data, keeps track of files, and controls the processing of computer programs.
 and Web server, the installation of security patches, the removal of insecure services, use of firewalls, etc.

* Administrative procedures, such as contracting with reputable service providers, renewing domain name registration, etc.

* Network configuration and maintenance, including load balancing The fine tuning of a computer system, network or disk subsystem in order to more evenly distribute the data and/or processing across available resources. For example, in clustering, load balancing might distribute the incoming transactions evenly to all servers, or it might redirect them , traffic management, and usage monitoring

* Backup and archiving policies and procedures Policies and Procedures are a set of documents that describe an organization's policies for operation and the procedures necessary to fulfill the policies. They are often initiated because of some external requirement, such as environmental compliance or other governmental , including the choice of backup media, media replacement interval, number of backups made, and storage location

* Physical location of the server and its vulnerability to fire, flood, earthquake, electric power anomalies, power interruption, temperature fluctuations, theft, and vandalism

Anne R. Kenney is assistant university librarian at Cornell University Library The Cornell University Library is the library system of Cornell University. In 2005 it held 7.5 million printed volumes in open stacks, 8.2 million microfilms and microfiches, and a total of 440,000 maps, motion pictures, DVDs, sound recordings, and computer files in its . She can be reached at ark3@cornell.edu. Nancy Y. McGovern is the digital preservation officer and heads the Research Department in Instruction, Research, and Information Sciences (IRIS) at Cornell University Library. She also is the co-editor of RLGDigiNews. She can be reached at nm84@cornell.edu. Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette, also co-wrote this article.
COPYRIGHT 2002 Association of Records Managers & Administrators (ARMA)
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2002, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:McGovern, Nancy Y.
Publication:Information Management Journal
Geographic Code:1USA
Date:Sep 1, 2002
Words:6014
Previous Article:Designing better documents: information design professionals attempt to understand what makes documents usable and to apply that knowledge in...
Next Article:Wireless information management: wise information managers will begin now to prepare for the day when all records are sent, stored, and retrieved...
Topics:



Related Articles
The state of the art and practice in digital preservation.
The challange of web site records preservation: managing electronic records in fast-paced, technology-driven web environments has frustrated...
Research questions for the digital era library.
The invisible library: paradox of the global information infrastructure.(Challenges faced by libraries and proposed research designs)
Moving image preservation in libraries.(film and video)
Web sites as recordkeeping & recordmaking systems: Web sites are important sources of organizational records; not properly capturing such records in...
Organization and staff renewal using assessment.
Building an Internet archive system for the British Broadcasting Corporation.
Building preservation partnerships: the Library of Congress National Digital Information Infrastructure and Preservation Program.
Geospatial Web services and geoarchiving: new opportunities and challenges in geographic information services.

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles