Recordkeeping in the 21st Century.
According to the Web site guide to Buddhist memorial services, on October 24 of every year, at the Daioh Temple of Rinzai Zen Buddhism in Kyoto, Japan, the head priest conducts a prayer for lost information. Recognizing that "many `living' documents and software are thoughtlessly discarded or erased without even a second thought," the sect hopes that through the holding of its "information service" the "`information void' will cease to exist."
Paradoxically, at the same time as institutions in the United States and elsewhere may be in danger of losing their collective memory due to routine deletion of information in electronic form, the typical end user is most likely experiencing the opposite sensation: drowning in information overload. A recent Washington Post cover story characterized the time we live in as the "Too-Much-Information Age," going so far as to declare in a bold headline: "Tidal Wave .of Information Threatens to Swamp Civilization" (Achenbach 1999).
Both the perceived infoglut and the infovoid are increasing exponentially and in lock step with each other, especially over the past decade, due to the advent of networked computer systems, the Internet, and the World Wide Web. According to Stuart Madnick (1999), "Advances in computing and networking technologies now allow huge amounts of data to be gathered and shared on an unprecedented scale. Unfortunately, these new-found capabilities by themselves are only marginally useful if the information cannot be easily extracted and gathered from disparate sources, if the information is represented differently with different interpretations, and if it must satisfy differing user needs."
This raises a question: How can organizations get a better handle on managing information flow both for the short- and the long-term? Assuming that saving every bit and byte of data now being created is out of the question, how can public institutions go about managing what is perceived as appropriate for preservation (i.e., creating official records out of the deluge of data created and received in electronic form)? Are there methods or tools available that can assist?
There has been increased interest and attention in scientific, academic, and governmental circles on the subject of using computer-generated metadata as an information tool to preserve the context, content, and structure of electronic records (Madnick 1999).(1) In its most generic sense, metadata simply means "data about data." As one recent paper put out by the U.S. Patent and Trademark Office notes (Purcell et al. 1999), "Traditionally, the term `metadata' has been widely used to characterize the descriptive information that will support search and retrieval of both paper and electronic material. Over the past three or four years the use of the term metadata has expanded to include additional information that must be acquired and retained in order to effectively manage electronic records over long period[s] of time, including permanently."
Even before the advent of computers, we lived for a long time in a world populated by metadata -- we just may not have viewed the world in such terms. The library community has employed metadata systems for more than 100 years, classifying information by means of Dewey Decimal numbers and the Library of Congress' alphanumeric system. Documents in libraries have also been made accessible by using preexisting lists of subject headings or descriptors. Rating systems such as that of the Motion Pictures Association and the recent television ratings developed in connection with the V-chip also function as standardized metadata about the contents of those media. Even classification schemes that describe the contents of consumer goods -- such as the long-standing requirements for labeling on cigarettes, food, and tires -- are types of metadata specifications that add important, useful knowledge of the objects they describe. Arguably, all contextual information that classifies or interprets data may be validly thought of as forms of metadata.
However, the introduction of computers, and particularly the Internet, into everyday life has exponentially expanded the universe of metadata that each of us is responsible for creating but of which we remain largely unaware. For every document created on a popular computer software interface such as Windows or Lotus Notes, a wealth of metadata is retained that does not appear on the screen -- encompassing everything from descriptions of the properties of documents (e.g., character, word, and line counts), personal settings and preferences for fonts and styles used, and document revision information, to embedded codes that are virtually inaccessible. Most people have no idea of the quantity or type of metadata generated in association with individual computer-generated documents.
The Microsoft Corporation was publicly criticized recently for collecting information on its customers at the time of registration that included an embedded identification code in Windows 98 allowing for the matching up of individually created documents with their human creators (Markoff 1999). At least partially in response to this criticism, Microsoft has published a series of papers on "How to Minimize Metadata" in some of its proprietary applications.
Indeed, the very nature of the Internet (i.e., its reliance on packet-switching as the mode for information being propagated) means that message traffic accumulates an audit trail of metadata in the form of routing information which can be easily traced -- unless one takes concerted action to maintain anonymity. It has become common knowledge (although it is still easily overlooked) that virtually every move one makes surfing in cyberspace -- literally every keystroke entered on a home or desktop computer -- is potentially traceable by being recorded on some server or hard disk, either locally or on a remote Web site. Much of this is embedded in subterranean places in computers and is inaccessible to users without sophisticated knowledge of the inner workings of the computer hardware and software.
Focusing solely on end users, one key question regarding using metadata as an information tool is whether it is possible to create, out of a sea of potential proprietary and ad hoc end-user generated metadata, one or more standardized metadata sets that add to the value of the underlying data so as to contribute to what is termed here the underlying trustworthiness of the records themselves (i.e., their completeness, authenticity, and preservability over time).(2) But which metadata is itself worthy of preservation? Obviously, choices have to be made because the logical structure of electronic documents does not necessarily translate into their physical appearance on the computer screen. This may greatly complicate the question of what constitutes a complete, authentic, and reliable record of transactions in cyberspace and how such records are to be preserved over time -- something the information management community has been contemplating.
In the paper-based world, there are well-settled rules and expectations and centuries of case law that govern what establishes trustworthiness in terms of rules concerning what constitutes "best evidence" as to originals or copies of a document, what challenges may be brought to question the authenticity of a document, and whether a document may be admitted into evidence for the truth of its contents as an exception to the rule against hearsay (Perritt 1992; Peritz 1986). Charles Merrill has noted that "[w]hen computers were first introduced into the business environment, and electronic records were still novel, proof of process was frequently required as a condition to admission under ... [the] business records exception to the hearsay rule" (1999). However, "[s]ince that time, courts have become more accepting of computer printouts without much authentication, overruling `Garbage In, Garbage Out' arguments" (Merill 1999; Jablon 1997)
In Merrill's view, the fact that up until now electronic records have resided in essentially closed systems has led to a chain of custody objections not being given much attention by the courts. The Internet, as the paradigmatic open system, now potentially changes everything by demanding increased emphasis on "authentication of identity of originators of information" and "authentication of record integrity."
The following is a brief overview of three very different types of policy initiatives that may yet contribute to a comprehensive metadata approach to managing information as we head into the next century. As arguably the first "killer application" in cyberspace, it is appropriate to look at e-mail and its metadata issues first, then the U.S. government's response to the litigation threat first posed by the PROFS case involving these issues. Other initiatives, including the use of digital signatures in building a public key infrastructure, as well as the development of open systems and specifications represented by XML and RDF, will also be addressed.
Front-End Metadata Tagging and Other Fallout from the PROFS Litigation
For the past decade, the U.S. government has struggled with one aspect of the problem of managing information, having been confronted in a series of high-profile federal court cases with the not-so-simple task of figuring out (1) how to get a handle on enormous quantities of electronic mail and word processing documents and (2) preserving those designated as long-term or permanent records. Although the holdings of the courts in these cases have not been expressly framed in terms of metadata, it is clear that the courts have viewed electronic records as more complete than corresponding hard copy printouts. It is therefore worth examining the lessons of the PROFS case for what it may instruct about public institutions using standardized metadata for preserving electronic records.
As previously reported in Records Management Quarterly (Pasterczyk 1998), the PROFS e-mail case -- formally captioned Armstrong v. Executive Office of the President -- was brought in January 1989 as an effort to preserve, under the Federal Records Act electronic records on National Security Council (NSC) PROFS, e-mail backup tapes that pertained substantially to the Iran-Contra Affair. The 1993 core Armstrong holding was that electronic versions of e-mail, as compared with hard-copy printouts required to be maintained under existing policies, contain "qualitatively different" information of "tremendous historical value in demonstrating ... what officials knew, and when they knew it," and therefore must be managed separately (Armstrong v. Executive Office of the President, 810 F. Supp. 335, 341 [D.D.C. 1993]). A higher appeals court agreed with the need for "who knew what when" information, going so far as to say that electronic versions of e-mail were "at most, kissing cousins" to their paper counterparts, and graphically describing the resulting hard-copy records that lacked missing transmission and receipt information as "dismembered," "amputated," and "lopped-off" (Armstrong v. Executive Office of the President, 1 F.3d 1282, 1285, 1286 [D.C. Cir. 1993]).
What these courts had in mind in terms of missing metadata was transmission and receipt information that amounted to an intelligent representation of the name of the sender of the e-mail, the names of all recipients of the e-mail, the date of the e-mail's transmission, and the date and time of any receipts of acknowledgement.(4) Armstrong was the first judicial holding in the area of the federal records laws that emphasized the importance of managing the electronic versions of records due to their value-added metadata aspects. In the wake of the decision, at least three important developments with the potential for governmentwide effect have emerged.
First, the NSC and, in 1994, the remaining components of the Executive Office of the President (EOP) (including the Office of Management and Budget, the Office of the U.S. Trade Representative, and others) implemented a form of electronic recordkeeping for e-mail. This was handled by (1) customizing their existing proprietary e-mail packages (such as All-in-1, Lotus cc:Mail, and MS Outlook) to provide for embedded metadata in the form of a record status field or label and (2) preserving the electronic versions of such records as presumptively permanent records for transfer in CD-ROM form for eventual accessioning at the National Archives under approved records schedules. Users of the various EOP e-mail systems are prompted to tag each of their e-mail messages as record or non-record before they are sent, in accordance with published guidance. The consequences of failure to tag are either that messages cannot be sent, or in the case of the largest system, the e-mail is designated as "record" by default (Armstrong v. EOP, 877 F. Supp. at 715).
Arguably, the requirement that users contemporaneously designate the record status of messages at the front end of the records life cycle constitutes a key element in ensuring that such messages are preserved in a manner that maximizes their trustworthiness. Once sent to a centralized recordkeeping "bucket," the record-designation metadata with the accompanying records is bound to the record and is non-revisable by the end user (subject only to being changed by an authorized records officer). The system in place at the EOP for the past five years has been rudimentary but has worked to preserve large quantities of e-mail in electronic form, including those e-mails with any word processing attachments.
In building on this model, one question is whether end users in the next century will tolerate increasing calls by archivists and records managers to add more complexity to front-end document management labeling schemes (through means of drop-down menus to be responded to, etc.), providing for more nuanced front-end designations of record type and retention period. For example, how many keystrokes would the reader be willing to add to each e-mail message to ensure long-term preservation in the appropriate bucket in cyberspace?
Second, in November 1997 the Department of Defense issued DoD Standard 5015.2-STD, a design criteria standard for electronic records management software applications, and has been testing and certifying proprietary software for compliance with the standard (see http://jitc-emh.army.mil/recmgt /#standard). The standard, which builds on a conceptual model developed out of the University of British Columbia (see http://www.slais.ubc. ca/users/duranti/intro.htm), includes the first-ever federal definition of recordkeeping metadata.(5) Standard 5015.2 takes into account the core Armstrong metadata while more broadly encompassing a "minimum set of baseline functional requirements" consistent with applicable law that purports to be "applicable to all records management applications regardless of organizational and site-specific implementations." The standard expressly includes provision for end-user contemporaneous tagging of record status.
In a letter to Arthur Money in November 1998, Archivist of the United States John Carlin gave his qualified endorsement that the DoD standard conforms with certain baseline recordkeeping requirements (Carlin 1998). It is expected over the next decade that sectors of the federal government will take advantage of these certified software products (and their successors) for at least some portions of their office records after thorough assessment is made of each agency's current business needs, technical capabilities, and other legal requirements.
Third, in March 1999, in light of a subsequent post-Armstrong round of litigation in Public Citizen v. Carlin (2 F.Supp.2d 1 (D.D.C. 1997), appeal pending [D.C. Cir.]), the Archivist of the United States issued Bulletin 99-4, which requires that all federal agencies retool and update their existing records schedules to account for the disposition of the electronic versions of e-mail messages and word processing documents that constitute federal records.(6) Although the bulletin does not require agencies to convert to electronic recordkeeping, the effort involved in undertaking review and appraisal of thousands of existing records schedules will undoubtedly result in increased governmentwide focus and attention on what constitutes proper management and disposition of electronic versions of records created on e-mail and word processing systems.
Use of Digital and Electronic Signatures
Metadata in the form of digital signatures bound to the contents of individual documents will become more commonplace in the next decade. With the advent of the Internet and the streaming of information from the unchartered, open environment which the Internet represents, it appears that public institutions will act to consider and incorporate as part of their best practices the use of new technologies, such as digital signatures and public key encryption, to ensure that authentic and trustworthy information is captured as part of their dealings with the public at large. The impetus for digital signature technology can be seen as a variant of the same basic "who knew what when" Armstrong metadata formulation, where the reasons for utilizing digital signatures translate broadly into four categories (Merrill 1998):
1. Ensuring the true authentication of senders through the use of asymmetric cryptography (the "who")
2. Providing for the data integrity of transactions by means of secure hash functions (the "what")
3. Providing for an accurate time stamp (the "when")
4. Allowing for a built-in method of non-repudiation of a given transaction
Following in the footsteps of the American Bar Association's 1996 Digital Signature initiative (see http://www.abanet.org/scitech/ec /isc/dsgfree.html) and various state legislative enactments (see http:// www.mbc.com/ds_sum.html), the federal government is poised to embrace the use of digital signatures. For the past several years, digital signature requirements have been embodied in regulations of the Food and Drug Administration, 21 CFR Part 11 (62 Federal Register 13429 [March 20, 1997]). In August 1998, the Health Care Financing Administration proposed its own set of security and electronic signature standards for use by health plans, health care clearinghouses, and health care providers (63 Federal Register 43242 [August 12, 1998]).
Most recently, Congress passed the Government Paperwork Elimination Act (GPEA), Title XVII of Public Law 105-277, with an effective date of October 21, 1998. The GPEA requires all federal agencies to provide for the optional use and acceptance of electronic documents and signatures, and electronic recordkeeping, where practicable, as a substitute for paper by October 2003. The GPEA specifically states that electronic records and electronic signatures developed in accordance with guidance implementing the GPEA "shall not be denied legal effect, validity, or enforceability because such records are in electronic form" (Pub. L. 105-277, [sections] 1707; 112 Stat. 2681-751 ).
On March 5, 1999, the Office of Management and Budget issued a proposed rule for implementation of the GPEA, emphasizing that the government's electronic systems "must protect the information's confidentiality, assure that the information is not altered in any unauthorized way, and be available when needed" (64 Federal Register 10896 [March 5, 1999]). Under OMB's fleshing out of the definition of electronic signatures provided in the GPEA, such signatures would include not only digital signatures but other forms such as digitized signatures, personal identification numbers (PINs), smart cards, and biometrics. OMB expects that federal agencies will perform a thorough risk analysis in planning and implementing electronic signatures or electronic recordkeeping, including evaluating the relationships of the parties to a transaction, the value of the transaction, and the "likely need for accessible, persuasive information regarding the transaction at a later point."
Emerging Non-Proprietary Metadata Languages, Systems, and Specifications
There is no end in sight to the explosive growth of the World Wide Web and the growing dominance of standards and protocols developed expressly for the Web. A case can be made that increasingly in the next century documents will be communicated and business will be transacted via Web-based protocols, including HTML and its successors, rather than in today's proprietary software formats. To the extent this indeed comes to pass, several initiatives now being carried forward by the Massachusetts Institute of Technology's World Wide Web Consortium (W3C) hold out the possibility that future software platforms will easily allow for sets of standardized recordkeeping metadata to be incorporated into Web-authored documents.
The W3C claims a "strong interest" in metadata elements since "[m]etadata will facilitate searching, helping authors to describe their documents in ways that search engines, browsers, and Web crawlers can understand." Extensible markup language (XML), is a "metalanguage" constituting an "extremely simple dialect" of standard generalized markup language (SGML). (See "The XML FAQ,"[subsections] A.1, A.2, available at http://www.ucc.ie/xml/)
One of the ways XML differs from HTML is that it allows information providers to define new tags at will and attribute names in the name-space at will. These features have the potential to "become a common metadata representation in Web objects" (W3C 1998).
The W3C's resource description framework (RDF) is an application of XML that operates as a "language for defining the vocabularies for use with particular applications," which leaves the metadata authors "free to choose the vocabulary of their choice" (see http://www.w3.org/ Metadata/Activity.html). The platform for Internet content selection (PICS) was originally designed as "a suite of specifications which enable people to distribute information about the content of digital material in a simple, computer-readable form." This feature allows parents and teachers to screen inappropriate material from minors. According to W3C, a future version of PICS "will be reformulated as an application of RDF."
In that XML, RDF, and PICS were all designed to be able to absorb particular metadata vocabularies, they may yet turn out to be the open, non-proprietary platform on which future standardized recordkeeping metadata architectures are built. For example, with respect to documents encoded in XML, it would be exceedingly easy to use XML to allow for contemporaneous tagging of record type and retention period for purposes of long-term preservation.
Recordkeeping in the 21st century will have to confront the fact that the very definition of what constitutes a record is dynamically changing. The 1990s have been marked by the widespread deployment of electronic mail, followed by the Internet and the Web. Although not yet similarly ubiquitous, the virtual record of the future is at our doorstep: word processing documents with embedded hyperlinks, voice and video mail, videoconferencing, and other forms of electronic objects in varying multimedia forms. The challenge for forward-looking organizations will be to manage this brave new world of information in ways that maximize knowledge. This may well include the possible use of management strategies keyed to metadata concepts.
In the future, digitally signed multimedia documents that are contemporaneously tagged with record status and other forms of standardized metadata in an XML-compatible format may come to be commonplace. One can hope that the scientific and information management communities, as well as various governmental and standards bodies active in this area, will find opportunities to pool their knowledge in an effort to develop recordkeeping tools they each need to help confront the common infoglut challenge.
(1) See also "Metadata Architecture," World Wide Web (W3) Consortium Web site, available at http://www. w3.org/DesignIssues/Metadata; Reference Model for Acceptable Business Communications, University of Pittsburgh Research Project on Electronic Records, available at http://www.sis.pitt.edu/ ~nhprc/meta96.html; David Wallace, "Managing the Present: Metadata as Archival Description, Archivaria 39 (Spring 1995), available at http://www.lis.pitt.edu/~nhprc/ Pub10.html; Sue McKemmish and Dagmar Parer, "Towards Frameworks for Standardizing Recordkeeping Metadata," Archives and Manuscripts 26, no.1 (1998). Also, a national standards committee ("C22") in the Association for Information and Image Management International, has been chartered to pursue a specification for binding record metadata to specific information objects when the objects are communicated (the reliability of electronic business information or "REBI" project); see http://www.aiim. org/industry/standards/index.html.
(2) A more complete description of factors to be considered by an agency when determining the "trustworthiness" of records is to be found in the ANSI/AIIM TR31 Series "Performance Guideline for the Legal Acceptance of Records Produced By Information Technology Systems, Part IV: Model Act and Rule," [sections] 3.2.1 (1994).
(3) For a further discussion of the Armstrong case and its metadata implications, see Jason R. Baron, "E-mail Metadata in a Post-Armstrong World," Paper in Metadata '99: Third IEEE Computer Society Metadata Conference, Bethesda, Maryland, April 6-7, 1999, available at http://computer.org/conferen/proceed/ meta/1999/papers/83/jbaron.html.
(4) The courts were concerned that NSC staff names (such as Oliver North) were designated on printouts only by two letters (e.g., "ON)," and that sending e-mail to a personal group called "List A" might fail to give latter-day historians needed context. Ironically, by the time of the appellate court's decision in the case, much of the EOP had switched to a proprietary e-mail system (All-in-1) which did provide the full names of senders and recipients in hard copy printouts (Armstrong v. Executive Office of the President, 877 F Supp. 690, 715 [D.D.C. 1995]). No attempt is made here to provide a full legal analysis of the Armstrong holdings on the many jurisdictional and substantive points of law covered in the case.
(5) The DoD standard defines metadata as "[d]ata describing stored data; that is, data describing the structure, data elements, interrelationships, and other characteristics of electronic records." DoD Standard 5015.2-STD, AP1.39.
(6) Bulletin 99-4 and documents related to the Carlin case may be found on the National Archives and Records Administration's comprehensive Web site devoted to the U.S. Archivist's response to the Carlin lawsuit, available at http://www.nara.gov/records/grs20.
"A Buddhist Prayer for Lost Information." Available on the Internet at http://www.thezen.or.jp/jomoh/kuyo.html.
Achenbach, Joel. "The Too-Much-Information Age." The Washington Post. March 12, 1999.
Carlin, John W. Letter to Arthur L. Money. November 18, 1998. Available at http://jitc-emh.army. mil/recmgt/nara.htm and at http://www.nara.gov/nara/pressrelease/nr99-26.html.
Jablon, Andrew. "`God Mail': Authentication and Admissibility of Electronic Mail in Federal Courts." 32 American Criminal Law Review. 1997.
Madnick, Stuart E. "Metadata Jones and the Tower of Babel: The Challenge of Large-Scale Semantic Heterogeneity." Paper in Proceedings: Metadata '99: Third IEEE Computer Society Metadata Conference, Bethesda, Maryland, April 6-7, 1999. Available at http://computer.org/conferen/proceed/meta/1999/papers/84/smadnick.html.
Markoff, John. "Microsoft Will Alter Its Software in Response to Privacy Concerns." The New York Times. March 7, 1999.
Merrill, Charles R. "Legislative Initiatives on the Shift from Paper to Electronic Paradigm: Understanding the Difference Between Closed and Open Systems." Presented at Internet Security Summit, Washington, D.C., February 8-9, 1999. Available at http://www.pkilaw.com/9902issp-Book/sld025.htm.
--. "Proof of WHO, WHAT, and WHEN in Electronic Commerce Under the Digital Signature Guidelines." April 1998. Available at http://www.pkilaw.com/proof_gc_3.htm.
Microsoft Corp. "WD97: How to Minimize Metadata in Word Documents." Available at http://support.microsoft.com/support/kb/ articles/q223/7/90.asp.
Pasterczyk, Catherine E. "Federal E-mail Management: A Records Manager's View of Armstrong v. Executive Office of the President and Its Aftermath." Records Management Quarterly. April 1998.
Peritz, Rudolph J. "Computer Data and Reliability: A Call for Authentication of Business Records Under the Federal Rules of Evidence." 80 Northwestern University Law Review. 1986.
Perritt Jr., Henry H. "Electronic Records Management and Archives." 53 Univ. of Pittsburgh Law Review. 1992.
Purcell, Arthur F., et al. "Metadata Requirements for Long-Term Access and Retention of Electronic Patent and Trademark Case Files." Office of the Chief Information Officer, U.S. Patent and Trademark Office. 26 February 1999.
"SGML v. XML," Electronic Public Information Newsletter. December 1998. Available on Westlaw.
World Wide Web Consortium. "W3C Metadata Activity Statement." 1999. Available at http://www.w3.org/Metadata/Activity.html.
Jason R. Baron, J.D., is a trial attorney in the Civil Division of the U.S. Department of Justice. For the past seven years he has represented the government in Armstrong v. Executive Office of the President and related litigation. Baron's law degree is from the Boston University School of Law. In 1995, he received an Achievement Award from the National Archives and Records Administration (NARA) in recognition of "contributing significantly to the management of federal records by the development of effective electronic mail regulations." All views expressed herein are solely the author's and do not purport to represent the views of the U.S. Department of Justice or any other component of the federal government. Also, Baron wishes to acknowledge the helpful comments and suggestions of Brian Kennedy, Miriam Nisbet, and Dan Schneider in preparation of this paper. The author may be reached at Jason.Baron@usdoj.gov.
|Printer friendly Cite/link Email Feedback|
|Author:||BARON, JASON R.|
|Publication:||Information Management Journal|
|Date:||Jul 1, 1999|
|Previous Article:||Technology: Tools for Managing Information.|
|Next Article:||Integrating EDMS Functions & RM Principles.|