Printer Friendly
The Free Library
14,558,366 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Classifying electronic documents: a new paradigm: the U.S. Department of Education set out to determine whether large volumes of electronic data can be indexed cost-effectively. (LessonsLearned).


At the Core

This article:

* Explains how the Department of Education (DoEd) used artificial neural network (artificial intelligence) artificial neural network - (ANN, commonly just "neural network" or "neural net") A network of many very simple processors ("units" or "neurons"), each possibly having a (small amount of) local memory.  software to classify clas·si·fy  
tr.v. clas·si·fied, clas·si·fy·ing, clas·si·fies
1. To arrange or organize according to class or category.

2. To designate (a document, for example) as confidential, secret, or top secret.
 electronic documents

* Reveals the challenges in analyzing and categorizing mass amounts of electronically stored data

The introduction of the desktop computer and its widespread adoption by and the private sector in the 1980s raised new problems for records managers. The computer increased information volume, manipulation of data became easier, and networks increased the ease with which information was transmitted, further adding to information mass.

When e-mail was added to the mix, quantity began to overwhelm o·ver·whelm  
tr.v. o·ver·whelmed, o·ver·whelm·ing, o·ver·whelms
1. To surge over and submerge; engulf: waves overwhelming the rocky shoreline.

2.
a.
 traditional information management systems. Records managers simply could not apply the old classification and storage systems to the huge amount of information generated.

Volume was not the only problem, of course. There was also the issue of what medium to store the information on and how to migrate it forward as new versions of software were developed.

The underlying assumption of the current paradigm for records and information management is that there is now too much information to manage at the document level. The file folder In a graphical user interface (GUI), a simulated file folder that holds data, applications and other folders. Folders were introduced on the Xerox Star, then popularized on the Macintosh and later adapted to Windows and Unix. In Unix and Linux, as well as DOS and Windows 3.  has become the basic unit of control. However, computer users often do not file the information in neat folders. Attempts to force users to do this filing have been largely unsuccessful. The time has come for a new paradigm New Paradigm

In the investing world, a totally new way of doing things that has a huge effect on business.

Notes:
The word "paradigm" is defined as a pattern or model, and it has been used in science to refer to a theoretical framework.
.

The monster that has created this mountain of information can be tamed tame  
adj. tam·er, tam·est
1. Brought from wildness into a domesticated or tractable state.

2. Naturally unafraid; not timid: "The sea otter is gentle and relatively tame" 
 and turned into the engine that controls it -- and to a far better degree than has been possible in the past. With the power of today's computers -- those found in virtually every office -- indexing can be done down to the word level, and retrieval can be virtually instantaneous in·stan·ta·ne·ous  
adj.
1. Occurring or completed without perceptible delay: Relief was instantaneous.

2.
. If only there were a way to categorize cat·e·go·rize  
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es
To put into a category or categories; classify.



cat
 the information.

Enter the new paradigm.

Case in Point

The U.S. Department of Education (DoEd) had used artificial neural network technology to analyze and categorize some electronic materials at the end of the Clinton Administration Noun 1. Clinton administration - the executive under President Clinton
executive - persons who administer the law
. That project was successful, and the department wanted to see if the technology could be applied to their vast electronic information holdings consisting of word processing word processing, use of a computer program or a dedicated hardware and software package to write, edit, format, and print a document. Text is most commonly entered using a keyboard similar to a typewriter's, although handwritten input (see pen-based computer) and  documents, spreadsheets in various formats, databases (both off-the-shelf and proprietary), and e-mail messages. The documents could not be deleted Deleted

A security that is no longer included on a specified market. Sometimes referred to as "delisted".

Notes:
Reasons for delisting include violating regulations, failing to meet financial specifications set out by the stock exchange and going bankrupt.
 because some were record material deserving de·serv·ing  
adj.
Worthy, as of reward, praise, or aid.

n.
Merit; worthiness.



de·serving·ly adv.
 of retention for varying time periods according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 the department's records retention schedule. The cost of storing and maintaining this material was a drain on the budget and hampered ongoing activities.

To see what could be accomplished, the department set up a demonstration project using e-mail and word processing documents from individuals who left the agency at the end of the Clinton Administration. Approximately 4 gigabytes of e-mail and half a gigabyte One billion bytes. Also GB, Gbyte and G-byte. See giga and space/time.

(unit) gigabyte - 2^30 = 1,073,741,824 bytes = 1024 megabytes.

Roughly the amount of data required to encode a human gene sequence (including all the redundant codons).

See prefix.
 of word processing documents were provided for the project. Fairfax, Virginia Fairfax is an independent city forming an enclave within the confines of Fairfax County, in the Commonwealth of Virginia. Although politically independent of the surrounding county, the City of Fairfax is nevertheless its county seatGR6.  based STG stg abbr (= sterling) → ester  Inc., the same company that had done the earlier project, was employed to undertake this one, with the significant addition of an experienced records manager.

Prior experience within the department and elsewhere in the federal government had resulted in less-than-satisfactory results with desktop-deployed records management applications. Users were reluctant or adamantly ad·a·mant  
adj.
Impervious to pleas, appeals, or reason; stubbornly unyielding. See Synonyms at inflexible.

n.
1. A stone once believed to be impenetrable in its hardness.

2. An extremely hard substance.
 unwilling to use the software, results were spotty spot·ty  
adj. spot·ti·er, spot·ti·est
1. Lacking consistency; uneven.

2. Having or marked with spots; spotted.



spot
, and constant training and retraining re·train  
tr. & intr.v. re·trained, re·train·ing, re·trains
To train or undergo training again.



re·train
 were necessary for staff, particularly where constant turnover was the norm. With these factors in mind, the project's goal was to see if a system could be devised to deal with the material using artificial neural network technology and to see if it could be employed as a network service.

The artificial neural network software was Hummingbird's Knowledge Manager Workstation. It uses a mathematical construct that analyzes the frequency and placement of words and concepts within documents to place them within a multidimensional mul·ti·di·men·sion·al  
adj.
Of, relating to, or having several dimensions.



multi·di·men
 grid. By manipulating the grid's various parameters, one can control the level of inclusiveness within the "clusters" of documents or groups of documents around specific ideas or concepts, thus increasing categorization process accuracy. The project team nicknamed the software "ANNie."

It quickly became apparent that the technology was very powerful and could categorize massive amounts of information, do it in a very short time span, and attain accuracy levels greater than could be expected from even the best of file clerks. Furthermore, accuracy was greatly enhanced when the number of possible categories was narrowed, leading to the project's first major decision: Focus would be not on all possible data within the agency but instead on individual work groups where the number of subjects addressed was limited by the work group's scope. This decision also made it easier to deal with related questions. For instance, access levels could be maintained at the office level just as they were for local area network access. It also allowed for ready application of the "office of record" principle. A memo from the secretary of education to the staff, for example, would show up in every individual's mailbox A simulated mailbox in the computer that holds e-mail messages. Mailboxes are stored on disk as a file of messages, a database of messages or as an individual file for each message. The standard mailboxes are usually In, Out, Trash and Junk (Spam). , but it was only necessary to maintain the outgoing copy from the secretary's office.

The approach mimics what traditional records managers do when dealing with paper records: They begin with an office and analyze the materials found there. Categorization leads to grouping documents into similar types, and each type is then given a retention period. Once established for an office, the retention schedule created can be applied until some future event triggers a need for change.

Using the materials provided by the DoEd, the neural network software Neural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.  was applied to a sample of each work group's documents. The sample generated the words and concepts used to create clusters of knowledge around which the documents could be grouped. Any group of documents may be divided into a number of clusters as long as the number is a square number (e.g., 4, 9, 16, 25). Clusters may be further divided into subclusters (again a square number), which may be subdivided into other subclusters until the desired level of categorization is achieved. Focusing on a particular office or work group keeps the number of clusters and subclusters smaller and, therefore, easier for the records manager to work with.

Examining and refining refining, any of various processes for separating impurities from crude or semifinished materials. It includes the finer processes of metallurgy, the fractional distillation of petroleum into its commercial products, and the purifying of cane, beet, and maple sugar  the clusters is a process called "training" the artificial neural network. During the training process, the records manager sees samples of the documents as they would be categorized cat·e·go·rize  
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es
To put into a category or categories; classify.



cat
 by clusters and can determine how well the clusters would work against the whole body of data. The end result of training is a "cluster map" that is saved so it can be applied to the work group's whole body of data. The software categorizes the entire corpus of documents around each cluster and subcluster.

If the records manager does the job correctly, these clusters and subclusters equate e·quate  
v. e·quat·ed, e·quat·ing, e·quates

v.tr.
1. To make equal or equivalent.

2. To reduce to a standard or an average; equalize.

3.
 to categories of documents that have the same retention period. The team found that the current, approved DoEd retention schedule could be used with minimal modification -- adding numbers in a hierarchical arrangement -- to define each cluster and subcluster as containing documents falling under one category in the retention schedule. This enabled the software to deal with the contents of each cluster and subcluster as a unit to which the same retention criteria were applied.

The Test of Time

For the demonstration project, the records manager examined all documents and e-mails to see that the software properly categorized them. Proper categorization was defined as ensuring that all documents that should be saved for a certain period of time according to the records retention schedule were placed in categories that were scheduled for that period, or longer. Of the 4.5 gigabytes of information provided by DoEd, the test sample of 3.8 gigabytes was from five working groups. (The remainder were from small groups or individuals; the study's time constraints In law, time constraints are placed on certain actions and filings in the interest of speedy justice, and additionally to prevent the evasion of the ends of justice by waiting until a matter is moot.  did not allow analysis of this material.) Of the more than 90,000 e-mails and word processing documents, the records manager was unable to find even one whose categorization would have resulted in an incorrect retention period.

The software did not successfully categorize all documents. Each cluster map provided for incorrectly categorized documents, which were examined as part of the process. Most were short documents that defy de·fy  
tr.v. de·fied, de·fy·ing, de·fies
1.
a. To oppose or resist with boldness and assurance: defied the blockade by sailing straight through it.

b.
 categorization even by the most experienced records manager. For example, e-mail messages that only made sense to the recipient, such as those with the single word "yes," "no," or "3:00." Also in this uncategorized area were jokes and cartoons distributed by e-mail. Since none of these are records that need to be kept for any period of time, their inclusion in the uncategorized area is not a problem.

Not all information created by an office has enduring value, but the higher the individual's level in an organization's hierarchy, the greater the percentage of material with enduring value is likely to be. This is true in paper and was true in electronic form as well.

While no count was kept, it appeared there was more material of transitory TRANSITORY. That which lasts but a short time, as transitory facts that which may be laid in different places, as a transitory action.  value in electronic media than in paper. For example, scheduling a meeting may generate numerous e-mails back and forth before a time is found for all participants.

Artificial neural network software categorizes large documents and those with more than one subject in more than one cluster. This provides assurance that all documents are properly categorized, for the more complex documents will tend to move to the category with the longest retention period, thus keeping such documents from being destroyed too early.

Having determined that artificial neural network software could categorize documents into groups useful for records management, the next task was to determine whether it could be adapted into a network service. Because of experience with user reluctance to deal with records management software, the system had to be transparent. Users could produce documents and e-mails as they always had, storing them in whatever manner they found most useful to their day-to-day operations. Behind the scenes, however, the software would capture those documents with retention requirements and ensure their retention and retrievability.

The project's next phase was to develop software that would transport the documents from the clusters to records management application (RMA (RealMedia Architecture) See RealMedia. ) software. The requirement was twofold: 1) to transport not only the document, but all the metadata required by DoD 5015.2 and 2) to use at least two different RMA software packages to prove the concept was not software specific. This was a complex process. The project's programmer was able to develop and install such software based on project specifications. From a records management standpoint, categories created in the RMA software corresponded to the DoEd's records retention schedule. The RMA software then stored copies of the documents in an offline storage Refers to disks and tapes that are kept in a data library. Offline data cannot be accessed from a computer or terminal until it is mounted in the drive.  area according to the time periods required. This left the documents online and available to the users.

The Business Application

The scope of the demonstration project did not include implementation in the agency, but it is identified as the next step and some thought has been given to accomplishing it.

The records manager began by identifying work groups, just as in a paper world, and gain access to their electronic documents. He then built a cluster map for each individual work group. Once satisfied with the results, he applied the cluster map to all the data from that group. The software took over, assigning documents to clusters and moving the documents and their metadata into the RMA if the retention period warranted. The records manager examined the uncategorized material to make sure none of it should be retained. This took care of the legacy material from the last several years when no application was available to deal with electronic documents.

The records manager then periodically applied the cluster map to newly created material. The timing of this review was a business decision based on staff levels and the work group's level of importance. For example, it may be desirable to review the documents in the secretary of education's office weekly, while the deputy secretaries' files might merit monthly attention, and the Office of Records Management's files could wait for an annual review. The software again took over, sorting the data created since the last application of the system, categorizing it, and sending the appropriate documents and their metadata to the RMA. The records manager again checked the uncategorized documents. As time went on and the agency developed new categories of documents, more items began to show up in the uncategorized area. This was a signal to the records manager that the cluster map is outdated and that he needed to take another sample and begin the process all over again. In government agencies this will occur when a new agency head arrives or when new programs are developed.

The New Paradigm

The new paradigm controls documents, not folders. It is, in fact, a whole new approach. Under the 19th century's copybook (programming, library) copybook - (Or "copy member", "copy module") A common piece of source code designed to be copied into many source programs, used mainly in IBM DOS mainframe programming.

In mainframe DOS (DOS/VS, DOS/VSE, etc.
 system, items were filed chronologically chron·o·log·i·cal   also chron·o·log·ic
adj.
1. Arranged in order of time of occurrence.

2. Relating to or in accordance with chronology.
, and retrieval depended on the indexer's skill. With the computer, it doesn't matter how the item is filed, just that it is retrievable. Indexing is done on every word within the document - excepting conjunctions, articles (a, an, the), and interjections -- and retrieval is nearly instantaneous and foolproof because the software also indexes phrases and concepts. In this case, Hummingbird's Fulcrum fulcrum: see lever.  FIND software was used to search for documents within the system and view the results.

Another change is the approach to categorization. With control and retrieval of documents handled by the computer, records managers are now more interested in how long a document must be kept rather than its subject category. Classification used with retention rules will enable storing information offline in groups with like retention periods, making disposal at the end of a document's useful life easier. If, for example, optical disc is chosen for storage and all documents scheduled to be retained until January 1, 2025, are stored on one disc, on that date the records manager will simply dispose of that disc. Since the computer doesn't care on which disc information is stored, it might as well be stored in the manner most convenient for the records managers.

Under the old paradigm, records managers spent a good deal of their time educating the staff about how to file, including cajoling and encouraging them to follow a file plan. The new paradigm makes file plans for electronic documents unnecessary. Records managers will spend more time managing information and far less time coaching.

Finally, the new paradigm increases the importance of records managers. They must control the application of the software that manages the agency's official records. Under the old paradigm, records managers established a system which was implemented by secretaries and file clerks. With the new paradigm, records managers not only design the systems, but they also supervise and oversee implementation, assisted by a staff of records management experts.

(Editor's Note Editor's Note (foaled in 1993 in Kentucky) is an American thoroughbred Stallion racehorse. He was sired by 1992 U.S. Champion 2 YO Colt Forty Niner, who in turn was a son of Champion sire Mr. Prospector and out of the mare, Beware Of The Cat.

Trained by D.
: The products mentioned in this article are examples from the author and do not constitute endorsement by ARMA International.)

Donald B. Schewe, Ph.D., CPM (1) (Critical Path Method) A project management planning and control technique implemented on computers. The critical path is the series of activities and tasks in the project that have no built-in slack time. , FAI, currently works part-time for STG Inc as a Records Management Consultant with clients in the federal government. He may be reached at dschewe@mindspring.com.
COPYRIGHT 2002 Association of Records Managers & Administrators (ARMA)
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2002, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Schewe, Donald B.
Publication:Information Management Journal
Geographic Code:1USA
Date:Mar 1, 2002
Words:2482
Previous Article:Enterprise application integration: EAI is the soluble glue needed for modular relationships that allow organizations to be flexible and responsive...
Next Article:NARA: a sneak preview.(National Archives and Records Administration's records management)(Brief Article)
Topics:



Related Articles
Technology: Tools for Managing Information.
Information Management in New Business Models.
International Standards and Best Practices in RIM.
Something Funny Is Happening on the Way to Knowledge Management ...(leveraging information)
Tying it all together: a CIO perspective; technology is making it imperative that information technology and records and information management come...
The truth about taxonomies.
So you want to implement automatic categorization? Automatic categorization can be a powerful tool despite its limitations, but it is still important...
Integrating an ERDMS in an IT environment: as the U.S. EPA's experience illustrates, effective electronic records management solution must consider...
Introduction.(organizing the Internet)
Digital archiving in the pharmaceutical industry: while relatively new as a retention method in the drug industry, e-archiving of records is a...

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles