Automatic categorization: how it works, related issues, and impacts on records management. (Cover Story).
Automatic categorization is currently being applied to electronic records management. Anyone hoping to effectively apply categorization needs to understand how automatic categorization works, its benefits, its limitations, and the potential impact it has on recordkeeping operations. Ultimately, automatic categorization and other text analytical tools will provide potential new career opportunities for records managers.
In order to better understand the process of automatic categorization, these key terms should be defined:
Categorization: The assigning of an object to a pre-existing subject heading in a file plan or assigning it to a given class within the taxonomy (also called classification)
Cluster: A group of objects with members that are more similar to each other than to members of any other group
Data visualization: A visual representation of corpus contents, often a topographical map or network of linked nodes
Structured data: Fielded data, or data that is generally contained in a relational database
Summarization: An abstract or a synopsis of a document
Unstructured data: Data not contained in fields (e.g., free text, audio, video, and images)
Over the last two decades, the computer's ability to process data has evolved from the domain of structured data to unstructured data. Structured data can include a series of tables with rows and columns. A formal mathematical model, or a relational model, defines the table structures and the complete set of operations that can be performed on the data.
Structured data represents less than 20 percent of the information available. More than 80 percent of all information resides in unstructured documents. Initially, this data could not be processed in its native form. Data elements contained in documents had to be extracted and entered into structured databases before they could be processed. The primary raison d'etre for forms is to be able to easily enter data into database management systems (DBMS). Products exist today to "read" data from forms, including intelligent character recognition (ICR), but this technology is actually processing structured data. For example, ICR depends upon the position of information in the form to determine the DBMS field into which the data is to be entered.
A records manager's primary responsibility has always been to process unstructured data, generally hardcopy documents. To create a file plan, the records manager analyzes a collection of documents and creates a taxonomy that adds structure to the collection. Assigning documents to a file requires an indexing clerk to extract keywords from the document. The creation of a records control schedule requires the records manager to extract the business and legal relevance of a file series. According to a report from Autonomy Corp., the increase in unstructured information is estimated as doubling every three months. The rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new tools at their command. These tools, as well as their advantages and limitations, are discussed later in the article. The focus will be primarily on text-based, electronic records, including e-mail, Web URL documents, Word documents, pdf files, and text documents.
The Electronic Records Environment
An organization that has implemented a standard electronic file structure that is universally followed is extremely fortunate. In most organizations, each person has their own directory structure and e-mail folder structure. Some companies have implemented electronic records management systems (RMS), but most have used a day-forward approach, in which all newly generated and received records are placed under the automated RMS on a certain date. Electronic files on existing servers and electronic records in off-line storage (back-office files) are rarely addressed. Generally, metadata does not exist to place the back-office files under RMS control, and surveying the corpus is cost prohibitive. However, these documents are just as vulnerable to discovery as the newer documents.
Automatic categorization attempts to associate electronic records with either a predefined taxonomy or self-defining categories. An understanding of the strengths and potential limitations of automatic categorization in managing records is important if it is to be used successfully. A number of text analysis tools act as a suite to assist in this process. These include feature extraction, clustering, visualization, and summarization tools. Commercial off-the-shelf (COTS) products often combine these tools into a single categorization product, but to understand how these products work, it is crucial to understand each of the related technologies. Feature extraction and clustering are integral to the categorization process. Text visualization and text summarization are an adjunct to categorization but are useful in gaining insight into the collection prior to developing categories or to ascertain the quality of the categorization results.
The feature extraction process can be viewed as a series of filters through which the document passes. Each filter attempts to further reduce the document to its key conceptual elements and assign numeric values to these elements. The first filter segments the document into individual linguistic components. The next series of filters identify and eliminate words, phrases, and sentences that have low content value. Individual features that describe the remaining content are then identified by the feature extractor. Finally, the feature vector is generated for the feature set.
To the computer, text is a collection of characters and nothing more. Words, phrases, sentences, and paragraphs must be identified through a parsing algorithm. The parser partitions the document into the individual paragraphs and sentences and then into the parts of speech. Computer programs perform these analyses to gain an understanding of the text's meaning, which is determined through these analyses and aids in determining the document's content.
Once text of low information content has been removed from the document, feature identification can begin. One feature type often used is the frequency of occurrence of words or phrases that have high discriminating value. Words that have a high discriminating value relate strongly to the subject of the given document but occur infrequently in documents that have a different meaning. A word may appear in several forms: singular plural, prefixed, suffixed, and hyphenated. Stemming is the process used to normalize all word forms and improves the accuracy of the frequency counts.
Because authors use different words to refer to the same concept to reduce redundancy, there is a negative impact on the frequency count associated with the concept. Therefore, it is often desirable to count concepts rather than individual words. The frequency of occurrence of discriminating words or phrases, along with other metrics, can provide a structured representation, or signature, for the document. These signatures can then be manipulated by the computer and used to assign the associated documents to appropriate categories.
A vector is a physical quantity that has a magnitude and direction. Vectors are represented as a series of numbers, in which each letter represents a magnitude or distance in a given direction. A feature vector is used to represent the document's feature set or the document's signature. Each number in the vector is a magnitude associated with a feature. If a categorization system determined that "file plan," "records inventory," and "records control schedule" were highly discriminating features for the entire corpus that the system was to address, then every document in the corpus would have an element in its associated feature vector for each of these three records-management oriented features. Tables 1 and 2 illustrate this concept.
The five articles are a small corpus. Column 1 lists the titles of the five documents. Columns 2 through 5 are the features that the categorization system has decided to use to determine the feature vectors to represent each article in the corpus. The feature values are the frequencies with which each of the phrases occurs in each article. The feature vectors are listed in Table 2.
Remember that vectors have a magnitude and direction. One of the challenges is to visualize how the categorizer works. Most people can only visualize three dimensions: length, width, and depth. Each feature in the feature vector represents a dimension. Even in this simple example, each feature lies in a six-dimension space. In a real classification system, feature vectors may have several hundred elements and therefore the vectors exist in a space with several hundred dimensions. Not to worry, is fortunately, the same mathematical principles that work in three dimension space also work in n-space (see Records Inventory Axis figure, next page).
Feature extraction results in a collection of feature vectors, one for each document in the collection. If the endpoints of a group of feature vectors are closely grouped together, this indicates that the documents represented are clusters, or are about the same topic. It is possible to determine which subjects are contained in a corpus by calculating the feature vectors for each document and then partitioning the feature vector endpoints into clusters. The user of the clustering software specifies a distance that is acceptable between feature vector endpoints in order to be considered within the same cluster. Each cluster can be envisioned as a sphere in n-space. The center of the sphere is called the centroid, where the radius of the centroid is the user-specified, cluster-defining distance, and this provides a mathematical signature for the topic of the documents forming the cluster.
Text visualization views relationships between documents in a corpus. It is difficult to visualize more than three dimensions, and feature vectors may have dimensionality in the hundreds of features. The Pacific Northwest National Laboratory of the Department of Energy is an example of one organization working in the area of text visualization. Two of their products are outlined below: Galaxies and Themeview.
Galaxies provides an overview of the corpus. Each point in the display is a feature vector end-point representing a document in the corpus. The bright areas are the result of documents being closely associated with each other. These are document clusters and represent naturally occurring topics within the corpus. Clusters can be selected for further analysis through the use of Galaxies' analytical tools, which can test whether an organization's file plan is appropriate for a given corpus by comparing the naturally occurring clusters with the documents and terms normally associated with each of the file plan categories. File plan categories can be used to define Galaxies' search criteria, plotting only compliant feature vectors. The resulting graphic can be used to determine whether the documents in the displayed clusters match the categories in the file plan.
Clusters in the display can be selected graphically for further analysis through Themeview, which provides another visualization perspective to view the associations of feature vectors. Themeview also provides a powerful set of analytical tools to support further analysis of the visual display.
There are a number of other products providing text visualization tools and many different approaches that provide the user with a means to analyze a corpus' contents. All of these tools use document features to determine the associations between various documents and the concept of clustering to determine topics or themes.
Summarization is another text analysis tool that supports the user in reviewing a large number of documents. The simplest version of a summarization tool provides a title listing of the documents associated with selected areas from the display using available metadata. Key phrases extracted from the selected documents can also be used to generate a list of titles without the existence of metadata. More sophisticated summarization tools use gisting techniques to generate a narrative summary of the document. Summarization techniques can assist the records manager in determining and refining the required training sets for the categorization system.
Categorization systems can file documents into multiple categories. This is accomplished by the categorization system utilizing a user-defined distance parameter, which is used as the radius for a sphere surrounding each subject heading's centroid. If the candidate's feature vector end point lies within any subject heading's sphere, it is filed within that subject heading.
While automatic categorization seems straightforward in theory, in practice, it is not. Its accuracy is highly dependent upon the selection of the proper training set for each of the subject headings in the file plan. A training set is the collection of documents selected to represent a subject heading and is then used to determine its associated centroid. This is an empirical problem and one in which the records manager must play the key role.
A set of representative documents is selected for each of the file plan's subject headings. The training function of the categorization system then calculates the centroids associated for each set. The accuracy of the training set selection can be evaluated by automatically categorizing the corpus from which the training sets were taken. If the training sets are perfect, all of the electronic documents previously filed in each subject heading will be assigned to that heading by the automatic classification system. This is very unlikely to happen for two reasons: 1) humans file documents incorrectly and 2) the training sets are not perfect. The training sets can be tuned by deleting documents and adding others until acceptable results are accomplished when tested against the existing electronic file plan. This process should be performed by the records manager. Once acceptable results have been accomplished, new documents can be categorized using the centroids created by the training set.
Automatic categorization systems can generate their own sets of categories through the use of clustering. These self-defined categories can then be fine-tuned by stripping out documents that are not relevant. The refined set of documents can be used as a training set, recalculating the centroids for each category. The fine-tuning of the categories should be performed by the records manager, who should determine if the self-defined categories meet business needs.
The level of investment in building a training set should not be under-estimated. Generally, the size of the training set is directly proportional to the accuracy of the automatic categorization. Given that cost estimates of reclassifying documents range from $25 to $100 per document, the cost of building a training set can be a significant one-time cost. Not developing a representative training set, however, will result in a significantly higher reoccurring cost.
Categorization accuracy is an important issue with automatic categorization systems. A categorization accuracy of 80 percent is considered fairly good for an automatic categorization system. However, this metric relates to a single level of the taxonomy. Statistically, the accuracy at any given level of the taxonomy would be accuracy of the categorization system raised to the power of the hierarchy level in which the subject heading resides. For example, if the subject heading was in the third level and the accuracy of the categorization system was 80 percent, the expected accuracy for the proper assignment of a document would be about 51 percent.
Records Management Implications
Understanding the strengths and potential limitations of automatic categorization is important if it is to be used successfully. The records manager must play a key role in establishing an automatic categorization system. Only the records manager understands the filing system and its applicability to the business enterprise, and this knowledge must be imbedded into the categorization system's knowledge base. The records manager must learn new skills to use this important tool in order to make a contribution toward its integration into the records management process.
ABOUT THE AUTHOR: R. Kirk Lubbes, CRM, is President of Records Engineering LLC. He has held positions at Pattern Analysis and Recognition Technology, The Analytical Sciences Corp., and Disclosure Inc. and has managed programs for the National Security Agency, Central Intelligence Agency, and the Air Force in the areas of sensor exploitation, information retrieval, text processing, data visualization, and records declassification. Lubbes has been working as an information technology contractor and consultant for more than 35 years. He can be reached at firstname.lastname@example.org.
Editor's Note: The Products mentioned in this article serve as examples and do not constitute endorsement by ARMA International.
THIS ARTICLE EXAMINES:
* how automatic categorization and other document tools are impacting records management
* the strengths and potential limitations of automatic categorization
* the importance of categorization accuracy
TABLE 1 Nuclear Nuclear Environmental Article Title Energy Waste Protection Preparation for the CRM 0 0 0 Managing Electronic Records 0 0 0 Advances in Nuclear Reactors 9 11 2 The New Energy Crisis 12 6 3 Recordkeeping for Energy Companies 7 3 5 Nuclear Nuclear Environmental Article Title Energy Waste Protection Preparation for the CRM 0 0 0 Managing Electronic Records 0 0 0 Advances in Nuclear Reactors 9 11 2 The New Energy Crisis 12 6 3 Recordkeeping for Energy Companies 7 3 5 TABLE 2 Article Title Feature Vector Preparation for the CRM Exam [4,6,7,0,0] Managing Electronic Records [1,8,4,0,0,0] Advances in Nuclear Reactors [0,0,0,9,11,2] The New Energy Crisis [0,0,0,12,6,3] Recordkeeping for Energy Companies [3,1,4,7,3,5]
Autonomy Corp. "Autonomy Technology White Paper." www.autonomy.com.
Hobbs, Jerry R. "Generic Information Extraction System." Artificial Intelligence Center SRI. www.itl.nist.gov/iaui/894.02/related_projects/tipster/gen_ie.htm.
"Feature Extraction." www.case.ogi.edu/class/cse580ir/handouts/6-20/text_processing/sldO14.htm.
"SPIRE: A new visual text analysis tool." www.pnl.gov/infoviz.
Turban, Efraim. Decision Support and Expert Systems, Management Support Systems, 2nd Edition. Prentice Hall. Upper Saddle River, New Jersey, 1990.
RELATED ARTICLE: ADVANTAGES AND DISADVANTAGES OF AUTOMATIC CATEGORIZATION
* Supports placing new electronic records into an existing file plan: Automatic categorization can be trained, using electronic records in an established file plan, and then used to categorize new documents. This approach can also be used to bring previously unmanaged electronic records on existing file servers into an established and managed file structure.
* Suggests a "natural" organization for an existing electronic record corpus: Automatic categorization can identify clusters of documents. A cluster's contents can be inspected and refined in an iterative process. Appropriate file tags can be assigned to the refined clusters. The file tags can then be used as the basis for a file plan.
* Identifies topics within an existing corpus: Often it is difficult to determine the contents of unmanaged servers. Automatic categorization can treat the server contents as a single corpus and can identify topics contained within the server using its clustering ability. This unconstrained clustering will also provide insight into the range of material contained within the collection.
* Identifies unknown associations with documents: This capability is one of the most powerful capabilities of automatic categorization. While primarily an information analysis tool, it can help the records manager identify better ways of organizing information.
* Identifies relevant information from non-relevant information: Clusters may form identifying information that is being improperly retained and/or is not relevant to the business needs. Identification of such clusters will help records managers better manage their records.
* Limited accuracy: Limited accuracy is an issue with automatic categorization. A balance must be made relative to the investment expended in developing a sound training set, correcting the system's misfiled documents, and an acceptable level of error in the system. One should remember that humans have a high incidence of misfiling documents. The automatic categorization system is generally consistent. Consistency may be more important than accuracy, in that knowing where to find a document is of major importance, even if the place it is filed is not ideal.
* Does not work well on very short documents, large documents, or on documents without uniform contents: Very short documents have little information content. Many e-mail messages are very short (e.g., an e-mail responding to a question may be simply "yes"). Without the context, the word yes has little information content. The categorization algorithm has little to go on and will place the message in a miscellaneous category. Large documents, such as books, are difficult to categorize because they contain a wide range of topics and will tend to be associated with many clusters resulting in too many copies being filed. This issue can be addressed by partitioning the document into sections or chapters, but such an approach will fragment the document, a practice that goes against good records management practices. The problem of multiple subjects exists for other types of documents as well (e.g. status reports, calendars, or forms). These documents have no coherent theme and are very difficult for the automatic declassification system to assign to individual categories. Automatic categorization systems should identify these troublesome documents and place them into a known category where they can be addressed by human intervention.
* Potentially misleading: The basis for automatic categorization is statistical. Centroids are initially calculated based upon the training set. As the system continues to add documents to each subject heading, categorization systems recalculate the centroid location. If misfiles are not removed, the centroid's location gradually drifts. Eventually the subject heading no longer represents the contents of the particular category. Effectively the file plan has unknowingly changed. The records manager must occasionally review and correct misfiles to avoid this situation. The same situation may occur even if all documents are being properly filed. This may be due to a subcategory developing within the category originally defined by the file plan. Consistent with hardcopy document practices, the category should be split to create a new category. Adding a new category, however, requires that the records manager identify new training sets for both the original category and the new category and rerun the categorization process.
|Printer friendly Cite/link Email Feedback|
|Author:||Lubbes, R. Kirk|
|Publication:||Information Management Journal|
|Article Type:||Statistical Data Included|
|Date:||Oct 1, 2001|
|Previous Article:||Web technologies for information management. (Cover Story).|
|Next Article:||The Future of e-Business. (In Review).|