Automatic categorization: how it works, related issues, and impacts on records management. (Cover Story).
A records manager's primary responsibility has always been to process unstructured data Data that does not reside in fixed locations. Free-form text in a word processing document is a typical example. Contrast with structured data. See free-form database. . The increase in unstructured documents and the rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new, automated tools at their command.
Automatic categorization is currently being applied to electronic records management. Anyone hoping to effectively apply categorization needs to understand how automatic categorization works, its benefits, its limitations, and the potential impact it has on recordkeeping operations. Ultimately, automatic categorization and other text analytical tools will provide potential new career opportunities for records managers.
In order to better understand the process of automatic categorization, these key terms should be defined:
Categorization: The assigning of an object to a pre-existing subject heading in a file plan or assigning it to a given class within the taxonomy taxonomy: see classification.
In biology, the classification of organisms into a hierarchy of groupings, from the general to the particular, that reflect evolutionary and usually morphological relationships: kingdom, phylum, class, order, (also called classification)
Cluster: A group of objects with members that are more similar to each other than to members of any other group
Data visualization See information visualization. : A visual representation of corpus contents, often a topographical map See under Cadastral. - Topographical surveying. See under Surveying.
See also: Topographic or network of linked nodes
Structured data: Fielded data, or data that is generally contained in a relational database relational database
Database in which all data are represented in tabular form. The description of a particular entity is provided by the set of its attribute values, stored as one row or record of the table, called a tuple.
intr. & tr.v. sum·ma·rized, sum·ma·riz·ing, sum·ma·riz·es
To make a summary or make a summary of.
sum : An abstract or a synopsis A summary; a brief statement, less than the whole.
A synopsis is a condensation of something—for example, a synopsis of a trial record. of a document
Unstructured data: Data not contained in fields (e.g., free text, audio, video, and images)
Over the last two decades, the computer's ability to process data has evolved from the domain of structured data to unstructured data. Structured data can include a series of tables with rows and columns. A formal mathematical model
relational model - relational data model , defines the table structures and the complete set of operations that can be performed on the data.
Structured data represents less than 20 percent of the information available. More than 80 percent of all information resides in unstructured documents. Initially, this data could not be processed in its native form. Data elements contained in documents had to be extracted and entered into structured databases before they could be processed. The primary raison d'etre rai·son d'ê·tre
n. pl. rai·sons d'être
Reason or justification for existing.
[French : raison, reason + de, of, for + être, to be. for forms is to be able to easily enter data into database management systems (DBMS (DataBase Management System) Software that controls the organization, storage, retrieval, security and integrity of data in a database. It accepts requests from the application and instructs the operating system to transfer the appropriate data. ). Products exist today to "read" data from forms, including intelligent character recognition In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) system that allows fonts and different styles of hand writing to be learned by a computer during processing to improve accuracy and recognition levels. (ICR (Intelligent Character Recognition or Image Character Recognition) The machine recognition of hand-printed characters as well as machine printing that is difficult to recognize. ), but this technology is actually processing structured data. For example, ICR depends upon the position of information in the form to determine the DBMS field into which the data is to be entered.
A records manager's primary responsibility has always been to process unstructured data, generally hardcopy documents. To create a file plan, the records manager analyzes a collection of documents and creates a taxonomy that adds structure to the collection. Assigning documents to a file requires an indexing clerk to extract keywords from the document. The creation of a records control schedule requires the records manager to extract the business and legal relevance of a file series. According to according to
1. As stated or indicated by; on the authority of: according to historians.
2. In keeping with: according to instructions.
3. a report from Autonomy Corp., the increase in unstructured information is estimated as doubling every three months. The rise in the portion of the material that is electronic has created an environment where the records manager can no longer manage records without having new tools at their command. These tools, as well as their advantages and limitations, are discussed later in the article. The focus will be primarily on text-based, electronic records, including e-mail, Web URL URL
in full Uniform Resource Locator
Address of a resource on the Internet. The resource can be any type of file stored on a server, such as a Web page, a text file, a graphics file, or an application program. documents, Word documents, pdf files See PDF. , and text documents.
The Electronic Records Environment
An organization that has implemented a standard electronic file structure that is universally followed is extremely fortunate. In most organizations, each person has their own directory structure and e-mail folder structure. Some companies have implemented electronic records management systems (RMS (1) (Record Management Services) A file management system used in VAXs.
(2) (Root Mean Square) A method used to measure electrical output in volts and watts.
1. RMS - Record Management Services.
2. ), but most have used a day-forward approach, in which all newly generated and received records are placed under the automated RMS on a certain date. Electronic files on existing servers and electronic records in off-line storage (back-office files) are rarely addressed. Generally, metadata does not exist to place the back-office files under RMS control, and surveying the corpus is cost prohibitive pro·hib·i·tive also pro·hib·i·to·ry
1. Prohibiting; forbidding: took prohibitive measures.
2. . However, these documents are just as vulnerable to discovery as the newer documents.
Automatic categorization attempts to associate electronic records with either a predefined taxonomy or self-defining categories. An understanding of the strengths and potential limitations of automatic categorization in managing records is important if it is to be used successfully. A number of text analysis tools act as a suite to assist in this process. These include feature extraction In pattern recognition and in image processing, Feature extraction is a special form of dimensionality reduction.
When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not much information) then the , clustering, visualization Using the computer to convert data into picture form. The most basic visualization is that of turning transaction data and summary information into charts and graphs. Visualization is used in computer-aided design (CAD) to render screen images into 3D models that can be viewed from all , and summarization tools. Commercial off-the-shelf Commercial off-the-shelf (COTS) is a term for software or hardware, generally technology or computer products, that are ready-made and available for sale, lease, or license to the general public. (COTS (Commercial Off-The-Shelf) Refers to ready-made merchandise that is available for sale. See MOTS.
(software) COTS - commercial off-the-shelf. See commercial software. ) products often combine these tools into a single categorization product, but to understand how these products work, it is crucial to understand each of the related technologies. Feature extraction and clustering are integral to the categorization process. Text visualization and text summarization are an adjunct to categorization but are useful in gaining insight into the collection prior to developing categories or to ascertain the quality of the categorization results.
The feature extraction process can be viewed as a series of filters through which the document passes. Each filter attempts to further reduce the document to its key conceptual elements and assign numeric numeric
see ten-key pad. values to these elements. The first filter segments the document into individual linguistic components. The next series of filters identify and eliminate words, phrases, and sentences that have low content value. Individual features that describe the remaining content are then identified by the feature extractor. Finally, the feature vector In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and is generated for the feature set.
To the computer, text is a collection of characters and nothing more. Words, phrases, sentences, and paragraphs must be identified through a parsing See parse.
parsing - parser algorithm. The parser A routine that analyzes a continuous flow of text-based input and breaks it into its constituent parts. See parse.
(language) parser - An algorithm or program to determine the syntactic structure of a sentence or string of symbols in some language. partitions the document into the individual paragraphs and sentences and then into the parts of speech. Computer programs perform these analyses to gain an understanding of the text's meaning, which is determined through these analyses and aids in determining the document's content.
Once text of low information content has been removed from the document, feature identification can begin. One feature type often used is the frequency of occurrence of words or phrases that have high discriminating dis·crim·i·nat·ing
a. Able to recognize or draw fine distinctions; perceptive.
b. Showing careful judgment or fine taste: value. Words that have a high discriminating value relate strongly to the subject of the given document but occur infrequently in·fre·quent
1. Not occurring regularly; occasional or rare: an infrequent guest.
2. in documents that have a different meaning. A word may appear in several forms: singular plural PLURAL. A term used in grammar, which signifies more than one.
2. Sometimes, however, it may be so expressed that it means only one, as, if a man were to devise to another all he was worth, if he, the testator, died without children, and he died leaving one , prefixed, suffixed, and hyphenated hy·phen·at·ed
1. Having a hyphen: a hyphenated adjective.
2. Often Offensive Of or relating to naturalized citizens or their descendants or culture. . Stemming is the process used to normalize normalize
to convert a set of data by, for example, converting them to logarithms or reciprocals so that their previous non-normal distribution is converted to a normal one. all word forms and improves the accuracy of the frequency counts.
Because authors use different words to refer to the same concept to reduce redundancy, there is a negative impact on the frequency count associated with the concept. Therefore, it is often desirable to count concepts rather than individual words. The frequency of occurrence of discriminating words or phrases, along with other metrics metrics Managed care A popular term for standards by which the quality of a product, service, or outcome of a particular form of Pt management is evaluated. See TQM. , can provide a structured representation, or signature, for the document. These signatures can then be manipulated by the computer and used to assign the associated documents to appropriate categories.
A vector is a physical quantity that has a magnitude and direction. Vectors are represented as a series of numbers, in which each letter represents a magnitude or distance in a given direction. A feature vector is used to represent the document's feature set or the document's signature. Each number in the vector is a magnitude associated with a feature. If a categorization system determined that "file plan," "records inventory," and "records control schedule" were highly discriminating features for the entire corpus that the system was to address, then every document in the corpus would have an element in its associated feature vector for each of these three records-management oriented o·ri·ent
1. Orient The countries of Asia, especially of eastern Asia.
a. The luster characteristic of a pearl of high quality.
b. A pearl having exceptional luster.
3. features. Tables 1 and 2 illustrate this concept.
The five articles are a small corpus. Column 1 lists the titles of the five documents. Columns 2 through 5 are the features that the categorization system has decided to use to determine the feature vectors to represent each article in the corpus. The feature values are the frequencies with which each of the phrases occurs in each article. The feature vectors are listed in Table 2.
Remember that vectors have a magnitude and direction. One of the challenges is to visualize how the categorizer works. Most people can only visualize three dimensions: length, width, and depth. Each feature in the feature vector represents a dimension. Even in this simple example, each feature lies in a six-dimension space. In a real classification system, feature vectors may have several hundred elements and therefore the vectors exist in a space with several hundred dimensions. Not to worry, is fortunately, the same mathematical principles that work in three dimension space also work in n-space (see Records Inventory Axis figure, next page).
Feature extraction results in a collection of feature vectors, one for each document in the collection. If the endpoints of a group of feature vectors are closely grouped together, this indicates that the documents represented are clusters, or are about the same topic. It is possible to determine which subjects are contained in a corpus by calculating the feature vectors for each document and then partitioning the feature vector endpoints into clusters. The user of the clustering software specifies a distance that is acceptable between feature vector endpoints in order to be considered within the same cluster. Each cluster can be envisioned as a sphere in n-space. The center of the sphere is called the centroid centroid
In geometry, the centre of mass of a two-dimensional figure or three-dimensional solid. Thus the centroid of a two-dimensional figure represents the point at which it could be balanced if it were cut out of, for example, sheet metal. , where the radius of the centroid is the user-specified, cluster-defining distance, and this provides a mathematical signature for the topic of the documents forming the cluster.
Text visualization views relationships between documents in a corpus. It is difficult to visualize more than three dimensions, and feature vectors may have dimensionality in the hundreds of features. The Pacific Northwest National Laboratory The Pacific Northwest National Laboratory (PNNL) is one of nine United States Department of Energy (DOE) multiprogram national laboratories. The laboratory
PNNL is located in Richland, Washington, and operates a marine research facility in Sequim, Washington. of the Department of Energy is an example of one organization working in the area of text visualization. Two of their products are outlined below: Galaxies and Themeview.
Galaxies provides an overview of the corpus. Each point in the display is a feature vector end-point representing a document in the corpus. The bright areas are the result of documents being closely associated with each other. These are document clusters and represent naturally occurring topics within the corpus. Clusters can be selected for further analysis through the use of Galaxies' analytical tools, which can test whether an organization's file plan is appropriate for a given corpus by comparing the naturally occurring clusters with the documents and terms normally associated with each of the file plan categories. File plan categories can be used to define Galaxies' search criteria, plotting only compliant feature vectors. The resulting graphic can be used to determine whether the documents in the displayed clusters match the categories in the file plan.
Clusters in the display can be selected graphically for further analysis through Themeview, which provides another visualization perspective to view the associations of feature vectors. Themeview also provides a powerful set of analytical tools to support further analysis of the visual display.
There are a number of other products providing text visualization tools and many different approaches that provide the user with a means to analyze a corpus' contents. All of these tools use document features to determine the associations between various documents and the concept of clustering to determine topics or themes.
Summarization is another text analysis tool that supports the user in reviewing a large number of documents. The simplest version of a summarization tool provides a title listing of the documents associated with selected areas from the display using available metadata. Key phrases extracted from the selected documents can also be used to generate a list of titles without the existence of metadata. More sophisticated summarization tools use gisting techniques to generate a narrative summary of the document. Summarization techniques can assist the records manager in determining and refining the required training sets for the categorization system.
Categorization systems can file documents into multiple categories. This is accomplished by the categorization system utilizing a user-defined distance parameter, which is used as the radius for a sphere surrounding each subject heading's centroid. If the candidate's feature vector end point lies within any subject heading's sphere, it is filed within that subject heading.
While automatic categorization seems straightforward in theory, in practice, it is not. Its accuracy is highly dependent upon the selection of the proper training set for each of the subject headings in the file plan. A training set is the collection of documents selected to represent a subject heading and is then used to determine its associated centroid. This is an empirical problem and one in which the records manager must play the key role.
A set of representative documents is selected for each of the file plan's subject headings. The training function of the categorization system then calculates the centroids The following diagrams depict a list of centroids. A centroid of an object in associated for each set. The accuracy of the training set selection can be evaluated by automatically categorizing the corpus from which the training sets were taken. If the training sets are perfect, all of the electronic documents previously filed in each subject heading will be assigned to that heading by the automatic classification system. This is very unlikely to happen for two reasons: 1) humans file documents incorrectly and 2) the training sets are not perfect. The training sets can be tuned by deleting documents and adding others until acceptable results are accomplished when tested against the existing electronic file plan. This process should be performed by the records manager. Once acceptable results have been accomplished, new documents can be categorized cat·e·go·rize
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es
To put into a category or categories; classify.
cat using the centroids created by the training set.
Automatic categorization systems can generate their own sets of categories through the use of clustering. These self-defined categories can then be fine-tuned by stripping out documents that are not relevant. The refined set of documents can be used as a training set, recalculating the centroids for each category. The fine-tuning of the categories should be performed by the records manager, who should determine if the self-defined categories meet business needs.
The level of investment in building a training set should not be under-estimated. Generally, the size of the training set is directly proportional (Math.) proportional in the order of the terms; increasing or decreasing together, and with a constant ratio; - opposed to
See also: Directly to the accuracy of the automatic categorization. Given that cost estimates of reclassifying documents range from $25 to $100 per document, the cost of building a training set can be a significant one-time cost. Not developing a representative training set, however, will result in a significantly higher reoccurring cost.
Categorization accuracy is an important issue with automatic categorization systems. A categorization accuracy of 80 percent is considered fairly good for an automatic categorization system. However, this metric relates to a single level of the taxonomy. Statistically, the accuracy at any given level of the taxonomy would be accuracy of the categorization system raised to the power of the hierarchy level in which the subject heading resides. For example, if the subject heading was in the third level and the accuracy of the categorization system was 80 percent, the expected accuracy for the proper assignment of a document would be about 51 percent.
Records Management Implications
Understanding the strengths and potential limitations of automatic categorization is important if it is to be used successfully. The records manager must play a key role in establishing an automatic categorization system. Only the records manager understands the filing system and its applicability to the business enterprise, and this knowledge must be imbedded imbedded,
adj See embedded. into the categorization system's knowledge base. The records manager must learn new skills to use this important tool in order to make a contribution toward its integration into the records management process.
ABOUT THE AUTHOR: R. Kirk Lubbes, CRM (Customer Relationship Management) An integrated information system that is used to plan, schedule and control the presales and postsales activities in an organization. , is President of Records Engineering LLC (Logical Link Control) See "LANs" under data link protocol.
LLC - Logical Link Control . He has held positions at Pattern Analysis and Recognition Technology, The Analytical Sciences Corp., and Disclosure Inc. and has managed programs for the National Security Agency, Central Intelligence Agency, and the Air Force in the areas of sensor exploitation, information retrieval information retrieval
Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. , text processing, data visualization, and records declassification de·clas·si·fy
tr.v. de·clas·si·fied, de·clas·si·fy·ing, de·clas·si·fies
To remove official security classification from (a document).
de·clas . Lubbes has been working as an information technology contractor and consultant for more than 35 years. He can be reached at firstname.lastname@example.org.
Editor's Note Editor's Note (foaled in 1993 in Kentucky) is an American thoroughbred Stallion racehorse. He was sired by 1992 U.S. Champion 2 YO Colt Forty Niner, who in turn was a son of Champion sire Mr. Prospector and out of the mare, Beware Of The Cat.
Trained by D. : The Products mentioned in this article serve as examples and do not constitute endorsement by ARMA International.
THIS ARTICLE EXAMINES:
* how automatic categorization and other document tools are impacting records management
* the strengths and potential limitations of automatic categorization
* the importance of categorization accuracy
TABLE 1 Nuclear Nuclear Environmental Article Title Energy Waste Protection Preparation for the CRM 0 0 0 Managing Electronic Records 0 0 0 Advances in Nuclear Reactors 9 11 2 The New Energy Crisis 12 6 3 Recordkeeping for Energy Companies 7 3 5 Nuclear Nuclear Environmental Article Title Energy Waste Protection Preparation for the CRM 0 0 0 Managing Electronic Records 0 0 0 Advances in Nuclear Reactors 9 11 2 The New Energy Crisis 12 6 3 Recordkeeping for Energy Companies 7 3 5 TABLE 2 Article Title Feature Vector Preparation for the CRM Exam [4,6,7,0,0] Managing Electronic Records [1,8,4,0,0,0] Advances in Nuclear Reactors [0,0,0,9,11,2] The New Energy Crisis [0,0,0,12,6,3] Recordkeeping for Energy Companies [3,1,4,7,3,5]
Autonomy Corp. "Autonomy Technology White Paper." www.autonomy.com.
Hobbs, Jerry R. "Generic Information Extraction In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured System." Artificial Intelligence Center SRI. www.itl.nist.gov/iaui/894.02/related_projects/tipster/gen_ie.htm.
"Feature Extraction." www.case.ogi.edu/class/cse580ir/handouts/6-20/text_processing/sldO14.htm.
"SPIRE: A new visual text analysis tool." www.pnl.gov/infoviz.
Turban, Efraim. Decision Support and Expert Systems, Management Support Systems, 2nd Edition. Prentice Hall Prentice Hall is a leading educational publisher. It is an imprint of Pearson Education, Inc., based in Upper Saddle River, New Jersey, USA. Prentice Hall publishes print and digital content for the 6-12 and higher education market. History
In 1913, law professor Dr. . Upper Saddle River, New Jersey Upper Saddle River is a Borough in Bergen County, New Jersey, United States. As of the United States 2000 Census, the borough population was 7,741. It is not to be confused with the neighboring borough of Saddle River. , 1990.
RELATED ARTICLE: ADVANTAGES AND DISADVANTAGES OF AUTOMATIC CATEGORIZATION
* Supports placing new electronic records into an existing file plan: Automatic categorization can be trained, using electronic records in an established file plan, and then used to categorize cat·e·go·rize
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es
To put into a category or categories; classify.
cat new documents. This approach can also be used to bring previously unmanaged electronic records on existing file servers into an established and managed file structure.
* Suggests a "natural" organization for an existing electronic record corpus: Automatic categorization can identify clusters of documents. A cluster's contents can be inspected and refined in an iterative it·er·a·tive
1. Characterized by or involving repetition, recurrence, reiteration, or repetitiousness.
2. Grammar Frequentative.
Noun 1. process. Appropriate file tags can be assigned to the refined clusters. The file tags can then be used as the basis for a file plan.
* Identifies topics within an existing corpus: Often it is difficult to determine the contents of unmanaged servers. Automatic categorization can treat the server contents as a single corpus and can identify topics contained within the server using its clustering ability. This unconstrained clustering will also provide insight into the range of material contained within the collection.
* Identifies unknown associations with documents: This capability is one of the most powerful capabilities of automatic categorization. While primarily an information analysis tool, it can help the records manager identify better ways of organizing information.
* Identifies relevant information from non-relevant information: Clusters may form identifying information that is being improperly retained and/or is not relevant to the business needs. Identification of such clusters will help records managers better manage their records.
* Limited accuracy: Limited accuracy is an issue with automatic categorization. A balance must be made relative to the investment expended ex·pend
tr.v. ex·pend·ed, ex·pend·ing, ex·pends
1. To lay out; spend: expending tax revenues on government operations. See Synonyms at spend.
2. in developing a sound training set, correcting the system's misfiled documents, and an acceptable level of error in the system. One should remember that humans have a high incidence of misfiling documents. The automatic categorization system is generally consistent. Consistency may be more important than accuracy, in that knowing where to find a document is of major importance, even if the place it is filed is not ideal.
* Does not work well on very short documents, large documents, or on documents without uniform contents: Very short documents have little information content. Many e-mail messages are very short (e.g., an e-mail responding to a question may be simply "yes"). Without the context, the word yes has little information content. The categorization algorithm has little to go on and will place the message in a miscellaneous category. Large documents, such as books, are difficult to categorize because they contain a wide range of topics and will tend to be associated with many clusters resulting in too many copies being filed. This issue can be addressed by partitioning the document into sections or chapters, but such an approach will fragment the document, a practice that goes against good records management practices. The problem of multiple subjects exists for other types of documents as well (e.g. status reports, calendars, or forms). These documents have no coherent theme and are very difficult for the automatic declassification system to assign to individual categories. Automatic categorization systems should identify these troublesome documents and place them into a known category where they can be addressed by human intervention.
* Potentially misleading: The basis for automatic categorization is statistical. Centroids are initially calculated based upon the training set. As the system continues to add documents to each subject heading, categorization systems recalculate re·cal·cu·late
tr.v. re·cal·cu·lat·ed, re·cal·cu·lat·ing, re·cal·cu·lates
To calculate again, especially in order to eliminate errors or to incorporate additional factors or data. the centroid location. If misfiles are not removed, the centroid's location gradually drifts. Eventually the subject heading no longer represents the contents of the particular category. Effectively the file plan has unknowingly changed. The records manager must occasionally review and correct misfiles to avoid this situation. The same situation may occur even if all documents are being properly filed. This may be due to a subcategory sub·cat·e·go·ry
n. pl. sub·cat·e·go·ries
A subdivision that has common differentiating characteristics within a larger category. developing within the category originally defined by the file plan. Consistent with hardcopy document practices, the category should be split to create a new category. Adding a new category, however, requires that the records manager identify new training sets for both the original category and the new category and rerun re·run
The act or an instance of rebroadcasting a recorded movie or a recorded television performance.
tr.v. re·ran , re·run, re·run·ning, re·runs
To present a rerun of. the categorization process.