Printer Friendly
The Free Library
14,573,512 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

So you want to implement automatic categorization? Automatic categorization can be a powerful tool despite its limitations, but it is still important to test and evaluate before making a commitment to using it. (TechTrends).


At the Core

This article

* presents a high-level overview of automatic categorization technology

* explains the benefits and limitations of various automatic categorization approaches

* describes the key role records managers play in successfully implementing automatic categorization

The massive amounts of data available through the Internet Internet

Publicly accessible computer network connecting many smaller networks from around the world. It grew out of a U.S. Defense Department program called ARPANET (Advanced Research Projects Agency Network), established in 1969 with connections between computers at the
, extranets, and internal corporate databases have created the need for new techniques to organize information. An excellent example is automatic categorization, an information management tool designed to assist enterprises in filing and retrieving the vast numbers of electronic records that they generate or use today. Automatic categorization attempts to assign electronic records to either predefined file structures or to self-defined categories through computer-based processes.

Successful use of automatic categorization requires the melding of technical and records management perspectives. Insight into the theory behind various vendor implementations, the benefits and limitations of each, and an understanding of what is "under the covers," will aid records and information managers in making intelligent decisions in selecting and implementing automatic categorization.

Automatic categorization technology has two principal approaches: pattern matching 1. pattern matching - A function is defined to take arguments of a particular type, form or value. When applying the function to its actual arguments it is necessary to match the type, form or value of the actual arguments against the formal arguments in some definition.  and rule-based systems. Pattern matching systems use word patterns and concepts within the electronic record to associate the record with a predefined file structure. Pattern matching systems can be further divided based on the technique used to associate patterns with a given category. The four principal pattern matching techniques used by vendors today are k-nearest neighbor, Bayesian Adj. 1. Bayesian - of or relating to statistical methods based on Bayes' theorem , neural networks neural network or neural computing, computer architecture modeled upon the human brain's interconnected system of neurons. Neural networks imitate the brain's ability to sort out patterns and learn from trial and error, discerning and extracting , and support vector machines This article or section may be confusing or unclear for some readers.
Please [improve the article] or discuss this issue on the talk page.
.

Rule-based systems depend on a user-specified set of rules to associate the occurrence (or exclusion) of names, phrases, or concepts contained in documents with specific file plan subject headings. The computer parses the document, identifies the user-specified entities, and assigns Individuals to whom property is, will, or may be transferred by conveyance, will, Descent and Distribution, or statute; assignees.

The term assigns is often found in deeds; for example, "heirs, administrators, and assigns to denote the assignable nature of
 the document to the appropriate category based upon the rule set.

Pattern Matching Approaches

Pattern matching categorization requires providing the system with representative sample documents for each subject heading in the file plan. Using the sample documents, the categorization software generates an internal representation for each subject heading. The software compares any new documents entering the system to internal subject heading representations and assigns the new document to the subject where it fits best. There are two phases to this process: the training phase, which consists of providing sample documents, and the classification phase, in which new documents are assigned as·sign  
tr.v. as·signed, as·sign·ing, as·signs
1. To set apart for a particular purpose; designate: assigned a day for the inspection.

2.
.

The training phase requires the records manager to identify document sets that represent each subject heading in the file plan. This is the training set. Identifying a training set is an empirical problem, one in which the records manager's knowledge of existing records and the current file structure is critical to automatic categorization's success. The manner in which the training set is used differentiates each of the four pattern recognition techniques.

In the classification phase, new documents entering the system are assigned to one or more categories using algorithms The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.  fine-tuned during the training phase. This category assignment is equivalent to the records manager's indexing the document in order to assign it to a specific subject heading.

Software developers currently use four primary methods to assign documents to subject headings (categories). The methods are drawn from various mathematics and computer science disciplines. The k-nearest neighbor algorithm “KNN” redirects here. For other uses, see KNN (disambiguation).

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space.
 is based upon algebra algebra, branch of mathematics concerned with operations on sets of numbers or other elements that are often represented by symbols. Algebra is a generalization of arithmetic and gains much of its power from dealing symbolically with elements and operations (such as  and geometry geometry [Gr.,=earth measuring], branch of mathematics concerned with the properties of and relationships between points, lines, planes, and figures and with generalizations of these concepts. . Bayesian modeling uses probability theory probability theory

Branch of mathematics that deals with analysis of random events. Probability is the numerical assessment of likelihood on a scale from 0 (impossibility) to 1 (absolute certainty).
. Neural networks are an outgrowth of the computer science field of artificial intelligence. Support vector machines (SVM SVM Support Vector Machines
SVM School of Veterinary Medicine
SVM Solaris Volume Manager
SVM Space Vector Modulation
SVM Storage Virtualization Manager (StoreAge)
SVM Service Module (also abbreviated as S/M) 
) are founded on machine learning theory.

K-nearest Neighbor

K-nearest neighbor is the easiest categorization approach to understand because the mathematics it uses has a physical analog in the real world.

In k-nearest neighbor, the records manager constructs a training set. The categorization software produces an internal representation in which each document in the training set is a point on a graph. The training set clusters graphic below shows the three-dimensional three-di·men·sion·al
adj.
1. Of, relating to, having, or existing in three dimensions.

2. Having or appearing to have extension in depth.

3.
 graph of a training set as produced by the product SERprivateBrain Learnset Viewer. The viewer allows humans to visualize a training set the same way that the software does. The points, representing the documents in each category, form groupings called clusters.

In the training set clusters example, the file plan has five categories, each containing documents created from five different books (The Age of Reason, Holy Bible Holy Bible

name for book containing the Christian Scriptures. [Christianity: NCE, 291]

See : Writings, Sacred
, Dracula Dracula: see Stoker, Bram; Vlad IV.
Dracula

Character created by Bram Stoker in his 1897 novel of the same name. A mesmerizing, ruthless vampire, Dracula captured the public imagination, especially following Bela Lugosi's elegant and chilling
, Moby Dick Moby Dick

pursued by Ahab and crew of Pequod. [Am. Lit.: Moby Dick]

See : Quarry


Moby Dick

white whale pursued relentlessly by Captain Ahab; “It was the whiteness of the whale that above all things appalled me.
, and Zarathustra Zar·a·thu·stra  

See Zoroaster.

Noun 1. Zarathustra - Persian prophet who founded Zoroastrianism (circa 628-551 BC)
Zoroaster
). There are five clusters--one for each book title--represented by different colors as shown in the legend. A cluster exists for each category (as it would for each file plan heading). In the k-nearest neighbor approach, the software generates a sphere that contains the documents (points) in the subject heading (cluster) and calculates the center of the sphere. The simulated spherical spher·i·cal
adj.
Having the shape of or approximating a sphere; globular.
 k-nearest neighbor boundaries graphic above shows spheres drawn around the clusters representing Moby Dick and The Age of Reason to illustrate this concept. The center of the sphere is called the centroid centroid

In geometry, the centre of mass of a two-dimensional figure or three-dimensional solid. Thus the centroid of a two-dimensional figure represents the point at which it could be balanced if it were cut out of, for example, sheet metal.
. The centroid represents the subject heading in the file plan to the computer.

New documents are filed (categorized cat·e·go·rize  
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es
To put into a category or categories; classify.



cat
) during the classification phase. The radius for the centroid sphere defines the maximum distance that any new document's representative point can be from that subject heading's centroid. When the point associated with a new document falls within any cluster's sphere, the document will be filed in that associated subject heading.

Bayesian Modeling

The Bayesian modeling approach is based on the concept that knowledge about the distribution of previous outcomes helps determine the probability of current outcomes. In automatic categorization, Bayesian modeling asserts that if the assignment of a set of documents from the corpus (the training set) to categories (subject headings) is known, then this information can aid in predicting the assignment of a new document to the appropriate category. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
, information obtained from the training set can be used to fine-tune a statistical model so that it can assign documents to categories.

One of the Bayesian model's drawbacks is that it assumes that words or phrases are independent. For example, the words records and manager, when considered independently, have a different meaning than the phrase records manager. Thus, the records and manager are, therefore, not independent. Bayesian modeling generally provides reasonable categorization results. However, there is no way to guarantee the independence between all terms that the Bayesian model will use to assign documents to subject headings.

Neural Networks

The basis for neural networks, more correctly called artificial neural networks (artificial intelligence) artificial neural network - (ANN, commonly just "neural network" or "neural net") A network of many very simple processors ("units" or "neurons"), each possibly having a (small amount of) local memory. , lies in the computer science field of artificial intelligence. Neural networks are the result of attempts to model the human brain. In the general case, a neural network can accept a document described by its differentiating words and phrases Words and Phrases®

A multivolume set of law books published by West Group containing thousands of judicial definitions of words and phrases, arranged alphabetically, from 1658 to the present.
 and classify clas·si·fy  
tr.v. clas·si·fied, clas·si·fy·ing, clas·si·fies
1. To arrange or organize according to class or category.

2. To designate (a document, for example) as confidential, secret, or top secret.
 it into a predefined set of categories.

The neural network must be trained to assign the document to a category. Other pattern recognition systems use mathematical formulas to extract the parameters from the training set during the training phase.

Neural network training is unique in that it uses a trial and error method for determining the parameters it will use to assign documents to various categories.

One of the drawbacks with the neural network approach is its complexity. If large numbers of words and phrases are required to differentiate documents into the various categories (i.e., the file topics are closely related), the network becomes large, increasing the processing power required to both train and use the system. The question of scalability should be investigated before settling on neural neural /neu·ral/ (noor´al)
1. pertaining to a nerve or to the nerves.

2. situated in the region of the spinal axis, as the neural arch.


neu·ral
adj.
1.
 network-based categorization software.

Support Vector Machine (SVM)

One later entry in the stable of automatic categorization systems is a mathematical technique called support vector machines (SVM), which for purposes of explanation may be considered an enhancement to the k-nearest neighbor approach.

The k-nearest neighbor approach assumes that a sphere will appropriately represent the boundaries between various subject headings for electronic documents, but this may not be the case. An irregular HEIR, IRREGULAR. In Louisiana, irregular heirs are those who are neither testamentary nor legal, and who have been established by law to take the succession. See Civ. Code of Lo. art. 874.  shape might better represent the subject heading boundaries that the software will use to determine the best assignment for a new document.

The simulated SVM boundary graphic below shows a conceptual closed surface that might be generated by the SVM approach as a boundary for the Moby Dick cluster. Compare this boundary to the Mob), Dick spherical boundary in the simulated spherical k-nearest neighbor boundaries and notice that there are several Dracula and Zarathustra documents erroneously er·ro·ne·ous  
adj.
Containing or derived from error; mistaken: erroneous conclusions.



[Middle English, from Latin err
 classified in the Moby Dick spherical boundary; this number is greatly reduced by the SVM boundary.

Spheres may overlap o·ver·lap
n.
1. A part or portion of a structure that extends or projects over another.

2. The suturing of one layer of tissue above or under another layer to provide additional strength, often used in dental surgery.

v.
, but irregular shapes may better partition A reserved part of disk or memory that is set aside for some purpose. On a PC, new hard disks must be partitioned before they can be formatted for the operating system, and the Fdisk utility is used for this task.  the file headings. Overlap is not necessarily bad because the document in many cases should be considered in multiple categories. The objective of categorization techniques, however, is to associate the object being categorized with the single "best" category. SVMs attempt to offset these two issues. While SVMs are showing some promise in automatic categorization, lead researchers in the field indicate that SVMs are no "silver bullet silver bullet - magic bullet ."

* "Discover More: A Technical White Paper on the Stratify strat·i·fy  
v. strat·i·fied, strat·i·fy·ing, strat·i·fies

v.tr.
1. To form, arrange, or deposit in layers.

2.
 Discovery System" notes that "SVMs pay more attention to outlying out·ly·ing  
adj.
Relatively distant or remote from a center or middle: outlying regions.


outlying
Adjective

far away from the main area

Adj. 1.
 training documents. Given high-quality training sets, SVMs will focus on the crucial documents that help define the borders of the group. With poor quality training sets, however, they tend to focus on erroneous erroneous adj. 1) in error, wrong. 2) not according to established law, particularly in a legal decision or court ruling.  outliers, and their performance suffers markedly."

* In an IEEE Intelligent Systems IEEE Intelligent Systems, a bimonthly publication of the IEEE Computer Society. It is an AAAI-sponsored journal. Cosponsors are the British Computer Society and the European Coordinating Committee for Artificial Intelligence.  Magazine article, "SVMs--A Practical Consequence of Learning Theory," B. Scholkoph wrote: "We are still missing an application where SVM methods significantly outperform Outperform

An analyst recommendation meaning a stock is expected to do slightly better than the market return.

Notes:
Exact definitions vary by brokerage, but in general this rating is better than neutral and worse than buy or strong buy.
 any other available algorithm algorithm (ăl`gərĭth'əm) or algorism (–rĭz'əm) [for Al-Khowarizmi], a clearly defined procedure for obtaining the solution to a general type of problem, often numerical.  or solve a problem that has so far been impossible to tackle."

The Rule-Based Systems Approach

Rule-based automatic categorization systems represent a different approach in that they do not require training. According to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 Fabrizio Sebastiani's A Tutorial An instructional book or program that takes the user through a prescribed sequence of steps in order to learn a product. Contrast with documentation, which, although instructional, tends to group features and functions by category. See tutorials in this publication.  on Automated au·to·mate  
v. au·to·mat·ed, au·to·mat·ing, au·to·mates

v.tr.
1. To convert to automatic operation: automate a factory.

2.
 Text Categorization, "Rule-based systems are popular because users of these systems can precisely define the criteria by which a document is classified. Rule-based systems can support complex operations and decision trees and produce very accurate results."

The rule-based approach's first phase is the definition of a set of rules that will be used during the classification phase to assign documents to subject headings. The rules are defined in the form of "IF conditional statement, THEN action."

Rule-based systems, unlike most pattern-recognition approaches, can also take advantage of document metadata (1) (meta-data) Data that describes other data. The term may refer to detailed compilations such as data dictionaries and repositories that provide a substantial amount of information about each data element.  to improve categorization accuracy (e.g., rules could be created that would assign all of the documents written by a particular author or written during a given date range to specific categories).

The rule-based system organizes the user-provided rules into a decision tree. A decision tree will only assign a document to the appropriate category if the rules are consistent; if they conflict with each other, the appropriate decision may not be reached. For example, rules 1 and 3 in the example below are not consistent. If the "nuclear" condition was the first tested by the decision tree, then rule 1 would classify the document in the "nuclear power" category even if the phrase "records manager" appeared later in the document. Rule 3, which would have properly classified the document, would not be tested because rule 1 asserted that only the condition "nuclear" was required to classify the document.

As the number of categories increases, the number and complexity of the rules must also increase to differentiate between categories. The number and complexity of the rules is also a function of how closely categories (subject headings) in the corpus are related to each other. For example, if documents in the corpus only contained two categories, records management and nuclear power, it might be possible to differentiate between them using rules 1 and 2. More complex rules, such as rules 3 and 4, would be needed to properly categorize cat·e·go·rize  
tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es
To put into a category or categories; classify.



cat
 documents discussing the management of nuclear records versus documents about managing nuclear power projects.

A major issue with rule-based systems is that the rules required to differentiate subject headings for large, complex collections of documents become difficult to manage and maintain consistently. In the article "Using SMVs for Text Categorization," S.T. Dumais wrote:
   [Another] drawback of this "manual"
   approach to the construction of automatic
   classifiers is the existence of a
   knowledge acquisition bottleneck,
   similar to what happens in expert
   systems [a type of rule-based, artificial
   intelligence system]. That is, rules
   must be manually defined by a
   knowledge engineer with the aid of a
   domain expert (in this case, an expert
   in document relevance to the chosen
   set of categories). If the set of categories
   is updated, then these two professional
   figures must intervene
   again, and if the classifier is ported to
   a completely different domain (i.e.,
   set of categories), the work has to be
   repeated anew.


The knowledge engineer is someone familiar with capturing rules and the internal workings of the vendor's rule-management system. The domain expert is the records manager. More than in any other automatic categorization approach, the records manager's understanding of records content and the file plan is critical to the success of a rule-based approach.

Automatic categorization technology, while the result of significant research and development, must prove cost effective in an operational environment. It can be successful only if the proper records management tasks are performed. (See sidebar (1) A Windows Vista desktop panel that holds mini applications (gadgets) such as a calendar, calculator, stock ticker and Vonage phone dialer. It is the Windows counterpart to the Dashboard in the Mac. See Windows Vista and gadget.  below.)

Importance of the Training Set Definition

The accuracy of all automatic categorization systems is highly dependent upon the effort and care taken during the training or rule definition phase. In systems using pattern matching, the selection of a training set that accurately represents the content of each subject heading in the file plan is critical. Vendors suggest that training sets should contain from 10 to 50 documents for each subject heading. However, training set quality is more dependent on the content of the documents than on the number of documents per subject heading.

The content of documents selected for training sets should be highly representative of the subject headings for which they are chosen. It may even be necessary to create documents that meet this requirement. For instance, an existing electronic document might contain a paragraph that is not focused on the document's primary subject, but otherwise might be highly representative of the subject heading desired. Rather than using the document as is for the training set, the records manager could electronically copy the document, assign it a different but easily recognized name, delete To remove an item of data from a file or to remove a file from the disk. See file wipe, trash and undelete.

1. (operating system) delete - (Or "erase") To make a file inaccessible.
 the paragraph or paragraphs not consistent with the general subject, and use the newly created document for training purposes. After completion of the software package's training phase and testing phase, the created document could be deleted Deleted

A security that is no longer included on a specified market. Sometimes referred to as "delisted".

Notes:
Reasons for delisting include violating regulations, failing to meet financial specifications set out by the stock exchange and going bankrupt.
 from the corpus.

For products that use Bayesian modeling, care should also be taken to create a training set where the ratio of documents associated with each subject heading is consistent with the number of documents under each subject heading in the original corpus. In other words, if a specific subject heading in the original corpus contains 7 percent of all the documents, then the training set for that subject heading should ideally contain 7 percent of the documents in the training set.

Guidelines guidelines,
n.pl a set of standards, criteria, or specifications to be used or followed in the performance of certain tasks.
 for creating training sets include:

* The larger the number of subject headings in the corpus, the larger the training set should be.

* The more difficult it is to discriminate dis·crim·i·nate  
v. dis·crim·i·nat·ed, dis·crim·i·nat·ing, dis·crim·i·nates

v.intr.
1.
a.
 between subject headings in the corpus, the larger the training set should be.

* It is better to have fewer highly representative documents per subject heading than many poorly representative documents for the subject heading.

Training set creation is normally an iterative it·er·a·tive  
adj.
1. Characterized by or involving repetition, recurrence, reiteration, or repetitiousness.

2. Grammar Frequentative.

Noun 1.
 process. After the initial training phase is completed, a collection of test documents, whose proper assignment to subject headings is known, is input to the classification phase. The records manager then evaluates the classification phase's accuracy. If the results are not acceptable, the training set is modified, the system is retrained, and the test is repeated. The training set modification involves placing misclassified documents into the training set, associating them with the correct subject heading, retraining re·train  
tr. & intr.v. re·trained, re·train·ing, re·trains
To train or undergo training again.



re·train
 the system, and rerunning the test. Only after an acceptable misclassification error rate is accomplished can the training phase be considered complete.

Importance of Rule Definition

In rule-based systems, the iterative process of defining rules, testing, tuning, and re-testing rule performance against a test corpus is critical. The records manager's understanding of the document collection is the key to success in selecting the training set or defining the rules.

Rule-based automatic categorization system vendors generally supply software to aid the records manager in defining and managing rules. The records manager may not even be aware that he or she is generating rules; rather the records manager specifies words and phrases that, when encountered in a document, either cause the document to be assigned to a given category or prevent it from being assigned to the category. Some rule-based package vendors bundle pre-established taxonomies that have key words and phrases related to each category already defined. The records manager can fine-tune these lists by adding or deleting words and phrases. The software generates the rules and the associated decision tree for use in the classification phase.

Once the records manager has defined a set of rules, system performance, or rule accuracy, must be tested. The records manager runs the system's classification phase using an input test corpus where the proper categorization of the documents is known. The system's categorization results are then compared with the known, or correct, categorization of the test documents. If the level of error is not acceptable, the user must adjust the rules. Much like the development of the training set, this is an iterative process and must be repeated until an acceptable level of categorization accuracy is reached.

Is Automatic Categorization Ready for Prime Time?

Automatic categorization systems have limitations, as evidenced by an experiment performed by Microsoft Research Microsoft Research (MSR) is a division of Microsoft created in 1991 for researching various computer science topics and issues. Overview
Microsoft Research (MSR) is one of the top research centers worldwide, currently employing Turing Award winners, C.A.R.
. Microsoft used automatic categorization software to categorize 12,902 Reuters Reuters

British cooperative news agency. Founded in 1851 by Paul Julius Reuter, it was initially concerned with commercial news but began to serve a growing newspaper clientele after the London Morning Advertiser subscribed in 1858.
 news stories into 118 categories. The stories had previously been manually filed into the 118 categories, providing a baseline The horizontal line to which the bottoms of lowercase characters (without descenders) are aligned. See typeface.

baseline - released version
 for comparison with the automatic classification software results. The experiment was unusual in that it used 75 percent (9,603) of the stories to train the system to classify the remaining 25 percent (3,299), far in excess of the number traditionally recommended by vendors to train their systems, according to Dumais.

But there is a limit to the accuracy of automatic classification, even when trained with large numbers of documents, according to "Verity ver·i·ty  
n. pl. ver·i·ties
1. The quality or condition of being true, factual, or real.

2. Something, such as a statement, principle, or belief, that is true, especially an enduring truth:
 Intelligent Classification, Turn Information Assets into Competitive Advantage," a report by P. Prahahar Paghavan.
   According to a study by Microsoft
   Research, over 9,000 documents were
   required to teach Bayesian and neural
   network-based systems to classify
   new data with a maximum first category-level
   accuracy of 80 percent.
   When data is broken down into subcategories,
   the accuracy drops even
   further. For example, with only 80 percent
   of information correctly categorized
   in the first level of a hierarchy, at
   the second level only 64 percent will be
   in the appropriate subcategories (0.8 at
   the first level multiplied by 0.8 at the
   second, multiplied by 100 to convert to
   a percent). Accuracy drops to 51 percent
   at the third level, and 41 percent at
   the fourth. At $25 to $100 per document
   to reclassify manually, this is an
   expensive problem to fix. To limit their
   accuracy problems, some automatic
   classification systems restrict taxonomies
   to two levels. This solution
   attempts to limit the proportion of
   misclassified documents to 36 percent--over
   one-third of the entire corpus.


These results are somewhat disappointing because many enterprise-wide file plans have far more than 118 subject headings, as well as multiple subject heading levels. Yet some researchers have reported success in the use of automatic classification systems. D. Schewe's article "Classifying Electronic Documents: A New Paradigm New Paradigm

In the investing world, a totally new way of doing things that has a huge effect on business.

Notes:
The word "paradigm" is defined as a pattern or model, and it has been used in science to refer to a theoretical framework.
" points out that, despite its limitations in some application areas, automatic categorization can be a viable tool for supporting records management. According to Schewe:
   Of the more than 90,000 e-mails and
   word processing documents [analyzed
   at the U.S. Department of
   Education], the records manager was
   unable to find even one whose categorization
   would have resulted in an
   incorrect retention period. The software
   did not successfully categorize
   all documents. Each cluster map provided
   for incorrectly categorized documents,
   which were examined as part
   of the process. Most were short documents
   that defy categorization even
   by the most experienced records
   manager.


The difference in the results of the Microsoft and Schewe examples may be explained by the scope and evaluation criteria of the two different applications. Schewe limited the number of categories to a relatively small number:
   Focus would be not on all possible
   data within the agency but instead on
   individual work groups where the
   number of subjects addressed was
   limited by the work group's scope.
   Focusing on a particular office or
   work group keeps the number of
   clusters and subclusters smaller and,
   therefore, easier for the records manager
   to work with.


Also, Schewe's measure of success was with respect to retention period, not content.
   For the demonstration project, the
   records manager examined all documents
   and e-mails to see that the
   software properly categorized them.
   Proper categorization was defined as
   ensuring that all documents that
   should be saved for a certain period
   of time according to the records
   retention schedule were placed in categories
   that were scheduled for that
   period, or longer.


Schewe's article illustrates that there are potentially useful records management applications of automatic categorization despite the limitations in current software systems. Schewe ameliorated many of these limitations by focusing at the work-group level rather than at the enterprise level, thus restricting the number of categories and concentrating on retention period as the key classification category rather than on subject heading.

Re-thinking Old Paradigms

Filing and retrieval of vast amounts of electronic records will require using automated tools such as automatic categorization. Records managers may have to re-think the current records management paradigm to facilitate the practical use of new automated tools in order to meet the legal and operational requirements (programming) operational requirements - Qualitative and quantitative parameters that specify the desired capabilities of a system and serve as a basis for determining the operational effectiveness and suitability of a system prior to deployment.  necessitated by electronic records.

There may be a large class of problems where automatic categorization is a powerful tool even with its limitation on accuracy. After all, the purpose of filing is to be able to retrieve information and support record retention requirements. In practice, combining automatic categorization and full-text indexing may make it possible to meet the information filing and retrieval needs of a large organization with fewer, well-differentiated categories. Automatic categorization also can be a powerful tool with which the cost of miscategorization is not significant.

Deciding whether to adopt automatic categorization will require records managers to

* identify the specific application, e.g., retrieval, browsing See browse. , discovery, information organization, routing, filtering, retention management

* determine an acceptable level of categorization error for the given application.

* evaluate any policy changes that might improve categorization performance (e.g., using fewer file subject headings for electronic files or managing at work-group levels)

* experiment with and evaluate multiple products, spending maximum effort on defining the training set or rules

* estimate the life-cycle costs of maintaining the system (e.g., the cost of defining new rules and/or refiling to accuracy)

* select the best tool based on empirical results and cost/benefit analysis

The number of products that provide automatic categorization continues to increase. As with most new technologies, the market is still shaping up; new entrants, mergers, and acquisitions make it difficult to keep track of product names and owners, but it is important to research various products. Many vendor Web sites also contain white papers that provide additional helpful background information on automatic categorization.

Products have evolved from different applications of automatic categorization: portal management, information retrieval information retrieval

Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links.
, information management, and records management. A specific vendor product may be stronger than others for a given application; it is worthwhile to understand which market the specific product is attempting to address. Some records management software packages include automatic categorization functionality. Records managers should work with their information technology departments when considering purchasing a given product to determine its ease of integration with existing products.

According to Geoffrey Bock Noun 1. bock - a very strong lager traditionally brewed in the fall and aged through the winter for consumption in the spring
bock beer

lager beer, lager - a general term for beer made with bottom fermenting yeast (usually by decoction mashing); originally
, "There is no commercially oriented o·ri·ent  
n.
1. Orient The countries of Asia, especially of eastern Asia.

2.
a. The luster characteristic of a pearl of high quality.

b. A pearl having exceptional luster.

3.
 benchmark for determining the effectiveness of one particular text-analysis solution or another. Thus, a company choosing between [a product] and its competitors has to do extensive comparisons on its own to determine the costs and benefits of alternative approaches."

In the final analysis, one should not expect miracles from automatic categorization. Look for applications where it can assist in improving efficiencies while taking into account its limitations. Determine the requirements and measures of success for the given application. Test and evaluate before making a commitment. Remember that humans make a significant number of errors when filing. Human accuracy, not perfection Perfection
Giotto’s O

perfect circle drawn effortlessly by Giotto. [Ital. Hist.: Brewer Dictionary, 463]

golden mean

or section
, should be the benchmark for measuring automatic categorization.
Rule Set Examples

Rule   If   Conditional Statement      Then   Action Statement

 1     If   "nuclear" appears in the   then   classify in
            the document                      "Nuclear Power"

 2     If   "records manager"          then   classify in "Records
            appears in the                    Management"

 3     If   "nuclear" and "records     then   classify in "Records
            manager" appear in the            Management"
            document

 4     If   "nuclear" and "records"    then   classify in "Nuclear
            "manager" but not                 Power"
            "records manager"
            appear in the document


References

Bock, Geoffrey. "Meta Tagging and Text Analysis from ClearForest, Identifying and Organizing Unstructured Content for Dynamic Delivery through Digital Networks." Patricia Seybold Group. 21 February 2002.

Dumais, S.T. "Using SVMs for Text Categorization." IEEE Intelligent Systems Magazine. July/August 1998.

Lubbes, R. Kirk. "Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management." The Information Management Journal October 2001.

Meyers, J. "Automatic Categorization, Taxonomies, and the World of Information: Can't Live With Them, Can't Live Without Them." E-docs. November/December 2002.

Paghavan, P. Prahahar. "Verity Intelligent Classification, Turn Information Assets into Competitive Advantage." Verity Inc. November 2000.

Schewe, D. "Classifying Electronic Documents: A New Paradigm." The Information Management Journal. March/April 2002.

Scholkoph, B. "SVMs--A Practical Consequence of Learning Theory." IEEE Intelligent Systems Magazine. July/August 1998.

Sebastiani, Fabrizio. A Tutorial on Automated Text Categorization, Istituto di Elaborazione dell 'Informazione, Consiglio Nazionale delle Ricerche The Consiglio Nazionale delle Ricerche (CNR; "Italian National Research Council") is an Italian public organization with the aim of supporting scientific and technological research. The institution was founded in 1923. , Via S.Maria, 46-56126 Pisa (Italy).

Stratify Inc. "Discover More: A Technical White Paper on the Stratify Discovery System." August 2001.

READ MORE ABOUT IT

Lubbes, R. Kirk "Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management." The Information Management Journal, September/October 2001.

R. Kirk Lubbes, CRM (Customer Relationship Management) An integrated information system that is used to plan, schedule and control the presales and postsales activities in an organization. , is President of Records Engeneering LLC (Logical Link Control) See "LANs" under data link protocol.

LLC - Logical Link Control
, in Reston, Virginia Reston is an internationally known planned community whose goal was to revolutionize post-World War II concepts of land use and residential/corporate development in American suburbia. . He may be contacted at klubbes@recordsengineering.com.
COPYRIGHT 2003 Association of Records Managers & Administrators (ARMA)
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Records managers play important role in implementing automatic categorization
Author:Lubbes, R. Kirk
Publication:Information Management Journal
Geographic Code:1USA
Date:Mar 1, 2003
Words:4327
Previous Article:MoReq: the standard of the future? Want to understand what electronic records management systems (ERMS) should do? The Model Requirements for the...
Next Article:Protecting records--what the standards tell us: key standards have been developed that aid in determining the best methods, rationale, environment,...
Topics:



Related Articles
Will KM Alter Information Managers' Roles?
SMARTLOGIK EXPANDS INTO U.S. MARKET.(Company Business and Marketing)
Automatic categorization: how it works, related issues, and impacts on records management. (Cover Story).(Statistical Data Included)
Tacit Knowledge Systems releases Expertise Services Platform. (New Products).(Brief Article)(Product Announcement)
Classifying electronic documents: a new paradigm: the U.S. Department of Education set out to determine whether large volumes of electronic data can...
The legislation that roared. (In focus: a message from the editors).
The truth about taxonomies.
Lessons in achieving ROI from your e-service solution.(e-CRM)
General RIM resources.
Creating order out of chaos with taxonomies: the increasing volume of electronic records and the frequency with which those records change require...

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles