Printer Friendly

A roadmap for proper taxonomy design: Part 1 of 2.

Burgeoning information quantities, regulatory compliance requirements and competitive drivers for speedier, more accurate data analysis continue to spur the development of many new, innovative information management technologies. Yet, pivotal to the success of these technologies is a rather old technology--arguably defined by Aristotle and later refined by Linnaeus--called a taxonomy.

A taxonomy is a hierarchical system describing the descending relationships between species and genera. Species derive from a common genus and, within a taxonomy, are hierarchically represented according to their essential characteristics and differences. For example, a thoroughbred is a type of a horse, which is an equid, which is a mammal and so on. Another useful term to know when defining a taxonomy my, is ontology. An ontology is a foundation of categories representing a particular organization's view of its world. It also reflects the organization's commonly used and trusted breakdown of those categories. For example, the logical breakdown that a news broadcast organization might use for its news items: World, Sports, Politics, etc., is ontological.

Taxonomies, used in conjunction with company ontologies, have proven to be a highly efficient structure for organizing structured and unstructured content. Taxonomies are therefore highly desirable, indeed a critical component of any sizable information management structure, be it a content management system, Intranet or portal.

Properly used and designed, taxonomies provide a consistent, scalable, stable means of organizing even vast quantities of data. They provide a navigable foundation enabling the logical, intuitive access of data. Improperly designed, taxonomies become a maintenance nightmare. The following is a conceptual explanation of the tenets of good taxonomy design as the basis for an information or knowledge management system.

Taxonomy development and (document) content indexing processes should be completed as the first phase of any information management or knowledge management design. Here's why.

Taxonomies are collective tools that reduce complexity by suggesting a logical, ontological, i.e. culturally founded, hierarchical representation of categories. Indexing represents the process of applying these ontologies to any particular document's content in order to normalize its content. By passing through the indexing process, the document is consistently aligned against a taxonomic standard.

If, on the other hand, you organize data based on the experience or perspective of a given set of people--even subject matter experts--you will need to rebuild your organizational structure each time something changes, for example when a "new" expert is added, or a different or additional set of terms needs to be considered, etc. Indexing on the basis of a group of individual perspectives effectively shatters any attempt at establishing corporate consistency into a multitude of insupportable idiosyncrasies.

The taxonomy development and indexing phase therefore is intentionally user independent and driven entirely by content. Its goal is the creation of a stable, foundational structure that supports a corporation's need to properly and consistently index its corpus of data.

It is not until the second phase of knowledge management system design, the dynamic classification phase, that you tap into individual expertise. This phase of development supports the real-time classification of information--from the users perspective--to providing tolls enabling the user to "slice and dice" data in the way that makes the most sense to them given their unique perspective and the problem they are trying to solve at the time. The success of an individual's ability to classify data in this way, however, depends upon the proper and successful completion of the taxonomy design and content indexing phase.

This article will focus on indexing and taxonomy design but will touch on the issues that companies face when they blur or reverse these two development phases, which happens surprisingly frequently.

Where to Begin

First, you will need to identify the paradigms of information your customers/employees are interested in accessing. From this you can construct an ontology--reflecting the unique way your corporation chooses to view these paradigms or groupings of data. Next, you should find and incorporate thesauri that best describe this ontology. The thesauri will provide a controlled, agreed upon vocabulary for that information, generally standard to an industry, including source terms, related terms and synonyms. For example, a pharmaceutical company may wish to use MeSH (Medical Subject Headings), while a defense company might want to use the DTIC (Defense Technical Information Center) thesaurus. These thesauri layered upon ontologies from the basis of your taxonomies. So what does it take to create a good taxonomy?

Tips for Good Taxonomy Design

Generally, it's better to develop multiple taxonomies each focused on a particular sphere of interest, versus creating a single, multi-purpose taxonomy. Good taxonomies will also contain certain characteristics described as follows.

Depth: Taxonomies should include no more than seven descending levels. Five levels is a more optimum number. Figure 1 shows a taxonomy design canon including the characteristics of each level.

[FIGURE 1 OMITTED]

The dotted line is where the "ontology" created by the company meets reality. Generic categories and terms found above the middle of the schema represent the level at which we start to describe the real world with our mental and linguistic tools; meaning we can touch a gun or a sign, whereas it would be hard to touch an ordnance. There are typically two ontological levels above the dotted lines. These are group nouns, collective terms used to "glue" the descending concepts together. The items at and below the dotted lines become increasingly specific, as does our ability to describe the world at a more granular and specific level. This way of deriving and compounding words to extend our grasp of the world is consistent in any language.

Width: Some taxonomies are up-heavy including too many higher levels or levels of equal definition. Synonyms start appearing as levels, creating ambiguities in logic and therefore instability as shown in Figure 2.

[FIGURE 2 OMITTED]

The problem with the levels in this taxonomy should be apparent. What's the universally understood difference between unwelcome and unpleasant? Why should one term be on top of the other? Wouldn't the logic work perfectly well in reverse? This "taxonomy" is really a classification-meaning it probably reflects the view of an individual, but does not indicate a genus to species sequence of relationships. In a taxonomy, this collection of synonyms should be collapsed into a single level.

Balance: Another problem can occur when an ontology is not used as a starting point or as an overlay for a taxonomy. In Figure 3, the root words are too numerous, creating an overly flat structure. This company has used a thesaurus but not an ontology. You need both. In a proper taxonomy you would expect to find perhaps a single root, such as Accounting with second level categories such as accountants, accounting firms, each with their subsequent levels and so on. You would not see non-related root words, such as acceptance and accountability, within this taxonomy.

Another characteristic to watch out for is unbalanced structure. This is shown in Figure 4 where you have one category (Acceptance) indicating one path and another category (Accidents) providing hundreds of paths.

Single Path Progression:

Consider the following real-life example of the confusion created when information is organized based on the experience of experts, versus from a taxonomical structure (Figure 5). This example was taken from a "taxonomy" used by a tax analyst firm. You see that assets and liabilities are duplicated in two different paths. Also note that if you reverse the structure, putting individuals and corporations under assets or assets and liabilities under individuals, the same logic applies.

[FIGURE 5 OMITTED]

This doesn't pass the first test of a proper taxonomy. An individual is not an asset, nor is a liability a corporation. These relationships don't reflect a proper genus to species relationship. Whenever a path is duplicated or if a path can be correctly reversed using the same logic, the structure is not taxonomical. Remember, a classification is a user-definable way of slicing and dicing data; a taxonomy is a uniform, non-changing structure for organizing data. The view above is a classification. It is a reflection of the way tax experts may wish to view information given their experience.

Figure 6 shows the proper taxonomical structure for this same information. These two dimensions should actually constitute two separate taxonomies.

[FIGURE 6 OMITTED]

In this structure, the information is equally accessible, but it is organized in a non-changing and universally understood hierarchy.

Once your taxonomies have passed these simple, glanceable structure tests, you should further test them by running them against your information corpus. This next step is a further test of the quality of your taxonomies, showing how well they perform against real data, and giving you another opportunity to adjust and refine their structure before you move to the next step in knowledge management design

Test Your Taxonomies

Before framing your documents against a taxonomy--essentially an indexing process--keywords, concepts and entities must be recognized in order to provide an accurate and precise understanding of the content of each document. In the best systems, a sophisticated linguistic analysis process is employed to correctly identify the phrases concepts, entities (collectively referred to as tokens), etc., found within documents. This linguistic analysis process generates a first level of normalization (stemming) on textual content. For example, if your organization chooses to use the word "gun" to represent everything from a pistol to a rifle, then all documents referencing these sorts of firearms will appear under the category of "gun."

Once the tokens are extracted, they can be run--or latched--against one or multiple taxonomies, resulting in a very rich index of the contents of your documents. This indexing process creates a semantic signature for each document. The resulting semantic signature ranks all the taxonomic categories, which have been linked to each document token, along with additional information including the location of these tokens in the document, etc. The semantic signature--or collection of metadata--within each document is then retrievable as an XML representation of the contents of each document.

As a result of this process, you can view the "latching scores" of your documents, another indicator of the health of your design. You'll see concentrations of latches in certain categories, which may be overly deep or wide, or clearly duplicated. Taxonomically based indexing technology will produce tables with additional metrics. These metrics will show which categories are overproducing tags and which aren't producing enough. Figure 7 shows the distribution of category production on a vertical scale of depth through the taxonomy. You are looking for a nice bell curve on both sides.

[FIGURE 7 OMITTED]

Using this and other latching metrics, you can go back, adjust the ontology as appropriate and re-run the documents again, testing the results until you sec a proper, balanced structure. Once achieved, the data is ready for import to your content management system, portal, or whatever information management or viewing mechanism you choose.

This description is intended to provide a conceptual view of the proper design and components of taxonomy design and its function within the initial development phase of a taxonomical knowledge management structure. Like most design projects, the discipline is in the details and good taxonomy design generally requires the specialized expertise of a librarian or taxonomy expert.

To summarize, companies should begin by identifying the groupings of data its employees will need to access. From this, an ontology reflective of your company's unique view of its world can be created. Next, companies should investigate and incorporate thesauri that best describe their ontology. Companies can then begin taxonomy design, making sure each taxonomy used has the right number of levels, depth, width and structure balance. Using the results generated when documents are initially latched to these taxonomies, companies should then re-review all categories making sure that the document levels are balanced and reasonable. If not, the company can go back and clean up its ontology, re-running the documents until the structure is correct.

And yet, as important a foundation as a taxonomical structure is to knowledge management systems, ideally it should be invisible to the end user. Users should take for granted that they have reliable, comprehensive access to all the data they need. Instead, users should be empowered to connect data--to identify previously un-recognized relationships and build their own unique webs of related data that may be disconnected in time, geography and even domain. But all of this begins with the sound underlying structure of a taxonomically indexed corpus of data.

"A Roadmap for Proper Taxonomy Design: Part 2" will appear in an upcoming edition of Computer Technology Review.

Figure 3

Acceptance Product Acceptance

Accountability Social Responsibility Social Investing

Accountants Public Accountants CPAs Attorney CPAs

Accounting Firms Big Five Accountant Firms Big Six Accounting Firms

Figure 4

Acceptance Product Acceptance

Accidents Accident Prevention Aircraft Accidents and Safety Air Traffic Control Hijacking

Boating Accidents and Safety Construction Accidents and Safety Electrcutions Falls Firearm Accidents and Safety Household Accidents and Safety Nuclear Accidents and Safety

Occupational Accidents Industrial Accidents

Occupational Safety Indoor Air Quality

Railroad Accidents and Safety

Ship Accidents and Safety Lighthouses

Swimming Accidents and Safety Drownings

Traffic Accidents and Safety Hit and Run Accidents

www.convera.com

Dr. Claude Vogel is CTO of Convera (Vienna, Virginia)
COPYRIGHT 2003 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Internet
Author:Vogel, Claude
Publication:Computer Technology Review
Date:Jul 1, 2003
Words:2182
Previous Article:Getting the most from broadband connectivity with wireless networking.
Next Article:Ensuring reliability of customized software in distributed systems.
Topics:


Related Articles
A risk communication taxonomy for environmental health.
Taxonomies for business. (Software World).
XBRL streamlining financial reporting. (Reporting Practices).
The truth about taxonomies.
Searching and categorising data. (Searching Data).
Designing a knowledge discovery system, Part 2: now that we have categorized, let's ... classify!
XBRL-US, the American chapter of XBRL International--a global consortium of more than 200 accounting, technology and financial services organizations.
Creating order out of chaos with taxonomies: the increasing volume of electronic records and the frequency with which those records change require...

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters