Printer Friendly
The Free Library
14,487,881 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

A roadmap for proper taxonomy design: Part 1 of 2.


Burgeoning information quantities, regulatory compliance requirements Compliance requirements are a series of directives established by United States Federal government agencies that summarize hundreds of Federal laws and regulations applicable to Federal assistance (also known as Federal aid or Federal funds).  and competitive drivers for speedier, more accurate data analysis continue to spur the development of many new, innovative information management technologies. Yet, pivotal to the success of these technologies is a rather old technology--arguably defined by Aristotle and later refined by Linnaeus--called a taxonomy taxonomy: see classification.
taxonomy

In biology, the classification of organisms into a hierarchy of groupings, from the general to the particular, that reflect evolutionary and usually morphological relationships: kingdom, phylum, class, order,
.

A taxonomy is a hierarchical system describing the descending relationships between species and genera genera, in taxonomy: see classification. . Species derive from a common genus and, within a taxonomy, are hierarchically represented according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 their essential characteristics and differences. For example, a thoroughbred Thoroughbred

Light breed of racing and jumping horse descended from three desert stallions brought to England between 1689 and 1724. Thoroughbreds have a delicate head, slim body, broad chest, and short back. Most are bay, chestnut, brown, black, or gray.
 is a type of a horse, which is an equid equid

see equidae.
, which is a mammal and so on. Another useful term to know when defining a taxonomy my, is ontology ontology: see metaphysics.
ontology

Theory of being as such. It was originally called “first philosophy” by Aristotle. In the 18th century Christian Wolff contrasted ontology, or general metaphysics, with special metaphysical theories
. An ontology is a foundation of categories representing a particular organization's view of its world. It also reflects the organization's commonly used and trusted breakdown of those categories. For example, the logical breakdown that a news broadcast organization might use for its news items: World, Sports, Politics, etc., is ontological on·to·log·i·cal  
adj.
1. Of or relating to ontology.

2. Of or relating to essence or the nature of being.

3.
.

Taxonomies, used in conjunction with company ontologies, have proven to be a highly efficient structure for organizing structured and unstructured content. Taxonomies are therefore highly desirable, indeed a critical component of any sizable information management structure, be it a content management system, Intranet or portal.

Properly used and designed, taxonomies provide a consistent, scalable, stable means of organizing even vast quantities of data. They provide a navigable NAVIGABLE. Capable of being navigated.
     2. In law, the term navigable is applied to the sea, to arms of the sea, and to rivers in which the tide flows and reflows. 5 Taunt. R. 705; S. C. Eng. Com. Law Rep. 240; 5 Pick. R. 199; Ang. Tide Wat. 62; 1 Bouv. Inst. n.
 foundation enabling the logical, intuitive access of data. Improperly designed, taxonomies become a maintenance nightmare. The following is a conceptual explanation of the tenets of good taxonomy design as the basis for an information or knowledge management system.

Taxonomy development and (document) content indexing processes should be completed as the first phase of any information management or knowledge management design. Here's why.

Taxonomies are collective tools that reduce complexity by suggesting a logical, ontological, i.e. culturally founded, hierarchical representation of categories. Indexing represents the process of applying these ontologies to any particular document's content in order to normalize normalize

to convert a set of data by, for example, converting them to logarithms or reciprocals so that their previous non-normal distribution is converted to a normal one.
 its content. By passing through the indexing process, the document is consistently aligned against a taxonomic tax·o·nom·ic   also tax·o·nom·i·cal
adj.
Of or relating to taxonomy: a taxonomic designation.



tax
 standard.

If, on the other hand, you organize data based on the experience or perspective of a given set of people--even subject matter experts--you will need to rebuild your organizational structure This article has no lead section.

To comply with Wikipedia's lead section guidelines, one should be written.
 each time something changes, for example when a "new" expert is added, or a different or additional set of terms needs to be considered, etc. Indexing on the basis of a group of individual perspectives effectively shatters any attempt at establishing corporate consistency into a multitude of insupportable idiosyncrasies.

The taxonomy development and indexing phase therefore is intentionally user independent and driven entirely by content. Its goal is the creation of a stable, foundational structure that supports a corporation's need to properly and consistently index its corpus of data.

It is not until the second phase of knowledge management system design, the dynamic classification phase, that you tap into individual expertise. This phase of development supports the real-time classification of information--from the users perspective--to providing tolls enabling the user to "slice and dice Refers to rearranging data so that it can be viewed from different perspectives. The term is typically used with OLAP databases that present information to the user in the form of multidimensional cubes similar to a 3D spreadsheet. See OLAP. " data in the way that makes the most sense to them given their unique perspective and the problem they are trying to solve at the time. The success of an individual's ability to classify data in this way, however, depends upon the proper and successful completion of the taxonomy design and content indexing phase.

This article will focus on indexing and taxonomy design but will touch on the issues that companies face when they blur or reverse these two development phases, which happens surprisingly frequently.

Where to Begin

First, you will need to identify the paradigms of information your customers/employees are interested in accessing. From this you can construct an ontology--reflecting the unique way your corporation chooses to view these paradigms or groupings of data. Next, you should find and incorporate thesauri that best describe this ontology. The thesauri will provide a controlled, agreed upon Adj. 1. agreed upon - constituted or contracted by stipulation or agreement; "stipulatory obligations"
stipulatory

noncontroversial, uncontroversial - not likely to arouse controversy
 vocabulary for that information, generally standard to an industry, including source terms, related terms and synonyms. For example, a pharmaceutical company may wish to use MeSH (Medical Subject Headings), while a defense company might want to use the DTIC DTIC

A trademark for the drug dacarbazine.



DTIC

dacarbazine.

dacarbazine Warning - Hazardous drug!

DTIC (CA), DTIC-Dome

 (Defense Technical Information Center Noun 1. Defense Technical Information Center - the agency in the Department of Defense that provides scientific and technical information to federal agencies and their contractors
DTIC
) thesaurus. These thesauri layered upon ontologies from the basis of your taxonomies. So what does it take to create a good taxonomy?

Tips for Good Taxonomy Design

Generally, it's better to develop multiple taxonomies each focused on a particular sphere of interest, versus creating a single, multi-purpose taxonomy. Good taxonomies will also contain certain characteristics described as follows.

Depth: Taxonomies should include no more than seven descending levels. Five levels is a more optimum number. Figure 1 shows a taxonomy design canon including the characteristics of each level.

[FIGURE 1 OMITTED]

The dotted line is where the "ontology" created by the company meets reality. Generic categories and terms found above the middle of the schema represent the level at which we start to describe the real world with our mental and linguistic tools; meaning we can touch a gun or a sign, whereas it would be hard to touch an ordnance. There are typically two ontological levels above the dotted lines. These are group nouns, collective terms used to "glue" the descending concepts together. The items at and below the dotted lines become increasingly specific, as does our ability to describe the world at a more granular granular /gran·u·lar/ (gran´u-lar) made up of or marked by presence of granules or grains.

gran·u·lar
adj.
1. Composed or appearing to be composed of granules or grains.

2.
 and specific level. This way of deriving and compounding words to extend our grasp of the world is consistent in any language.

Width: Some taxonomies are up-heavy including too many higher levels or levels of equal definition. Synonyms start appearing as levels, creating ambiguities in logic and therefore instability as shown in Figure 2.

[FIGURE 2 OMITTED]

The problem with the levels in this taxonomy should be apparent. What's the universally understood difference between unwelcome and unpleasant? Why should one term be on top of the other? Wouldn't the logic work perfectly well in reverse? This "taxonomy" is really a classification-meaning it probably reflects the view of an individual, but does not indicate a genus to species sequence of relationships. In a taxonomy, this collection of synonyms should be collapsed into a single level.

Balance: Another problem can occur when an ontology is not used as a starting point Noun 1. starting point - earliest limiting point
terminus a quo

commencement, get-go, offset, outset, showtime, starting time, beginning, start, kickoff, first - the time at which something is supposed to begin; "they got an early start"; "she knew from the
 or as an overlay for a taxonomy. In Figure 3, the root words are too numerous, creating an overly flat structure. This company has used a thesaurus but not an ontology. You need both. In a proper taxonomy you would expect to find perhaps a single root, such as Accounting with second level categories such as accountants, accounting firms, each with their subsequent levels and so on. You would not see non-related root words, such as acceptance and accountability, within this taxonomy.

Another characteristic to watch out for is unbalanced structure. This is shown in Figure 4 where you have one category (Acceptance) indicating one path and another category (Accidents) providing hundreds of paths.

Single Path Progression:

Consider the following real-life example of the confusion created when information is organized based on the experience of experts, versus from a taxonomical tax·o·nom·ic   also tax·o·nom·i·cal
adj.
Of or relating to taxonomy: a taxonomic designation.



tax
 structure (Figure 5). This example was taken from a "taxonomy" used by a tax analyst firm. You see that assets and liabilities are duplicated in two different paths. Also note that if you reverse the structure, putting individuals and corporations under assets or assets and liabilities under individuals, the same logic applies.

[FIGURE 5 OMITTED]

This doesn't pass the first test of a proper taxonomy. An individual is not an asset, nor is a liability a corporation. These relationships don't reflect a proper genus to species relationship. Whenever a path is duplicated or if a path can be correctly reversed using the same logic, the structure is not taxonomical. Remember, a classification is a user-definable way of slicing and dicing data; a taxonomy is a uniform, non-changing structure for organizing data. The view above is a classification. It is a reflection of the way tax experts may wish to view information given their experience.

Figure 6 shows the proper taxonomical structure for this same information. These two dimensions should actually constitute two separate taxonomies.

[FIGURE 6 OMITTED]

In this structure, the information is equally accessible, but it is organized in a non-changing and universally understood hierarchy.

Once your taxonomies have passed these simple, glanceable structure tests, you should further test them by running them against your information corpus. This next step is a further test of the quality of your taxonomies, showing how well they perform against real data, and giving you another opportunity to adjust and refine their structure before you move to the next step in knowledge management design

Test Your Taxonomies

Before framing your documents against a taxonomy--essentially an indexing process--keywords, concepts and entities must be recognized in order to provide an accurate and precise understanding of the content of each document. In the best systems, a sophisticated linguistic analysis process is employed to correctly identify the phrases concepts, entities (collectively referred to as tokens), etc., found within documents. This linguistic analysis process generates a first level of normalization In relational database management, a process that breaks down data into record groups for efficient processing. There are six stages. By the third stage (third normal form), data are identified only by the key field in their record.  (stemming) on textual content. For example, if your organization chooses to use the word "gun" to represent everything from a pistol to a rifle, then all documents referencing these sorts of firearms This is an extensive list of small arms — pistol, machine gun, grenade launcher, anti-tank rifle — that includes variants.

: Top - 0–9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A
  • A-91 (Russia - Compact Assault Rifle - 5.
 will appear under the category of "gun."

Once the tokens are extracted, they can be run--or latched--against one or multiple taxonomies, resulting in a very rich index of the contents of your documents. This indexing process creates a semantic signature for each document. The resulting semantic signature ranks all the taxonomic categories Taxonomic categories

Any one of a number of formal ranks used for organisms in a traditional Linnaean classification. Biological classifications are orderly arrangements of organisms in which the order specifies some relationship.
, which have been linked to each document token, along with additional information including the location of these tokens in the document, etc. The semantic signature--or collection of metadata--within each document is then retrievable as an XML XML
 in full Extensible Markup Language.

Markup language developed to be a simplified and more structural version of SGML. It incorporates features of HTML (e.g., hypertext linking), but is designed to overcome some of HTML's limitations.
 representation of the contents of each document.

As a result of this process, you can view the "latching scores" of your documents, another indicator of the health of your design. You'll see concentrations of latches in certain categories, which may be overly deep or wide, or clearly duplicated. Taxonomically tax·o·nom·ic   also tax·o·nom·i·cal
adj.
Of or relating to taxonomy: a taxonomic designation.



tax
 based indexing technology will produce tables with additional metrics. These metrics will show which categories are overproducing tags and which aren't producing enough. Figure 7 shows the distribution of category production on a vertical scale of depth through the taxonomy. You are looking for Looking for

In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with.
 a nice bell curve on both sides.

[FIGURE 7 OMITTED]

Using this and other latching metrics, you can go back, adjust the ontology as appropriate and re-run the documents again, testing the results until you sec a proper, balanced structure. Once achieved, the data is ready for import to your content management system, portal, or whatever information management or viewing mechanism you choose.

This description is intended to provide a conceptual view See view.  of the proper design and components of taxonomy design and its function within the initial development phase of a taxonomical knowledge management structure. Like most design projects, the discipline is in the details and good taxonomy design generally requires the specialized expertise of a librarian or taxonomy expert.

To summarize, companies should begin by identifying the groupings of data its employees will need to access. From this, an ontology reflective of your company's unique view of its world can be created. Next, companies should investigate and incorporate thesauri that best describe their ontology. Companies can then begin taxonomy design, making sure each taxonomy used has the right number of levels, depth, width and structure balance. Using the results generated when documents are initially latched to these taxonomies, companies should then re-review all categories making sure that the document levels are balanced and reasonable. If not, the company can go back and clean up its ontology, re-running the documents until the structure is correct.

And yet, as important a foundation as a taxonomical structure is to knowledge management systems, ideally it should be invisible to the end user. Users should take for granted that they have reliable, comprehensive access to all the data they need. Instead, users should be empowered to connect data--to identify previously un-recognized relationships and build their own unique webs of related data that may be disconnected in time, geography and even domain. But all of this begins with the sound underlying structure of a taxonomically indexed corpus of data.

"A Roadmap for Proper Taxonomy Design: Part 2" will appear in an upcoming edition of Computer Technology Review.

Figure 3

Acceptance Product Acceptance

Accountability Social Responsibility Social Investing social investing

Limiting one's investment alternatives to securities of firms whose products or actions are considered socially acceptable. For example, an investment manager might decide to eliminate from consideration the securities of all firms engaged
 

Accountants Public Accountants CPAs Attorney CPAs

Accounting Firms Big Five Accountant Firms Big Six Accounting Firms

Figure 4

Acceptance Product Acceptance

Accidents Accident Prevention Aircraft Accidents and Safety Air Traffic Control Hijacking hijacking

Crime of seizing possession or control of a vehicle from another by force or threat of force. Although by the late 20th century hijacking most frequently involved the seizure of an airplane and its forcible diversion to destinations chosen by the air pirates, when
 

Boating Accidents and Safety Construction Accidents and Safety Electrcutions Falls Firearm firearm, device consisting essentially of a straight tube to propel shot, shell, or bullets by the explosion of gunpowder. Although the Chinese discovered gunpowder as early as the 9th cent., they did not develop firearms until the mid-14th cent.  Accidents and Safety Household Accidents and Safety Nuclear Accidents and Safety

Occupational Accidents Industrial Accidents

Occupational Safety Indoor Air Quality Indoor Air Quality (IAQ) deals with the content of interior air that could affect health and comfort of building occupants. The IAQ may be compromised by microbial contaminants (mold, bacteria), chemicals (such as carbon monoxide, radon), allergens, or any mass or energy stressor  

Railroad Accidents and Safety

Ship Accidents and Safety Lighthouses

Swimming Accidents and Safety Drownings

Traffic Accidents and Safety Hit and Run Accidents

www.convera.com

Dr. Claude Vogel is CTO (Chief Technical Officer) The executive responsible for the technical direction of an organization. See CIO and salary survey.  of Convera (Vienna, Virginia Vienna is a town in Fairfax County, Virginia, United States. The population was 14,453 at the 2000 census and it has grown by about 3% since[1].

In July of 2005, CNN/Money and Money
)
COPYRIGHT 2003 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Internet
Author:Vogel, Claude
Publication:Computer Technology Review
Date:Jul 1, 2003
Words:2182
Previous Article:Getting the most from broadband connectivity with wireless networking.(Internet)
Next Article:Ensuring reliability of customized software in distributed systems.(Internet)
Topics:



Related Articles
A risk communication taxonomy for environmental health.
SOFTWARE PROTOTYPES.(Brief Article)
Taxonomies for business. (Software World).(TFPL report)
XBRL streamlining financial reporting. (Reporting Practices).
The truth about taxonomies.
Searching and categorising data. (Searching Data).
Designing a knowledge discovery system, Part 2: now that we have categorized, let's ... classify!(Internet)
XBRL-US, the American chapter of XBRL International--a global consortium of more than 200 accounting, technology and financial services...
Creating order out of chaos with taxonomies: the increasing volume of electronic records and the frequency with which those records change require...

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles