The truth about taxonomies.
* defines a taxonomy
* explains how an organization can use and develop taxonomies
* identifies types of taxonomies
Imagine opening up file cabinet drawers, credenzas, or desk drawers and seeing papers and materials piled up and scattered with no rhyme or reason. Imagine information on a computer stored in one or two big dumping grounds according to the name of a person or titles that only make sense to the creator, with no breakdowns according to specified groupings. Chances are, in either case, it will take a long time to locate files. And what happens when there are new items to add? The valuable space being occupied in these examples is not being used and organized to provide the best benefit in terms of space and time efficiency.
In an office situation, a taxonomy or classification scheme to organize the paper and/or electronic documentation is required. Most organizations use some form of structure to manage their paper documentation. This may or may not be a documented procedure. It may or may not be a system that is widely understood by all employees. It may or may not reflect the business needs of the organization. When it comes to the electronic information in most organizations, it is often every computer or shared drive for itself. Often there are no guidelines or procedures for how these repositories of corporate information and knowledge are to be handled. Organizations frequently overlook the management of one of their most important business assets--information. Information is the fuel that keeps an organization running smoothly. Why then do organizations not give more time and attention to the management of this important asset? Unfortunately, no one discusses the need for better management of information until a crisis hits.
WHAT IS A TAXONOMY?
According to www.whatis.com:
Taxonomy (from Greek "taxis" meaning arrangement or division and "nomos" meaning law) is the science of classification according to a pre-determined system, with the resulting catalog used to provide a conceptual framework for discussion, analysis, or information retrieval. In theory, the development of a good taxonomy takes into account the importance of separating elements of a group ("taxon") into subgroups ("taxa") that are mutually exclusive, unambiguous, and taken together, include all possibilities. In practice, a good taxonomy should be simple, easy to remember, and easy to use.
Another definition, according to Jean Graef of the Montague Institute is:
"... structures that provide a way of classifying things--living organisms, products, books--into a series of hierarchical groups to make them easier to identify, study, or locate. Taxonomies consist of two parts--structures and applications. Structures consist of the categories (or terms) themselves and the relationships that link them together. Applications are the navigation tools available to help users find information."
Other terms associated with taxonomy development and implementation are controlled vocabulary, thesaurus, and user warrant. A controlled vocabulary is an indexing language (i.e., a standardized set of terms and phrases authorized for use in an indexing system to describe a subject area or information domain). A thesaurus is a type of controlled vocabulary that shows the hierarchical (parent-child), associative (related), and equivalent (synonymous) relationships among terms. Often, controlled vocabulary, thesaurus, and classification structure (taxonomy) are used interchangeably. User warrant is a justification for the representation of a concept or for the selection of a preferred term because of individual user needs.
In essence, a taxonomy is a hierarchical classification of headings constructed using the principles of classification, and a thesaurus supplies the commentary and links to navigate the taxonomy. In today's information-dependent environment, where we are receiving, accessing, and using information in its many forms, it is absolutely imperative that there are well-defined and documented structures in place. This ensures that the person needing the information receives it in the timeframe required.
However, the reality in most organizations is a
* ack of standardized procedures
* "stovepipe" approach to information process
* ack of an information-sharing culture
* proliferation of legacy systems
Taxonomies can provide:
1. Identification--The taxonomy can help control the glut of information and identify where information should be stored by filtering, categorizing, and labeling information.
2. Discovery--Additional information on a topic can be inferred by seeing where the entry is placed in context within the taxonomy and provide serendipitous guidance to the person working on the issue.
3. Delivery--The taxonomy can improve the retrieval process. The use of the taxonomy's controlled vocabulary enhances searching via browsing. The use of navigation paths or "breadcrumbs" based on the taxonomy's hierarchy provide context and enhance searching via free text. For example, if a free text search returns 100 hits for the word "bridge," the navigation path for each hit provides the context required to show whether the record refers to a structure, a card game, or financing. It is not necessary to open each returned record to see how the word "bridge" is used.
In addition to performing these basic functions, Graef suggests "A taxonomy should also inspire trust. The user should feel confident that the taxonomy will help him find the information he seeks--if it exists ... As more information gets into electronic format and becomes available over global networks, it gets harder to ensure that any one taxonomy is both sufficiently specific and comprehensive."
Although many logically different structures address taxonomies or classifications, two of the most widely known are the generic relationship and the whole-part relationship. In her article, "The Role of Classification in Knowledge Representation and Discovery," Barbara H. Kwasnik defines these relationships and provides the pros and cons of each.
1. Generic Relationship (Genus/Species) This theorem is probably the most true of all taxonomies. It adheres to strict structural requirements and contains the following properties: genus/species, inclusiveness, inheritance, transitivity, rules for association and distinction, and mutual exclusivity. The following, taken from the Medical Subject Headings (National Library of Medicine), is an example of such a structure:
* Eye Diseases
* Conjunctival Diseases
* Conjunctival Neoplasm
* Corneal Diseases
Genus/species: A true hierarchy has only one type of relationship between its super and subclasses, which is known as the "IS-A" relationship. In a generic relationship, keratoconjunctivitis is a kind of conjunctivitis, which in turn is a kind of conjunctival disease, which in turn is a kind of eye disease.
Inclusiveness: The top class is the most inclusive class and describes the domain of the classification. The top class includes all of its sub-classes. Everything below eye diseases is an eye disease.
Inheritance: This ensures that everything that is true for entities in a given class is also true for entities in a subclass. Whatever is true of eye diseases (as a whole) is also true of conjunctival diseases, and so on. Attributes are inherited by a subclass from its super class. This is a downward flow of information.
Transitivity: All subclasses are members of not only their immediate super class but of every super class above that one. If keratoconjunctivitis is a kind of conjunctivitis, and conjunctivitis is a kind of conjunctival disease, then by the rule of transitivity, keratoconjunctivitis is also a kind of conjunctival disease. This is an upward flow of information.
Systematic and Predictable Rules for Association and Distinction: All entities in a given class are like each other in some predictable and predetermined way, and these entities differ from entities in sibling classes in some predictable and predetermined way. Conjunctival diseases and corneal diseases are alike in that they are both kinds of eye diseases. They differ from each other in some predicable and systematic criterion of distinction (in this case, the "part of the eye affected").
Mutual Exclusivity: A given entity can belong to only one class.
A well-known example of this type of taxonomy is found in biology, the Kingdom-Phylum-Class-Order-Family-Genus-Species classification of life as developed by Karl von Linne (also known as Linnaeus) and published in 1758. This marked the beginning of modern classification of plants and animals.
A generic relationship is most useful for representing knowledge in mature domains in which the nature of the entities and the nature of meaningful relationships are already known. It is useful for entities that are well defined and have clear class boundaries, for example, a subject body of knowledge.
2. Whole-Part Relationship (Tree) This type of relationship also progresses from the more general to the more specific. The following is an example of such a relationship:
* Engine Block
The most marked difference between this classification theorem and the generic relationship is that the whole-part relationship does not assume the rule of genus/species and, therefore, also does not assume the rule of inheritance. A body is not a type of automobile. Upholstery is not a type of interior.
As opposed to genus/species hierarchies, where the flow of information is both vertical and lateral, in whole-part classifications the flow of information is only vertical. There are systematic and predicable rules for distinction. Pistons and valves are known to be different parts of an engine block. However, this relationship does not assume systematic and predicable rules for association; pistons and valves are not both kinds of engine blocks, nor can it be assumed that they share many attributes because of their sibling position within the hierarchy. They share the attribute that they are part of the engine block, but that is only a partial explanation of what they are.
Whole-part classifications or taxonomies are found in the organization of most Web sites and, more generally, in function-based enterprise classifications and geographic-based classifications. They are also used as corporate directories. They are more popular than generic classifications. Some argue however, that they are not true taxonomies.
An interesting nuance in the relationship between those versed in traditional classification theory and those in the records management profession is the way in which each regards the utility of a taxonomy. The former seeks to define and represent the relationships between entities for the purposes of identification and retrieval and may adhere to a more strict structure for a taxonomy. The latter seeks to classify information not only for identification and retrieval but also, it can be argued, ultimately to apply retention requirements--something that would not be considered by the former. The necessity to incorporate retention requirements may result in a looser application of classification rules. Determining the reason for constructing a taxonomy and the needs of its users is paramount.
DEVELOPING A TAXONOMY
Whether the taxonomies are rooted in the generic theorem or the whole-part theorem, various formats may be employed when building taxonomies for organizations. The choice of a primary method of organizing the information may be supplemented by the addition of metadata fields, which act as additional entry points to the information. Taxonomies (classification structures) and metadata (cataloguing) work well together to provide a rich description of information.
In some cases, more than one taxonomy is required within an organization. A subject-based taxonomy may already be in use within a document management software application; however, there also may be a need to organize corporate documents within a functional taxonomy. A functional-based taxonomy already may be developed, yet an organization also may wish to represent the more esoteric elements of what each department does. In this case, a companion taxonomy based on organization structure may be useful. There also may be different taxonomies in use for each element in a metadata set. Each organization must assess its needs and requirements to decide on the taxonomy structure that is right for it.
According to Alan Gilchrist, a consultant with TFPL Ltd., it has been shown through research that most organizations
* were aware of the need to develop a better information structure
* were prepared to commit substantial financial resources to the project
* understood the importance of using a standardized terminology
* recognized that user participation and feedback was important
Unlike libraries, which have classification systems such as Dewey Decimal or Library of Congress, there are no universal standard taxonomies for businesses. Each business taxonomy must be developed based on the uniqueness of an organization's individual business requirements, its users, and its industry.
The following are suggested steps to follow in developing a taxonomy.
Plan and Gather Data
This first step is the most important because it provides the direction and foundation for the development of the taxonomy and its accompanying tools (controlled vocabulary, thesaurus). It requires a variety of activities. First, survey the organization to define the stakeholders. Include a cross sampling of the various levels (i.e., frontline workers, management, and executive) to ensure that each has had an opportunity to express his/her interests and experiences.
Then define the goals, scope, and rules for the project--much easier said than done--in order to provide the basis for measuring success. Next, develop a communications plan to keep the project team members and the organization as a whole informed about the importance of the project and its progress.
Once these items are in place, data gathering can begin. A taxonomy is media-independent; it represents the intellectual content of any format of materials. Use questionnaires, individual interviews, walkthroughs, and reviews of documentation (e.g., annual reports, research summaries, existing file plans) to gather enough information that represents the body of the data set to be classified. This material will provide not only the basis for the logical arrangement of the taxonomy but also the names of the individual headings within it. This process also provides an opportunity to develop key contacts with subject experts.
Build a Draft Taxonomy
During this phase, the project team will need to confirm the type of taxonomy that will be used. It is a good idea to contact other organizations within the same industry to review their taxonomy structures or conduct other research to identify existing taxonomies or thesauri elsewhere. Organizations may be able to benefit from the work done by others to reduce "reinventing the wheel" themselves.
If a business plans to use a technology application (e.g. electronic document management systems, categorizing software), it will need to research it thoroughly and understand its capabilities/limitations. Many categorizing software applications require a pre-built taxonomy.
Whichever taxonomy format is chosen, the most important decision is determining the "first cut" or how the rules of distinction will be applied. Kwasnik says, "this determines the shape and the representational eloquence of the classification." Is it best to build the taxonomy from the top-down or the bottom-up? Should an organization build the top-level buckets first (based on pre-defined business requirements) or should it organize the buckets at the most granular level first (based on the content of the information) and decide on the major groupings for the top levels later? Some argue that a top-down approach, with its pre-defined business requirements view of the world, can unduly influence the development of the taxonomy--that a different approach of organizing the information may be overlooked. In actual practice, the method of building a taxonomy usually employs a marriage of the two approaches.
Now begins the arduous task of grouping similar documents together, deciding on the group names, and developing the controlled vocabulary and/or thesaurus. Users interviewed in step 1 will, for the most part, have supplied the vocabulary for these headings. This part of the development process is a matter of trial and error and requires a good deal of patience. Some people write the concepts or terms on cards; some use Post-It notes. Others use an Excel spreadsheet or employ specialized software packages, such as Visual Mind or Mind Mapper, to construct tree diagrams. A rule of thumb for the number of top buckets is seven plus or minus two; this is especially relevant for Web site taxonomies. As for its depth, anything beyond four levels may inhibit a user's ability to navigate easily within the structure. Although a document may be classified in more than one place in the taxonomy, the document itself should be stored in only one location. Names of the bucket headings should not repeat unless--and this is critical--a navigation path showing the super classes can be supplied to provide the context of the bucket.
There may be multiple facets or characteristics of topics that must be represented throughout the taxonomy. For example, policies, standards, research, or a geographic location may apply to numerous topics. A taxonomy may be constructed with those headings as subclasses under a super class. However, this may result in a very large taxonomy. Instead, consider constructing a list of such terms as metadata in a separate field that can be accessed via a controlled vocabulary search.
Once a draft has been constructed, it must be distributed and reviewed by a cross sampling of the organization in order to promote buy-in and provide feedback to refine the hierarchy. It also may be necessary to consult with subject matter experts at this time. When developing only a small area of the taxonomy for a pilot, keep in mind how other areas will affect the taxonomy's structure once they are added. Try to anticipate what may need to be changed. The draft then should be presented to a validation committee for approval.
Remember that the taxonomy will continue to develop over time. It is never truly finished; instead, it may be thought of as a "living" document. What is approved now may need to be modified later as the organization grows and evolves in order to ensure that the organization's assets are represented in the taxonomy.
Now it is time to populate the taxonomy with records or information. At this point, an automatic categorization software application may be used to assist with the categorization of the concepts across the pre-developed taxonomy. Reviewing the pilot project with users and getting their input during this phase is crucial. Time well spent in the development of the taxonomy ensures its success.
Refine and Finalize
As a result of user feedback during the pilot project, the taxonomy will need to be refined; the controlled vocabulary and/or thesaurus may require modification. At this point, continue to launch the taxonomy in "chunks" if the organization is large; a small- to medium-sized organization, or one with limited resources, may decide to launch it in its entirety. It is crucial at this point to market and build on the success of the pilot project. This will encourage other groups to come forward to participate.
User training must be designed, developed, and implemented based on user needs. Front-line workers will need more information and hands-on experience than the executives. However, all employees will need to be informed. This is a great opportunity to reiterate the value of the taxonomy and how it enables better management of and access to an organization's most important asset--information.
Ensure Continued Development
The taxonomy and its related tools will evolve over time. To ensure their continued value to the organization, policies and procedures must be written to outline such things as who owns the taxonomy and at what level will an end user or content manager be able to add, if allowed at all, a heading.
Ensure there is a mechanism in place for updating and revising the taxonomy through the use of a committee, a regular auditing process, or informal reviews. Proper staffing is also necessary to ensure that the development and ongoing maintenance of the taxonomy is sustained. What began as a project with specific goals and objectives evolves into an ongoing process that needs to be managed and maintained over time.
What is the role of records managers in taxonomy development? They have a good understanding of the organization and business processes, as well as the terminology that may be unique to their industry. In addition, they have a deep appreciation of the needs and benefits of information retrieval. Those who develop a taxonomy, especially a functional one, become invaluable to an organization because they become an expert on its overall operations. In addition, a byproduct of the process is the development of a strong network of key contacts--an important marketing asset.
AUTOMATIC CATEGORIZATION AND TAXONOMY
Automatic categorization is the process by which technology is used to create clusters of documents based on criteria specified by the user, usually via a pre-supplied taxonomy, thesaurus, or controlled vocabulary list. Automatic categorization software provides the potential means to automatically file documents into either a predefined taxonomy or self-defined categories. Although it sometimes may be referred to as taxonomy or classification software, it does not currently perform this function well, if at all. Automatic categorization software cannot produce a well-formed hierarchical taxonomy or classification scheme. In reality, the software acts as an indexing agent. Its strengths lie in its ability to process (i.e., categorize or index) large data sets and to identify concepts and relationships that may not be readily apparent from the raw data in order for an individual to build or refine a taxonomy. [Editor's Note: See related article, "So You Want to Implement Automatic Categorization?" on page 60.]
There are two main types of categorization software: rules-based (e.g., Entrieva, formerly Semio, and Inxight) and statistical clustering (e.g., Autonomy, Mohomine, Hummingbird). The former employs predefined "if-then" statements to define the clusters. It uses linguistic analysis (the rules of grammar, language detection, proximity analysis, stemming) to extract concepts, not keywords, from the documents and assigns the documents into clusters. This method is not dependent on the information in the collection itself; that is, the rules may be applied against multiple collections. The latter employs mathematical algorithms to cluster like concepts together. Popular methods include term co-occurrence analysis and neural networks. This method categorizes documents based on the information in the collection itself.
Catalogue-by-example is an additional technique used by either camp (e.g., Inxight, Autonomy) to refine results. This technique compares new documents to a collection of exemplary documents, usually 20-30 documents for each heading, which is known as the "training set." During the training process, humans evaluate the appropriateness of the software-categorized documents, and shift documents from one category to another as required. The software learns through an iterative process what documents should be categorized into which clusters. In this way, the software refines its understanding of a concept. This technique works best if the taxonomy is stable; otherwise, time must be taken to retrain the system. It also works well when dealing with disparate data, for example, televisions and telephones, not robins and sparrows.
The terms classification and taxonomy will continue to be used in their respective professional spheres; however, according to Liz Edols' article, "Taxonomies Are What?" in Free Pint e-zine: "Good taxonomies, based on the use of classification and controlled vocabularies, result in more efficient information retrieval. This ensures better productivity and less user frustration. Where do taxonomies fit into the information architecture paradigm? They are one part of it, though they may not always be referred to as a taxonomy."
Isn't this the ultimate goal? By using the standards and principles set out for the development of a hierarchical structure, whether it is called a taxonomy or a classification scheme, more efficient information retrieval, better productivity, and less user frustration can be achieved.
Types of Taxonomies
The following are some ways of representing information within taxonomies:
Functional: This type of taxonomy organizes itself along the different functions performed by an organization--both administrative and operational.
* is most in tune with organizational goals and business processes
* reduces silos of information
* reduces duplication
* makes it easier to find the most recent official document
* shows the flow of information
* naming of headings is unaffected by department name changes
* is recommended by ISO Technical Report 15489
* requires a new way of thinking about information
* needs buy-in from everyone
* requires one person to oversee the major shared "buckets"
* requires a liaison within each department that contributes to the "bucket"
* requires more training of employees
Department: This type of taxonomy is department-based and mirrors an organizational chart.
* is easy to build
* is easy to understand
* preserves the chain of command, avoiding internal "politics"
* allows an individual to work in only one area of the taxonomy
* requires taxonomy headings to be changed frequently
* department mergers and splits will force parallel changes
* splits information on a project or topic across the taxonomy if two or more people from different departments contribute information
* encourages a proprietary way of thinking
* does not represent what an organization actually does
* is difficult for new employees to use
* requires the management of departing employees' documents/files
Subject: This type of taxonomy is based on the subjects of information with which an organization might deal.
* is appealing if need to classify a discrete body of knowledge
* allows for greater depth if required
* is an excellent application for an EDMS
* is limited to that one body of knowledge
* may be difficult to select the terminology for the subject headings if the users of the taxonomy are both novices and experts in the subject field
Product/Services: This type of taxonomy is based on the products and services that the organization provides.
* provides good representation of information for product-or service-centered organizations
* is more of a stand-alone taxonomy rather than a way to represent an entire organization
Location: This type of taxonomy is based on the organization's geographic location.
* is ideal for large multinational organizations
* allows for customization based on location
* allows for the incorporation of customs, culture, and regulations that are specific to the location
* is challenging to split information between the corporate office and branch locations
* requires specialist in each country to create the taxonomy because of language and cultural nuances
* is difficult for centralized control
Managing Taxonomies Strategically
The following pearls of wisdom are gleaned from experience:
* Prepare for politics. The development of a taxonomy requires an enormous amount of diplomacy, tact, and negotiation.
* Taxonomy development is a process, not just a project. A taxonomy is never truly "finished."
* Serve the real needs of users rather than produce an "ideal" textbook taxonomy. Remember the concept of user warrant.
* Write policies and procedures for everything.
* Ensure there is a mechanism in place for the taxonomy's continued development and maintenance. This includes budgeting for the appropriate staff support required.
Adams, Katherine C. "Word Wranglers: Automatic Classification Tools Transform Enterprise Documents from `Bags of Words' into Knowledge Resources." Available at www.intelligentkm.com/feature/010101/featl.shtml (accessed 9 January 2003).
BRINT Institute. "Classification." Available at http://portal.brint.com/cgi-bin/getit/ links/Reference/Knowledge_Management/Knowledge_Retrieval/Classification (accessed 9 January 2003).
Content Wire. "Taxonomies." Available at www.content-wire.com/Taxonomies/Index.cfm (accessed 9 January 2003).
Edols, Liz. "Taxonomies Are What?" Free Pint e-zine. October 2001.
Graef, Jean. "Managing Taxonomies Strategically." Montague Institute Review. March 2001.
Hagedorn, Kat. "Extracting Value From Automated Classification Tools." Argus Center for Information Architecture, March 2001.
ISO Technical Report 15489-2. "Information and Documentation--Records Management, Part 2: Guidelines" 2001.
Kwasnik, Barbara H. "The Role of Classification in Knowledge Representation and Discovery." Library Trends, Vol. 48, No. 1, Summer 1999.
Lubbes, R. Kirk, 2001. "Automatic Categorization: How Does it Work, Related Issues, and Impact on Records Management." Presented at ARMA International 46th Annual Conference and Expo, 30 September 2001, Montreal, Quebec, Canada.
New South Wales Government. "Designing and Implementing Recordkeeping Systems (DIRKS)" and "Australian Standard 4390--Records Management." Available at www.records.nsw.gov.au (accessed 9 January 2003).
NISO Standard Z39.19-1993. "Guidelines for the Construction, Format, and Management of Monolingual Thesauri." Available at www.niso.org/standards/standard_detail.cfm?std_id=518 (accessed 9 January 2003).
Nua Knowledge Base. "Classification: The Cure for Information Overload." Available at www.nua.ie/nkb/classification/index.shtml (accessed 9 January 2003).
Olsen, Christopher J., 2001. "Buy It, Build It, Steal It--But Your Organization Needs a Taxonomy for Its Information Survival." Presented at ARMA International 46th Annual Conference and Expo, 30 September 2001, Montreal, Quebec, Canada.
Rosenfeld, Louis and Peter Morville. Information Architecture for the World Wide Web: Designing Large-Scale Web Sites, Second Edition. Sebastopol, CA: O'Reilly & Associates, 2002.
"Where to Find Taxonomy Information." Montague Institute Review. Available at www.montague.com/review/taxoninfo.html (accessed 9 January 2003).
Willpower Information. "Publications on Thesaurus Construction and Use." Available at www.willpower.demon.co.uk/thesbibl.htm (accessed 9 January 2003).
Denise Bruno, MLS, is an independent information management consultant. She may be contacted at firstname.lastname@example.org.
Heather Richmond, CRM, is Vice President, Marketing and Sales, at CONDAR Consulting Inc. She may be contacted at email@example.com.
|Printer friendly Cite/link Email Feedback|
|Publication:||Information Management Journal|
|Date:||Mar 1, 2003|
|Previous Article:||Tragedies, controversies, and opportunities: redefining RIM's role in a turbulent time: given recent issues and critical developments, the central...|
|Next Article:||MoReq: the standard of the future? Want to understand what electronic records management systems (ERMS) should do? The Model Requirements for the...|