Creating order out of chaos with taxonomies: the increasing volume of electronic records and the frequency with which those records change require the development and implementation of taxonomies--a classification system of topics or subject categories--to maximize efficient retrieval of records for legal, business, and regulatory purposes.
The higher the level of public scrutiny by regulators and stakeholders, the greater the need organizations have for applying management controls. Some types of comprehensive record searches (e.g., divestitures, due diligence investigations, and electronic discovery in response to courts and regulators), are difficult to conduct without taxonomies. To maximize efficient and effective retrieval of records for legal, business, and regulatory purposes, organizations must develop and implement taxonomies and metadata to complement text searching, provide multiple access points to information, and incorporate retention requirements.
Methods for Organizing Information
Historically, classification systems were expressly developed to classify physical objects that existed in physical locations (see "A Little History" sidebar), but technological advancements in the twentieth century brought an explosion of information--both digital and physical--that forever changed notions of classifying an "information collection."
The information contained in physical and digital records changes frequently--daily, weekly, monthly--and sometimes without warning. Frequent updates, such as modification or deletion of items, are critical when time-sensitive information is involved, but updates can be very disconcerting to users who find themselves hunting for moving targets. A 2004 Delphi Group research report indicated that constantly changing information was the biggest impediment to relocating and retrieving information. Seventy-three percent of survey respondents reported they spend 10 to 20 percent of their work week searching for information.
Compounding the problem of frequently changing information is the increasing volume of hardcopy and digital records that organizations maintain for legal and business purposes. A 2000 Reuters study indicated that "Every day, approximately 20m [20 million] words of technical information are recorded." (As a testament to the volatility of information, the Reuters article is no longer available online.) While the largest library collection in the world, the Library of Congress, consists of nearly 128 million items, a large organization can easily maintain tens of millions of physical and digital records.
One tool that is instrumental for managing increasing records volume is a taxonomy: a structured, often hierarchical, classification system of topics or subject categories. Taxonomies speed up the process of retrieving records because end users can select from subject categories or topics, enabling them to narrow the search field and find relevant information rather than relying solely on the blank text search field and their ability to construct an effective query. Taxonomies also provide "serendipitous guidance," according to a 2003 The Information Management Journal article by Denise Bruno and Heather Richmond, because additional information can be inferred from seeing where a topic resides in the taxonomy's context.
End users who are not knowledgeable about a particular topic might begin a search process by navigating through the taxonomy. When an area of interest is discovered, a text search against only the records in this particular area of the taxonomy could be executed. Conversely, the user might start with a text search producing hundreds or thousands of records. Through the integration of a taxonomy, the results could be displayed as a customized set of folders that organize the content by related topics. According to the Delphi Group report, enterprise content management (ECM) products enable taxonomy integration, allowing users to search across repositories, present records from multiple repositories in response to user queries, and personalize these responses based on the requestor's relationship to the enterprise.
In most organizations, there is still no way to search for electronic records in multiple repositories except to search each repository separately. Despite compelling arguments for using taxonomies in records and information management, according to Gartner, more than 70 percent of organizations that invest in such initiatives do not achieve their target return on investment due to under-investment in taxonomy development. It is worthwhile then to compare trade-offs associated with alternatives such as buying pre-built taxonomies, building taxonomies, and automatically generating taxonomies.
Buying Pre-existing Taxonomies
Pre-built taxonomies covering common business functions, including legal, information technology, human resources, and sales and marketing, are available from search-technology and ECM companies. Vendors also offer taxonomy templates with specific industry terminology for corporate, government, and education sectors. For the corporate sector, for example, there are taxonomies for aerospace; architecture and design; automotive; finance and accounting businesses; commodities; chemistry; earth science; engineering, international business; law; life and medical sciences; pharmaceuticals; physics and astronomy; textiles; and utilities.
Industry associations are another source of pre-built taxonomies. In the oil and gas industry, for example, the Petrotechnical Open Software Consortium (POSC) and the PPDM Association, with work done by Shell Expro and Flare Consultants, have produced an exploration and production (E&P) taxonomy catalog. The catalog, which includes a standard set of metadata attributes for E&P information, provides a logical, standardized way to index and catalog information so that it can be easily identified and retrieved in the right context.
There are also worthwhile pre-built taxonomies in the public domain. The Taxonomy Warehouse (www.taxonomywarehouse. corn) provides a free directory of 501 taxonomies, thesauri, classification schemes, and other authority files from around the world, plus information about taxonomy references, resources, and events. The taxonomies are classified by 73 subject domains, such as patents, real estate, and taxation, and each has ordering information.
The use of pre-built taxonomies has pros and cons. Clearly, pre-built taxonomies can speed up the taxonomy creation process, enabling organizations to deliver immediate results while still allowing taxonomies to be fine-tuned for organization-specific requirements. Pre-built taxonomies have been checked for consistency so that an accounts payable invoice is not called a "bill" in one subject category and a "posting" in another. Furthermore, they incorporate industry best practices and can introduce a more efficient and effective method for organizing records.
A significant disadvantage of pre-built taxonomies is that because they are not specific to an organization and its objectives, they have limited applicability. Each organization has its own culture and its own way of categorizing. Using pre-built taxonomies will introduce unfamiliar terminology and make user training more time-consuming.
Unlike pre-built taxonomies, a custom-developed taxonomy can be very specific to an organization, its objectives, and culture. The developer has control over the selection of terminology to make sure it reflects the understanding and needs of an intended audience, as well as the range of content to which it will he applied. In some cases, building a taxonomy is the only solution because there are no other existing taxonomies that cover a particular area of interest.
The primary disadvantage of building a taxonomy is the time it takes. It is much faster to use a pre-existing taxonomy--or even to customize a pre-existing one that is compatible in scope and application. Trying to customize an incompatible taxonomy could be just as time-consuming as building a new one and even more challenging. Another disadvantage to building a taxonomy is that it is usually more expensive than buying a pre-existing one. Despite the disadvantages, however, most companies still build their own taxonomies while leveraging the use of pre-built taxonomies when possible.
In constructing and implementing a taxonomy, the goal is to develop a conceptual organizational structure that can be used to classify and search for information. The general process is roughly the same whether a manual or an automatic approach is used. Four interrelated phases must be considered:
Phase 1: Planning and analysis
Phase 2: Design, development, and testing
Phase 3: Implementation
Phase 4: Maintenance
Phase 1: Planning and Analysis
Planning and analysis is the most critical phase of taxonomy development. It requires gaining a thorough understanding of the total information environment in which the taxonomy will be implemented and developing a realistic strategy for integration. Activities in this phase are designed to do the following:
* assess resources involved in the taxonomy project and determine how the taxonomy will be used. If necessary, identify outside consulting resources to assist
* identify categories to be used and decide on the taxonomy's structure
* select a development strategy and identify appropriate technology for developing the taxonomy as well as categorizing content
* budget for development and ongoing maintenance
Information from the planning phase is used to firm up the project plan and determine key milestones that demonstrate success.
Phase 2: Design, Development, and Testing
Changes to a taxonomy can be painful after implementation, so taxonomies should be designed for both short-term and long-term needs to minimize change when the organization's structure changes. "People do not like information architecture to change," content management consultant and author Gerry McGovern said. "Spend the time to get it as right as possible [the] first time." Design and development, therefore, should be an iterative process based on feedback from stakeholders at every major stage of the process. Develop a high-level structure and test with stakeholders. Modify the structure based on their feedback and then test again until general con sensus indicates that taxonomy objectives are being met.
Phase 3: Implementation
Good planning and design provide a solid foundation for implementing the taxonomy. However, smooth implementation can be achieved only if people, processes, and technologies have been identified and prepared for this phase. The change management process begins early in the project through open communication and expectation setting. It is formalized with stakeholder training on the processes to he used around categorizing new information, searching and retrieving information, and using the technologies employed in the effort.
Phase 4: Maintenance
Even when taxonomy developers consider short-term and long-term needs in the planning process, change is inevitable. A taxonomy is a strategic part of an organization's information architecture that will be maintained for many years. It will evolve as business needs change and as sophistication and understanding grows around records management. Documentation of decisions made throughout the development and implementation process will be instrumental for efficiently assessing requests for change and making changes to the taxonomy as necessary. The change management infrastructure (people, process, and technology) that was implemented in Phase 3 should be maintained for the life of the taxonomy.
Developing Manual and Automated Taxonomies
There are two basic strategies to building taxonomies: top-down or bottom-up. In "Best Practices in Taxonomy Development and Management," authors Laura Ramos and Daniel Rasmus said that a top-down strategy--usually developed manually--offers control over the broad general concepts found at the highest taxonomy levels and is useful for aligning the taxonomy with business strategy and goals. A bottom-up strategy uses automated technologies to extract basic concepts from the content itself and make generalizations about them. Both strategies have advantages and disadvantages, and both are important for taxonomy development and implementation.
Manual taxonomy development offers significant control over the meaning and arrangement of concepts and can be deliberately shaped to reflect common knowledge and practice in an organization. However, manual categorization of documents to the concepts in the taxonomy is low in accuracy simply because of the human judgment involved. Where automated classification methods are used with manually developed taxonomies, it is a significant task to "train" the tools to categorize documents to the taxonomy, and, it may be impossible to train an automated tool if there is not sufficient distinction in the meaning of categories. The cost of developing and maintaining a manual taxonomy is high because it is a resource-intensive process.
Automatic classification tools can automate the process of categorizing content for an already developed taxonomy or generate the taxonomy structure itself. Tools that automatically generate the taxonomy structure apply various algorithms (statistical analysis, Bayesian probability, and clustering) to a corpus of documents in a bottom-up strategy. An automatically generated taxonomy offers little control over the meaning and arrangement of high-level concepts and, consequently, requires significant refinement in order to make sense to users and be more reflective of the way they view information. These tools can categorize a larger number of documents more accurately and faster than humans. However, addition of new concepts requires that the tool be trained to recognize each new concept before content can be automatically classified. The cost of automatically deriving a taxonomy structure is also high because some time-intensive tasks still require human intervention. People must still examine each category to see if it is fit for purpose and if it is named appropriately. Human judgment must determine if some categories should be deleted or new ones added. They must also determine if the final taxonomy "matches" human understanding and purpose.
Selection of a taxonomy strategy and associated tools should be based on the goals of the taxonomy development project. The best solutions will use a combination of the strategies--a top-down approach to develop the higher-level categories in the taxonomy aligned with business strategy, and a bottom-up approach to refine lower-level concepts and enable automatic categorization of content.
Identifying Best Practices
As business awareness and use of taxonomies have grown over the years, the following best practices have emerged for successful taxonomy development:
* Make sure the taxonomy is clearly related to business strategy. This will provide one standard against which to measure success and help in controlling the scope of work to be done.
* Incorporate existing taxonomy and metadata resources whenever possible. Some resources may be available internally. Others may be found in the public domain (e.g., country codes published by the United Nations Code for Trade and Transport Locations and API well numbers published by the American Petroleum Institute).
* Even in large and complex taxonomies, make sure categories are well-defined and distinct. If the meaning of categories is too similar, it will be hard for both people and machines (automatic tagging and categorization) to make distinctions.
* Iterative development is key. Develop a high-level taxonomy, test it with users, expand it, and test again. This technique will increase the probability that the right concepts are identified and encourage buy-in from stakeholders.
* Keep the taxonomy as simple as possible. Decompose to a useful level but avoid so much detail that information becomes fragmented.
* Provide for adequate resources to maintain the taxonomy. Taxonomies are not static and will change over time. The appropriate change management infrastructure (people, process, and technology) must be put into place to support necessary change.
Planning for the Long Term
Faced with an ever-growing challenge to provide efficient search and retrieval across growing record repositories, organizations are looking for ways to create order out of chaos, and taxonomies are a primary tool. Taxonomies enhance searching for records because end users can select from standardized categories and hierarchical structures of information, enabling them to narrow the search field and find relevant information faster. Good planning and design provide a solid foundation for establishing the taxonomy, and people, processes, and technologies must be identified and prepared for implementation. A taxonomy, like a records retention schedule, is a strategic part of an organization's information architecture, and maintenance will require a long-term investment of human and financial resources.
A Little History ...
Taxonomy originated in the life sciences and can be traced back to Aristotle's theory of categories. "He espoused the idea that things are placed into the same category on the basis of what they have in common," author Arlene Taylor wrote in her book The Organization of Information, and they are arranged hierarchically with things either inside or outside the container.
Among the earliest applications of classification to knowledge were 10 broad categories used by Callimachus, a Greek poet and scholar, for classifying works in the Library of Alexandria. These 10 categories remained fairly stable until the late Middle Ages but expanded significantly in the nineteenth century with the rapid growth of libraries and an increased need to provide users with easier access to books. Two large classification systems were developed to address this need and were put into widespread use: the Dewey Decimal Classification and the Library of Congress Classification.
By the end of the nineteenth century, a movement was underway in Europe to go beyond providing access to books. The Universal Decimal Classification system was not developed as a library classification, Taylor said, but rather as a means to organize and analyze documents.
At the Core
* explains how instrumental taxonomies are to managing the e-records deluge
* describes advantages and disadvantages of buying versus building taxonomies
* details the four phases of building taxonomies
* provides best practices for developing taxonomies
Bruno, Denise and Heather Richmond. "The Truth about Taxonomies," The Information Management Journal 37, No. 2 (March/April 2003).
Delphi Group. "Information Intelligence: Content Classification and the Enterprise Taxonomy Practice." June 2004. Available at www.delphigroup.com/research/whitepapers/20040601-taxonomy-WP.pdf (accessed 28 March 2005).
Ingram, Brian. "Locate Smoking Guns Electronically." LAW.COM. 29 September 2003. Available at www.law.com/special/supplement/e_discovery/smoking_gun.shtml (accessed 28 March, 2005).
Knox, Rita E. and Debra Logan. "What Taxonomies Do for the Enterprise." Gartner Research. 10 September 2003. Available at www.gartner.com/resources/117200/117204/117204.pdf (accessed 28 March 2005).
McGovern, Gerry. "A Step-by-Step Approach to Web Classification Design." October 2002. Available at www.gerrymcgovern.com/la/wcd.pdf (accessed 28 March 2005).
Petrotechnical Open Standards Consortium. "Work Program Summary for 2003." 3 February 2004. Available at www.posc.org/workprgm/summary_2003.shtml (accessed 28 March 2005).
Ramos, Laura. "Decision Criteria for Undertaking a Taxonomy Development Project." Giga Information Group. 8 January 2002.
Ramos, Laura and Daniel Rasmus. "Best Practices in Taxonomy Development and Management." Giga Information Group. 8 January 2003.
Reuters Studies. "The Reuters Guide to Good Information Strategy." Dow Jones Reuters Business Interactive Limited, 2000.
Taylor, Arlene. The Organization of Information. Englewood, Colorado: Libraries Unlimited Inc, 1999.
Susan L. Cisco, PhD., CRM, FM, is a Project Manager with Iron Mountain Enterprise Solutions and Services. She holds an M.L.S. and Ph.D. in Library and Information Science from The University of Texas at Austin and is a published RIM author and educator. She may be contacted at firstname.lastname@example.org.
Wanda K. Jackson, Ph.D., is an IT professional with special expertise in taxonomy development that has enabled her to develop enterprise-wide taxonomies for several large companies. She holds a Ph.D. in Library and Information Science from the University of Texas at Austin and is a certified Project Management Professional. She may be contacted at email@example.com.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||instrumental taxonomies|
|Author:||Jackson, Wanda K.|
|Publication:||Information Management Journal|
|Date:||May 1, 2005|
|Previous Article:||Who owns business data on personally owned computers?|
|Next Article:||The impact of the USA PATRIOT Act on records management: the impact of the USA PATRIOT Act on a particular records manager or records management...|