Getting just what you need.
That, in a nutshell, summarizes the demands placed on information professionals and the information content industry today. But how are these demands addressed?
Current Market Drivers
The information marketplace is evolving. Contextual information distribution is no longer just about the bulk supply of content to one demand-side entity. It is increasingly about the targeted distribution of very specific sometimes small--data points to those people within the workflow who must have the content to complete their part.
This, in turn, demands payment methods that can cater to specific data elements as well as large-scale licenses or subscriptions, which, in the future, will drive different payment and collection methods.
The key to successful delivery of future information services is customization. It truly boils down to customer intimacy, or the vendors' ability to partner with the customer to understand the day-to-day requirements of specific user groups.
This is an early view of the next generation of information content delivery that will be based on today's content holdings plus technology partners who use XML and Web services along with specific proprietary technology partners who can enable the profitable offering of value-added services.
Thomson Scientific, a market segment of The Thomson Corporation, has for over 50 years been helping people find mission-critical scientific and technical information more efficiently and effectively. One of the key enabling factors in this is consistent classification of information.
One of the most important sources of technical information is patent literature. Patent documents represent a major source of technical information since, in return for protection of an invention, the inventor is required to fully disclose the invention in a way that makes it reproducible. Technical innovation occurs across all fields of technology; therefore, classification schema form an integral part of patent information.
Patent classification schemes are constructed and maintained by and for patent examiners, and their primary purpose is to help the examiners in their work. Some of the earliest systems were devised by national patent offices such as the US Patent and Trademark Office and the UK Patent Office.
However, these classification systems are applied individually. This makes them difficult to use efficiently across all the different patent systems. To address that difficulty, the International Patent Classification (IPC) system was developed in 1968. It remains the foundation of patent classification today and is currently used by more than 70 patent authorities to classify and index the subject matter of published patent specifications.
There remain, however, numerous issues in the use of the IPC for consistent and reliable information retrieval:
* The classification policy of individual examining authorities may vary. Local practice may place emphasis on different features of the invention;
* Some patent offices only apply the IPC at a very general level; and
* The USPTO assigns IPC codes to its specifications via an automated concordance that does not always provide a reliable match.
These shortcomings of the IPC were recognized early on. Consequently, Thomson Derwent developed its own classification system for patent information. Codes are used to describe the significant features of an invention and are applied by a team of highly trained experts with specialist knowledge in each of the areas of technology with which they are concerned. In this way, indexing terms are applied consistently, therefore assuring reliable, consistent information retrieval.
Similarly, in the arena of scientific publishing, taxonomies are classically used to provide consistent and reliable information retrieval. One of the best-known examples of this is the BIOSIS taxonomy.
A taxonomy defines the hierarchical relationships among categories (or, more technically, "nodes of information"). Because taxonomies specify the kinds of relationships among terms, taxonomies also allow reasoning about the relationships among classes of information. For example, an apple "is a kind of" fruit.
From the BIOSIS viewpoint, there are two important goals in employing taxonomies. The first strives to achieve consistency in the selection of terms. BIOSIS has adopted an approach that employs "controlled vocabularies" to work in conjunction with author-provided natural language to reach an appropriate level of consistency. The natural language in life-science documents introduces a considerable number of synonyms or ambiguous terms. Since ambiguous terms present real problems for search and retrieval, controlled vocabularies help to standardize the meaning of these terms and can dramatically increase precision in search results.
The second goal is to establish semantically meaningful categories of terms. Beyond the problem of synonyms in natural language, there is the problem of homonyms--terms that look alike but have very different meanings. Think of the term "glycine," which is both an amino acid and a crop plant (i.e., "Glycine" max). To differentiate terms like these, BIOSIS developed semantic tags that indicate the subject categories to which these terms belong, such as "Organisms," "Chemicals and Biochemicals" or "Geography."
BIOSIS has developed a relational structure that moves beyond the potentially rigid systems of ordinary taxonomies. In terms of both search and browse, what begins to emerge is a representation of knowledge that is partly hierarchical and partly associative. Using hierarchical term relationships, searchers can easily broaden or narrow searches for greater recall or precision. Or they can associate terms with specific categories to manage the vexing problem of ambiguity.
Search--Progress through Partnership
Another key component of delivering the right information to the right people at the right time is the ability to search across an organization's complete electronic browser-accessible resources through a single search interface, or a "federated" search system.
Through a partnership with WebFeat, Inc., Thomson Scientific provides a federated search system called WebFeat Prism. Unlike cross-search environments or protocol-based systems, each database remains in its native format, and WebFeat Prism uses a system of translators to complete each search one translator for each database. It knows how to translate search terms into the proper syntax for each electronic resource, then uses the search engines of the native databases to retrieve results.
In this way, WebFeat Prism acts as a bridge to quickly and easily lead researchers from a library or organization portal homepage into the databases they need for day-to-day R&D activities.
Looking toward the future, Thomson ISI and NEC have announced a collaborative venture to create a comprehensive, multidisciplinary citation index for open access resources. This Web Citation Index Pilot Initiative will make use of technology developed for NEC's "CiteSeer" experimental, Web-based citation indexing environment, along with Thomson ISI editorial content selection process and the capabilities of ISI Web of Knowledge. Pilot testing is being conducted during 2004, with full availability of this unique resource projected for early 2005.
The Value of Retrieval
The real value of information suppliers today lies in the ability to accurately locate the information required. Content itself is no longer enough--the user interface is part of the search capability. It's not just the information, but also the way it's accessed and presented. As Richard J. Harrington put it in a recent interview with Dick Kaser of Information Today magazine: "Content no longer is king. It's extremely important, obviously. But we say content is key. It's the marriage of the content with the technology that creates the value."
It's clear that underlying tools like CiteSeer, with its automated citation index creation tools for the Web of Science, must be teamed with front-end interfaces that reflect the needs of users.
Thomson Scientific has worked closely with customers to develop a new version of the Web of Science interface, due for release shortly. In addition to many navigation and presentation enhancements, limitations of the former interface have also been addressed in response to customer feedback, resulting in such improvements as the removal of the limit of 500 results displayed from a search.
What Price Improved Retrieval?
A recent study conducted by IDC to quantify the impact of inefficient information retrieval concluded that the time spent looking for but not finding information costs an organization $6 million a year. And that's not including opportunity costs or the costs of reworking information that exists but can't be located. If these costs are included, an additional $27 million annually can be added.
But the cost of missing or incomplete information may be much higher than this. An example occurred in summer 2001 when a volunteer on a Johns Hopkins research project died when she was given hexamethonium to inhale. Researchers had done a search on PubMed and the Web to find out if there were adverse effects associated with its use. What the researchers didn't know was that PubMed only goes back to 1966. The research on hexamethonium was done in the 1950s with the critical reference describing death associated with hexamethonium therapy published in 1953 in the New England Journal of Medicine.
By applying the strategies for search, taxonomy and classification described here, Thomson Scientific aims to enable focused delivery of just the right information to just the right people at just the right time. This helps our customers make better, more informed decisions faster. This in turn, can help them bring products to market more quickly and avoid potentially costly, or even fatal, mistakes through failing to identify critically important information.
Thomson products and features mentioned herein are trademarks, service marks and registered trademarks used under license. Thomson has no proprietary interest in the marks or names of others.
Thomson Pharma Information Portal
Making informed decisions at every step of the research process--from the initial stages of R&D through marketing of the finished product--is a significant challenge. Different users of information, each with their specific needs, are involved in this process, but the requirement to access focused, relevant information in a digestible format in a timely manner is common to all.
Thomson Scientific is working on a solution for the pharmaceutical industry. It will provide a wide range of information sources integrated into a uniform platform that can be customized to individual customer needs.
One of the critical development efforts required to deliver such a solution is standardization of the indexing and classification underlying the various disparate information sources. This involves substantial development work, including: mapping of drug activity/mechanism keywords across indexing systems; creation of master indexes for drug synonyms, company names and chemical compounds; and matching of bibliographic citations and patent numbers.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Strategies for Search, Taxonomy and Classification|
|Date:||Jul 1, 2004|
|Previous Article:||Creating and using taxonomies to enhance enterprise search.|
|Next Article:||Beyond search: the business case for intellectual capital management.|