Designing a knowledge discovery system, Part 2: now that we have categorized, let's ... classify!
To handle this two-step requirement, knowledge management system design should be split into two consecutive phases. The first phase should focus on the organization and its need for a maintainable, reliable and universally understandable information repository. The focus in phase one is internal. This phase is dependent upon the proper use of ontologies and taxonomies, as described in Part One of ("A Roadmap to Proper Taxonomy Design," Computer Technology Review, July 2003).
The second design phase (and the focus of this article) is the user-centric, externally focused dynamic classification phase--which, when layered on top of a solid taxonomically based informational foundation, constitutes a powerful, scalable and flexible system capable of complex problem-solving support. The beauty of dynamic classification is this flexibility and ability to adjust in the face of the huge and constantly changing information assets available today.
What's required is a set of tools that help individuals extract small details or serendipitously discover data relationships within the information foundation, in ways that make unique sense to them personally.
What is a Classification?
A classification can be visualized as a tree representation of what is actually a multi-dimensional matrix. "Dynamic" classification is the ability to cross and combine these dimensions--essentially slicing and dicing information as desired and in real-time--to place information into the perspective most meaningful to the user within a unique, time-specific problem-solving context.
It is this specific ability, the user-definable slicing and dicing of data, which supports knowledge discovery versus merely information retrieval.
These information dimensions or trees may be shifted or reversed; dramatically affecting the resulting classification even though the information latched to the categories within a dimension remains unchanged. It is easy to see that these trees are permutable. In the two classifications in Figure 1, the dimensions used are the same. However, when the dimension order shifts, an entirely different perspective is generated.
In Classification 1, we can see diseases within an African context. In Classification 2, we can see the epidemiology of Alzheimer across different geographical contexts.
The ability to shift dimension order is a true benefit of dynamic classification. Individuals need to understand many variables in order to make a good decision--particularly in an urgent situation. Moreover, each individual will go about this process in a different way. For example, if a terrorist attack was imminent the local police, FBI and medical personnel would all want essentially the same information but from their own different perspectives.
Dynamic classifications generally occur in identifiable patterns. These are geography/ topic, horizontal/vertical and vertical/vertical. This is useful to keep in mind as you begin to design your dynamic classification tools and identify the dimensions you will offer you users. Geography is the most commonly used dimension because it is an analytical element of so many decision-making processes. Terrorism in the Philippines or criminal law in Texas or domestic sales, for example, would all involve a geographic tree. An example of horizontal/vertical pattern would be the petroleum business or anti-money laundering regulations. The horizontal tree (business) includes broad categories such as marketing, research, health and safety, etc. By crossing a horizontal category with a vertical dimension such as Petroleum, which may contain such categories as crude oil or solid waste, you would derive categories populated by documents highly specific to that type of business. The other major structure, vertical/vertical, really displays the power of dynamic classification. An example would be "MeSH Proteins" and "MeSH Diseases." In this case you would see all categories containing documents with information matching these two trees.
You can see how quickly this process becomes complex. There are thousands of diseases, and if you cross them with the dimension of proteins alone, you could have millions of possible combinations. If you add a third dimension, chemical compounds, you move quickly into the billions. The virtual space containing your multi-dimensional operation is huge, even when you use only a small part of it.
This highlights another design consideration. If you allow your users to cross too many large dimensions, the number of relevant documents resulting is likely to be low and rather unsatisfying for the user. While this sounds counter-intuitive, it happens because very few documents will colocate a reference to disease A, protein B and chemical compound C. So, while the virtual space necessitated is huge, it will be populated by a reversely dismal amount of documents. You need to design in equilibrium among the number of dimensions you use and the number of documents you can actually target to populate the resulting classification. To do this successfully you need to understand your users. You will need to determine what type of information is most useful, which intersections of information will be most valuable, and then ensure that these individual dimensions are represented in your indexing and taxonomy development. For example, expert users are likely to want extremely specific information throughout their classification. General users will want to see a broader selection of data at the top and then drill down as knowledge increases. In some cases, you may need to design two systems to address each user group's need.
Once a number of template classifications have been designed, the next step is to benchmark the classifications and analyze how they become populated. Visualize the population process as a waterfall: imagine documents dropped at the top of the cliff and cascading down into various streams. Documents flow through the classification design and go as deep as they can to find their "best" folder(s). A document can pass through a node and explore further if it satisfies the rules established for that folder. At a minimum the document's tags must match the folder's name. A rule can, of course, be more sophisticated and combine taxonomic categories with semi-structured information otherwise extracted from the documents. Metadata included in forms, for example, or clinical trial reports, may be an extremely important information source. As well, an individual's personal database of notes, thoughts and clipped articles may be significant to their particular line of research. These little "treasure houses" must also be indexed and accessible through the folder's rules.
In Figure 2, the fact that a given document has all the tags required by the path (date::geography::business:: type) is not enough. To ensure maximum quality, you also want to make sure that there is enough "mutual information" between the occurrences of these tags--in short, that these tags are consistently found together in the document. For example, if you were looking for documents referencing red SUVs, you would not want to see a document dealing with blood or pigmentation. You would only want to see documents in which the words "red" and "SUVs" occurred in close proximity, indicating a relationship. The ability to latch concepts and accurately identify them as having additional meaning based on their proximity is critical to automating classification.
Once a template classification has been populated, you should check the "behavior" of the classification against your collection of documents to verify results in terms of both accuracy and efficiency. The following are a few tips for verifying the accuracy of your classifications.
Quality Controls Folder Population
Folders should be scanned for population quantity. Some folders will be overpopulated and some will be quite thin. Clearly the documents should flow as deeply as possible into the classification. If the folders appear overpopulated, check for a bottleneck. If this occurs, the folders will need a bit more "room" so documents can flow appropriately into "children" folders. Use your taxonomy to break the folder into more categories.
On the other hand, you may have folders containing only one document. That's not good either. In these cases you may have one folder containing one document that you open only to see another subfolder containing one document and so on. To eliminate these "strings," you should take the "end" folder and shrink all the intermediate levels leading to it. The intermediate levels should be collapsed so that you see the folder containing the end document in the sub-category of the first level. If the large area of your classification is unpopulated, you may need to release some constraints. Perhaps the combination of dimensions is too rigid, or you don't have documents dealing with the right combination of information, or you should incorporate a different type of source material.
Interrupted Bell Curve
In a proper classification you would expect to see the natural distribution of your documents represented as a bell curve. You would see a few folders at the top, more in the middle and fewer again at the end--let's say the fourth and fifth levels of classification. If this is not the case, and you see larger document quantities at the end levels, then you need to add additional levels to increase specificity and derive a more natural bell curve
The average size of your folders must be a reasonable number, preferably not in the single or triple digits but somewhere within that range.
Quality Tests Needle Test
One quick way to verify the quality of your classification is to find a bit of relevant information that occurs only once among all of your documents. Check to make sure that you can actually find that one, unique combination of data.
Full Discovery Test
Another quick test is to open each of the documents in a particular category. Verify that the information cited is not too far apart to represent a relationship. Also check to make sure that the information is correctly categorized.
While leaps made recently in search technology are astounding, the mind is still the better tool. Our innate ability to balance multiple variables and shift variable priorities when emergencies arise is a skill we should compliment, not duplicate, with computing power. Next generation knowledge discovery systems are beginning to exploit the power of the user, through capabilities like dynamic classification. By layering dynamic classification on any properly designed, taxonomically-aligned information repository, corporations can immediately empower their knowledge workers to begin working more precisely, efficiently and with significantly more satisfying results.
Figure 1--Geography and diseases Classification 1 Africa Alzheimer Anthrax Classification 2 Alzheimer Africa America Figure 2 Example: Summer (metadata: date) Domestic Sales (taxonomic: geography and business Convertibles (metadata: type)
Dr. Claude Vogel is chief technology officer at Convera (Vienna, VA)
|Printer friendly Cite/link Email Feedback|
|Publication:||Computer Technology Review|
|Date:||Oct 1, 2003|
|Previous Article:||VPNs and wireless gateways vie for the heart of WLAN security.|
|Next Article:||Making the most of your business, trade media opportunities.|