Designing a knowledge discovery system, Part 2: now that we have categorized, let's ... classify!Really, all people want to do when they use a search engine, portal, or even a full-blown knowledge management system, is answer a question. They want all information relevant to their question so they can formulate an answer. To do this, knowledge management systems must shift from a "retrieval-" to a "discovery-"based orientation. Next generation knowledge discovery systems will introduce users to a new way of searching information assets by better complementing the user's own cognitive approaches to finding information. These systems must simultaneously manage vast and continually changing stores of information, as well as the idiosyncratic id·i·o·syn·cra·sy n. pl. id·i·o·syn·cra·sies 1. A structural or behavioral characteristic peculiar to an individual or group. 2. A physiological or temperamental peculiarity. 3. nature of the user. To handle this two-step requirement, knowledge management system design should be split into two consecutive phases. The first phase should focus on the organization and its need for a maintainable, reliable and universally understandable information repository An information repository is an easy to deploy secondary tier of data storage that can comprise multiple, networked data storage technologies running on diverse operating systems, where data that no longer needs to be in primary storage is protected, classified according to captured . The focus in phase one is internal. This phase is dependent upon the proper use of ontologies and taxonomies, as described in Part One of ("A Roadmap to Proper Taxonomy taxonomy: see classification. taxonomy In biology, the classification of organisms into a hierarchy of groupings, from the general to the particular, that reflect evolutionary and usually morphological relationships: kingdom, phylum, class, order, Design," Computer Technology Review, July 2003). The second design phase (and the focus of this article) is the user-centric, externally focused dynamic classification phase--which, when layered on top of a solid taxonomically tax·o·nom·ic also tax·o·nom·i·cal adj. Of or relating to taxonomy: a taxonomic designation. tax based informational foundation, constitutes a powerful, scalable and flexible system capable of complex problem-solving support. The beauty of dynamic classification is this flexibility and ability to adjust in the face of the huge and constantly changing information assets available today. What's required is a set of tools that help individuals extract small details or serendipitously discover data relationships within the information foundation, in ways that make unique sense to them personally. What is a Classification? A classification can be visualized as a tree representation of what is actually a multi-dimensional matrix. "Dynamic" classification is the ability to cross and combine these dimensions--essentially slicing and dicing dicing Trauma surgery Multiple 0.5-1.0 cm, cube-like lacerations of the skin seen in MVA victims who strike shattered tempered glass car windows information as desired and in real-time--to place information into the perspective most meaningful to the user within a unique, time-specific problem-solving context. It is this specific ability, the user-definable slicing and dicing of data, which supports knowledge discovery versus merely information retrieval information retrieval Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. . These information dimensions or trees may be shifted or reversed; dramatically affecting the resulting classification even though the information latched latch n. 1. A fastening, as for a door or gate, typically consisting of a bar that fits into a notch or slot and is lifted from either side by a lever or string. 2. to the categories within a dimension remains unchanged. It is easy to see that these trees are permutable. In the two classifications in Figure 1, the dimensions used are the same. However, when the dimension order shifts, an entirely different perspective is generated. In Classification 1, we can see diseases within an African context. In Classification 2, we can see the epidemiology epidemiology, field of medicine concerned with the study of epidemics, outbreaks of disease that affect large numbers of people. Epidemiologists, using sophisticated statistical analyses, field investigations, and complex laboratory techniques, investigate the cause of Alzheimer across different geographical contexts. The ability to shift dimension order is a true benefit of dynamic classification. Individuals need to understand many variables in order to make a good decision--particularly in an urgent situation. Moreover, each individual will go about this process in a different way. For example, if a terrorist attack was imminent the local police, FBI and medical personnel would all want essentially the same information but from their own different perspectives. Dynamic classifications generally occur in identifiable patterns. These are geography/ topic, horizontal/vertical and vertical/vertical. This is useful to keep in mind as you begin to design your dynamic classification tools and identify the dimensions you will offer you users. Geography is the most commonly used dimension because it is an analytical element of so many decision-making processes Presented below is a list of topics on decision-making and decision-making processes: | width="" align="left" valign="top" |
| width="" align="left" valign="top" | tr.v. pop·u·lat·ed, pop·u·lat·ing, pop·u·lates 1. To supply with inhabitants, as by colonization; people. 2. by documents highly specific to that type of business. The other major structure, vertical/vertical, really displays the power of dynamic classification. An example would be "MeSH Proteins" and "MeSH Diseases." In this case you would see all categories containing documents with information matching these two trees. You can see how quickly this process becomes complex. There are thousands of diseases, and if you cross them with the dimension of proteins alone, you could have millions of possible combinations. If you add a third dimension, chemical compounds, you move quickly into the billions. The virtual space containing your multi-dimensional operation is huge, even when you use only a small part of it. This highlights another design consideration. If you allow your users to cross too many large dimensions, the number of relevant documents resulting is likely to be low and rather unsatisfying for the user. While this sounds counter-intuitive, it happens because very few documents will colocate a reference to disease A, protein B and chemical compound C. So, while the virtual space necessitated is huge, it will be populated by a reversely dismal dis·mal adj. 1. Causing gloom or depression; dreary: dismal weather; took a dismal view of the economy. 2. amount of documents. You need to design in equilibrium among the number of dimensions you use and the number of documents you can actually target to populate To plug in chips or components into a printed circuit board. A fully populated board is one that contains all the devices it can hold. the resulting classification. To do this successfully you need to understand your users. You will need to determine what type of information is most useful, which intersections of information will be most valuable, and then ensure that these individual dimensions are represented in your indexing and taxonomy development. For example, expert users are likely to want extremely specific information throughout their classification. General users will want to see a broader selection of data at the top and then drill down as knowledge increases. In some cases, you may need to design two systems to address each user group's need. Populating Classifications Once a number of template classifications have been designed, the next step is to benchmark the classifications and analyze how they become populated. Visualize the population process as a waterfall waterfall, a sudden unsupported drop in a stream. It is formed when the stream course is interrupted as when a stream passes over a layer of harder rock—often igneous—to an area of softer and therefore more easily eroded rock; the edge of a cliff or : imagine documents dropped at the top of the cliff and cascading down into various streams. Documents flow through the classification design and go as deep as they can to find their "best" folder In a graphical user interface (GUI), a simulated file folder that holds data, applications and other folders. Folders were introduced on the Xerox Star, then popularized on the Macintosh and later adapted to Windows and Unix. In Unix and Linux, as well as DOS and Windows 3. (s). A document can pass through a node and explore further if it satisfies the rules established for that folder. At a minimum the document's tags must match the folder's name. A rule can, of course, be more sophisticated and combine taxonomic categories Taxonomic categories Any one of a number of formal ranks used for organisms in a traditional Linnaean classification. Biological classifications are orderly arrangements of organisms in which the order specifies some relationship. with semi-structured information otherwise extracted from the documents. Metadata (1) (meta-data) Data that describes other data. The term may refer to detailed compilations such as data dictionaries and repositories that provide a substantial amount of information about each data element. included in forms, for example, or clinical trial reports, may be an extremely important information source. As well, an individual's personal database of notes, thoughts and clipped articles may be significant to their particular line of research. These little "treasure houses" must also be indexed and accessible through the folder's rules. In Figure 2, the fact that a given document has all the tags required by the path (date::geography::business:: type) is not enough. To ensure maximum quality, you also want to make sure that there is enough "mutual information" between the occurrences of these tags--in short, that these tags are consistently found together in the document. For example, if you were looking for Looking for In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with. documents referencing red SUVs, you would not want to see a document dealing with blood or pigmentation pigmentation, name for the coloring matter found in certain plant and animal cells and for the color produced thereby. Pigmentation occurs in nearly all living organisms. . You would only want to see documents in which the words "red" and "SUVs" occurred in close proximity, indicating a relationship. The ability to latch concepts and accurately identify them as having additional meaning based on their proximity is critical to automating classification. Once a template classification has been populated, you should check the "behavior" of the classification against your collection of documents to verify results in terms of both accuracy and efficiency. The following are a few tips for verifying the accuracy of your classifications. Quality Controls Folder Population Folders should be scanned for population quantity. Some folders will be overpopulated o·ver·pop·u·late v. o·ver·pop·u·lat·ed, o·ver·pop·u·lat·ing, o·ver·pop·u·lates v.tr. To fill (an area, for example) with excessive population to the detriment of the inhabitants, resources, or environment. and some will be quite thin. Clearly the documents should flow as deeply as possible into the classification. If the folders appear overpopulated, check for a bottleneck A lessening of throughput. It often refers to networks that are overloaded, which is caused by the inability of the hardware and transmission lines to support the traffic. It can also refer to a mismatch inside the computer where slower-speed peripheral buses and devices prevent the CPU . If this occurs, the folders will need a bit more "room" so documents can flow appropriately into "children" folders. Use your taxonomy to break the folder into more categories. On the other hand, you may have folders containing only one document. That's not good either. In these cases you may have one folder containing one document that you open only to see another subfolder containing one document and so on. To eliminate these "strings," you should take the "end" folder and shrink all the intermediate levels leading to it. The intermediate levels should be collapsed so that you see the folder containing the end document in the sub-category of the first level. If the large area of your classification is unpopulated, you may need to release some constraints CONSTRAINTS - A language for solving constraints using value inference. ["CONSTRAINTS: A Language for Expressing Almost-Hierarchical Descriptions", G.J. Sussman et al, Artif Intell 14(1):1-39 (Aug 1980)]. . Perhaps the combination of dimensions is too rigid, or you don't have documents dealing with the right combination of information, or you should incorporate a different type of source material. Interrupted in·ter·rupt v. in·ter·rupt·ed, in·ter·rupt·ing, in·ter·rupts v.tr. 1. To break the continuity or uniformity of: Rain interrupted our baseball game. 2. Bell Curve In a proper classification you would expect to see the natural distribution of your documents represented as a bell curve. You would see a few folders at the top, more in the middle and fewer again at the end--let's say the fourth and fifth levels of classification. If this is not the case, and you see larger document quantities at the end levels, then you need to add additional levels to increase specificity and derive a more natural bell curve Folder Size The average size of your folders must be a reasonable number, preferably not in the single or triple digits but somewhere within that range. Quality Tests Needle Test One quick way to verify the quality of your classification is to find a bit of relevant information that occurs only once among all of your documents. Check to make sure that you can actually find that one, unique combination of data. Full Discovery Test Another quick test is to open each of the documents in a particular category. Verify that the information cited is not too far apart to represent a relationship. Also check to make sure that the information is correctly categorized cat·e·go·rize tr.v. cat·e·go·rized, cat·e·go·riz·ing, cat·e·go·riz·es To put into a category or categories; classify. cat . While leaps made recently in search technology are astounding a·stound tr.v. a·stound·ed, a·stound·ing, a·stounds To astonish and bewilder. See Synonyms at surprise. [From Middle English astoned, past participle of astonen, , the mind is still the better tool. Our innate ability to balance multiple variables and shift variable priorities when emergencies arise is a skill we should compliment, not duplicate, with computing computing - computer power. Next generation knowledge discovery systems are beginning to exploit the power of the user, through capabilities like dynamic classification. By layering dynamic classification on any properly designed, taxonomically-aligned information repository, corporations can immediately empower their knowledge workers to begin working more precisely, efficiently and with significantly more satisfying results.
Figure 1--Geography and diseases
Classification 1
Africa
Alzheimer
Anthrax
Classification 2
Alzheimer
Africa
America
Figure 2
Example:
Summer (metadata: date)
Domestic Sales (taxonomic: geography and business
Convertibles (metadata: type)
www.in.convera.com Dr. Claude Vogel is chief technology officer at Convera (Vienna, VA) |
|
||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion