Discovering Semantic Patterns in Bibliographically Coupled Documents.ABSTRACT ISSUES IN DISCOVERING KNOWLEDGE IN BIBLIOGRAPHIC databases For computer programs to manage an individual's bibliographic references, see Reference management software A bibliographic or library database is a database of bibliographic information. are addressed. An example of semantic pattern analysis is used to demonstrate the methodological aspects of knowledge discovery in bibliographic databases. The semantic pattern analysis is based on the keywords selected from the documents grouped by bibliographical bibliographical pertaining to the literature of a subject. bibliographical tools the ways in which a bibliography can be approached or managed. coupling. The frequency distribution patterns suggest the existence of a common intellectual base with a wide range of specialties and marginal areas in the antibiotic resistance antibiotic resistance, n the ability of certain strains of microorganisms to develop resistance to antibiotics. antibiotic resistance literature. The resulting values for keyword density Keyword density is the percentage of words on a web page that match a specified set of keywords. In the context of search engine optimization keyword density can be used as a factor in determining whether a web page is relevant to a specified keyword or keyword phrase. per rank show a difference of ten times between the specialty and marginal keyword densities. The possibilities and further studies of incorporating knowledge discovery results into information retrieval information retrieval Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. are discussed. INTRODUCTION Knowledge discovery in databases (KDD KDD Knowledge Discovery and Data Mining (International Conference) KDD Knowledge Discovery in Databases KDD Kokusai Denshin Denwa (Japan) KDD Key Distribution Device ) is considered a process of nontrivial nontrivial - Requiring real thought or significant computing power. Often used as an understated way of saying that a problem is quite difficult or impractical, or even entirely unsolvable ("Proving P=NP is nontrivial"). The preferred emphatic form is "decidedly nontrivial". extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, constraints CONSTRAINTS - A language for solving constraints using value inference. ["CONSTRAINTS: A Language for Expressing Almost-Hierarchical Descriptions", G.J. Sussman et al, Artif Intell 14(1):1-39 (Aug 1980)]. , regularities) from data in databases (Chen, Han, & Yu, 1996, p. 866). Most research on KDD has focused on applications in business operations Business operations are those activities involved in the running of a business for the purpose of producing value for the stakeholders. Compare business processes. The outcome of business operations is the harvesting of value from assets and well-structured data. Knowledge discovery in textual tex·tu·al adj. Of, relating to, or conforming to a text. tex tu·al·ly adv. databases has been
underemphasized (Trybula, 1997). Among the limited publications on KD in
textual databases, the full-text document data are the primary source of
analysis. Lent, Agrawal, and Srikant (1997) developed a patent mining
system at IBM (International Business Machines Corporation, Armonk, NY, www.ibm.com) The world's largest computer company. IBM's product lines include the S/390 mainframes (zSeries), AS/400 midrange business systems (iSeries), RS/6000 workstations and servers (pSeries), Intel-based servers (xSeries) for identifying trends in large textual databases over a
period of time. They used sequential pattern mining to identify
recurring re·cur intr.v. re·curred, re·cur·ring, re·curs 1. To happen, come up, or show up again or repeatedly. 2. To return to one's attention or memory. 3. To return in thought or discourse. phrases and generate histories of phrases, after which they then extracted phrases that satisfied a specific trend. Discovering associations among the keywords in texts is another area of research in KD in textual databases. Using background knowledge about the relationships of keywords, Feldman and Hirsh (1996) studied associations among the keywords or concepts representing the documents. The knowledge base they built supplies unary Meaning one; a single entity or operation, or an expression that requires only one operand. 1. (programming) unary - (or "monadic") A description of a function or operator which takes one argument, e.g. the unary minus operator which negates its argument. or binary relations In mathematics, a binary relation (or a dyadic or 2-place relation) is an arbitrary association of elements within a set or with elements of another set. An example is the "divides" relation between the set of prime numbers P and the set of integers among the keywords representing the documents. Feldman, Dagan, and Hirsh (1998) developed a system for Knowledge Discovery in Text (KDT KDT Key Definition Table KDT Knowledge Discovery in Text (data/text mining) KDT Keyboard Display Terminal KDT Keyboard Definition Table KDT Kneel-Down Transporter KDT Keyboard-to-Disc-to-Tape KDT Knowledge-Dependent Transaction ) that extracts keywords to represent document contents and allows users to browse (1) To view the contents of a file or a group of files. Browser programs generally let you view data by scrolling through the documents or databases. In a database program, the browse mode often lets you edit the data. See Web browser. a list of keywords that co-occur with Verb 1. co-occur with - go or occur together; "The word 'hot' tends to cooccur with 'cold'" collocate with, construe with, cooccur with, go with accompany, attach to, come with, go with - be present or associated with an event or entity; "French fries come another keyword(s) for knowledge discovery purposes. Mining in full-text documents attempts to extract useful associations and patterns for representing the document content, including clustering, categorization, summarization sum·ma·rize intr. & tr.v. sum·ma·rized, sum·ma·riz·ing, sum·ma·riz·es To make a summary or make a summary of. sum , and feature extraction In pattern recognition and in image processing, Feature extraction is a special form of dimensionality reduction. When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not much information) then the . While many studies using data from bibliographic databases were not conducted in terms of KDD or data mining, they nevertheless bear the marks of KDD's techniques and analysis. Such examples can be found in citation Citation (foaled 1945) U.S. Thoroughbred racehorse. In four seasons he won 32 of 45 races, finished second in ten, and third in two. He won the 1948 Triple Crown, and became the first horse to win $1 million. He set a world record in 1950 by running a mile in 1:33 3/5. and cocitation analysis (Kassler, 1965; Small, 1973; Small & Sweeney, 1985; Braam, Moed, & van Raan, 1991), keyword classifications (Sparck Jones & Jackson, 1970), investigation of indexing similarities between keywords and controlled vocabularies Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the uses of predefined, authorised terms that have been preselected by the designer of the controlled vocabulary as opposed to natural (Shaw, 1990; Qin, in press), and author mapping (Logan & Shaw, 1987). Discovering knowledge through mining textual data in bibliographic databases presents more problems than mining numerical data Numerical data (or quantitative data) is data measured or identified on a numerical scale. Numerical data can be analysed using statistical methods, and results can be displayed using tables, charts, histograms and graphs. . One problem is that most fields in a bibliographic database have long character strings--e.g., author name, title, affiliation, journal title, and indexing terms (from both keywords and controlled vocabularies). Such long strings are usually difficult for statistical packages or data mining software to perform computational Having to do with calculations. Something that is "highly computational" requires a large number of calculations. tasks. Unlike the full-text document source, bibliographic bib·li·og·ra·phy n. pl. bib·li·og·ra·phies 1. A list of the works of a specific author or publisher. 2. a. data are semi-structured. Although it may be an advantage over completely unstructured full-text documents, it also creates a challenge for mining tools that the data in the structured fields should not be mixed up when extracting data sets and performing analysis. Linguistic problems (such as singulars and plurals, stems and suffixes) and inconsistencies in abbreviating journal titles and institution names can also be challenging issues in mining bibliographic data. To obtain valid and reliable data for discovering trends and patterns in subject fields and research, data preprocessing See preprocessing. See also data processing. and cleansing can become very time-consuming and both labor and intellectually intensive. However, the most challenging issue remains whether there is a chance for information retrieval systems to "be extended to become knowledge discovery systems," or whether "the kinds of record existing in bibliographical and textual databases offer any possibility of analysis in ways similar to those in more structured factual databases" (Vickery, 1997, pp. 119-20). This study selected a set of bibliographic records as the data source for discovering semantic patterns among the keywords in these records. The purpose of this keyword analysis was to discover if any semantic patterns existed in the keywords extracted from bibliographically coupled documents regarding antibiotic resistance in pneumonia pneumonia (n mōn`yə), acute infection of one or both lungs that can be caused by a bacterium, usually Streptococcus pneumoniae .Also, if such patterns did exist, how the discovered knowledge about a subject field can be used to improve the effectiveness of knowledge representation and information retrieval. A preliminary test of antibiotic resistance in pneumonia literature found that documents citing the same publication not only co-cited other publications but also contained semantically similar or same keywords in the titles of cited publications. The frequency distributions of these keywords characterized char·ac·ter·ize tr.v. character·ized, character·iz·ing, character·iz·es 1. To describe the qualities or peculiarities of: characterized the warden as ruthless. 2. three distinctive strata: a very small number of keywords falling into the highest frequency region, a relatively larger group with moderate occurrences, and a majority of them appearing only once or twice. If the terms occurring most frequently represent the intellectual base in this subject area (Small, 1973; Small & Sweeney, 1985) and the ones with medium occurrences represent the specialties, then the terms occurring least frequently represent the marginal terms. These marginal terms may be the links between the mainstream of the antibiotic resistance research to the less overt Public; open; manifest. The term overt is used in Criminal Law in reference to conduct that moves more directly toward the commission of an offense than do acts of planning and preparation that may ultimately lead to such conduct. OVERT. Open. but promising research. The citation-semantic analysis is aimed at discovering semantic patterns of the antibiotic resistance literature so that the analysis process and semantic patterns can be programmed into tools that can assist information searchers in building search queries and customizing their postsearch analysis. Specifically, this project studied whether the distribution follows the three strata described earlier, how such distribution can be measured, and to what extent the keywords in these strata reflect the research front in antibiotic resistance. The methods used to preprocess pre·proc·ess tr.v. pre·proc·essed, pre·proc·ess·ing, pre·proc·ess·es To perform preliminary processing on (data, for example). pre·proc and analyze the data are discussed in detail in the following sections. RESEARCH DESIGN The first and most important step in KDD is to clarify what kinds of knowledge are to be discovered, because this decides what types of data or database one needs to work on and what techniques to use for discovering the knowledge anticipated. In general, mining data in any type of database includes association rule generalization gen·er·al·i·za·tion n. 1. The act or an instance of generalizing. 2. A principle, a statement, or an idea having general application. , multilevel mul·ti·lev·el adj. Having several levels: a multilevel parking garage. Adj. 1. multilevel - of a building having more than one level data characterization A rather long and fancy word for analyzing a system or process and measuring its "characteristics." For example, a Web characterization would yield the number of current sites on the Web, types of sites, annual growth, etc. , data classification, data clustering, pattern-based similarity Similarity is some degree of symmetry in either analogy and resemblance between two or more concepts or objects. The notion of similarity rests either on exact or approximate repetitions of patterns in the compared items. search, and mining path traversal Crossing over. Passing through. See NAT traversal. (data) traversal - Processing nodes in a graph one at a time, usually in some specified order. Traversal of a tree is recursively defined to mean visiting the root node and traversing its children. patterns (Chen, Han, & Yu, 1996). This project was to identify semantic patterns in antibiotic resistance literature, which would be based on the frequency analysis of keyword occurrences. To achieve this goal, one can obtain a set of working data either by selecting keywords directly from individual records or by obtaining a more coherent pool(s) of source documents by applying a citation restriction such as bibliographical coupling. When the bibliographical coupling method is used to select source documents, at least one similar publication is cited in all the source documents of a bibliographical coupling pool. By this criterion, the documents can be considered coherent in content. Because of this, the keyword data were collected from pools of source documents through bibliographical coupling. DATA COLLECTION The Science Citation Index Science Citation Index (SCI ®) is a citation index originally produced by the Institute for Scientific Information (ISI) in 1960, which is now owned by Thomson Scientific. (SCI (Scalable Coherent Interface) An IEEE standard for a high-speed bus that uses wire or fiber-optic cable. It can transfer data up to 1GBytes/sec. (hardware) SCI - 1. Scalable Coherent Interface. 2. UART. ) database was used to collect data. The following search query was formulated for·mu·late tr.v. for·mu·lat·ed, for·mu·lat·ing, for·mu·lates 1. a. To state as or reduce to a formula. b. To express in systematic terms or concepts. c. to achieve relative precision and recall: SELECT (ANTIBIOTIC antibiotic, any of a variety of substances, usually obtained from microorganisms, that inhibit the growth of or destroy certain other microorganisms. Types of Antibiotics ? (W) RESISTAN?) AND PNEUMONI? The query was executed in May 1996 and resulted in a total of 360 postings. After ranking by CR (Cited Reference) field, the number of records was reduced to 340 due to the fact that some records did not include references. In Figure 1, these articles are represented by [a.sub.1], [a.sub.2], [a.sub.3], ..., [a.sub.n]. A total of 8,753 publications ([c.sub.1], [c.sub.2], [c.sub.3], ..., [c.sub.k] in Figure 1) were cited in 340 papers. The highest frequency that a paper was cited was seventy-two times, which means the largest pool of source documents identified via bibliographical coupling contained seventy-two articles (see Table 1). The pools with the same number of source documents were treated as the same rank. All thirty-three ranks in this data set were grouped into three categories: 1 through 10 were large pools, those from 11 to 20 the medium, and the rest the small. The first five pools of source documents were selected from each category for extracting keyword data because of the time constraints In law, time constraints are placed on certain actions and filings in the interest of speedy justice, and additionally to prevent the evasion of the ends of justice by waiting until a matter is moot. for the project. Separate keyword files (i.e., [w.sub.1], [w.sub.2], [w.sub.3], ..., [w.sub.j], in Figure 1) were downloaded for each pool of documents. [Figure 1 ILLUSTRATION OMITTED] Table 1. TOP 10 MOST FREQUENTLY CITED DOCUMENTS IN ANTIBIOTIC RESISTANCE IN PNEUMONIA LITERATURE
Frequency of Author Name and Source
Rank Being Cited
1 72 KLUGMAN KP, 1990, V3, P171, CLIN
MICROBIOL REV
2 45 MARTON A, 1991, V163, P542, J INFECT DIS
3 41 JACOBS MR, 1978, V299, P735, NEW ENGL J MED
4 38 FENOLL A, 1991, V13, P56, REV INFECT DIS
5 34 HANSMAN D, 1967, V2, P264, LANCET
6 33 APPELBAUM PC, 1992, V15, P77, CLIN INFECT
DIS
7 32 SPIKA JS, 1991, V163, P1273, J INFECT DIS
8 28 PALLARES R, 1987, V317, P18, NEW ENGL J MED
9 26 APPELBAUM PC, 1987, V6, P367, EUR J CLIN
MICRO
9 26 WARD J, 1981, V3, P254, REV INFECT DIS
10 25 JORGENSEN JH, 1990, V34, P2075, ANTIMICROB
AGE
10 25 MUNOZ R, 1991, V164, P302, J INFECT DIS
10 25 PHILIPPON A, 1989, V33, P1131,
ANTIMICROB AGEN
DATA PREPROCESSING The first step in preprocessing A preliminary processing of data in order to prepare it for the primary processing or for further analysis. The term can be applied to any first or preparatory processing stage when there are several steps required to prepare data for the user. is cleaning the downloaded keyword files and converting them into tables. This can be done easily with a word processor's FIND and REPLACE functions. Macros or programs can also be written to read the text files into database tables. Data preprocessed through either way would need to be checked for errors, missing values In statistics, missing values are a common occurrence. Several statistical methods have been developed to deal with this problem. Missing values mean that no data value is stored for the variable in the current observation. , and the irregular HEIR, IRREGULAR. In Louisiana, irregular heirs are those who are neither testamentary nor legal, and who have been established by law to take the succession. See Civ. Code of Lo. art. 874. labels missed by the REPLACE command. The next step is then to assign to keywords the text codes that can be computed by analytic an·a·lyt·ic or an·a·lyt·i·cal adj. 1. Of or relating to analysis or analytics. 2. Expert in or using analysis, especially one who thinks in a logical manner. 3. Psychoanalytic. tools (see Appendix). As mentioned earlier, textual data mining faces difficulty in handling long character strings and normalizing terms linguistically. Long strings would not be suitable for calculating frequencies or performing other statistical analysis. The text codes designed for the keywords in this subject field are mnemonic Pronounced "ni-mon-ic." A memory aid. In programming, it is a name assigned to a machine function. For example, COM1 is the mnemonic assigned to serial port #1 on a PC. Programming languages are almost entirely mnemonics. and, in most cases, comprehensible com·pre·hen·si·ble adj. Readily comprehended or understood; intelligible. [Latin compreh without the help of the original forms. A dictionary or a knowledge base for linking text codes to their keywords can be built for automatic coding. In coding keywords for this data set, a general rule was made to maintain as much of the original form and semantics semantics [Gr.,=significant] in general, the study of the relationship between words and meanings. The empirical study of word meanings and sentence meanings in existing languages is a branch of linguistics; the abstract study of meaning in relation to language or of the keywords as possible. Other coding rules were set as follows: * The same codes were assigned to both singular SINGULAR, construction. In grammar the singular is used to express only one, not plural. Johnson. 2. In law, the singular frequently includes the plural. and plural forms Noun 1. plural form - the form of a word that is used to denote more than one plural relation - (usually plural) mutual dealings or connections among persons or groups; "international relations" of the same keywords, e.g., invit = invitro activity/invitro activities, child = child/children. * The same codes were assigned to those having the same stem but different suffixes, e.g., pn-r = penicillin-resistance/penicillin-resistant, ther = therapeutic and therapy. * Key phrases were coded by the noun noun [Lat.,=name], in English, part of speech of vast semantic range. It can be used to name a person, place, thing, idea, or time. It generally functions as subject, object, or indirect object of the verb in the sentence, and may be distinguished by a number of with its modifying adjective adjective, English part of speech, one of the two that refer typically to attributes and together are called modifiers. The other kind of modifier is the adverb. or noun as a modifier (programming) modifier - An operation that alters the state of an object. Modifiers often have names that begin with "set" and corresponding selector functions whose names begin with "get". , e.g., pnu-k = klebsiella pneumoniae Klebsiella pneu·mo·ni·ae n. Friedlander's bacillus. , pnu-r = resistant pnuemoniae, pnu-s = streptococcus streptococcus (strĕp'təkŏk`əs), any of a group of gram-positive bacteria, genus Streptococcus, some of which cause disease. pnuemoniae. * A few keywords that were semantically the same but morphologically mor·phol·o·gy n. pl. mor·phol·o·gies 1. a. The branch of biology that deals with the form and structure of organisms without consideration of function. b. different were given the same code for the purpose of joining those with the same meanings. Only two keywords fell into this category in this data set: child was used for coding child, children, infants, and pediatric patients pediatric patient Child, see there ; and 3rdw for third-world and developing-countries. The text coding process was done semi-manually since building the initial code dictionary often needs human intelligence to analyze and translate a keyword or phrase into an appropriate code. The coding consistency (i.e., the same keyword is given the same code or vice versa VICE VERSA. On the contrary; on opposite sides. throughout the data set) was double checked by sorting the data in the order of keyword and text code and then the order of text code and keyword. DATA ANALYSIS Data analysis in KDD processes is associated with data generalization and summarization which "presents the general characteristics or a summarized high-level view over a set of user-specified data in a database" (Chen, Han,& Yu, 1006, p. 866). The semantic patterns of keywords can be generalized gen·er·al·ized adj. 1. Involving an entire organ, as when an epileptic seizure involves all parts of the brain. 2. Not specifically adapted to a particular environment or function; not specialized. 3. from different perspectives--the simple frequency of occurrences and co-occurrences, or the number of unique keywords per rank (frequency), each of which uses a different measure to analyze the data. The simple frequency of occurrences counts how many times a keyword appears in a bibliographic coupling Bibliographic coupling occurs when two works reference a common third work in their bibliographies. The coupling strength is higher the more citations the two bodies have in common, and this coupling is used to extrapolate how similar the subject matter of the two works is. pool. It draws a high-level view of the semantic patterns from keyword frequency distribution. How often a keyword occurred is often decided by the size of the keyword pool. Obviously, the larger a keyword pool is, the more likely it is for a keyword to occur more frequently. When comparing the simple keyword frequency of a large pool of source documents with that of a smaller one, the result can be misleading because of the uneven bases for comparison. A more meaningful and reliable measure would be relative occurrences--i.e., percentage of times that a keyword appears in the total occurrences. The frequency of co-occurrences is useful for measuring the importance of a keyword in the subject area, but it needs to be used with care. This data set was divided into large, medium, and small groups of source document pools. A complete coordination of all possible co-occurrences would involve those between groups 1 and 2, 1 and 3, 2 and 3, and among all three. Even though a keyword may appear in two or three groups at the same time, its frequency of occurrences may vary greatly in different groups. There were also large variations in the numbers of total ranks or frequencies of keyword occurrences: thirty-three in the large group, twenty-four in the medium, and eleven in the small. These can lead to an invalid Null; void; without force or effect; lacking in authority. For example, a will that has not been properly witnessed is invalid and unenforceable. INVALID. In a physical sense, it is that which is wanting force; in a figurative sense, it signifies that which has no effect. comparison for the same keyword with the same rank number but in different groups. For instance, a keyword ranked at eleven in the large group, which was considered high in its group, would have meant the lowest rank in the small group. To normalize normalize to convert a set of data by, for example, converting them to logarithms or reciprocals so that their previous non-normal distribution is converted to a normal one. the frequency of occurrences, a measure of keyword density per rank was used. The keyword density per rank can be interpreted as the ratio of the number of unique keywords to the number of ranks at which the unique keywords occurred. It can be expressed in the following formula: [1] [MATHEMATICAL EXPRESSION A group of characters or symbols representing a quantity or an operation. See arithmetic expression. NOT REPRODUCIBLE re·pro·duce v. re·pro·duced, re·pro·duc·ing, re·pro·duc·es v.tr. 1. To produce a counterpart, image, or copy of. 2. Biology To generate (offspring) by sexual or asexual means. IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ] Where D(t) = Average number of keywords [t.sub.1], [t.sub.2], ..., [t.sub.i] per rank, [r.sub.i] = Number of ranks, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] = Total number of unique keywords included from ranks 1 through n. Figure 2 shows how the keyboard density was calculated. [Figure 2 ILLUSTRATION OMITTED] This measure eliminates the defects of simple frequency and co-occurrences and focuses on how many unique keywords scatter scat·ter v. 1. To cause to separate and go in different directions. 2. To separate and go in different directions; disperse. 3. To deflect radiation or particles. n. in a region. This region is denoted by the frequency rank, and its size can be set according to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. the distribution shape. In Equation [1], the least possible D(t) is 1, that is, both the number of unique keywords and the number of ranks are the same. For example, three unique keywords were found to have appeared in three different frequencies (or frequency ranks), then 3/3 = 1. The largest possible D(t) can be an infinite in theory, which means that all unique keywords appeared at the same one rank. It is clear that the keyword density will increase as a rank contains more unique keywords. FINDINGS Frequency Distribution There were a total of 2,994 keywords in the fifteen pools of source documents. The number of keywords in the large group (source document pools 1-5) consists of 54.5 percent of the total. The medium group had slightly over 40 percent keywords, and the small group only about 10 percent (see Table 2). The decrease in the number of keywords was mainly due to the decrease in the size of document pools; the average number of keywords (7) per record remained approximately the same for each pool. Nonetheless, the frequency distribution of keywords in all three groups was very similar: a majority of the keywords appeared less than five times in each of the groups; as the occurrences increased, the percentage of keywords decreased (see Figure 3). [Figure 3 ILLUSTRATION OMITTED] Table 2. NUMBER OF KEYWORDS IN INDEXING RECORDS FOR THE SOURCE DOCUMENTS IDENTIFIED THROUGH BIBLIOGRAPHICAL COUPLING
Group Size by
Number of Document Number of Number of Percentage
Documents Pool Documents Keywords
Large 1 72 512 17.1
(Pools of
source 2 45 316 10.6
documents
identified 3 41 291 9.7
through
bibliographic 4 38 273 9.1
coupling)
5 34 241 8.0
Medium 6 25 200 6.7
7 25 208 7.0
8 23 207 7.0
9 23 171 5.7
10 23 202 6.7
Small 11 11 78 2.6
12 11 77 2.6
13 10 73 2.4
14 10 67 2.2
15 10 78 2.6
Total 401 2,994 100.0
Group Size by
Number of Cumulative
Documents Percentage
Large 17.1
(Pools of
source 27.7
documents
identified 37.4
through
bibliographic 46.5
coupling)
54.5
Medium 61.2
68.2
75.2
80.9
87.6
Small 90.2
92.8
95.2
97.4
100.0
Although the percentage of keywords declined dramatically as the group size decreased, all three groups shared the same top three keywords--antibiotic resistance, antimicrobial antimicrobial /an·ti·mi·cro·bi·al/ (-mi-kro´be-al) 1. killing microorganisms or suppressing their multiplication or growth. 2. an agent with such effects. resistance, and streptococcus-pneumoniae. This suggests that a common "intellectual base" existed among all three groups (see Table 3). The percentage of these three keywords dropped in medium and small groups compared to the large group. A close examination of data revealed that the lower occurrences were caused mainly by fluctuations in individual groups (see Figure 4). Figure 4 suggests that such fluctuations became wider as the group size shrunk shrunk v. A past tense and a past participle of shrink. shrunk Verb a past tense and past participle of shrink shrunk, shrunken shrink to the next level. [Figure 4 ILLUSTRATION OMITTED] Table 3. RELATIVE FREQUENCIES OF KEYWORDS IN THE FIRST 25TH PERCENTILES IN THREE GROUPS
Keywords Large Group Medium Group
Rank Rel. freq. Rank Rel. freq.
Antibiotic- 1 7.8 1 6.0
resistance
Antimicrobial 2 5.2 2 5.2
resistance
Streptococcus- 3 5.1 3 4.0
pneumoniae
Children/Infants/ 4 4.2 6 2.6
Pediatric patients
Susceptibility 5 3.7
Infection/infections
Day-care center/ 7 2.2
centers
United States 5 3.0
Haemophilus-influenza 4 3.4
3rd-generation
cephalosporins
Escherichia-co
Mechanically ventilated
patients
Penicillin-binding protein 7 2.2
Keywords Small Group
Rank Rel. freq.
Antibiotic- 1 4.6
resistance
Antimicrobial 2 3.2
resistance
Streptococcus- 3 2.9
pneumoniae
Children/Infants/ 4 2.2
Pediatric patients
Susceptibility 6 1.9
Infection/infections 6 1.9
Day-care center/
centers
United States 6 1.9
Haemophilus-influenza
3rd-generation 5 2.1
cephalosporins
Escherichia-co 6 1.9
Mechanically ventilated 6 1.9
patients
Penicillin-binding protein
Co-Occurrence of Keywords In addition to the base keywords, other keywords co-occurred in either all three groups or two of the three. The largest number of keywords (eighty-five) co-occurred in all three groups. Only seven keywords co-occurred in both the large and small groups besides the eighty-five (see Table 4). The number of unique keywords that occurred in only one group was surprisingly similar: 25, 33, and 33 in the large, medium, and small groups respectively. Among the eighty-five keywords in "all groups" in Table 4, there existed large variations that the same keyword appeared with varied frequencies in different groups. The highest occurrences concentrated in the large group, then declined as the rank of the document pool went down (see Table 5). For example, "children" had sixty-nine occurrences in the large group but decreased to twenty-six and nine respectively in the medium and small groups. While the numbers of unique keywords in the three groups did not differ significantly (p [is less than] 0.05), the relative occurrences (4.2, 2.6, 2.5 respectively) show that more records in the large group had this keyword. This similar phenomenon happened throughout most other co-occurring keywords in either all three groups or any of the two groups together. Very few keywords that occurred in the small group outnumbered Outnumbered is a British sitcom that aired on BBC One in 2007.[1] It stars Hugh Dennis and Claire Skinner as a mother and father who are outnumbered by their three children. the occurrences in the medium or large group--i.e., although keywords co-occurred in different groups, they did not appear at the same frequency. Keywords co-occurring in only two groups were mostly those with lower frequencies. Figure 4 depicts the number of unique keywords having occurrences one through more than nineteen. Most unique keywords in the large group occurred more frequently but much less frequently (only one, two, or three times) in the small group. Table 4. NUMBER OF UNIQUE KEYWORDS THAT OCCURRED OR CO-OCCURRED IN DIFFERENT GROUPS
Large Group Medium Group Small All Groups
Groups
Large Group 25 46 7
Medium Group 46 33 28
Small Group 7 28 33
All Groups 85 85 85 85
Total 163 192 153
TABLE 5. PORTION OF THE FREQUENCY AND PERCENTAGE OF THE KEYWORDS THAT CO-OCCURRED Keywords Occurring Large Group Medium Group in All Three Groups Freq. % Freq. % Children/Infants/ 69 4.2 26 2.6 Pediatric patients Susceptibility 60 3.7 3 0.3 Pneumococci 47 2.9 21 2.1 Infection/infections 44 2.7 13 1.3 Day-care 42 2.6 22 2.2 United States 37 2.3 6 0.6 Haemophilus-influenza 32 2.0 34 3.4 Therapy/therapeutic 30 1.8 1 0.1 Penicillin resistance 28 1.7 19 1.9 Penicillin-binding protein 24 1.5 22 2.2 Penicillin 22 1.3 6 0.6 Strains 21 1.3 2 0.2 Disease 18 1.1 6 0.6 Epidemiology 17 1.0 11 1.1 Failure 17 1.0 9 0.9 Otitis-media 16 1.0 9 0.9 Vaccine/conjugate 16 1.0 2 0.2 vaccine Keywords Occurring Small Group in All Three Groups Freq. % Children/Infants/ 9 2.5 Pediatric patients Susceptibility 7 1.9 Pneumococci 4 1.1 Infection/infections 7 1.9 Day-care 5 1.3 United States 7 1.9 Haemophilus-influenza 3 0.8 Therapy/therapeutic 7 1.9 Penicillin resistance 4 1.1 Penicillin-binding protein 5 1.3 Penicillin 1 0.3 Strains 1 0.3 Disease 3 0.8 Epidemiology 2 0.5 Failure 4 1.1 Otitis-media 1 0.3 Vaccine/conjugate 2 0.5 vaccine THE KEYWORD DENSITY To compute To perform mathematical operations or general computer processing. For an explanation of "The 3 C's," or how the computer processes data, see computer. the keyword density per rank, the frequency distribution of keyword occurrences was plotted for each of the three groups after the intellectual base keywords had been excluded. Figure 6 reveals a sharp turn at four, which was then used as a dividing point between the marginal and specialty keywords in the sample. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke" put differently , keywords occurring three or fewer times in the sample were assumed to be marginal in the subject under study, and those with four or more times to be the specialties. Applying Formula [1] in the Data Analysis section, the keyword density was calculated according to the data in Table 6. When calculating the keyword density, ranks that had no keyword occurrences were treated as missing cases and ignored because only the actual number of frequency ranks reflected the keyword density. Thus the number of ranks for specialty, keywords in the large group would be 42 minus 3 (intellectual base ranks) minus 3 (marginal ranks) minus 5 (missing cases) equal 31, and so forth for the other two groups. Results in Table 7 show that the density for marginal keywords is approximately ten times greater than those of specialty keywords in all three groups. Further studies are needed to explore whether this is only a coincidence for this particular data set or a phenomenon existing across disciplines. [Figure 6 ILLUSTRATION OMITTED] Table 6. FREQUENCY DISTRIBUTION OF KEYWORD OCCURRENCES IN THREE GROUPS EXCLUDING THE THREE INTELLECTUAL BASE KEYWORDS
Number of Unique
Keywords ([t.sub.i])
Keyword Large Medium Small
No. Occurrences Group Group Group
1 1 33 72 83
2 2 26 30 24
3 3 18 26 15
4 4 10 9 9
5 5 8 5 7
6 6 9 8 2
7 7 1 4 8
8 8 6 5 1
9 9 5 6 1
10 10 5
11 11 2 4 1
12 12 2 2 1
13 13 5 2
14 14 4 2
15 15 1 1
16 16 4 3
17 17 3 1
18 18 1 2
19 19 3
20 20 1 1
21 21 1 2
22 24 1
23 28 1
24 30 1 1
25 32 2
26 34 1
27 36 1
28 37 1
29 40 1
30 42 1
31 43 1
32 44 1
33 47 1
34 48 1
35 51 1
36 52 1
37 59 1
38 60 1
39 69 1
40 84 1
41 85 1
42 129 1
Total 163 192 153
Table 7. KEYWORD DENSITY IN GROUPS Keyword Intellectual Base Specialty Density (D) Keywords (i) Keywords (s) Large Group (1) D(li)=3/3=1 D(ls)=83/31=2.68 Medium Group (m) D(mi)=3/3=l D(ms)=61/19=3.21 Small Group (s) D(si)=3/3=1 D(ss)=28/6=4.67 Keyword Marginal Density (D) Keywords (m) Large Group (1) D(lm)=77/3=25.67 Medium Group (m) D(mm)=128/3=42.67 Small Group (s) D(sm)=122/3=40.67 A further examination was made for keywords in the specialty and marginal groups. Several patterns emerged in the specialty keywords (see Tables 8, 9, and 10): * Keywords co-occurring in two or three groups tended to be more generic or disciplinarily generic than non-co-occurring ones. Examples included children, day-care, failure, infections, prevalence, United States United States, officially United States of America, republic (2005 est. pop. 295,734,000), 3,539,227 sq mi (9,166,598 sq km), North America. The United States is the world's third largest country in population and the fourth largest country in area. , genes. * There were more microbial microbial pertaining to or emanating from a microbe. microbial digestion the breakdown of organic material, especially feedstuffs, by microbial organisms. names and related infections in the keywords co-occurring than in the ones not co-occurring. Examples included pneumococci, enterococcus/enterococci, Escherichia-coli, Klebsiella-pneumonia, Neisseria-gonorrhoea, haemophilus-influenza, streptococcus-pneumonicoccal meningitis meningitis (mĕnĭnjī`tĭs) or cerebrospinal meningitis (sĕr'əbrōspī`nəl), acute inflammation of the meninges, the membranes that cover and protect the brain and spinal cord. , Branhamella-catarrhalis. * There was a clear tendency in the keywords both co-occurring and non-co-occurring (the latter happened in the first two groups only) that antibiotic resistance in pneumonia was investigated from perspectives of genetics genetics, scientific study of the mechanism of heredity. While Gregor Mendel first presented his findings on the statistical laws governing the transmission of certain traits from generation to generation in 1856, it was not until the discovery and detailed study of (binding protein gene, penicillin-binding proteins, multiresistant clone clone, group of organisms, all of which are descended from a single individual through asexual reproduction, as in a pure cell culture of bacteria. Except for changes in the hereditary material that come about by mutation, all members of a clone are genetically ), microbiology microbiology: see biology. microbiology Scientific study of microorganisms, a diverse group of simple life-forms including protozoans, algae, molds, bacteria, and viruses. (invitro activities), and immunology immunology, branch of medicine that studies the response of organisms to foreign substances, e.g., viruses, bacteria, and bacterial toxins (see immunity). Immunologists study the tissues and organs of the immune system (bone marrow, spleen, tonsils, thymus, lymphatic (pneumococcal pneumococcal /pneu·mo·coc·cal/ (-kok´al) pertaining to or caused by pneumococci. polysaccharide polysaccharide: see carbohydrate. polysaccharide Any of a large class of long-chain sugars composed of monosaccharides. Because the chains may be unbranched or branched and the monosaccharides may be of one, two, or occasionally more kinds, ). However, this tendency in co-occurring keywords seemed to be more toward pharmaceutical aspects in relation to the microbes and infections they caused (spectrum betalactam, chloramphenicol chloramphenicol (klōr'ămfĕn`əkŏl'), antibiotic effective against a wide range of gram-negative and gram-positive bacteria (see Gram's stain). It was originally isolated from a species of Streptomyces bacteria. therapy, third-generation cephalosporins Cephalosporins Definition Cephalosporins are medicines that kill bacteria or prevent their growth. Purpose Cephalosporins are used to treat infections in different parts of the body—the ears, nose, throat, lungs, sinuses, and ), and more pathologically path·o·log·i·cal also path·o·log·ic adj. 1. Of or relating to pathology. 2. Relating to or caused by disease. 3. oriented o·ri·ent n. 1. Orient The countries of Asia, especially of eastern Asia. 2. a. The luster characteristic of a pearl of high quality. b. A pearl having exceptional luster. 3. in non-co-occurring keywords (serotype serotype /se·ro·type/ (ser´o-tip) the type of a microorganism determined by its constituent antigens; a taxonomic subdivision based thereon. se·ro·type n. See serovar. v. distribution, strains, antimicrobial susceptibility susceptibility the state of being susceptible. Refers usually to infectious disease but may be to physical factors such as wetting or to psychological factors such as harassment. , plasmids). * The keyword density in double digits Double Digits was a pricing game on the American television game show, The Price Is Right. Played from April 20, 1973 through May 18, 1973's show, it was played for a car and used small prizes. were generally more specific than those in single digits, though there did exist a few general ones (see Table 10). Table 8. SPECIALTY KEYWORDS THAT CO-OCCURRED IN TWO OR THREE GROUPS
Large Group Medium Group
No. Keywords Freq. % Freq. %
1 Children/Infants/ 69 4.2 26 2.6
Pediatric patients
2 Pneumococci 47 2.9 21 2.1
3 Infection/infections 44 2.7 13 1.3
4 Day-care/Day-care 42 2.6 22 2.2
centers
5 United States 37 2.3 6 0.6
6 Penicillin resistance 28 1.7 19 1.9
7 Penicillin-binding 24 1.5 22 2.2
proteins
8 Failure 17 1.0 9 0.9
9 Gene/genes 14 0.9 15 1.5
10 Pneumococcal 14 0.9 7 0.7
polysaccharide
11 Prevalence 13 0.8 11 1.1
12 Enterococcus/
enterococci 6 0.4 16 1.6
13 Escherichia-coli 5 0.3 16 1.6
14 Spectrum beta-lactam 5 0.3 18 1.8
15 Klebsiella-pneumoni 4 0.2 9 0.9
16 Neisseria-gonorrhoea 4 0.2 4 0.4
17 Meningitis 52 3.2 16 1.6
18 Chloramphenicol 36 2.5 8 0.8
therapy
19 Haemophilus-influenza 32 2.0 34 3.4
20 Penicillin 22 1.3 6 0.6
21 Disease 18 1.1 6 0.6
22 Epidemiology 17 1.0 11 1.1
23 Binding protein gene 16 1.0 14 1.4
24 Otitis-media 16 1.0 9 0.9
25 Streptococcus- 15 0.9 14 1.4
pneumonicoccal
meningitis
26 Bacterial-meningitis 14 0.9 11 1.1
27 Bacteria 13 0.8 6 0.6
28 Beta-lactam 13 0.8 8 0.8
antibiotics
29 Branhamella- 12 0.7 7 0.7
catarrhalis
30 Multiresistant clone 12 0.7 8 0.8
31 Antibiotics 11 0.7 9 0.9
32 Influenzae type-b 11 0.7 11 1.1
33 Invitro activities 10 0.6 5 0.5
34 Carriage 9 0.6 4 0.4
35 Diagnose 9 0.6 4 0.4
36 Resistance 9 0.6 6 0.6
37 Erythromycin 8 0.5 4 0.4
38 Staphylococcus-aureu 8 0.5 11 1.1
39 Influenzae 4 0.2 5 0.5
40 Tuberculosis 4 0.2 6 0.6
41 Susceptibility 60 3.7
42 Therapy/Therapeutic 30 1.8
43 Emergence 8 0.5
44 Protective efficacy 5 0.3
45 3rd-generation 6 0.6
cephalosporins
46 Enterobacter/ 7 0.7
Enterobacteriaceae
47 Respiratory-tract 4 0.4
infection
Small Group
No. Keywords Freq. %
1 Children/Infants/ 9 2.4
Pediatric patients
2 Pneumococci 4 1.1
3 Infection/infections 7 1.9
4 Day-care/Day-care 5 1.3
centers
5 United States 7 1.9
6 Penicillin resistance 4 1.1
7 Penicillin-binding 5 1.3
proteins
8 Failure 4 1.1
9 Gene/genes 4 1.1
10 Pneumococcal 7 1.9
polysaccharide
11 Prevalence 5 1.3
12 Enterococcus/
enterococci 4 1.1
13 Escherichia-coli 7 1.9
14 Spectrum beta-lactam 5 1.3
15 Klebsiella-pneumoni 6 1.6
16 Neisseria-gonorrhoea 6 1.6
17 Meningitis
18 Chloramphenicol
therapy
19 Haemophilus-influenza
20 Penicillin
21 Disease
22 Epidemiology
23 Binding protein gene
24 Otitis-media
25 Streptococcus-
pneumonicoccal
meningitis
26 Bacterial-meningitis
27 Bacteria
28 Beta-lactam
antibiotics
29 Branhamella-
catarrhalis
30 Multiresistant clone
31 Antibiotics
32 Influenzae type-b
33 Invitro activities
34 Carriage
35 Diagnose
36 Resistance
37 Erythromycin
38 Staphylococcus-aureu
39 Influenzae
40 Tuberculosis
41 Susceptibility 7 1.9
42 Therapy/Therapeutic 7 1.9
43 Emergence 4 1.1
44 Protective efficacy 4 1.1
45 3rd-generation 8 2.2
cephalosporins
46 Enterobacter/ 5 1.3
Enterobacteriaceae
47 Respiratory-tract 4 1.1
infection
Table 9. SPECIALTY KEYWORDS OCCURRING IN A SINGLE GROUP
Keywords Unique in the
No. Large Group Freq. %
1 Serotype distribution 48 2.9
2 Spain 32 2.0
3 Strains 21 1.3
4 New-Guinea/Papua-New-Guinea 17 1.0
5 Vaccine/Conjugate vaccine 16 1.0
6 Pneumococcal meningitis 14 0.9
7 Systemic infections 13 0.8
8 Penicillin-resistant pneumoniae 13 0.8
9 Resistant pneumococcus pneumoniae 10 0.6
10 Upper respiratory-tract 10 0.6
11 Community-acquired pnuemoniae 10 0.6
12 Immune-deficiency syndrome 9 0.6
13 Antibody 9 0.6
14 Vancomycin 8 0.5
15 Horizontal transfer 8 0.5
16 Antimicrobial susceptibility 8 0.5
17 Hungary 7 0.4
18 Iceland 6 0.4
19 Invasive disease 6 0.4
20 Patterns 6 0.4
21 Pneumococcal infections 6 0.4
22 Acquired immuno deficiency syndrome 6 0.4
23 Bacterial pneumonia 6 0.4
24 Cerebrospinal-fluid 6 0.4
25 Co-trimoxazole 6 0.4
26 Requiring hospitalization 5 0.3
27 Sensitivity 5 0.3
28 Septic arthritis 5 0.3
29 Human-Immuno deficiency Virus (HIV) 5 0.3
30 Capsular polysaccharide 5 0.3
31 Molecular epidemiology 4 0.2
32 Invasive pneumococcal infections 4 0.2
33 Anemia 4 0.2
34 Binding proteins 4 0.2
35 Capsular types 4 0.2
36 Ciprofloxacin 4 0.2
37 UK 30 3
38 Sulbactam 19 1.9
39 Resistant staphylococcuci 18 1.8
40 Nucleotide-sequences 13 1.3
41 Sri-Lanka 12 1.2
42 Cephalosporins 9 0.9
43 Pneumococcal serotype 9 0.9
44 Tetracycline 8 0.8
45 Transpeptidase 8 0.8
46 Mechanism 6 0.6
47 Catarrhalis beta-lactamase 5 0.5
48 Pneumococcal vaccine 5 0.5
49 Postsplenectomy sepsis 5 0.5
50 Ampicillin 4 0.4
51 Gram-negative bacilli 4 0.4
52 Plasmid/plasmids 4 0.4
53 Mechanically ventilated patients 7 1.9
54 Patients/critically ill patients 7 1.9
55 Intensive-care unit (AIDS) 5 1.3
56 Nosocomial infection 5 1.3
57 Transferable resistance 4 1.1
Table 10. MARGINAL KEYWORDS THAT CO-OCCURRED IN EITHER ALL THREE OR TWO OF THE THREE GROUPS
Large Group Medium Group
Keywords Freq. % Freq. %
Pseudomonas-aeruginosa 3 0.2 3 0.3
Management 3 0.2 3 0.3
Etiology 3 0.2 1 0.1
Isolate/clinical isolate 2 0.1 2 0.2
Coagulase-negatives 2 0.1 2 0.2
Aminoglycoside resistance 2 0.1 1 0.1
Legionnaires-disease 2 0.1 1 0.1
Microdilution system 2 0.1 1 0.1
Blood cultures 2 0.1 2 0.2
Trimethoprim-sulfame 1 0.1 2 0.2
Methicillin-resistant 1 0.1 2 0.2
Refractory periodont 1 0.1 1 0.1
Outer-membrane permeability 1 0.1 3 0.3
Outbreak 1 0.1 3 0.3
Nosocomial outbreak 1 0.1 3 0.3
Colonization 1 0.1 3 0.3
2x 2 0.1 2 0.2
Anti-inflammatory agent 3 0.2 1 0.1
Antibiotic-therapy 1 0.1 1 0.1
Aspiration 1 0.1 1 0.1
Broth 1 0.1 1 0.1
Cefamandole 3 0.2 1 0.1
Ceftriaxone 3 0.2 1 0.1
Clarithromycin 1 0.1 2 0.2
Clindamycin 1 0.1 1 0.1
Clones 2 0.1 2 0.2
Common organization 2 0.1 1 0.1
D-alanine ligase 2 0.1 3 0.3
Directions 1 0.1 1 0.1
Group-a 1 0.1 1 0.1
High-level resistance 2 0.1 3 0.3
Invasive pneumococcal infections 1 0.1 1 0.1
Nasopharyngeal carriage 3 0.2 2 0.2
Neisseria-meningitis 1 0.1 2 0.2
Pathogen 1 0.1 1 0.1
Populations 3 0.2 2 0.2
Quinolones 1 0.1 1 0.1
South-Africa 2 0.1 1 0.1
Spread 3 0.2 1 0.1
Streptococcus-pneumoniae strains 2 0.1 1 0.1
Structural-changes 1 0.1 1 0.1
Ampicillin 3 0.2
Antimicrobial agents 3 0.2
Bacterium legionella 2 0.1
Catarrhalis beta-lactamase 3 0.2
Cephalosporins 2 0.1
Dirithromycin 1 0.1
Mechanism 2 0.1
Norfloxacin 1 0.1
Nucleotide-sequences 2 0.1
Plasmid/plasmids 2 0.1
Pneumococcal vaccine 1 0.1
Spectrum 1 0.1
Affairs-medical-center 2 0.2
Anemia 1 0.1
Antibody 3 0.3
Aztreonam 1 0.1
Broad-spectrum cepha 1 0.1
Calcoaceticus var anitratus 2 0.2
Capsular polysaccharide 3 0.3
Ceftazidime resistance 3 0.3
Cerebrospinal-fluid 3 0.3
Ciprofloxacin 3 0.3
Classification 2 0.2
Community-acquired pnuemoniae 2 0.2
Digestive-tract 1 0.1
DNA 3 0.3
Enzymatic resistance 3 0.3
Horizontal transfer 1 0.1
Identification 1 0.1
Imipenem-cilastatin 3 0.3
India 1 0.1
Nursing-home patient 3 0.3
Patterns 1 0.1
Penicillin-resistant pneumoniae 3 0.3
Pneumococcal meningitis 3 0.3
Resistant pneumococcus pneumoniae 3 0.3
Salmonella-typhi 1 0.1
Selective decontamination 2 0.2
Staphylococcus-aureu 3 0.3
Steady-state treatment 2 0.2
Strains 2 0.2
Substitution 1 0.1
Systemic infections 2 0.2
Third-world countries 1 0.1
Transposition 1 0.1
Upper respiratory-tract 2 0.2
Vancomycin 1 0.1
Small Group
Keywords Freq. %
Pseudomonas-aeruginosa 2 0.5
Management 1 0.3
Etiology 1 0.3
Isolate/clinical isolate 1 0.3
Coagulase-negatives 1 0.3
Aminoglycoside resistance 2 0.6
Legionnaires-disease 1 0.3
Microdilution system 1 0.3
Blood cultures 1 0.3
Trimethoprim-sulfame 1 0.3
Methicillin-resistant 1 0.3
Refractory periodont 1 0.3
Outer-membrane permeability 1 0.3
Outbreak 1 0.3
Nosocomial outbreak 2 0.5
Colonization 3 0.8
2x
Anti-inflammatory agent
Antibiotic-therapy
Aspiration
Broth
Cefamandole
Ceftriaxone
Clarithromycin
Clindamycin
Clones
Common organization
D-alanine ligase
Directions
Group-a
High-level resistance
Invasive pneumococcal infections
Nasopharyngeal carriage
Neisseria-meningitis
Pathogen
Populations
Quinolones
South-Africa
Spread
Streptococcus-pneumoniae strains
Structural-changes
Ampicillin 1 0.3
Antimicrobial agents 1 0.3
Bacterium legionella 1 0.3
Catarrhalis beta-lactamase 1 0.3
Cephalosporins 3 0.8
Dirithromycin 1 0.3
Mechanism 2 0.5
Norfloxacin 1 0.3
Nucleotide-sequences 2 0.5
Plasmid/plasmids 2 0.5
Pneumococcal vaccine 1 0.3
Spectrum 1 0.3
Affairs-medical-center 1 0.3
Anemia 1 0.3
Antibody 1 0.3
Aztreonam 1 0.3
Broad-spectrum cepha 1 0.3
Calcoaceticus var anitratus 1 0.3
Capsular polysaccharide 1 0.3
Ceftazidime resistance 1 0.3
Cerebrospinal-fluid 2 0.5
Ciprofloxacin 1 0.3
Classification 1 0.3
Community-acquired pnuemoniae 2 0.5
Digestive-tract 3 0.8
DNA 3 0.8
Enzymatic resistance 2 0.5
Horizontal transfer 2 0.5
Identification 1 0.3
Imipenem-cilastatin 3 0.8
India 1 0.3
Nursing-home patient 1 0.3
Patterns 1 0.3
Penicillin-resistant pneumoniae 1 0.3
Pneumococcal meningitis 3 0.8
Resistant pneumococcus pneumoniae 1 0.3
Salmonella-typhi 1 0.3
Selective decontamination 3 0.8
Staphylococcus-aureu 2 0.5
Steady-state treatment 1 0.3
Strains 1 0.3
Substitution 1 0.3
Systemic infections 1 0.3
Third-world countries 2 0.5
Transposition 1 0.3
Upper respiratory-tract 3 0.8
Vancomycin 1 0.3
CONCLUSION Knowledge discovery in bibliographic databases is distinctive compared to KD in full-text document and numerical databases. One challenge is transforming semi-structured textual data into the types and structures suitable for calculations and modeling. In the case of subject keywords, all the idiosyncrasies existing in natural language, including suffixes, different spellings for the same word, and synonyms, need to be normalized before analysis. Similar work on this type of term normalization In relational database management, a process that breaks down data into record groups for efficient processing. There are six stages. By the third stage (third normal form), data are identified only by the key field in their record. has been done in automatic indexing, such as stem stripping (Paice, 1990; Porter, 1980). Harman and Candela candela (kăndĕ`lə), in weights and measures: see candle. A unit of measurement of the intensity of light. Part of the SI system of measurement, one candela (cd) is the monochromatic radiation of 540THz with a radiant intensity (1990) argue that term normalization such as stem stripping is not worth the effort for large full-text databases because this operation has little impact on other methods (e.g., frequency counts) of indexing. While this may be true in indexing full-text documents, preprocessing of this kind is a necessity in discovering knowledge from bibliographic databases. The reason is obvious: semantic analysis Semantic analysis may refer to:
v. spe·cial·ized, spe·cial·iz·ing, spe·cial·iz·es v.intr. 1. To pursue a special activity, occupation, or field of study. 2. and interdisciplinary in·ter·dis·ci·pli·nar·y adj. Of, relating to, or involving two or more academic disciplines that are usually considered distinct. interdisciplinary Adjective subject fields. Presentation of the knowledge discovered is an important part of the KDD process. Visualizing visualizing, v 1., holding an image in one's mind. 2., forming an image of a goal or destination in one's mind before undertaking it, so as to facilitate success. the patterns, trends, and associations in a subject field can be very challenging because of the size of the screen and the number of text values that one data field can contain. This study of semantic patterns in keywords was by no means a large one in scale, but the total number of keywords made it difficult to draw any legible leg·i·ble adj. 1. Possible to read or decipher: legible handwriting. 2. Plainly discernible; apparent: legible weaknesses in character and disposition. charts for the whole data set. An inclusion of even one group of keywords would clutter the chart badly and cause the keywords on the chart to be unrecognizable. Substituting long keywords with shorter and mnemonic text codes normalized the inconsistencies in keywords as well as leaving more room for visual presentation of the knowledge discovered. The semantic patterns discovered in this data set suggest that different keyword density regions may be used as a controlling mechanism for better targeted searching. Traditionally, query expansion (information science) query expansion - Adding search terms to a user's search. Query expansion is the process of a search engine adding search terms to a user's weighted search. The intent is to improve precision and/or recall. The additional terms may be taken from a thesaurus. is one of the main techniques used to improve retrieval performance (Sparck Jones & Jackson, 1970; Salton, Fox, & Voorhees, 1983; Salton & Buckley, 1990; Harman, 1992). Query expansion allows searchers to browse the indexing term list or give relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query. to searchers through frequency ranking or term weighting. While performance was reported to have improved to a high percentage in small testing collections, it is still unproven unproven Dubious, nonscientific, not proven, quack, questionable, unscientific adjective Relating to that which has not been validated by reproducible experiments or other scientific methods for determining effect or efficacy that these techniques would achieve the same performance in large collections in the real world (Korfhage, 1997). Keyword density may provide a new solution to this uncertainty because it is computed on the basis of a collection of keywords extracted from bibliographically coupled source documents. By implementing keyword density analysis into an algorithm, it is possible that a simple search query to the database (s) would generate a group of keywords stratified stratified /strat·i·fied/ (strat´i-fid) formed or arranged in layers. strat·i·fied adj. Arranged in the form of layers or strata. by their density regions. Information searchers can then select keywords from different density regions according to their own definition of relevance. The semantic patterns found in non-co-occurring and co-occurring keywords suggest that it is necessary and possible to design new search tools that will deliver "analyzed an·a·lyze tr.v. an·a·lyzed, an·a·lyz·ing, an·a·lyz·es 1. To examine methodically by separating into parts and studying their interrelations. 2. Chemistry To make a chemical analysis of. 3. " search results to users. In retrieving information from terabyte One trillion bytes. Also TB, Tbyte and T-byte. See tera and space/time. (unit) terabyte - 2^40 = 1,099,511,627,776 bytes = 1024 gigabytes or roughly 10^12 bytes. (Note the spelling - one 'r'). See prefix. databases, the most challenging task in information retrieval is probably how to find most relevant information, in a manageable amount, in the easiest way. Conventional information systems have applied various sophisticated methods to accomplish this task but were limited by their design which requires equally sophisticated search techniques to find the information and leaves the information filtration filtration: see sewerage; water supply. Filtration The separation of solid particles from a fluidsolids suspension of which they are a part by passage of most of the fluid through a septum or membrane that retains most of the solids to users themselves. The development of science and the growth of scientific literature has made filtering relevant information more difficult than ever in highly interdisciplinary scientific research areas. The semantic pattern analysis of the keywords from bibliographical coupling shows a possibility that simple semantic processing to natural language (keywords extracted from citations in this case) may be programmed and serve as a tool for providing "analyzed" search results to users. The results in this study are only preliminary. It is unknown whether the semantic patterns identified in this data set are a coincidence or a common phenomenon across subject fields. Further studies are needed to discover whether the subject category of keywords is related to the density region and whether the stratified keyword distribution and density can contribute to customizing the selection of a targeted group of documents and post-search analysis. [Figure 5 ILLUSTRATION OMITTED] REFERENCES Braam, R. R.; Moed, H. E; & van Raan, A. F. J. (1991a). Mapping of science by combined co-citation and word analysis. Part I: Structural aspects. Journal of the American Society for Information Science, 42(4), 233-251. Braam, R. R.; Moed, H. F.; & van Raan, A. F. J. (1991b). Mapping of science by combined co-citation and word analysis. Part II: Dynamical aspects. Journal of the American Society for Information Science, 42(4), 252-266. Chen, M. Y.; Han, J.; & Yu, P. S. (1996). Data mining: An overview from a database perspective. IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields. Transactions on Knowledge and Data Engineering, 8, 866-883. Feldman, R., & Hirsh, H. (1996). Mining associations in text in the presence of background knowledge. In E. Simoudis, J. Han,& U. Fayyad (Eds.), KDD '96 (Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, August 1996) (pp. 343-346). Menlo Park Menlo Park. 1 Residential city (1990 pop. 28,040), San Mateo co., W Calif.; inc. 1874. Electronic equipment and aerospace products are manufactured in the city. Menlo College and a Stanford Univ. research institute are there. 2 Uninc. , CA: AAAI AAAI American Association for Artificial Intelligence AAAI Association for the Advancement of Artificial Intelligence (Menlo Park, California) AAAI American Academy of Allergy, Asthma, and Immunology Press. Feldman, R.; Dagan, I.; & Hirsh, H. (1998). Mining text using keyword distributions. Journal of Intelligent Information Systems, 10(3), 281-300. Harman, D. (1992). User-friendly systems instead of user-friendly front-ends: Four end-user systems employing probabilistic (probability) probabilistic - Relating to, or governed by, probability. The behaviour of a probabilistic system cannot be predicted exactly but the probability of certain behaviours is known. Such systems may be simulated using pseudorandom numbers. ranking: PRISE, CITE To notify a person of a proceeding against him or her or to call a person forth to appear in court. To make reference to a legal authority, such as a case, in a citation. , MUSCAT Muscat, Maskat, or Masqat (all: mŭs`kăt, mŭs`kət), city (1993 pop. 533,774), capital of Oman, SE Arabia, on the Gulf of Oman. It is flanked by rugged mountains. and News Retrieval Tool. Journal of the American Society for Information Science, 43 (2), 164-174. Harman, D. K., & Candela, G. (1990). Retrieving records from a gigabyte One billion bytes. Also GB, Gbyte and G-byte. See giga and space/time. (unit) gigabyte - 2^30 = 1,073,741,824 bytes = 1024 megabytes. Roughly the amount of data required to encode a human gene sequence (including all the redundant codons). See prefix. of text on a minicomputer (1) An earlier medium-scale, centralized computer that functioned as a multiuser system for up to several hundred users. The minicomputer industry was launched in 1959 after Digital Equipment Corporation introduced its PDP-1 for $120,000, an unheard-of low price for a computer in using statistical ranking. Journal of the American Society for Information Science, 41(8), 581-589. Kessler, M. M. (1965). Comparison of the results of bibliographic coupling and analytic subject indexing Subject indexing is the act of describing a document by index terms to indicate what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary. . American Documentation, 16(3), 223-233. Korfhage, R. R. (1997). Information storage and retrieval information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request. Computers and data processing techniques have made possible the high-speed, selective retrieval of large amounts of . New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : Wiley. Lent, B.; Agrawal, R.; & Srikant, R. (1997). Discovering trends in text databases. In D. Heckerman, H. Mannila, & D. Pregibon (Eds.), KDD '97 (Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, California Newport Harbor redirects here. For the MTV reality series, see . Newport Beach, incorporated in 1906, is a city in Orange County, California, 10 miles south of downtown Santa Ana. , August 14-17, 1997) (pp. 227-230). Menlo Park, CA: AAAI Press. Logan, E. L., & Shaw, W. M. (1987). An investigation of the coauthor co·au·thor or co-au·thor n. A collaborating or joint author. tr.v. co·au·thored, co·au·thor·ing, co·au·thors To be a collaborating or joint author of: "He and a colleague . . . graph. Journal of the American Society for Information Science, 38(4), 262-268. Paice, C. (1990). Another stemmer: Natural language processing Natural language processing Computer analysis and generation of natural language text. The goal is to enable natural languages, such as English, French, or Japanese, to serve either as the medium through which users interact with computer systems such as . SIGIR SIGIR Special Interest Group on Information Retrieval (Association for Computing Machinery) SIGIR Special Inspector General for Iraq Reconstruction Forum, 24(3), 56-61. Porter, M. F. (1980). An algorithm for suffix suf·fix n. An affix added to the end of a word or stem, serving to form a new word or functioning as an inflectional ending, such as -ness in gentleness, -ing in walking, or -s in sits. tr.v. stripping. Program, 14(3), 130-137. Qin, J. (In press). Indexing similarities in a keyword database and a controlled vocabulary database: Antibiotic resistance in pneumonia. Journal of the American Society for Information Science. Salton, G.; Fox, E. A.; & Voorhees, E. M. (1985). Advanced feedback methods in information retrieval. Journal of the American Society for Information Science, 36(2), 200-210. Salton, G., & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4), 288-297. Shaw, W. M. (1990). Subject indexing and citation indexing A citation index is an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. The first citation indices were legal citators such as Shepard's Citations (1873). : Clustering structure in the cystic fibrosis cystic fibrosis (sĭs`tĭk fībrō`sĭs), inherited disorder of the exocrine glands (see gland), affecting children and young people; median survival is 25 years in females and 30 years in males. document. Information Processing information processing: see data processing. information processing Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations. and Management, 26(6),693-718. Small, H. (1973). Co-citation in scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4), 265-269. Small, H., & Sweeney, E. (1985). Clustering the science citation index using co-citations: I. A comparison of methods. Scientometrics, 7(3/6), 391-409. Sparck Jones, K., & Jackson, D. M. (1970). The use of automatically-obtained keyword classifications for information retrieval. Information Storage and Retrieval, 5(4), 175-201. Travis, J. (1994). Reviving re·vive v. re·vived, re·viv·ing, re·vives v.tr. 1. To bring back to life or consciousness; resuscitate. 2. To impart new health, vigor, or spirit to. 3. the antibiotic miracle. Science, 264(5157), 360-362. Trybula, W. J. (1997). Data mining and knowledge discovery. Annual Review of Information Science and Technology, 32, 197-229. Vickery, B. (1997). Knowledge discovery from databases: An introductory review. Journal of Documentation, 53(2), 107-122. ADDITIONAL REFERENCE Small, H.; Sweeney, E.; & Greenlee, E. (1985). Clustering the Science Citation Index using co-citations: II. Mapping science. Scientometrics, 8(5/6), 321-340. APPENDIX EXAMPLES OF KEYWORDS AND THEIR SEMANTIC CODES IN THE SAMPLE
Keyword Code Keyword Code
2x 2x HUMAN-IMMUNO- hiv
DEFICIENCY VIRUS
ANTI-INFLAMM- agnt HUNGARY hungary
ATORY AGENT
AID Saids IMMUNE-DEFICIENCY Eids
SYNDROM
ANTIMICROBIAL sus-am INVASIVE PNEUMO- Sinf-pnu
SUSCEPTIBILITY COCCAL INFECTION
ASPIRATION aspirat PNEUMOCOCCAL inf-pnu
INFECTIONS
BARCELONA barc INVASIVE DISEASE invasiv
BINDING PROTEINS bp MENINGITIS/ meningi
MENINGEAL
BINDING PROTEIN bpg MOLECULAR epi-mol
GENE EPIDEMIOLOGY
BROTH broth NEW-GUINEA/ ng
PAPUA-NEW
CAPSULAR TYPES capt COMMON ORGAN- org
IZATION
CARRIAGE carrig PATHOGEN pathoge
NASOPHARYNGEAL carr-n BACTERIAL pnu-b
CARRIGE PNEUMONIA
CEFAMANDOLE cefam QUINOLONES quinolo
CEFTRIAXONE ceftri HIGH-LEVEL res-hi
RESISTANCE
CHLORAMPHEN ther-chl SOUTH-AFRICA sa
-ICOL THERAPY
CLARITHROMYCIN clarith SENSITIVITY sensiti
CLINDAMYCIN clindam POSTSPLE- SEPSIS sepsi-p
NECTOMY
CLONES clone SPREAD spread
MULTIRESISTANT clone-m STREPTOCOCCUS stra-s
CLONE -PNEUMONIAE
STRAINS
DIRECTIONS direct SOUTH-AFRICAN stra-sa
STRAIN
D-ALANINE LIGASE ligase STRUCTURAL- struct
CHANGES
ERYTHROMYCIN erythr TETRACYCLINE tetracy
GROUP-A grp-a ANTIBIOTIC-THERAPY ther-a
Jian Qin, School of Information Studies, Syracuse University Syracuse University, main campus at Syracuse, N.Y.; coeducational; chartered 1870, opened 1871. Syracuse is noted for its research programs in government and industry; facilities include the Center for Science and Technology, the Newhouse Communications Center, and , 4-206 Center for Science and Technology, Syracuse, NY 13244 JIAN QIN is Assistant Professor at the School of Information Studies at Syracuse University in Syracuse, New York
Syracuse (IPA: . She was the recipient of the OCLC OCLC - Online Computer Library Center LIS LIS - Langage Implementation Systeme. A predecessor of Ada developed by Ichbiah in 1973. It was influenced by Pascal's data structures and Sue's control structures. A type declaration can have a low-level implementation specification. Research Grant in 1997 and the ISI ISI International Sensitivity Index, see there Citation Research Grant in 1997. Ms. Qin is the author of over twenty journal articles, technical reports, and conference papers dealing with scientific communication, metadata (1) (meta-data) Data that describes other data. The term may refer to detailed compilations such as data dictionaries and repositories that provide a substantial amount of information about each data element. , keyword semantic pattern analysis, and bibliometrics Bibliometrics is a set of methods used to study or measure texts and information. Citation analysis and content analysis are commonly used bibliometric methods. While bibliometric methods are most often used in the field of library and information science, bibliometrics have wide . More recently she co-edited a topical topical /top·i·cal/ (top´i-k'l) pertaining to a particular area, as a topical antiinfective applied to a certain area of the skin and affecting only the area to which it is applied. top·i·cal adj. issue on Web research and information retrieval for Information Processing and Management.3 |
|
||||||||||||||||||

tu·al·ly adv.
mōn`yə)
Printer friendly
Cite/link
Email
Feedback
Reader Opinion