Template Mining for Information Extraction from Digital Documents.ABSTRACT WITH THE RAPID GROWTH OF DIGITAL INFORMATION RESOURCES (1) The data and information assets of an organization, department or unit. See data administration. (2) Another name for the Information Systems (IS) or Information Technology (IT) department. See IT. , information extraction In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured (IE)--the process of automatically extracting information from natural language texts--is becoming more important. A number of IE systems, particularly in the areas of news/fact retrieval and in domain-specific areas, such as in chemical and patent information retrieval information retrieval Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. , have been developed in the recent past using the template mining approach that involves a natural language processing Natural language processing Computer analysis and generation of natural language text. The goal is to enable natural languages, such as English, French, or Japanese, to serve either as the medium through which users interact with computer systems such as (NLP (Natural Language Processing) The capability of understanding human language. If the language is spoken, voice recognition plays an important role in converting the sounds to individual words. Then, natural language processing figures out what the words mean. ) technique to extract data directly from text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. the instructions associated with that template. This article briefly reviews template mining research. It also shows how templates are used in Web search engines--such as Alta Vista--and in meta-search engines--such as Ask Jeeves--for helping end-users generate natural language search expressions. Some potential areas of application of template mining for extraction of different kinds of information from digital documents are highlighted, and how such applications are used are indicated. It is suggested that, in order to facilitate template mining, standardization standardization In industry, the development and application of standards that make it possible to manufacture a large volume of interchangeable parts. Standardization may focus on engineering standards, such as properties of materials, fits and tolerances, and drafting in the presentation, and layout of information within digital documents has to be ensured, and this can be done by generating various templates that authors can easily download and use while preparing digital documents. INFORMATION EXTRACTION AND TEMPLATE MINING Information extraction (IE), the process of automatically extracting information from natural language texts, is gaining more and more importance due to the fast growth of digital information resources. Most work on IE has emerged from research into rule-based systems in natural language processing. Croft CROFT, obsolete. A little close adjoining to a dwelling-house, and enclosed for pasture or arable, or any particular use. Jacob's Law Dict. (1995) suggested that IE techniques, primarily developed in the context of the Advanced Research Projects Agency (ARPA ARPA - Defense Advanced Research Projects Agency ) Message Understanding Conferences (MUCs), are designed to identify database entities, attributes, and relationships in full text. Gaizauskas and Wilks (1998) defined IE as the activity of automatically extracting pre-specified sorts of information from short natural language texts typically, but by no means exclusively, newswire articles. Although works related to IE date back to the 1960s, perhaps the first detailed review of IE as an area of research interest in its own right was by Cowie and Lehnert (1996). However, a detailed review dividing the literature on IE into three different groups--namely, the early work on template filling, the Message Understanding Conferences (MUCs), and other works on information extraction--has recently been published by Gaizauskas and Wilks (1998). Template mining is a particular technique used in IE. Lawson et al. (1996) defined template mining as a natural language processing (NLP) technique used to extract data directly from text if either the data and/ or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to instructions associated with that template. Although different techniques are used for information extraction and knowledge discovery--as described by Cowie and Lehnert (1996), Gaizauskas and Wilks (1998), and Vickery (1997)--template mining is probably the oldest information extraction technique. Gaizauskas and Wilks (1998) reported that templates were used to extract data from natural language texts against which "fact retrieval" could be carried out in the Linguistic String Project at New York University New York University, mainly in New York City; coeducational; chartered 1831, opened 1832 as the Univ. of the City of New York, renamed 1896. It comprises 13 schools and colleges, maintaining 4 main centers (including the Medical Center) in the city, as well as the that began in the mid-1960s and continued into the 1980s (reported by Sager, 1981). Numerous studies have been conducted, though most of them are domain-specific, using templates for extracting information from texts. This article briefly reviews some of these works. It also shows how templates are used for information retrieval purposes in major Web search engines A Web site that maintains an index and short summaries of billions of pages on the Web, Google being the world's largest. Most search engine sites are free and paid for by advertising banners, while others charge for the service. like AltaVista (http:// www.altavista.com). This discussion proposes that template mining has great potential in extracting different kinds of information from documents in a digital library environment. To justify this proposition, this article reports some preliminary tests carried out on digital documents, more specifically on some articles published in the D-Lib Magazine D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Content of current and past issues are available free of charge. The publication is financially supported by the Defense Advanced Research Projects Agency (as part of the Digital (http://www.dilib.org/dilib). WORKS ON TEMPLATE MINING Template mining has been used successfully in different area: * extraction of proper names by Coates-Stephens (1992), Wakao et al. (1996), and by Cowey and Lehnert (1996); * extraction of facts from press releases related to company and financial information in systems like ATRANS (Lytinen & Gershman, 1986), SCISOR (Jacobs & Rau, 1990), JASPER (Andersen, et al., 1992; Andersen & Huettner, 1994), LOLITA (Costantino, Morgan, & Collingham, 1996), and FIES FIES Family Income and Expenditure Survey FIES Federazione Italiana Esercizi Spirituali (Italian) FIES Fellow of the Institution of Engineers and Shipbuilders in Scotland (Chong & Goh, 1997); * abstracting scientific papers by Jones and Paice (1992); * summarizing new product information by Shuldberg et al. (1993); * extraction of data from analytical chemistry analytical chemistry: see under chemistry. papers by Postma et al. (1990a, 1990b) and Postma and Kateman (1993); * extraction of reaction information from experimental sections of papers in chemistry journals by Zamora and Blower (1984a,1984b); * processing of generic and specific chemical designations from chemical patents by Chowdhury and Lynch (1992a, 1992b) and by Kemp (1995); and * extraction of bibliographic citations from the full texts of patents by Lawson et al. (1996). Template mining has largely been used for extraction of information from news sources and from texts in a specific domain. Gaizauskas and Wilks (1998) reported that applied work on filling structured records from natural language texts originated in two long-term research projects: The Linguistic String project (Sager, 1981) at New York University and the research on language understanding and story comprehension carried out at Yale University Yale University, at New Haven, Conn.; coeducational. Chartered as a collegiate school for men in 1701 largely as a result of the efforts of James Pierpont, it opened at Killingworth (now Clinton) in 1702, moved (1707) to Saybrook (now Old Saybrook), and in 1716 was by Schank and his associates (Schank, 1975; Schank & Abelson, 1977; Schank & Riesbeck, 1981). The first research was conducted in the medical science domain, particularly involving radiology reports and hospital discharge summaries discharge summary A document prepared by the attending physician of a hospitalized Pt that summarizes the admitting diagnosis, diagnostic procedures performed, therapy received while hospitalized, clinical course during hospitalization, prognosis, and plan of , while the second research led to many other research works in the early 1980s that used the principles and techniques of IE to develop practical applications such as the FRUMP frump n. 1. A girl or woman regarded as dull, plain, or unfashionable. 2. A person regarded as colorless and primly sedate. system developed by De Jong De Jong is the most common Dutch surname. Many people bear this name, including many important historical figures. Some of these people are mentioned below. De Jong may mean:
ATRANS (Lytinen & Gershman, 1986), another IE system, was soon developed and commercially applied. ATRANS used the script approach (Schank & Abelson, 1977; Schank & Riesbeck, 1981) for automatic processing of money transfer messages between banks. Another successful application of IE has produced a commercial online news extraction system called SCISOR (Jacobs & Rau, 1990) that extracts information about corporate mergers and acquisitions from online news sources. JASPER (Andersen et al., 1992; Andersen & Huettner, 1994) was another IE system developed for fact extraction for Reuters. JASPER uses a template-driven approach and partial analysis techniques to extract certain key items of information from a limited range of texts such as company press releases. LOLITA (Costantino, Morgan, & Collingham, 1996) is a financial IE system that uses three pre-defined groups of templates designed according to a "financial activities approach," namely, company related templates, company restructuring templates, and general macroeconomic mac·ro·ec·o·nom·ics n. (used with a sing. verb) The study of the overall aspects and workings of a national economy, such as income, output, and the interrelationship among diverse economic sectors. templates. In addition, the user-definable template allows the user to define new templates using natural language sentences. Chong and Goh (1997) developed a similar template-based financial information extraction system, called FIES, that extracts key facts from online news articles. Applications of template mining techniques for automatic abstracting can be traced back to 1981 when Paice (1981) used what he called indicator phrases (such as "the results of this study imply that ...") to extract topics and results reported in scientific papers for generating automatic abstracts. Paice continued his work to improve on this technique and for resolving a number of issues in natural language processing (see, for example, Jones & Paice, 1992; Paice & Husk husk (husk) an outer covering or shell, as of some fruits and seeds. psyllium husk the cleaned, dried seed coat from the seeds of Plantago , 1987). Shuldberg et al. (1993) described a system that digests large volumes of text, filtering out irrelevant articles and distilling the remainder into templates that represent information from the articles in simple slot/filler pairs. The system consists of a series of programs each of which contributes information to the text to help determine which strings constitute appropriate values for the slots in the template. Chemical and patent information systems have been the prominent areas for the application of templates for IE. TICA TICA The International Cat Association TICA Thailand Incentive and Convention Association TICA Timpanogos Cave National Monument (US National Park Service) TICA Tenant In Common Association TICA Turbine Inlet Cooling Association (Postma et al., 1990a, 1990b; Postma & Kateman, 1993) used templates to extract information from the abstracts of papers on inorganic tritimetric analysis. The parsing See parse. parsing - parser program used in TICA followed an expectation-driven approach where words or groups of words expect other words or concepts to appear. Zamora and Blower (1984a, 1984b) developed a system that automatically generates reaction information forms (RIFs) from the descriptions of syntheses of organic chemicals in the Journal of the American Chemical Society
adj. Serving or tending to prepare or make ready; preliminary. n. Something that prepares for or acts as a preliminary to something following. reactions from the experimental sections of the Journal of Organic Chemistry The Journal of Organic Chemistry (abbreviated as J. Org. Chem. or JOC) is a scientific journal for original contributions of fundamental research in organic and bioorganic chemistry. It is published by the American Chemical Society. papers. This work identified seven sequences of events that were used for building templates for the text of an experimental paper. Chowdhury and Lynch (1992a, 1992b) developed a template-based method for converting to GENSAL (a generic structure language developed at the University of Sheffield The University of Sheffield is a research university, located in Sheffield in South Yorkshire, England. Reputation Sheffield was the Sunday Times University of the Year in 2001 and has consistently appeared as their top 20 institutions. ) those parts of the Derwent Documentation Abstracts that specify generic chemical structures. Templates for processing both the variable and multiplier expressions, which predominate in the assignment statements in the Derwent Documentation Abstracts, were identified for further processing. As part of this research, Chowdhury (1992) also conducted a preliminary discourse analysis Discourse analysis (DA), or discourse studies, is a general term for a number of approaches to analyzing written, spoken or signed language use. The objects of discourse analysis—discourse, writing, , conversation, communicative event, etc. of European chemical patents that identified the common patterns of expressions occurring in different parts of patent texts. This work prompted further research (Kemp, 1995; Lawson et al., 1996) leading to the use of template mining in the full text of chemical patents. Lawson et al. (1996) reported their work using the template mining approach to isolate and extract automatically bibliographic citations to patents, journal articles, books, and other sources from the full texts of English-language patents. There is also some work that examines the development of specific tools and techniques for information extraction using templates. For example, Sasaki (1998) reported an ongoing project on building an information extraction system that extracts information from a real-world text corpus In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe. such as newspaper articles and Web pages. As part of this project, an inductive logic programming Inductive logic programming (ILP) is a subfield of machine learning which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. (ILP ILP Inductive Logic Programming ILP Instruction-Level Parallelism ILP Individual Learning Plan ILP Independent Labour Party ILP Independent Living Program ILP Institut Latihan Perindustrian (Malaysia) ) system has been developed to generate IE rules from examples. Gaizauskas and Humphreys (1997) described the approach taken to knowledge representation in the LaSIE information extraction system, particularly the knowledge representation formalisms, their use in the IE task, and how the knowledge represented in them is acquired. LaSIE first translates individual sentences to a quasi [Latin, Almost as it were; as if; analogous to.] In the legal sense, the term denotes that one subject has certain characteristics in common with another subject but that intrinsic and material differences exist between them. logical form and then constructs a discourse model of the entire text from which template fills are derived. Guarino (1997) argued that the task of information extraction can be seen as a problem of semantic matching between a user-defined template and a piece of information written in natural language. He further suggested that the ontological on·to·log·i·cal adj. 1. Of or relating to ontology. 2. Of or relating to essence or the nature of being. 3. assumptions of the template need to be suitably specified and compared with the ontological implications of the text. Baralis and Psaila (1997) argued that the current approaches to data mining usually address specific user requests, while no general design criteria Noun 1. design criteria - criteria that designers should meet in designing some system or device; "the job specifications summarized the design criteria" criterion, standard - the ideal in terms of which something can be judged; "they live by the standards of their for the extraction of association rules are available for the end-user. To solve this problem, they have proposed a classification of association rule types that provides a general framework for the design of association rule mining applications and predefined templates as a means to capture the user specification of mining applications. Although numerous research projects have been undertaken, and some are currently ongoing, Croft (1995) suggested that the current state of information extraction tools is such that it requires a considerable investment to build a new extraction application, and certain types of information are very difficult to identify. However, Croft further commented that extraction of simple categories of information is practical and can be an important part of a text-based information system. This article highlights some potential areas of application of template mining in a digital library environment. USE OF TEMPLATES IN WEB SEARCH ENGINES Gaizauskas and Wilks (1998) suggested that there is a contrast between the aims of information extraction and information retrieval systems in the sense that IR retrieves relevant documents from collections, while IE extracts relevant information from documents. However, the two are complementary, and their use in combination has the potential to create powerful new tools in text processing and retrieval. Indeed, IE and IR are equally important in the electronic information environment, particularly the World Wide Web, and templates have been used both for IR and IE. Many applications of template mining mentioned above handle digital texts available on the Web, while search engines use templates to facilitate IR. Search engines are one of the most essential tools on the Internet--they help find Web sites relating to relating to relate prep → concernant relating to relate prep → bezüglich +gen, mit Bezug auf +acc a particular subject or topic. Search engines are basically huge databases containing millions of records that include the URL URL in full Uniform Resource Locator Address of a resource on the Internet. The resource can be any type of file stored on a server, such as a Web page, a text file, a graphics file, or an application program. of a particular Web page along with information relating to the content of the Web page supplied in the HTML HTML in full HyperText Markup Language Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web. by the author. A search engine obtains this information via a submission from the author or by the search engine doing a "crawl" using "robot crawlers" of the Internet for information. The most popular search engines include: AltaVista, Excite, Hotbot, Infoseek, Lycos, Webcrawler, Yahoo, and so on. Some search engines use templates to help end-users submit natural language queries A query expressed by typing English, French or any other spoken language in a normal manner. For example, "how many sales reps sold more than a million dollars in any eastern state in January?" In order to allow for spoken queries, both a voice recognition system and natural language used by search engines to conduct a search on specific topics. Two small sets of tests were conducted to see how this is done in a large search engine--AltaVista--and in a meta search engine--Ask Jeeves. The following section shows how these search engines use templates for natural language query formulation in their interfaces. USE OF TEMPLATES IN ALTA VISTA See AltaVista. (World-Wide Web) Alta Vista - A World-Wide Web site provided by Digital which features a very fast Web and Usenet search engine. As of April 1996 its word index is 33GB in size. The Alta Vista search engine (http://www.altavista.com) helps users find information on the Web. One interesting feature of this search engine is that a user can enter one or more search terms/phrases or can type a natural language statement such as "What is the capital of Alaska?" or "Where can I find quotations by Ingmar Bergman Noun 1. Ingmar Bergman - Swedish film director who used heavy symbolism and explored the psychology of the characters (born 1918) Bergman ?" Taking the second option, a simple query statement, "Where can I find information on Web search engines?" was typed in the specified box of the Alta Vista search interface (see Figure 1). Along with the results, Alta Vista came up with two templates that contain natural language sentences related to the search topic (see Figure 2). By clicking on the box at the end of the statement "How do I (Internet skill)?" the system shows a box containing various options (Figure 3), any of which can be chosen to complete the sentence, the default one being "search through ALL web sites." By choosing this, or any other option from the box, a user can formulate a sentence-like query such as: "How do I search through all Web sites?" or "How do I learn HTML?" or "How do I use the Internet as a telephone?" and so on. [Figures 1-3 ILLUSTRATION OMITTED] The above examples show that the search engine uses templates and various options for the "(Internet skill)" slot in the template "How do I (Internet skill)." By clicking on the "More answers" option (see Figure 2), a user can get more such templates (see Figure 4). It may be noted that for many slots, such as "computing term," "Internet term," "search engine," and so on (see Figure 4), there are various options that can be displayed by clicking on the appropriate box. Some of these options are shown in Figure 5. Thus, the search engine uses various templates and provides options to fill in the slots to prepare sentence-like queries. Once a user prepares a search sentence by choosing the appropriate option from the drop-down box and clicks on the "Answer" button, the system conducts a search and fetches the relevant hits. However, it may be noted that the format of the query templates, and quite obviously the contents of the box showing the various options for the slots, vary from query to query. For example, when the system was asked "When will the next World Cup Football Games be held?" the templates that came up on the output screen were different (see Figures 6, 7, and 8). However, the system does not always come up with natural language query templates. For example, when the query, "What is a hurricane?" was given, the system simply produced a list of hits and no templates (see figure 9). [Figures 4-9 ILLUSTRATION OMITTED] USE OF TEMPLATES IN ASK JEEVES Noun 1. Ask Jeeves - a widely used search engine accepting plain English questions or phrases or terms trademark - a formally registered symbol identifying the manufacturer or distributor of a product Ask Jeeves (http://www.askjeeves.com) is a meta-search engine that represents a model of an application using knowledge management techniques in order to better organize disparate information sources (Stanley, 1998). It draws on the expertise of experienced human Web searchers and encapsulates this expertise in a database so that it can be put to use by others. Various questions and their answers are manually selected by human editors who scan resources on the Web on a daily basis to build up a knowledge base of information about sites which might be used to answer common questions. The questions and the Web pages which answer them are then stored as a series of templates in the Ask Jeeves knowledge base, and keywords and concepts in a search string are matched against them in order to retrieve the questions and their corresponding Web sites (Stanley, 1998). Ask Jeeves was asked a simple question: "What is a hurricane?" (see Figure 10), and the system created a number of templates as shown in Figure 11. [Figures 10-11 ILLUSTRATION OMITTED] Thus, the search engines use templates, though not for IE but for IR, more specifically to help users formulate a natural language query. However, the way these templates are created is quite interesting. Human experts conduct searches on the topics and, based on the search results, organize them into different groups. These groupings could be on different topics such as "computing terms," "Internet terms," "Internet skills," and so on. These are then created as slots in sentence-like queries, and the various options for the slots (each "Internet term," "Internet skill," and so on) are then presented in boxes for the end-user to select while searching. TEMPLATE MINING IN DIGITAL LIBRARIES The British Library British Library, national library of Great Britain, located in London. Long a part of the British Museum, the library collection originated in 1753 when the government purchased the Harleian Library, the library of Sir Robert Bruce Cotton, and groups of manuscripts. DL (digital library) Program (The British Library ..., 1997) defines digital library as the widely accepted descriptor (1) A word or phrase that identifies a document in an indexed information retrieval system. (2) A category name used to identify data. (operating system) descriptor for the use of digital technologies to acquire, store, conserve, and provide access to information and materials in whatever form it was originally published. The Stanford digital library working paper (Reich & Winograd, 1995) defines a digital library as a coordinated collection of services that are based on collections of materials, some of which may not be directly under the control of the organization providing a service in which they play a role. The contents of a digital library, being digital information sources, provide ample opportunities for applying template mining resulting in the extraction of valuable information in a number of areas in a digital library environment. Four areas of such an application have been identified and how this is done is discussed in the following sections. AUTOMATIC CREATION OF CITATION DATABASES OF DIGITAL DOCUMENTS Recently there have been some works on citation studies in the Web environment. Almind and Ingwersen (1998) introduced the concept of "Webometrics"--i.e., the application of informetric methods to the WWW. They have argued that citation analysis Citation Analysis is the most common method of bibliometrics. Citation analysis uses citations in scholarly works to establish links to other works or other researchers. Co-citation coupling and bibliographic coupling are specific kinds of citation analysis. on the WWW has not been tested in practice. In another publication, Ingwersen and Hjortgaard (1997) discussed the advantages and disadvantages of using the WWW for informetric analysis and examined the pitfalls of online informetric analysis using the ISI ISI International Sensitivity Index, see there (Institute of Scientific Information) files. A quick and simple examination of some issues of the electronic journal D-Lib Magazine (http://www.dilib.org/dilib) revealed that template mining can be used to develop citation databases automatically from the online articles. Such databases may contain information somewhat similar to the ISI databases, such as the citing author, address of the citing author, title of the citing article, keywords, and so on as well as the authors, titles, and bibliographic details of the cited articles. The simple template mining approach may be used to extract the information for each of these fields in the citation database that can later be used for various citation analysis and other purposes. Analysis of the articles (they are called stories in D-Lib Magazine) published in the nine 1998 issues of D-Lib Magazine (January, February, March, April, May, June, July/August, September, and October) revealed a general structure of the articles (see Figure 12). [Figure 12 ILLUSTRATION OMITTED] Figure 12 provides a general idea of the different kinds of templates that can be generated to extract information from the various sections of each article. Text that appears in bold in Figure 12 shows that the concerned text is constant--i.e., it appears in each article. Similarly, empty boxes indicate that texts appear there to indicate value for the given slot--e.g., title, author's name Noun 1. author's name - the name that appears on the by-line to identify the author of a work writer's name name - a language unit by which a person or thing is known; "his name really is George Washington"; "those are two names for the same thing" , and so on. Texts in some slots appear in a specific format, for example, in the "Issue Date" slot, the particular issue appears in the format "Month Year"--e.g., "February 1998." In some slots, the heading varies. For example, the heading used for references is usually "References" but the headings "Bibliography" or "Notes" are also used. This preliminary study has shown that, although this is the general structure of the articles in D-Lib Magazine, some articles may not have some of the slots--i.e, "Keywords," "Abstract," "References," or "Acknowledgments." The values for some of the slots remain constant while, for most of the slots, they vary from one article to the other. For example, in the "Author" and "Address" slots, different patterns have been noticed. Articles may be written by only one author, by two authors, or by more than two authors. Again, when more than one author is involved, they may have the same or different addresses. Articles also differ in terms of layout for writing the authors' names and addresses: while in most cases they appear vertically, one after the other, in some cases they appear horizontally, one after the other. The general structure of these slots is shown in Figure 13. [Figure 13 ILLUSTRATION OMITTED] Similarly, the contents of the "References" slot vary significantly depending on the type of item referred to. Some articles have only print versions while others are available only in electronic form. Therefore, the citations relating to the sources vary: there may be journal title, volume number, and so on or there may only be the URL or both the bibliographic details and the URL. Although the general pattern of citations as they appear at the end of each article appears to be quite simple (see Figure 14), and therefore amenable to template mining, a closer look at the references indicates that there are a number of irregularities there. For example, different authors use different citation styles and, sometimes, within the same article, authors follow different citation styles for similar types of material (see Figure 15). If these irregularities are sorted out, the template matching Template matching is a technique in Digital image processing for finding small parts of an image which match a template image. It can be used in manufacturing as a part of quality control,[1] a way to navigate a mobile robot,[2] technique can be used to extract relevant information, including URLs, from the citations. There are two ways to sort out these irregularities and to ensure a standard citation style. The first one could be to impose rigorous editorial practice, but that would be more expensive and time consuming, causing delays in publication. The second, and less expensive, approach may be to prepare templates for each type of citation and make them available online. Authors can download and make use of these templates for preparing the list of references. This will ensure a standard citation style. The same practice may be followed for the other parts of the article, and eventually the whole structure of the articles can be standardized, thereby facilitating the use of template mining. Figure 14. Simple Structure of References in the D-Lib Magazine Sample References from D-Lib Magazine [Chen et al., 1996] Hsinchun Chen, Chris Schuffels, and Rich Orwig, "Internet Categorization and Search: A Machine Learning Approach," Journal of Visual Communication and Image Representation, Special Issue on Digital Libraries, Volume 7, Number 1, Pages 88-102, 1996. [5] Harnad E., Print Archive and Psycoloquy and BBS (1) (Bulletin Board System) A computer system used as an information source and forum for a particular interest group. They were widely used in the U.S. Journal Archives <http://www.princeton.edu/~harnad> [11] Trudi Bellardo Hahn, Text Retrieval Online: Historical Perspective on Web Search Engines, pp. 7-10, Bulletin of the American Society for Information Science, April/May, 1998. Figure 15. Sample References from Different Issues of D-Lib Magazine Showing the Lack of Standards in the Citation Style. [ieeePC97] Bacon J, Bates Bates , Katherine Lee 1859-1929. American educator and writer best known for her poem "America the Beautiful," written in 1893 and revised in 1904 and 1911. J and Halls D, Location oriented multimedia, IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields. Personal Communications 4(5), pp 48-57, October 1997. [Phelps and Wilensky, 1996b] Thomas A. Phelps and Robert Wilensky, "Multivalent multivalent /mul·ti·va·lent/ (-val´ent) 1. having the power of combining with three or more univalent atoms. 2. active against several strains of an organism. Documents: Inducing Structure and Behaviors in Online Digital Documents", Proceedings of Hawaii International Conference on System Sciences The Hawaii International Conference on System Sciences (HICSS) is an annual conference for Information Systems and Information Technology academics and professionals sponsored by the University of Hawaii at Manoa. '96 (Best Paper Award, Digital Documents Track). [Salton, 1989]Gerard Salton Gerard Salton (8 March, 1927 in Nuremberg - 28 August, 1995) was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time. , Automatic Text Processing, Addison-Wesley, 1989. [Weiss et al., 1996]Ron Weiss, David Gifford et al., "HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext hypertext, technique for organizing computer databases or documents to facilitate the nonsequential retrieval of information. Related pieces of information are connected by preestablished or user-created links that allow a user to follow associative trails across the Clustering", Proceedings of the Seventh ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field. Conference on Hypertext, March 1996, Washington, DC. [Weiss and Indurkhya, 1993] S. Weiss and N. Indurkhya, "Optimized rule induction Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data. ," IEEE Expert 8, 6, 61-69. [Yahoo]Yahoo [http://www.yahoo.com/] [Zhu et al., 1997] Quan Zhu et al., "Searching for Parts and Services on the Web", Proceedings of International Symposium on Research, Development, and Practice in Digital Libraries, Nov. 18 - 21, 1997, Tsukuba, Japan. [1] Tom Sanville of the OhioLINK consortium has noted in public presentations a correlation between ease of access and use and the amount of use. [2] TULIP: The University Licensing Program <http://www.elsevier.nl/locate/ tulip> [3] Red Sage Red sage either refer to:
[5] Harnad E. Print Archive and Psycoloquy and BBS Journal Archives <http://www.princeton.edu/~harnad> [9] JSTOR JSTOR Journal Storage http://www.Ostor.org [Arms 1995] R. Kahn and R. Wilensky, A Framework for Distributed Digital Object Services, (May 1995). [Bearman 1998] D. Bearman and J. Trant, Authenticity of Digital Resources:Towards a Statement of Requirements in the Research Process, DLIB DLIB Digital Library on-line magazine, (June 1998). C. Lynch et al., A White Paper on Authentication (1) Verifying the integrity of a transmitted message. See message integrity, e-mail authentication and MAC. (2) Verifying the identity of a user logging into a network. and Access Management Issues in Cross-organizational Use of Networked Information Resources, (Note: The underlines in the last three references indicate that they are hyperlinked to the respective URLs) AUTOMATIC EXTRACTION OF INFORMATION FROM NEWS ITEMS IN ELECTRONIC JOURNALS Electronic journals often contain news items, and important information can be extracted for the creation of databases or for any other use by a simple template mining approach. A scan through the pages of the nine issues of D-Lib Magazine revealed that there is a section called "Goings on" that provides information on conferences/seminars/workshops, and so on under the heading "Clips & Pointers." A scan through the items appearing under the heading "Goings on" in all the 1998 issues of D-Lib Magazine revealed that the seminar/conference/workshop announcements follow a general pattern as shown in Figure 16. This shows that a simple template can extract information about the various forthcoming seminars, conferences, and so on. Particular information, such as the place, date, Web address, and so on, can also be extracted by such templates. However, once again such a template mining approach calls for a standard format and layout. Templates can also be generated to extract further information such as specific topics, contact address, deadlines, and so on. [Figure 16 ILLUSTRATION OMITTED] AUTOMATIC IDENTIFICATION OF FUNDING/SPONSORING AGENCIES FOR RESEARCH A scan through the "Acknowledgment acknowledgment, in law, formal declaration or admission by a person who executed an instrument (e.g., a will or a deed) that the instrument is his. The acknowledgment is made before a court, a notary public, or any other authorized person. " section in the articles revealed that they contain information about the funding/sponsoring agency's name, address, grant number, and so on. A study of the various articles appearing in the electronic journals can help generate a pattern, and thereby appropriate templates, that will be able to extract the relevant information for further use. Again, in order to standardize the practice of providing these items of information, the editorial board of the journal can make appropriate templates available to the authors. INFORMATION EXTRACTION USING METADATA AND TEMPLATE MINING Metadata are data about data. However, this definition is too simple and does not tell us their characteristic features. A better definition has been provided by Dempsey and Heery (1998) according to whom metadata are data associated with objects which relieve their potential users of having to have full advance knowledge of the existence of characteristics. Younger (1997) defined metadata as documentation about documents and objects; they describe resources, indicate where they are located, and outline what is required in order to use them successfully. There are several metadata schemes created by library and information professionals over the years, the Years, The the seven decades of Eleanor Pargiter’s life. [Br. Lit.: Benét, 1109] See : Time most prominent ones being the MARC formats, the AACR AACR American Association for Cancer Research AACR Anglo-American Cataloging Rules AACR Australasian Association of Cancer Registries AACR African Armed Conflicts Resolved 2 catalog formats, subject headings lists (such as the LCSH LCSH Library of Congress Subject Headings LCSH Lee County Senior High (Sanford, NC, USA) ), and classification schemes such as LC, DDC See VESA DDC. , UDC UDC abbr. universal decimal system UDC (Brit) n abbr (= Urban District Council) → Stadtverwaltung f , and so on. Each of these schemes is constructed by experts in the field from an understanding of the specific domains, information resource needs, and the requirements for describing documents. While these schemes have been used for bibliographic access and control for decades, there remains the question of how to catalog and index materials available on the Internet using these schemes. This has given rise to a thought that electronic documents need to be self-indexed (as opposed to the assignment of cataloging and indexing tags and value added Value Added The enhancement a company gives its product or service before offering the product to customers. Notes: This can either increase the products price or value. by cataloging and indexing agencies or library staff). However, it is obvious that, in order for the documents to be self-indexed, a core set of metadata elements must be identified, and each creator of electronic documents should be able to implement it in the record that he creates. With this objective, a simple resource description set of data has emerged--the Dublin Core A set of meta-data descriptions about resources on the Internet. Used for resource discovery, it contains data elements such as title, creator, subject, description, date, type, format and so on. Dublin Core descriptions are often included in HTML meta tags. (http://purl.org/metadata/ dublin_core). The Dublin Core metadata set prescribes fifteen elements, namely (http://www.lub.lu.se/cgi-bin/nmdc.pl): title, creator, subject (keywords, controlled vocabulary Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the uses of predefined, authorised terms that have been preselected by the designer of the controlled vocabulary as opposed to natural , and classification), description (abstract and content description), publisher, contributor (other than the creator), date, type (category of the resource), format (HTML, Postscript, etc.), identifier (URL, string or number used to identify the resource), source (from which this resource is derived), language, relation (with other resources), coverage (spatial and/or temporal characteristics of the resources), and rights (link to a copyright notice, etc.). Weibel (1995,1996) suggested that, in order to enable information creators to apply metadata, a mechanism for embedding 1. (mathematics) embedding - One instance of some mathematical object contained with in another instance, e.g. a group which is a subgroup. 2. (theory) embedding - (domain theory) A complete partial order F in [X -> Y] is an embedding if the data within HTML documents had to be established. The Dublin Core looks at one aspect of metadata--i.e., the simple description--but, as Dempsey and Heery (1998) suggested, there is a need for more complex description for particular specialist domains. In 1996, there was a conference organized by UKOLN UKOLN United Kingdom Office for Library and Information Networking and OCLC OCLC - Online Computer Library Center to examine the various metadata issues including the Dublin Core. This meeting gave rise to a proposal, the Warwick Framework proposal (Lagoze, 1996) (named after the place of the conference), calling for an architecture for the interchange of various metadata packages. Metadata are an important tool for resource discovery from digital documents. They not only help users locate the required information resources, they also help in the examination and selection (or rejection) of the retrieved items. Various fields, such as subject, description, language, sources, and so on, can provide necessary information for examining the relevance of the retrieved resources. Web search engines, as an aid during the examination phase, generally construct an information surrogate to display the search hits. These surrogates generally consist of the URL, title of the Web page, and some summary text that is derived with the aid of some heuristics heu·ris·tic adj. 1. Of or relating to a usually speculative formulation serving as a guide in the investigation or solution of a problem: (Lagoze, 1997). Lagoze (1997) argued that, recognizing the limitations of the current search engines, researchers are now actively pursuing both standards for descriptive surrogates for networked objects and methods for associating surrogates with those objects. Such studies aim at developing a number of surrogate templates that would facilitate the resource discovery process. The template mining approach can be used effectively for the resource discovery process. The metadata tags, such as subject and description in the Dublin Core metadata format, for example, can be extended and be made more structured by specifying various templates. The following section briefly describes how this can be achieved. USE OF THE PRINCIPLES OF PRE-COORDINATE INDEXING SYSTEMS IN BUILDING SUBJECT TEMPLATES Information retrieval systems have used two different types of indexing systems, namely, pre-coordinate indexing and post-coordinate indexing (for detailed discussions, see Lancaster, 1998; Foskett, 1996). Precoordinate indexing systems--classification schemes like the Dewey Decimal Classification Dewey Decimal Classification or Dewey Decimal System System for organizing the contents of a library based on the division of all knowledge into 10 groups. Each group is assigned 100 numbers. (DDC) (Dewey, 1996), Universal Decimal Classification The Universal Decimal Classification is a system of library classification developed by the Belgian bibliographers Paul Otlet and Henri la Fontaine at the end of the 19th century. It is based on the Dewey Decimal Classification, but is much more powerful. (UDC, 1985), and Colon Classification Colon classification (CC) is a system of library classification developed by S. R. Ranganathan. It was the first ever faceted (or analytico-synthetic) classification. It is especially used in libraries in India. (CC) (Ranganathan, 1965), and so on that use artificial notations or subject indexing Subject indexing is the act of describing a document by index terms to indicate what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary. systems such as relational indexing (Farradane, 1980a, 1980b), PRECIS (Austin & Dykstra, 1984), and POPSI POPSI Postulate Based Permuted Subject Indexing (Bhattacharyya, 1981) that use natural language terms --represent the content of an information resource by synthesizing the various components of the subject and organizing these in a specific order. On the contrary, post-coordinate indexing systems do not rely on the a priori a priori In epistemology, knowledge that is independent of all particular experiences, as opposed to a posteriori (or empirical) knowledge, which derives from experience. relations or organization of the constituent search terms; rather, retrieval is performed by searching through each individual term and then the output is generated based on the coordination of the terms at the retrieval stage; for example, according to the principles of set theory. Most of the modern day IR systems, including the search engines, follow the principles of post-coordinate indexing systems. However, it is proposed that, for resource discovery from digital libraries or from the Web, one can use the principles of pre-coordinate indexing systems. The following paragraph suggests a simple approach to this. The basic tenet TENET. Which he holds. There are two ways of stating the tenure in an action of waste. The averment is either in the tenet and the tenuit; it has a reference to the time of the waste done, and not to the time of bringing the action. 2. of this approach is that pre-coordinate indexing systems help indexers generate subject index entries that represent the subject matter of the document concerned. This approach has been successfully used in libraries for organizing library materials on the shelves based on classification numbers. These class numbers are created according to the principles of pre-coordinate indexing systems, and the notations represent the content of the document concerned. Similarly, this approach has been used in preparing alphabetical subject index entries for documents in national bibliographies and in other bibliographic databases For computer programs to manage an individual's bibliographic references, see Reference management software A bibliographic or library database is a database of bibliographic information. . One major drawback of these pre-coordinate indexing systems is that it is largely human dependent because human indexers need to analyze the documents and prepare the subject statements, which are then manipulated by computers for generating multiple entries or are used to prepare the class number. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke" put differently , this process involves a significant amount of human expertise and time and therefore is a slow and expensive process. As a result, adopting this approach is almost impossible with a large collection and is impossible in the Web environment with millions of documents. The aforementioned problem could be solved if this task was accomplished by the author or the generator of information resources. If authors can somehow indicate the key concepts treated in the documents, along with an indication of the appropriate categories where they belong, then subject statements or simple surrogates can be prepared automatically. A template can be provided to the generator of information resources--e.g., an author of an article or a report may fill in the various slots with appropriate information. Such templates can be generated using the various categories proposed in pre-coordinate indexing systems--e.g., the fundamental categories of Ranganathan (1967), the nine relational categories proposed by Farradane (see for discussion, Farradane, 1980a, 1980b; Chowdhury, 1989), the various role operators proposed in PRECIS by Austin (Austin & Dykstra, 1984), the various categories proposed by Bhattacharyya (1981) in POPSI, and so on. The author or the generator of the information resource should be able to understand the connotation con·no·ta·tion n. 1. The act or process of connoting. 2. a. An idea or meaning suggested by or associated with a word or thing: of each category and thus should be able to fill in the template according to the semantic content of the concerned information resource. For example, the author of a document entitled "Internet as a Tool for Management of OPACs in the Libraries in Singapore This is a list of libraries in Singapore: National libraries These libraries are managed by the National Library Board.
[Figure 17 ILLUSTRATION OMITTED] Such templates, prepared according to the principles of any pre-coordinate indexing system or a modified version of that, can be used both for better retrieval and for the preparation of document surrogates. Better retrieval can be achieved because of the semantic values attached to the terms; this would help reduce the false drops Search results that meet your search criteria but have nothing to do with the information you were trying to find. For example, you want biographical details about Grover Cleveland, but you get information about Cleveland, Ohio. See full-text search. . Document surrogates can be prepared automatically according to the prescribed rules of the concerned indexing system--e.g., according to the principle of generating the index entries in PRECIS (Austin & Dykstra, 1984). CONCLUSION The explosive growth in our capabilities to collect and store data over the past decades has given rise to a new field of study, called knowledge discovery in databases, that is concerned with the creation of a new generation of tools and techniques for automated and intelligent database analysis. Raghavan et al. (1998) suggest that KDD KDD Knowledge Discovery and Data Mining (International Conference) KDD Knowledge Discovery in Databases KDD Kokusai Denshin Denwa (Japan) KDD Key Distribution Device refers to the whole process in the path from data to knowledge and to use descriptive phrases for specific tasks in the process, such as pattern extraction methods, pattern evaluation methods, or data cleaning methods. Thus, simply speaking, KDD is the process of deriving useful knowledge from real-world databases through the application of pattern extraction techniques. The grand challenge of KDD is, therefore, to automatically process large quantities of raw data, identify the most significant and meaningful patterns, and present these as knowledge appropriate for achieving a user's goals. This discussion has proposed that template mining, which is based on pattern recognition and pattern matching 1. pattern matching - A function is defined to take arguments of a particular type, form or value. When applying the function to its actual arguments it is necessary to match the type, form or value of the actual arguments against the formal arguments in some definition. in natural language texts, can be used for extracting different kinds of information from digital documents or text databases. Citation databases can be built automatically based on the template mining approach, from digital documents, such as from articles in electronic journals. Template mining can also be used to extract different types of information, such as information about the funding/sponsoring agencies of research projects as they appear in the acknowledgments section of articles. Another application could be to identify inter-document links by tracing the hypertext links using the "http://...." template, and thus a network of articles in a specific domain can be built that would be useful for researchers in the subject concerned. This article has also indicated that template mining can be used to extract various items of news from digital documents such as in the extraction of conference information from electronic journals, and so on. This article has also indicated that the template mining approach can be incorporated within the metadata format in order to facilitate better information retrieval and to enable automatic generation of document surrogates that would help the end-user filter the search output. This is very necessary, particularly in the WWW environment where a given search may retrieve several thousands, even millions, of records. This could be achieved by using the concept categorization and organization principles of the pre-coordinate indexing systems. Such pre-coordinate systems will improve the quality of the output of search engines that basically follow the principles of post-coordinate indexing systems. In other words, while post-coordinate indexing principles used in the search engines will retrieve digital information resources, the principles of pre-coordinate indexing systems may be used to filter them. However, all the above-mentioned applications will be possible provided the digital information resources appear in standard format and layout. The first few applications of template mining mentioned above will require strict adherence to formats. This should not be too difficult, as authors are used to following author instructions issued by publishers/ editors of printed journals and publication houses. It should not be a big problem to implement the same in the electronic environment. The only requirement will be to formulate appropriate guidelines and, if possible, appropriate templates can be made available to the authors that could be used online while preparing the documents. For example, simple templates may be prepared for each type of publication, and authors may be required to just click on the type of document, such as journal article, conference paper, online sources, and so on, to get the appropriate slots to fill-in with data for author, title, source details, and so on. This would not only ensure consistency in the references but would also facilitate template mining applications. For the last application proposed above, much more work needs to be done. It may be incorporated as an element in the metadata format, or may be added as a required field in any digital document, or embedded Inserted into. See embedded system. in HTML. However, some experiments must be conducted in order to determine which pre-coordinate indexing system will be more appropriate and yet easy for the authors to use while preparing the digital information sources. This may be kept at a simple level just by creating slots for each category of the chosen pre-coordinate indexing system or may be made more complex by incorporating the phase relations proposed by Ranganathan (1967, 1987) as well. However, with the increasing complexity of the system, it may be more difficult, and therefore more inhibiting, for the authors who are required to fill in the slots in the templates. Nevertheless, this job has to be done by the generators of digital information resources, otherwise it may be too difficult and expensive for any agency to analyze each digital document and fill in the slots as done in traditional libraries for classifying and indexing materials. Initially, it may seem an extra burden to the authors but, as we need to follow instructions from editors and publishers and follow HTML and similar standards while preparing hard copy and/or digital documents, we may have to do this in order to make our generated information more widely and easily available to potential users. Eventually, all these additional activities may be incorporated within popular software and, as we can now use editors for creating HTML documents rather than coding them by hand, authors may just fill in the templates as part of their document creation task. Detailed experiments are currently underway and, upon successful completion of this research, we expect to develop a model for template mining from digital information resources. ACKNOWLEDGMENT The author gratefully acknowledges the encouragement and support obtained while writing this article from Michael F. Lynch, Professor Emeritus, Department of Information Studies, University of Sheffield, Sheffield, England. REFERENCES Ai, C. S.; Blower, P. E.; & Ledwith, R. H. (1990). Extraction of chemical reaction information from primary journal text. Journal of Chemical Information and Computer Sciences, 30(2), 163-169. Almind, T. C., & Ingwersen, P. (1998). Informetric analyses on the World Wide Web: Methodological approaches to "Webmetrics." Journal of Documentation, 54(4), 404-426. Andersen, P. M.; Hayes, P.J.; Huettner, A. K.; Schmandt, L. M.; Nirenburg, I. B.; & Weinstein, S. P. (1992). Automatic extraction of facts from press releases to generate news stories. In Third Conference on Applied Natural Language Processing (31 March-3 April, 1992, Trento, Italy) (pp. 170-177). Morristown, NJ: Association of Computational Linguistics computational linguistics (CL) Use of digital computers in linguistics research. The simplest examples are the use of computers to scan text and produce such aids as word lists, frequency counts, and concordances. . Andersen, P. M., & Huettner, A. K. (1994). Knowledge engineering for the JASPER fact extraction system. Integrated Computer-Aided Engineering See CAE. Computer-aided engineering Any use of computer software to solve engineering problems. With the improvement of graphics displays, engineering workstations, and graphics standards, computer-aided engineering (CAE) has come to mean the computer , 1 (6), 473493. Austin, D., & Dykstra, M. (1984). PRECIS: A manual of concept analysis and subject indexing (2d ed.). London, England: British Library. Baralis, E., & Psaila, G. (1997). Designing templates for mining association rules. Journal of Intelligent Information Systems, 9(1), 7-32. Bhattacharyya G. (1981). Some significant results of current classification research in India. International Forum on Information and Documentation, 6(1), 11-18. The British Library Research and Innovation Centre. (1998). The British Library Digital Library Programme. Retrieved December 16, 1998 from the World Wide Web: http://www.bl.uk/services/ric/diglib/digilib.html. Chong, W., & Goh, A. (1997). FIES: Financial information extraction system. Information Services See Information Systems. & Use, 17(4), 215-223. Chowdhury, G. G. (1989). Nature and applicability of Farradane's relational analysis with particular reference to the preparation of linear index entries. Unpublished doctoral dissertation, Jadavpur University Jadavpur University (Bengali: যাদবপুর বিশ্ববিদ্যালয়) (JU) is a premier educational and research institution in India. , Calcutta, India. Chowdhury, G. G. (1992). Application of natural language processing to chemical patents. Unpublished doctoral dissertation, University of Sheffield, Sheffield, England. Chowdhury, G. G. (1995). PRECIS: A workbook work·book n. 1. A booklet containing problems and exercises that a student may work directly on the pages. 2. A manual containing operating instructions, as for an appliance or machine. 3. . Calcutta, India: IASLIC. Chowdhury, G. G., & Lynch, M. F. (1992a). Automatic interpretation of the texts of chemical patent abstracts. Part 1: Lexical analysis (programming) lexical analysis - (Or "linear analysis", "scanning") The first stage of processing a language. The stream of characters making up the source program or other input is read one at a time and grouped into lexemes (or "tokens") - word-like pieces such as keywords, and categorization. Journal of Chemical Information and Computer Sciences, 32(5), 463-467. Chowdhury, G. G., & Lynch, M. F. (1992b). Automatic interpretation of the texts of chemical patent abstracts. Part 2: Processing and results. Journal of Chemical Information and Computer Sciences, 32(5), 468-473. Coates-Stephens, S. (1992). The analysis and acquisition of proper names for robust text understanding. Unpublished doctoral dissertation, City University, London City University London is a British university based at Northampton Square in Finsbury, London (). Its official name is The City University.[3] The University is famous for its excellent graduate employment records. , England. Costantino, M.; Morgan, R. G.; & Collingham, R. J. (1996). Financial information extraction using pre-defined user-definable templates in the LOLITA system. Journal of Computing & Information Technology, 4(4), 241-255. Cowie, J., & Lehnert, W. (1996). Information extraction. Communications of the ACM (publication) Communications of the ACM - (CACM) A monthly publication by the Association for Computing Machinery sent to all members. CACM is an influential publication that keeps computer science professionals up to date on developments. , 39(1), 80-91. Croft, W. B. (1995). What do people want from information retrieval?: The top 10 research issues for companies that use and sell IR systems. D-Lib Magazine. Retrieved December 7, 1998 from the World Wide Web: http://www.lib.org/dlib/november95/11croft.html. De Jong, G. (1982). An overview of the FRUMP system. In W. Lehnert & M. H. Ringle (Eds.), Strategies for natural language processing (pp. 149-176). Hillsdale, NJ: Lawrence Erlbaum Associates. Dempsey, L., & Heery, R. (1998). Metadata: A current view of practice and issues. Journal of Documentation, 54(2), 145-172. Dewey, M. (1996). Dewey decimal classification and relative index. Albany, NY: Forest Press. Farradane, J. E. L. (1980a). Relational indexing. Part 1. Journal of Information Science, 1(5), 267-276. Farradane, J. E. L (1980b). Relational indexing. Part 2. Journal of Information Science, 1(6), 313-324. Foskett, A. C. (1996). The subject approach to information (5th ed.). London, England: Library Association Publishing. Gaizauskas, R., & Wilks, Y. (1998). Information extraction: Beyond document retrieval The ability to search for documents by keywords and other attributes such as date and author. It implies that the documents have been indexed on all pertinent fields and that keywords have been chosen based upon title and textual content. See document imaging and document management system. . Journal of Documentation, 54(1), 70-105. Gaizauskas, R., & Humphreys, K. (1997). Using a semantic network (data) semantic network - A graph consisting of nodes that represent physical or conceptual objects and arcs that describe the relationship between the nodes, resulting in something like a data flow diagram. for information extraction. Natural Language Engineering, 3(2/3), 147-169. Guarino, N. (1997). Semantic matching: Formal ontological distinctions for information organization, extraction, and integration. In M. T. Pazienza (Ed.), Information extraction: A multidiciplinary approach to an emerging information technology international summer school (No. 1299: Lecture Notes in Artificial Intelligence) (pp. 139-168). New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : Springer springer a North American term commonly used to describe heifers close to term with their first calf. . Ingwersen, P., & Hjortgaard C. F. (1997). Data set isolation for bibliometric online analysis of research publication: Fundamental methodological issues. Journal of the American Society for Information Science, 48(3), 205-217. Jacobs, P., & Rau, L. F. (1990). SCISOR: Extracting information from on-line news. Communications of the ACM, 33(11), 88-97. Jones, P. A., & Paice, C. D. (1992). A "select and generate" approach to automatic abstracting. In T. McEnery & C. D. Paice (Eds.), Proceedings of the BCS (1) (The British Computer Society, Swindon, Wiltshire, England, www.bcs.org) The chartered body for information technology professionals in the U.K., founded in 1957. 14th Information Retrieval Colloquium col·lo·qui·um n. pl. col·lo·qui·ums or col·lo·qui·a 1. An informal meeting for the exchange of views. 2. An academic seminar on a broad field of study, usually led by a different lecturer at each meeting. (pp. 141-154). Berlin, Germany: Springer-Verlag. Kemp, N. M. (1995). The application of natural language processing to chemical patents. Unpublished doctoral dissertation, University of Sheffield, Sheffield, England. Lagoze, C. (1996). The Warwick Framework: A container architecture for diverse sets of metadata. D-Lib Magazine. Retrieved December 7, 1998 from the World Wide Web: http://www.dlib.org/dlib/july96/lagoze/07lagoze.html Lagoze, C. (1997). From static to dynamic surrogates: Resource discovery in the digital age. D-Lib Magazine. Retrieved December 7, 1998 from the World Wide Web: http://www.lib.org/dlib/june97/06lagoze.html Lancaster, F. W. (1998). Indexing and abstracting in theory and practice (2d ed.). Urbana-Champaign: Graduate School of Library and Information Science A School of Library and Information Science (SLIS) is a university-based institution that provides a Master's degree or other advanced degrees associated with Library science, Information Science, or a combination of the two. , University of Illinois University of Illinois may refer to:
Lawson, M.; Kemp, N.; Lynch, M. F.; & Chowdhury, G. G. (1996). Automatic extraction of citations from the text of English language English language, member of the West Germanic group of the Germanic subfamily of the Indo-European family of languages (see Germanic languages). Spoken by about 470 million people throughout the world, English is the official language of about 45 nations. patents: An example of template mining. Journal of Information Science, 22(6), 423-436. Lytinen, S. L., & Gershman, A. (1986). ATRANS: Automatic processing of money transfer messages. In AAAI AAAI American Association for Artificial Intelligence AAAI Association for the Advancement of Artificial Intelligence (Menlo Park, California) AAAI American Academy of Allergy, Asthma, and Immunology '86 (The Fifth National Conference on Artificial Intelligence, August 11-15, 1986, Philadelphia, Pennsylvania) (pp. 1089-1093). Los Altos Los Altos (lôs ăl`tōs, lŏs), residential city (1990 pop. 26,303), Santa Clara co., W Calif.; inc. 1952. There is diversified light manufacturing. , CA: Morgan Kaufmann. Paice, C. D. (1981). The automatic generation of literature abstracts: An approach based on the identification of self-indicating phrases. In R. N. Oddy, S. E. Robertson, C. J. Van Rijsbergen C. J. "Keith" van Rijsbergen (Cornelis Joost van Rijsbergen) (born 1943) is a professor of computer science and the leader of the Glasgow Information Retrieval Group based at the University of Glasgow. , & P. W. Williams (Eds.), Information retrieval research (pp. 172-191). London, England: Butterworths. Paice, C. D., & Husk, G.D. (1987). Towards the automatic recognition of anaphoric a·naph·o·ra n. 1. The deliberate repetition of a word or phrase at the beginning of several successive verses, clauses, or paragraphs; for example, features in English text: The impersonal pronoun pronoun, in English, the part of speech used as a substitute for an antecedent noun that is clearly understood, and with which it agrees in person, number, and gender. "it." Computer Speech and Language, 2(2), 109-132. Postma, G. J., & Kateman, G. (1993). A systematic representation of analytical chemical actions, journal of Chemical Information and Computer Sciences, 33(3), 350-368. Postma, G. J.; van der Linden Linden, city, United States Linden, city (1990 pop. 36,701), Union co., NE N.J., in the New York metropolitan area; inc. 1925. During the first half of the 20th cent. , J. R.; Smits, J. R. M.; & Kateman, G. (1990a). TICA: A system for the extraction of data from analytical chemical texts. Chemometrics & Intelligent Laboratory Systems, 9(1), 65-74. Postma, G. J; van der Linden, J. R.; Smits, J. R. M.; & Kateman, G. (1990b). TICA: A system of extraction of analytical chemical information from texts. In E.J. Karjalainen J. Karjalainen is a Finnish singer-songwriter. (Ed.), Scientific computing and automation (Europe) (pp. 407-414). Amsterdam: Elsevier. Raghavan, V. V.; Deogun, J. S.; & Sever TO SEVER, practice. When defendants who are sued jointly have separate defences, they may in general sever, that is, each one rely on his own separate defence; each may plead severally and insist on his own separate plea. See Severance. , H. (1998). Introduction (to the special issue on knowledge discovery and data mining). Journal of the American Society for Information Science, 49(5), 397-402. Ranganathan, S. R. (1987). Colon classification (7th ed.). Bangalore, India: Sarada Ranganathan Endowment for Library Science. Ranganathan, S. R. (1967). Prolegomena to library classification (3d ed.). Bangalore, India: Sarada Ranganathan Endowment for Library Science. Reich, V., & Winograd, T. Working assumptions about the digital library (Stanford Digital Library Working Paper Feb. 23, 1995). Retrieved December 16, 1998 from the World Wide Web: http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1995-0006. Sager, N. (1981). Natural language information processing information processing: see data processing. information processing Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations. : A computer grammar of English and its applications. Reading, MA: Addison-Wesley. Sasaki, Y. (1998). Learning of information extraction rules using ILP-programming report. In PADD PADD Personal Access Display Device (Star Trek) PADD Person Authorized to Direct Disposition (of human remains) PADD Parallel and Distributed Databases PADD Protection and Advocacy for Developmental Disabilities 98 (Proceedings of the Second International Conference on the Practical Application of Knowledge Discovery and Data Mining, London, 25-27 March 1998) (pp. 195-205). Blackpool, England: Practical Application. Schank, R. C. (1975). Conceptual information processing. Amsterdam: North-Holland. Schank, R. C., & Abelson, R.P. (1977). Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Hillsdale, NJ: Lawrence Erlbaum. Schank, R. C., & Riesbeck, C.K. (1981). Inside computer understanding: Five programs plus miniatures. Hillsdale, NJ: Lawrence Erlbaum. Shuldberg, H. K; Macpherson, M.; Humphrey, P.; & Corely, J. (1993). Distilling information from text: The EDS template filler system. Journal of the American Society for Information Science, 44(9), 493-507. Stanley, T. (1998). Ask Jeeves: The knowledge management search engine. Ariadne, Vol. 17. Retrieved December 11, 1998 from the World Wide Web: http:// www.ariadne.ac.uk/issue17/search-engines/intro.html UDC: BS 1000 International Medium Edition. English text. (1985). London, England: British Standards British Standards are the national standards of the UK. The standards body which produces them is BSI British Standards, a division of BSI Group. It is incorporated under a Royal Charter and is formally designated as the National Standards Body (NSB) for the UK. Institution. Vickery, B. (1997). Knowledge discovery from databases: An introductory review. Journal of Documentation, 53(2), 107-122. Wakao, T.; Gaizauskas, R.; & Wilks, Y. (1996). Evaluation of an algorithm for the recognition and classification of proper names. In COLING COLING Computational Linguistics '96 (The 16th International Conference on Computational Linguistics, August 5-9, 1996, Copenhagen, Denmark) (pp. 418-423). Copenhagen, Denmark: Center for Sprogteknologi. Weibel, S. (1995). Metadata: The foundations of resource description. D-lib Magazine. Retrieved December 11, 1998 from the World Wide Web: http://www.lib.org/dlib/July95/07Weibel.html. Weibel, S. (1996). A proposed convention for embedding metadata in HTML. Retrieved December 10, 1998 from the World Wide Web: http://www.clc.org/-weibel/htmlmeta.html. Younger, J. A. (1997). Resources description in the digital age. Library Trends, 45(3), 462- 487. Zamora, E., & Blower, P. E. (1984a). Extraction of chemical reaction information from primary journal text using computational linguistic techniques: 1. Lexical and syntactic Dealing with language rules (syntax). See syntax. phases. Journal of Chemical Information and Computer Sciences, 24(3), 176-181. Zamora, E., & Blower, P. E. (1984b). Extraction of chemical reaction information from primary journal text using computational linguistic techniques: 2. Semantic phase. Journal of Chemical Information and Computer Sciences, 24(3), 181-188. Gobinda G. Chowdhury, Division of Information Studies, School of Applied Science, Nanyang Technological University Nanyang Technological University (Abbreviation: NTU) is a major research university in Singapore. The University's garden campus, known as the Yunnan Garden campus is in the southwestern part of Singapore. , N4#2a--32, Nanyang Avenue, Singapore 639798 GOBINDA G. CHOWDHURY is a Senior Lecturer senior lecturer n. Chiefly British A university teacher, especially one ranking next below a reader. in the Division of Information Studies, School of Applied Science, at Nanyang Technological University, Singapore, where he teaches and guides research in different areas of information retrieval, knowledge organization, information sources and searching, and digital libraries. He has been teaching library and information science for the past thirteen years and has taught at various universities in India India has both private and public universities. India has a large number of universities supported by the Government of India and the state governments. Apart from these there are private universities supported by various bodies and societies also exist. , England, Africa, and Singapore. He has wide international teaching and research experience and over fifty publications that include four books Four Books Chinese Sishu Ancient Confucian texts used as the basis of study for civil service examinations (see Chinese examination system) in China (1313–1905). and a number of research papers that appeared in a number of reputed journals, seminars, and conferences. |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion