The Laurin inferface suite: a software package for newspaper clipping archives.
The LAURIN Interface Suite was developed during the R&D-project LAURIN (1998-2000) (Muhlberger, 2000) and enhanced and completed in the follow-up project LAURIN+ (2000-2002). As the major part of the LAURIN System (Calvanese et al., 2001) the LAURIN Interface Suite is used by the electronic clipping archive of the Innsbrucker Zeitungsarchiv since 1999 (cf. http://iza.uibk.ac.at). The aim of the LAURIN project was to create a software package for clipping archives which would allow them to entirely digitise the clipping, indexing, storing, and retrieval of the archived material. The LAURIN System comprises several components: an acquisition tool scanning newspaper pages and clipping articles, the LAURIN Database, which holds all textual data of the system, the LAURIN Thesaurus, a multilingual, standards compliant thesaurus (Retti & Stehno, 2003), and the LAURIN Interface Suite, which is made up of several tools for managing the data, for indexing and retrieval. This paper focuses on the tools the LAURIN Interface Suite.
2. Architecture and Platforms
From a technical point of view the LAURIN System can be regarded as consisting of a data repository, i.e. a relational database (Retti 2003), an image repository, i.e. a hierarchical file system holding files with image data and OCR text, a scanning and clipping station for data acquisition, and an application server, providing all necessary tools to work with the clipping archive's data. The database has been implemented in Oracle, although it should be rather easy to migrate to an Open Source equivalent like PostgreSQL. Due to the architecture the database server can be separated from the image repository as well as from the application server. The application server is web-based, therefore any current web-browser on a client computer may be used with the system regardless of the operating system of the client computer. The LAURIN Interface Suite is written in Perl (Wall et al., 1996) and runs under the Apache HTTP Server (cf. http://httpd.apache.org). Linux is used as the development environment, the production environment at the University of Innsbruck runs partly under Sun Solaris.
3. Indexing Workflow and Indexing Tools
During the analysis conducted in the LAURIN project (Habitzel & Retti, 1999) two major steps in the indexing process have been isolated and described in detail: bibliographic indexing and content indexing. Bibliographic indexing refers to the exact description of the object in question, i.e. the clipped article, and the required meta-data can usually be found on the object itself. The only exception may be the author's name, which--although depending on the newspaper or journal as well as on national traditions--may appear in an abbreviated form or as a pseudonym. Content indexing, on the other hand, should be understood as a description of the content of the article by determining its text type, e.g. "news: breaking news" or "opinion: editorial", and by associating the article to keywords from the LAURIN Thesaurus (Retti, et al. 2000; Retti & Stehno 2003).
Obviously the skills required to perform these two different tasks are not the same. Therefore, two different tools have been designed and implemented to complete the two different steps of indexing: LBIX, the "LAURIN Bibliographic Indexing Tool", is a front-end to display, check, change, and commit the articles' data one record after the other.
[FIGURE 1 OMITTED]
LBIX, article and author editor
LCIX, the "LAURIN Content Indexing Tool", includes a search facility to retrieve keywords from the LAURIN Thesaurus and a thesaurus browser to select and associate the correct thesaurus entries for indexing. Both tools provide a revision component for corrections at a later step.
[FIGURE 2 OMITTED]
LCIX, thesaurus browser
New indexing terms may be added to the thesaurus during the process of content indexing. Those new entries are to be controlled, corrected, accepted, or rejected later on during the thesaurus maintenance workflow. It should be noted, that from within LBIX and LCIX the scanned images of the current article are always accessible as a one- or multi-page PDF-file, which is created on the fly from the images stored in TIFF-format.
4. Information Retrieval
The "LAURIN Search Interface" (LAUS) is used to search for articles in the digital archive. The basic search feature uses a word index, which is build up from the thesaurus entries associated with the articles. This approach fits to the behaviour of the casual web-user, who is used to enter a term into one field and hit the search button or the enter key. Then a scored list of thesaurus entries is presented to the user and a click on one of the list items will display the articles referenced by the item.
[FIGURE 3 OMITTED]
LAUS, search result with scored list
LAUS uses frames to display the basic search form, the list of thesaurus entries and the list of articles. Furthermore, an additional frame at the bottom of the browser windows shows a thesaurus browser for the user to navigate through the data. The clipped articles can be downloaded as PDF-files, if the copyright allows the archive to deliver them, or they may be ordered as print-outs through a simple interface resembling an online-shop. The question of copyright has been addressed during the LAURIN Project, but it was impossible to find a general solution. Therefore, the Innsbrucker Zeitungsarchiv took a different approach and asked the newspaper publishers for permission. Fortunately as a non-profit research institutions it was granted the right to deliver the articles electronically without charge by almost half of the publishers asked.
[FIGURE 4 OMITTED]
LAUS, search result with articles and thesaurus browser Searches may also be performed through an advanced search form, which includes popup-lists of the journals and newspapers covered by the archive, rubrics in these periodicals, text types, and time ranges in month. These lists may be combined with a request for a specific author or a generic search term. The latter will be looked up in the word index mentioned above as well as in an additional word index, which is build from the data in the title, sub-title, and abstract field of the article, thus providing a restricted full text retrieval. Full text search on the OCR-text of the articles is currently on the agenda for future developments.
5. Thesaurus Maintenance
The LAURIN Thesaurus complies with the relevant standards for mono- and multilingual thesauri (ISO 2788, 1986; ISO 5964, 1985). Thesaurus entries are moulded after the well-known linguistic sign model (Saussure, 1967), differentiating expression and meaning as concept and name (Retti & Stehno, 2003). The LAURIN Thesaurus has been filled with data from the Getty Thesaurus of Geographic NamesTM (Harpring 1998) and the Nomenclature of Territorial Units for Statistics for the European Community (NUTS 1995). Atop level structure according to the IPTC Subject Reference System was added (IPTC) and most of the OECD Macrothesaurus was merged into the LAURIN Thesaurus (OECD 1997). Furthermore, a database maintained at the Innsbruck Zeitungsarchiv holding about 35.000 records about writers and poets was imported. Thus, the LAURIN Thesaurus started from the early beginning with an inventory of approximately 200.000 concepts, i.e. thesaurus entries, and 250.000 names.
Due to the fact that the LAURIN System was originally designed as a network of online archives with a thesaurus distributed throughout that network on local nodes grouped around a central node, the issue of quality control regarding the maintenance of the LAURIN Thesaurus had to be addressed from the very beginning of the project. The LAURIN Interface Suite provides a tool for the local level, LNTM, the "Local Node Thesaurus Manager", and one for the central level, CNTM, the "Central Node Thesaurus Manager". The following schema gives an overview of this workflow:
[FIGURE 5 OMITTED]
State diagram for the thesaurus entries
When a new item is added to the thesaurus, it is marked as a temporary keyword. It has then to be decided whether this temporary keyword will be removed, whether it will we be promoted to be a thesaurus candidate, or whether it will be turned into a free keyword. Free keywords are not really integrated into the thesaurus, as the do not maintain relations to other thesaurus entries. On the other hand, free keywords are very useful for new or upcoming terms often to be found in newspapers. Some of those terms may be turned into regular thesaurus entries later, but other are just fashion and disappear from the media soon. When such free keywords appear first, it is often difficult to obtain an exact definition and, therefore, to determine the correct place for them within the thesaurus. Thesaurus candidates are forwarded to be examined on the central level, where they may be accepted or rejected. If the model of the thesaurus entries as made up of concepts and names is taking into account, the picture of the workflow gets a little more complicated:
[FIGURE 6 OMITTED]
State diagram for the thesaurus entries including concept and name
CNTM, the "Central Node Thesaurus Manager", is used to review thesaurus candidates, to apply structural changes to the thesaurus by merging or moving entries, and to add new thesaurus entries if required.
The LAURIN System, though designed and planed to be run by a network of clipping archives, is used in a production environment only be the Innsbrucker Zeitungsarchivat the University of Innsbruck. As a matter of fact the software was hardly usable after the first phase of the project was finished. Therefore, the other partners from the EU-funded project did not deploy the system. In the second phase of the project, which was only conducted on a national level, the LAURIN Interface Suite was enhanced and completed. Besides the tools mentioned above there are interfaces to edit users of the system, to edit periodical data, to display statistics on acquired and indexed data etc. Together with the LAURIN Database it is available for free download at http:/ /laurin.uibk.ac.at/.
The LAURIN Project
Department of German Language, Literature and Literary Criticism
University of Innsbruck
Email : firstname.lastname@example.org
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Technical Report|
|Publication:||Journal of Digital Information Management|
|Date:||Dec 1, 2003|
|Previous Article:||A model to predict whether an online RPG makes gamers loyal.|
|Next Article:||Methodologies, technologies and applications in distributed and Grid systems.|