Current developments and future trends for the OAI Protocol for Metadata Harvesting.


 access to diverse e-print archives through metadata harvesting and aggregation, the protocol has demonstrated its potential usefulness to a broad range of communities. Two years out from the release of the stable production version of the protocol (2.0), there are many interesting developments within the OAI (Open Application Interface) A computer to telephone interface that lets a computer control and customize PBX and ACD operations.  community. Communities of interest have begun to use the protocol to aggregate metadata relative to their needs. The development of a registry of OAI data providers with browsing and searching capabilities as well as accessibility to machine processing is helping to provide a scalable solution to the question of who is providing what via the OAI protocol. Work is progressing on the technical infrastructure for extending the OAI protocol beyond the traditional harvesting structure. However, serious challenges, particularly for service providers, still exist. This article provides an overview of the current OAI environment and speculates on future directions for the protocol and OAI community.


This article provides a brief overview of the OAI environment, two years out from the release of the production version of the protocol. We assume a relatively high level of familiarity with how the protocol works and only give a brief overview. We delve into some of the interesting developments within the OAI world, particularly the use of the protocol within specific communities of interest, the development of a comprehensive registry of OAI data providers, and a resolver for OAI identifiers that extends the protocol beyond its traditional use. We also document some of the current challenges for both data and service providers. We end the article by noting some of the possible future directions for the OAI protocol and community.


Community- and Domain-Specific OAI Services

The different foci were indicative of the future progress of service providers. No one service provider can serve the needs of the entire public, hence user group-specific service providers have become the norm. Many communities have adopted or are in the process of adopting the OAI protocol to help provide federated Connected and treated as one. See federated database and federated directories.  access to dispersed resources. These communities of interest are significant not only because they have adopted the protocol for a specific domain but also because they have developed additional standards, tools, and metadata schemas to use along with the OAI protocol--much as the originators of the protocol had hoped. Indeed, these domain- and user-specific services may be the best example of what the OAI protocol has to offer.

We highlight three notable community- or domain-specific services in various stages of development below. For a fuller documentation of community-specific service providers and data providers, see the 2003 Digital Library Federation report (Brogan, 2003) and the recent series of profiles of service providers in Library Hi Tech News (McKiernan, 2003a, 2003b, 2004).

Sheet Music Consortium The Sheet Music Consortium is a group of four academic libraries--UCLA, Johns Hopkins University Johns Hopkins University, mainly at Baltimore, Md. Johns Hopkins in 1867 had a group of his associates incorporated as the trustees of a university and a hospital, endowing each with $3.5 million. Daniel C. , Indiana University Indiana University, main campus at Bloomington; state supported; coeducational; chartered 1820 as a seminary, opened 1824. It became a college in 1828 and a university in 1838. The medical center (run jointly with Purdue Univ. , and Duke University--that are building a freely available collection of digitized sheet music. Sheet music presents a particular problem for cataloging because of its various elements: cover art, the sheet music itself, the lyrics, etc. (Davison, Requardt, & Brancolini, 2003). The consortium provides standards for using unqualified Dublin Core to describe sheet music and guidelines for implementation of data provider services. The search service allows the creation of "virtual collections" and allows users to annotate annotate - annotation  the metadata records (Sheet Music Consortium, n.d.). While work on this service is still in progress, the focus on building a service provider based on a specific type of material makes it well worth watching.

As the OAI community has matured, and especially as the number of OAI repositories and the number of data sets served by those repositories has grown, it has become increasingly difficult for service providers to discover and effectively utilize the myriad repositories. In order to address this difficulty the OAI research group at UIUC has developed a comprehensive, searchable registry of OAI repositories (Experimental OAI Registry at UIUC, n.d.).

Developing the Experimental OAI Registry

In developing OAI service providers for various projects within the UIUC Library, the issues of completeness and discoverability have become more evident. The UIUC research group thus built the Experimental OAI Registry to address these problems. Moreover, based on feedback after the first public announcement of the Registry on the OAI-Implementers listserv, the group realized that the Registry also could be utilized to meet various other needs in the OAI community, such as the need for various output formats to support machine processing of the Registry.

Finally, requests to manually add repositories to the registry are accepted. In the future, self-registration should become an automated procedure.

The primary supported search is for keywords appearing in the various OAI responses, namely Identify, ListSets, and the sample records. A key observation resulting from our search system is that repositories, including rich collection-level metadata either in the optional Identify description containers or the optional ListSets setDescription containers, will fare better in terms of discoverability. This suggests the desirability of broader use of collection-level metadata by the OAI community.

Amenable to Machine Processing The third major goal was to expose the registry's data in ways that were useful for machine processing. The most obvious way to make the registry accessible for machine processing was by making it an OAI repository itself. Thus, basic Dublin Core records about each OAI repository contained in the registry can be harvested via the OAI-PMH. The ERRoL service, described below, is an example of an application that utilizes the OAI-PMH interface to the registry. In the future, additional metadata formats might be harvestable as well, such as the ZeeRex format used by the Search and Retrieval Web/URL Service (SRW/U SRW/U Search and Retrieve Webservice/URL Access Mechanism ) protocol (ZeeRex, n.d.). In addition, the registry is also an RDF (Resource Description Framework) A recommendation from the W3C for creating meta-data structures that define data on the Web. RDF is designed to provide a method for classification of data on Web sites in order to improve searching and navigation (see Semantic Web).  Site Summary (RSS (Really Simple Syndication) A syndication format that was developed by Netscape in 1999 and became very popular for aggregating updates to blogs and the news sites. RSS has also stood for "Rich Site Summary" and "RDF Site Summary. ) news feed provider. Using RSS a person can monitor the registry for new or modified repository records. The RSS feed Summaries of Web site content that are published in the RSS format for download. See RSS.  is available off of the registry Web site (Experimental OAI Registry at UIUC, n.d.). There are also a number of ways to export repository records from the registry. Any list of repositories resulting from a search or a browsable view can be exported using the XML schema The definition of an XML document, which includes the XML tags and their interrelationships. Residing within the document itself, an XML schema may be used to verify the integrity of the content.  of the "friends" description container.

Work is also progressing on a "harvest bag" feature. This would allow a user to accumulate a custom list of repositories, including sets and metadata formats, that they could export in a standard XML schema. This would be similar to the "book bag" feature of other digital library portals, which allows users to save and export lists of bibliographic citations. The vision is that the "harvest bag" list could then be imported into harvesting software to initiate a harvest of the selected sites.

Future Work

While the registry is now fully operational, there remain a number of improvements the group would like to make to increase its usefulness. Following, in no order, are some plans for future enhancements to the registry:

* Enhance the collection-level description of the repositories to enable better search and discover. This might include both manual cataloging and the application of automated classification algorithms to the repository's records.

* Provide more automated maintenance of the registry, including the ability of OAI data providers to securely add or modify their repository's records in the registry, including collection-level descriptive data.

* Improve the automated discovery of new repositories, such as automatically running the Google SOAP-based harvester.

* Delegate the creation and maintenance of virtual collections of repositories, including collection-level metadata.

* Improve the view of search results, especially the context of the search hit. The current system does not identify the context of a search hit, which could be the Identify or ListSets responses or the sample records.


As mentioned above, according to the conventional model of OAI, the world is divided into data providers and service providers. As it happens, though, a few simple tricks with style sheets and HTTP redirects allow an OAI repository to stand alone as an independent Web application. Early examples of this were created by enhancing individual repositories, as discussed elsewhere (Van de Sompel, Young, & Hickey, 2003). Frustration with changing the OAI world one repository at a time, however, led to the development of the ERRoL resolution service (Extensible Repository Resource Locators, n.d.), which automatically extends these same features and more to any OAI repository in the UIUC registry.

Keep in mind that the ERRoL service is stripping these XML documents from OAI GetRecord responses that it retrieves from the home repository. Each shares the same oai-identifier as the oai_dc metadata record that describes it, which, as explained above, can be obtained by changing the extension to "oai_dc." Having content and metadata in such close proximity makes it easy to build lightweight, interactive, self-descriptive, content-based, automated systems using XSLT (eXtensible Stylesheet Language Transformation) Software that converts an XML document into another format such as HTML, PDF or text. It may also be used to convert one XML document to another XML document with a different set of XML tags (different schema).  and other thin clients.

ERRoLs work with any OAI repository that has a unique repository-identifier registered at the UIUC Experimental OAI Registry. In the case of the ".html" extension, the repository displays integrated identity and branding information gleaned from the repository's "Identify" response, but otherwise the repositories share the same look and feel. It is possible, however, for individual repositories to instruct the ERRoL service to use an alternate style sheet by inserting a <description> element in their "Identify" response. Thus, the GSAFD GSAFD Guidelines on Subject Access to Individual Works of Fiction, Drama, Etc (American Library Association cataloging standard)  Thesaurus repository (OCLC, n.d. a) looks and acts differently from the default style shown above. The list of custom style sheets is currently limited to an approved set, but a mechanism is planned that will open this up to arbitrary style sheets.

Other extensions are available at the repository and item levels, and new ones are in the works. It is even possible for individual repositories to specify custom extensions by defining them in "Identify" response <description> elements, although this feature is not fully developed yet. Having shown the promise of ERRoLs, though, a few words of caution are needed. ERRoLs operate by dynamically interacting with data providers via the OAI-PMH protocol. If these repositories are offline, slow, or less than fully OAI-compliant (which is frequently the case), the ERRoL functions will suffer. Nevertheless, these examples should show that ERRoLs are an interesting alternative to the conventional OAI model.


We have highlighted a number of developments and ongoing work within the OAI community (and there are many more). But as the number of OAI data providers has grown, two broad areas of concern have arisen, particularly for service providers. These center on the variations and problems with data provider implementations and on the metadata itself. A third concern is the lack of communication among service and data providers. The metadata issues in particular have been well documented (Shreeves, Kaczmarek, & Cole, 2003; Halbert 2003; Hagedorn, 2003; Arms, Dushay, Fulker, & Lagoze, 2003), but we highlight some of the major issues in all areas of concern below.

Metadata Variation

to convert a set of data by, for example, converting them to logarithms or reciprocals so that their previous non-normal distribution is converted to a normal one.
 data (such as date or type elements) so search limiters can be used requires the development of common values among many disparate ones. The normalization In relational database management, a process that breaks down data into record groups for efficient processing. There are six stages. By the third stage (third normal form), data are identified only by the key field in their record.  of the subject element--with many different controlled vocabularies Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the uses of predefined, authorised terms that have been preselected by the designer of the controlled vocabulary as opposed to natural  (or merely keywords) used by the different data providers--is, for most service providers, prohibitively resource intensive.

Metadata Formats

In the same vein, the problem of harvesting a data repository's additional metadata formats (beyond unqualified Dublin Core) can be a difficult task. For a large service provider with a standard method for processing harvested metadata, including new formats involves adding additional paths to the processing routines. The more formats, the more complex it becomes. Additionally, large service providers may have developed interfaces conforming to the simple Dublin Core standard and not have the ability to integrate more complex and more varied formats. For this, service providers need more all-encompassing game plans and better internal support.

OAI Data Provider Implementation Practices

Communication Issues

The OAI community is very loosely federated. There are general and technical listservs available through the Open Archives Initiative. However, as some of the issues above illustrate, a serious need for best practices and guidelines exists for both data and service providers. An informal community of service providers has appeared who advise each other on the technicalities of performing hat-vesting and maintaining their service. While this ad hoc For this purpose. Meaning "to this" in Latin, it refers to dealing with special situations as they occur rather than functions that are repeated on a regular basis. See ad hoc query and ad hoc mode.  community is welcome, a more formal method of communication between data and service providers is needed.


We have discussed above just some of the current developments in the OAI community. Below we outline some future directions. This list is not meant to be all inclusive but rather a taste of some of the ongoing research and practices in the OAI community.

Best Practices

Static Repository Gateway

The technical hurdle is still sometimes too great for potential data providers. The Static Repository Gateway, developed at the Los Alamos National Laboratory Los Alamos National Laboratory (LANL) (previously known at various times as Site Y, Los Alamos Laboratory, and Los Alamos Scientific Laboratory) is a United States Department of Energy (DOE) national laboratory, managed and operated by Los Alamos National , is the most recent option for OAI data providers and provides a very low entry point (Van de Sompel, Lagoze, Nelson, & Warner, 2002; Hochstenbach,Jerez, &Van de Sompel, 2003). Essentially, a resource developer can post a single large XML file containing the metadata and OAI wrappers In data mining and treatment learning, wrappers were used by Ron Kohavi and George John. Their idea was to wrap their treatments learners in a preprocessor that would search to make subsets from the current set of attributes.  on its Web server. This file can be accessed through an OAI gateway service. Currently two service providers, UIUC and the University of Michigan, have been working to shepherd potential data providers to one gateway, which has proved very simple for both the service and data providers.

Mod_oai Project

The OAI-rights committee is working toward a means of incorporating structured rights statements about the resources exposed (that is, the metadata) through the protocol (Lagoze, Van de Sompel, Nelson, & Warner, 2003). The committee does not intend to define a new rights language but only to provide the means of communicating a structured, defined language within the protocol.

Controlled Vocabularies and OAI

Controlled vocabularies will become more important as data and service providers try to cope with the chaos that develops from aggregating metadata from diverse sources. Controlled vocabularies will become particularly important within self-archiving systems such as institutional repositories An Institutional Repository is an online locus for collecting, preserving, and disseminating -- in digital form -- the intellectual output of an institution, particularly a research institution.  and e-print archives (many of which are also OAI data providers) ; in many cases there is no cataloger to exert quality and authority control. A lightweight solution to this would be for authority agencies to mount their thesauri as an SRW/U search service, register it with the UIUC registry, and use ERRoLs to provide an HTML interface and URL access to items in the repository (OCLC, n.d. a).

SRW/U-to-OAI Gateway to the ERRoL Service

This service will allow institutions to load their data as an SRW/U search service, register it with the UIUC gateway, and automatically get OAI-PMH and ERRoL functionality for free. The OCLC Research Publications OM repository is the first demonstration of this. This configuration adds searching capability to the mix of ERRoL features (OCLC, n.d. b).


Hochstenbach, P., Jerez, H., & Van de Sompel, H. (2003). The OAI-PMH static repository and static repository gateway. In C. C. Marshall, G. Henry, & L. Delcambre (Eds.), Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries: May 27-31, 2003, Houston, Texas (pp. 210-217).
Lagoze, C., Hoehn, W., Millman, D., Arms, W., Gan, S., & Hillmann, D. et al. (2002). Core services The introduction to this article provides insufficient context for those unfamiliar with the subject matter.
 New York: ACM Press.
Lagoze, C., & Van de Sompel, H. (2001). The Open Archives Initiative: Building a low-barrier interoperability framework. In E. A. Fox & C. L. Borgman (Eds), Proceedings of First ACM/IEEE-CS Joint Conference on Digital Libraries: June 24-28, 2001, Roanoke, Virginia, USA (pp. 54-62). New York: ACM Press.

