Printer Friendly
The Free Library
22,728,043 articles and books

Current developments and future trends for the OAI Protocol for Metadata Harvesting.



ABSTRACT

The Open Archives Initiative Protocol for Metadata Harvesting OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) is a protocol developed by the Open Archives Initiative. It is used to harvest (or collect) the metadata descriptions of the records in an archive so that services can be built using metadata from many  (OAI-PMH OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting ) has been widely adopted since its initial release in 2001. Initially developed as a means to federate fed·er·ate  
v. fed·er·at·ed, fed·er·at·ing, fed·er·ates

v.tr.
To cause to join into a league, federal union, or similar association.

v.intr.
To become united into a federal union.
 access to diverse e-print archives through metadata harvesting and aggregation, the protocol has demonstrated its potential usefulness to a broad range of communities. Two years out from the release of the stable production version of the protocol (2.0), there are many interesting developments within the OAI (Open Application Interface) A computer to telephone interface that lets a computer control and customize PBX and ACD operations.  community. Communities of interest have begun to use the protocol to aggregate metadata relative to their needs. The development of a registry of OAI data providers with browsing and searching capabilities as well as accessibility to machine processing is helping to provide a scalable solution to the question of who is providing what via the OAI protocol. Work is progressing on the technical infrastructure for extending the OAI protocol beyond the traditional harvesting structure. However, serious challenges, particularly for service providers, still exist. This article provides an overview of the current OAI environment and speculates on future directions for the protocol and OAI community.

**********

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) has been widely adopted since its initial release in 2001. Initially developed as a means to federate access to diverse e-print archives through metadata harvesting (Lagoze & Van de Sompel, 2003), the protocol has demonstrated its potential usefulness to a broad range of communities. According to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 the Experimental OAI Registry at the University of Illinois University of Illinois may refer to:
  • University of Illinois at Urbana-Champaign (flagship campus)
  • University of Illinois at Chicago
  • University of Illinois at Springfield
  • University of Illinois system
It can also refer to:
 Library at Urbana-Champaign (UIUC UIUC University of Illinois at Urbana-Champaign ) (Experimental OAI Registry at UIUC, n.d.), there are currently over 300 active data providers using the production version (2.0) of the protocol from a wide variety of domains and institution types. Developers of both open source and commercial content management systems (such as D-Space and CONTENTdm) are including OAI data provider services as part of their products. Service providers range from large-scale efforts with a wide scope, such as the National Science Digital Library The National Science Digital Library (NSDL) is a free online library for education and research in science, technology, engineering, and mathematics.

The National Science Digital Library (NSDL) Program was established by the National Science Foundation (NSF) in 2000 as a free
 (n.d.), to small, tightly focused, community-specific services, such as the Sheet Music Consortium (n.d.).

This article provides a brief overview of the OAI environment, two years out from the release of the production version of the protocol. We assume a relatively high level of familiarity with how the protocol works and only give a brief overview. We delve into some of the interesting developments within the OAI world, particularly the use of the protocol within specific communities of interest, the development of a comprehensive registry of OAI data providers, and a resolver for OAI identifiers that extends the protocol beyond its traditional use. We also document some of the current challenges for both data and service providers. We end the article by noting some of the possible future directions for the OAI protocol and community.

CURRENT DEVELOPMENTS IN OAI WORK

The mission of the Open Archives Initiative The Open Archives Initiative (OAI) is an attempt to build a "low-barrier interoperability framework" for archives (institutional repositories) containing digital content (digital libraries). It allows people (Service Providers) to harvest metadata (from Data Providers). , the entity responsible for the protocol, is to "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content" (Open Archives Initiative, n.d. a). The Protocol for Metadata Harvesting, a tool developed through the OAI, facilitates interoperability between disparate and diverse collections of metadata through a relatively simple protocol based on common standards (XML XML
 in full Extensible Markup Language.

Markup language developed to be a simplified and more structural version of SGML. It incorporates features of HTML (e.g., hypertext linking), but is designed to overcome some of HTML's limitations.
, HTTP HTTP
 in full HyperText Transfer Protocol

Standard application-level protocol used for exchanging files on the World Wide Web. HTTP runs on top of the TCP/IP protocol.
, and Dublin Core A set of meta-data descriptions about resources on the Internet. Used for resource discovery, it contains data elements such as title, creator, subject, description, date, type, format and so on. Dublin Core descriptions are often included in HTML meta tags. ). The OAI world is divided into data providers or repositories, which traditionally make their metadata available through the protocol, and service providers or harvesters, who completely or selectively harvest metadata from data providers, again through the use of the protocol (Lagoze & Van de Sompel, 2001). The OAI protocol requires that data providers expose metadata in at least unqualified Dublin Core; however, the use of other metadata schemas is possible and encouraged. The protocol can provide access to parts of the "invisible Web See deep Web. " that are not easily accessible to search engines (such as resources within databases) (Sherman & Price, 2003) and can provide ways for communities of interest to aggregate resources from geographically diffuse collections. The protocol promotes a structure in which data providers can focus on building collections and content, and service providers can focus on building services for these collections and content. While the protocol itself says nothing about what happens to metadata once harvested, usually service providers aggregate, index, and build search/retrieval and other value-added services A value-added service (VAS) is a telecommunications industry term for non-core services or, in short, all services beyond standard voice calls and fax transmissions.  around the harvested metadata. It has been two years now since the production version of the protocol was introduced (Lagoze, Van de Sompel, Nelson, & Warner, 2002a). Below we discuss just some of the current trends and developments within the OAI community.

Community- and Domain-Specific OAI Services

As mentioned above, the Open Archives Initiative emerged from and was initially designed to meet the needs of the e-print archives community (Warner, 2003). However, it was recognized fairly early in the protocol's development that it could be applicable in a broad range of communities, including, but not limited to, libraries, museums, and archives. In fact, the implementation guidelines (Lagoze, Van de Sompel, Nelson, & Warner, 2002b) are deliberately nonspecific nonspecific /non·spe·cif·ic/ (non?spi-sif´ik)
1. not due to any single known cause.

2. not directed against a particular agent, but rather having a general effect.


nonspecific

1.
 so as to provide room for community-specific applications of the protocol (Lagoze & Van de Sompel, 2003).

The initial push for developing OAI service providers was in part due to the Andrew W. Mellon Foundation The Andrew W. Mellon Foundation is a foundation endowed with wealth accumulated by the late Andrew W. Mellon. It is the product of the 1969 merger of the Avalon Foundation and the Old Dominion Foundation.  grants in 2001 (Waters, 2001). The foundation issued seven grants to institutions interested in researching the development of service providers. Three institutions developed publicly accessible services predicated on their research: the AmericanSouth.org project at Emory University Emory University (ĕm`ərē), near Atlanta, Ga.; coeducational; United Methodist; chartered as Emory College 1836, opened 1837 at Oxford. It became Emory Univ. in 1915 and in 1919 moved to Atlanta. ; the Digital Gateway to Cultural Heritage Materials at the University of Illinois at Urbana-Champaign Early years: 1867-1880
The Morrill Act of 1862 granted each state in the United States a portion of land on which to establish a major public state university, one which could teach agriculture, mechanic arts, and military training, "without excluding other scientific
 (UIUC) ; and the OAIster project at the University of Michigan (body, education) University of Michigan - A large cosmopolitan university in the Midwest USA. Over 50000 students are enrolled at the University of Michigan's three campuses. The students come from 50 states and over 100 foreign countries. . Each service had a different focus. The AmericanSouth.org project focused on aggregating content related to the culture and history of the American South while involving scholars in the process of selection and interpretation (Halbert, 2003). The UIUC project aggregated metadata relating to relating to relate prepconcernant

relating to relate prepbezüglich +gen, mit Bezug auf +acc 
 cultural heritage resources, including finding aids (Shreeves, Kaczmarek, & Cole, 2003), and the OAIster project harvested all possible repositories but kept only those records that pointed to actual digital objects (Hagedorn, 2003).

The different foci were indicative of the future progress of service providers. No one service provider can serve the needs of the entire public, hence user group-specific service providers have become the norm. Many communities have adopted or are in the process of adopting the OAI protocol to help provide federated Connected and treated as one. See federated database and federated directories.  access to dispersed resources. These communities of interest are significant not only because they have adopted the protocol for a specific domain but also because they have developed additional standards, tools, and metadata schemas to use along with the OAI protocol--much as the originators of the protocol had hoped. Indeed, these domain- and user-specific services may be the best example of what the OAI protocol has to offer.

We highlight three notable community- or domain-specific services in various stages of development below. For a fuller documentation of community-specific service providers and data providers, see the 2003 Digital Library Federation report (Brogan, 2003) and the recent series of profiles of service providers in Library Hi Tech News (McKiernan, 2003a, 2003b, 2004).

Open Language Archives Community The mission of the Open Language Archives Community (OLAC OLAC Online Audiovisual Catalogers
OLAC Open Language Archives Community
) is to create "a worldwide virtual library of language resources" through development of community-based standards for archiving and interoperability and a "network of interoperable repositories" (Open Language Archives Community, n.d. a). OLAC uses the OAI protocol as a means to the latter end. OLAC has extended the protocol to meet the needs for its particular community, specifically through the maintenance of a specialized metadata schema (based loosely on unqualified Dublin Core), data provider tools (including a range of options for organizations without the technical infrastructure to support full-fledged OAI data providers), and service provider tools (Simons & Bird, 2003). Currently OLAC provides access to metadata harvested from twenty-seven data providers through search services hosted at the Linguist List The LINGUIST List is the major on-line resource for the academic field of linguistics. It was founded by Anthony Aristar in early 1990 at the University of Western Australia.  (n.d.) and the Linguistic Data Consortium The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes.  (n.d.). This integration of search services within important community Web sites increases the visibility and value of OLAC.

Sheet Music Consortium The Sheet Music Consortium is a group of four academic libraries--UCLA, Johns Hopkins University Johns Hopkins University, mainly at Baltimore, Md. Johns Hopkins in 1867 had a group of his associates incorporated as the trustees of a university and a hospital, endowing each with $3.5 million. Daniel C. , Indiana University Indiana University, main campus at Bloomington; state supported; coeducational; chartered 1820 as a seminary, opened 1824. It became a college in 1828 and a university in 1838. The medical center (run jointly with Purdue Univ. , and Duke University--that are building a freely available collection of digitized sheet music. Sheet music presents a particular problem for cataloging because of its various elements: cover art, the sheet music itself, the lyrics, etc. (Davison, Requardt, & Brancolini, 2003). The consortium provides standards for using unqualified Dublin Core to describe sheet music and guidelines for implementation of data provider services. The search service allows the creation of "virtual collections" and allows users to annotate annotate - annotation  the metadata records (Sheet Music Consortium, n.d.). While work on this service is still in progress, the focus on building a service provider based on a specific type of material makes it well worth watching.

National Science Digital Library The National Science Digital Library (NSDL NSDL National Science Digital Library
NSDL National Science, Technology, Engineering, and Mathematics Digital Library
NSDL National Securities Depository Limited, India
NSDL Non Secure Data Link
) provides access to the content of collections of science-based learning objects (National Science Digital Library, n.d.). The OAI protocol is the primary means of aggregating the metadata describing this content, although other means are used as well (Lagoze et al., 2002). Funded by the National Science Foundation (NSF NSF - National Science Foundation ), the NSDL has the broadest vision of the service providers described here in that it is attempting to build and aggregate not just a series of digital collections and content but also services to use these resources and the infrastructure to support both. As such, NSF has invested significant resources in the development of content, services, and infrastructure. The NSDL maintains standards for metadata and guidance for data providers. The NSDL aims for a broad user base (K-12), but its core mission remains to develop this "learning environment and resources network" for science education (Zia, 2001).

COMPREHENSIVE OAI REGISTRY OF DATA PROVIDERS

As the OAI community has matured, and especially as the number of OAI repositories and the number of data sets served by those repositories has grown, it has become increasingly difficult for service providers to discover and effectively utilize the myriad repositories. In order to address this difficulty the OAI research group at UIUC has developed a comprehensive, searchable registry of OAI repositories (Experimental OAI Registry at UIUC, n.d.).

Shortcomings A shortcoming is a character flaw.

Shortcomings may also be:
  • Shortcomings (SATC episode), an episode of the television series Sex and the City
 of Existing Registries

There were and continue to be several other registries of OAI repositories such as those maintained by the Open Archives Initiative Web site (Open Archives Initiative, n.d. b) and OLAC (Open Language Archives Community, n.d. b). However, nearly all of these suffer from a number of shortcomings. Probably foremost is that the registries typically maintain very sparse records about the individual repositories, usually nothing other than flat lists of base URLs and possibly the repository name. Typically, there is no search mechanism and fairly limited browsing capabilities. An onerous amount of manual snooping using the OAI-PMH verbs directly in a Web browser The program that serves as your front end to the Web on the Internet. In order to view a site, you type its address (URL) into the browser's Location field; for example, www.computerlanguage.com, and the home page of that site is downloaded to you.  is usually required by potential service providers before they can assess the utility of a specific repository for their needs.

A second shortcoming short·com·ing  
n.
A deficiency; a flaw.


shortcoming
Noun

a fault or weakness

Noun 1.
 of the existing registries is completeness. The registries are usually populated pop·u·late  
tr.v. pop·u·lat·ed, pop·u·lat·ing, pop·u·lates
1. To supply with inhabitants, as by colonization; people.

2.
 by serf-registration or maintained to support the specific needs of a unique community, so few of the registries approach a complete list of all available repositories. "Googling" or following friends or provenance prov·e·nance  
n.
1. Place of origin; derivation.

2. Proof of authenticity or of past ownership. Used of art works and antiques.
 links reveals many new OAI repositories that are not listed in any of the existing registries, even taken as a whole.

Developing the Experimental OAI Registry

In developing OAI service providers for various projects within the UIUC Library, the issues of completeness and discoverability have become more evident. The UIUC research group thus built the Experimental OAI Registry to address these problems. Moreover, based on feedback after the first public announcement of the Registry on the OAI-Implementers listserv, the group realized that the Registry also could be utilized to meet various other needs in the OAI community, such as the need for various output formats to support machine processing of the Registry.

Completeness The UIUC research group addressed the completeness issue by employing three different strategies. The first strategy was a simple inventory of existing registries, both formal and informal, that listed different repositories. The second strategy involved following various links that were contained within the OAI responses. The first source of links was the "friends" container (Lagoze, Van de Sompel, Nelson, & Warner, 2002b). This container could be included as one of the optional description elements in an OAI "Identify" response. It allows an OAI repository to list other confederate repositories that may be of interest to a harvester harvester, farm machine that mechanically harvests a crop. Small-grain harvesting has been mechanized to a certain extent since early times. In the modern period the first harvester to gain general acceptance was made by Cyrus McCormick in 1831 (see reaper). . It is also commonly used by aggregator repositories. The other source of links was the "provenance" container (Lagoze, Van de Sompel, Nelson, & Warner, 2002b). This container could be included as one of the optional "about" elements of an OAI record. The provenance container stores data about the original source of a record that has been aggregated into a different repository. Using "friends" and "provenance," it was possible to recursively crawl webs of related OAI repositories. The registry maintains this linking information about each repository to produce a network graphic. The third strategy involved using the Google[TM] SOAP-based Web toolkit (Google, n.d.). Using this toolkit the research group was able to programmatically Using programming to accomplish a task.  search the Google Web indexes to find OAI repositories. The group developed a number of search strategies, from using OAI related keywords such as "OAI" or "Open Archives An open archive is an institutional repository or some other web-accessible digital database that is compliant with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). ," to using special Google keywords such as "allinurl:verb=Identify," which will find Web sites that contain the string "verb=Identify" in their URL URL
 in full Uniform Resource Locator

Address of a resource on the Internet. The resource can be any type of file stored on a server, such as a Web page, a text file, a graphics file, or an application program.
. This latter strategy proved the most successful. Once a candidate base URL is discovered, it is tested to determine whether it can respond to the OAI "verb=Identify" request. If it responds, it is assumed to be a valid OAI repository and it is added to the registry.

Finally, requests to manually add repositories to the registry are accepted. In the future, self-registration should become an automated procedure.

Searchable and Browsable The second major objective was to make it possible to search for OAI repositories using various criteria and browse through different views of the registry without any manual cataloging of the various OAI repositories. To accomplish this the research group developed processes to automatically harvest and index various data from each repository. Essentially, a specialized harvest of each repository is performed. This harvest collects data from the Identify, ListSets, and ListMetadataFormats responses, supplying these data to various tables and fields in a relational database relational database

Database in which all data are represented in tabular form. The description of a particular entity is provided by the set of its attribute values, stored as one row or record of the table, called a tuple.
. In addition, sample records from each OAI repository are collected for each combination of set and metadataPrefix supported by the provider. These data are also added to the relational database. Once these data are indexed, including the full-text of each response, various searches and views of the registry are possible.

The primary supported search is for keywords appearing in the various OAI responses, namely Identify, ListSets, and the sample records. A key observation resulting from our search system is that repositories, including rich collection-level metadata either in the optional Identify description containers or the optional ListSets setDescription containers, will fare better in terms of discoverability. This suggests the desirability of broader use of collection-level metadata by the OAI community.

Amenable to Machine Processing The third major goal was to expose the registry's data in ways that were useful for machine processing. The most obvious way to make the registry accessible for machine processing was by making it an OAI repository itself. Thus, basic Dublin Core records about each OAI repository contained in the registry can be harvested via the OAI-PMH. The ERRoL service, described below, is an example of an application that utilizes the OAI-PMH interface to the registry. In the future, additional metadata formats might be harvestable as well, such as the ZeeRex format used by the Search and Retrieval Web/URL Service (SRW/U SRW/U Search and Retrieve Webservice/URL Access Mechanism ) protocol (ZeeRex, n.d.). In addition, the registry is also an RDF (Resource Description Framework) A recommendation from the W3C for creating meta-data structures that define data on the Web. RDF is designed to provide a method for classification of data on Web sites in order to improve searching and navigation (see Semantic Web).  Site Summary (RSS (Really Simple Syndication) A syndication format that was developed by Netscape in 1999 and became very popular for aggregating updates to blogs and the news sites. RSS has also stood for "Rich Site Summary" and "RDF Site Summary. ) news feed provider. Using RSS a person can monitor the registry for new or modified repository records. The RSS feed Summaries of Web site content that are published in the RSS format for download. See RSS.  is available off of the registry Web site (Experimental OAI Registry at UIUC, n.d.). There are also a number of ways to export repository records from the registry. Any list of repositories resulting from a search or a browsable view can be exported using the XML schema The definition of an XML document, which includes the XML tags and their interrelationships. Residing within the document itself, an XML schema may be used to verify the integrity of the content.  of the "friends" description container.

Work is also progressing on a "harvest bag" feature. This would allow a user to accumulate a custom list of repositories, including sets and metadata formats, that they could export in a standard XML schema. This would be similar to the "book bag" feature of other digital library portals, which allows users to save and export lists of bibliographic citations. The vision is that the "harvest bag" list could then be imported into harvesting software to initiate a harvest of the selected sites.

In addition, the research group is working on a SRW/U search service for the registry (SRW SRW Super Robot Wars (video games)
SRW Single Rear Wheel (truck)
SRW Segmental Retaining Wall
SRW Soldier Radio Waveform
SRW Strategic Reconnaissance Wing
SRW Search and Retrieve via the Web
, n.d.). This would allow SRW/U clients to search the registry in a manner similar to that provided by the Web forms search interface. The record formats available via the SRW/U interface would be the same as those available via the registry's OAI provider.

Future Work

While the registry is now fully operational, there remain a number of improvements the group would like to make to increase its usefulness. Following, in no order, are some plans for future enhancements to the registry:

* Enhance the collection-level description of the repositories to enable better search and discover. This might include both manual cataloging and the application of automated classification algorithms to the repository's records.

* Provide more automated maintenance of the registry, including the ability of OAI data providers to securely add or modify their repository's records in the registry, including collection-level descriptive data.

* Improve the automated discovery of new repositories, such as automatically running the Google SOAP-based harvester.

* Delegate the creation and maintenance of virtual collections of repositories, including collection-level metadata.

* Improve the view of search results, especially the context of the search hit. The current system does not identify the context of a search hit, which could be the Identify or ListSets responses or the sample records.

EXTENSIBLE REPOSITORY RESOURCE LOCATORS (ERRoLs)

As mentioned above, according to the conventional model of OAI, the world is divided into data providers and service providers. As it happens, though, a few simple tricks with style sheets and HTTP redirects allow an OAI repository to stand alone as an independent Web application. Early examples of this were created by enhancing individual repositories, as discussed elsewhere (Van de Sompel, Young, & Hickey, 2003). Frustration with changing the OAI world one repository at a time, however, led to the development of the ERRoL resolution service (Extensible Repository Resource Locators, n.d.), which automatically extends these same features and more to any OAI repository in the UIUC registry.

ERRoLs are "Cool URLs" (Berners-Lee, 1998) to content and services related to information in an OAI repository. In essence, the ERRoL service is a resolver for oai-identifiers. In its simplest form, the oai-identifier for an item (such as "oai:lcoal.loc.gov:loc.pnp/cph.3b37282") can be resolved by appending it to the end of the ERRoL service URL "http://errol.oclc.org/," as in "http://errol.oclc.org/oai:lcoal.loc.gov:loc.pnp/cph.3b37282." The ERRoL service begins the resolution process by parsing See parse.

parsing - parser
 the repository identifier ("lcoal.loc.gov") from the URL and using it to obtain the official OAI base URL from the UIUC registry. With this, the ERRoL service constructs a standard OAI GetRecord (oai_dc) request to the home repository, which is what the client sees in response.

As a resolution result, however, an XML OAI GetRecord response is of marginal interest at best. Fortunately, appending various extensions to the basic URL form can produce different kinds of results. For example, if we want this same oai_dc record stripped from the OAI GetRecord wrapper A data structure or software that contains ("wraps around") other data or software, so that the contained elements can exist in the newer system. The term is often used with component software, where a wrapper is placed around a legacy routine to make it behave like an object. , we can append To add to the end of an existing structure.  the "oai_dc" metadataPrefix to the URL, as in "http://errol.oclc.org/oai:lcoal.loc.gov:loc.pnp/cph.3b37282.oai_dc." This home repository can also supply a "marcxml" record for this same oai-identifier, which can be obtained by appending a ".marcxml" extension, as in "http://errol.oclc.org/oai:lcoal.loc.gov:loc.pnp/cph.3b37282.marc21." Any metadataPrefix available for this item can be added as an extension. This ability to strip a record from its OAI GetRecord wrapper becomes particularly interesting when OAI repositories contain XML content, beyond metadata. Here are examples for a repository that can disseminate XHTML (EXtensible HTML) A markup language for Web pages from the W3C. XHTML combines HTML and XML into a single format (HTML 4.0 and XML 1.0). Like XML, XHTML can be extended with proprietary tags. Also like XML, XHTML must be coded more rigorously than HTML.  (metadataPrefix = xhtml), XSL (eXtensible Stylesheet Language) A standard from the W3C for describing a style sheet for XML documents. It is the XML counterpart to the Cascading Style Sheets (CSS) in HTML and is compatible with CSS2.  Stylesheets (metadataPrefix = xsl), and XML Schemas This is a list of XML schemas in use on the Internet sorted by purpose. XML schemas can be used to create XML documents for a wide range of purposes such as syndication, general exchange, and storage of data in a standard format. Bookmarks
  • XBEL http://pyxml.sourceforge.
 (metadataPrefix = xsd) respectively:

* http://errol.oclc.org/oai:xmlregistry.oclc.org:xoai/xoaiharvester.xhtml

* http://errol.oclc.org/oai:xmlregistry.oclc.org:xoai/xoaiharvester.xsl

* http://errol.oclc.org/oai:xmlregistry.oclc.org:xoai/config.xsd

Keep in mind that the ERRoL service is stripping these XML documents from OAI GetRecord responses that it retrieves from the home repository. Each shares the same oai-identifier as the oai_dc metadata record that describes it, which, as explained above, can be obtained by changing the extension to "oai_dc." Having content and metadata in such close proximity makes it easy to build lightweight, interactive, self-descriptive, content-based, automated systems using XSLT (eXtensible Stylesheet Language Transformation) Software that converts an XML document into another format such as HTML, PDF or text. It may also be used to convert one XML document to another XML document with a different set of XML tags (different schema).  and other thin clients.

These examples demonstrate that ERRoLs are a simple mechanism for accessing various manifestations of OAI data, but it cannot be said that they elevate an OAI repository to the level of a human-interactive Web application yet. But just as ERRoLs transformed standard OAI responses into other forms in the examples above, they can just as easily transform them into HTML HTML
 in full HyperText Markup Language

Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web.
 using the ".html" extension, as in "http://errol.oclc. org/oai:lcoal.loc.gov:loc.pnp/cph.3b37282.html." The ".html" extension, as well as others, not only works at the item level with oai-identifiers but also at the repository level with repository-identifiers. In the case of repository-identifier "lcoal.loc.gov," URL patterns like "http://errol.oclc.org/lcoal. loc.gov.html" are possible. Furthermore, standard OAI parameters can be appended to this URL to produce HTML renderings of all the OAI-PMH responses, as in "http://errol.oclc.org/xmlregistry.oclc.org.html?verb=Li stRecords&metadataPrefix=oai_dc&set=XSLStylesheets."

ERRoLs work with any OAI repository that has a unique repository-identifier registered at the UIUC Experimental OAI Registry. In the case of the ".html" extension, the repository displays integrated identity and branding information gleaned from the repository's "Identify" response, but otherwise the repositories share the same look and feel. It is possible, however, for individual repositories to instruct the ERRoL service to use an alternate style sheet by inserting a <description> element in their "Identify" response. Thus, the GSAFD GSAFD Guidelines on Subject Access to Individual Works of Fiction, Drama, Etc (American Library Association cataloging standard)  Thesaurus repository (OCLC, n.d. a) looks and acts differently from the default style shown above. The list of custom style sheets is currently limited to an approved set, but a mechanism is planned that will open this up to arbitrary style sheets.

Other extensions are available at the repository and item levels, and new ones are in the works. It is even possible for individual repositories to specify custom extensions by defining them in "Identify" response <description> elements, although this feature is not fully developed yet. Having shown the promise of ERRoLs, though, a few words of caution are needed. ERRoLs operate by dynamically interacting with data providers via the OAI-PMH protocol. If these repositories are offline, slow, or less than fully OAI-compliant (which is frequently the case), the ERRoL functions will suffer. Nevertheless, these examples should show that ERRoLs are an interesting alternative to the conventional OAI model.

ONGOING CHALLENGES FOR THE OAI COMMUNITY

We have highlighted a number of developments and ongoing work within the OAI community (and there are many more). But as the number of OAI data providers has grown, two broad areas of concern have arisen, particularly for service providers. These center on the variations and problems with data provider implementations and on the metadata itself. A third concern is the lack of communication among service and data providers. The metadata issues in particular have been well documented (Shreeves, Kaczmarek, & Cole, 2003; Halbert 2003; Hagedorn, 2003; Arms, Dushay, Fulker, & Lagoze, 2003), but we highlight some of the major issues in all areas of concern below.

Metadata Variation

While metadata must be created using unqualified Dublin Core (DC) encoding See encode. , as well as any other kind of encoding the data provider wishes, the choice of how to use the encoding standard and/or how to fit the encoding to metadata values that already exist varies widely among data providers. One institution's choice of how to use the DC Type tag can vary greatly from another's (for example, "HTML" vs. "Preprint pre·print  
n.
Something printed and often distributed in partial or preliminary form in advance of official publication: a preprint of a scientific article.

tr.v.
"). This can make it difficult to create a search environment in which users feel certain they are receiving what they need. For instance, to normalize normalize

to convert a set of data by, for example, converting them to logarithms or reciprocals so that their previous non-normal distribution is converted to a normal one.
 data (such as date or type elements) so search limiters can be used requires the development of common values among many disparate ones. The normalization In relational database management, a process that breaks down data into record groups for efficient processing. There are six stages. By the third stage (third normal form), data are identified only by the key field in their record.  of the subject element--with many different controlled vocabularies Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the uses of predefined, authorised terms that have been preselected by the designer of the controlled vocabulary as opposed to natural  (or merely keywords) used by the different data providers--is, for most service providers, prohibitively resource intensive.

Metadata Formats

In the same vein, the problem of harvesting a data repository's additional metadata formats (beyond unqualified Dublin Core) can be a difficult task. For a large service provider with a standard method for processing harvested metadata, including new formats involves adding additional paths to the processing routines. The more formats, the more complex it becomes. Additionally, large service providers may have developed interfaces conforming to the simple Dublin Core standard and not have the ability to integrate more complex and more varied formats. For this, service providers need more all-encompassing game plans and better internal support.

OAI Data Provider Implementation Practices

The OAI protocol is flexible in that there are relatively few required pieces for implementation: valid responses to OAI verbs, the use of oai_dc, a unique and persistent OAI identifier, and a date stamp Verb 1. date stamp - stamp with a date; "The package is dated November 24"
date

date - provide with a dateline; mark with a date; "She wrote the letter on Monday but she dated it Saturday so as not to reveal that she procrastinated"
. The OAI Guidelines for Implementation have a limited technical scope, are intended for a general audience of implementers, and do not describe the consequences of not implementing some of the optional features of the protocol (Lagoze, Van de Sompel, Nelson, & Warner, 2002b). This has meant that many of the features of OAI, such as sets, use of descriptive containers, etc., that are quite helpful for service providers, have been underutilized. Data providers also need to be aware of how their implementation of required items such as date stamps impacts service providers.

Communication Issues

The OAI community is very loosely federated. There are general and technical listservs available through the Open Archives Initiative. However, as some of the issues above illustrate, a serious need for best practices and guidelines exists for both data and service providers. An informal community of service providers has appeared who advise each other on the technicalities of performing hat-vesting and maintaining their service. While this ad hoc For this purpose. Meaning "to this" in Latin, it refers to dealing with special situations as they occur rather than functions that are repeated on a regular basis. See ad hoc query and ad hoc mode.  community is welcome, a more formal method of communication between data and service providers is needed.

FUTURE DIRECTIONS

We have discussed above just some of the current developments in the OAI community. Below we outline some future directions. This list is not meant to be all inclusive but rather a taste of some of the ongoing research and practices in the OAI community.

Best Practices

As indicated above, service providers face serious challenges in both their harvesting and aggregating activities. The development of community-specific best practices and implementation guidelines has been an important part of OLAC and other domain-based service providers. A group of service providers within the Digital Library Federation (DLF DLF Digital Library Federation
DLF Digital Library Federation (Washington, DC)
DLF Development Loan Fund
DLF Distribution Loss Factor
DLF Det Liberale Folkeparti (Norwegian political party) 
) has now begun work on some more general best practices to be used with the DLF and beyond.

Static Repository Gateway

The technical hurdle is still sometimes too great for potential data providers. The Static Repository Gateway, developed at the Los Alamos National Laboratory Los Alamos National Laboratory (LANL) (previously known at various times as Site Y, Los Alamos Laboratory, and Los Alamos Scientific Laboratory) is a United States Department of Energy (DOE) national laboratory, managed and operated by Los Alamos National , is the most recent option for OAI data providers and provides a very low entry point (Van de Sompel, Lagoze, Nelson, & Warner, 2002; Hochstenbach,Jerez, &Van de Sompel, 2003). Essentially, a resource developer can post a single large XML file containing the metadata and OAI wrappers In data mining and treatment learning, wrappers were used by Ron Kohavi and George John. Their idea was to wrap their treatments learners in a preprocessor that would search to make subsets from the current set of attributes.  on its Web server. This file can be accessed through an OAI gateway service. Currently two service providers, UIUC and the University of Michigan, have been working to shepherd potential data providers to one gateway, which has proved very simple for both the service and data providers.

Mod_oai Project

The mod_oai project, funded by the Andrew W. Mellon Foundation, is developing a tool that makes content that is accessible from Apache open-source Web servers available through the OAI protocol. This tool will essentially extend the benefits of selective and incremental Additional or increased growth, bulk, quantity, number, or value; enlarged.

Incremental cost is additional or increased cost of an item or service apart from its actual cost.
 harvesting available through the OAI protocol to the general Web community (Mod_oai, n.d.).

OAI-rights

The OAI-rights committee is working toward a means of incorporating structured rights statements about the resources exposed (that is, the metadata) through the protocol (Lagoze, Van de Sompel, Nelson, & Warner, 2003). The committee does not intend to define a new rights language but only to provide the means of communicating a structured, defined language within the protocol.

Controlled Vocabularies and OAI

Controlled vocabularies will become more important as data and service providers try to cope with the chaos that develops from aggregating metadata from diverse sources. Controlled vocabularies will become particularly important within self-archiving systems such as institutional repositories An Institutional Repository is an online locus for collecting, preserving, and disseminating -- in digital form -- the intellectual output of an institution, particularly a research institution.  and e-print archives (many of which are also OAI data providers) ; in many cases there is no cataloger to exert quality and authority control. A lightweight solution to this would be for authority agencies to mount their thesauri as an SRW/U search service, register it with the UIUC registry, and use ERRoLs to provide an HTML interface and URL access to items in the repository (OCLC, n.d. a).

SRW/U-to-OAI Gateway to the ERRoL Service

This service will allow institutions to load their data as an SRW/U search service, register it with the UIUC gateway, and automatically get OAI-PMH and ERRoL functionality for free. The OCLC Research Publications OM repository is the first demonstration of this. This configuration adds searching capability to the mix of ERRoL features (OCLC, n.d. b).

REFERENCES

Arms, W. Y., Dushay N., Fulker, D., & Lagoze, C. (2003). A case study in metadata harvesting: The NSDL. Library Hi Tech, 21(2), 228-237.

Berners-Lee, T. (1998). Hypertext style: Cool URIs don't change. Retrieved November 20, 2004, from http://www.w3.org/Provider/Style/URI.

Brogan, M. (2003). A survey of Digital Library Aggregation services. Washington, DC: Digital Library Federation. Retrieved November 20, 2004, from http://www.diglib.org/pubs/brogan/.

Davison, S., Requardt, C., & Brancolini, K. (2003). A Specialized Open Archives Initiative harvester for sheet music: A project report and examination of issues. Paper presented at the fourth International Conference on Music Information Retrieval Music information retrieval or MIR is the interdisciplinary science of retrieving information from music.

This includes:
  • Computational methods for classification, clustering, and modelling — Musical feature extraction for mono- and polyphonic music,
, October 26-30, 2003, Baltimore, MD. Retrieved November 20, 2004, from http://ismir2003.ismir.net/papers/Davison.PDF (Portable Document Format) The de facto standard for document publishing from Adobe. On the Web, there are countless brochures, data sheets, white papers and technical manuals in the PDF format. .

Experimental OAI Registry at UIUC. (n.d.). Retrieved November 20, 2004, from http://oai.grainger.uiuc.edu/registry/.

Extensible Repository Resource Locators (ERROLs) for OAI Identifiers. (n.d.). Retrieved November 20, 2004, from http://www.oclc.org/research/projects/oairesolver/.

Google. (n.d.). Google Web APIs: Develop your own applications using Google. Retrieved November 20, 2004, from http://www.google.com/apis/.

Hagedorn, K. (2003). OAIster: A "no dead ends" OAI service provider. Library Hi Tech, 21(2), 170-181.

Halbert, M. (2003). The Metascholar Initiative: AmericanSouth.Org and MetaArchive.Org. Library Hi Tech 21(2), 182-198.

Hochstenbach, P., Jerez, H., & Van de Sompel, H. (2003). The OAI-PMH static repository and static repository gateway. In C. C. Marshall, G. Henry, & L. Delcambre (Eds.), Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries: May 27-31, 2003, Houston, Texas “Houston” redirects here. For other uses, see Houston (disambiguation).
Houston (pronounced /'hjuːstən/) is the largest city in the state of Texas and the
 (pp. 210-217). Los Alamitos Los Alamitos (lôs ăləmē`təs, lŏs), city (1990 pop. 11,676), Orange co., NE of Long Beach, S Calif., in a suburban area; inc. 1960. Los Alamitos Racetrack and U.S. military installations are nearby. , CA: IEEE Computer Society (body) IEEE Computer Society - The society of the IEEE which publishes the journal "Computer".

http://computer.org/.
.

Lagoze, C., Hoehn, W., Millman, D., Arms, W., Gan, S., & Hillmann, D. et al. (2002). Core services The introduction to this article provides insufficient context for those unfamiliar with the subject matter.
Please help [ improve the introduction] to meet Wikipedia's layout standards. You can discuss the issue on the talk page.
 in the architecture of the National Science Digital Library (NSDL). In G. Marchionini & W. R. Hersh (Eds.), Proceedings of the Second A CM/IEEE-CS Joint Conference on Digital Libraries: July 14-18, 2002, Portland, Oregon (pp. 201-209). New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
: ACM (Association for Computing Machinery, New York, www.acm.org) A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field.  Press.

Lagoze, C., & Van de Sompel, H. (2001). The Open Archives Initiative: Building a low-barrier interoperability framework. In E. A. Fox & C. L. Borgman (Eds), Proceedings of First ACM/IEEE-CS Joint Conference on Digital Libraries: June 24-28, 2001, Roanoke, Virginia Roanoke is an independent city located in the Commonwealth of Virginia. The city of Roanoke is adjacent to the city of Salem and the town of Vinton and is otherwise surrounded by, but politically separate from, Roanoke County. , USA (pp. 54-62). New York: ACM Press.

Lagoze, C., & Van de Sompel, H. (2003). The making of the Open Archives Initiative protocol for metadata harvesting. Library Hi Tech, 21(2), 118-128.

Lagoze, C., Van de Sompel, H., Nelson, M., & Warner, S. (2002a). The Open Archives Initiative protocol for metadata harvesting--Version 2.0. Retrieved November 20, 2004, from http://www.openarchives.org/OAI/openarchivesprotocol.html.

Lagoze, C., Van de Sompel, H., Nelson, M., & Warner, S. (2002b). Implementation

guidelines for the Open Archives Initiative protocol for metadata harvesting. Retrieved November 20, 2004, from http://www.openarchives.org/OAI/2.0/guidelines.htm.

Lagoze, C., Van de Sompel, H., Nelson, M., & Warner, S. (2003). OAI-Rights white paper Retrieved November 20, 2004, from http://www.openarchives.org/documents/OAIRightsWhitePaper.html. Linguistic Data Consortium. (n.d.). Home page. Retrieved November 20, 2004, from http:// wave.ldc.upenn.edu/.

Linguist List. (n.d.). Home page. Retrieved November 20, 2004, from http://cf.linguistlist. org/.

McKiernan, G. (2003a). E-profile: Open Archives Initiative service providers. Part I: Science and technology. Library. Hi Tech News, 20(9), 30-38.

McKiernan, G. (2003b). E-profile: Open Archives Initiative service providers. Part II: Social sciences and humanities. Library Hi Tech News, 20(10), 24-31.

McKiernan, G. (2004). E-profile: Open Archives Initiative service providers. Part III: General. Library Hi Tech News, 21(I), 38-46.

Mod_oai. (n.d.). Home page. Retrieved November 20, 2004, from http://www.modoai.org/.

National Science Digital Library. (n.d.). Home page. Retrieved November 20, 2004, from http://www.nsdl.org/.

OCLC. (n.d. a). GSAFD thesaurus. Retrieved November 20, 2004, from http://errol.oclc. org/gsafd.oclc.org.html.

OCLC. (n.d. b). OCLC research publications. Retrieved November 20, 2004, from http://errol.oclc.org/orpubs.oclc.org.html.

Open Archives Initiative. (n.d. a). FAQ (Frequently Asked Questions) A group of commonly asked questions about a subject along with the answers. Vendors often display them on their Web sites for use as troubleshooting guidelines. . Retrieved November 20, 2004, from http://www.openarchives.org/documents/FAQ.html.

Open Archives Initiative. (n.d. b). Registered data providers. Retrieved November 20, 2004, from http://www.openarchives.org/Register/BrowseSites.pl.

Open Language Archives Community. (n.d. a). Home page. Retrieved November 20, 2004, from http://www.language-archives.org/.

Open Language Archives Community. (n.d. b). Participating archives. Retrieved November 20, 2004, from http://www.language-archives.org/archives.php4.

Sheet Music Consortium. (n.d.). Home page. Retrieved November 20, 2004, from http://digital. library.ucla.edu/sheetmusic/.

Sherman, C., & Price, G. (2003). The invisible Web: Uncovering sources search engines can't see. Library Trends, 52(2), 282-298.

Shreeves, S. L., Kaczmarek,J. S., & Cole, T. W. (2003). Harvesting cultural heritage metadata using the OAI protocol. Library Hi Tech News, 21(2), 159-169.

Simons, G., & Bird, S. (2003). Building an Open Language Archives Community on the OAI foundation. Library Hi Tech News, 21(2), 210-218.

SRW-Search/Retrieve Web Service. (n.d.). Home page. Retrieved November 20, 2004, from http://www.loc.gov/z3950/agency/zing/srw/.

Van de Sompel, H., Lagoze, C., Nelson, M., &Warner, S. (2002). Implementation guidelines for the Open Archives Initiative for Metadata Harvesting: The OAI static repository and static repository gateway. Retrieved November 20, 2004, from http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm.

Van de Sompel, H., Young, J. A., & Hickey, T. B. (2003). Using the OAI-PMH ... differently. D-Lib Magazine D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Content of current and past issues are available free of charge. The publication is financially supported by the Defense Advanced Research Projects Agency (as part of the Digital , 9(7/8). Retrieved November 20, 2004, from http://www.dlib.org/dlib/july03/young/07young.html.

Warner, S. (2003). E-prints and the Open Archives Initiative. Library Hi Tech News, 21(2), 151-158.

Waters, D. (2001). The metadata harvesting initiative of the Mellon Foundation Mellon Foundation, officially the Andrew W. Mellon Foundation, philanthropic trust formed (1969) through the merger of the Avalon Foundation (est. 1940 by Ailsa Mellon Bruce) and the Old Dominion Foundation (est. 1941 by Paul Mellon). . ARL ARL - ASSET Reuse Library  Bimonthly bi·month·ly  
adj.
1. Happening every two months.

2. Happening twice a month; semimonthly.

adv.
1. Once every two months.

2. Twice a month; semimonthly.

n. pl.
 Report, 217. Retrieved November 20, 2004, from http://www.arl.org/newsltr/217/waters.

ZeeRex. (n.d.). What is ZeeRex? Retrieved November 20, 2004, from http://explain.z3950.org/.

Zia, L. L. (2001). Growing a national learning environments and resources network for science, mathematics, engineering, and technology education. D-Lib Magazine, 7(3). Retrieved November 20, 2004, from http://www.dlib.org/dlib/march01/zia/03zia.html.

Sarah L. Shreeves, Visiting Assistant Professor of Library Administration, IMLS IMLS Institute of Museum and Library Services
IMLS Institute for Museum and Library Services (US)
IMLS Institute of Medical Laboratory Sciences
 DCC (1) (Direct Cable Connection) A Windows 95/98 feature that allows PCs to be cabled together for data transfer. DCC actually sets up a network connection between the two machines. , University of Illinois at Urbana-Champaign, 1301 W. Springfield, Room 52, Urbana, IL 61801, Thomas G. Habing, Research Programmer, Room 155, Grainger Engineering Library Grainger Engineering Library is a library at the University of Illinois at Urbana-Champaign College of Engineering dedicated to all disciplines of engineering at the University. It is situated on the north side of the Bardeen Quad on the engineering campus along Springfield Avenue. , University of Illinois at Urbana-Champaign, 1301 W. Springfield Ave., Urbana, IL 61801, Kat Hagedorn, OAIster/Metadata Harvesting Librarian, University of Michigan Libraries The University of Michigan University Library in Ann Arbor, is one of the largest university library systems in the United States. It is in fact 19 separate libraries in 11 buildings, which, taken together, hold over 8 million volumes and serve more than 3 million patrons on-site , 920 North University Ave., Ann Arbor Ann Arbor, city (1990 pop. 109,592), seat of Washtenaw co., S Mich., on the Huron River; inc. 1851. It is a research and educational center, with a large number of government and industrial research and development firms, many in high-technology fields such as , MI 48109, and Jeffrey A. Young, Software Architect, Online Computer Library Center (OCLC), 6565 Frantz Rd., Dublin, OH 43017.
COPYRIGHT 2005 University of Illinois at Urbana-Champaign
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2005, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Young, Jeffrey A.
Publication:Library Trends
Geographic Code:1USA
Date:Mar 22, 2005
Words:6102
Previous Article:Cyber-democracy or cyber-hegemony? Exploring the political and economic structures of the Internet as an alternative source of information.
Next Article:Lessons learned with Arc, an OAI-PMH service provider.
Topics:



Related Articles
IBM, United Devices and Accelrys aid DOD in search for smallpox cure.
A survey of metadata research for organizing the web.
EVOTEC OAI/PANACOS TO DEVELOP ANTI-HIV THERAPIES.
Learning objects symposium special issue guest editorial.
Introduction.
Lessons learned with Arc, an OAI-PMH service provider.
Collaboration enabling Internet resource collection-building software and technologies.
Tools for creating your own resource portal: CWIS and the Scout Portal Toolkit.
Prototype preservation environments.
Creating metadata for children's resources: issues, research, and current developments.

Terms of use | Copyright © 2014 Farlex, Inc. | Feedback | For webmasters