Collaboration enabling Internet resource collection-building software and technologies.ABSTRACT Over the last decade the Library of the University of California, Riverside The University of California, Riverside, commonly known as UCR or UC Riverside, is a public research university and one of ten campuses of the University of California system. and its collaborators have developed a number of systems, service designs, and projects that utilize innovative technologies to foster better Internet finding tools in libraries and more cooperative and efficient effort in Internet link and metadata collection building. The open-source software and projects discussed represent appropriate technologies and sustainable strategies that we believe will help Internet portals, digital libraries, virtual libraries, library catalogs-with-portal-like-capabilities (IPDVLCs), and related collection-building efforts in academia to better scale and more accurately anticipate and meet the needs of scholarly and educational users. ********** Our work and its intent is best introduced by providing an overview of the projects, services, and software that we have been working on for the last several years: iVia, INFOMINE, and Data Fountains. iVia will be described in depth from the standpoints of its overall system, content and uses supported, end-user features, content development and management features for institutional collaborators, features for individual expert content builders, and incentives for collaborative collection building. IVIA iVia (http://infomine.ucr.edu/iVia/) is a portal or virtual library collection-building software platform (Mitchell et. al., 2003). It was designed to support multiple institutions and projects in collaborative collection-building efforts. The system (or components) is used by INFOMINE and the National Science Digital Library The National Science Digital Library (NSDL) is a free online library for education and research in science, technology, engineering, and mathematics. The National Science Digital Library (NSDL) Program was established by the National Science Foundation (NSF) in 2000 as a free (NSDL NSDL National Science Digital Library NSDL National Science, Technology, Engineering, and Mathematics Digital Library NSDL National Securities Depository Limited, India NSDL Non Secure Data Link ) of the National Science Foundation, among others. The software, written primarily in C++, is licensed as open source and is available to all. iVia features a very large number of custom-configurable user interfaces and information retrieval information retrieval Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links. options to support the institutional identity management (that is, branding) and user finding needs of diverse, collaborating organizations. Institutional collaborators will also be able to avail themselves of multiple metadata creation options, including support for multiple "production lines" and levels of editorial control. Resource- and labor-saving machine assistance is featured and used to semi- and/or fully-automate a number of tasks in both Internet resource identification and metadata generation. The former is made possible through new work in focused crawling and the latter through innovations in automated classification (which include the assignment of Library of Congress Subject Headings The Library of Congress Subject Headings (LCSH) comprise a thesaurus (in the information technology sense) of subject headings, maintained by the United States Library of Congress, for use in bibliographic records. [LCSH LCSH Library of Congress Subject Headings LCSH Lee County Senior High (Sanford, NC, USA) ] and Library of Congress Classifications Library of Congress Classification or LC Classification System of library organization developed during the reorganization of the U.S. Library of Congress. [LCC (Leadless Chip Carrier, Leaded Chip Carrier) See leadless chip carrier, CLCC and PLCC. 1. LCC - Language for Conversational Computing. Written at CMU in the 1960's. ] ). iVia support has come from the Library of the University of California The University of California has a combined student body of more than 191,000 students, over 1,340,000 living alumni, and a combined systemwide and campus endowment of just over $7.3 billion (8th largest in the United States). at Riverside, the U.S. Institute of Museum and Library Services The Institute of Museum and Library Services is an independent agency of the United States federal government. It is the main source of federal support for libraries and museums within the United States. (IMLS IMLS Institute of Museum and Library Services IMLS Institute for Museum and Library Services (US) IMLS Institute of Medical Laboratory Sciences ), NSDL, and the Fund for the Improvement of Post-Secondary Education of the U.S Department of Education (FIPSE FIPSE Fund for the Improvement of Post-Secondary Education ). [FIGURE 1 OMITTED] INFOMINE The INFOMINE (http://infomine.ucr.edu) virtual library service was conceived from inception as a multi-institutional, collaborative effort and has served the academic community since 1994. It has the mission of identifying, describing, and therefore making visible and useful to the academic community the significant scholarly and educational resources on the Internet. More than 230,000 resources populate To plug in chips or components into a printed circuit board. A fully populated board is one that contains all the devices it can hold. the collection. These represent all major academic research disciplines and are the product of the collaborative efforts of librarians, faculty, and graduate students at the University of California (Riverside, Los Angeles Los Angeles (lôs ăn`jələs, lŏs, ăn`jəlēz'), city (1990 pop. 3,485,398), seat of Los Angeles co., S Calif.; inc. 1850. , Santa Cruz Santa Cruz, city, United States Santa Cruz (săn`tə kr z), city (1990 pop. 49,040), seat of Santa Cruz co., W Calif., on the north shore of Monterey Bay; inc. 1866. , and Irvine campuses), Wake Forest University, and California State University EnrollmentINFOMINE draws upon a hybrid collection design that consists of metadata created by (1) subject experts (at INFOMINE and at collaborating institutions); (2) machine processes or machine processes with expert refinement; and (3) external collaborating institutions that share data streams of records, which are imported through OAI-PMH OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting or other means, translated as needed as needed prn. See prn order. , and then added to the INFOMINE collection (for example, MARC records of the University of California Shared Cataloging Project and Dublin Core A set of meta-data descriptions about resources on the Internet. Used for resource discovery, it contains data elements such as title, creator, subject, description, date, type, format and so on. Dublin Core descriptions are often included in HTML meta tags. records from some collections within the NSDL). INFOMINE represents a rich collection of records with rich metadata. For example, the number of subject and keyword terms applied in expert-created records that describe resource themes are much more numerous than in standard library catalogs. INFOMINE is used for both end-user searching and collection development on the part of other Internet portals, digital libraries, virtual libraries, and library catalogs-with-portal-like-capabilities (IPDVLCs). It uses iVia software as its system platform. INFOMINE support has come from the Library of the University of California at Riverside and the collaborating libraries mentioned above, as well as from IMLS, NSDL, and FIPSE. DATA FOUNTAINS Data Fountains (http://infomine.ucr.edu/Data_Fountains/) is an open-source software system and a service for automated or semi-automated Internet resource discovery and metadata generation. Based in the iVia system, it expands beyond iVia considerably by creating an array of independent, though federated Connected and treated as one. See federated database and federated directories. , collection-building systems for collaborating projects with the goal of generating the basic "ore" (links to important Internet resources and associated metadata records and rich full text) for these projects. It also improves upon core crawling and classification techniques. Each collaborating project and/or subject community works Police and Community Youth Clubs PCYC Originally known as the 'Police Rotary Youth Club' PCYC originated in Sydney, Australia in the 1930s. There are currently 57 clubs in New South Wales. with and fine tunes its own Data Fountain, that is, its own set of focused crawler A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that only the topic is given, while focused crawling also assumes that some labeled (s) and classifier(s). The records and full text derived are exported to and utilized within the collaborator's own native interface, backend system, and databases. In iVia these crawlers and classifiers are shared, as is the backend. Expert-machine interaction, which relies upon the subject domain expertise and the wisdom and conventions in collection building of participating librarians, is emphasized more in Data Fountains than currently in iVia and should result in more accurate content. That is, semi-automated approaches are more fully designed into and featured in the system and are critical to improving its performance. Given that Data Fountains is currently under development, much of the following instead addresses iVia, its close relative. Data Fountains work is supported by IMLS and the Library of the University of California at Riverside. Please contact us if you are interested in implementing Data Fountains in your project. [FIGURES 2-3 OMITTED] COLLABORATIVE SERVICE AND PARTICIPATORY TECHNOLOGY DEVELOPMENT Participatory technology development (PTD) is an approach to learning and innovation that is used in international development as part of projects and programmes relating to sustainable agriculture. We designed the technology behind the INFOMINE, iVia, and Data Fountains projects to enable and facilitate cooperative service building and effort. That is, we wanted the technology providing the foundation for these systems to be collaborative and participatory and to gain significant increases in accuracy and resource savings through this. While the system strongly supports fully automated and fully manual processes for collection building, the technology also supports semi-automated processes emphasizing interactive subject domain expertise. We see our work as building machine-assisted IPDVLC community-ware. We are developing and bringing to the library community new, machine learning-based technologies that are * Enabling: These technologies provide systems that scale better in the Internet environment and save expert labor and other resources. They enable collaborative efforts of many types at the same time that they are supportive of multiple modes of collection building and user access. These technologies also enable us to reduce redundant effort by better distributing collection and metadata development efforts among similar projects. * Participatory: Collaborating institutions, as co-designers, participate in developing and customizing the software to fit their needs (for example, in interface, data views/landscapes, record creation, and retrieval). Collaborators work in codesigning systems that emphasize identifying, enhancing, and/or developing synergies among collaborating projects. This is done as well by identifying promising expert/machine processes/ interactions that will augment and improve the performance of both. Experts actively participate in improving machine processes and vice versa VICE VERSA. On the contrary; on opposite sides. . * Supportive of Librarian Community Expertise, Values, and Effort: These technologies help amplify and facilitate the transfer of academic librarian subject expertise, organizing expertise, public domain orientation, objectivity, service orientation, and other scholarly and educational community values and capabilities into efficient and effective Internet-based information. Tools such as iVia allow us to build very useful collections that are based on and express our considerable wealth of knowledge in subject domains, fully featured interfaces, sophisticated (that is, precise) user access, and rich, well-organized metadata. While Google-level accuracy and approaches suffice suf·fice v. suf·ficed, suf·fic·ing, suf·fic·es v.intr. 1. To meet present needs or requirements; be sufficient: These rations will suffice until next week. for many information-finding needs, they do not generally serve the in-depth finding needs of academics. Google may partially "disintermediate" the role of the expert librarian in some areas, but, in the long term, this will not extend to areas where superior information quality, sophisticated access, and accurate provenance prov·e·nance n. 1. Place of origin; derivation. 2. Proof of authenticity or of past ownership. Used of art works and antiques. verification are critical to major research and fact-finding efforts. It is incumbent upon the library community to work with this technology, to adapt it to its needs, and to come to own it just as physical collections usually own the facilities in which they are located. This is what our projects are about: bringing public domain community-ware and machine-learning technology in resource discovery and metadata generation, among other areas, into the library. FOCUS ON IVIA--AN OPEN-SOURCE SOFTWARE PLATFORM FOR COLLABORATIVE INTERNET COLLECTION BUILDING Hardware The following hardware supports the INFOMINE application of iVia: * Public search interface server: end-user and content-builder (including expert-guided crawler Also known as a "Web crawler," "spider," "ant," "robot" (bot) and "intelligent agent," a crawler is a program that searches for information on the Web. Crawlers are widely used by Web search engines to index all the pages on a site by following the links from page to page. ) interfaces are supported * Public search interface server backup * Database server (both the metadata and full-text databases are here) * Database server backup * Crawler/classifier processes server (for example, vlcrawler, Nalanda iVia focused crawler) * OAI (Open Application Interface) A computer to telephone interface that lets a computer control and customize PBX and ACD operations. import/export server * Additional mass storage equipment: 2 terabytes of storage including a RAID array (1 terabyte One trillion bytes. Also TB, Tbyte and T-byte. See tera and space/time. (unit) terabyte - 2^40 = 1,099,511,627,776 bytes = 1024 gigabytes or roughly 10^12 bytes. (Note the spelling - one 'r'). See prefix. of storage) accessible via Network File System (NFS (Network File System) The file sharing protocol in a Unix network. This de facto Unix standard, which is widely known as a "distributed file system," was developed by Sun. See file sharing protocol and WebNFS. NFS - Network File System ) (networked storage) A standard machine would be an AMD (Advanced Micro Devices, Inc., Sunnyvale, CA, www.amd.com) A major manufacturer of semiconductor devices including x86-compatible CPUs, embedded processors, flash memories, programmable logic devices and networking chips. XP 3200+ CPU CPU in full central processing unit Principal component of a digital computer, composed of a control unit, an instruction-decoding unit, and an arithmetic-logic unit. , 1.5 GiB of high-speed RAM, and an 80GB disk storage. Software iVia software is licensed as open source (GNU GPL See GNU General Public License. and LGPL (Lesser GPL) See GNU General Public License. ). Open-source software is free software intended to be of use to and be further developed and refined by its users. In iVia's case this would be users in the library and Internet Portal community. The open-source approach enables institutions to pool resources and inexpensively develop and refine software that meets their needs. In fact, in addition to the software we have developed, our system is based on many very successful and well-known open-source packages, including the Linux operating system operating system (OS) Software that controls the operation of a computer, directs the input and output of data, keeps track of files, and controls the processing of computer programs. (including Debian, RedHat, and Suse variants), MySQL and Berkeley DB (Berkeley DataBase) An open source database system that uses "key=value" pairs and is used to create indexes to tables and other data structures. For example, a record could hold a key (account number) and its value (row number), although a value can be any data structure databases management packages, and Apache Web server See Apache. software. iVia code is in C++, this being one of the most powerful, flexible, and standardized standardized pertaining to data that have been submitted to standardization procedures. standardized morbidity rate see morbidity rate. standardized mortality rate see mortality rate. of programming languages. Some of our interface code is in Java. Currently the iVia program size is close to 10 Mb (>230k lines). Standards iVia is based in standards. Metadata standards include Dublin Core and MARC (we use Dublin Core but can translate from/to MARC). Subject schema standards include Library of Congress Subject Headings (LCSH) and Library of Congress Classifications (LCC), these long being standards in the U.S. academic library. Using these will eventually allow iVia, as finding tool software, seamless subject access (no translations involved) to both the Internet and print records of knowledge. For data transfer among collaborators, iVia uses the Open Archives Initiative The Open Archives Initiative (OAI) is an attempt to build a "low-barrier interoperability framework" for archives (institutional repositories) containing digital content (digital libraries). It allows people (Service Providers) to harvest metadata (from Data Providers). (OAI-PMH) approach as well as standard delimited de·lim·it also de·lim·i·tate tr.v. de·lim·it·ed also de·lim·i·tat·ed, de·lim·it·ing also de·lim·i·tat·ing, de·lim·its also de·lim·i·tates To establish the limits or boundaries of; demarcate. formats (SDF (Standard Data Format) A simple file format that uses fixed length fields. It is commonly used to transfer data between different programs. SDF Pat Smith 5 E. 12 St. Rye NY Bob Jones 200 W. Main St. Palo Alto CA Comma delimited "Pat Smith","5 E. ). OAI-PMH is used as well internally to transfer/harvest records from our crawling and classification databases and our user databases. Fields Supported Forty-seven fields are supported in our database. Of most direct value to users are URL URL in full Uniform Resource Locator Address of a resource on the Internet. The resource can be any type of file stored on a server, such as a Web page, a text file, a graphics file, or an application program. , title, alternative title, creator (author), subject--LCSH, subject--LCC, keywords, description, selected full-text (1-3 pages of rich text), MyI (a field that helps institutions create custom data views), and local URL (often of value for collaborators in accessing fee-based material). Other fields of note and their functions are general subject categories (for example, biological, agricultural, and medical sciences); created at; created by; modified by; last modified by; access restrictions; restricted to; publisher; audience levels; resource types; language; coverage begin; and coverage end. Content Managed Format types represented through iVia include HTML HTML in full HyperText Markup Language Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web. resources and, shortly, PDF (Portable Document Format) The de facto standard for document publishing from Adobe. On the Web, there are countless brochures, data sheets, white papers and technical manuals in the PDF format. , Postscript The de facto standard page description language (PDL) in the graphics arts industry as well as in commercial printing. Developed by Adobe, many printers and most imagesetters support PostScript by having a built-in PostScript interpreter. , and others. Metadata as well as representative, rich full text is generated or harvested from the resource being described and makes up the content of our databases. This data represents free and fee-based resources and includes resource types as varied as digital libraries, other virtual libraries and portals, e-journals, e-books, e-print archives, databases, hypertext fiction Hypertext fiction is a genre of electronic literature, characterized by the use of hypertext links which provides a new context for non-linearity in "literature" and reader interaction. , maps, and more. Content retrieval is robust and quick. Berkeley DB indexing capabilities are used to augment performance through MySQL. iVia Uses Major applications of iVia to date have included INFOMINE, one of the first Web-based services offered by a library. INFOMINE (an Internet resources virtual library-type finding tool) has been supported by iVia in serving academic researcher and student end users both nationally and at specific institutions (for example, the University of California at Riverside and Wake Forest University). Collection development for others has been another major function, with many other academic virtual libraries using iVia/INFOMINE as a resource discovery service for their own collection-building efforts, iVia/INFOMINE is also used by librarians in creating Web-based subject guides or pathfinders 1. Experienced aircraft crews who lead a formation to the drop zone, release point, or target. 2. Teams dropped or air landed at an objective to establish and operate navigational aids for the purpose of guiding aircraft to drop and landing zones. 3. in various subjects (this is facilitated through using our "canned search" generator and MyI field), as well as by faculty creating Web resource modules on their course pages in support of curriculum units. While INFOMINE has been the major application of iVia so far, with most aspects of iVia as described in this article being applied in INFOMINE, we have been working with the National Science Digital Library (NSDL) to develop an NSDL iVia. Among the major goals of this project are the integration of our Web crawlers See crawler and WebCrawler. and classification software into NSDL's core system for purposes of open Internet resource discovery and related classification (that is, resource identification and metadata generation). Just as crucial here will be the use of this software to generate metadata for existent ex·is·tent adj. 1. Having life or being; existing. See Synonyms at real1. 2. Occurring or present at the moment; current. n. One that exists. Adj. 1. , "deep Web" collections (for example, article databases or e-print collections or other databases where access is through a search front-end) in many different document formats other than HTML. IVIA USER FEATURES Through the INFOMINE application, iVia has demonstrated sophisticated and flexible user features geared toward varying levels of searching expertise. Most searchers will use defaults that are transparent to them as they use the basic search (http://infomine.ucr.edu/). Librarians, information specialists, and researchers may choose to use the many user configurable features found in Advanced Search and Browse (http://infomine. ucr.edu/cgi-bin/search). Advanced Search and Browse features are present in each individual collection (for example, http://infomine.ucr.edu/cgibin/search?category=bioag). In more detail, iVia's search and browse features include the following: multiple subject and resource type collections or categories, including Biological, Agricultural and Medical Sciences; Business and Economics; Cultural and Ethnic Diversity; E-journals; Government Information; Maps and GIS (1) (Geographic Information System) An information system that deals with spatial information. Often called "mapping software," it links attributes and characteristics of an area to its geographic location. ; Physical Sciences, Engineering, Computer Science, and Math; Social Sciences and Humanities; and Visual and Performing Arts. The availability of standardized, fielded metadata, as well as rich full-text, enables advanced searching capabilities including Boolean (for example, and, or, not) and Proximity operators (for example, near 1-20); exact searching using quotes or stem searching using asterisk (1) See Asterisk PBX. (2) In programming, the asterisk or "star" symbol (*) means multiplication. For example, 10 * 7 means 10 multiplied by 7. The * is also a key on computer keypads for entering expressions using multiplication. ; nested searching using parentheses See parenthesis. parentheses - See left parenthesis, right parenthesis. ; and various types of limit searching. One can limit to expert or expert plus robot-originated records (the latter being those that have been automatically identified and described), or combine general subject categories (for example, BioAgMed or E-journals), any combination of fields (for example, title, keywords, subjects, and/or description, and so on), resource type (for example, article databases, electronic journals, or e-print collections), and/or type of access to resource (such as free, fee, or a mix). In iVia, search interfaces are presented on the bottom of each results page if search modification is desired. In the event of zero result searches, spelling is checked and possible spelling alternatives are suggested. Finally, in full display, most indexing terms are presented as links, which can be clicked on to narrow or broaden a user's search. Browse indexes are available for both all subject categories and individual subject categories. Specific browse indexes are available for titles, creators (including authors), subjects--LCSH, subjects--LCC, keywords (these often include minor subjects and lay-person terminology), resource types (for example, standards, style manuals) and Whats New! (that is, recent expert additions to the collection). Records are displayed in three formats: title only, regular (title, description, and origin of record as either expert or robot created), and long (accessed by clicking on "More Info" in the full display). The latter includes a great number of fields of interest to users or collection builders including URL, title, description, broad subject categories, creators, subject--LCSH, subject--LCC, keywords, access, audience level (academic, K-12, or lifelong learner), institutional owner (which collaborator contributed the record if expert in origin), URL checker check·er n. 1. a. One, such as an inspector or examiner, that checks. b. One that receives items for temporary safekeeping or for shipment: a baggage checker. 2. information, and INFOMINE collection information (mostly for record keeping: who added, who modified, record number, record origin). Results pages can be displayed in groups of 30, 50, or 100. They can be ordered alphabetically al·pha·bet·i·cal also al·pha·bet·ic adj. 1. Arranged in the customary order of the letters of a language. 2. Of, relating to, or expressed by an alphabet. by title or by relevance to the query as judged by how many query terms were hits, how many were hits in major or minor fields (for example, title being more highly weighted than keyword, which is more highly weighted than full text), and whether terms in a specified phrase were found in exact or approximate adjacency. IVIA CONTENT DEVELOPMENT AND MANAGEMENT--FEATURES, TOOLS, AND MACHINE ASSISTANCE FOR INSTITUTIONAL COLLABORATORS AND EXPERT CONTENT BUILDERS iVia emphasizes numerous innovations for improving and making more efficient collection development and management efforts for both individual or multiple collaborating projects. These translate into significant labor and resource savings in building collections. These innovations can be best understood from the standpoints of institutional collaborators and individual experts creating new content, as detailed below. Support for Institutional Collaborators Institutional identity management or branding is important for iVia collaborators. Access to collaborative resources needs to reflect, within reason, the established ongoing Web presence and interface of the collaborating institution. To this end iVia provides multiple interfaces and methods of accessing data in collections it supports. The user interfaces and desired data views of collaborator project sites are supported. For example, the interface that the user is accessing from can be detected by iVia, which activates searching and other interface capabilities that meet existent profiles set up for this by the collaborating institution. Access is also enabled for selected external collections that rely on metasearching. Custom Data Views and Access Supported iVia provides pre-constructed interface modules that can be quickly assembled and customized by collaborators in building interfaces to iVia data. These interface modules reflect the themes and presentation of the collaborating project while still taking full advantage of unique iVia retrieval and other user features. The suite of programs that facilitate this is known as "Theme-ing." Special fields, such as MyI (which allows institutions to create custom data views), support Theme-ing and custom interface access. For example, retrieval filters can be created by participating institutions to channel user searches through selected subsets of iVia data (for example, perhaps only the records for fee-based resources in the collection that have been subscribed to by the particular institution). This is done by identifying and tagging, in the MyI field, those records that the institution wants its users to view. Parallel fields are also supported for similar reasons. For example, some collaborators want short descriptions and others long. Hence, there are two, parallel, description fields. Users coming from the institution desiring short descriptions will see only these. Metasearching Access iVia also enables access to its content through the interfaces of selected, completely external finding tools, which rely on general methods of metasearching. For example, the Ex Libris online public access library catalog catalog, descriptive list, on cards or in a book, of the contents of a library. Assurbanipal's library at Nineveh was cataloged on shelves of slate. The first known subject catalog was compiled by Callimachus at the Alexandrian Library in the 3d cent. B.C. system provides access to INFOMINE content, as does the California Digital Library The California Digital Library, or CDL, is the University of California's 11th University Library. The CDL assists the ten University of California libraries in sharing their resources and holdings more effectively, in part through negotiating and acquiring consortial licenses on Searchlight searchlight, device, usually swiveled, using a lens and reflecting surface to direct a powerful beam of light of nearly parallel rays. In 1892 such apparatus was used along the English Channel in coastal defense and later, in the South African War, as an aid to system. The nice thing about metasearching is that large numbers of diverse collections from multiple projects can be searched simultaneously. However, significant downsides exist because of the need to include generally very simplified, lowest common denominator low·est common denominator n. 1. See least common denominator. 2. a. The most basic, least sophisticated level of taste, sensibility, or opinion among a group of people. b. searching of only the shared fields among the databases searched, which can be very few; this eliminates search access to unique, useful fields. Another problem is the limited ability to eliminate duplicate, overlapping results returned from the databases searched. Multiple Modes of Content-Building Supported Even if collaborating institutions have been building Internet resource collections for some time and have established ways or styles of doing things, iVia takes this into account by providing multiple means for new collaborators to ramp up Ramp Up To increase a company's operations in anticipation of increased demand. Notes: A company might 'ramp up' operations if they just signed a contract creating substantially more demand for their product. See also: Demand, Economies of Scale and begin creating content in ways with which they are comfortable. To this end iVia supports from one to three levels of editorial review as well as a pending record database that holds records in the process of being built and reviewed prior to their being approved and moved to the main working database. Some collaborators use just one level of review, that of the editor of the subject file (for example, the BioAgMed file in INFOMINE). Others have developed a well-defined division of labor whereby catalogers review the subject content of records created by public service librarians or metadata specialists prior to review by the editor of the subject file. Similarly, in support of various divisions of labor and optimum utilization of staff with varying skill sets, each content builder can be assigned a different level of access to iVia content-building features. Managing editors of a subject file have full permission of many kinds, including batch deletes and batch changes, to the content of the whole database. Metadata specialists, on the other hand, may only be allowed to add content to the pending record database, with their records going through multiple levels of review before being added, by the subject file editor, to the working database. Hybrid Collections of Heterogeneous Metadata--Support for Multiple Incoming Data Streams and Types of Records Just as one of the main benefits of collaboration in mutual content building is sharing the collection development load among participants, iVia also makes it possible to utilize the work of other collection-building projects that choose to not be an integral part of the project. To do this, iVia has a hybrid collection design that supports diverse, heterogeneous record types and record origins (Mason, Mitchell, Mooney, Reasoner, & Rodriguez, 2000). As manifested in the INFOMINE application of iVia, the system builds content by ingesting and threading together a number of diverse data streams. The first of these is, of course, the records created within the iVia system by experts. Sources for these currently include content builders from the University of California at Riverside, UCLA UCLA University of California at Los Angeles UCLA University Center for Learning Assistance (Illinois State University) UCLA University of Carrollton, TX and Lower Addison, TX , and individuals from other UCs; Wake Forest University; and California State University at Fresno and Sacramento. There are about 20,000 of these expert-built records internally created for and through INFOMINE's iVia system. INFOMINE's iVia also imports and, as needed, translates from collaborating external data streams. For example, MARC records for Internet resources cataloged by the UC Shared Cataloging Project (SCP (1) (Service Control Point) A node in an SS7 telephone network that provides an interface to databases, which may reside within the SCP computer or in other computers. ) are imported, translated to Dublin Core, and utilized (about 25,000 records in INFOMINE are of this origin). Through collaborators at UC Santa Cruz, Lexis Lexis® An online legal information service that provides the full text of opinions and statutes in electronic format. Subscribers use their personal computers to search the Lexis database for relevant cases. They may download or print the legal information they retrieve. Nexis serial titles are imported (accounting for close to 6,000 records). INFOMINE's iVia also uses OAI-PMH to import records from selected NSDL-associated collections (about 10,000). In INFOMINE, there is a total of close to 60,000 expert-created records either of internal origin from closely allied institutions or that have been created externally by sharing institutions and imported. All of these expert-driven data streams form a first tier of records in the architecture of iVia. The second-tier collection supported by iVia consists of records that have been created automatically by crawler/classifier robots. There are also records that are of robot origin but that have been refined, augmented, and vetted by experts. This is an example of semi-automation with experts receiving machine assistance in resource discovery and metadata development. Currently, there are three crawler/classifiers (to be described below) that have created over 170,000 records. As in Google, these records, while far from MARC perfect, remain very useful and have been created relatively inexpensively. In the architecture of iVia they form a large second-tier collection that is used to support the first-tier collection of expert-built records. Complemented by the 60,000 expert-created records, INFOMINE's total collection size is around 230,000 records and growing rapidly. Importantly, the content of iVia records ranges from just metadata to metadata augmented by selected, rich full text that has been robotically harvested from the resource itself. Judicious ju·di·cious adj. Having or exhibiting sound judgment; prudent. [From French judicieux, from Latin i use of full text is of great help to user retrieval by drastically increasing the amount of material that can be searched and therefore the granularity The degree of modularity of a system. More granularity implies more flexibility in customizing a system, because there are more, smaller increments (granules) from which to choose. or detail in searching that can be supported. Full text also helps correct for controlled subject vocabularies that are often too removed from common parlance Parlance - A concurrent language. ["Parallel Processing Structures: Languages, Schedules, and Performance Results", P.F. Reynolds, PhD Thesis, UT Austin 1979]. and/or too general or specialized to adequately serve a wide variety of user audiences. The collection designs discussed above have been very successful. They have been able to reflect and provide intelligent organization and access to content from many different sources and of many different types. In a world of multitudes of important collections and approaches to metadata, the iVia hybrid collection approach has been very useful for end-user access. Support for Expert Content Builders Just as iVia provides means for facilitating and aggregating the mutual efforts of multiple institutions, it also provides a great amount of time saving, machine assistance, and other means of expediting the work of expert collection builders. Machine assistance is provided in new resource discovery (that is, collection development), metadata generation (that is, indexing), and in a great number of smaller collection-building tasks. Machine Assistance through Automated and Semi-Automated Resource Discovery Automated and semi-automated resource discovery (that is, collection development) is a major boost in collection building and saving the time of experts in finding relevant new resources, iVia uses several Web crawlers to scour scour, scours 1. the chemical and physical cleaning of fleece wool. 2. diarrhea. dietetic scour see dietary diarrhea. peat scour see secondary nutritional copper deficiency. the Web (or selected parts of it) to identify scholarly and educational resources of interest (Chakrabarti, 2003). The crawling technology can run fully automatically, but it has been built to include important roles for experts in guidance, refinement, and truing. For example, experts work with the crawlers to monitor and adjust resource acceptance weighting thresholds or the criteria by which a crawler will identify a resource as relevant. Screening for duplicates or resources already in the database is a perennial perennial, any plant that under natural conditions lives for several to many growing seasons, as contrasted to an annual or a biennial. Botanically, the term perennial challenge. This is done through automated means as well as through experts monitoring lists of potential duplicates found through either exact or fuzzy fuzz·y adj. fuzz·i·er, fuzz·i·est 1. Covered with fuzz. 2. Of or resembling fuzz. 3. Not clear; indistinct: a fuzzy recollection of past events. 4. matches of title and URL information. For irrelevant sites that keep re-occurring in crawls, iVia content-builder community blacklists are maintained that prohibit future crawler visits. For custom, finite crawls, we have built crawlers that are fully expert guided in the sense that well-defined crawling targets are provided by experts and crawling occurs in a very directed manner, iVia's "Expert Guided Crawler with Drill Down/Drill Out" takes expert-provided individual or multiple URLs and crawls them. Experts specify the number of levels down into a site that should be crawled (most sites being organized hierarchically) as well as the distance of other sites linked to from the expert-provided site that should be pursued (for example, options are one to two jumps from the original URL). This semi-automated crawler gives the expert the ability to "mine" for new resources/links in a very precise way. A single page or site can be crawled, or a community of closely linked sites can be crawled. Likewise, we are building a focused crawler that will take a topic that is very well defined by experts and concentrate on just that topic. This is a semi-automated focused crawler that will be dependent on feedback and truing from participating experts for best results. Just as experts interact with and improve crawler processes and accuracy, the interaction can be reversed with crawlers suggesting the most promising of sites as needing expert attention from content builders. That is, the most highly weighted sites that are automatically included in the crawler collection are flagged for expert review and refinement. Similarly, iVia database and record usage statistics are kept so that the most used or visited records of crawler origin can be flagged for expert attention, whereby the automatically created metadata present can be improved. Such a record is then moved from the second-tier, robot-created collection to the first-tier, expert-created collection. These are both important collection development tools and provide useful assists for experts. Machine Assistance through Automated Record/Metadata Generation or Import Automated and semi-automated metadata generation provides expert content builders with a great advantage (Chakrabarti, 2003; Frank & Paynter, 2004). Collection size and depth is greatly improved through records created in these ways. Specifically, iVia's second-tier collection of records, those that have been created fully automatically, provides a great boost for the utility and value of the collection as a whole to users and greatly augments and complements expert content-building work. At the same time, the existence of automatically created records provides great assists for expert record-building activities when they are viewed as "foundation records" or records that have been partially built (from a librarian standpoint) and that can be improved upon through some expert effort. Working with these automatically created records as foundation records and improving them saves expert time compared with creating records from scratch. Foundation records can be seen as the basic "ore" that can be easily refined for more demanding or discerning dis·cern·ing adj. Exhibiting keen insight and good judgment; perceptive. dis·cern ing·ly adv. uses where more rigorous (though more expensive) metadata may be the norm. Expert content builders are also aided, as mentioned above, by iVia's ability to import and share records with other collections though OAI-PMH and standard delimited formats. This also contributes to boosting collection size, depth, and value for the end user. Specific Machine Assistance to Experts in Record Building Numerous small machine assists are supplied by iVia to make expert record building more efficient. In the aggregate, these are crucial and save much expert time. For example, iVia supports * Duplicate checking: prior to building an expert record, the iVia checker finds both exact and fuzzy matches within the URL and title fields for experts to review. Also identified and deleted, by checking exact lengthy character strings, are mirror sites. * Record cloning cloning: see clone. To make a product that functions like another. See clone. See also cloning software. : multiple records can be built representing closely related sites, authors, or organizations. Similarly, multiple records on the same or related subjects can be cloned and the subject and keyword indexing, among other metadata, saved and re-utilized. * Batch editing: just as multiple records can be imported or exported in batches, their metadata can be edited and changed globally in batches. This saves much time in cases, for example, where a convention on naming a resource type has changed. * URL Canonization canonization (kăn'ənĭzā`shən), in the Roman Catholic Church, process by which a person is classified as a saint. It is now performed at Rome alone, although in the Middle Ages and earlier bishops elsewhere used to canonize. : variants of URLs are canonized can·on·ize tr.v. can·on·ized, can·on·iz·ing, can·on·iz·es 1. To declare (a deceased person) to be a saint and entitled to be fully honored as such. 2. To include in the biblical canon. 3. to proper form when this is needed. * URL change notification: always a challenge is keeping up with changing URLs. To do this iVia has developed a "URL Checker and Pursuit" utility that flags problem URLs, notes the nature of the problem, notes potential locations indicated by forwarding messages, and (after three consecutive failures of a URL over a period of three weeks) flags the editor of the subject file with the record with the problem URL and suggests possible working URLs. * Pull down menus of various controlled vocabularies Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the uses of predefined, authorised terms that have been preselected by the designer of the controlled vocabulary as opposed to natural : these would include resource types, keywords, and broad subject disciplines. * User corrections/suggestions/new content: these are encouraged and funneled to content builders. This has been a major source for identifying possible new content and correcting errors. * Online and point-of-need guidance: help is provided via manuals, style guides, and pop-up screens with pointers. * Collection development assistance: this is supplied to other collections through iVia's email-based "New Resources Alert Service" and through the Whats New! index. Under the Hood under the hood - [hot-rodder talk] 1. The underlying implementation of a product (hardware, software, or idea). Implies that the implementation is not intuitively obvious from the appearance, but the speaker is about to enable the listener to grok it. The techniques, approaches, and algorithms that make machine assistance to experts in collection building and, more generally, iVia possible are described in more depth at the iVia site, http://infomine.ucr.edu/iVia. A COLLABORATION-INDUCING SYSTEM There are a number of catalysts that should stimulate increasing collaboration with iVia and its participants. The foremost is that, working together, a powerful, far-reaching, and high-quality finding tool and both internally developed and allied, externally developed collections, with proven value to researchers and students, will continue to grow and thrive. Working together, collaborators reduce redundant efforts by sharing and distributing collection development tasks and by unifying system building and support activities. Collaborators participate in a state-of-the-art system incorporating resource-saving machine assistance in numerous tasks. Furthermore, the iVia system is in the public domain, free, and open to custom development. At the same time, iVia and the collections it provides access to can be utilized through custom interfaces and data views that meld well with the Web presence of the collaborating institution. Additionally, as one of the first library-based Web services (1) Loosely, any online service delivered over the Web. Such usage appears in articles from non-technical sources, but not in IT-oriented publications, because definition #2 below describes the correct use of the term. , iVia/INFOMINE developers have a great deal of experience in meeting scholarly Internet user Internet user n → internauta m/f Internet user Internet n → internaute m/f finding needs. Finally, the collections that populate iVia through INFOMINE are significant, well-organized, and useful. INFOMINE is among the largest librarian-built collections of its type. SUMMARY iVia is a powerful and flexible, collaboration-enabling, open-source, Internet collection-building, and finding tool system. It is of use in building Internet collections of metadata and full-text data representing resources from the Web as exemplified through INFOMINE, one of the earliest and more significant of academic virtual libraries. The metadata generated includes library standard subject schema, iVia supports single or multiple subject focuses as well as both single or multiple institutional efforts. It is intended as community-ware and has proven itself to be of value in multi-institutional collaborations such as INFOMINE, NSDL, iVia, and, shortly, Data Fountains. User retrieval options are numerous for both fielded and full-text data and support both beginning and advanced searchers, iVia supports custom branding, interfaces, and data views for those accessing its collections. Numerous modes of content building are possible featuring varying levels of editorial review, styles of indexing, and divisions of labor. iVia is noteworthy because it saves resources and labor by integrating fully automated, semi-automated, and fully manual modes of record building. Resource discovery through various iVia Web crawlers and metadata generation through iVia classifiers (and other means) results in collections that require fewer resources and less expert labor to reach significant size. iVia emphasizes collaboration and empowers the librarian expert through the use of machine assistance. REFERENCES Chakrabarti, S. (2003). Mining the Web: Discovering knowledge from hypertext hypertext, technique for organizing computer databases or documents to facilitate the nonsequential retrieval of information. Related pieces of information are connected by preestablished or user-created links that allow a user to follow associative trails across the . San Francisco San Francisco (săn frănsĭs`kō), city (1990 pop. 723,959), coextensive with San Francisco co., W Calif., on the tip of a peninsula between the Pacific Ocean and San Francisco Bay, which are connected by the strait known as the Golden : Morgan Kaufman. Frank, E., & Paynter, G. W. (2004). Predicting Library of Congress classifications from Library of Congress subject headings. Journal of the American Society for Information Science and Technology The American Society for Information Science and Technology (also referred to as ASIST or ASIS&T) is an organization of information professionals. Established in 1937, the organization sponsors an annual conference and publishes proceedings from this conference under , 55(3), 214-227. Mason, J., Mitchell, S., Mooney, M., Reasoner, L., & Rodriguez, C. (2000). INFOMINE: Promising directions in virtual library development. First Monday First Monday is a short-lived U.S. television drama centered on the U.S. Supreme Court. Created by JAG creator Donald Bellisario, the show aired on CBS from January until May of 2002. , 5(6). Retrieved November 20, 2004, from http://www.firstmonday.dk/issues/issue5_6/mason/index.html. Mitchell, S., Mooney, M., Mason, J., Paynter, G., Ruscheinski, J., & Kedzierski, A., et. al. (2003). iVia open source virtual library system. D-Lib Magazine D-Lib Magazine is an on-line magazine dedicated to digital library research and development. Content of current and past issues are available free of charge. The publication is financially supported by the Defense Advanced Research Projects Agency (as part of the Digital , 9(1). Retrieved November 20, 2004, from http://www.dlib.org/dlib/january03/mitchell/01mitchell.html. Steve Mitchell Steve Mitchell was a basketball player for the University of Alabama at Birmingham. Through his 1982-1986 tenure, he became the school's all time leading scorer with 1,866 points. , iVia and Data Fountains Projects Coordinator, Science Library, University of California at Riverside, Riverside, CA 92521 |
|
||||||||||||||||||

z)
ing·ly adv.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion