Printer Friendly

Scalable Internet resource discovery: research problems and approaches.

In its roots as the ARPANET, the Internet was conceived primarily as a means of remote login and experimentation with data communication protocols. However, the predominant usage quickly became email in support of collaboration. This trend continues into the present incarnation of the Internet, but with increasingly diverse support for collaborative data-sharing. Email has been supplemented by a variety of wide-area filing, information retrieval, publishing, and library access systems. At present, the Internet provides access to hundreds of gigabytes each of software, documents, sounds, images, and other file system data; library catalog and user directory data; weather, geography, telemetry, and other physical science data; and many other types of information.

To make effective use of this wealth of information, users need ways to locate information of interest. In the past few years, a number of such resource discovery tools have been created and have gained wide popular acceptance in the Internet [1, 9, 13, 17, 19, 22].(1) Our goal here is to examine the impact of scale on resource discovery tools, and place these problems into a coherent framework. We focus on three scalability dimensions: the burgeoning diversity of information systems, the growing user base, and the increasing volume of data available to users.

Table 1 summarizes these dimensions, suggests a set of corresponding conceptual layers, and indicates problems being explored by the authors, who comprise the Internet Research Task Force (IRTF) Research Group on Resource Discovery and Directory Service. Users perceive the available information at the information interface layer. This layer must support scalable means of organizing, browsing, and searching. The information dispersion layer is responsible for replicating, distributing, and caching information. This layer must support access to information by a large, widely distributed user populace. The information-gathering layer is responsible for collecting and correlating the information from many incomplete, inconsistent, and heterogeneous repositories.

Information System Diversity

An important goal for resource discovery systems is to provide a consistent, organized view of information. Since information about a resource exists in many repositories--within the object itself and within other information systems--resource discovery systems need to identify a resource, collect information about it from several sources, and convert the representation to a format that can be indexed for efficient searching.

As an example, consider the problem of constructing a user directory. In a typical environment, several existing systems contain information about users. The Sun NIS [23] database usually has information about a user's account name, personal name, address, group memberships, password, and home directory. The ruserd [23] server has information about a user's workstation and its idle time. Additionally, users often place information in a ".plan" file that might list the user's travel schedule, home address, office hours, and research interests.

As this example illustrates, information in existing Internet repositories has the following characteristics:

* It is heterogeneous: Each repository maintains the information it needs about resources. For example, the primary purpose of the NIS database is to maintain information about user accounts, while a user's ".plan" file often contains more personal information. Also, the two repositories represent the information differently: records in an NIS database have a fixed format, but a ".plan" file contains unstructured text.

* It is inconsistent: Most information contained in Internet repositories is dynamic. Certain properties change frequently, such as which workstations a person is using. Other properties change more slowly, such as a user's email address. Because information is maintained by several repositories that perform updates at different times using different algorithms, often the information in the various repositories will conflict. For example, information about account name, address, and phone number may be maintained by both the NIS database and a user directory server. When a user's address or phone number changes, the directory service will probably be updated first. However, if the account changes, the NIS database will usually be the first to reflect the change.

* It is incomplete: Additional attributes of a resource can often be obtained by combining information from several repositories. For example, a bibliographic database does not contain explicit information about a person's research interests. However, keywords might be extracted from the persons' research papers to infer research interests.

There are two common approaches to these information-gathering layer problems. The first approach--data mapping--generates an aggregate repository from multiple information sources. The second approach--operation mapping--constructs a "gateway" between existing systems; it maps the functionality of one system into another without actually copying the data.

Data Mapping

The first approach for accommodating diversity is to collect data from a set of underlying repositories and combine it into a homogeneous whole. Doing so involves two parts: mapping algorithms for collecting information, and agreement algorithms for correlating information [4].

A mapping algorithm is implemented as a function that collects information from a repository and reformats it. There may be several implementations of mapping algorithms, each customized for an existing repository. The most common mapping algorithms are implemented as clients of an existing service. For example, Netfind extracts user information from several common Internet services [22], including the Domain Naming System (DNS), the finger service, and the Simple Mail Transfer Protocol.

The agreement algorithm defines a method for handling conflicts between different repositories. For example, Figure 1 illustrates data for the Enterprise [2] user directory system, which is built on top of the Univers name service [4]. This figure shows three mapping algorithms that gather information from the NIS database, the ruserd server, and the user's email. Several attributes can be generated by more than one mapping algorithm. For example, address information exists potentially in both the NIS database and the information supplied by the user. The agreement algorithm considers information gathered directly from the user as the most reliable. Depending on the attribute, the agreement algorithm may permit some properties to have several values, such as the two address attributes that describe the user in Figure 1.

Data mining represents a special form of agreement algorithm, which works by cross-correlating information available from multiple repositories. This can have two benefits. First, it can flag inconsistencies. For example, Enterprise could inform users if it detected conflicts between the email addresses listed in different repositories. Second, data mining can deduce implicit information by cross-correlating existing information. For example, Netfind collects and cross-correlates data continuously from a number of sources, to form a far-reaching database of Internet sites. One source might discover a new host called "astro.columbia.edu" from DNS traversals, and cross-correlate this information with the existing site database record for columbia.edu ("columbia university, new york, new york"), its understanding of the nesting relationships of the Domain name space, and a database of departmental abbreviations, to derive a new record for the Astronomy Department at Columbia University.

Mapping and agreement algorithms operate best generally when they exploit the semantics of specific resource discovery applications. In the Netfind example described this was possible because the data were gathered from particular services, each with semantics that Netfind understands.

More generally, data gathering depends on some data type structure to help select semantically specific gathering methods. We are exploring a variety of data-typing approaches in the context of gathering file system data. We now consider briefly the problems that arise in typing and gathering this data.

To gather information effectively from file system data, it is helpful to extract information in different ways depending on file type. For example, while it is possible to index every token in program source code, the index will be smaller and more useful if it distinguishes the variables and procedures defined in the file. In contrast, applying this data-gathering algorithm to a program's associated textual documentation will not yield the most useful information, since it has a different structure. By typing data, information can be extracted in whichever way is most appropriate for each type of file.

File data can be typed explicitly or implicitly. Explicitly typing files has the advantage that it simplifies data gathering. An explicitly typed file conforms to a well-known structure, which supports a set of conventions for what data should be extracted. Explicit typing is most naturally performed by the user when a file is exported into the resource discovery system. We are exploring this approach in the Nebula file system [3] and Indie discovery tool [5].

Many files exist without explicit type information, as in most current anonymous FTP files.(2) An implicit typing mechanism can help in this case. For example, Essence [12] uses a variety of heuristics to recognize files as fitting into certain common classes, such as word processor source text or binary executables. These heuristics include common naming conventions or identifying features in the data. The MIT semantic File System uses similar techniques [11].

Given a typed file, the next step is to extract indexing information. This can be accomplished most easily through automatic content extraction, using a grammar that describes how to extract information from each type of file. For example, given a TeX word processing document, the grammar could describe where to extract author, title, and abstract information. For cases in which more complex data extraction methods are needed, one can provide an "escape" mechanism that allows arbitrary data extraction programs to be run.

Automatic extraction methods have the advantage that they can provide an immediate base of usable information, but in general will generate some inappropriate keywords and miss generating other, desirable keywords. For this reason, it is prudent to augment these methods with means by which people can override the automatically extracted data.

In many cases file-indexing information can be extracted from each file in isolation. In some cases, however, it is useful to apply extraction procedures based on the relationships among files. For example, binary executable files can be indexed sometimes by gathering keywords from their corresponding documentation. One can use heuristics to exploit such implicit interfile relation ships for common cases, augmented by means of specifying explicit relationships. For example, we are exploring an approach that allows users to create files in the file system tree that specify relationships among groups of files.

One final observation about data type structure is that the index should preserve type information to help identify context during searches. For example, keywords extracted from document titles can be tagged in the index so that a query will be able to specify that only data extracted from document titles should match the query. This stands in contrast to the approach (used by WAIS [13], for example) of allowing a free association between query keywords and extracted data. We discuss indexing schemes further later.

Operation Mapping

A gateway between two resource discovery systems translates operations from one system into operations in another system. Ideally, the systems interoperate seamlessly, without the need for users to learn the details of each system. Sometimes, however, users must learn how to use each system separately.

Building seamless gateways can be hindered if one system lacks operations needed by another system's user interface [22]. For example, it would be difficult to provide a seamless gateway from a system (like WAIS) that provides a search interface to users, to a system (like Prospero [19]) that supports browsing only. Even if two systems support similar operations, building seamless gateways may be hindered by another problem: providing appropriate mappings between operations in the two systems. To illustrate the problem, consider the current interim gateway from Gopher [17] to Netfind, illustrated in Figure 2.(3) Because the gateway simply opens a telnet window to a Unix program that provides the Netfind service, users perceive the boundaries between the two systems.

In contrast, we have built a system called Dynamic WAIS [12], which extends the WAIS paradigm to support information from remote search systems (as opposed to the usual collection of static documents). The prototype supports gateways to Archie and to Netfind, using the Z39.50 information retrieval protocol to integrate the information spaces seamlessly. The Dynamic WAIS interface to Netfind is shown in Figure 3.

The key behind the Dynamic WAIS gateways is the conceptual work of constructing the mappings between the WAIS search-and-retrieve operations and the underlying Archie and Netfind operations. In the case of Netfind, for example, when the Dynamic WAIS user requests a search using the "dynamic-netfind.src" WAIS database, the search is translated to a lookup in the Netfind site database to determine potential domains to search. The Netfind domain selection request is then mapped into a WAIS database selection request (the highlighted selection in the XWAIS Question window). Once the user selects one of the domains to search, the WAIS retrieval phase is mapped into an actual domain search in Netfind (the upper-most window).

We are developing these techniques further, to support gateways to complex forms of data, such as scientific databases.

User Base Scale

New constituencies of users will make the Internet grow significantly beyond its present 2 million nodes. This growth will overburden the network's resource discovery services unless we address four problems of scale. First, discovery services should monitor data access patterns to determine how best to replicate themselves, to determine whether to create specialized services that manage hot subsets of their data, and to diagnose accidentally looping clients. Second, discovery services should support significantly higher levels of replication, using algorithms specifically designed to function in the Internet's dynamic patchwork of autonomous systems. Third, the range of user expertise and needs requires that user interfaces and search strategies be highly customizable. Fourth, the Internet will need a hierarchically structured, extensible, object-caching service through which clients can retrieve data objects, once they are discovered. We consider these issues later.

Server Instrumentation

The designers of new information systems can never fully anticipate how their systems will be used. For this reason, we believe that new Internet services should instrument their query streams.

Self-instrumented servers could help determine where to place additional replicas of an entire service: If some X.500 server in Europe finds that half of its clients are in the U.S., the server itself could suggest that a strategically located replica be created.

Self-instrumented servers could also identify the access rate of items in their databases for use by more specialized services. For example, an instrumented Archie server would note that properly formed user queries only touch about 16% of Archie's database. Such instrumentation would enable the creation of a complementary service that reported only popular, duplicate-free, or nearby objects. We discuss these ideas more later.

Self-instrumented servers could also identify server-client or server-server communication run amok. Large distributed systems suffer frequently from undiagnosed, endless cycles of requests. For example, self-instrumented Internet name servers show that DNS traffic consumes 20 times more bandwidth than it should, because of unanticipated interactions between clients and servers [6]. Self-instrumentation could identify problem specifications and implementations before they become widespread and difficult to correct.

Server Replication

Name servers scale well because their TABULAR DATA OMITTED data are typically partitioned hierarchically. Because resource discovery tools search flat, rather than hierarchical or otherwise partitionable views of data, the only way to make these tools scale is by replicating them. To gain an appreciation for the degree of replication required, consider the Archie file location service.

Currently, the global collection of Archie servers process approximately 100,000 queries per day, generated by a few thousand users worldwide. Every month or two of Internet growth requires yet another replica of Archie. Thirty Archie servers replicate a continuously evolving 150MB database of 2.1 million records. While a query posed on a Saturday night receives a response in seconds, it can take several hours to answer the same query on a Thursday afternoon. Even with no new Internet growth, for the current implementation of Archie to yield five-second response times during peak hours we would need at least 60 times more Archie servers, i.e., 1,800 servers. Because of its success and the continual rapid growth of the Internet, in time Archie will require thousands of replicas. Other successful tools that cannot partition their data easily will also require massive replication.

We believe massive replication requires additional research. On the one hand, without doubt we know how to replicate and cache data that partitions easily, as in the case of name servers [18]. Primary copy replication works well because name servers do not perform associative access, and organizational boundaries limit the size of a domain, allowing a handful of servers to meet performance demands.(4) We have learned many lessons to arrive at this understanding [6]. On the other hand, we have little experience deploying replication and caching algorithms to support massively replicated, flat, yet autonomously managed databases.

What do existing replication schemes for wide-area services lack? First, existing replication systems ignore network topology. They do not route their updates in a manner that makes efficient use of the Internet. One day, a streaming reliable multicast protocol might serve this purpose. Today, we believe, it is necessary to calculate the topology over which updates traverse and to manage replication groups that exploit the Internet's partitioning into autonomous domains.

Second, existing schemes do not guarantee timely and efficient updates in the face of frequent changes in physical topology, network partition, and temporary or permanent node failure. In essence, they treat all physical links as having equal bandwidth, delay, and reliability, and do not recognize administrative domains.

We believe that flooding-based replication algorithms can be extended to work well in the Internet. Both Archie and network news replicate using flooding algorithms. However, for lack of good tools, administrators of both Archie and network news manually configure the flooding topology over which updates travel, and reconfigure this topology manually when the physical network changes. This is not an easy task because Internet topology changes often, and manually composed maps are never current. While we are developing tools to map the Internet [25], full network maps will not automate topology-update calculation.

Avoiding topology knowledge by using today's multicast protocols [8] for data distribution fails for other reasons. First, these protocols are limited to single routing domains or require manually placed tunnels between such domains. Second, Internet multicast attempts to minimize message delay, which is the wrong metric for bulk transport. At the very least, we see the need for different routing metrics. Third, more research is needed into reliable, bulk transport multicast that deals efficiently with site failure, network partition, and changes in the replication group.

We are exploring an approach to providing massively replicated, loosely consistent services [5]. This approach extends ideas presented in Lampson's Global name service [14] to operate in the Internet. Briefly, our approach organizes the replicas of a service into groups, imitating the idea behind the Internet's autonomous routing domains. Group replicas estimate the physical topology and then create an update topology between the group members. The left-hand side of Figure 4 shows three replication domains. Inside each replication domain, a logical update topology is established that best uses the physical topology connecting the replicas. Whenever the algorithm detects a meaningful change in physical topology among members of a replication domain, it modifies the logical update topology accordingly. Because it does not require a separate recovery algorithm, it is simpler than solutions based on Internet multicast.

Data Object Caching

We believe that the Internet needs an object-caching service, through which search clients can retrieve the data objects they discover. We say this for two reasons.

First, people are using Mosaic, FTP and email as a cacheless, distributed file system [7]. Since the quantity and size of read-mostly data objects grows as we add new information services, an object cache would improve user response times and decrease network load (e.g., we found that one fifth of all NSFNET backbone traffic could be avoided by caching FTP data). Second, caches protect the network from looping client programs that repeatedly retrieve the same object. While the cache does not fix the faulty components, it does isolate the fault. A caching service would obviate the need for every new client and Internet service to implement its own data cache.

We believe the object cache should be hierarchically organized, as illustrated in Figure 5. The dark circles in this figure represent file caches residing on secure machines placed near the entry points to regional networks and near the core of backbone networks. The organization of these caches could be similar to the organization of the Domain Name System. Clients would send their requests to one of their default cache servers. If the request missed the cache, the cache would resolve the request recursively with one of its parent caches, or directly from the FTP archive.

Client Customization

As the number of Internet information systems users gets larger, naturally it becomes more diverse. Allowing users flexible customization can be a great help, because people have different search styles and needs. To see how this can be done, consider the analogy of a newcomer to a town. One first establishes general acquaintance with the town layout and major services. One then learns about services close to one's heart--for example, clubs, specialized stores, and recreation facilities--usually by word of mouth, the media, or advertising. After a while, one develops networks of friends, better knowledge of different services, and experience based on habits and interests. This is a continuing process, because new facilities appear, and one's interests evolve. The same process occurs on a much larger scale in the Internet and suggests ways that interactions between users and discovery services can be made flexible. Next we discuss three types of customization that can be addressed in discovery tools, based on some of our experimental systems.

The first type of customization involves tracking a person's search history. For example, recently one of this article's authors discovered that the New Republic magazine was available on-line while browsing Gopherspace via Mosaic. But only two days later it took him 15 minutes to navigate back to the same place. To address this problem, the user interface can keep track of previous successful and unsuccessful queries, comments a user made on past queries, and browsing paths. Existing systems record some of this information (e.g., the ability to set "bookmarks" in Gopher and "Hotlists" in Mosaic); saving the whole history can help further. For example, we built a system that records the paths users traverse when browsing FTP directories and allows this information to be searched [21]. It is also helpful to record search history and allow the searches themselves to be searched. For example, the X-Windows interface to agrep translates every command into the corresponding agrep statement, and allows flexible retrieval of this information. This type of history can support queries such as "What was the name of the service that allowed me to search for XYZ that I used about a year ago?"

The second type of customization is the ability to choose not only according to topic but also according to context and other attributes. If the search is for papers on induction, for example, knowledge of whether the searcher is a mathematician or an electrical engineer can be very useful. When one looks for a program named ZYX using Archie, for example, it would be useful to specify a preference for, say, a Unix implementation of the program.

The third type of customization is ranking. WAIS provides one form of ranking, in which matched documents are ordered by frequency of occurrence of the specified keywords. However, it would be useful to allow users to customize their notion of ranking, so they could choose to rank information by quality or reliability. Clearly, these notions are very subjective. If we are naive users, we may want information from someone who knows how to build easy-to-use systems. If we are academics, we may want information from someone with deep understanding. If we absolutely positively need it by tomorrow, we want someone who can deliver, and so on. We believe that in time the Internet will support many types of commercial and nonprofit review services, and people will subscribe and follow recommendations from reviewers they trust (just like they go to movies or buy refrigerators based on reviews).

Customizations like the ones discussed here are missing from most current resource discovery client programs because they were not designed to be extensible.

Data Volume

The amount of information freely available in the Internet is huge and growing rapidly. New usage modes will contribute additional scale. For example, as multimedia applications begin to proliferate, users will create and share voluminous audio, image, and video data, adding several orders of magnitude of data. Many users already store voluminous data, but do not share it over the Internet because of limited wide-area network bandwidths. For example, earth and space scientists collect sensor data at rates as high as gigabytes per day. To share data with colleagues, they send magnetic tapes through the postal mail. As Internet link capacities increase, more scientists will use the Internet to share data, adding several more orders of magnitude to the information space.

While resource discovery tools must deal with this growth, the scaling problems are not quite as bad as they may seem, because the number and size of "searchable items" need not grow as fast. Resource discovery tools may be needed to find the existence and location of gigabytes or terabytes of raw sensor data, but probably they will not search or otherwise process all this data. A reasonably small-sized descriptor object will be sufficient to point anyone to this data. Only this descriptor object will need to be searched and indexed. The same holds for sound, video, and many other types of nontextual data. Searching image files is desirable, but current pattern-matching techniques are still too slow to allow large-scale image processing on-the-fly. Again, a descriptor object can be associated with every image, describing it in words.

User Interaction Paradigms: Browsing vs. Searching

Generally, there are two resource discovery paradigms in common use in the Internet--organizing/browsing and searching. Organizing refers to the human-guided process of deciding how to interrelate information, usually by placing it into some sort of a hierarchy (e.g., the hierarchy of directories in an FTP file system). Browsing refers to the corresponding human-guided activity of exploring the organization and contents of a resource space. Searching is a process in which the user provides some description of the resources being sought, and a discovery system locates information that matches the description.

We believe that a general discovery system will have to employ a combination of both paradigms--here we analyze the strengths and weaknesses of each paradigm. The main weakness of organizing is that it is typically done by "someone else" and is not easy to change. For example the Library of Congress Classification System has Cavalry and Minor Services of Navies as two second-level topics (under "Military Science" and "Naval Science" respectively), each equal to all of Mathematical Sciences (a second-level topic under "Science"), of which computer science is but a small part.(5) Ironically, this is also the main strength of organizing, because people prefer a fixed system they can get used to, even if it is not the most efficient.

Browsing suffers from this problem also because typically it depends heavily on the quality and relevance of the organization. Keeping a large amount of data well organized is difficult. In fact, the notion of "well organized" is highly subjective and personal. What one user finds clear and easy to browse may be difficult for users who have different needs or backgrounds. Browsing can lead to navigation problems also, and users can get disoriented. To some extent this problem can be alleviated by systems that support multiple views of information [19]. Yet, doing so really pushes the problem "up" a level--users must locate appropriate views, which in itself is another discovery problem. Moreover, because there are few barriers to "'publishing" information in the Internet (and we strongly believe there should not be any), there is great deal of information that is useful to only very few users, and often for only a short period of time. To other users, this information clutters the "information highway," making browsing difficult.

Searching is much more flexible and general than organizing/browsing, but it is also harder for the user. Forming good queries can be a difficult task, especially in an information space unfamiliar to the user. On the other hand, users are less prone to disorientation; the searching paradigm can handle change much better; and different services can be connected by searching more easily than by interfacing their organizations.

Many current systems, such as WAIS, Gopher, and World-Wide Web [1], employ an organization that is typically based on the location of the data, with limited per-item or per-location searching facilities. Browsing is the first paradigm that users see,(6) but once a server or an archive is located, some type of searching is also provided. Searching can be comprehensive throughout the archive (for example, WAIS servers provide full-text indexes), or limited to titles.

Indexing Schemes

The importance of searching can be seen by the recent emergence of file system search tools [11, 12, 16] and the Veronica system (a search facility for Gopher). Moreover, it is interesting to note that while the Prospero model [19] focuses on organizing and browsing, Prospero has found its most successful application as an interface to the Archie search system.

To support efficient searching, various means of indexing data are required. As illustrated in Figure 6, Internet indexing tools can be placed on a spectrum of indexing space vs. representativeness. The upper-left corner of this figure is occupied by systems that achieve very space-efficient indexes, but only represent the names of the files or menus that they index. Archie and Veronica, for example, index the file and menu names of FTP and Gopher servers, respectively. The compact nature of these indexes means that a single index can support far-reaching searches. Yet, these indexes support very limited queries. For example, Archie supports only name-based searches; searches based on content are possible only when the file names happen to reflect some of the contents.

The lower-right corner of Figure 6 is occupied by systems that provide full-text indexing of the data found at individual sites. For example, a WAIS index represents every keyword in a set of documents located at a single site. Similar indexes are available for individual Gopher and World-Wide Web servers. Some recent advances have lowered the space requirements for full-text indexing, with as low as 2% to 4% for the index used in Glimpse [16]. There is usually (but not always) a time-space trade-off; systems that use less space require more time for searching.

The middle region of Figure 6 is occupied by systems that represent some of the contents of the objects they index, based on selection procedures for including important keywords or excluding less meaningful keywords. For example, Essence and MIT's Semantic File System (SFS) select keys based on application-specific semantics of individual documents (e.g., locating the author lines within TeX documents). The Netfind site database includes keywords for each component of a site's Domain name, plus each keyword included in an organizational description that was constructed based on understanding the semantics of many information sources used to gather that information [22]. Whois+ + [24] indexes templates that were manually constructed by site administrators wishing to describe the resources at their site.

By selecting relatively small but important information such as file names, Archie quickly became a popular search-based Internet resource discovery tool. But as we discussed earlier, Archie has scale problems. One way to overcome some of these problems is to introduce some hierarchy into the indexing mechanism in Archie (and similar tools). In addition to one flat database of all file names, it is possible to maintain much smaller slightly limited databases that will be replicated widely. For example, we can detect duplicates (exact and/or similar, see [15]), and keep only one copy (e.g., based on locality) in the smaller databases. Most queries will be satisfied with the smaller databases, and only few of them will have to go further.

The scale problems for full texts are much more difficult. Full-text indexes are almost always inverted indexes, which store pointers to every occurrence of every word (possibly excluding some very common words). The main drawbacks of inverted indexes are their size--usually 50% to 150% of the original text [10]--and the time it takes to build and/or update them. Therefore, maintaining an inverted index of the whole Internet FTP space is probably out of the question. But separate local indexes, which is what we have now, do not present a general enough view of much of the available information. Often, users need to perform a lengthy browsing session to find the right indexes, and they often miss. This problem will only get worse with scale.

We envision a search system that connects local indexes and other pieces of information via a multilevel indexing scheme. The main principle behind such a scheme is that the number of search terms is quite limited no matter how much data exists. Search terms are mostly words, names, technical terms, and so forth, and the number of those is on the order of [10.sup.6] to [10.sup.7]. But more important, this number grows more slowly than the general growth of information. Inverted indexes require enormous space because they catalog all occurrences of all terms. But if we index, for each term, only pointers to places that may be relevant, we end up with a much smaller index. Searching will be similar to browsing in the sense that the result of each query may be a list of suggestions for further exploration.

A big advantage of such a scheme is that the index can be partitioned in several ways, which makes it scalable. Specialized archives will, of course, keep their local indexes. Several local indexes can be combined to form a two-level index, in which the top level can only filter queries to a subset of the local indexes. For example, separate collections of technical reports can form a combined index. Also, indexes can be combined according to topics or other shared attributes. The directory of services could be another index (but more widely replicated), which contains pointers to information about common terms and local information. There could be also a more detailed directory maintained at some servers with knowledge about more terms. Users could navigate by a combination of browsing and searching. In contrast with fixed browsing, this type of navigation will allow users to skip many levels, to customize their searches better, and to combine information from different sources more easily. Of course, many issues need to be resolved, including ranking, classification, replication, consistency, data extraction, privacy, and more.

An example (at a smaller scale) of the multilevel approach is Glimpse [16], an indexing and searching scheme designed for file systems. Glimpse divides the entire file system into blocks, and in contrast with inverted indexes, it stores for each word only the block numbers containing it. The index size is typically only 2% to 4% of the text size. The search is done first on the index and then on the blocks that match the query. Glimpse supports approximate matching, regular expression matching, and many other options.

Essence and the MIT Semantic File System do selective indexing of documents, by selecting keywords based on knowledge of the structure of the documents being indexed. For example, Essence understands the structure of several common word processing systems (as well as most other common file types in Unix environments), and uses this understanding to extract authors, titles, and abstracts from text documents. In this way, it is able to select fewer keywords for the index, yet retain many of the keywords that users would likely use to locate documents. Because these types of systems exploit knowledge of the structure of the documents being indexed, it would be possible to include document structure information in the index. (This idea was discussed earlier.)

Topic Specialization, Classification, and Information Quality

It is safe to say that at least 99% of the available data is of no interest to at least 99% of the users. The obvious solution to this problem is to construct specialized archives for particular domains of interest. We believe new techniques are needed to simplify the process of managing specialized archives.

Currently, specialized archives rely on 1) direct contributions from their communities and 2) administrators who contribute time and expertise to keep the collections well structured. Except for replication (or mirrors as they are usually called in the Internet) of FTP files, archives do not cooperate among themselves.

We see the need for a discovery architecture and set of tools that easily let people create specialized services and that automatically discover information from other services, summarize it, keep it consistent, and present it to the archive administrator for their editorial decision. An important component of any complete solution is a directory of services in which archives describe their interest specialization and keep this definition current.

This approach amounts essentially to defining archives in terms of queries [2, 5]. A server uses its query periodically to hunt throughout the Internet for relevant data objects. One type of server could, for example, be specialized to scan FTP archives, summarizing files and making these summaries available to yet other services.

Such an architecture could greatly reduce the manual steps that archive administrators perform to incorporate users' contributions. This architecture would also help users discover smaller, highly specialized archives.

To help support topic-specialized servers, we believe information must be classified also according to topic- or community-specific taxonomies. For example, an archive of computer science technical reports might be classified using the ACM Computing Reviews taxonomy. Taxonomies allow a more uniform search space than is possible solely by content-indexing documents. Because a particular document may be of value to several communities, it should be possible to classify documents according to multiple taxonomies, and register classification information for the document in several specialized archives.

Classification should also include some description of information quality. Often, for example, scientists need to understand such things as the methods by which data were gathered and what error controls were used. At this point it is not clear how to specify quality, because there are many associated issues. At a minimum it would be useful to record the author, data collection process, and something about the review process to which the data were subjected before being published. Other possibilities might include such things as pointers to conflicting points of view and support for "voting" by users who have perused the data.

We believe tools should be developed that allow authors to mark up their documents with classification terms from some selected set of taxonomies. In turn these classification data should be included in topic indexes, with attributes indicating that the source of the information was a classification process, rather than just a content-indexing process (since the latter tends to generate many less well-focused or well-defined terms).

Summary

The Internet's rapidly growing data volume, user base, and data diversity will create difficult problems for the current set of resource discovery tools. Future tools must scale with the diversity of information systems, number of users, and size of the information space.

With growing information diversity, techniques are needed to gather data from heterogeneous sources and sort through the inherent inconsistency and incompleteness. Interact Research Task Force efforts in this realm focus on application-specific means of extracting and cross-correlating information, based on both explicit and implicit data-typing schemes.

With a growing user base, significantly more load will be placed on Interact links and servers. This load will require much more heavily replicated servers and more significant use of data caching, highly customizable client interfaces, and self-instrumenting servers to help guide replication and specialization, IRTF efforts in this realm focus on flooding-based replication algorithms that adapt to topology changes, and on customized clients.

As the volume of information continues to grow, organizing and browsing data break down as primary means for supporting resource discovery. At this scale, discovery systems will need to support scalable content-based search mechanisms. Current systems tend to strike a compromise between index representativeness and space efficiency. Future systems will need to support indexes that are both representative and space efficient. IRTF efforts in this realm focus on scalable content-based searching algorithms, and on servers specialized to support particular user communities.

1 The reader interested in an overview of resource discovery systems and their approaches is referred to [20].

2 FTP is an Internet standard protocol that supports transferring files between interconnected hosts. Anonymous FTP is a convention for allowing Internet users to transfer files to and from machines on which they do not have accounts, for example, to support distribution of public domain software.

3 Efforts are under way to improve this gateway.

4 Because it is both a name service and a discovery tool, X.500 could benefit from massive replication.

5 There was a time, of course, when the study of cavalry was much more important than the study of computers.

6 WAIS supports searching at the top level via the directory of servers, but many users simply browse a local copy of this server list.

References

1. Berners-Lee, T., Cailliau, R., Groff, J.F., and Pollermann, B. World-Wide Web: The information universe. Elec. Netw. Res. Appl. Pol. 1, 2 (Spring 1992), 52-58.

2. Bowman, C.M. and Dharap, C. The Enterprise distributed white-pages service. In Proceedings of the USENIX Winter Conference. USENIX Association, Berkeley, Calif., 1993, pp. 349-360.

3. Bowman, M., Dharap, C., Baruah, M., Camargo, B., and Potti, S. A file system for information management. In Proceedings of the Conference on Intelligent Information Management Systems (Washington, D.C., June 1994).

4. Bowman, M., Peterson, L.L., and Yeatts, A. Univers: An attribute-based name server. Softw. Prac. Exp. 20, 4 (Apr. 1990), 403-424.

5. Danzig, P.B., Li, S.-H., and Obraczka, K. Distributed indexing of autonomous Internet services. Comput. Syst. 5, 4 (Fall 1992), 433-459.

6. Danzig, P.B., Obraczka, K., and Kumar, A. An analysis of wide-area name server traffic: A study of the Domain Name System. In ACM SIG-COMM '92 Conference. ACM, New York, 1992, pp. 281-292.

7. Danzig, P.B., Schwartz, M., and Hall, R. A case for caching file objects inside internetworks. In ACM SIGCOMM '93 Conference. ACM, New York, 1993, pp. 239-248.

8. Deering, S. and Cheriton, D. Multi-cast routing in datagram internet-works and extended LANs. ACM Trans. Comput. Syst. 8, 2 (May 1990), 85-110.

9. Emtage, A. and Deutsch, P. Archie: An electronic directory service for the Internet. In Proceedings of the Winter 1992 Usenix Conference. ACM, New York, 1992, pp. 93-110.

10. Faloutsos, C. Access methods for text. ACM Comput. Surv. 17 (Mar. 1985), 49-74.

11. Gifford, D.K., Jouvelot, P., Sheldon, M.A., and O'Toole, J.W., Jr. Semantic file systems. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles. ACM, New York, 1991, pp. 16-25.

12. Hardy, D. and Schwartz, M.F. Essence: A resource discovery system based on semantic file indexing. In Proceedings of the USENIX Winter Conference. USENIX Association, Berkeley, Calif., 1993, pp. 361-374.

13. Khale, B. and Medlar, A. An information system for corporate users: Wide Area Information Servers. ConneXions Interoper. Rep. 5, 11 (Nov. 1991), 2-9.

14. Lampson, B. Designing a global name service. In Proceedings of ACM Principles of Distributed Computing. ACM, New York, 1986, pp. 1-10.

15. Manber, U. Finding similar files in a large file system. In Proceedings of the USENIX Winter Conference. USENIX Association, Berkeley, Calif., 1994, pp. 1-10.

16. Manber, U. and Wu, S. Glimpse: A tool to search through entire file systems. In Proceedings of the USENIX Winter Conference. ACM, New York, 1994, pp. 23-32.

17. McCahill, M. The Internet Gopher: A distributed server information system. ConneXions Interoper. Rep. 6, 7 (July 1992), 10-14.

18. Mockapetris, P. RFC 1034: Domain names--concepts and facilities. Internet Req. Comm. (Nov. 1987).

19. Neuman, B.C. Prospero: A tool for organizing Internet resources. Elec. Netw. Res. Appl. Pol. 2, 1 (Spring 1992), 30-37.

20. Schwartz, M.F., Emtage, A., Kahle, B., and Neuman, B.C. A comparison of Internet resource discovery approaches. Comput. Syst. 5, 4 (Fall 1992), 461-493.

21. Schwartz, M.F., Hardy, D.R., Heinzman, W.K., and Hirschowitz, G. Supporting resource discovery among public internet archives using a spectrum of information quality. In Proceedings of the Eleventh International Conference on Distributed Computing Systems. 1991, 82-89.

22. Schwartz, M.F. and Calton, P. Applying an information gathering architecture to Netfind: A White Pages tool for a changing and growing Internet. IEEE/ACM Trans. Netw. To be published.

23. Sun Microsystems. SunOS Reference Manual. Sun Microsystems, Mountain View, CA, 1990.

24. Weider, C., Fullton, J., and Spero, S. Architecture of the Whois++ index service. Tech. Rep., Nov. 1992. Available by anonymous FTP from nri.reston.va.us, in internet-drafts/draft-ietf-wnils-whois-00.txt.

25. Wood, D.C.M., Coleman, S.S., and Schwartz, M.F. Fremont: A system for discovering network characteristics and problems. In Proceedings of the USENIX Winter Conference. USENIX Association, Berkeley, Calif., pp. 335-348.

C. MIC BOWMAN is a member of technical staff at Transarc Corporation. Current research interests include descriptive naming systems for local and wide-area file systems, structured file systems, resource discovery, and protocols for remote computing. Author's Present Address: Transarc Corp., The Gulf Tower, 707 Grant Street, Pittsburgh, PA 15219; email: mic

PETER B. DANZIG is an assistant professor of computer science at the University of Southern California. Current research interests include the measurement and performance debugging of Internet services, distributed system architectures for resource discovery, and mathematical modeling of communication networks. Author's Present Address: Computer Science Department, University of Southern California, 941 W. 37th Place, Los Angeles, CA 90089-0781; email: danzig

UDI MANBER is a professor of computer science at the University of Arizona. Current research interests include design of algorithms, pattern matching, computer networks, and software tools. Author's Present Address: Computer Science Department, University of Arizona, Tucson, AZ 85721; email: udi

MICHAEL E SCHWARTZ is an associate professor of computer science at the University of Colorado. Current research interests include issues raised by international networks and distributed systems, with particular focus on resource discovery and wide-area network measurement. Author's Present Address: Computer Science Department, University of Colorado, Boulder, CO 80309-0430; email: schwartz
COPYRIGHT 1994 Association for Computing Machinery, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1994 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Special Issue: Internet Technology
Author:Bowman, C. Mic; Danzig, Peter B.; Manber, Udi; Schwartz, Michael F.
Publication:Communications of the ACM
Date:Aug 1, 1994
Words:7722
Previous Article:Interactive design of seamless collaboration media.
Next Article:Crypto policy perspectives.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |