Printer Friendly

Beyond keywords and hierarchies.

Abstract: As our ability to store information increases, the mechanisms we employ to access that information become ever more important. In this paper, we present Archosum, a prototype of an organizational system that attempts to encapsulate the benefits of both hierarchical and keyword systems. By introducing abstract entities, Archosum provides a simple interface with which users can build and maintain powerful relationship-based organizations. We compared Archosum to two alternative systems in a user study. Through this study we begin to expose some of the advantages and disadvantages to each of these three approaches to designing an organizational system. Furthermore, we begin to consider how organizational systems will work when distinct users create organizations for collections and how sharing might be facilitated using Archosum.

Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]:H.3.1[Content Analysis and Indexing]; H.3.2 [Information Storage]

General Terms

Information Processing, Indexing

Keywords: Information architecture, Key words

Received 30 Oct. 2004; Reviewed and accepted 30 Jan. 2005

1. Introduction

The continuing, rapid expansion of the Internet [25] supplies an ever-increasing quantity of information to the fingertips of connected users around the world. This is a mixed blessing; as the quantity of information increases, our ability to find a particular piece of information decreases. As the size of a collection grows, there is an increasing need for mechanisms to organize that collection. The purpose of any organizational system is to allow a user or group of users to quickly and accurately find a subset of entities within a collection. An information retrieval system is a type of organizational system. This type of system informs the user of the existence and whereabouts of information relating to his/her request [23].

[FIGURE 1 OMITTED]

Since organizational systems provide a method to access large collections, many companies have found a new interest in these systems. From Google, an internet search engine startup, to operating system vendors including Microsoft, Apple and the open source community have begun looking at alternatives and modifications to traditional organizational systems to accommodate ever increasing collections (from the individual file systems to the collected works of the internet [6,15,17,19].

There are many different approaches to building organizational systems, for example, SFS, a file system based on semantic file attributes similar to those used in data-base management systems [9] and more recently, VennFS, a Venn-Diagram based organizational system [5]. Yet most large scale, deployed applications are either based on hierarchies (e.g. computer file systems) or keywords (e.g. web search engines and Google Desktop Search [10]). By learning from the features present in both of the standard approaches and alternative organizational systems, we hope to design an information retrieval system that will allow users to access their ever increasing collections with accuracy, speed and ease and allow them to share collections with other users.

To build an intuitive and accurate organizational system, we looked to several sources including the original ideas of hypertext as advanced by Dr. Vannevar Bush in his article in 1945 describing Memex [3], an approach to organizing information through the use of associations. Tim Berners-Lee used this idea in 1989 to implement a protocol for hypertext: a set of interlinked digital documents which became the basis of the World Wide Web [1] and subsequently advanced this idea in the Semantic Web.

We took a subset of features from the Semantic Web as the starting point for our organizational system, which we call Archosum. In building a semantic web organization, users make associations between entities, but as the size of a collection grows, the complexity also increases. As we will see in section 3, managing an organization for a changing collection becomes very complicated. Archosum abstracts the details and assists users in creating and managing complex organizations, thus allowing users to utilize a Memex style organization without the usual upkeep costs.

Organizational systems can be applied to collections of entities of any type (e.g. text documents, books, music, movies, files on a computer disk or web-addresses (URLs)). In the context of this paper, we define entities as anything that can be referenced by a Universal Resource Identifier (URI), e.g. URIs of websites of books, movies, research groups or personal homepages, on-line research articles and others.

2. Organizational Approaches

Library scientists have been working on ways to organize collections of physical documents for decades [13]. With the introduction of computer catalog systems, many libraries have chosen to utilize more than one information retrieval system to give patrons multiple methods of finding documents within their collection. Although many systems exist, there are two general approaches in wide use that we will discuss before presenting the details of our approach.

2.1 Hierarchies

Hierarchical tree structures can be found in a variety of fields dating back to the 18th Century from Biology (Linnaean taxonomy) to Psychology (Maslow's hierarchy of needs) to Linguistics (Chomsky hierarchy) [13]. The world's most famous hierarchy dates back to 1873, the first implementation of the Dewey Decimal System. From computer desktops to corporate power structures, hierarchies have become an extremely popular way to organize information due to several benefits:

1. Computational Efficiency--hierarchies are very space-and computation-time- efficient.

2. Understandable--hierarchical structures have been woven into the fabric of society and hierarchical metaphors are used in everyday life (e.g. placing an object in a drawer in a desk, which is in a room, in a house, on a street, in a city, in a country, on a continent, etc.). The average user understands intuitively how hierarchies are organized and uses these intuitions when organizing information in a hierarchical way.

3. Ease of Recognition--as users browse a hierarchy, slowly refining their query through each level they need only recognize directory names as they are presented; this cognitive process is much easier than recalling keywords or labels or formulating queries.

In small collections, these advantages of hierarchies make them a good choice for organization. However, as the collections increase in size, certain problems become evident:

1. Static Organization--the structure of a collection can not adapt to changes in the collection, as the organization is specified explicitly by the user.

2. Single Tree Order--users must arbitrarily decide the order of directories even when this ordering is not intuitive and furthermore, remember that ordering when looking for a document. For example, we might have decided to save this paper in either of the following directories: "/Docs/Papers/Projects/Archosum/" or "/ Projects/Archosum/Docs/Papers/". It isn't obvious whether the first choice is better or more likely than the other [22].

3. Growing Complexity--as a collection increases in size, the organizational tree must grow in either depth or child cardinality. Either growth is undesirable; as depth increases, the number of refinements from the root to leaves increases; and as child cardinality increases, the number of choices at each refinement increases.

4. Lost Relationships--a single hierarchy represents ideally one semantic relationship. Any other relationships between entities are lost.

Because of these problems, classifying and finding documents in hierarchies becomes more difficult as the size of the collection increases. The Dewey Decimal System is standardized and maintained exclusively by librarians trained in classification, but many library patrons still prefer to search a keyword catalog to navigate the massive library collections.

Although there have been various attempts to organize the Internet into a hierarchy, the Open Directory Project is the most widely distributed and largest human edited directory online [19]. In 1998, Richard Skrenta and Bob Truel started the Open Directory Project (ODP) which today has over 65,000 volunteer editors attempting to organize the internet.

As the hierarchy grew in complexity and content expanded in diversity, the rate at which editors could add content diminished to less than 1% growth per month in 2004 (compared with over 7% in 1999). This demonstrates that as hierarchies become more complicated they become less useful.

If editors are having increasing difficulty in classifying websites, it is not for a lack of new websites, but rather it reflects the crippling effect of the increasing complexity of the hierarchy. If the editors (the creators of the organization) are having increasing difficulty organizing the directory, it seems a logical conclusion that consumers of the directory will have equal or greater difficulty finding information in the directory.

2.2 Keyword-based

A keyword-based approach organizes documents by creating an inverted index of keywords related to each entity. In a keyword-based approach, users need not classify individual documents as the system can build an index of keywords based on the contents of any given document (and more recently keywords in documents known to be related [2]). Building an index is an automated process, which builds an organization based solely on the document collection. Since the process of building an organization is automated, the scalability of keyword-based systems is not limited by human interaction, but rather by computation speed.

Due to this automation, keyword-based information retrieval systems allow us to organize larger document collections than ever before. Google has an index of over 8 billion documents, eclipsing the size of the ODP by nearly 2000 times. Google's index is also updated periodically, whereas the ODP's index is relatively static. However, keyword systems are not without their problems:

1. Human Recall--users must synthesize keywords that describe the entity they are looking for.

2. Word Meanings--if a keyword has more than one meaning, query revision may be required to obtain the correct result set. The vocabulary of the user may change his/her result set, if it is different than that of the content creator (and thus the index). Although queries can be expanded by use of thesauri like WordNet [7], the problem is rooted deeper in the approach since context is difficult to determine and store.

3. Abstract Concept--the keywords associated with a document by the system may not make sense to all users. These keywords also may change with the content or indexing algorithm causing users confusion and frustration when they attempt to find a document for a second time.

4. Misspellings, translation--if either the content producer or the directory consumer misspells a word, it will not constitute a match. Again, text-processing techniques (e.g. word stemming), the use of dictionaries and other approaches can be used to limit, but not eliminate this problem [4].

3. Approach

We propose an approach that encompasses the benefits of both hierarchical and keyword-based systems while allowing users to create organizations and use them in a familiar way. Our system makes use of a directed graph to organize entities, where an entity is a unit of data (like a file, book, movie, paper, person's homepage etc.). Thus as users organize entities, a web of associations is built, resembling hyperlinks on the Internet. The problem with direct association among entities (the approach used to link hypertext or web documents) is that if we wish to add a new entity, there may be many related entities that we must associate with it. Similarly, when we delete an entity we must remove many associations. Updates to the collection would be complicated and consume both human and computer time.

To deal with this problem, we introduce the concept of abstract entities. An abstract entity is a named concept (e.g. a directory / class / category / genre). Users can relate abstract entities in the same way as they would relate non-abstract (existential) entities, but the relationship of an existential entity to an abstract entity is automatically propagated to all other existential entities related to the abstract entity. Let's consider an example of 10 entities (5 songs, 5 reviews). We want to relate all the songs together; all the reviews together; and each review to the song it covers.

[FIGURE 2 OMITTED]

When an existential entity (Song) is associated with an abstract entity (Music), it is automatically related to all the other entities associated with that abstract entity; thus one review is associated with every other review even though the user created only one association (to the abstract entity).

Since abstract entities propagate relationships, Figure 2a and 2b express equivalent organizations. Abstract entities help simplify complicated relationship while preserving their meaning. This simplification also lends to scalability; if we wish to add a new song to the organization, the user needs only relate it to music (and all other songs will be related via propagation).

So our solution bears similarity to the ideas of Vanevar Bush, which inspired the development of hypertext, the concept of semantic networks in AI, the web and the semantic web. However, it extends these ideas in the following ways:

* Users manage a group of relationships (via abstract entities) while Archosum manages the resulting individual relationships.

* Users create abstract entities in a way similar to creating directories in a hierarchy.

* The underlying structure of the organization is abstracted from the user, yet relationships are based solely on the decisions made by a user

When building an organization, users create abstract entities (from now on we will call them "categories" for simplicity) and organize entities (for simplicity from now on we will call existential entities just "entities") by associating them with other categories. The system does not restrict the number of associations created so a given entity can have more than one association. Users have the sensation that they can put entities in more than one place and consequently they can access entities by browsing along multiple distinct paths in the organization. Furthermore, users can associate entities not only with categories, but with other entities (e.g. "Mark" is an entity. An email (another entity) from "Mark" is associated with the category "E-mails" but also associated with the entity "Mark"). The resulting organization can be thought of as a graph (with entities as vertices and relationships as edges).

4. Archosum: An Architecture for the Proposed Organization Approach

We have developed an architecture called Archosum for the organization approach described in the previous section. The architecture encompasses a user who creates an organization of entities, a representation of the organization (consisting of a graph with entities and categories) and a search interface, which allows the user to navigate the organization by browsing. The architecture also provides a communication channel allowing users to share their organizations on a peer-to-peer basis. However, sharing is not the focus of this paper. Rather, our focus is on the single user: allowing easy creation of scalable organizations and search by browsing.

To use Archosum, a user must have some way of viewing and browsing the organization. Hierarchies can be viewed level by level (as in file systems); however, a graph can be very difficult to visualize, especially as it grows [12]. To allow our approach to scale, the user should not see the entire graph but only a portion based on some starting point. In hierarchies there is a static root node that works as a starting point from which any other node can be reached. In our approach we eliminate the notion of the root node used in hierarchies in favor of allowing any entity or group of entities to act as a starting point.

In Archosum, the starting point is dynamically chosen based on some initial context of the user query. If a user is looking for bookmarks on their computer, a good starting point is the category "Bookmarks". The capturing of this initial context is beyond the scope of Archosum, but we believe applications could easily provide this. For example, when a user wishes to open a document in a word processor, the application could tell the organizational system to use "Word Processor Documents" as a starting point. This is similar to the way in which many applications limit results in a file system to those files ending with a specific extension (eg: ".JPG"). We believe users will create more effective organizations in Archosum because the underlying structure of the organization is abstracted from the user. Users do not need to worry about how relationships are managed. Users simply classify entities into categories as they would in familiar systems (hierarchy and keyword based). Once users build an organization in this way, they can choose to browse or search, but more importantly Archosum can exploit the relationships formed between entities to suggest methods for expanding or refining user queries or show similar and related documents.

5. User Study

The aspirations of our approach are:

* To provide a powerful mechanism for organizing large, heterogeneous collections that is easy to use and easily modifiable.

* To provide both the ability to search directly (by query) and to browse in the organization.

* To allow the comparison and sharing of organizations created by distinct users.

We have not evaluated our approach with respect to all these aspirations yet, but we have started working on the first one by performing a user study to explore how our approach compares with keyword and hierarchical organization approaches. We created a web-based prototype implementation of Archosum, which organizes entities in a fashion similar to a traditional web directory. For the purpose of comparison we created two alternative organizational systems: a keyword based system and a hierarchy based system.

[FIGURE 3 OMITTED]

To compare our approach with hierarchical and keyword based organizations, we need to define some metrics on which we can compare the three systems. Precision and Recall are standard metrics for comparing information retrieval systems as defined by the Text Retrieval Conference (TREC) [9]. Information retrieval systems are typically "user-neutral": their index is based on the collection; therefore the organization is a function of the collection. In our system the user is active in building the organization; thus the organization is a function of both the collection and the user.

Since we have introduced a new variable (the user), we evaluate the process of building an organization and finding entities in this organization as two separate steps. In our experiment, we have only considered the first step of this process, the organization step. The second step--showing that the organization allows more effective search will be explored in future studies. To evaluate the first step, we define the following metrics on which we can judge an organizational system:

Organizational Metrics:

1. Cross User Similarity--the amount of similarity between the organizations created by different users in a system

2. Depth of entity vs. Breadth of category--This describes the continuum of classification. In one extreme all entities could be classified in one category (high breadth of category). In the other extreme, entities could each be classified in a unique, very specific category (high depth of entity).

3. Time to Classify--the amount of human time required to build an organization.

In our experiment, 12 users were asked to organize 50 documents in each of three organizational systems over a 10 day period. We compiled a collection of 150 web addresses from 6 general categories including: news, people, companies, movies, musicians and academic papers. For each organizational system we selected 50 entities with an approximately even proportion of representation of each category in each system. To minimize user learning between systems, the entities in each system had no overlap.

Each user filled out pre and post experiment questionnaires. For each system, users went through a four step process:

1. Short training for how the system works.

2. Chance to browse a working organization in that system.

3. Training for how to organize documents in a system.

4. Classification of 50 entities using the system within 30 minutes.

[FIGURE 4 OMITTED]

[FIGURE 5 OMITTED]

All participants completed this process for each of the three systems (hierarchy-based, keyword-based and Archosum). We analyzed the timing of actions and resulting organizations in each system.

6. Results

Although each system may use different methods to build and utilize an organization, we need to compare all systems using common metric. Using a common metrics we introduce the quasi-formal description of a generic organizational system as the following:

In each system we have a set of entities ([E.sub.k] [member of] E) and a set of users ([U.sub.i] [member of] U). Each user ([U.sub.i]) creates a set of categories ([C.sub.ij] [member of] [C.sub.i]). Building an organization in any of the systems involves placing entities into categories ([E.sub.k] [member of] [C.sub.ij] [member of] [C.sub.i] for [U.sub.i]). In each system there are 50 entities ([E.sub.k]:k [member of] [1,50]), 12 users ([U.sub.i]:i [member of] [1,12]) and some number of categories created by each user ([C.sub.ij]:i [member of] [1,12] j [member of] [1,[[lambda].sub.i]], where [[lambda].sub.i] is the number of categories created by [U.sub.i]).

Time to Classify

The time to classify the i-th entity for a user (TC([E.sub.i], [U.sub.j])) was measured as the time between seeing the i-th entity and classifying it. Since users might take a break or leave, if the time was over a threshold (10 minutes), we ignored that time. Times over the threshold were rarely reached (less than 0.5% of classification times were ignored).

TC([E.sub.i], [U.sub.j]) = [Time [E.sub.i] Classified by [U.sub.j]] - [Time [E.sub.i] was shown to [U.sub.j]]

TC[([E.sub.i]).sub.ave] = [summation over (j)]TC([E.sub.i], [U.sub.j])/[absolute value of U]

We can see that in each system users took less time to classify entities as they gained experience with each system. The hierarchy curve is slightly skewed by two users who did not fully understand the interface for this system. These users spent most of their time on the first few entities and rushed on the rest, thus producing the above curve.

Depth vs. Breadth

The depth of an entity (D([E.sub.i], [U.sub.j])) measures the number of categories with which a user associated that entity. Once we have the depth of each entity, we computed the average depth across all entities for each user and plotted that on the left graph in Figure 5.

Note: Since all users classified the same set of entities, [absolute value of E] = 50.

[[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The breadth of a category (B([E.sub.i], [U.sub.j])) measures the number of entities classified in that category for a given user. The right graph in Figure 5 plots the average breadth for each user in each system.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The results in Figure 5 show that the keywords organization had a higher average depth (more keywords associated with an entity) while the hierarchical organization had a higher average breadth (more entities classified in a given category). These results are not surprising: it is harder for the user to create many hierarchical categories (creating a new category requires identifying the parent category), therefore the few that are created tend to be associated by users with more entities (higher average breadth for hierarchies). Conversely, creating new keywords is very easy, and users easily forget keywords that they have previously used, therefore they associated many keywords with each entity (higher average depth for keywords). Archosum appears more similar to hierarchies with respect to these two measures than to keywords.

Cross User Similarity

Through this metric we are trying to explore the similarity between organizations built by different users. To examine the similarity, we can begin by looking at the categories created by different users (see Figure 6). We believe the labels (names) given by users to the categories provide little benefit in comparing organizations, as they are highly dependant on the vocabulary of a given user; thus, we have chosen to compare the similarity between the patterns of relating entities to categories across users and ignore the labels (names) as provided by users.

[FIGURE 6 OMITTED]

In each system, the users provided different numbers of categories,thus enumerating and comparing them would be difficult and non-intuitive. To facilitate the comparison, we introduce the concept of "relationships".

Each user ([U.sub.i]) has a set of relationships ([R.sub.ijk] [member of] [R.sub.i]) that relate two entities ([E.sub.j] and [E.sub.k]). A relationship ([R.sub.ijk]) exists if the user ([U.sub.i]) has placed two entities ([E.sub.j] and [E.sub.k]) in a common category ([C.CUB.c]). All relationships are symmetrical, thus [R.sub.ijk] [member of] [R.sub.i] implies [R.sub.ikj] [member of] [R.sub.i].

The benefit of introducing relationships is that we can easily enumerate the distinct relationships from a given entity (an entity [E.sub.i] can be related by a user [U.sub.i] at most to all other entities, thus [absolute value of [R.sub.ij]] [greater than or less than] [absolute value of [E]] - 1). If we know all the relationships ([R.sub.ij]) from an entity ([E.sub.j]) established by a user ([U.sub.i]), we can compare this set to the set of relationships ([R.sub.kj]) established for the same entity ([E.sub.j]) by each user ([U.sub.k]:k [not equal to] i). First, we will count the average number of relationships made by each user in each system. This tells us how likely users are to relate different entities using each system. The left graph in Figure 7 plots these values.

[R.sub.i ave] = [summation over (j)] [[absolute value of [R.sub.ij]] / [absolute value of E]

[FIGURE 7 OMITTED]

The most important value is not the number of relationships established, but rather the number of relationships established by a user that are shared by other users. We measure this by constructing the weighted common set of relationships RC.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

For example, if [U.sub.1] placed [E.sub.2], [E.sub.2], [E.sub.3] and [E.sub.4] in the same category([C.sub.1,5]) while [U.sub.2] placed [E.sub.2] and [E.sub.4] in [C.sub.2,3] and [E.sub.3] and [E.sub.4] in [C.sub.2,5] then the weighted common set (RC) would consist of the following values:

R[C.sub.1,2] = 1, R[C.sub.1,3] = 2, R[C.sub.1,4] = 1, R[C.sub.2,3] = 1, R[C.sub.2,4] = 2, R[C.sub.3,4] = 2

Note: the symmetric equivalents are not shown but would also exist in RC.

Now that we have compiled the weighted common set, we can extract the number of relationships shared across some number of users (x). In Figure 7b, we have plotted the number of shared relationships for eleven of the twelve users , x [member of] [1,12]. As can be seen in Figure 7a & 7b, users created more relationships in Archosum and moreover, there was a higher number of shared relationships in Arhcosum across multiple users.

7. Discussion

Our results show the differences between the three organizational systems tested. These differences translate into distinct performance characteristics for each system. These characteristics were identified in our small experimental data set and would surely be amplified in a full-scale implementation; thus, making an incorrect choice of organizational system could have far reaching consequences. It is therefore very important to know how the strengths and weakness can be exploited or avoided when implementing an organizational system.

While a keyword based approach provides high depth of entity, it suffers from low breadth of category. The opposite is true for hierarchical systems. The Archosum approach gives users the freedom to associate entities to many categories and abstracts the structure of the organization from the user. In the end Archosum builds organizations that have both moderate depth of entity and breadth of category.

Since the ideal level of depth and breadth is determined by the purpose and intended audience for a given organization, we cannot state that any organizational system is inherently better than another. However, we can state that for any situation there is an ideal choice since each system has a distinct behavior.

When looking at the cross user similarity metrics we observed major differences between the three systems. In the keyword-based approach, users specified over four times as many categories per entity yet established fewer relationships between entities than in the hierarchical system or Archosum. The number of relationships shared across users in Archosum was also higher than in the hierarchical system.

In an increasingly connected world, it seems archaic to consider only a single user's collection. The cross user similarity metric allows us to consider how well we can combine the collections of two or more users to build a larger collection or group of collections. The benefit of being able to combine collections is not evident in our experiment as all users classified the same set of entities. In a real world situation, users would be much more likely to organize document collections of inconsistent contents.

Consider a real world situation in which some user [U.sub.1] has organized a collection of entities [E.sup.1]. Is there a way that [U.sub.1] can automatically find new entities that are similar to those in a given category of their current organization? If we examined the collection of entities [E.sup.2] organized by [U.sub.2], we could find the intersection of both collections ([E.sup.1 [intersection] [E.sup.2]). From the intersection, we could develop a pattern based on the similarity between both users' organizations that we could use to find entities in [E.sup.2] organized by [U.sub.2] to be similar to those entities organized by [U.sub.1].

Obviously the ability to intelligently combine different semantically organized collections is not trivial, Thus, a system that can identify and represent similarities between organizations of different users may provide a way to organize large collections in a more coherent way than ever before.

It is this ability to fuse collections that gave Archosum its name. Archosum is the combination of two Latin words, archos meaning ruler and visum meaning view. The analogy is that by fusing organizations we can create an organization (or view) that is customized and controlled by each user (the ruler). Archosum may provide a way for humans to organize collections of constantly increasing size without increasing the complexity of organization.

8. Conclusion

Our experiment and analysis evaluates the process of creating an organization of entities in three different systems. The results of this experiment show significant advantages for each system in certain situations. It therefore seems that alternatives to the standard keyword and hierarchy-based systems may be of significant value and warrant more investigation.

This experiment also casts a new focus on the creation of organizations. TREC has evaluated information retrieval systems for many years based on their query performance (judged by recall and precision), but has not directly examined the creation of organizations since this is usually automatic. The metrics we have presented allowed us to compare systems that do not automatically generate indices, but rely on users to classify the contents of their collections.

With the recent surge of development in consumer file systems and their organizational systems, few large implementations utilize systems that are not keyword or hierarchy-based, yet alternative systems are available in academia [5,9]. We hope our results may help the development of future alternative organizational systems as well as their use in practical implementations.

We believe Archosum could be applied to environments where many users are attempting to find new entities related to their current organization based on the recommendations of other users. Of specific interest would be systems like Comtella [24], a peer-to-peer application developed to share papers in academia. Archosum would allow each user to organize their collection of papers as he/she chose (rather than relying on some centrally determined hierarchy), and fuse these organizations based on similarities to discover new papers relevant to each user. Furthermore, Archosum would better facilitate cross-field work as a paper could be classified into more than one category and a search could be conducted on more than one category.

9. Further Research

The first avenue of further research is to expand our experiment to include the second step of an organizational system, the actual retrieval of documents given an index. If we could test all three systems using the metrics defined by TREC, we might discover new evidence to support the implementation of a particular organizational system, or possibly some criteria for building an even better system. We believe that our work on metrics for building organizations may be used for performance comparisons by other organizational systems as they are developed.

The Resource Description Framework (RDF) is a language designed by the World Wide Web Consortium (W3C) for representing information about resources in the World Wide Web [1]. RDF uses triples to describe resources and relationships between resources. We believe that our approach could easily be implemented using RDF and possibly even expanded beyond simple associations to include multiple types and strengths of relationships using RDF.

An RDF implementation of Archosum brings many new features but also many new questions:

1. What is the nature of relationships?

In Archosum, entities are related to categories in a directed, equally weighted sense. RDF allows us to build unlimited types of relationships including undirected and weighted relationships. Are these concepts intuitive and do they give us a better tool with which to describe relationships, or do they allow us too much freedom? Innovative file-management systems like VennFS [5], already explore the benefits of relationship weights by organizing documents in a plane; documents that are far apart are loosely related while those close together are tightly related.

2. Should organizations be concerned with time?

Should relationships decay over time to create a process of natural selection? Could this be used to build an inherent junk or SPAM deterrent?

3. How well can machines classify documents?

Can we combine the concepts presented in this paper with those of machine classification? Can the system provide classification suggestions, and then adapt them based on user acceptance and further suggestions?

4. Can collections organized by different users be shared seamlessly?

We will explore sharing collections and automatic search across collections organized subjectively by different users using relationships between entities as defined by the measures presented in this paper.

References

[1.] Berners-Lee, Tim. (1989). "Information Management: A Proposal". CERN. Internet Available: http://www.w3.org/History/1989/ proposal.html (23 Oct 2004).

[2.] Brin, Sergey and Page, Lawrence. (1998). "The Anatomy of a Large-Scale Hypertextual Web Search Engine." WWW7/Computer Networks30: 107-117. Internet Available: http://www-db. stanford.edu/~backrub/google.html (8 Mar 2005).

[3.] Bush, Vannevar. (1945). As We May Think. The Atlantic Monthly, July 1945: 101:108. Internet Available: http://sloan.stanford.edu/ mousesite/Secondary/Bush.html (23 Oct 2004).

[4.] Dalianis, Hercules. (2002). "Evaluating a spelling support in a search engine." NLDB 2002, The Eight International Workshop on the Applications of Natural Language to Data Bases (June 27-28, 2002). Internet. Available: http://www.nada.kth.se/~hercules/ papers/SpellingIR.pdf (6 Mar 2005).

[5.] De Chiara, R., Erra, Ugo., Scarano, V. (2003). "VennFS: A Venn-Diagram File Manager." IEEE International Conference on Information Visualization.

[6.] "Data Access and Storage Developer Center: Building WinFS Solutions." Microsoft Developer Network. Internet Available: http:// msdn.microsoft.com/data/winfs/ (23 Oct 2004).

[7.] Fellbaum, Christiane, ed. (1994). WordNet: An Electronic lexical Database. Cambridge: The MIT Press. Internet Available: http:// mitpress.mit.edu/book-home.tcl?isbn=026206197X (23 Oct 2004).

[8.] Giampaolo, Dominic. (1998) Practical file system design with the Be file system. San Fransisco: Morgan Kaufmann.

[9.] Gifford, D., Jouvelot, P., Sheldon, M., O'Toole, J. (1991) "Semantic File Systems." Operating Systems Review, v25 n5.

[10.] "Google Desktop Search Beta." Google. Internet Available: http://desktop.google.com/ (6 Mar 2005).

[11.] "Google Stock Information Summary." Yahoo! Finance. Internet Available: http://finance.yahoo.com/q?s=goog (23 Aug. 2004).

[12.] Herman, G. Melan,con, and M. S. Marshall. (2000). Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics, 6(1):24-43, 2000. Internet Available: http://citeseer.ist.psu.edu/ herman00graph.html (23 Oct 2004).

[13.] "Hierarchy." Wikipedia, the free encyclopedia. Internet Available: http://en.wikipedia.org/wiki/Hierarchy (24 Oct 2004).

[14.] "History of The Open Directory." DeemozWatch. Feb. 2004. Internet Not Available: http://www.deemozwatch.org/history.htm (15 Aug. 2004).

[15.] "Mac OS X Tiger: Spotlight." Apple Computer Inc. 28 Jun. 2004. Internet Available: http://www.apple.com/macosx/tiger/spotlight.html (23 Oct 2004).

[16.] Manola, F., Eric M. ed. (2004). "RDF Primer: W3C Recommendation." World Wide Web Consortium. Internet Available: http://www.w3.org/TR/rdf-primer/ (23 Oct 2004).

[17.] Nickell, Seth. "A Cognitive Defense of Associative interfaces for Object Reference." Internet Available: http://www.gnome.org/~seth/storage/associative-interfaces.pdf (23 Oct 2004).

[18.] "ODP and Yahoo Size Charts." Geniac.net. 10 Jan 2004. Internet Available: http://www.geniac.net/odp/ (15 Aug 2004).

[19.] "Open Directory Project." Netscape. Internet Available: http://www.dmoz.org/ (24 Oct 2003).

[20.] Robertson, S.E., Walker, S., Zaragoza, H. (2001). "Microsoft Cambridge at TREC-10: Filtering and web tracks." NIST: Text REtrieval Conference 2001, pg. 378. Internet Available: http://trec.nist.gov/ pubs/trec10/papers/msr_cambridge.pdf (24 Oct. 2004).

[21.] Sayers, Craig. (2004). "Node-centric RDF Graph Visualization." HP Laboratories: Mobile and Media Systems Laboratory. Internet Available: http://www.hpl.hp.com/techreports/2004/HPL-2004-60.pdf (23 Oct 2004).

[22.] Soules, Craig, Granger, Gregory. (2003). "Why Can't I find my files? New methods for automating attribute assignment." Hot Topics in Operating Systems, May 2003: 115-120. USENIX Association. Internet Available: http://www.pdl.cmu.edu/PDL-FTP/Storage/hotOS03.pdf (23 Oct 2004).

[23.] Van Rijsbergen, C. J. (1979). Information Retrieval. London: Butterworths. Internet Available: http://citeseer.ist.psu.edu/vanrijsbergen79information.html (23 Oct 2004).

[24.] Vassileva J. (2002). "Supporting Peer-to-Peer User Communitites, in R. Meersman, Z. Tari et al. (Eds.) 'On the Move to Meaningful Internet Systems 2002: CoopIS, DOA, and ODBASE'" Coordinated international Conferences Proceedings, Irvine, 29 Oct-1 Nov 2002, Springer Verlag: Berlin-Heidelberg, 230-247.

Ian Hopkins and Julita Vassileva

Department of Computer Science,

University of Saskatchewan,

57 Campus Drive, Saskatoon S7N 5A9, Canada

[ikh328@mail.usask.ca; jiv@cs.usask.ca]
COPYRIGHT 2005 Digital Information Research Foundation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2005 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Hopkins, Ian; Vassileva, Julita
Publication:Journal of Digital Information Management
Date:Jun 1, 2005
Words:6382
Previous Article:Ontology-based heterogeneous XML data integration.
Next Article:CYCLADES: an environment for the cooperative management of digital information.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters