Printer Friendly

Interactive exploration of non-indexed data.

Creating and storing large archives of rich media such as digital photographs, videos, X-rays, MRIs and CAT scans is becoming easier: for example, a digitized medical institution generates approximately 12 terabytes of medical images annually. However, finding a vaguely-specified object in such an archive is very difficult. Unlike text data, which is easily indexed by search engines, rich media is difficult to search by content. In such cases, one has no alternative but to perform a brute-force search--examining each item in the repository to determine whether it matches the user's needs.

Unfortunately, native implementations of brute-force search are very inefficient for several reasons. First, testing the query against every item in a large collection can be prohibitively expensive in terms of computation. Second, the available network bandwidth may be insufficient to ship all of the data to the user's computer, particularly when the data repository is distributed at several sites on the internet. Third, the effort involved in shipping and processing most of the data is wasted since only a very small fraction of the data is likely to be relevant to any given query. For these reasons, most institutions simply store the data without offering means to search it. This problem becomes worse as the cost of storage decreases and the volume of stored data increases--leading to the curse of cheap storage.

Early Discard

Imagine that your goal is to find a needle in a large haystack. The traditional approach to brute-force search is analogous to shipping the entire haystack to your office, only to throw most of it away. Ideally, one would like to send an agent into the field to search through the haystack and to bring back the needle. Unfortunately, despite decades of research, automated search systems are still significantly inferior to human performance in most domains; thus, for the foreseeable future, most real-world applications will require a human in the loop. However, although our automated systems may not be smart enough to recognize every needle they are smart enough to discard much of the hay. The idea of early discard is to filter the data where it is stored--and to eliminate obviously irrelevant data as early as possible.

Diamond: Efficient Non-Indexed Search

Diamond is an open-source research collaboration between Intel Research Pittsburgh and Carnegie Mellon University that enables efficient search of non-indexed data using early discard. This capability is realized through code fragments called "searchlets" that execute close to storage. Diamond provides additional efficiency by re-ordering search tasks and by dynamically partitioning search computation between the front-end computer used for interaction and the back-end nodes where data is stored. This allows Diamond to dynamically adapt to different hardware configurations and network bandwidths in a manner that is completely transparent to applications. Diamond further improves performance by making heavy use of caching technology which minimizes wasted computation by reusing partial results from prior queries.

Design Considerations

The Diamond project's goal is to provide a highly-portable open source software implementation that supports proprietary applications and data formats. This enables ISVs to add unique value on top of Diamond while encouraging broad adoption of the open API. Diamond is designed to be agnostic to the storage system and works equally well with repositories on server farms, SAN clusters, machines on the internet and (in the future) on active disks. This is made possible by Diamond's self-tuning capabilities that insulate applications from hardware upgrades and enable the system to dynamically adapt to changes in network and computation resources.

Searches in Diamond are implemented using searchlets, which are fragments of binary code along with configuration parameters. A searchlet can be further decomposed into a set of filters that independently process (and discard) data. Each filter examines a single object, determines whether it is relevant to the current search, and (optionally) adds attributes to the object. For instance, a particular filter might discard images unless they contain a green-colored region. Filters can depend upon the results of other filters and these dependencies can be expressed as a partial ordering. For example, a face recognition filter (e.g., "is this George?") might require that a face detection filter (e.g., "does this image contain any human faces, and if so where?") be run as a prerequisite step.

Within the dependency constraints, Diamond is free to execute the filters in any order and is set up to automatically discover efficient filter orderings.

Self-Tuning in Diamond

Diamond automatically tunes itself to improve performance based on the characteristics of the system, the type of data and even the current query. Diamond employs three forms of self-tuning:

Filter ordering -- The order in which a sequence of filters is executed can have a huge impact on the system performance. For instance, running a cheap filter before an expensive filter is generally a good idea unless the expensive filter is much more selective. Diamond's goal is to reject irrelevant data as cheaply as possible. Within the dependency order constraints, Diamond is free to evaluate filters in any order and selects good orderings by trying different sequences while taking measurements. A major benefit of Diamond's approach is that filter ordering can be done without needing access to the source code of any filter.

Dynamic partitioning of computation -- Diamond evaluates a subset of the filters at the storage system and the remainder on the host machine. However, deciding where to run each filter is a tricky problem and the optimal solution depends on several factors such as the number of nodes, the processing power available at each node and the network bandwidth. Fortunately, Diamond frees the application developer from having to worry about these problems and dynamically reorganizes the filters that are executed at each node so that the overall system performance is improved.

Smart caching -- Diamond recognizes that interactive searches often require a lot of refinement; the user poses a query, observes some partial results, and adds a new filter or tweaks an existing filter in response. Rather than naively recomputing all of the filters, Diamond uses a smart hashing technique to identify the binary filters that have changed between two searches and reuses whatever computation that it can. This means that even though Diamond is searching non-indexed data, it can do so very quickly.

Diamond Applications

The Diamond framework is not specific to any particular application or type of data. For example, the objects in the repository could be images, sound, video or medical data. We illustrate Diamond with an example from content-based image retrieval.

SnapFind: Searching Digital Photographs

SnapFind is a prototype application for Diamond (also released open source) that allows the user to search through large collections of digital photos. SnapFind queries are constructed using examples; the user builds Diamond filters for desired colors or visual textures by selecting sample patches. For instance, the user could quickly build a custom filter for a carpet pattern in her living room and then use it to retrieve photos taken at home.

This approach is quite different from most current photo retrieval systems, where the user is typically limited to two options. The first is to manually search through all of the images, which ensures a high recall rate but is extremely time-consuming. The second approach, advocated by many content-based image retrieval systems, is to index all of the images on pre-computed search criteria (such as global color histograms). The latter allows fast retrieval, but if the precomputed criteria do not fit the user's needs, the user can do little to find the desired images. SnapFind is an interactive search system that explores the space between these two extremes. It leverages the user's understanding of the semantic content to drive the search.

Medical Imaging

We are now exploring applications for Diamond in the medical imaging domain. Initial explorations in dermatopathology (with University of Pittsburgh Medical Center) and in cell microscopy (with Merck Research) have yielded promising results. Diamond could become a key component in future management systems for PACS databases or computer-assisted diagnosis.

Diamond was primarily developed by L. Huston, M. Satyanarayanan and R. Sukthankar. R. Wickremesinghe, D. Hoiem, Y. Ke, C. Helfrich, L. Fix and S. Schlosser made significant contributions.

www.intel.com
COPYRIGHT 2006 West World Productions, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2006, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 
Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Diamond, network software
Publication:Computer Technology Review
Date:Jan 1, 2006
Words:1355
Previous Article:SenSage solves compliance worries.
Next Article:The revolution in hard disk drive storage: small 2.5" serial attached SCSI drives.
Topics:


Related Articles
New efforts to decloak 'invisible' science.
Discoveries show potential for diamond mines.
Wawa diamond finds unique.
Staking activity levels increase. (Mining).
Public offering response bodes well for big red.
Tsodilo Resources Limited Exploration Program Update and Private Placement.
Junior leads diamond 'rush' by example; Sharing and comparing data with Discover Abitibi helped Tres-Or spark the biggest staking rush to hit the...

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters