Interactive exploration of non-indexed data.Creating and storing large archives of rich media such as digital photographs, videos, X-rays, MRIs and CAT scans CAT scan (kăt) [computerized axial tomography], X-ray technique that allows relatively safe, painless, and rapid diagnosis in previously inaccessible areas of the body; also called CT scan. is becoming easier: for example, a digitized medical institution generates approximately 12 terabytes of medical images annually. However, finding a vaguely-specified object in such an archive is very difficult. Unlike text data, which is easily indexed by search engines, rich media is difficult to search by content. In such cases, one has no alternative but to perform a brute-force search--examining each item in the repository to determine whether it matches the user's needs. Unfortunately, native implementations of brute-force search In computer science, brute-force search or exhaustive search is a trivial but very general problem-solving technique, that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement. are very inefficient for several reasons. First, testing the query against every item in a large collection can be prohibitively pro·hib·i·tive also pro·hib·i·to·ry adj. 1. Prohibiting; forbidding: took prohibitive measures. 2. expensive in terms of computation. Second, the available network bandwidth may be insufficient to ship all of the data to the user's computer, particularly when the data repository See repository. is distributed at several sites on the internet. Third, the effort involved in shipping and processing most of the data is wasted since only a very small fraction of the data is likely to be relevant to any given query. For these reasons, most institutions simply store the data without offering means to search it. This problem becomes worse as the cost of storage decreases and the volume of stored data increases--leading to the curse of cheap storage. Early Discard Imagine that your goal is to find a needle in a large haystack. The traditional approach to brute-force search is analogous to shipping the entire haystack to your office, only to throw most of it away. Ideally, one would like to send an agent into the field to search through the haystack and to bring back the needle. Unfortunately, despite decades of research, automated search systems are still significantly inferior to human performance in most domains; thus, for the foreseeable future, most real-world applications will require a human in the loop. However, although our automated systems may not be smart enough to recognize every needle they are smart enough to discard much of the hay. The idea of early discard is to filter the data where it is stored--and to eliminate obviously irrelevant data as early as possible. Diamond: Efficient Non-Indexed Search Diamond is an open-source research collaboration between Intel Research Pittsburgh and Carnegie Mellon University Carnegie Mellon University, at Pittsburgh, Pa.; est. 1967 through the merger of the Carnegie Institute of Technology (founded 1900, opened 1905) and the Mellon Institute of Industrial Research (founded 1913). that enables efficient search of non-indexed data using early discard. This capability is realized through code fragments called "searchlets" that execute close to storage. Diamond provides additional efficiency by re-ordering search tasks and by dynamically partitioning search computation between the front-end computer used for interaction and the back-end nodes where data is stored. This allows Diamond to dynamically adapt to different hardware configurations and network bandwidths in a manner that is completely transparent to applications. Diamond further improves performance by making heavy use of caching technology which minimizes wasted computation by reusing partial results from prior queries. Design Considerations The Diamond project's goal is to provide a highly-portable open source software implementation that supports proprietary applications and data formats. This enables ISVs to add unique value on top of Diamond while encouraging broad adoption of the open API Open API (often referred to as OpenAPI) is a word used to describe sets of technologies that enable websites to interact with each other by using SOAP, Javascript any other web technology. . Diamond is designed to be agnostic ag·nos·tic n. 1. a. One who believes that it is impossible to know whether there is a God. b. One who is skeptical about the existence of God but does not profess true atheism. 2. to the storage system and works equally well with repositories on server farms, SAN clusters, machines on the internet and (in the future) on active disks. This is made possible by Diamond's self-tuning capabilities that insulate in·su·late tr.v. in·su·lat·ed, in·su·lat·ing, in·su·lates 1. To cause to be in a detached or isolated position. See Synonyms at isolate. 2. applications from hardware upgrades and enable the system to dynamically adapt to changes in network and computation resources. Searches in Diamond are implemented using searchlets, which are fragments of binary code binary code Code used in digital computers, based on a binary number system in which there are only two possible states, off and on, usually symbolized by 0 and 1. Whereas in a decimal system, which employs 10 digits, each digit position represents a power of 10 (100, 1,000, along with configuration parameters. A searchlet can be further decomposed de·com·pose v. de·com·posed, de·com·pos·ing, de·com·pos·es v.tr. 1. To separate into components or basic elements. 2. To cause to rot. v.intr. 1. into a set of filters that independently process (and discard) data. Each filter examines a single object, determines whether it is relevant to the current search, and (optionally) adds attributes to the object. For instance, a particular filter might discard images unless they contain a green-colored region. Filters can depend upon the results of other filters and these dependencies can be expressed as a partial ordering partial ordering - A relation R is a partial ordering if it is a pre-order (i.e. it is reflexive (x R x) and transitive (x R y R z => x R z)) and it is also antisymmetric (x R y R x => x = y). . For example, a face recognition filter (e.g., "is this George?") might require that a face detection filter (e.g., "does this image contain any human faces, and if so where?") be run as a prerequisite step. Within the dependency constraints, Diamond is free to execute the filters in any order and is set up to automatically discover efficient filter orderings. Self-Tuning in Diamond Diamond automatically tunes itself to improve performance based on the characteristics of the system, the type of data and even the current query. Diamond employs three forms of self-tuning: Filter ordering -- The order in which a sequence of filters is executed can have a huge impact on the system performance. For instance, running a cheap filter before an expensive filter is generally a good idea unless the expensive filter is much more selective. Diamond's goal is to reject irrelevant data as cheaply as possible. Within the dependency order constraints, Diamond is free to evaluate filters in any order and selects good orderings by trying different sequences while taking measurements. A major benefit of Diamond's approach is that filter ordering can be done without needing access to the source code of any filter. Dynamic partitioning In a symmetric multiprocessing (SMP) system, the ability to reassign processors, memory and I/O to specific applications on the fly without shutting down the machine. The reassignment can be done by the operator or automatically from a script that monitors conditions such as time of day of computation -- Diamond evaluates a subset of the filters at the storage system and the remainder on the host machine. However, deciding where to run each filter is a tricky problem and the optimal solution depends on several factors such as the number of nodes, the processing power available at each node and the network bandwidth. Fortunately, Diamond frees the application developer from having to worry about these problems and dynamically reorganizes the filters that are executed at each node so that the overall system performance is improved. Smart caching -- Diamond recognizes that interactive searches often require a lot of refinement; the user poses a query, observes some partial results, and adds a new filter or tweaks an existing filter in response. Rather than naively recomputing all of the filters, Diamond uses a smart hashing Creating hash totals or hash tables. See hash total and hash table. hashing - hash coding technique to identify the binary filters that have changed between two searches and reuses whatever computation that it can. This means that even though Diamond is searching non-indexed data, it can do so very quickly. Diamond Applications The Diamond framework is not specific to any particular application or type of data. For example, the objects in the repository could be images, sound, video or medical data. We illustrate Diamond with an example from content-based image retrieval Content-based image retrieval (CBIR), also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR) is the application of computer vision to the image retrieval problem, that is, the problem of searching for . SnapFind: Searching Digital Photographs SnapFind is a prototype application for Diamond (also released open source) that allows the user to search through large collections of digital photos. SnapFind queries are constructed using examples; the user builds Diamond filters for desired colors or visual textures by selecting sample patches. For instance, the user could quickly build a custom filter for a carpet pattern in her living room and then use it to retrieve photos taken at home. This approach is quite different from most current photo retrieval systems, where the user is typically limited to two options. The first is to manually search through all of the images, which ensures a high recall rate but is extremely time-consuming. The second approach, advocated by many content-based image retrieval systems, is to index all of the images on pre-computed search criteria (such as global color histograms Not to be confused with Image histogram. In computer graphics and photography, a color histogram is a representation of the distribution of colors in an image, derived by counting the number of pixels of each of given set of color ranges in a typically two-dimensional (2D) or ). The latter allows fast retrieval, but if the precomputed criteria do not fit the user's needs, the user can do little to find the desired images. SnapFind is an interactive search system that explores the space between these two extremes. It leverages the user's understanding of the semantic content to drive the search. Medical Imaging We are now exploring applications for Diamond in the medical imaging domain. Initial explorations in dermatopathology (with University of Pittsburgh Medical Center The University of Pittsburgh Medical Center (UPMC) is a leading American healthcare provider and institution for medical research. It consistently ranks in US News and World Report's "Honor Roll" of the approximately 15 best hospitals in America. ) and in cell microscopy microscopy /mi·cros·co·py/ (mi-kros´kah-pe) examination under or observation by means of the microscope. mi·cros·co·py n. 1. The study of microscopes. 2. (with Merck Research) have yielded promising results. Diamond could become a key component in future management systems for PACS (Picture ArChiving System) A storage and management system for high-resolution images. Typically pertaining to the medical field, images such as X-rays, MRIs and CAT scans require a greater amount of storage than other industries. databases or computer-assisted diagnosis computer-assisted diagnosis Any use of computer algorithms to arrive at a clinical diagnosis. See Algorithm, Artificial intelligence, Cyberdoc, Expert system. Cf Computer-assisted therapeutics management. . Diamond was primarily developed by L. Huston, M. Satyanarayanan Mahadev Satyanarayanan is an experimental computer scientist who has pioneered research in mobile and pervasive computing. One outcome is the Coda file system, which supports mobility in low-bandwidth and intermittent wireless networks through disconnected and bandwidth-adaptive and R. Sukthankar. R. Wickremesinghe, D. Hoiem, Y. Ke, C. Helfrich, L. Fix and S. Schlosser made significant contributions. www.intel.com |
|
||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion