Printer Friendly
The Free Library
22,728,043 articles and books

Exploiting Multimodal Context in Image Retrieval.


THIS RESEARCH EXPLORES THE INTERACTION of textual and photographic information in multimodal Two or more modes of operation. The term is used to refer to a myriad of functions and conditions in which two or more different methods, processes or forms of delivery are used. On the Web, it refers to asking for something one way and receiving the answer another; for example requesting  documents. The World Wide Web (WWW WWW or W3: see World Wide Web.

(World Wide Web) The common host name for a Web server. The "www-dot" prefix on Web addresses is widely used to provide a recognizable way of identifying a Web site.
) may be viewed as the ultimate, large-scale, dynamically changing, multimedia database. Finding useful information from the WWW without encountering numerous false positives (the current case) poses a challenge to multimedia information retrieval information retrieval

Recovery of information, especially in a database stored in a computer. Two main approaches are matching words in the query against the database index (keyword searching) and traversing the database using hypertext or hypermedia links.
 systems (MMIR MMIR Marine Mammal Inventory Report
MMIR Multispectral Microwave Imaging Radiometer
). The fact that images do not appear in isolation, but rather with accompanying collateral text, is exploited. Taken independently, existing techniques for picture retrieval using collateral text-based methods and image-based methods have several limitations. Text-based methods, while very powerful in matching context, do not have access to image content. Image-based methods compute general similarity between images and provide limited semantics. This research focuses on improving precision and recall in an MMIR system by interactively combining text processing with image processing image processing

Set of computational techniques for analyzing, enhancing, compressing, and reconstructing images. Its main components are importing, in which an image is captured through scanning or digital photography; analysis and manipulation of the image, accomplished
 (IP) in both the indexing and retrieval phases. A picture search engine is demonstrated as an application.


This research explores the interaction of textual and photographic information in multimodal documents. The World Wide Web (WWW) may be viewed as the ultimate, large-scale, dynamically changing, multimedia database. Finding useful information from the WWW poses a challenge in the area of multimodal information indexing and retrieval. The word "indexing" is used here to denote the extraction and representation of semantic content. This research focuses on improving precision and recall in a multimodal information retrieval system by interactively combining text processing with image processing.

The fact that images do not appear in isolation but rather with accompanying text, which is referred to as collateral text, is exploited. Figure 1 illustrates such a case. The interaction of text and image content takes place in both the indexing and retrieval phases. An application of this research--namely, a picture search engine that permits a user to retrieve pictures of people in various contexts--is presented.


Taken independently, existing techniques for text and image retrieval An image retrieval system is a computer system for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, or descriptions to the  have several limitations. Text-based methods, while very powerful in matching context (Salton, 1989), do not have access to image content. There has been a flurry of interest in using textual captions to retrieve images (Rowe & Guglielmo, 1993). Searching captions for keywords and names will not necessarily yield the correct information, as objects mentioned in the caption are not always in the picture. This results in a large number of false positives that need to be eliminated or reduced. In a recent test, a query was posed to a search engine to find pictures of Clinton and Gore and resulted in 941 images. After applying our own filters to eliminate graphics and spurious spu·ri·ous
Similar in appearance or symptoms but unrelated in morphology or pathology; false.


simulated; not genuine; false.
 images (e.g., white space), 547 potential pictures that satisfied the query remained. A manual inspection revealed that only 76 of the 547 pictures contained pictures of Clinton or Gore. This illustrates the tremendous need to employ image-level verification and to use text more intelligently.

Typical image-based methods compute general similarity between images based on statistical image properties (Flickner et al., 1995). Examples of such properties are texture and color (Swain & Ballard, 1991). While these methods are robust and efficient, they provide very limited semantic indexing capabilities. There are some techniques that perform object identification; however, these techniques are computationally expensive A computationally expensive algorithm is one that, for a given input size, requires a relatively large number of steps to complete; in other words, one with high computational complexity.  and not sufficiently robust for use in a content-based retrieval system. This is due to a need to balance processing efficiency with indexing capabilities. If object recognition is performed in isolation, this is probably true. More recently, other attempts to extract semantic properties A semantic property consists of the components of meaning of a word. The component female is a semantic property of girl, woman, actress etc. See also
  • Semantic class
  • Semantic feature
 of images based on spatial distribution of color not of the white race; - commonly meaning, esp. in the United States, of negro blood, pure or mixed.

See also: Color
 and texture properties have also been attempted (Smith & Chang, 1996). Such techniques have drawbacks, primarily due to their weak disambiguation dis·am·big·u·ate  
tr.v. dis·am·big·u·at·ed, dis·am·big·u·at·ing, dis·am·big·u·ates
To establish a single grammatical or semantic interpretation for.
. These are discussed later. Webseer ( describes an attempt to utilize both image and text content in a picture search engine. However, text understanding is limited to processing of HTML tags A code used in HTML to define a format change or hypertext link. HTML tags are surrounded by the angle brackets < and >.

; no attempt to extract descriptions of the picture is made. More important, it does not address the interaction of text and image processing in deriving semantic descriptions of a picture.

In this article, a system for finding pictures in context is described. A sample query would be Find pictures of victims of natural disasters. Specifically, experiments have been conducted to effectively combine text content with image content in the retrieval stage. Text indexing is accomplished through standard statistical text indexing techniques and is used to satisfy the general context that the user specifies. Image processing consists of face detection and recognition. This is used to present the resulting set of pictures based on various visual criteria (e.g., the prominence of faces). Experiments have been conducted on two different scenarios for this task; results from both are presented. Preliminary work in the intelligent use of collateral text in determining pictorial attributes is also presented. Such techniques can be used independently or combined with image processing techniques to provide visual verification. Thus this represents the integration of text and image processing techniques in the indexing stage.


Before techniques for extracting picture properties from text and images are described, it is useful to examine typical queries used in retrieving pictures. Jorgensen (1996) describes experimental work in the relative importance of picture attributes to users. Twelve high-level attributes--literal object, people, human attributes, art historical information, visual elements, color, location, description, abstract, content/story, viewer response, and external relationship--were measured. It is interesting to note that literal object accounted for up to thirty-one of the responses. Human form and other human characteristics accounted for approximately fifteen responses. Color, texture, and so on ranked much lower compared to the first two categories. The role of content/story varied widely from insignificant to highly important. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
, users dynamically combine image content and context in their queries.

This page is about the cartographic mechanism called a "Romer" or "Roamer"; for people named Romer see Romer (surname)

A Romer or Roamer is a simple device for accurately plotting a grid reference on a map.
 (1993) describes a wish list for image archive managers, specifically the types of data descriptions necessary for practical retrieval. The heavy reliance on text-based descriptions is questioned. Furthermore, the adaptation of such techniques to multimodal content is required. The need for visual thesauri (Srihari & Burhans, 1994; Chang & Lee, 1991) is also stressed, since these provide a natural way of cataloging pictures, an important task. An ontology ontology: see metaphysics.

Theory of being as such. It was originally called “first philosophy” by Aristotle. In the 18th century Christian Wolff contrasted ontology, or general metaphysics, with special metaphysical theories
 of picture types would be desirable. Finally, Romer (1995) describes the need for "a precise definition of image elements and their proximal relationship to one another." This would permit queries such as Find a man sitting in a carriage in front of Niagara Falls Niagara Falls, waterfall, United States and Canada
Niagara Falls, in the Niagara River, W N.Y. and S Ont., Canada; one of the most famous spectacles in North America. The falls are on the international line between the cities of Niagara Falls, N.Y.

Based on the above analysis, it is clear that object recognition is a highly desirable component of picture description. Although object recognition in general is not possible, for specific classes of objects, and with feedback from text processing, object recognition may be attempted. It is also necessary to extract further semantic attributes of a picture by mapping low-level image features such as color and texture into semantic primitives. Efforts in this area (see Smith & Chang, 1996) are a start but suffer from weak disambiguation and hence can be applied in select databases; our work aims to improve this. Improved text-based techniques for predicting image elements and their structural relationships are presented.


To demonstrate the effectiveness of combining text and image content, a robust, efficient, and sophisticated picture search engine has been developed; specifically, Webpic will selectively retrieve pictures of people in various contexts. A sample query could be Find outdoor pictures of Bill Clinton with Hillary talking to Noun 1. talking to - a lengthy rebuke; "a good lecture was my father's idea of discipline"; "the teacher gave him a talking to"
lecture, speech

rebuke, reprehension, reprimand, reproof, reproval - an act or expression of criticism and censure; "he had to
 reporters on Martha's Vineyard Martha's Vineyard (vĭn`yərd), island (1990 est. pop. 8,900), c.100 sq mi (260 sq km), SE Mass., separated from the Elizabeth Islands and Cape Cod by Vineyard and Nantucket sounds. . This should generate pictures where (1) Bill and Hillary Clinton actually appear in the picture (verified by face detection/recognition), and (2) the collateral text supports the additional contextual requirements. The word "robust" means the ability to perform under various data conditions; potential problems could be lack of, or limited, accompanying text/HTML, complex document layout, and so on. The system should degrade TO DEGRADE, DEGRADING. To, sink or lower a person in the estimation of the public.
     2. As a man's character is of great importance to him, and it is his interest to retain the good opinion of all mankind, when he is a witness, he cannot be compelled to disclose
 gracefully under such conditions. Efficiency refers primarily to the time required for retrievals which are performed online. Since image indexing operations are time-consuming, they are performed offline. Finally, sophistication so·phis·ti·cate  
v. so·phis·ti·cat·ed, so·phis·ti·cat·ing, so·phis·ti·cates
1. To cause to become less natural, especially to make less naive and more worldly.

 refers to the specificity of the query/response. In order to provide adequate responses to specific queries, it is necessary to perform more complex indexing of these data.

Figure 2 depicts the overall structure of the system. It consists of three phases. Phase 1 is the data acquisition phase--multimodal documents from WWW news sites (e.g., MSNBC MSNBC Microsoft/National Broadcasting Company , CNN CNN
 or Cable News Network

Subsidiary company of Turner Broadcasting Systems. It was created by Ted Turner in 1980 to present 24-hour live news broadcasts, using satellites to transmit reports from news bureaus around the world.
, USA Today USA Today

National U.S. daily general-interest newspaper, the first of its kind. Launched in 1982 by Allen Neuharth, head of the Gannett newspaper chain, it reached a circulation of one million within a year and surpassed two million in the 1990s.
) are downloaded. In order to control the quality of data that are initially downloaded, a Web crawler See crawler and WebCrawler.  in Java has been implemented to do more extensive filtering of both text and images.


The inputs to the system are a set of name keys (names of people) and an initial set of URLs to initiate the search. Some preprocessing A preliminary processing of data in order to prepare it for the primary processing or for further analysis. The term can be applied to any first or preparatory processing stage when there are several steps required to prepare data for the user.  tools are employed during this phase. One such tool is an image-based photograph versus graphic filter. This filter is designed and implemented based on histogram histogram
 or bar graph

Graph using vertical or horizontal bars whose lengths indicate quantities. Along with the pie chart, the histogram is the most common format for representing statistical data.
 analysis. Presumably pre·sum·a·ble  
That can be presumed or taken for granted; reasonable as a supposition: presumable causes of the disaster.
, a photograph histogram has a much wider spectrum than that of a graphic image.

A collateral text extractor, whose task is to determine the scope of text relevant to a given picture, is also employed. Caption text appears in a wide variety of styles. News sites such as CNN and MSNBC use explicit captions for pictures. These are indicated through the use of special fonts and careful placement using HTML HTML
 in full HyperText Markup Language

Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web.
 commands as illustrated in Figure 1. In other Web pages, captions are not set off explicitly but, rather, are implicit by virtue of their proximity to the picture.

Explicit captions are detected based on the presence of strong HTML clues as well as the usage of key phrases such as "left, foreground, rear" and so on. These can be used to predict picture contents. General collateral text is detected based on the presence of words from the "ALT" tag, caption words, spatial proximity to picture, and so on. Such text, while not a powerful predictor of the contents of a picture, establishes the context of a picture. An image-based caption extractor that extracts ASCII text Alphanumeric characters that are not in any proprietary file format. See ASCII file.  that has been embedded Inserted into. See embedded system.  in images (a common practice among news oriented sites) has been developed in our laboratory and is available for use.

Phase 2 is the content analysis or indexing phase (performed offline). Phase 2 illustrates that both natural language processing Natural language processing

Computer analysis and generation of natural language text. The goal is to enable natural languages, such as English, French, or Japanese, to serve either as the medium through which users interact with computer systems such as
 (NLP (Natural Language Processing) The capability of understanding human language. If the language is spoken, voice recognition plays an important role in converting the sounds to individual words. Then, natural language processing figures out what the words mean. ) and image processing result in factual assertions to the database. This represents a more semantic analysis Semantic analysis may refer to:
  • Semantic analysis (computer science)
  • Semantic analysis (informatics)
  • Semantic analysis (linguistics)
 of the data than general text and image indexing based on statistical features. This is discussed in later sections.

Phase 3, retrieval, demonstrates the need to decompose de·com·pose  
v. de·com·posed, de·com·pos·ing, de·com·pos·es
1. To separate into components or basic elements.

2. To cause to rot.

 the query into its constituent parts. A Web-based graphical user interface graphical user interface (GUI)

Computer display format that allows the user to select commands, call up files, start programs, and do other routine tasks by using a mouse to point to pictorial symbols (icons) or lists of menu choices on the screen as opposed to having to
 (GUI (Graphical User Interface) A graphics-based user interface that incorporates movable windows, icons and a mouse. The ability to resize application windows and change style and size of fonts are the significant advantages of a GUI vs. a character-based interface. ) has been developed for this. As Figure 3 illustrates, the system permits users to view the results of a match based on different visual criteria. This is especially useful in cases where the user knows the general context of the picture but would like to interactively browse and select pictures containing his or her desired visual attributes. The interface also illustrates that further query refinement using techniques such as image similarity are possible. Finally, although the example illustrates a primary context query, it is possible for the original query to be based on pure image matching techniques. The basic database infrastucture for a multimodal database has been built using Illustra (Illustra is a relational database management system relational database management system - relational database  from Informix Inc.).


This is used for data storage as well as representing factual (exact) information. Illustra's ability to define new data types and associated indexing and matching functions is useful for this project.


For each picture and its accompanying text, the following metadata are extracted and stored. The metadata model described here is currently applicable only to text and image sources. However, it can be easily extended to accommodate audio and video sources as well:

* Text_Indx: text index, using statistical vector-space indexing techniques. This is useful in judging similarity of two contexts.

* Img_Indxl,Img_Indx2, ... Img_Indxk: indexes for various image features based on statistical techniques. This includes color, texture, shape, as well as other properties useful in judging the similarity of two images.

Pacific Daylight Time

PDT Pacific Daylight Time

PDT n abbr (US) (= Pacific Daylight Time) → hora de verano del Pacífico

: this is a template containing information about people, objects, events, locations, and dates mentioned in the text accompanying a picture. Such information is extracted through NLP techniques and will be discussed in the text processing section. Similarity of these templates involves a sophisticated unification algorithm.

* Objects: this is a template containing information about objects detected in the image (image coordinates) and their spatial relationships. It also includes information pertaining per·tain  
intr.v. per·tained, per·tain·ing, per·tains
1. To have reference; relate: evidence that pertains to the accident.

 to general scene classification (e.g., indoor/outdoor, man-made/natural, and so on).


Text Processing

The goal of natural language processing research in this project is to examine the use of language patterns in collateral text to indicate scene contents in an accompanying picture. In this section, NLP techniques to achieve this goal are described. The objective is to extract properties of the accompanying picture as well as cataloging the context in which the picture appeared. Specifically, the interest is in deriving the following information that photo archivists have deemed to be important in picture retrieval:

* Determining which objects and people are present in the scene; the location and time are also of importance, as is the focus of the picture.

* Preserving event (or activity) as well as spatial relationships that are mentioned in the text. Spatial information, when present, can be used for automatically identifying people in pictures.

Consider the caption President Clinton and his family visited Niagara Falls yesterday. The First Lady and Chelsea went for a ride on the Maid of the Mist The Maid of the Mist is a boat tour of Niagara Falls. (The actual boats used are each named Maid of the Mist, followed by a different Roman numeral in each case. . This should not match the query find pictures of Clinton on the Maid of the Mist. However, the caption Clinton rode the Maid of the Mist Sunday should be returned. Current IR systems that rely on statistical processing would return both captions. NLP techniques are required for correct processing in this case. Determining further attributes of the picture such as indoor versus outdoor, mood, and so on.

* Representing and classifying the general context indicated by the text-e.g., political, entertainment, and so on.

Some organizations, such as Kodak, are manually annotating an·no·tate  
v. an·no·tat·ed, an·no·tat·ing, an·no·tates
To furnish (a literary work) with critical commentary or explanatory notes; gloss.

To gloss a text.
 picture and video clip A short video presentation.  databases to permit flexible retrieval. Annotation 1. (programming, compiler) annotation - Extra information associated with a particular point in a document or program. Annotations may be added either by a compiler or by the programmer.  consists of adding logical assertions regarding important entities and relationships in a picture. These are then used in an expert system for retrieval. Aslandogan et al. (1997) describe a system for image retrieval based on matching manually entered entities and attributes of pictures, whereas our objective is to automatically extract as much information as possible from natural language captions.

Specifically, the goal is to complete picture description templates (PDT) which represent image characteristics. Templates of this type are used by photo repository systems, such as the Kodak Picture Exchange (Romer, 1993). The templates carry information about people, objects, relationships, location, as well as other image properties. These properties include: (1) indoor versus outdoor setting, (2) active versus passive scene-i.e., an action shot versus a posed photo, (3) individual versus crowd scene, (4) daytime versus night-time, and (5) mood.

As an example, consider Figure 4 which shows the output template from processing the caption A woman adds to the floral tribute to Princess Diana Noun 1. Princess Diana - English aristocrat who was the first wife of Prince Charles; her death in an automobile accident in Paris produced intense national mourning (1961-1997)
Diana, Lady Diana Frances Spencer, Princess of Wales
 outside the gates of Kensington Palace of Figure 1. Information extraction In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured  (IE) techniques (Sundheim, 1995), particularly shallow techniques, can be used effectively for this purpose. Unlike text understanding systems, IE is concerned only with extracting relevant data that have been specified a priori a priori

In epistemology, knowledge that is independent of all particular experiences, as opposed to a posteriori (or empirical) knowledge, which derives from experience.
 using fixed templates. Such is the situation here.

Figure 4. Picture Description Template (PDT).

People: person (female, PER1)

Objects: flowers

Activity: pay_tribute (PER1, Princess Diana)

Location: Kensington Palace, "outdoor"

Event Date: Monday, Sept. 2, 1997

Focus: PER1

Specific techniques for deriving the above information are now presented. The techniques fall into three general categories: statistical text indexing, light parsing See parse.

parsing - parser
, and extracting picture attributes.

Statistical Text Indexing

The goal here is to capture the general context represented by collateral text. Though not useful in deriving exact picture descriptions, statistical text indexing plays a key role in a robust multimodal information retrieval system. There has been considerable research in the area of document indexing and retrieval, particularly the vector space vector space

In mathematics, a collection of objects called vectors, together with a field of objects (see field theory), known as scalars, that satisfy certain properties.
 indexing techniques (Salton, 1989). The problem being faced here differs from traditional document matching since the text being indexed--viz, collateral text--is frequently very sparse. Minor adjustments are made to existing techniques in order to overcome the sparseness problem. This includes: (1) the use of word triggers (computed from a large corpus) to expand each content word into a set of semantically similar words, and (2) the use of natural language pre-processing in conjunction with statistical indexing. Word triggers refer to the frequent co-occurrence of certain word pairs in a given window size of text (e.g., fifty words). Natural language pre-processing refers to methods, such as Named Entity Tagging (described below), which classify groups of words as person name, location, and so on. While the use of NLP in document indexing and retrieval has met with limited success, the brevity Brevity
Adonis’ garden

of short life. [Br. Lit.: I Henry IV]


symbolic of transitoriness of life. [Art: Hall, 54]

cherry fair

cherry orchards where fruit was briefly sold; symbolic of transience.
 of collateral text calls for more advanced processing.

Light Parsing: Extracting Patterns of Interest

The previous section described general content indexing; these techniques are based on statistics of word, word-pair frequencies, and so on. In this subtask, the focus is on more in-depth syntactic Dealing with language rules (syntax). See syntax.  processing of the relevant text; this is treated as an information extraction task. Such systems consist of several hierarchical layers, each of which attempts to extract more specific information from unformatted (1) A hard disk, rewritable optical disc or floppy disk that has not been initialized and is completely blank. See format program.

(2) Without a structure. For example, an e-mail message that contains only text without any style attributes and no graphics is

In the case of photographs, template entities are the objects and people appearing in the photograph, template relationships include spatial relationships between objects/people, as well as event/activity information. The first layer consists of named entity tagging;, this is an extremely useful pre-processing technique and has been the subject of considerable research.

Named entity (NE) tagging refers to the process of grouping words and classifying these groups as person name, organization name, place, date, and so on. For example, in the phrase, Tiger Woods Editing of this page by unregistered or newly registered users is currently disabled.  at the River Oaks Club, River Oaks Club would be classified as a location. Applying NE tagging to collateral text reduces errors typically associated with words having multiple uses. For example, a query to "Find pictures of oaks along a river" should not retrieve the above caption since River Oaks Club is tagged as a location. Bikel et al. (1997) describe a statistical method for NE tagging; given a manually truthed corpus of captions and collateral text, it is straightforward to develop an NE tagger tag·ger  
1. One that tags, especially the pursuer in the game of tag.

2. taggers Very thin sheet iron, usually plated with tin.

Noun 1.
. At this point, a rule-based system for NE tagging has been implemented which is giving better than 90 percent accuracy performance.

The next layers of the hierarchical grammar are used for recognizing domain-independent syntactic structures such as noun and verb groupings (assuming that named entity tagging has already taken place); this leads to identification of template entities and basic relationships (i.e., SVO SVO Straight Vegetable Oil
SVO Subject Verb Object
SVO Special Vehicle Operations
SVO Save Opportunities (baseball relief pitcher statistic)
SVO Securities Valuation Office
SVO Moscow, Russia - Sheremetyevo
 structure). The processing in these layers is confined con·fine  
v. con·fined, con·fin·ing, con·fines
1. To keep within bounds; restrict: Please confine your remarks to the issues at hand. See Synonyms at limit.
 to the bounds of single sentences. The final layer is where intersentential information is correlated, thus leading to merging of templates. It is here that the final decision on entries in the picture description template are made. For example, one sentence in a caption may refer to Princess Diana seen at her country estate, while a later sentence may refer to the fact that the estate is located outside the village of Althorp, England. In such a situation, template merging would result in the information that, in the specified picture, the location is Althorp, England. This is a form of co-reference that is being exploited. The template also includes general characteristics of the picture which may be detected from either the caption or collateral text. This is discussed in the next section.

The demands for efficient and robust natural language processing systems have caused researchers to investigate alternate formalisms for language modeling. Current information extraction requirements call for the processing of up to 80 MB of text per hour. Researchers have increasingly turned to finite-state processing techniques (Roche & Schabes, 1997). Roche (1997) says that "for the problem of parsing natural language sentences, finite-state models are both efficient and very accurate even in complex linguistic situations" (p. 241). A finite state transducer A finite state transducer (FST) is a finite state machine with two tapes: an input tape and an output tape.

Contrast this with an ordinary finite state automaton, which has a single tape.
 (FST See flat screen. ) is a special case of a finite state automaton Finite State Automaton - Finite State Machine  (FSA FSA Financial Services Authority
FSA Food Standards Agency (UK)
FSA Farm Service Agency (USDA)
FSA Financial Services Agency (Japan) 
) in which each arc is labeled by a pair of symbols (input and output) rather than a single symbol. A rule compiler (Kartutnen & Beesley, 1992) takes regular relations as input and constructs the corresponding FST. Operations supported by FST that are useful in grammar construction are union, intersection and, particularly, composition. Domain-specific pattern rules (to extract special attributes for a select domain) can be written as a new FST; this new FST can easily be composed with the base system. Hobbs et al. (1997) employs a cascaded set of FSTs to implement a hierarchical grammar for IE. The picture description grammar is currently being implemented as a cascaded FST.

Extracting Picture Attributes

Once the parsing process has been completed, it is possible to attach further attributes to the picture.. This includes attributes such as indoor versus outdoor, mood, and so on. By employing the roles that entities take on in the picture description templates, as well as referring to ontologies and gazetteers, it is possible, in some cases, to extract further attributes. For example, if a caption refers to Clinton on the White House lawn, it is characterized as an outdoor picture. This is essentially a unification process unification process unification nEinigungsprozess m  between location types. Chakravarthy (1994) discusses the use of WordNet in performing such characterization.


Imagery is probably the most frequently encountered modality modality /mo·dal·i·ty/ (mo-dal´i-te)
1. a method of application of, or the employment of, any therapeutic agent, especially a physical agent.

, next to text, in multimedia information retrieval. Most of the existing techniques in the literature of content-based retrieval or image indexing and retrieval-use low-level or intermediate-level image features such as color, texture, shape, and/or motion for indexing and retrieval. Although these methods may be efficient in retrieval, the retrieval precision may not be good enough, as typically it may not be true that image features always reflect their semantic contents.

In this article, the focus is mainly on image retrieval of people or scenes in a general context. This requires capabilities of face detection and/or recognition in the general image domain. By a general image domain, it is meant that the appearances of the objects in question (e.g., faces) in different images may vary in size, pose, orientation, expression, background, as well as contrast. Since color images A (digital) color image is a digital image that includes color information for each pixel.

For visually acceptable results, it is necessary (and almost sufficient) to provide three samples (color channels
 are very popular in use and very easy to obtain, these have been chosen for experimentation.

The potential applications of the capability of face detection and/or face recognition include: (1) filtering--i.e., determining whether or not a particular image contains a human being, (2) identifying individuals-i.e., handling queries for certain well-known people using face recognition, and (3) improving the accuracy of similarity matching. For images involving human faces, it is very difficult to check similarity based on histograms of the entire images. Color histogram Not to be confused with Image histogram.
In computer graphics and photography, a color histogram is a representation of the distribution of colors in an image, derived by counting the number of pixels of each of given set of color ranges in a typically two-dimensional (2D) or
 techniques do not work well for images containing faces. However, after applying face detection to the original images, the face areas may be automatically "cropped" out, and the rest of the image may still be used for histogram-based similarity matching.

Face detection and/or recognition has received focused attention in the literature of computer vision and pattern recognition for years. A good survey on this topic may be found in Chellappa et al. (1995). Typically, face detection and recognition are treated separately in the literature, and the solutions proposed are normally independent of each other. In this task, a streamlined solution to both face detection and face recognition is pursued. By a streamlined solution, it is meant that both detection and recognition are conducted in the same color feature space, and the output of the detection stage is directly fed into the input of the recognition stage. Another major difference between the present research and work described earlier in the literature is that the proposed system is a self-learning system, meaning that the face library used in face recognition is obtained through face detection and text understanding using the earlier research system PICTION (Srihari, 1995b). This allows the stage of face data collection for construction of the face library as an automatic part of data mining, as opposed to interactive manual data collection usually conducted for face recognition. Note that in many situations it is impossible to do manual data collection for certain individuals, such as Bill Clinton. For those people, their face samples can only be obtained through the WWW, newspapers, and so on. Thus, automatic data collection is not only efficient but is also necessary.

Face detection is approached as pattern classification in a color feature space. The detection process is accomplished in two major steps: feature classification and candidate generation. In the feature classification stage, each pixel is classified as face or nonface based on a standard Bayesian rule (Fukunaga, 1990). The classification is conducted based on pre-tuned regions for the human face in a color feature space. The color features used in this approach are hue and chrominance See chroma. . The pre-tuning of the classification region in the color feature space is conducted by sampling over 100 faces of different races from different Web sites. In the candidate generation stage, first a morphological operation is applied to remove the noise, and then a connected component search is used to collect all the "clusters" that indicate the existence of human faces. Since the pre-tuned color feature region may also classify other parts of the human body as candidates, let alone certain other objects that may happen to be within the region in the color feature space, heuristic A method of problem solving using exploration and trial and error methods. Heuristic program design provides a framework for solving the problem in contrast with a fixed set of rules (algorithmic) that cannot vary.

 checking is used to verify the shape of the returned bounding box to see if it conforms to the "golden ratio" law.(1) Figure 5 shows the whole process of face detection and recognition for a Web image. Note that each detected face is automatically saved into the face library if it has a strong textual indication of who this person is (self-learning to build up the face library), or the face image is searched in the face library to find who the person is, if the query asks to retrieve images of this individual (query stage).


In the face recognition stage, there are two modes of operation. In the mode of face library construction, it is assumed that each face image has its collateral textual information to indicate identities of the people in the image. Face detection is first applied to detect all the faces. Based on the collateral information (Srihari, 1995b), the identities for each face may be found and thus saved into the library automatically. In the mode of query, on the other hand, the detected face needs to be searched in the library to find out the identity of this individual.

Figure 5 (e) and (g) are two face images of the same individual. This is a problem of finding semantic similarity Semantic similarity, is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content.  between two face images in the general image domain to identify whether or not two face images contain the same individuals. This is one of the current research directions underway. Promising experimental results based on preliminary tests show that it is possible to include the capability of querying individuals in image retrieval by conducting semantic similarity matching.

To summarize, image processing capability currently consists of: (1) a face detection module based on color feature classification to determine whether or not an image contains human faces, and (2) a histogram-based similarity matching module to determine whether or not two images "look" similar.


Even though there has been much success recently in text-based information retrieval systems, there is still a feeling that the needs of users are not being adequately met. Multimodal IR presents an even greater challenge since it adds more data types/modalities, each having its own retrieval models. The body of literature in multimodal IR is vast, ranging from logic formalisms for expressing the syntax and semantics of multimodal queries (Meghini, 1995) to MPEG-4 standards for video coding Video coding is the field in electrical engineering and computer science that deals with finding efficient coding formats and algorithms for digital video.

Video data usually not only contains visual information but also audio.
 which call for explicit encoding See encode.  of semantic scene contents. A popular approach has been to add a layer representing meta querying on top of the individual retrieval models. An agent-based architecture for decomposing and processing multimodal queries is discussed in Merialdo and Dubois (1997). In focusing so much on formalisms, especially in the logic-based approaches, researchers sometimes make unreal assumptions about the quality of information that can be automatically extracted (e.g., the detection of complex temporal events in video).

The present research focuses not on the formalism Formalism
 or Russian Formalism

Russian school of literary criticism that flourished from 1914 to 1928. Making use of the linguistic theories of Ferdinand de Saussure, Formalists were concerned with what technical devices make a literary text literary, apart
 used to represent the queries, rather, the focus is on the effect of utilizing automatically extracted information from multimodal data in improved retrieval. Processing queries requires the use of: (1) information generated from statistical text indexing, (2) information generated from natural language processing of text, and (3) information generated from image indexing--in this case, face detection and recognition--as well as color, shape, and texture indexing.

Thus, matching a query to a captioned image in the database could involve four types of similarity computation:

1. ([Text_Indx.sub.q], [Text_Indx.sub.CapImg]): text-based similarity, statistical approach;

2. SIM ([Img_Indx(j).sub.q], [Img_Indx.sub.CapImg]): j = 1, ... k: image similarity, for each image feature statistical approach;

3. SIM ([PDT.sub.q], [PDT.sub.CapImg]): text-based concept similarity, symbolic approach; and

4. SIM ([Objects.sub.q], [Objects.sub.CapImg]): image-based content similarity, symbolic approach.

Syntax and Semantics of Multimodal Queries

Similarity matching techniques for each information source are discussed in the next section. Here the discussion centers on the interpretation of the query, as handled by the procedure Interpret_Query which attempts to understand the user's request and decompose it accordingly.

User input includes one or more of the following: (1) text_query, a text string; (2) image_query, an image; (3) topic_query, one or more concepts selected from a pre-defined set of topics, such as sports, politics, entertainment, and so on; and (4) user_preferences, a set of choices made by the user indicating preferred display choices and so on. These are used by the Interpret_Query module in determining ranking schemes.

The specific objective of the Interpret_Query procedure is: (1) to determine the arguments to each of the SIM(x, y) components mentioned above, and (2) to determine the set of ranking schemes that will be used in presenting the information to the user. Determining arguments to the text and image similarity functions are straightforward. The text string comprising the query is processed, resulting in content terms to be used in a vector-space matching algorithm. In the case of a query image, the image features are available already, or are computed if necessary. Determining the arguments to the picture description template similarity and object similarity are more involved. Some natural language processing analysis of the Text_String is required to determine which people, objects, events, and spatial relationships are implied by the query.

Another important issue is to decide on how information should be combined. For example, for an unambiguous query such as Find pictures of Bill Clinton, the face detection and recognition results will be automatically applied to produce a single ranking of images satisfying the query. However, for a more subjective query, such as Find pictures of victims of natural disasters, the general context is first applied. The results are then sorted based on various visual criteria, thus allowing the user to browse and make a selection.

Each ranking scheme (RSk)defines a ranking (CapImg(k,1),CapImg (k,2), ..., CapImg(k,nk)) of the images in the database. Currently, a simple technique to generate ranking schemes is employed. For each information source that is involved in a query, several sort criteria are applied in varying order. These sort criteria reflect the relative importance of each information source. For example, for queries involving finding people in various contexts, two sorted lists will be presented to the user. The first weights the context more and the second weights the face detection results more--i.e., presence of face, relative size of face.

Matching Queries to Data

Text-based similarity is based on statistical indexing techniques; while not as precise as natural language processing techniques, it is very robust. Image-based similarity techniques using color, shape, texture, and so on have been discussed extensively in the content-based image retrieval Content-based image retrieval (CBIR), also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR) is the application of computer vision to the image retrieval problem, that is, the problem of searching for  literature. Image-based content similarity includes any visual information that has been verified by using object recognition techniques (e.g., number of faces, gender) or semantic classification (e.g., indoor versus outdoor).

When matching based on the similarity of picture description templates, it is necessary to employ unification techniques. For example, a search for Dalmation should match a picture whose PDT contains dog. That is, Unify (Dalmation, dog) should return a non-zero value. An approach similar to that of Aslandogan et al. (1997) to perform inexact in·ex·act  
1. Not strictly accurate or precise; not exact: an inexact quotation; an inexact description of what had taken place.

 matching is being adopted. The use of ontologies is required for several purposes in this phase. First, they are required to map entities into their basic categories (Rosch et al., 1976); research has shown that people most often query by basic categories (e.g., dog rather than Doberman). If the caption refers to the location as an auditorium, for example, it is necessary to map this into building for the purpose of retrieval. Similar mapping needs to take place on query terms. Srihari (1995a) and Aslandogan et al. (1997) discuss the use of WordNet in matching picture entities with queries. WordNet provides critical information in determining hierarchical relationships between entity classes and event classes.

Query Refinement and Relevance Feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query.

Since users are not always sure of what they are looking for Looking for

In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with.
, an adaptive system An adaptive system is a system that is able to adapt its behavior according to changes in its environment or in parts of the system itself. A human being, for instance, is certainly an adaptive system; so are organizations and families.  is required. After specifying an initial query, the results are sorted into various classes based on the ranking schemes suggested by Interpret_Query. Users may choose to refine the query by either modifying the text query, concept query, or select images that best match their needs. The latter are used in a relevance feedback process, where users can interactively select pictures that satisfy their needs. Although the technique is well-understood in the text domain (Chang, 1998; Robertson, 1986; Rocchio, 1971; Ide, 1971; Croft CROFT, obsolete. A little close adjoining to a dwelling-house, and enclosed for pasture or arable, or any particular use. Jacob's Law Dict.  & Harper, 1979; Fuhr & Buckley, 1991), it is still in the experimental stage in the image domain (Smith, 1997). Popular techniques include Rocchio's (1971) relevance feedback formula for the vector model and its variations (Ide, 1971), and the Croft-Harper formula (1979) for the probabilistic (probability) probabilistic - Relating to, or governed by, probability. The behaviour of a probabilistic system cannot be predicted exactly but the probability of certain behaviours is known. Such systems may be simulated using pseudorandom numbers.  retrieval model and its modifications (Fuhr & Buckley, 1991; Robertson, 1986). Query refinement consists of adjusting the weights assigned to each feature; this is the technique adopted in the text domain. Of course, the difficult aspect is determining which features are important. The multiple ranking scheme described in the previous section is of use here since each ranking corresponds to the importance of certain features (or metadata). By selecting images in certain ranking schemes, the system is able to learn which features are useful. This process can continue iteratively until the user finds the required picture. The user interface supports the visual browsing that is an integral part of image retrieval.


There were two experiments conducted in picture retrieval from multimodal documents. Each reflected a different strategy of combining information obtained by text indexing and image indexing. Both of these are now described.

Single Ranhing Method

In this experiment, the queries are first processed using text indexing methods. This produces a ranking Px1, ... , Pxn as indicated in Figure 6. Those pictures pxiwhich do not contain faces are subsequently eliminated; this information is obtained by running the automatic face detection module.


Figure 7 presents the results of an experiment conducted on 198 images that were downloaded from various news Web sites. The original data set consisted of 941 images. Of these, 117 were empty (white space) and 277 were discarded as being graphics. From these, a subset of 198 images was chosen for this experiment.

Figure 7. Results of Single Ranking Strategy of Combining Text and Image Content. The last column indicates the result of text indexing combined with face detection. The number in parentheses See parenthesis.

parentheses - See left parenthesis, right parenthesis.
 indicates the number of images in the given quantile quantile

division of a total into equal subgroups; includes terciles, quartiles, quintiles, deciles, percentiles.
 that were discarded due to failure to detect faces.
               Text Only     Text + Manual Insp    Text + Face Det

At 5 docs        1.0                1.0                1.0(3)
At 10 docs       1.0                0.70               1.0(3)
At 15 docs       0.80               0.75               1.0(2)
At 30 docs       0.77               0.67                NA

There were ten queries, each involving the search for pictures of named individual(s); some specified contexts also, such as find pictures of Hillary Clinton at the Democratic Convention. Due to the demands of truthing, the results for one query are reported; more comprehensive evaluation is currently underway. Figure 7 indicates precision rates using various criteria for content verification: (1) using text indexing (SMART) alone, (2) using text indexing and manual visual inspection, and (3) using text indexing and automatic face identification. As the table indicates, using text alone can be misleading--when inspected, many of the pictures do not contain the specified face. By applying face detection to the result of text indexing, photographs that do not have a high likelihood of containing faces are discarded. The last column indicates that this strategy is effective in increasing precision rates. The number in parentheses indicates the number of images in the qiven quantile that were discarded due to failure to detect faces.

Sample output is shown in Figures 8 and 9. Figure 8 illustrates the output based on text indexing alone. The last picture illustrates that text alone can be misleading. Figure 9 illustrates the re-ranked output based on results from face detection. This has the desired result that the top images are all relevant. However, a careful examination reveals that, due to the face detector's occasional failure to detect faces in images, relevant images are inadvertently being discarded. Thus this technique increases precision but lowers recall. However, if the source of images is the WWW, this may not be of concern. The face detector is continually being improved to make it more robust to varied lighting conditions.


Multiple Ranking Method

In this experiment, a multiple ranking method for presenting candidate images to the user is employed. This strategy is depicted in Figure 10. The context is first verified using statistical text indexing. These candidate images are then sorted based on various visual properties. The first property is the presence of faces, the second represents the absence of faces (reflecting an emphasis on general scene context rather than individuals). This reflects the assumption that users do not know a priori exactly what kind of pictorial attributes they are looking for--i.e., that they would like to browse. Figure 11 depicts the top ranked images for the query victims of disasters.


Many of these refer to the recent air crash in Indonesia, partially blamed on heavy smoke from forest fires This is a list of notorious forest fires: North America

Year Size Name Area Notes
1825 3,000,000 acres (12,000 km²) Miramichi Fire New Brunswick Killed 160 people.
. Some images depict victims, some depict politicians discussing the situation. Based on an imposed threshold, only the top ten images returned by text retrieval were considered. As the results show, this produces a different ranking of images, where the lower row clearly emphasizes people. Had a lower threshold for text retrieval been used, the difference would have been more dramatic.

Evaluating precision and recall for such a technique is challenging. The precision rate for a given sorting criterion is based on both the text relevance and the presence of the required pictorial attributes (e.g., presence of faces). The text retrieval precision for the top ten images is 90 percent. However, when "presence of faces" is used as a sorting criterion, the precision in the top ten images drops to 40 percent. This is primarily due to the presence of very small faces in the image which are found by the face detector. Since the manual annotators were instructed to disregard faces below a certain size, these are judged to be erroneous (e.g., the last picture in the second row of Figure 11). Thus, assigning relevance judgments based on pictorial attributes must be reinvestigated.


Future directions include improvements on several fronts. First, it is necessary to incorporate information derived from natural language processing as well as statistical image indexing into the retrieval model. Second, the experiments conducted so far have involved only a single query modality, namely text. The next step is to permit multimodal queries, whereby the user can specify an information request using a combination of text (representing contextual constraints) and images (representing exemplars). A relevance feedback mechanism whereby the system can "learn" from user feedback is called for.

Finally, there is a need for more comprehensive testing and evaluation of the techniques developed thus far. The development of evaluation frameworks suitable for multimedia information retrieval systems is still an emerging research area. It is the focus of the MIRA Mira (mī`rə), [Lat.,=marvelous], variable star in the constellation Cetus; Bayer designation Omicron Ceti; 1992 position R.A. 2h19.0m, Dec. −3°05'.  (1999) project, a consortium of IR researchers in Europe. They make a strong case for dynamic evaluation techniques for such applications as opposed to the static evaluation techniques used in text retrieval systems. Rather than evaluating the initial results of a single query, researchers are proposing that the evaluation should be associated with an entire session consisting of continuously refined queries. For example, a monotonically increasing performance curve indicates a good session. They also suggest that new interaction-oriented tasks (apart from search and retrieval) must be supported and evaluated. An example of the latter would be the ability to clarify and formulate information needs.

In this research effort, the following measures of performance are of interest: (1)effectiveness of the ranking scheme generated based on the user's query input and preferences, (2) performance of each individual ranking scheme, and (3) performance of the face detection and recognition modules.


This article has presented a system for searching multimodal documents for pictures in context. Several techniques for extracting metadata from both images and text have been introduced. Two different techniques for combining information from text processing and image processing in the retrieval stage have been presented. This work represents efforts toward satisfying users' needs to browse efficiently for pictures. It is also one of the first efforts to automatically derive semantic attributes of a picture, and to subsequently use this in content-based retrieval. Retrieval experiments discussed in this article have utilized only two of the four indexing schemes that have been developed. These show the promise of integrating several modalities Modalities
The factors and circumstances that cause a patient's symptoms to improve or worsen, including weather, time of day, effects of food, and similar factors.
 in both the indexing and retrieval stages.


(1) It is believed that for a typical human face, the ratio of the width to the height of the face is always around the magic value of 2/(1 +05), which is called the golden ratio (Farkas & Munro, 1987).


Aslandogan, Y. A.; Their, C.; Yu, C.T; Zou, J.; & Rishe, N. (1997). Using semantic contents and WordNet in image retrieval. In SIGIR SIGIR Special Interest Group on Information Retrieval (Association for Computing Machinery)
SIGIR Special Inspector General for Iraq Reconstruction
 '97 (Proceedings of the 20th Annual International ACM (Association for Computing Machinery, New York, A membership organization founded in 1947 dedicated to advancing the arts and sciences of information processing. In addition to awards and publications, ACM also maintains special interest groups (SIGs) in the computer field.  SIGIR Conference on Research and Development in Information Retrieval, July 27-31, 1997, Philadelphia, PA) (pp. 286-295). New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
: Association for Computing Machinery See ACM.

Association for Computing Machinery - Association for Computing

Chang, C. C., & Lee, S. Y (1991). Retrieval of similar pictures on pictorial databases. PatternRecognition, 24(7), 675-680.

Chang, W. C. (1998). A framework for global integration of distributed visual information systems. Unpublished doctoral dissertation, State University of New York (body) State University of New York - (SUNY) The public university system of New York State, USA, with campuses throughout the state. , Buffalo.

Charkravarthy, A. S. (1994). Representing information need with semantic relations Noun 1. semantic relation - a relation between meanings
linguistic relation - a relation between linguistic forms or constituents

hyponymy, subordination - the semantic relation of being subordinate or belonging to a lower rank or class
. In COLING-94 (The 15th International Conference on Computational Linguistics computational linguistics (CL)

Use of digital computers in linguistics research. The simplest examples are the use of computers to scan text and produce such aids as word lists, frequency counts, and concordances.
, August 5-9, 1994, Kyoto, Japan) (pp. 737-741). Morristown, NJ: ACL See access control list.

1. ACL - Access Control List.
2. ACL - Association for Computational Linguistics.
3. ACL - A Coroutine Language.

A Pascal-based implementation of coroutines.

["Coroutines", C.D.

Chellappa, R.; Wilson, C.; & Sirohey, S. (1995). Human and machine recognition of faces: A survey. Proceedings of the IEEE (Institute of Electrical and Electronics Engineers, New York, A membership organization that includes engineers, scientists and students in electronics and allied fields. , 83(5), 705-741.

Croft, W., & Harper, D. (1979). Using probabilistic models of document retrieval The ability to search for documents by keywords and other attributes such as date and author. It implies that the documents have been indexed on all pertinent fields and that keywords have been chosen based upon title and textual content. See document imaging and document management system.  without relevance information. Journal of Documentation, 35(4), 285-295.

Farkas, L. G., & Munro, I. R. (1987). Anthropometric an·thro·pom·e·try  
The study of human body measurement for use in anthropological classification and comparison.

 facial proportions in medicine. Springfield, IL: Charles C. Thomas.

Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3), 223-248.

Fukunaga, K. (1990). Introduction to statistical pattern recognition (2d ed.). Boston: Academic Press.

Hobbs, J. R.; Appelt, D.; Bear, J.; Israel, D.; Kameyama, M.; Stickel, M.; Tyson, M. (1997). Fastus: A cascaded finite-state transducer transducer, device that accepts an input of energy in one form and produces an output of energy in some other form, with a known, fixed relationship between the input and output.  for extracting information from natural language text. In E. Roche & Y. Schabes (Eds.), Finite-state language processing
For the processing of language by computers, see Natural language processing.

Language processing refers to the way human beings process speech or writing and understand it as language.
 (pp. 383406). Cambridge, MA: MIT MIT - Massachusetts Institute of Technology .

Ide, E. (1971). New experiments in relevance feedback. In G. Salton (Ed.), The SMART system: Experiments in automatic document processing Processing text documents, which includes indexing methods for text retrieval based on content. See document imaging.  (pp. 337-354). Englewood Cliffs, NJ: Prentice Hall Prentice Hall is a leading educational publisher. It is an imprint of Pearson Education, Inc., based in Upper Saddle River, New Jersey, USA. Prentice Hall publishes print and digital content for the 6-12 and higher education market. History
In 1913, law professor Dr.

Jorgensen, C. (1996). An investigation of pictorial image attributes in descriptive tasks. In B. E. Rogowitz &J. P. Allenbach (Eds.), Human vision and electronic imaging (Proceedings of the Society for Optical Engineering (vol. 2657, pp. 241-251). Bellingham, WA: SPIE SPIE International Society for Optical Engineering
SPIE Society of Photo-Optical Instrumentation Engineers
SPIE Source Path Isolation Engine
SPIE Special Purpose Insertion Extraction
SPIE Software Process Improvement Experimentation
SPIE Standard Protocols in Effect

Kartutnen, L., & Beesley, K. R. (1992). Two-level rule compiler. Unpublished Xerox PARC A common reference to Xerox's famous PARC research and development center before it became a separate subsidiary of Xerox in 2002. See PARC.

XEROX PARC - /zee'roks park'/ Xerox Corporation's Palo Alto Research Center.
 Tech. Report No. TR ISTL-92-2.

Meghini, C. (1995). An image retrieval model based on classical logic. In SIGIR '95 (Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 9-13, 1995, Seattle, WA) (pp. 300-309). New York: Association for Computing Machinery Press.

Merialdo, B., & Dubois, F. (1997). An agent-based architecture for content-based multimedia browsing. In M. T. Maybury (Ed.), Intelligent multimedia information retrieval (pp. 281-294). Cambridge, MA: AAAI AAAI American Association for Artificial Intelligence
AAAI Association for the Advancement of Artificial Intelligence (Menlo Park, California)
AAAI American Academy of Allergy, Asthma, and Immunology

MIRA. (1999). Evaluation frameworks for interactive multimedia information retrieval applications Areas where information retrieval techniques are employed include (the entries are in alphabetical order within each category): General applications of information retrieval
  • Digital libraries
  • Information filtering
. Retrieved July 7, 1999 from the World Wide Web: mira.

Robertson, S. (1986). On relevance weight estimation and query expansion (information science) query expansion - Adding search terms to a user's search. Query expansion is the process of a search engine adding search terms to a user's weighted search. The intent is to improve precision and/or recall. The additional terms may be taken from a thesaurus. . Journal of Documentation, 42(3), 182-188.

Rocchio, J.J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: Experiments in automatic document processing (pp. 313-323). Englewood Cliffs, NJ: Prentice-Hall.

Roche, E. (1997). Parsing with finite-state transducers. In E. Roche & Y. Schabes (Eds.), Finite-state language processing (pp. 241-280). Cambridge, MA: MIT.

Romer, D. M. (1993). A keyword is worth 1,000 images (Kodak Internal Tech. Rep.). Rochester, NY: Eastman Kodak.

Romer, D. M. (1995). Research agenda for cultural heritage on information networks. Retrieved July 7, 1999 from the World Wide Web:

Rosch, E.; Mervis, C. B.; Gray, W. D.; Johnson, D. M.; Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology cognitive psychology, school of psychology that examines internal mental processes such as problem solving, memory, and language. It had its foundations in the Gestalt psychology of Max Wertheimer, Wolfgang Köhler, and Kurt Koffka, and in the work of Jean , 8(3), 382-439.

Rowe, N., & Guglielmo, E. (1993). Exploiting captions in retrieval of multimedia data. Information Processing information processing: see data processing.
information processing

Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations.
 and Management, 29(4), 453-461.

Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley.

Smith, J. R. (1997). Integrated spatial and feature image systems: Retrieval, analysis, and compression. Unpublished doctoral dissertation, Columbia University Columbia University, mainly in New York City; founded 1754 as King's College by grant of King George II; first college in New York City, fifth oldest in the United States; one of the eight Ivy League institutions. , New York.

Smith, J. R., & Chang, S.-F. (1996). VisualSEEK: A fully automated content-based image query system. In Proceedings of ACM Multimedia '96 (November 18-22, 1996, Boston, MA) (pp. 87-98). New York: Association for Computing Machinery Press.

Srihari, R. K. (1995a). Automatic indexing and content-based retrieval of captioned images. Computer, 28(9), 49-56.

Srihari, R. K. (1995b). Use of captions and other collateral text in understanding photographs. Artificial Intelligence Review, 8(5-6), 409-430.

Srihari, R. K., & Burhans, D.T. (1994). Visual semantics: Extracting visual information from text accompanying pictures. In Proceedings of the Twelfth National Conference on Artificial Intelligence (pp. 793-798). Menlo Park Menlo Park.

1 Residential city (1990 pop. 28,040), San Mateo co., W Calif.; inc. 1874. Electronic equipment and aerospace products are manufactured in the city. Menlo College and a Stanford Univ. research institute are there.

2 Uninc.
, CA: AAAI Press.

Sundheim, B. (Ed.). (1995). MUC-6 (Proceedings of the 6th Message Understanding Conference, November 6-8, 1995, Columbia, MD). San Francisco San Francisco (săn frănsĭs`kō), city (1990 pop. 723,959), coextensive with San Francisco co., W Calif., on the tip of a peninsula between the Pacific Ocean and San Francisco Bay, which are connected by the strait known as the Golden : Morgan Kaufmann.

Swain, M.J., & Ballard, D. H. (1991). Color indexing color index, in astronomy, difference in an object's brightness as recorded between any two well-defined bands of the electromagnetic spectrum by using optical filters of different colors. . International Journal of Computer Vision, 7(1), 11-32.


Bikel, D. M.; Miller, S.; Schwartz, R.; & Weischedel, R. (1997). Nymble: A high-performance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing (March 31-April 3 1997, Washington DC) (pp. 194-201). Boston: MIT Press.

Rohini K. Srihari, Department of Computer Science, Center for Document Analysis and Recognition (CEDAR), UB Commons, 520 Lee Entrance--Suite 202, State University of New York, Buffalo, NY 14228-2567

Zhongfei Zhang, Computer Science Department, Watson School of Engineering and Applied Science School of Engineering and Applied Science is the name of several engineering schools at universities in the United States.
  • School of Engineering and Applied Science at Columbia University, founded in 1896.
, State University of New York at Binghamton Binghamton University, State University of New York, or their officially adopted name, Binghamton University, is a coeducational public research university located in Vestal, New York. , Vestal vestal (vĕs`təl), in Roman religion, priestess of Vesta. The vestals were first two, then four, then six in number. While still little girls, they were chosen from prominent Roman families to serve for 30 (originally 5) years, during which , NY 13902

LIBRARY TRENDS, Vol. 48, No. 2, Fall 1999, pp. 496-520

ROHINI K. SRIHARI is an Associate Professor of Computer Science and Engineering at SUNY SUNY - State University of New York  at Buffalo. She has worked in both natural language processing as well as computer vision. Dr. Srihari's current research interests include multimedia information retrieval and multimodal image annotation systems. She is presently working on three projects. The first project, Show&Tell--a multimodal system (combining speech, deictic deic·tic  
1. Logic Directly proving by argument.

2. Linguistics Of or relating to a word, the determination of whose referent is dependent on the context in which it is said or written.
 input, and computer vision) for aerial image An aerial image is a projected image which is "floating in air", and cannot be viewed normally. It can only be seen from one position in space, often focused by another lens.  annotation and retrieval--was sponsored by the DOD (1) (Dial On Demand) A feature that allows a device to automatically dial a telephone number. For example, an ISDN router with dial on demand will automatically dial up the ISP when it senses IP traffic destined for the Internet.  as part of the RADIUS image exploitation program. The second project, WebPiction--for combining text and image context in image retrieval for the World Wide Web and Imagination--is an extension of a DOD sponsored effort on the use of collateral text in image understanding. The third project, Imagination--an Image Annotation and Metadata Generation System for Consumer Photos--is being sponsored by Kodak as part of their digital imaging initiative. Dr. Srihari recently organized a workshop on Multimedia Indexing and Retrieval in conjunction with the ACM conference on Information Retrieval, SIGIR '99.

ZHONGFEI ZHANG is an Assistant Professor in the Computer Science Department at SUNY at Binghamton. Prior to that he was a Research Assistant Professor at the Department of Computer Science and Engineering at SUNY Buffalo. His research interests include multimedia information indexing and retrieval, image understanding and processing, pattern recognition, artificial intelligence, and robotics. He has published over twenty academic papers in international and national journals and conferences.
COPYRIGHT 1999 University of Illinois at Urbana-Champaign
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1999, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion




Article Details
Printer friendly Cite/link Email Feedback
Publication:Library Trends
Date:Sep 22, 1999
Previous Article:Precise and Efficient Retrieval of Captioned Images: The MARIE Project.
Next Article:Introduction.

Related Articles
Intellectual Access to Images.
Image Retrieval as Linguistic and Nonlinguistic Visual Model Matching.
Recent Developments in Cultural Heritage Image Databases: Directions for User-Centered Design.
Evaluation of Image Retrieval Systems: Role of User Feedback.
Information Retrieval Beyond the Text Document.
Precise and Efficient Retrieval of Captioned Images: The MARIE Project.
Max Renkel: Galleria Ugo Ferranti. (Rome).
SALT forum issues draft specification far review. (Happenings).

Terms of use | Copyright © 2014 Farlex, Inc. | Feedback | For webmasters