Printer Friendly

Content-based image retrieval systems--reviewing and benchmarking.

1. Introduction

Research in Content-Based Image Retrieval is devoted to develop techniques and methods to fulfil image centered information needs of users and manage large amounts of image data. It is nowadays an even more vivid research area than when Veltkamp and Tanase provided an overview of the CBIR landscape [VT02], which showed the state-of-the-art at that time. Since then there has been much development in multiple directions. The question arises how those projects evolved: which have been concluded, which canceled, which are still active or have spawned new projects. Yet another question the report by Veltkamp and Tanase didn't answer is how the systems perform. This question alludes to the broad topic of benchmarking information retrieval systems in general and content-based image retrieval systems specifically. A number of research papers have addressed this topic ([MMS+01b, MMW06, MMS01a]), focusing on how to establish standards for benchmarking in this area.

We tried to follow the paths of the systems described in [VT02] and reviewed new developments. Assuming a bird's eye perspective, we found that on the one hand many of the older projects were finished or have not been further developed. For many of these efforts it remains unclear what further developments have occurred, if any. On the other hand there are a lot of new projects. Among them we found a tendency to implement standards like MPEG-7 as well as a simplification of user interfaces. Of the recent systems none expects the user to provide image features as a query, for example. Most of the efforts try to address the problem of CBIR in a general fashion, they don't focus on specific domains. The projects are often spin-offs or specialisations of other, general CBIR systems or frameworks. Since presenting all of our findings would exceed the scope of this article we just present an overview of the systems we see as "live": active and ongoing projects, which are either under active development or research, or are being used in products. Furthermore, the availability of the source or runtime code for testing was an obvious prerequisite. The complete report can be requested from Paul Maier (1).

In this article we focus on the mentioned systems and their development as well as several new systems and on benchmarking a small subset of these systems. We don't review the development of CBIR as a whole. In their recent work Datta et al. ([DJLW08]) provide a broad overview of recent trends, ideas and influences in CBIR.

The rest of the article is organized as follows: section 2 provides the overview of "live" projects. Also, it identifies the systems chosen for the benchmark. Section 3 explains how the systems were benchmarked, particularly our new measure [[??].sub.WRN] is introduced. Section 4 discusses the benchmark results and section 5 finally sums up and provides an outlook for future work.

2. Reviewed Systems

Table1 shows an overview of all "live" systems. The leftmost column lists the systems, sorted alphabetically. The second column holds references for the systems, then follow the columns showing the features: support of texture, color, shape, search by keywords, interactive relevance feedback, MPEG-7 descriptors [DK08] (for example DominantColor), whether they are specifically designed to be used as web search engine, whether they use classification techniques or segment images into regions.


The column designed for web is primarily a matter of scalability: a web search engine must be able to handle millions of images, performing searches on vast databases within seconds. But it also means that some kind of image accumulation mechanism, such as a web crawler, must be provided.

The column classification specifies if a system uses some kind of classification mechanism, clustering images and assigning them to classes. The last column, image regions, indicates whether a system applies some kind of image segmentation. This is not the same as shape: image regions might be used to identify shapes in an image, but they could also simply be a means of structuring the feature extraction (see for example VIPER in section 4.5).

One can recognize that keyword search is very often supported as well as relevance feedback. There seems to be a consensus that image retrieval in general cannot solely rely on content-based search but needs to be combined with text search and user interaction.

MPEG-7 image descriptors are still seldom used, but especially new systems or new versions of systems tend to incorporate these features. Many of the systems are designed to work on large image sets( >100 000 images in the database), but they are rarely designed specifically as a web search engine. Currently, only Cortina seems to be designed that way.

Seven CBIR systems have been chosen to for our benchmark:

SIMBA, SIMPLIcity, PictureFinder, ImageFinder, Caliph&Emir, VIPER/GIFT, Oracle Intermedia

The systems were chosen following three criteria: (1) availability and setup: we needed to obtain and setup the CBIRSs in order to benchmark them within our own environment. Setup failed with some obtained systems, which were therefore excluded; (2) visual similarity only: in our benchmark we focus on visual similarity search. Therefore, systems were required to only use visual similarity properties for their search. We excluded additional features, like text search or semantic analysis (e.g., automatic annotation); (3) query by example: this method was required to be supported by all CBIRSs.

There are a number of interesting systems which comply with the mentioned requirements, but have not been tested for other reasons. There is for example the Blobworld [CTB+99] system, which turned out too time-consuming to setup. Three other interesting systems which have not been tested mainly due to time constraints are PicSOM [KLLO00], Quicklook (2) [CGS01] and FIRE [DKN04] (4).

3. Benchmark Structure and Measurement

A new benchmark was developed to test and compare the chosen systems. This section outlines the aim of the benchmark, describes its overall structure, states which images where used in the database and how a ground truth was build from those images. Finally, our performance measure is introduced, as well as two new methods to compare CBIR systems based on these measures.

3.1 Benchmark Methods

In recent years efforts were made to analyse the possibilities for benchmarking CBIR systems and to propose standard methods and data sets ([MMS+01b, MMW06, MMS01a, ben]). Since 2003 the cross language image retrieval track Image CLEF ([ima05]) is run every year as part of the Cross Language Evaluation Forum. The focus of this track is on image retrieval from multilingual documents. A review of evaluation strategies for CBIR is presented in [DJLW08]. Still there seems to be no widely accepted standard for CBIR benchmarking, neither a standard set of performance measures nor a standard data set of images for the benchmarks.

ACBIR system has many aspects which could be subjected to a benchmark: the retrieval performance of course, its capability to handle large and very large image sets and the indexing mechanism to name just a few. In this work we focus on the retrieval performance aspect.

Aim Of The Benchmark We compared CBIRSs to each other based on their accuracy, i.e. their ability to produce results which accurately match ideal results defined by a ground truth. We assume a CBIRS does a good job if it finds images which the user himself would choose. No other queues were used, like e.g. relevance feedback. We used query by example only, no query by keyword, sketch or others. Also, no optimization for the image sets was done. The intent was to test the systems without providing them with prior knowledge of the image database. We concentrated on visual similarity only, the benchmark does not address any other aspects, e.g. resource usage and scalability. Those aspects, while of at least equal importance, where left out mainly due to time constraints.

The basic requirement of a CBIRS, is to fulfil a user's information need: finding useful information in form of images. We assume a CBIRS fulfils an information need if it finds visually similar images. Accordingly, we define similarity of images as follows

Definition: Image Similarity Two or more images are taken to be similar if spontaneously viewed as visually similar by human judges.

Accuracy in this work is simply defined through the ideal order imposed on result images by similarity: Images more similar to a query are appear before less similar images in the result. If the actual ordering in the result violates this ideal ordering accuracy decreases. In the following we characterize a system's performance by its accuracy.

3.2 Image Database

The images used for our benchmark are divided into domains of roughly similar images. The intention was to have a fairly representative sample of images, roughly grouped by some sort of image type. The following list explains which image domain represents which common type of image (and lists the query images):

Band: (5) Images of a street band. All the images are grey-scale. They represent black and white photographs and typical images showing a group of people.

Blumen (flowers): (6) Photographs of single flowers, taken in a natural environment. Typical examples are photographs of single (natural) objects with nature as background (meadows, rocks etc.).

Kirschen (cherry): (7) Photographs of blossoming cherry trees, taken in some kind of park. This is representative for trees, bushes, colorful images, mixed with artificial structures like pathways and buildings.

Textur (texture): (8) These are texture images. That means, they are photographs of various surface structures and intended for use as texture in computer graphics. Typical examples are texture-only images, like photographs of tissue, cloth, wall paint and the like.

Urlaub (vacation): (9) Photographs from someones vacation. Typical examples are vacation/leisure time photographs.

Verfalschung-Wind (distortion-wind): Agraphical filter (creating a motion blurr which looks like wind) was applied to a source image (which is used as query image), creating a sequence of images with increasing filter strength. This domain will also be referred to as Verfalschung for short.

Each domain additionally contains several images from all other domains to create a "background noise".

Image Sources

The images have been taken from several sources which provided them for free use. The sources are indicated as footnotes in the above list except for Verfalschung. The images of this domain are altered versions of an image which is distributed together with Caliph&Emir [LBK03].

The images from the University ofWashington and Benchathlon. net seem to be public domain, no licence is given. Benchathlon. net clearly states that up-loaders should ensure that images are not "copyright encumbered". The terms for images from ImageAfter allow free use except for redistributing them through an online resource like ImageAfter (10).

3.3 The Benchmark

The CBIRS are benchmarked by running and evaluating a set of well defined queries on them. Aquery is simply an image for the search engines to look for similar images. The search is done within the images of the query domain. The reason is that building the ground truth was manageable this way, since only the images within a domain needed to be evaluated for similarity. Query images were chosen such that a reasonable number of similar results was possible.

Each CBIRS has a number of parameters which allow to adapt and modify it in various ways. Changes in those settings can improve or worsen the result for certain or even all queries. Consequently, we benchmarked different parameter settings, or psets, for each CBIRS. For example, for VIPER/GIFT one pset uses texture features only, another is restricted to color features. In the following, we refer to a CBIRS with a specific pset as system. We ran the benchmark on 23 psets overall. From these 20 were fully evaluated; the 3 psets of SIMPLIcity had to be excluded due the missing-images-problem explained below.


Psets are named with letters of the alphabet starting with 'b' (11). For example, Caliph&Emir has three psets, b,c,d. Across different CBIRS psets have nothing in common, i.e., two psets c for two different CBIRS only share the letter. The only exception is pset b, which for all CBIRS is their standard setup. In the results section, the psets for each CBIRS will be described in detail.

3.4 Ground Truth

The ground truth for this benchmark has been established through a survey among 5 randomly chosen students. The students had no prior knowledge of CBIR techniques and were not familiar with the image sets (other than one could expect from fairly normal photographs).14 queries have been defined, for each the students had to compare the query image with all images from the query's domain. They were asked to judge the similarity from 0 (no similarity) up to 4 (practically equal images) (12). The ground truth thus consists of tuples of query image and result image which are mapped to a similarity value. This value is the mean of all similarity values from the different survey results and thus is an element of [0, 4].

3.5 Performance Measures

The information in this section has mainly been excerpted from [MMS+01b, MMW06].

From the domain of text-based information retrieval a number of methods are known which can also be applied to content-based image retrieval. One of the simplest, but most time-consuming methods is before-after user comparison: A user chooses the best result from a set of different results to a predefined query. 'Best' in this context means the most relevant result.

Then, a wide variety of single valued measures exist. Rank of the best match measures whether the most relevant image is under the first 50 or the first 500 images retrieved. In [MMS+01b] the Normalized Average Rank (NAR) is proposed as a performance measure for CBIR systems. We use it as basis for our own measure, which will be explained later. Popular and a standard in text retrieval are the measures precision and recall:

precision = No. relevant images retrieved/Total No. images retrieved,

recall = No. relevant images retrieved/Total No. relevant images in the collection

They are good indicators of the system performance, but they have to be used in conjunction, e.g. precision at 0.5 recall or the like. An entirely different method is target testing, where a user is given a target image to find. During the search, the number of images is recorded which the user has to examine before finding the target. The binary preference measure keeps track of how often non-relevant images are misplaced, i.e. before relevant images. Further single valued measures are error rate, retrieval efficiency and correct and incorrect detection. For the sake of brevity these are only listed here, for detailed explanations the reader is referred to [MMS+01b].

A third category of measures can be subsumed as graphical methods. They are graphs of two measures, the most popular being precision vs. recall graphs. They are also a standard measure in IR and can be readily interpreted by many researchers due to their popularity. Another kind of graphs are precision/recall vs. number of images retrieved graphs. The drawback of these two graphical methods is their dependency on the number of relevant images. Retrieval accuracy vs. noise graphs are used to show the change in retrieval performance as noise is added to the query image. A number of other graph methods exist, again the reader is referred to [MMS+01b] for further reading.

3.6 Worst Normalized Rank

The afore mentioned measures all use binary similarity/relevance values, i.e. only the distinction relevant/not relevant or similar/ not similar is made. In contrast, our measure uses multiple similarity values and thus allows fair and accurate comparisons of different CBIRSs with respect to different queries. None of the other benchmarks allow this comparison.

To account for five degrees of similarity we developed the new measure Worst Normalized Rank [[??].sub.WRN], based on the NAR measure [MMS+01b]. NAR is defined as


N: no. documents total

[N.sub.R]: no. relevant documents total

[R.sub.i]: i-th relevant document's rank in result list

For a given query result, a ranked list of images, [??] essentially computes a distance between the query result and the ideal result for that query. It does so by summing up the ranks [R.sub.i] of the similar images and subtracting the sum of ranks of the ideal result, which is [summation].sup.NR.sub.i=1] i (13). Thus, the best possible measured value is always 0, while the upper bound for the worst value is 1. The measure was chosen because it expresses CBIRS performance (for a specific query result) in a single value and is fairly easy to adapt. In the following, the term NAR will be used in a general fashion when addressing properties which hold true for both the original measure and the modified one. When referring exactly to one of them, the expressions Rank for the original measure and [[??]sub.WRN] for the modified measure will be used.

As a first step [??] is adapted to account for multiple similarity values in the following way:


N: like above

i: images are sorted by similarity (most similar has lowest i), thus the ideal ranking would be [R.sub.i] = i

[s.sub.i]: similarity value for image i

[N'.sub.R] : no. of relevant images, that is, all images with [s.sub.i] > 0

maxS: maximum similarity value, five in this case

To account for multiple similarity values [s.sub.i], the ranks [R.sub.i] are weighted with[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. The same is done for the ideal result: [[summation].sup.N.sub.i=1] I*[s.sub.i]/maxS. The reasoning is that more similar images, when misplaced, should yield a larger distance from the ideal result. Note that the original measure is obtained if one chooses a ground truth with similarity values {0, maxS}: non-similar images yield [R.sub.i.sup.*] 0 = 0, similar images [R.sub.i] * maxS/maxS = [R.sub.i].

This also explains why it doesn't matter whether one sums up to [N.sub.R]/[N.sub.R]' or N: the non-similar images can be left out and the remaining images are exactly the [N.sub.R]/[N'.sub.R] similar images for [??] and [[??].sub.m] respectively. Note that the [s.sub.i] need to be sorted by similarity, i.e. as stated above the highest [s.sub.i] have the lowest i. Finally, the distance is normalized over the number of images in the DB and the number of similar images for the given query, N and [N.sub.R]/[N.sub.R]'.

[[??].sub.m] allows to compare CBIRS based on multiple similarity degrees. However, the measurements are not comparable across different queries or different image sets: The worst possible measurement depends on the ratio of [N.sub.R] to N, i.e. it differs from query to query and image database to image database. Additionally, the worst value depends on the similarity distribution for a given query: it's a difference whether for a given query we have one very similar image and four moderately similar images or the other way round.

Therefore, as a second step, we introduce the Worst Result Normalization: We additionally normalize the performance measure over the worst possible result for the given query and image set. The worst result for a query is simply the reverse ideal result, i.e. the most similar images are ranked worst and appear at the end of the list while all non-similar images are ranked first. The such modified measure is then:


The factor 1/[NN.sub.R], is left out, as well as 1/maxS [[??].sub.WRN] yields values in [0,1] with 0 being the ideal and 1 the worst result. This makes it possible to compare the performance of systems across different queries and image sets.

Missing Images Problem

NAR requires that every similar image ([s.sub.i] > 0) in the searched data set or image domain is ranked by the tested CBIRS, or in other words the result list must contain all similar images (Recall = 1). Since the CBIRS does not know before hand which images are similar (unless it's a perfect system), all images must be returned. All search engines which have been tested provide some sort of threshold parameter to adjust the result list length, so in theory this was no problem. In practice, however, a number of systems failed to return all images, probably due to bugs. The result is that some systems did produce rather bad results: they are ranked worse than they probably would have been had they returned all images.

To deal with this problem, a number of approaches have been tested, but none of them led to a satisfying solution. In order to get comparable measures the result lists with missing images were manually completed by simply adding those images, giving them the worst possible rank. The worst possible rank is the rank of the last image in a list of all images in the domain, or simply the number of images in the domain. This is a worst case assumption punishing systems which leave out similar images rather hard. But any other solution would be based on illegal assumptions about how the CBIRS in question would have ranked the missing images, giving it an unfair edge.

The bottom line is that some systems produce results which are not really comparable. A solution might be to use an all together different performance measure than NAR which would allow to measure performance based on result lists of arbitrary length. How much this problem affected the various CBIRS is explained in subsections 4.2 through 4.6.

3.7 Comparing CBIR Systems

The performance measures introduced in the previous section allows one to measure the quality of single query results. To compare systems to one another, one needs to some how merge these measurements into a single value representing the overall performance of the system, s.t. a ranking of the systems can be build. We used two methods, a ranking method and a scoring method.

3.7.1 Ranking

The ranking method is reminiscent of rankings used in sports competitions, where athletes have to perform in different disciplines. The single queries are analogous to the sport disciplines, the systems to the athletes. First, the systems are ranked only based on single queries. Each system then has a rank for each query or within each discipline. Now, for each system the worst and the best ranking is discarded and an average rank is computed from the rest. This average rank is then used to build an overall ranking of all systems. Systems might receive the same rank in the end, but it is unlikely. Figure 2 shows the resulting rank list.

The method can be slightly modified by simply not discarding the worst and best ranking when computing the average rank. This results in a slightly more distinct ranking (i.e. less systems are ranked equally), but essentially it doesn't differ much from the first method.

3.7.2 Scoring

The ranking produces a clear result, but doesn't say anything about the distance of systems between one another. How close, for instance, are the first three systems? We used a scoring to answer this question. A value in [0, 1] is computed for each system, where 0 means worst and 1 best. To be more precise, 0 represents a hypothetical system that scores lowest on all queries and 1 such a system that scores highest on all queries.

As with the ranking, in the first step, performance is evaluated separately for each query. Each system's performance on a given query, computed through [[??].sub.WRN], is mapped onto the interval [0, 1] using the following simple formula:

[sc.sub.i] = [r.sub.i] - b/w - b

The terms are

[r.sub.i]: [[??].sub.WRN] result of system i for the current query

b: best result for the current query

w: worst result for the current query

[sc.sub.i]: score of system i for the current query

The resulting scores [sc.sub.i] show the performance of system i in relation to all other systems for the given query, with the worst system having [sc.sub.i] = 0 and the best [sc.sub.i] = 1. This preserves the relative distances from the original measures. Now, the overall score for each system is simply the mean of all per-query-scores of this system. Figure 2 shows the so computed scores for all systems.

The scoring method essentially produces the same ordering among the systems as the ranking method, yet there are some noticeable differences. Some systems which are clearly better than other systems within the scoring fall behind those systems within the ranking. The ranking essentially just counts how often a system wins or looses against the other systems, while within the scoring it is also important by how far the system wins or looses. So while one victory is easily out-weighted by lots of losses in the ranking, in scoring a clear victory might still out-weight a number of close losses. In other words, within the ranking, a system is the better the more often it ranks high with the single queries and within the scoring, a system is the better the greater the distance to the other systems is within the single queries.

4. Results

Figure 2 shows the ranking and scoring for all systems. As can be seen, the VIPER/GIFT CBIRS in its standard configuration (pset b) clearly leads the field, followed by Caliph & Emir, SIMBA and then PictureFinder and Oracle Intermedia. As mentioned in section 3.7.2 one can see that the two comparison methods slightly differ: in the ranked list, for example Caliph&Emir pset d is behind SIMBA pset e, where as in the scoring Caliph&Emir pset d clearly has a better score than the SIMBA system.

ImageFinder's performance is rather bad compared with the other systems. The most probable cause for this is that ImageFinder needs to be adjusted to the task at hand, a more detailed discussion is given in section 4.1.


Considering the different parameter sets one can see the clear tendency for the standard sets (psets b) to outperform the other sets. Only in the scoring, several CBIRS (Oracle Intermedia, ImageFinder, SIMBA) score better with a different set than the default one, but the difference is small, if not marginal. An explanation for this would be that the default settings already are an optimized parameter set for the respective CBIRS. The changes that we applied (with no explicit parameter optimization in mind) to create the other parameter sets most likely produce a less optimal set.

Table 2 shows an additional ranking of the best five systems per domain. This ranking shows that GIFT/VIPER also leads within four out of the six per domain rankings. Caliph&Emir also shows good performance scoring second within the three domains Band, Texturand Urlaub. The ranking method employed here is the one which also includes worst and best rankings (as explained in section 3.7.1) (14).

A comparison of all queries, based on the average results over all CBIRSs, showed that the Blumen queries seem to be the hardest ones. The easiest query seems to be the one from the Verfalschung domain, but it exhibits quite a big standard deviation.

4.1 SIMPLIcity and ImageFinder

The SIMPLIcity CBIRS implements some interesting technologies, namely the region based feature exf raction, the Integrated Region Match measure and the classification of images in order to improve the search performance. In the setting pset c (15) in the Blumen domain, it is the best system showing a high potential. Unfortunately, for most of the other domains and psets results were distorted by missing images. Therefore we decided to exclude SIMPLIcity from the overall comparison and just to include it in the per-domain ranking (figure 2).

ImageFinder provides some kind of basic image search/recognition/match platform which a user modifies and tweaks to her specific needs. That is, compared to the others, ImageFinder is not solely meant for CBIR. Its vast numbers of parameters need to be optimized for CBIR, which was out of this works scope. Therefore, the overall results for ImageFinder are rather poor. However, for the hardest of all queries, Blumen-10, ImageFinder produced a good result (6th place out of 20), which shows its potential (16).



SIMBA uses invariant feature histograms, which are invariant against rotations and translations of parts of the image, e.g. objects. The feature ext raction is based on kernel functions. The size of the kernel influences the size of objects SIMBA "recognizes" in images, i.e., with a big kernel, feat ures are invariant only against rotation and translation of bigger objects [Sig02, page 24].

For SIMBA we created three psets besides the default set. Psets c and d were first tests and turned out less meaningful. Pset e provides a bigger kernel function than the default set. Specifically, kernel (40, 0), (0, 80) was used for all color channels.

The results show that SIMBA performs well and is third in our benchmark. It shows poor within the Blumen domain (best pset is e with rank 14)( see table2) and very good within the domain Textur, SIMBA pset b. From figure 3 one recognizes that the bigger kernel of pset e results only in minor changes in the domain Band with respect to the default set.

4.3 PictureFinder

PictureFinder models images with sets of polygons of different sizes, colors and textures. Basically, the polygons represent prominent forms in the images. PictureFinder can thus be seen as shape focussed, which might explain the good results in the Blumen and Band domains.

We modified one parameter, which determines whether the position of a polygon influences the similarity assessment or not. The default setting (pset b) was that the polygon position is import ant: roughly speaking, two images are more similar if they contain similar prominent forms in similar positions. Accordingly, in pset c the position was not important: only size, color and texture of polygons determined similarity.

Regarding the position as important (pset b) shows generally better results than not doing so (pset c). An exception is an outlier in domain Urlaub. The reason might be that for this query there exists lots of similar images (according to GT) with similar prominent forms in different positions. As for pset c it has been generally worse: we suspect that too many less similar images (GT), which however contain similar forms in other positions, are seen as similar to the query by PictureFinder.

Missing images in the domain Textur distorted these results somewhat, causing the outlier.


4.4 Caliph&Emir

Caliph&Emir is a combination of two applications for managing image data. Caliph can be used to annotate the images, and Emir is the corresponding search engine. A strong feature of Caliph is its representation of annotations as semantic graphs, which greatly enhances the search. For our benchmark however, we focussed on Emir's visual similarity search. To use Emir as CBIR system, relevant features in form of MPEG-7 descriptors have to be extracted and stored as MPEG-7 documents by Caliph.

Caliph&Emir offer the three descriptors ColorLayout, EdgeHistogram and ScalableColor to choose from when performing a search. Unfortunately, ScalableColor didn't seem to work (the search didn't seem to start with this descriptor), so only ColorLayout and EdgeHistogram were used. The parameter settings were as follows: the default setting b was used for ColorLayout, pset c was used only for EdgeHistogram and pset d combines the two descriptors for the search.

The Caliph&Emir CBIRS is the second best system in this test behind VIPER/GIFT, as shown by the ranking and scoring (see figure 2). There is one outlier with pset c in the Verfalschung domain caused by missing images, but otherwise the results are rather balanced.

It turned out that ColorLayout in general is clearly better than EdgeHistogram or the combination of the two descriptors, see figure 5. There are however a few queries where EdgeHistogram beats Color-Layout, for example the query Blumen-10, shown in figure 6.


The design of the VIPER search mechanism, which is now part of the GIFT architecture as a plugin, borrows much from text retrieval. It uses an inverted index of up to 80 000 features, which are like terms in documents. For each image roughly 1000 features are extracted.

The numerous features can be grouped as color features and texture features. Each group in turn consists of histogram features and block or local features. The color features are based on the HSV color space, which is quantized into bins. From these, the color histogram features are extracted. Unlike other CBIRSs, in VIPER each bin of a histogram counts as feature. As local color features the mode colors (17) of small blocks are used the images are split into. The texture features are based on Gabor filters (18). The filters yield local texture features which are also based on small blocks of the image. Additionally, histograms of the average filter outputs are stored as a measure for an image's overall texture qualities.



The similarity measure in VIPER is similar to methods used in text retrieval. Two methods are used for feature weighting and the computation of a similarity score: (1) classical IDF, the classical method from text retrieval. It weights and normalizes all features together. (2) Separate Normalisation, which weights and normalizes the different feature groups (color histogram, color blocks, Gabor histogram, Gabor blocks) separately and merges the results.

VIPER/GIFT ranks and scores best of all systems. Furthermore it ranks best in 4 of the 6 separate domain rankings. This clearly marks VIPER/GIFT as the best CBIRS in this benchmark. Note that except for the Blumen domain it's always VIPER/GIFT's default setting (pset b) which achieves the best results.

The different parameter sets are the following: pset b uses separate normalization and all feature sets. Pset c is like b, but with classical IDF instead of separate normalization. Pset d uses separate normalization and color blocks and histogram only (color focussed). Pset e uses separate normalization and Gabor blocks and histogram only (texture focussed).

Unfortunately this CBIRS is also affected by the missing-images-problem. The outlier with pset d in the domain Textur is caused by this, see figure 7 (6th bar from the right). But apart from this, the results are not influenced by this problem.

Another outlier can be seen in the texture domain Textur with pset e (only texture features). The reason could be that the texture images of this particular query also have strong overall color features, which are ignored in pset e. Interestingly, for both queries in domain Textur, the color centered pset d seems to outperform the texture centered pset e19. This trend is even stronger in the overall result: color features and the combination of all features are better than texture features only, see figure 7.

Our benchmark confirms a result shown in [Mul02], namely that the weighting method separate normalisation (psets b) is better than classical IDF (pset c). Classical IDF gave the texture feature groups (i.e. Gabor histogram and blocks) too much weight due to much higher numbers of features (according to [Mul02, page 50]). We conclude that VIPER/GIFT performs best employing separate normalization and color features (psets b and d). Emphasizing texture too much (directly in pset e, indirectly through classical IDF in pset c) degrades its performance.

4.6 Oracle Intermedia

Intermedia is an extension to Oracle DB, providing services and data types for handling media data (sounds, images, videos etc.). Specifically images and according signatures can be stored as objects in a database. A score function is used to compare images by calculating distances between them based on their signatures. This can be used to write SQL-queries with example images which produce a ranked list of similar images. Four parameters govern the image search: color, texture, shape and location. The score is computed using the Manhattan distance: a weighted sum of single distances for color, texture and shape. It remained unclear to us how location integrates in the score. The weights are normalized so that their sum equals 1.

The Oracle Intermedia CBIRS shows very good results with the exception of one big outlier. The parameter sets b - e are based on the four parameters and are defined as follows: pset b: (color = 0.6, texture = 0.2, shape = 0.1, location = 0.1); pset c: (color = 0.5, texture = 0.5, shape = 0, location = 0); pset d: (shape = 1.0, all others set to 0); pset e: (color = 0.33, texture = 0.33, shape = 0.34, location = 0).


The mentioned outlier can be seen easily in the Urlaub domain. The causes remained unclear to us. The query (urlaub-Image15) doesn't seem special, none of the other CBIR systems show a similar behaviour. The outlier is less dominant in psets d and e, but still clearly visible. These parameter sets put more weight on shape search, which seems to improve performance on this query but degrade most of the others.

As mentioned the results are otherwise very good, especially in the Blumen domain, where Oracle Intermedia (pset b) ranks second behind SIMPLIcity. It would be interesting to see how the outlier could be improved and in how this would influence Oracle Intermedia's ranking and scoring.

Focussing on shape in pset d degrades performance practically for all queries. Also, equally using color, texture and shape is not as good as concentrating on color (pset b) or color and texture (pset c), see the ranking and scoring in figure 2. The shape-only pset d is actually much worse then the other three sets, which could mean that the shape parameter is only useful in special cases.

5. Summary and outlook

We have benchmarked seven CBIR Systems: SIMBA, SIMPLIcity, PictureFinder, ImageFinder, Caliph &Emir, VIPER/GIFT and Oracle Intermedia. Different parameter settings of the CBIRSs have been created by modifying interesting parameters.


The ground truth for the benchmarks was determined through an online survey. The image database was built from freely usable images taken from various sources. The images have been grouped roughly by similarity and forming image domains.

The systems were benchmarked by running a number of defined queries using the query by example approach. The systems were then compared to each other based on their visual retrieval accuracy. The results from the runs were measured using the custom accuracy measure [Rank.sub.WRN], which we derived from Normalized Average Rank[MMS+01b]. Based on these measurements the systems have been ranked and scored.

Results of the Benchmarks

VIPER/GIFT (with its default parameter setting) was found to be the best system in our tests. It ranks and scores best in the overall comparison, and also is first in many of the domain specific rankings. Caliph&Emir and SIMBA are second and third, respectively. We found that some systems have weak spots, like for example Oracle Intermedia.

Color features seem to work best for most systems, other features seem to be more domain specific. Different parameter settings produce significant differences, but mostly do not result in big jumps in the overall performance. From the very good performance of Caliph&Emir it can be seen that MPEG-7 descriptors are a good choice for CBIR, especially the descriptor ColorLayout. Changes to the default parameters tend to worsen the result, most likely because the default parameters are already optimized.

Two significant problems were encountered: First, ImageFinder's parameters obviously need to be optimized to specific domains. Second, SIMPLIcity kept crashing during the benchmark runs resulting in missing images, which distorted the results. Therefore, we decided to not include SIMPLIcity in the overall ranking and the scoring.


Clearly the trends go towards semantic search, for example by using automatic annotation techniques, and towards Web engine capabilities (addressing scaling problems). It is also agreed that the reliance on standards for the interoperable usage is important. It seems that these are one of the most important challenges in content-based image retrieval right now: closing the semantic gap (at least a little) and agreeing on standards, for benchmarks as well as for the CBIR systems themselves. Interesting future directions would be to combine our benchmark with text retrieval benchmarks and to measure the system performance in terms of memory and CPU usage.

6. Acknowledgements

We would like to thank the authors and developers of SIMBA, SIMPLIcity, PictureFinder, ImageFinder, Caliph&Emir, VIPER/ GIFT, Oracle Intermedia, for their support and many helpful thoughts and suggestions. Specifically, we would like to thank Gerd Brunner and Hans Burkhardt from the University of Freiburg for granting access to a server running SIMBA, Prof. James Z. Wang for providing SIMLIcity, Thorsten Hermes from TZI in Bremen for providing PictureFinder, the staff at Attrasoft for providing a copy of ImageFinder and Wolfgang Muller from the University of Bamberg for his helpful suggestions regarding VIPER/GIFT. Furthermore we would like to thank Markus Koskela and Jorma Laaksonen for their help and support with PicSOM, Gianluigi Ciocca and Raimondo Schettini for their efforts and help with Quicklook (2).

Received: 11 July 2009; Revised 13 August 2009; Accepted 19 August 2009


[1] Attrasoft, company web page., May 22 2007.

[2] Benchathlon website.

[3] Cord, M., Gosselin, P.-H., Philipp-Foliguet, S. (2007). Stochastic exploration and active learning for image retrieval. Image and Vision Computing, 25, 14-23.

[4] Gianluigi Ciocca, Isabella Gagliardi, Raimondo Schettini. (2001). Quicklook2: An integrated multimedia system. Journal of Visual Languages & Computing, 12(1) 81-103.

[5] Chad Carson, Megan Thomas, Serge Belongie, Joseph M. Hellerstein, Jitendra Malik. (1999). Blobworld: A system for region-based image indexing and retrieval. In: Third International Conference on Visual Information Systems. Springer.

[6] Ritendra Datta, Dhiraj Joshi, Jia Li, James Z. Wang. (2008). Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2) 60.

[7] Mario Doller, Harald Kosch. (2008). The MPEG-7 multimedia database system (MPEG-7 MMDB). Journal of Systems and Software, 81 (9) 1559-1580.

[8] Thomas Deselaers, Daniel Keysers, Hermann Ney. (2001). Fire--flexible image retrieval engine: Imageclef 2004 evaluation. In: Working Notes of the CLEF Workshop, Bath, UK, September 2004, 535- 544.

[9] Fournier, J., Cord, M., Philipp-Foliguet. S. (2001). Retin: A content-based image indexing and retrieval system. Pattern Analysis and Applications Journal, Special issue on image indexation, 4(2/3) 153-173.

[10] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, Peter Yanker.(2001). Query by image and video content: the QBIC system, 255-264.

[11] Gabbouj,M., Kiranyaz, S.(2004). Audio-visual content-based multimedia indexing and retrieval--the MUVIS framework. In Proceedings of the 6th International Conference on Digital Signal Processing and its Applications, (DSPA), Moscow, Russia, March 31 - April 2 2004, 300-306.

[12] Hermes, Th., Miene, A., Herzog,O. (2005). Graphical search for images by PictureFinder. Journal of Multimedia Tools Applications, 27 (2) 229-250.

[13] Imageclef., November 2005.

[14] Markus Koskela, Jorma Laaksonen, Sami Laakso, Erkki Oja. (2000). The picsom retrieval system: Description and evaluations. In: Proceedings of Challenge of Image Retrieval (CIR 2000). Brighton, UK. P.O. BOX 5400, Fin-02015 HUT, Finland, May 2000. Laboratory of Computer and Information Science, Helsinki University of Technology.

[15] Mathias Lux, Jutta Becker, Harald Krottmaier.(2003). Caliph & emir: Semantic annotation and retrieval in personal digital photo libraries. In: Proceedings of CAiSE '03 Forum at 15th Conference on Advanced Information Systems Engineering, Velden, Austria, 16-20 June 2003, 85-89.

[16] Ltu technologies., August 2007.

[17] Jia Li., James Z. Wang., Gio Wiederhold. (2000). IRM: Integrated region matching for image retrieval. In Proc. ACM Multimedia, Los Angeles, CA, October 2000, 147-156.

[18] Henning Muller, Wolfgang Muller, David McG.(2001). Squire. Automated benchmarking in content-based image retrieval. IEEE International Conference on Multimedia & Expo, 2001, 209.

[20] Henning Muller, Wolfgang Muller, David McG. Squire, Stephane Marchand-Maillet, Thierry Pun. (2001). Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recogn. Lett., 22 (5) 593-601.

[21] Stephane Marchand-Maillet, MarcelWorring.(2006). Benchmarking image and video retrieval: an overview. In: MIR '06: Proceedings of the 8th ACM international workshop on Multimedia Information Retrieval, New York, NY, USA, 2006, 297-300.

[22] Henning Muller. (2002). User Interaction and Performance Evaluation in Content-Based Visual Information Retrieval. PhD thesis, l'Universite de Geneve, 2002.

[23] Oracle Intermedia. B14302_01/toc. htm, August 2007.

[24] Till Quack, Ullrich Monich, Lars Thiele, B.S., Manjunath. (2004). Cortina: a system for large-scale, content-based web image retrieval. In: MULTIMEDIA '04: Proceedings of the 12th annual ACM international conference on Multimedia, New York, NY, USA, 2004, 508-511.

[25] Sven Siggelkow. (2002). Feature Histograms for Content-Based Image Retrieval. PhD thesis, Albert-Ludwigs-Universitat Freiburg im Breisgau, 2002.

[26] Remco C. Veltkamp., Mirla Tanase. (2002). Content-based image retrieval systems: A survey. Technical report, Department of Computing Science, Utrecht University, 2002.

[27] Jules Vleugels, Remco C. Veltkamp. (1999). Efficient image retrieval through vantage objects. In: Visual Information and Information Systems, 575-584.

[28] James Z. Wang, Jia Li., Gio Wiederhold. (2001). Simplicity: Semantics-sensitive integrated matching for picture libraries. In: IEEE Trans. on Pattern Analysis and Machine Intelligence, 23, 947-963.

[29] Yavlinsky, A., Heesch, D., Ruger. S. (2006). A large scale system for searching and browsing images from the world wide web. In: Proceedings of the International Conference on Image and Video Retrieval, Lecture Notes in Computer Science 3789, Springer, 2006, 530-534.

(1) Contact by email:

(2) Keyword search in Oracle Intermedia depends on the design of the database being used to storSe the images. Generally, multimedia data can be arbitrarily combined with e.g. text data and searched with corresponding SQL queries.

(3) Whethera featurein SIMBAis actuallya colorortexture feature depends on the kernel function used.

(4) The authors of PicSOM and Quicklook2 both offered to run the tests within their environments and send back the results. FIRE is freely available as open source software.





(9) http://www. cannonbeach/

(10) For the licence see the web page at

(11) A spread sheet has originally been used to describe the parameter sets, and since the first column 'a' was taken all sets start with 'b'.

(12) The motivation behind the5 similarity valuesis simply the assumption that human judges with no prior knowledge of CBIR or familiarity with the images can fairly easy differentiate 5 (or maybe 6 or 7) similarity levels, but not more. This bears some similarity to a Likert scale, although it was not the aim to implement it.

(13) Since the ideal result is simply the list with all relevant images at the beginning, thus having ranks from 1 to NR

(14) As mentioned in section 3.7.1 this method doesn't really differ much from the other ranking method. The reason for applying this method is basically that some domains only have two queries to begin with and thus no worst and best per-query ranking can be left out.

(15) For pset c of SIMPLIcity, the program classify was called with parameter -f 5.

(16) The result was produced with Image Finder pset c: An unsupervised filter was used, and Images were processed with the Laplace5 filter.

(17) The most frequent color within the block.

(18) These are often used to describe the neuron layers in the visual cortex which react to orientation and frequency.

(19) Assuming that the outlier result for psetd caused by missing images would be significantly better without the missing-images-problem.

Harald Kosch

Chair of Distributed Information Systems

University Passau


Paul Maier

Chair for Image Understanding and Knowledge-Based Systems

Technical University Munich

Table 1. Overview of all live CBIR systems. For each CBIRS
(rows), its features are listed (columns). A star ('*') in a
column marks the support for a feature and a question mark ('?')
if the support is unknown. If a system doesn't support the
feature, the table entry is left empty

System reference texture color shape keywords

Behold [YHR06] * * *
Caliph&Emir [LBK03] * * *
Cortina [QMTM04] * * *
FIRE [DKN04] * * *
ImageFinder [att07] ? ? ?
LTU ImageSeeker [LTU07] * * * ?
MUVIS [GK04] * *
Oracle Intermedia [oim07] * * *
PictureFinder [HMH05] * * * *
PicSOM [KLLO00] * * *
QBIC [FSN+01] * * * *
Quicklook [CGS01] * * * *
RETIN [CGPF07, * *
SIMBA [Sig02] *
SIMPLIcity [LWW00, * * *
SMURF [VV99] * *
VIPER/GIFT [Mul02] * *

 relevance MPEG-7 designed
System feedback support for web classification

Caliph&Emir *
Cortina * * * *
ImageFinder ?
LTU ImageSeeker ? *
Oracle Intermedia
PictureFinder *
PicSOM * *
Quicklook *

SIMPLIcity *


System regions

Behold *
ImageFinder *
LTU ImageSeeker *
Oracle Intermedia
PictureFinder *
PicSOM *

SIMPLIcity *


Table 2. The five best systems per domain

Blumen Urlaub Kirschen

1. SIMPLIcity pset c 1. GIFT pset b, 1. GIFT pset b
2. Oracle Intermedia GIFT pset e 2. SIMBA pset c
 pset b, PictureFinder 2. Caliph&Emir 3. SIMPLIcity pset b
 pset b pset d 4. GIFT pset d,
3. Oracle Intermedia 3. SIMPLIcity pset b SIMBA pset b,
 pset c 4. GIFT pset d, SIMBA pset d
4. Oracle Intermedia SIMPLIcity pset c 5. SIMPLIcity pset d
 pset e 5. Caliph&Emir
5. GIFT pset d pset c

Band Textur Verfalschung

1. GIFT pset d 1. SIMBA pset b, 1. GIFT pset b
2. Caliph&Emir pset b SIMBA pset c 2. SIMPLIcity pset c
3. PictureFinder pset b 2. Caliph&Emir pset 3. GIFT pset d
4. GIFT pset b b, SIMBA pset e 4. Oracle Intermedia
5. SIMBA pset e 3. SIMBA pset d pset b
 4. Oracle Intermedia 5. Caliph&Emir pset
 pset e b, SIMBA pset b,
 5. Oracle Intermedia SIMBA pset c
 pset b
COPYRIGHT 2010 Digital Information Research Foundation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2010 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Kosch, Harald; Maier, Paul
Publication:Journal of Digital Information Management
Article Type:Report
Date:Feb 1, 2010
Previous Article:A MPEG-based architecture for generic distributed multimedia scenarios.
Next Article:Knowledge management of 1D SDS PAGE gel protein image information.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters