Printer Friendly

Robust method for codebook generation using embedding algorithm for large image collections.


Today with the development of multimedia systems and the decline of writing, we use more and more video content as a communication medium in different areas. Indeed, the image and digital video are part of such systems by the density and richness of content. The same image can have several meanings at different levels: analysis, description, recognition and interpretation. In this framework, information research covers the treatment of digital documents involving the structure, analysis, storage and organization of data. In the past, the term was seeking information related to the concept of textual information. Currently information search is associated with any kind of information, text, visual or others. However, due to limitations of textual methods, methods development based on the visual content has become essential. This explains the intense research activity devoted to CBIR (content based image retrieval) system in recent years. The CBIR is often faced with the relevance of the research problem and search time.

The objective of any CBIR system is to satisfy the request of a user by the relevance of the results. Such as access to a document via its pure semantics is impossible, traditional CBIR systems use a low-level representation paradigm for the image content, color, texture, shape, etc.... and others by a combination of them. Image search is done well by comparing descriptors.

The analysis and content representation of data used as feature vector. The information obtained in this step is a kind of summary of the database images (region segmentation, color, texture, spatial relationships, ...). The transformation is generally time-consuming calculations.

Methods providing a compromise between the computational complexity and quality of results are methods based on bag of visual words model introduced by [1]. Because of its effectiveness in the classification domain and image annotation, the bag of visual words model has become popular in recent years [1][2][3][4] [5][17][18]. The essential aspect of it is to extract global image descriptors and represent images as a collection of local properties calculated from a set of small sub-images called patches.

To apply the bag of words model in the visual domain, it's necessary to create a vocabulary of visual words. This work is performed by a quantification of regions descriptor values. In general, the quantization uses a clustering algorithm (eg, K-means [6]) that takes as input a sample descriptions and creates from this sample sets (clusters), each centroid is represent a visual word, all the visual words is called visual vocabulary (codebook).

After the allocation of visual words regions, images are represented by a collections (bags) of visual words, and construct a histograms whose dimension is equal to the visual vocabulary size (number of clusters), each bin contains frequency of the corresponding word in the visual image or in a part of the image.

However this approach has several drawbacks. One of them is based on the computational complexity of the dictionary construction. For large databases, the dictionary is constructed from a sample of the base to remove this limitation. However, it is difficult to estimate the correct sample size to be considered to ensure good performance. This depends on the data performance (structure, size and cluster density, dimension) and therefore requires making strong hypothesis, and mostly limiting, on clusters or their distribution. In addition, the performance of clustering methods degrade when the size of data increases, limiting the quality of the extracted dictionary [7], [8], [9].

In this paper we focus our effort to improving this drawback that concerned the time construction of vocabulary by employing the embedding technique that is introduced in [10] by Indck et al. 2003. We only give a brief review on it in the next section. The method proposed in this paper avoids the computational costs associated with distance computation. In the same cadre several Embedding algorithm are proposed in the literature [11], [12].

The rest of the paper is organized as follows. Section 2 Describe the embedding technique. Section 3 Present our contribution. Section 4 evaluates the construction time and the result search quality of our proposed approach experimentally. Finally, Section 5 concludes the paper.


In the majority of retrieval system, either dedicated to concept detection, scene categorization, classification or similarity retrieval, a very common approach exploits the bag-of-words paradigm, in which the centroids (vocabulary) of a clustered set of training descriptors is used to count the occurrences of descriptors belonging to every class within the objects of interest [21]. Although to count the occurrences for each descriptor is realized by searching its nearest cluster centroid. Hence, various unsupervised clustering methods have been applied to this particular problem, whose most popular k-means clustering [6]. However, this method is quite effective, its time and space complexity is extremely high, making it even unpractical for handling a large set of training samples. To improve the computational efficiency, a number of methods [20], [22], [23], [24], [25] have been designed to efficiently construct visual vocabularies. Other methods that are proposed for the same problem we find mean-shift [19], tree-based coding (e.g., tree induced by hierarchical k-means) [20]. However, most of all these methods construct visual vocabularies by quantizing the training samples in their original feature space, which need to perform calculation of distances between the original descriptors, and thus, lengthens the construction time. To improve the construction time, we construct a new visual vocabulary (EBVW) by embedding all the original features in the other space avoiding the computation distances between them.


The goal of our proposed method is to quickly build visual vocabulary by using the embedding method. Indeed, the use of this method to dive descriptors vectors that represent the image databases to an other space. This method allows a thin projection. The decisive advantage of embedding is to allow a rapid quantification descriptor vectors. More generally, this method interested analyzing data from any complex system, which requires a space of multidimensional representation. Figure 1 show an example of embedding algorithm to embed a set of 2-dimensional points.

For example, we take the following 2-dimensional points an embedded:

A(0, 0), B(1.5, 1.3), C(2.2, 3.5), D(4, 2.9), E(0.6, 3.7), F(2.3, 4), G(1.7, 1.1), H(2.1, 2.5), I(0.5, 2.1) and J(0.1, 2.8).

The diameter values for theses 2-dimensional points are 4 for each dimension. If we divide each diameter in four components, we have the following schema.

The grids that are not containing any point are ignored. The average points for the other grids represent the vocabulary of 2-dimensional points. For this example (one level) the 2-dimensional points that constitute this vocabulary are:

V1 = A (0,0);

V2 = E (0.6, 3.7)

V3 = Average (B(1.5, 1.3), G(1.7, 1.1)) = (1.6, 1.2)

V4 = Average (I(0.5, 2.1), J(0.1, 2.9)) = (0.3, 2)

V5 = H(2.1, 2.5):

V6 = Average (C(2.2, 3.5), D(4, 2.9)) = (3.1, 3.2)

V7 = F(2.3, 3.4):

Exploitation of the embedding method allows the immediate construction of visual vocabularies of a database without to compare descriptor vectors of database with them, significantly reducing construction time. Figure 2 presents the different steps of our approach for visual vocabulary generation.

In the following paragraph we present the construction algorithm of visual vocabulary using the embedding method.

Algorithm Codebook Generation Using Embedding Algorithm

Input: Image signatures of database S = {s1, ..., sn}

Output: Set of visual words V = {v1, ***, vn}

1. Calculate the diameters of each dimension of all the image signatures D = {[d.sub.1], ..., [d.sub.p]}

2. Initialization of partitioning parameters for each dimension (a few segments for each dimension. A value of d (dimension) higher is highly recommended for a more accurate data separation).

3. The vocabulary consists of all the centers of element grids. Note that the grids that are empty are ignored. In addition, the use of this method is simple and fast because there is no calculation between image signatures.

4. Description of each image following the product visual vocabulary. This description is performed by assignment of each local image descriptor to the nearest visual word from the vocabulary and increment it by one.

It is important to note that the algorithmic complexity of this algorithm is O(N logd(P)), where d (base of the logarithm) is the segment number of each dimension, P is feature dimension, and N is total number of feature databases.


The purpose of this section is to evaluate our approach for image description by content. The evaluation cover both the quality of results returned to the user and the visual vocabulary construction time.

4.1 Experimental Environment

The programs used in our tests are implemented in Java. Performance evaluation is made on a quad core of an Intel processor running at 1.6 GHz PC with 4 GB memory, Windows 7 operating system and 400 GB of local disk. Two parameters are used to measure the performance of our method, the construction time and the quality of the returned results. BOF construction times are measured using the getTimeInMillis () function of the Calendar class.

4.2 Data description

The image description methods are evaluated on two types of descriptors. The first is the SURF descriptor shape descriptor. SURF [13] descriptors extracted from each image are the basis vectors of dimension 64. The second is a color descriptor represented by the signatures introduced in [14]. We have adjusted a few parameters such as the dimensions of the signatures are smaller than 16. These evaluations are performed on three image databases:

The First one Coil-100 [15] image database of Columbia University, which contains 100 categories, each category contains 72 images. One image per category is presented in Figure 3.

The second one Oxford Flowers database is available online [16]. This dataset consists of 1360 labeled images of 17 class's, with 80 images per class. Sample image categories are presented in Figure 4.

The third one is a synthetic descriptor databases, whose different sizes, and where descriptors have been created randomly, and distributed uniformly in the range [01] with several dimensions.

The first two databases are suitable for accuracy evaluation while the last one has been used to study the response time of the proposed approach.

4.3 Vocabulary construction time

To show the difference in construction time between our approach and the one that is built by Hierarchical KMeans (BoF (HKmeans)), we have present in separate figures the construction time to build different sizes of visual vocabularies. The results are shown in Figure 5. We recall that The algorithmic complexity of a single k-means iteration in HKMeans is O(N logK) where the base of the logarithm is the cluster number of each node in the vocabulary tree.

In these tests, we estimate the construction time compared of our method to BoF (HKmeans). We note that the construction time of these descriptors is growing faster depending on the number of vectors in the database, and the dimension of the vector descriptors and the performances of our approach beyond that of the BoF (HKmeans). The gain in construction time becomes more and more interesting when the number of vectors and the dimensions of vector descriptors believe.

4.4 Quality of similarity search

We propose in this section to evaluate research performance by the content based on our method (EBVW). As seen in Figure.6, the accuracies produced by EBVW are superior to BoF (HKMeans) for different descriptors using for building a visual vocabularies and different image databases. We note that, we adapt the embedding parameters to have the same size to the BoF(HK-Means). The original descriptors used for vocabulary generations are SURF and Color signatures.

For these experiments, the interrogation protocol of three bases is the same. Thus, for each base, 200 images drawn at random are considered query and then we study the returned responses.

Curves precision/ recall are summarized in figure 6 shows for each original descriptor category that our approach improves more the accuracy of image search by content. This result is another favorable point of our approach.


This paper describes a new approach to building visual vocabulary. The method bypasses the computational effort associated with computing distances between feature vectors. Advantage more, the search quality result using our method is performed. The result in terms of response time shows that our method is effective in large and scalability. We also evaluated our method to the case of image retrieval by content. Our results are promising.


[1.] SIVIC J., ZISSERMAN A.: Video Google: A text retrieval approach to object matching in videos. In Proc. ICCV (2003), vol. 2, pp. 1470-1477.

[2.] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluation of color descriptors for object and scene recognition. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:1-8, 2008.

[3.] Y. G. Jiang, C. W. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR '07: Proceedings of the 6th ACM international conference on Image and video retrieval, pages 494-501. ACM, 2007.

[4.] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR '06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 2169-2178. IEEE Computer Society, October 2006.

[5.] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)--Volume 2, volume 2, pages 524-531 vol. 2, Washington, DC, USA, June 2005. IEEE Computer Society.

[6.] J. B. Macqueen. Some methods for classification and analysis of ultivariate observations. In Procedings of the Fifth Berkeley Symposium on Math, Statistics, and Probability, volume 1, pages 281-297. University of California Press, 1967

[7.] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.

[8.] E. Valle, M. Cord, and S. Philipp-Foliguet. High-Dimensional Descriptor Indexing for Large Multimedia Databases. In CIKM '08: Proceedings of the 17th ACM Conference on Information and Knowledge Management, page 739-748, Napa Valley--CA, USA, 2008. ACM.

[9.] Eric Bruno and Stephane Marchand-Maillet. Multimodal preference aggregation for multimedia information retrieval. Journal of Multimedia, 4(5):321-329, 2009.

[10.] Indyk P. and N. Thaper. Fast image retrieval via embeddings. In 3rd International Workshop on Statistical and Computational Theories of Vision (at ICCV), 2003.

[11.] H. Jegou, M. Douze, and C. Schmid, "Hamming embedding and weak geometric consistency for large scale image search," in Proc. Eur. Conf. Comput. Vision, vol. 5302. 2008, pp. 304-317.

[12.] H. Jegou, M. Douze, and C. Schmid, "Product quantization for nearest neighbor search," IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117-128, Jan. 2011.

[13.] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346-359, 2008.

[14.] Y. Rubner and C. Tomasi. Perceptual metrics for image database navigation. PhD thesis, Stanford University, 2001.

[15.] Nene S.A., Nayar S.K. & Murase H. (1996)Columbia object image library (coil-100). Technical Report CUCS-006-96.

[16.] M. Nilsback and A. Zisserman. 17 Category Flower Dataset., 2006.

[17.] E. Nowak, F. June, and B. Triggs, "Sampling strategies for bag-offeatures image classification," in Proc. Eur. Conf. Comput. Vision, vol. 3954. 2006, pp. 490-503.

[18.] S. Wei, Y. Zhao, C. Zhu, C. Xu, and Z. Zhu, "Frame fusion for video copy detection," IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 1, pp. 15-28, Jan. 2011.

[19.] Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: ICCV. (2005) 604-610

[20.] Nist_er, D., Stewfienius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2). (2006) 2161-2168

[21.] C. R. Dance, G. Csurka, L. Fan, J. Willamowski, and C. Bray, "Visual categorization with bags of keypoints," in ECCV Workshop on Statistical Learning in Computer Vision, 2004, pp. 1-22.

[22.] F. Moosmann, W. Triggs, and F. Jurie, "Randomized clustering forests for building fast and discriminative visual vocabularies," in Proc. Neural Inform. Process. Syst. Conf., Nov. 2006, pp.1-7.

[23.] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, "Object retrieval with large vocabularies and fast spatial matching," in Proc. IEEE Conf. Comput. Vision Pattern Recognit., vol. 3613. Jun. 2007, pp. 1575-1589.

[24.] T. Tuytelaars and C. Schmid, "Vector quantizing feature space with a regular lattice," in Proc. IEEE Int. Conf. Comput. Vision, Oct. 2007, pp. 1-8.

[25.] Y. Mu, J. Sun, T. Han, L. Cheong, and S. Yan, "Randomized locality sensitive vocabularies for bag-of-features model," in Proc. Eur. Conf. Comput. Vision, vol. 6313. 2010, pp. 748-761.

Abdelkhalak Bahri, Hamid Zouaki, Youssef Bourass

Equipe: Modelisation Mathematiques et Informatique Decisionnelle, Faculty of Science Eljadida, Morocco

Bahri. abdelkhala@gmail. com,, bourass.youssef@gmail. com
COPYRIGHT 2015 Springfield Publishing Corporation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Bahri, Abdelkhalak; Zouaki, Hamid; Bourass, Youssef
Publication:International Journal of Emerging Sciences
Article Type:Report
Geographic Code:1USA
Date:Mar 1, 2015
Previous Article:PLS mechanism for local search algorithm (PPCA) for medical clustering problem.
Next Article:Review of image defect detection and inpainting techniques and scope for improvements.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters