Printer Friendly

Heuristic image recognition.

1. Introduction

Internet users have already become used to a new category of spam messages, one in which the content is actually delivered in the form of a picture attachment. Also, as they probably noticed, each image is slightly different from its predecessors, by having some noise added.

A year ago, such "image spam" consisted of approximately 10% of the total amount of spam. Such message series typically consisted of about 10-15 spam images with some minor modifications. In the last 6 months, however, spammers noticed that many of the current AntiSpam solutions are almost ineffective against this new trick so they started attacking this niche in earnest. Image spam increased to 30-40% of the total amount of circulating spam, with random noise changing at almost every image sent. Detection rates dropped even further.

Usually noise consisted in:

* Adding random pixel in the image

* Animated GIF's with random bogus frames

* Modifying the size of the image

* A long line at the end of the image (some sort of border) with random part missing

* Splitting the image into sub-images and using the HTML tables to reconstruct it.

* Similar colors between different parts of text in the image.

* GIF's with transparent frames, with parts of the text in different frames to confuse OCR filters

As one may notice, part of this noise types are also met in an image retrieval system. Let's consider that a professional photographer has to take a picture of the Eiffel Tower. He will fix his camera on his tripod in the place where he can have the best view on the tower and take a few pictures. Even though he will take consecutive pictures during a small time interval, and for his eye the pictures are quite similar, computationally speaking, the pictures will be different. If a bird or a plane flies over the tower, the picture will be even more different.

As an example, the following spam images show the same information, but computationally speaking, they are totally different.

The information the Fig. 1 and Fig. 2 is identical, but the images are different (random pixels changed). Also, both of the above pictures were reduced to black and white in order to rapidly notice the difference. The original versions had different colors of text or pixels and a different background. Also, it is easily noticeable that the images have different sizes.

If in spam images this process is voluntary, in reality this process can happen by mistake, or simply because of the medium in which the pictures are taken.

Another problem would be that the information contained in the image can be rotated with different angles in order to confuse OCR filters. The same problem would appear if the photographer would want to take a landscape picture, or a diagonal picture. You can see a sample of rotated spam image in Fig. 3.

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

BitDefender solved this problem by performing analysis only on the histograms of the spam images by using a special distance developed for this type of pictures, called SID (spam image distance). Using a frequently updated database of histograms of images met in spam emails, almost 98% of the spam images were correctly identified.

[FIGURE 3 OMITTED]

2. Proposed Method

Many histogram distances have been presented so far, among which we can enumerate: Histogram Euclidian distance, Histogram intersection distance, Histogram quadratic (cross) distance, where, if we consider h and g two color histograms, then:

1. Histogram Euclidian Distance:

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.] (1)

In this formula there is only comparison between the identical bins in the respective histograms. Two different bins may represent perceptually similar colors but are not compared cross-wise. All bins contribute equally to the distance.

2. Histogram Intersection Distance:

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.] (2)

Where [parallel]h[parallel] and [parallel]g[parallel] give the magnitude of each histogram, which is equal to the number of samples. Colors not present in the user's query image do not contribute to the intersection distance. This reduces contribution of the background colors. The sum is normalized by the histogram with fewest samples.

3. Histogram quadratic (cross) distance:

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.] (3)

The cross distance formula considers the cross-correlation between histogram bins based on the perceptual similarity of the colors represented by the bins.

From all these distances, the most appealing one for an AntiSpam engine would be the histogram intersection distance, but even this one has some problems. It is useful because it can ignore changes in background colors, but problematic when changes appear in foreground. Also, if the noise added by the spammers uses the same colors but in different quantity, these colors will be considered in the distance.

The algorithm used was quite simple. There had to be a component that cleaned the image as much as possible from the noise added by spammers, and there had to be a distance which compared two histograms. Since this type of noise is particular for spam images, one way to avoid unnecessary problems was to ignore all colors that could be found less than x% in that image. Some noise persisted, but the major part is out. Of course, this eliminates good colors too, but only insignificant ones.

The cleaning function Ch accepts a given histogram and returns a new shorter histogram in which unnecessary colors are eliminated. Ch can be defined like this:

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.]

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.] (4)

In other words, only colors that can be found in a large amount (normalized to the surface of the picture) are kept. Experimentally, a good value for [gamma] would be 0.01.

Algorithmically speaking, this function can be implemented by using the following decision block:

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.] (5)

SID is a modified intersection distance which can be defined like this:

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.] (6)

where

[MATHEMATICAL EXPRESSION NOT REPORDUCIBLE IN ASCII.]

wherein h and g denote pixel counts of the target and reference histograms, respectively, a coordinate triplet (a, b, c) tags an individual color bin, A, B and C denote the basis colors of the color space, parameters 8 and [DELTA] denote a color neighborhood size and a color bin similarity threshold, respectively, while [parallel]h[parallel] and [parallel]g[parallel] denote the magnitude (i.e., total number of samples) of the target and reference histograms, respectively. The distance of eq. [6] is symmetrical with respect to h and g--i.e. the distance depends on the contents of the two images, but not on which of the two images is a target histogram and which is a reference histogram.

In other words, we do not consider only identical colors, but also similar, which can differ from the original with no more than [delta] values, with the restriction that we consider this element in the sum only if the difference between the sizes of the bins is smaller than [DELTA]%. These two parameters can be determined using ROC curves, trial and error or using any machine learning technique.

This way, SID is comparing a first bin of a first histogram representation to a range of second bins of a second histogram representation to determine a minimum distance between the first bin and the range of second bins, wherein the first histogram representation and the second histogram representation are selected from the target histogram representation and the reference histogram representation, and wherein the first histogram representation is distinct from the second histogram representation, and employing the minimum distance to determine the histogram distance between the target histogram representation and the reference histogram representation.

A better representation of the SID algorithm can be seen in Fig. 4.

[FIGURE 4 OMITTED]

Basically. this filter compares the query image with all the images contained in the database in order to find a match. The database contains only clean histograms.

The algorithm can be presented in the following steps

1. Extract histogram

2. q = Clean histogram (skip the unnecessary colors)

3. y = max( SID(q,[h.sub.i]))

4. If y [greater than or equal to] [OMEGA] return "We have a match"

It is obvious that this filter can be modified in order to develop an image retrieval system. The only difference is that the database has to contain the original images and not only the pre-computed histograms. All we have to do in the algorithm is to add another step and to modify the one where we extract only the maximum hit. We want all the hits. The algorithm will have as input a image, and will search in a given database of images similar pictures (with a really small SID score).

[FIGURE 5 OMITTED]

[FIGURE 6 OMITTED]

The image retrieval algorithm it presented in the following lines:

1. Extract query histogram

2. for i = 1 to Database. size

a. [h.sub.i] = Extract histogram

b. Clean [h.sub.i]

c. If (SID(q, [h.sub.i]) [greater than or equal to] [ZIGMA] ) display (Image([h.sub.i]))

3. Results

Our preliminary results show that SID can be very precise when searching for similar pictures. For example, pictures shown in Fig.1 and Fig.2 are identical judging

by SID's score. Also, the pictures shown in Fig. 5 and Fig. 6 are also identical, although there is a change in luminosity. SID notices that there is a change of color, but since the exact percent of colors is the same, the images are declared identical.

On our test corpus, our image retrieval system has an accuracy of 97.9%, but since our query system permits the user to choose the degree of accepted noise, the degree of similarity between colors or the degree of similarity between pixel counts, it's accuracy can rise up even to 100%.

Also, working with complex pictures with a high number of colors, might cause some problems to SID, but if this method would be combined with other techniques for image similarity, we believe that there will be some amazing results.

4. Conclusions

By comparing each bin with a plurality of bins, SID is able to identify images that are slightly different from the original with high accuracy. of course, since retrieval is performed only on the histograms of the images, we can't say that this process is flawless, but we are confident that this distance is one step further in image retrieval systems research.

Used with a caption filter, this technology could bring value to any image search engine on the Internet. Since right now search engines classify images based only on the document topic, using image features will bring accuracy to the final results. Alone, none of the technologies will have amazing results, but together the rate of false positives or false negatives will tend to 0.

DOI: 10.2507/daaam.scibook.2009.39

5. References

Cosoi, A. C.. (2006). The medium or the message--Dealing with image spam, Virus Bulletin Magazine, December 2006, (www.virrusbtn.com)

Cosoi, A. C.; Vlad M. & Sgarciu V. (2007)-Image Retrieval System inspired from antispam filters, CSCS 16 Proceedings, 22 May 2007, Bucharest, Romania, ISBN: 978-973-718-741-3

Culberston, B & Malzbender, T. (2003). A histogram-based Color Consistency Test for Voxel Coloring Charles Rivier Analytics, HP Research Lab

Graham-Cumming, J. (2006). The Rise and Rise of Image-Based Spam, Spam Suplement, Virus Bulletin, November 2006 (www.virusbtn.com)

Sablak, S. & Boult, T. (1999). Multilevel Color Histogram Representation of Color Images by Peaks for Omni-Camera, Department of Electrical engineering &

Computer Science (LASTED proceedings, October 18-21 1999, Nassau, Bahamas

Scheunders, P. (2000). A comparison of clustering algorithms applied to color image quntization, Vision Lab., Dept. of Physics, RUCA university of Antwerp

This Publication has to be referred as: Cosoi, A[lexandru] C[atalin]; Vlad, M[adalin] S[tefan] & Sgarciu V[alentin] (2009). Heuristic Image Recognition, Chapter 39 in DAAAM International Scientific Book 2009, pp. 377-384, B. Katalinic (Ed.), Published by DAAAM International, ISBN 978-3-901509-69-8, ISSN 17269687, Vienna, Austria

Authors' data: PhD Eng. Cosoi, Alexandra] C[atalin]; Dr. Eng. Vlad, M[adalin] S[tefan]; Prof. Dr. Eng. Sgarciu, V[alentin], University "Politehnica" Bucharest, Romania, Splaiul Independentei 313, Bucharest, PO BOX 060042; catalin.cosoi@gmail.com, madalinv@ac.pub.ro, vsgarciu@cs.pub.ro
COPYRIGHT 2009 DAAAM International Vienna
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2009 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Chapter 39
Author:Cosoi, A.C.; Vlad, M.S.; Sgarciu, V.
Publication:DAAAM International Scientific Book
Article Type:Report
Geographic Code:4EXRO
Date:Jan 1, 2009
Words:2035
Previous Article:Optimising involutes asymmetrical teeth gears software.
Next Article:An applied elasticity common sense perspective and the optimal design.
Topics:

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters