# Spam blog filtering with bipartite graph clustering and mutual detection between spam blogs and words.

1. IntroductionThe blogosphere is not only one of the richest sources of information for individual opinion and social reputation, but also a highly effective medium for advertising. Advertising is a mechanism that creates matches between consumer demand and corporations' supply. Hence advertising based on direct consumer experience is valuable for readers who seek referrals to products or services. Moreover, consumer-generated advertising gives bloggers financial incentives to create advertising content by allowing them opportunities to generate income through referrals.

However, there are many spam blogs that reap benefits from advertising without this type of unique content. Spam blogs tend to take advantage of affiliate-advertising benefits by using copied-and-pasted sentences from other blogs and news articles, even while they may not contain any useful information about goods or services. Spam blogs are typically posted on multiple pages and/or sites to increase opportunity for exposure.

Detecting spam blogs is a difficult task for filtering methods based on machine-learning mechanisms, e.g. SVM (support vector machine), because the cycle of creation and deletion of spam blog is very rapid compared to normal blogs. Preparing the vast amount of learning data required for machine learning is no doubt a costly task.

In this paper, a new filtering method is developed in order to reduce filtering costs. This method is based on a mutual-detection mechanism with cooccurrence cluster seeds in terms of bipartite graphs between blogs and words.

2. Spam Blog Filtering

A spam blog is often composed of parts of other blogs and news articles. It may contain repetitive posts or be posted on multiple sites to maximize the chances of attracting a large audience. These characteristics suggest that spam blogs may share common spam words and that the blogs and words form large bipartite graph clusters. This paper assumes that these large clusters are an invariant feature of spam blogs, and it uses blogs in these clusters as spam blog seeds for filtering.

In this paper, one important criterion to evaluate spam blogs is how much value a blogger adds to his or her blog. Spammers tend to copy parts of other blogs and paste them onto their blogs with many commercial affiliate links, either automatically or semi-automatically. As a result, they add no value to their blogs. In contrast, blogs with value tend to contain individuals' opinions, experiences, and creative content. When these kinds of blogs contain advertising links, the links are likewise valuable because readers are interested in the content and can find goods or services useful based on their relevance.

The spam filtering process is as follows: bipartite graphs between blogs and words are extracted from updated blog data. From these graphs, the spam seed is extracted. Mutual detection is performed based on the spam seed.

2.1. Bipartite Graph Clustering

There are already many established methods and algorithms to extract clusters, e.g. hierarchical agglomerative clustering, K-means clustering, and so forth [1][2]. Hierarchical agglomerative clustering can extract numerous numbers of clusters, but suffers from a high computational cost for processing a large data set. K-means clustering can process a data set within short processing time, but requires a predefined number of clusters, which can not be determined in advance of preparing spam seed.

Extracting huge co-occurrence clusters from a huge data set derived from the blogosphere requires high computational efficiency without pre-defined parameters concerning the number and size of clusters, due to the size of the data set and the undetermined size and number of clusters for spam filtering. To meet this requirement and to extract huge co-occurrence clusters between spam blogs and spam words from the blogosphere, the Shared Interest algorithm (SI) has been developed.

To describe SI, co-occurrence data between blogs and words is illustrated in a set of graphs, each of which is composed of nodes and edges. The nodes mean represent blogs and words. The edges represent occurrences of words on blogs.

SI employs an edge, rather than a node, as the element used to form clusters, since an occurrence of a word on a blog represents the intention of a blogger. A local similarity between any two edges is defined as a measure of the degree of shared interest. To extract dense clusters without predefined parameters of size or number of clusters and to reduce calculation cost, a partial order on an edge set is defined in terms of in and out degrees, and a traversal algorithm on the edge set is also developed.

2.2 Partial Ordered Edge Set

To construct a partially ordered edge set, a partially ordered node set is defined, since a partially ordered edge set can be constructed by a Cartesian product of two partial ordered node sets, removing structurally unrelated node pairs from the product. Index numbers are assigned to all nodes in terms of ascending order based on degrees of nodes. Hence, relations between degrees of nodes [n.sub.i], [n.sub.j], [n.sub.k] (i<j<k) in a graph are d([n.sub.i]) [less than or equal to] d([n.sub.j]) [less than or equal to] d([n.sub.k]), where d([n.sub.i]) is degree of node n Based on the degree, the partial orders are [n.sub.i] < [n.sub.j] and [n.sub.j] < [n.sub.k]. For example, two partially ordered node sets can be derived from the graph in Fig. 1. The bipartite graph is composed of blogs and words occurred on the blogs. The blog set and word set can be illustrated in co-occurrence graphs, as shown in Fig. 2. The partial orders of the blog set (the upper half of Fig. 2) are [b.sub.2]=b3, [b.sub.1]<[b.sub.4], [b.sub.3]<[b.sub.4], and [b.sub.3]<[b.sub.4], whereas those of the word set (the lower half of Fig. 2) are [w.sub.1] = [w.sub.2], [w.sub.1] < [w.sub.4], [w.sub.2] < [w.sub.4], and [w.sub.3] < [w.sub.4].

The partial order of the edge set can be defined by the Cartesian product of the two partially ordered node sets in Fig. 2 without unlinked node pairs in Fig. 1. Generally, the Cartesian product of two partially ordered sets ([S.sub.1], [[less than or equal to].sub.1]) and ([S.sub.2], [[less than or equal to].sub.2]) is a partially ordered set with domain [S.sub.1] x [S.sub.2] and relation [[less than or equal to].sub.1] x [[less than or equal to].sub.2] given by the rule ([a.sub.1], [a.sub.2]) [[less than or equal to].sub.1] x [[less than or equal to].sub.2] ([b.sub.1], [b.sub.2]) if and only if [a.sub.1] [[greater than or equal to].sub.1] [b.sub.1] and [a.sub.2] [[less than or equal to].sub.2] [b.sub.2] [4]. Hence, the partial edge order can be described as follows.

[FIGURE 1 OMITTED]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Based on the partial-edge order, we can construct a Hasse diagram, which is a directed graph that represents a partially ordered set (poset). A Hasse diagram of a poset can be a line graph of the original graph, i.e., a graph with interchanged nodes and edges of an original graph. The diagram in Fig. 3 illustrates the relationship between two edges in the original graph in Fig. 1.

For the sake of simplicity, the terms "edge" and "node" are employed when discussing an original graph, and likewise with the terms "element" and "path" in the Hasse diagram. For example, a node circled in the original graph in Fig. 1 corresponds to a path circled in the Hasse diagram of Fig. 3, whereas an edge drawn between two nodes in the original graph corresponds to an element illustrated in rounded rectangles in the Hasse diagram. When appending an element to a cluster, based on similarities, two different dense clusters may be appended into one cluster, even though the cluster is sparser than each of the two clusters, as shown in Fig. 4 (a). A high-degree element can join two clusters. As shown in Fig. 4 (b), when a high-degree element, an edge in an original graph, is placed at an order higher than the others, two dense clusters are always separated, appending elements from the bottom elements of both sides. As seen in Fig 4, to prevent the unnecessary joint, an incremental partially ordered element (edge) set, which is an edge set in the original graph, is defined.

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

2.3 Local Similarity between Two Edges

The similarity of two edges in an original graph is defined by the co-occurrences of words shared by two different nodes of two edges ni and nj where i is less than j (Fig. 5).

The two edges share a word [n.sub.s] in the co-occurrence word set, including common word ns between edge {[n.sub.i], [n.sub.s]} and edge {[n.sub.j], [n.sub.s]}. To define a local similarity between the edges, the Jaccard coefficient between the different nodes, i.e., [n.sub.i] and [n.sub.j], is employed.

J([n.sub.i], [n.sub.j])=s([n.sub.i], [n.sub.j]) / (d([n.sub.i])+d([n.sub.j]) - s([n.sub.i], [n.sub.j])) (1)

where s([n.sub.i], [n.sub.j]) is the number of co-occurrence words between [n.sub.i] and [n.sub.j].

A cluster with a high local similarity between each edge pair, which is an element pair in a Hasse diagram, appears to be dense. Detailed discussions of density and local similarity will be analyzed in terms of local similarity and traversal on a Hasse diagram to extract clusters in Fig. 6.

[FIGURE 5 OMITTED]

2.4 Upper and Lower Bounds of Co-occurrences

To understand the characteristics of a sequential path between two elements, i.e., {[n.sub.i], [n.sub.j]} and {[n.sub.j], [n.sub.k]}, in a Hasse diagram, the upper and lower bounds of the number of co-occurrences between the two nodes in the original graph are defined, respectively. The upper limit is trivial because of the definition of local similarity. The limit is simply the minimum degree between two nodes [n.sub.i] and [n.sub.j] Hence, when 1 < j

[s.sup.max] ([n.sub.i], [n.sub.j])=min(d([n.sub.i]), d([n.sub.j]))=d([n.sub.i]) (2)

To find the lower bound of the similarity, a threshold 8 is defined as follows:

J([n.sub.i], [n.sub.j]) [greater than or equal to] [delta] (3)

Based on the similarity threshold, we can simplify the inequality

[delta] [less than or equal to] J([n.sub.i], [n.sub.j])= s([n.sub.i], [n.sub.j]) / (d([n.sub.i])+d([n.sub.j]) s([n.sub.i], [n.sub.j]))

s([n.sub.i], [n.sub.j]) [greater than or equal to] [delta](d([n.sub.i])+d([n.sub.j])) / ([delta]+1) (4)

Hence, the lower boundary is expressed as

[s.sup.min] (n., n.)=8(d(n.)+d(n.)) / (8+1) (5)

Based on the lower boundary, to extract high-density multi-clusters, we can remove unnecessary paths (which are nodes in the original graph). Unnecessary paths are those between two elements under [delta] in a Hasse diagram.

To find the relation between a path and [delta] we can derive a relationship between d([n.sub.i]) and d([n.sub.j]) from the lower and upper boundary conditions.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (6)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (7)

In the same way, we can also derive

[delta]d([n.sub.k]) [greater than or equal to] d([n.sub.j]) (8)

Based on the conditions of (7) and (8), we have

[delta]d([n.sub.j]) + [delta]d([n.sub.k]) [less than or equal to] d([n.sub.i]) + d([n.sub.j])

[delta] [greater than or equal to] (d([n.sub.i]) + d([n.sub.j])) / (d([n.sub.j]) + d([n.sub.k])) (9)

The conditions concerning node degrees in an original graph can be satisfied by removing all elements under the lower bound of similarity [delta] in a Hasse diagram.

2.5 Simple Example:

To demonstrate how the Shared Interest algorithm functions, it is applied on the graph [G.sub.1] as shown in Fig. 1. We can see that the graph is composed of two complete bipartite graphs, i.e., [G.sub.a] and [G.sub.b], and an edge {[b.sub.4], [w.sub.4]}. The example is defined as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where [B.sub.a] is a set of bloggers in [G.sub.a], and is called the blog set. Wa is a set of occurred words on the blogs in [G.sub.b], and is called the word set. E is a set of links between [B.sub.a] and [W.sub.a] in [G.sub.a].

The node degrees are

d([b.sub.1]) = 2, d([b.sub.2]) = 2, d([b.sub.3]) = 2, d([b.sub.4]) = 3 d([w.sub.1]) = 2, d([w.sub.2]) = 2, d([w.sub.3]) = 2, d([w.sub.4]) = 3

Partial orders between nodes in two sets are

[b.sub.2] = [b.sub.3], [b.sub.1]<[b.sub.4], [b.sub.2]<[b.sub.4], [b.sub.3]<[b.sub.4] and [w.sub.1] = [w.sub.2], [w.sub.1]<[w.sub.4], [w.sub.2]<[w.sub.4], [w.sub.3]<[w.sub.4]

Co-occurrence in the blog and word sets are

[CB.sub.1]={{[b.sub.1], [b.sub.4]}, {[b.sub.3], [b.sub.4]}, {[b.sub.2], [b.sub.4]}, {[b.sub.2], [b.sub.3]}} [CW.sub.1]={{[w.sub.1], [w.sub.2]}, {[w.sub.1], [w.sub.4]}, {[w.sub.2], [w.sub.4]}, {[w.sub.3], [w.sub.4]}}

where CB1 is a set of co-occurrence relations in the blog set and [CW.sub.1] is a set in the word set.

The numbers of co-occurrences and similarities of each pair in the blog set are

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The numbers of co-occurrences and similarities of each pair in word set are

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Assuming that the lower boundary of the similarity is 1/2 ([delta] =1/2), co-occurrence relations {[b.sub.3],[b.sub.4] and {[b.sub.2], [b.sub.4]} in [CB.sub.1, and that of {[w.sub.1],[w.sub.2]} and {[w.sub.2],[w.sub.4]} in [CW.sub.1], which are represented by the dotted lines of cooccurrence graphs in Fig. 2, are ignored. The edge set E1 in the original graph G1 is defined by the Cartesian product of B1 and W1 without node pairs unlinked, i.e., {[b.sub.1],[w.sub.4]}, {[b.sub.1],[w.sub.3]}, {[b.sub.2],[w.sub.3]}, {[b.sub.3],[w.sub.1]}, {[b.sub.2],[w.sub.2]}, {[b.sub.2],[w.sub.1]}, and {[b.sub.2],[w.sub.2]}, which are ignored elements in the Hasse diagram of Fig. 3. Fig. 6 shows the similarity matrix of elements, which are element pairs in terms of a partially ordered set. In the ordered element set, the minimal elements are {[b.sub.1], [w.sub.1]}, {[b.sub.3], [w.sub.3]}, and {[b.sub.4], [w.sub.4]} ({[b.sub.1], [w.sub.4]}, {[b.sub.2], [w.sub.3]} can also be used as the minimum element instead of {[b.sub.1], [w.sub.1]} and {[b.sub.3], [w.sub.3]}, in terms of equality in a partially ordered element set.

In a traversal from a minimal element to a maximal element in partial order, ineffective paths (those under the lower bound of the similarity) are ignored, i.e., {[b.sub.3],[w.sub.4]}-{[b.sub.4],[w.sub.4]}, {[b.sub.2],[w.sub.4]}{[b.sub.4],[w.sub.4]}, {[b.sub.4],[w.sub.1]}-{[b.sub.4],[w.sub.4]}, and {[b.sub.4],[w.sub.2]}-{[b.sub.4],[w.sub.4]}, which should be are removed from the Hasse diagram of Fig. 3. Three clusters are extracted from the traversal in Fig. 6.

[FIGURE 6 OMITTED]

To avoid duplicate traversals, an implementation of the SI algorithm traverses from maximal elements to minimal elements, the reverse of the method previously explained in the examples in Fig 6. In a regular traversal from minimal elements to maximal elements, there are generally many regions around maximal elements that overlap because when 8 is small, many traversals lead to a few maximal elements on a Hasse diagram. In traversing from maximal elements to minimal elements, however, there are no overlapping regions. The reverse implementation of the SI algorithm is described below:

main while PO_edge_set is not empty h_edge = pop_highest(PO_edge_set) Do traverse_edges(h_edge) traverse_edges(h_edge) push_edge(h_edge, proc_edge_set) while proc_edge_set is not empty c_edge = pop_highest_edge(proc_edge_set) remove(c_edge, PO_edge_set) lower_eset = get_all_lower(c_edge, PO_edge_set) if lower_eset is not null push(lower_eset, proc_edge_set) remove(c_edge, start_edge_set) else push(c_edge, start_edge_set)

The sum of the time complexity of the two "while" loops in the main and traverse edges is O(e), where e is the number of edges in a graph set. This is because the loops depend on one edge set, the PO edge set, and processing an element removes it from the set. The linear time complexity was also confirmed in an experiment [3].

2.6. Extracting Spam Seed

Employing an SI algorithm, which is a bipartite-graph clustering algorithm, the spam seed is extracted from bipartite graphs between blogs and words they use. Spam blogs and words can be extracted as large cooccurrence clusters because spam blogs share many common words, due to their using copied-and-pasted content from other blogs and to their postings on multiple blogs and sites. However, some normal blogs may still be extracted as spam blog clusters, since all blog pages have certain high-frequency terms generated by blog sites, e.g. name of blog site, category names within the sites, text banners, etc. To avoid the incorrect extraction of these blogs, highfrequency terms are removed from bipartite graphs of blogs and words based on threshold W, which is an upper limit of document frequency of each word. Cooccurrence clusters of low-frequency words are then extracted.

[FIGURE 7 OMITTED]

The procedure for spam seed extraction is described as follows:

Step 1: All co-occurrence data between blogs and words in a data set is prepared ((a) of Fig. 7). The data is represented as a set of bipartite graphs G(ABlog [union] AWord, AEdge). ABlog, AWord and AEdge denote a set of all blogs, a set of all words, and a set of all edges between blogs and words used in the blogs, respectively.

Step 2: A set of bipartite graphs with low-frequency words ((b) of Fig. 7), which is denoted as LG(ABlog [union] LWord, LEdge), is extracted from a set of all bipartite graphs G, where LWord denotes a set of low-frequency words and LEdge denotes a set of edges between words and blogs. LWord is composed of words which are under threshold W in terms of document frequency. In the following pseudo codes, deg([w.sub.i]) denotes the document frequency of word [w.sub.i] and to([e.sub.j]) denotes a word [w.sub.k] on one side of edge [e.sub.j].

LWord = [PHI] for [w.sub.i] [member of] AWord do if deg([w.sub.i]) [greater than or equal to] W then LWord = LWord [union] {[w.sub.i]} end if end for LEdge = [PHI] for [e.sub.j] [member of] AEdge do if to([e.sub.j]) [member of] LWord then LEdge = LEdge [union] {[e.sub.i]} end if end for

Step 3: A set of huge and dense clusters, which is denoted as Clst, is extracted from LG with the SI algorithm ((c) of Fig. 7). Based on a preliminary experiment, the lower bound of the similarity ([delta]) is defined as 0.2 [3].

Step 4: Spam score sscore(ci) is calculated for each cluster [c.sub.i] [member of] Clst. sscore([c.sub.i]) is defined as [absolute value of blog([c.sub.i])] x [absolute value of word([c.sub.i])]. CBlog is a set of all blogs which belong to [c.sub.i] [member of] Clst. Spam seed SSeed is composed of a part of CBlog, whose clusters have a high spam score sscore([c.sub.i]). The number of blogs in SSeed is defined by Z x [absolute value of] Clst], where threshold Z is the rate of blogs employed as spam seed. In the pseudo codes, blog([c.sub.i]) and word([c.sub.j]) denote a set of blogs and a set of words in cluster c.. Because of the characteristics of spam blogs, i.e. copied-and-pasted content and multiple postings, spam blogs and words can form huge co-occurrence clusters. Hence, blogs in a cluster ci with a high spam score sscore([c.sub.i]) may be spam blogs.

for ci [member of] Clst do sscore([c.sub.i])=[absolute value of]blog([c.sub.i])] x [absolute value of]word([c.sub.i])] end for Clst ={[c.sub.i] : sorted by sscore([c.sub.i]) in descending order} SSeed = [PHI] for i=1 to Z x [absolute value of]Clst] do SSeed = SSeed [union] {blog([c.sub.i])} end for

A large cluster has a high spam score and is usually a spam cluster, as a result of copied-and-pasted content and multiple postings; in constrast, a small cluster is seldom a spam cluster. Hence, blogs in spam clusters with a high spam score can be safely appended as spam seeds.

This paper conducted a preliminary experiment ranking extracted clusters based on spam score, dividing them into 10 groups, sampling 20 blogs in each group, and checking spam blogs in each sample. This experiment concluded that high-spamscore groups contain many spam blogs. Therefore this paper uses blogs in the top 5 groups as spam seed in terms of spam score. This method of spamseed extraction always works well because large cooccurrence clusters with low-frequency words are an invariant feature of copied-and-pasted content and multiple postings, even though characteristics like words, design, and URL are often changed.

2.7. Mutual Detection

Spam blogs and words are mutually detected with spam seeds. The number of extracted blogs as spam in the mutual detection is determined by multiplying the number of all blogs by a parameter S, which is an assumed rate of spam blogs among all blogs. S is determined by the results of sampling from all blogs. The procedure of mutual detection between spam blogs and words is described as follows:

Step 1: For initialization, spam blogs in the spam seed are appended in an initial spam blog list SBlog. Spam word list SWord is null. In the following pseudo code, UBlog and UWord denote a set of blogs undecided and a set of words undecided, respectively.

SBlog = {[b.sub.i] | [b.sub.i] [member of] SSeed} SWord = [PHI] UBlog = ABlog \ SBlog UWord = AWord

Step 2: The spam rate swrate of each word [w.sub.j] [member of] UWord is calculated by [absolute value of]sblog([w.sub.i])]/[absolute value of]ablog([w.sub.j])], where sblog([w.sub.i]) and ablog([w.sub.i]) denote a set of spam blogs which use word [w.sub.i] and a set of all blogs which use the word. The results are recorded on a list of word spam rates. If a word's spam rate is higher than the threshold R, the word is judged to be a spam word and is added to SWord. The threshold R is defined as the lowest spam rate for a given word.

for [w.sub.i] [member of] UWord do swrate=[absolute value of]sblog([w.sub.i])]/[absolute value of] ablog([w.sub.i])] if swrate [greater than or equal to] R then SWord = SWord [union] {[w.sub.i]} UWord = UWord \ {[w.sub.i]} end if end for

Step 3: The spam rate sbrate of each blog bi [member of] UBlog is calculated by [absolute value of]sword([b.sub.i])]/[absolute value of] aword([b.sub.i])], where sword([b.sub.i]) and aword([b.sub.i]) denote a set of spam words which are used in blog [b.sub.i] and a set of all words which are used in the blog. The results are recorded on a list of blog spam rates. SBlog is composed of blogs equal to or under threshold C, which is the lowest spam blog rate.

for [b.sub.i] [member of] UBlog do sbrate=[absolute value of]sword([b.sub.i])]/[absolute value of] aword([b.sub.i])] if sbrate [greater than or equal to] C then SBlog = SBlog [union] {[b.sub.i]} UBlog = UBlog \ {[b.sub.i]} end if end for

Step 4: The rate of detected spam blogs is calculated by [absolute value of]SBlog]/[absolute value of]ABlog]. If the rate is equal to or greater than threshold F, the lowest threshold of the rate of possible spam blogs, mutual detection is terminated, and it is possible to proceed to step 5. F is defined as greater than S, when enough candidates can be selected to determine spam blogs. If there are undetermined blogs (UBlog [not equal to] [empty set]), mutual detection is continued from step 2. Otherwise, proceed to step 6.

if [absolute value of]SBlog]/[absolute value of ABlog] [greater than or equal to] F then goto step 5 else if [absolute value of UBlog]> 0 then goto step 2 else goto step 6 end if

Step 5: For each extracted spam blog [b.sub.i] [member of] SBlog, sbrate([b.sub.i]) is recalculated. The set of final spam blogs FSBlog is composed of high spam score blogs. The number of spam blogs in FSBlog is defined as S x [absolute value of ABlog], where S is the rate of spam blogs among all blogs. Spam blog extraction has been successfully completed.

for [b.sub.i] [member of] SBlog do sbrate([b.sub.i])|=[absolute value of]sword([b.sub.i])]/[absolute value of ]aword([b.sub.i])] end for SBlog ={[b.sub.i] : sorted by srate([b.sub.i]) in descending order} FSBlog = [empty set] for i=1 to S x [absolute value of ABlog] do FSBlog = FSBlog [union] {[b.sub.i]} end for

Step 6: The final spam blog set is null (FSBlog= [empty set]) because not enough possible spam blogs can be extracted. Spam blog extraction has then failed.

Because parameter S is the estimated rate of spam blogs based on an observation of a data set and the estimated number of spam blogs in step 5, i.e. S x [absolute value of ABlog], the precision of spam filtering is equal to the recall if S is the exact spam rate of the data set. For example, when precision is P, the recall is also P, since S of all blogs are observed to be spam blogs and P of these spam blogs are extracted. Hence, Fmeasure, which is defined as (Precision x Recall x 2) / (Precision + Recall), is also P in this case because Precision = Recall = P.

Fig. 8 illustrates an example of the mutual detection process. In this example, blogs [B.sub.1] and [B.sub.2] are included in the spam seed. The blogs are appended in an initial spam blog list. Based on this list, spam word rates are calculated. Words over the threshold R are added to a spam word list. In this example, words [K.sub.1] and [K.sub.2] are spam words based on R = 0.6. Based on the spam words, spam blog rates are calculated, and blogs with high spam rates greater than C are added to the spam blog list. In this example, blog [B.sub.3] is added to the spam blog list because C = 0.005. As illustrated in this example, mutual detection works by updating spam blog and spam word lists and by re-calculating spam rates of blogs and words.

3. Experiment in Filtering

To examine spam filtering performance, this paper investigates the result of filtering for approximately six months in a brief examination and investigates one day's updated data in a more detailed examination.

[FIGURE 8 OMITTED]

All parameters in the filtering method are follows:

* S: rate of spam blogs in all blogs (the rate is defined by the results of sampling from all blogs)

* W: upper threshold of document frequency of word for clustering (60, 80, 100, 500)

* R: lower threshold of spam rate for words (0.4, 0.6, 0.8)

* C: lower threshold of spam rate for blogs (0.001, 0.005, 0.01, 0.05)

* F: rate of final detected blogs in all blogs (0.2, 0.3, 0.4, 0.5, 0.6, 0.7)

To find the most effective combination of the parameters' values, a preliminary experiment was conducted and standard parameter values were determined (Table 1). Using the standard parameters, updated blog data was filtered every day beginning January 1st, 2008.

3.1. Preliminary Experiment

The analysis of the preliminary experiment employed approximately six months' data from January 1st to June 21st, 2008, which sampled 3,007,005 unique blogs. These were selected from a huge database of updated blogs begun in 2005, and from 25 RSS feeds of blog sites and ping servers. To investigate filtering precision, 20 samples from the determined spam blogs (selected from the first day of each month) were checked manually and judged to be spam. The observed rates of precision, or Fmeasure, varied between 75% and 95%. (1/1: 85%, 2/1: 95%, 3/1: 90%, 4/1: 85%, 5/1: 75%, 6/1:80%).

3.2. Parameters for Spam-Seed Extraction

To verify that the standard parameters are effective, filtering precision was investigated using one day's data on August 20th, 2008. The parameter checked here is "W: upper threshold of document frequency of word for clustering." To verify the spam rate in the spam seed, which was extracted from all blog data with bipartite clustering, the precision of each value of W (60, 80, 100, 500) was observed. W = 100 was determined to be the most effective value with a precision rate of 85% (Fig. 9).

To verify spam blog distribution in the case of W = 100, blogs in extracted co-occurrence clusters were sorted in descending order by spam score and were divided into 10 groups. The spam score of the first group was high, whereas that of group 10 was low. According to sampling observation, spam rates among the 10 groups range from 40% to 100%. The precision rates of group 1 and 2 are both 100%, which means that a group with a high spam score consists of almost entirely spam blogs (Fig. 10). This paper uses 5 groups from 1 to 5 as the spam seed. (In this paper, half of the groups employed for spam seeds were automated spam seed selections, although it is worth noting that the spam rate of group 5 may be unusually low for one day's data.)

[FIGURE 9 OMITTED]

[FIGURE 10 OMITTED]

3.3. Parameters for Mutual Detection

Parameters for mutual detection are R, which is the lower threshold of the spam rate for a word, C, which is the lower threshold of the spam rate for a blog, and F, which is the rate of final detected blogs among all blogs. In case of altering R in the standard parameter set, the most effective value of R is 0.6 (Fig. 11). In case of R = 0.8, mutual detection cannot detect the estimated number of spam blogs (defined by S) from updated blog data, due to the narrow condition of R. In the case of altering C in the standard parameter set, the most effective value of C is 0.005 (Fig. 12).

In the case of C = 0.05, mutual detection cannot detect the estimated number of spam blogs (defined by S) from updated blog data, due to the narrow condition of C. Moreover, in the case of altering F in the standard parameter set, the most effective value of F is 0.5 (Fig. 13).

[FIGURE 11 OMITTED]

[FIGURE 12 OMITTED]

[FIGURE 13 OMITTED]

[FIGURE 14 OMITTED]

3.4. Parameter Adjustment for Change in Number of Spam Blogs

To calculate the number of spam blogs in one day's data on August 20th, 2008, 100 samples were selected from the data and 14 spam blogs were counted, which suggests that the estimated spam rate is 14%. The spam rate of 14% in this experiment was slightly lower than 20%, which was observed in the preliminary experiment in 3.1. Hence, in the case of S = 0.14 in standard parameters, the most effective parameter value of F can be determined. According to the results, the value of F is 0.4 (the precision or F-measure is 95%) (Fig. 14). To maintain a high rate of precision in filtering when the number of spam blogs increases or decreases, parameter S can be re-defined by observing the spam rate among all data, and parameter F can be adjusted.

4. Discussion

The spam filtering method in this paper uses spam seeds, which are extracted as co-occurrence clusters due to copied-and-pasted content and multiple postings. Mutual detection is based on this seed. One advantage of this method is that it does not require a huge learning data set as required for machine-learning mechanisms, e.g., SVM employed for spam blog filtering [5][6]. To clarify this advantage, experiments will be provided in following subsection.

4.1. Spam Filtering with SVM

To illustrate the advantages and disadvantages of machine learning, SVM is employed to filter spam blogs in the data set August 20th, 2008, which was also processed in section 3.

The data set is divided into two subsets. One is the spam data set, which is explained as the spam seed in 3.2. Another is the unknown data set, which is the remainder of all blogs in the entire data set, excluding the spam data set. The spam data set contains 5286 blogs. The unknown data set contains 205,702 blogs.

The blogs in the spam data set belong to groups 1 to 5 in Fig. 10. The figure shows that the data set contains a number of non-spam blogs due to automatic extraction with co-occurrence clusters. Hence, the estimated number of spam blogs is approximately 4175. In comparison, the estimated number of non-spam blogs is approximately 1110.

To select words as features of data for SVM, document frequency (DF) is employed to find which are effective for filtering spam blogs. Employing these measures on the spam data set, the top 100 words are selected as features to represent characteristics of each blog.

Each blog in the spam data set is represented based on these selected features. As a result, the spam data set is expressed in a two-dimensional matrix of 5286 blogs and term frequencies of 100 words.

To perform SVM on the data sets, R (1) and kernlab (2) are employed, which are a statistical computing environment and a kernel-based machine learning package, respectively. The SVM function in the package, i.e., ksvm, will be used with default parameters and 3-fold cross-validation (cross = 3).

4.2. Filtering within Spam Data Set

To show an advantage of machine learning, i.e., high precision of data separation, the spam data set is used for training and testing SVM. The spam data set, which contains 5286 blogs, is randomly divided into two subsets. One is for a training set for SVM which contains 2643 blogs. Another is for a test data set to evaluate SVM which has 2644 blogs.

To prepare a training data set for SVM learning, which contains 2643 spam and 2643 non-spam blogs, 2643 blogs are randomly selected as non-spam blogs from the unknown data set of 205,702 blogs, although some of the selected blogs may be spam blogs. The selected blogs are added to the subset of the spam data set for training SVM. The training set contains 5286 blogs, which are labeled as either "spam" or "non-spam." SVM are then trained with the training data set.

After training with ksvm function, the SVM is also tested using the test data set, which is remainder of the spam data set, excluding blogs in the training data set. The estimated number of spam blogs in the test data set is 2089 and that of non-spam blogs is 555, based on the spam distribution in fig. 10. The trained SVM divides the test data set into two groups, i.e. spam and non-spam. The number of blogs in the spam group is 1991. In comparison, that in the nonspam group is 654. 20 blogs are randomly selected from the spam group and checked manually. The results show that the 20 blogs are all spam blogs. Therefore the precision of SVMs is 100%, even though the test data set contains approximately 555 non-spam blogs. The high precision, in this case 100%, indicates an advantage of machine learning. When the test data set contains similar data to that in the training data set, machine learning can separate test data at high precision.

4.3. Filtering Entire Data Set

To show a disadvantage of machine learning, i.e., lower precision on a different type of test data compared to training data, the entire data set is used for training and testing SVM.

To prepare a training data set, the spam data set containing 5286 blogs and 5286 randomly selected blogs from the unknown data set are appended. SVM is trained using this training data set. Applying the trained SVM, the unknown data set containing 205,702 blogs, is then divided into two groups, i.e., spam and non-spam. The number of blogs in the spam group is 32,743. That in the non-spam group is 167,676. 20 blogs are selected randomly from the spam group and checked manually. The results show that 13 of the 20 blogs are spam blogs. Therefore, the precision of SVM is 65%. Feature selection with 1000 words (instead of 100 words) is also examined, and the precision rate is also 65%. The results conclude that increasing the number of features does not improve the precision of filtering. The slightly low precision rate of 65% indicates a disadvantage of machine learning. When the test data set contains a different type of data from the training data set, machine learning cannot separate test data precisely.

4.4. Similarity between Training and Test Data Sets

When researchers employ SVM and evaluate the result of data separation, they use a set of blogs manually marked as "spam" or "non-spam," and divided into two data sets for training and testing. The similarity between the two sets tends to be very high because they are randomly selected from a manually marked data set. As a result, SVM trained using the training set can separate spam blogs in the test set at high precision. For example, the result in 4.2 of this paper indicates 100% precision. This result is based on using the same type of data set for both training and testing. In this paper, the set is collected based on co-occurrence clusters, although the other researchers collect and mark manually. Hence, high precision with training and test data sets derived from an original data set may produce slightly trivial results.

However, previous research does not test on different types of data sets, collected as different points of view, unlike the unknown data set in this paper, which is not organized large co-occurrence clusters. Using different types of data sets, this paper shows the precision of SVM is not high, i.e., 65%. Hence, machine learning is not very effective to separate spam for an unknown data source, which is different from a training data set.

4.5. Advantage of Mutual Detection

Machine learning is an effective method for identifying spam blogs if the characteristics of spam blogs do not frequently change. However, spam blogs vary and are easy to change. Hence, to maintain filtering precision, new learning data and frequent re-learning are required. The manual operations in preparing huge volumes of learning data are costly, though Lin [6] has developed an annotation tool to collect spam blogs for machine learning data.

In this paper, it is relatively easy to maintain high precision based on periodical sampling of the spam rate and adjustment of two parameters. For example, when the spam rate changes from 20% to 14%, filtering precision can be maintained by redefining parameters S as 14% instead of 20% and adjusting F for S as in Fig. 14.

5. Summary

This paper developed a spam filtering method based on extraction of co-occurrence cluster seeds and mutual detection of spam blogs and spam words. Based on a preliminary experiment using six months' data and a more detailed experiment using one day's data, it reported that the maximum precision or Fmeasure of filtering is 95%. An advantage of this method for spam blog filtering, as compared to a machine learning approach, is also supported by experiments with SVM.

Acknowledgments

This research was supported in part by Grants-in-Aid for Young Scientists (B) (No. 20700222, Representative: Kazunari Ishida) from Japan Society for the Promotion of Science (JSPS).

Received: 14 June 2009; Revised 11 September 2009; Accepted 1 October 2009

References

[1] Bramer, M. (2007). Principle of Data Mining, Springer-Verlag London Limited.

[2] Chakrabarti, S. (2002). Mining the Web, Morgan Kaufmann.

[3] Ishida, K. (2007). Mining Collective Opinions from Blogospheres, In: Proc. of International Conference on Cybernetics and Information Technologies, Systems and Applications (CITSA 2007), V. 1 pp. 154-161, July 12-15.

[4] Kenneth, H. R. (1998). Michaels, J. G., Gross, J. L., Grossmann, J. W., Shier, D. R. Handbook of Discrete and Combinatorial Mathematics, CRCPress, July 31.

[5] Kolari, P., Finin, T., Joshi, A. (2006). SVMs for the Blogosphere: Blog Identification and Splog Detection, Proc. of AAAI Symposium on Computational Approaches to Analyzing Weblogs, 92-99. Stanford.

[6] Lin, Y., Chen, W., Shi, X., Sia, R., Song, X., Chi, Y., Hino, K., Sundaram, H., Tatemura, J., Tseng, B. (2006). The Splog Detection Task and a Solution Based on Temporal and Link Properties, In: Proc. of the Fifteenth Text-Retrieval Conference (TREC 2006).

Faculty of Applied Information Science

Hiroshima Institute of Technology

k.ishida.p7@it-hiroshima.ac.jp

(1) http://www.r-project.org/

(2) http://cran.r-project.org/web/packages/kernlab/

Table 1. Standard Parameters S W R C F 0.20 100 0.60 0.01 0.50

Printer friendly Cite/link Email Feedback | |

Author: | Ishida, Kazunari |
---|---|

Publication: | Journal of Digital Information Management |

Article Type: | Technical report |

Geographic Code: | 9JAPA |

Date: | Apr 1, 2010 |

Words: | 7221 |

Previous Article: | A web-based surveillance system for traffic behavior and its application for practical recommendation of crosswalk installation. |

Next Article: | Cellular DBMS: an attempt towards biologically-inspired data management. |

Topics: |