# Mining association rules for constructing network intrusion detection model.

IntroductionAs network based computer systems play increasingly vital roles in modern society, they have become the target of our enemies and criminals.Therefore, we need to find the best ways possible to protect our systems. The security of a computer system is compromised when an intrusion takes place. An intrusion can be defined [1] as "any set of actions that attempt to compromise the integrity, confidentiality, or availability of a resource", for example, illegally gaining super user privileges, attacking and rendering a system out of service (i.e., denial-of-service), etc.

The term "data mining" means many different things to different people. Originally, the term was attached to blind, undirected search for probabilistic associations within large data sets. In [2], it is said that, "many data mining algorithms come from the fields of machine learning and statistics. A common view in machine learning is that machine learning algorithms performs a search (typically heuristic) through a space of hypotheses (patterns) that explain (are valid in) the data at hand. Similarly, the data mining algorithms can be viewed as searching, exhaustively or heuristically, a space of patterns in order to find interesting patterns that are valid in the given data."

A conceptually simple yet interesting example is that of analysing a large database of supermarket transactions with the aim of finding association rules. This is called association rules mining or market basket analysis [3]. It involves searching for interesting customer habits by looking at associations. The classical example is the one where an American store was reported to have discovered that people buying nappies tended also to buy beer. It is not quite clear if this is really true but makes an interesting example!

Association rules mining has many applications other than market basket analysis, including applications in marketing, customer segmentation,medicine,ECommerce, classification,clustering,detecting fraud, waste and abuse,etc.In this paper, the aim is to discover significant association rules for constructing an efficient network intrusion detection model, based on Apriori algorithm from the KDDCup'99 dataset. In this paper, mining association rules using multiple Minimum supports are proposed to build network intrusion detection model, instead of single support. Since the number of discovered association rules might be very large, some pruning techniques along with various interestingness measures are implemented to select the relevant and interesting rules for the intrusion detection process. Experimental results show that this technique is effective in discovering significant association rules.

The rest of the paper is organised as follows: Section 2 provides association rule mining and problem statement along with the main aspects of Apriori and MS Apriori algorithm. Some pruning techniques along with the interesting measures are presented in Section 3 .Section 4 describes about the methodology used .Section 5 describes about the effective network intrusion detection model based on the experimental results, followed by discussion on related works in section6.Finally, conclusion and references are given in section 7and section 8 respectively.

Association Rule Mining

Association rule mining finds interesting association or correlation relationships among a large set of data items [4]. The association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold [5]. A more formal definition is the following [6]. Let I= {I1, I2, I3 ... Im} be a set of items. Let D, the task relevant data, be a set of database transactions where each transaction T is a set of items such that T [??] I. An association rule is an implication in the form of X [right arrow] Y, where X, Y [subset] I are sets of items called itemsets, and X [intersection] Y = 0. X is called antecedent while Y is called consequent; the rule means X implies Y.

There are two important basic measures for association rules, support(s) and confidence (c) .Since the database is large and users concern about only those frequently occurred items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively, additional constraints of interesting rules also can be specified by the users. The two important measures for Association Rule Mining (ARM), support and confidence, can be defined as follows.

The support(s) of an association rule is the ratio (in percent) of the records that contain X [union] Y to the total number of records in the database. Support is the statistical significance of an association rule. The confidence(c) can be defined as the ratio (in percent) of the number of records that contain X [union] Y to the number of records that contain X. Confidence is a measure of a rule's strength. Often a large confidence is required for association rules.The problem of mining association rules can be decomposed into two sub problems [7] as stated in Algorithm1, shown in Fig. 1.

The first step in algorithm 1 finds large or frequent item sets. Item sets other than those are referred as small item sets. Here an item set is a subset of the total set of items of interest from the database. An interesting (and useful) observation about the large item sets is that: if an item set X is small, any superset of X is also small.

Of course the contra positive of this statement (if X is a large item set than so is any subset of X) is also important to remember. The second step in Algorithm 1 finds association rules using large item sets in the first step. In this, it is desired to generate strong association rules from the frequent item sets. By definition; these rules must satisfy both minimum support and minimum confidence.

Figure 1: Basic Association rule mining algorithm. Algorithm 1.Basic Association Rule Mining: Input: I, D, s, c Output: Association rules satisfying s and c Algorithm: 1) Find all sets of items which occur with a frequency that is greater than or equal to the user specified threshold support s. 2) Generate the desired rules using the large item sets, which have user specified threshold confidence, c.

Association rule mining is to find out association rules that satisfy the pre-defined minimum support and confidence from a given database [7]. The problem is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a pre-defined threshold in the database, those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence. Suppose one of the large item sets is [L.sub.k], [L.sub.k] ={[I.sub.1],[[I.sub.2], ...., [I.sub.k-1],[I.sub.k]},association rules with this item sets are generated in the following way: the first rule is {[I.sub.1],[I.sub.2], ...., [I.sub.k-1]} [right arrow] {[I.sub.k]},by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to theconsequent,further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second sub problem is quite straight forward, most of the researches focus on the first sub problem.

Apriori Algorithm

Apriori is a great improvement in the history of association rule mining, Apriori algorithm was first proposed in [7]. Apriori is more efficient during the candidate generation process for two reasons. Apriori employs a different candidate's generation method and a new pruning technique. Fig. 2 gives an overview of the Apriori algorithm for finding all frequent item sets from the database [8].

The Apriori generates the candidate item sets by joining the large item sets of the previous pass and deleting those subsets which are small in the previous pass without considering the transactions in the database. By only considering large item sets of the previous pass, the number of candidate large item sets is significantly reduced.

The first pass of the algorithm simply counts item occurrences to determine the large 1-itemsets.A subsequent pass, say pass k, consists of two phases. First, the large item sets Lk-1 found in the [(k-1).sup.th] pass are used to generate the candidate item sets [C.sub.k], using the Apriori candidate generation function as described above .Next, the database is scanned and the support of candidates in [C.sup.k] is counted. For fast counting, an efficient determination if the candidates in [C.sup.k] that are contained in a given transaction t is needed. A hash-tree data structure [7] is used for this purpose. The Apriori-gen function takes as argument Lk-1, the set of all large (k-1)-item sets. It returns a superset of the set of all large k-item sets, as described in [7].

Extended Association Rules

Previous work as described earlier has focussed on mining association rules in large databases with single support. Since a single threshold support is used for the whole database, it assumes that all items in the data are of the same nature and/or have similar frequencies. In reality, some items may be very frequent while others may rarely appear. However, the latter may be more informative and more interesting than the other. For example, it could be some items in a super market which are sold less frequently but more profitable, food processor and cooking pan [9].Therefore, it might be very interesting to discover a useful rule food processor [right arrow] cooking pan with a support of 2%.

Figure 2. Apriori Algorithm. Input: database D Minimum Support s Minimum Confidence c Output: [R.sub.t] All association rules Method: [L.sub.1] = {large 1-itemsets}; for (k=2, [L.sub.k-1] [not equal to] 0; k++) do begin [C.sub.k] = apriori-gen ([L.sub.k]-1); //New Candidates for all transactions t [member of] D do begin [C.sub.t] = subset ([C.sub.k], t); //Candidates contained in t For all candidates c [member of] [C.sub.t] do c.count++; end [L.sub.k]= {c [member of] [C.sub.k]| c.count [greater than or equal to] min support} end [L.sub.f]= [union] k [L.sub.k]; [R.sub.t] = Generate Rules (Lf, c)

If the threshold support is set too high, rules involving rare items will not be found. To obtain the rules involving both frequent and rare items, the threshold support has to be set very low. Unfortunately, this may cause combinatorial explosion, producing too many rules, because those frequent items will be associated with another in all possible ways and many of them are meaning less. This dilemma is called as "rare item problem" [9]. Therefore single threshold support for the entire database is inadequate to discover important association rules because it cannot capture the inherent natures and/or frequency differences in the database. In [9], the existing association rule model is extended to allow the user to specify multiple threshold supports. The extended new algorithm is named as MS Apriori, as shown in Figure 3. In this method, the threshold support is expressed in terms of minimum item support (MIS) of the items that appear in the rule. The main feature of this technique is that the user can specify a different threshold item support for each item.Therefore; this technique can discover rare item rules without causing frequent items to generate too many unnecessary rules.

MS Apriori Algorithm

Let [L.sub.k] denotes the set of large k-itemsets.each item set c of the following form, <c[1], c[2],.., c[k]>, which consists of items, c[1],c[2],..,c[k], where MIS (c[1]) [less than or equal to] MIS(c[2]), [less than or equal to] ..., [less than or equal to] MIS (c[k]). The algorithm is illustrated in Figure 3.

Similar to conventional algorithms, the MS Apriori generates all large item sets by making multiple passes over the data. In the first pass, it counts the supports of individual items and determines whether they are large. In each subsequent pass, it uses large item sets of the previous pass to generate candidate item sets. Computing the actual supports of these candidate sets, the MS Apriori determines which of the candidate sets are actually large at the end of the pass. However, the generation of large item sets in the second pass differs from other algorithms. A key operation in the

MS Apriori is the sorting of the items I in ascending order of their MIS values. This ordering is used in the subsequent operation of the algorithm.

Figure 3: MS Apriori algorithm. M = sort (I, MS); /* according to MIS (i)'s stored in MS*/ F = Init-pass (M, T); /*make the first pass over T*/ [L.sub.1] = {<f>|f [member of] F, f.count [greater than or equal to] MIS (f)}; For (k = 2; [L.sub.k-1] [not equal to] 0; k++) do If k = 2 then [C.sub.2] = level2-candidate-gen (F) else [C.sub.k] = candidate-gen ([L.sub.k-1]) end for each transaction t [member of] T do [C.sub.t] = subset ([C.sub.k], t); For each candidate c [member of] [C.sub.t] do c.count ++; end [L.sub.k] ={c [member of] C k | c.count [greater than or equal to] MIS(c [1])} end [L.sub.f] = [[union].sub.k] [L.sub.k]; [R.sub.t] = Generate Rules ([L.sub.f], c)

Measuring Interestingness

The number of discovered association rules using Apriori is very large and many of the discovered rules do not have much meaning. Therefore, it is important to implement pruning and interestingness measures so as to obtain the most interesting generalised profile association rules. In this paper, we implement template based pruning technique and interestingness measure to select the most interesting profile rules among set of generalised rules.

Template Based Pruning

One way of selecting interested association rules is to specify template information in which we can explicitly specify what is interesting and what is not[10].We implement the template based pruning to extract relevant set of association rules, because not all rules that passed the min.support and min.confidence values are interesting. This technique filter the discovered association rules and selects the ones that match our template criteria, while rejecting others [11].

It is a fact that strong association rules are not necessarily interesting [12].Several measures, besides confidence, have been proposed to better measure the correlation between X and Y. Some of them are as follows: Chi-Square testing, R-interestingness measures, lift, correlation, conviction and cosine.

Chi-Square Testing

To perform the Chi-Square test, a table of expected frequencies is first calculated using P(X) and P(Y) from the contingency table. The expected frequency for (X and Y) is given by the product of P(X) P(Y).Performing a grand total over observed frequencies versus expected frequencies gives a number which we denote by Chi. we use chi-square ([chi square]) test to measure the degree of independence between two different attributes by comparing their observed patterns of occurrence (actual support) with the expected pattern of occurrence (expected support). ([chi square]) value is calculated as follows: [chi square] =. (fo-fe)2/ fe

Where [f.sub.O] represents an observed frequency (actual support) and [f.sub.e] is an expected frequency (expected support).From the association rules information, the value of the expected support and expected confidence can be calculated as shown in Table 2 and Table 4, using a contingency table of the observed support and observed confidence illustrated in Table 1 and Table 3 respectively. The contingency table is a frequency table in which a sample from the population is classified according to two or more attributes. The expected support is that all expected frequencies of the presence or absence of an item will be equal in a category or not. The expected support/confidence is calculated as a product of corresponding row and column totals divided by the total number of elements in the lower right cell of the observed support/confidence table.

The calculated value of [chi square] is compared with the table value of [chi square] for given degrees of freedom at a specified level of significance. If at the stated level (generally 5% level is selected), the calculated value of [chi square] is more than the table value of [chi square], the difference between theory and observed is considered significant, i.e. it could not have arisen due to fluctuations of simple sampling. If on the other hand, the calculated value of [chi square] is less than the table value, the difference between theories and observed is not considered as significant, i.e. it is regarded as due to fluctuations of simple sampling and hence ignored.

The observed value of [chi square] is compared with a cut-off value read from a Chi-Square table/Surf stat Chi-Square calculator. The degree of freedom in this case is calculated as (r-1) (c-1), where, r and c represents the total number of rows and columns are present in the contingency table under consideration. For the probability value of 0.05 with one degree of freedom, the cut-off value is measured. If Chi value is greater than this, X and Y are regarded as correlated with a 95% confidence level. Otherwise, they are regarded as non-correlated also with a 95% confidence level.

R-Interesting Measures

This technique uses the information provided in taxonomies to find the interesting rules among its ancestors, based on the assumption of statistical independence and strength of the latter rule. R is a threshold value specified by the user. A rule X [right arrow] Y is interesting if it passed the R-specified threshold with respect to its ancestors. This technique is based on the Idea implemented in [13] and states that:

A rule (X [right arrow] Y) is considered to be interesting in a given set of rules, if it has no ancestors or it is R-interesting with respect to its ancestor X [right arrow] [??] We call a rule X [right arrow] Y R-interesting with respect to ancestor X [right arrow] [??] if the support of the rule X [right arrow] Y is R-times the expected support based on X [right arrow] [??] or the confidence is R-times the expected confidence based on X [right arrow] [??]

Lift (X [right arrow] Y)

conf (X [right arrow] Y)/P(Y).an equivalent definition is: P(X, Y)/P(X) P(Y).Lift is a symmetric measure. A lift well above 1 indicates a strong correlation between X and Y.A lift around 1 says that P(X, Y) = P(X) P(Y). In terms of probability, this means that the occurrences of X and the occurrence of Y in the same transaction are independent events; hence X and Y are not correlated. Another definition can be found in [14].

Correlation (X [right arrow] Y)

[P(X, Y)-P(X) P(Y)]/ [square root] [P(X) P(Y) (1-P(X)) (1-P(Y))]. Correlation is a symmetric measure. A correlation around 0 indicates that X and Y are not correlated, a negative figure indicates that X and Y are negatively correlated and a positive figure that they are positively correlated. Note that the denominator of the division is positive and smaller than 1.Thus the absolute value |cor (X [right arrow] Y)| is greater than |P(X, Y)-P(X) P(Y) |. In other words, if the lift is around 1, correlation can still be significantly different from 0.

Conviction (X [right arrow] Y)

[1-P(Y)]/ [square root] [1-conf (X [right arrow] Y)]. Conviction is not a symmetric measure. A conviction around 1 says that X and Y are independent, while conviction is infinite as conf (X [right arrow] Y) is tending to 1 .It is to be noted that if p(Y) is high, 1-P(Y) is small. In that case, even if conf (X [right arrow] Y) is strong, conviction (X [right arrow] Y) may be small.

Cosine (X [right arrow] Y)

P(X,Y)/ [square root] [P(X)P(Y)],where [square root] [P(X)P(Y)] means the square root of the product P(X)P(Y).An equivalent definition is : Cosine(X [right arrow] Y) = |{t_i such that t_i contains both X and Y}|/ [square root] [ |{t_i containing X}||{t_i containing Y}|]. Cosine is a number between 0 and 1.This is due to the fact that both P(X, Y) = P(X) and P(X, Y) = P(Y). A value close to 1 indicates a good correlation between X and Y. Contrasting with the previous measures, the total number of transactions n is not taken into account by the cosine measure. Only the numbers of transactions containing both X and Y, the number of transactions containing X and the number of transactions containing Y are used to calculate the cosine measure.

Methodology

The conceptual flow of our proposed model for discovering association rules are described below:

* Identify the variables for inclusion in the profile, such as, protocol, service, flag, attack class, etc.

* Apply Apriori algorithm to generate all association rules that have minimum support and minimum confidence.

* Apply the pruning techniques to select the suitable general rules that describe the behaviour of network intrusions.

* Apply MS Apriori algorithm to discover significant rules based on multiple support item sets.

* Perform the Chi-Square statistical test to measure the degree of independence among the attributes that are forming the rules.

* A significance test of various other interestingness measures are used to select the most interesting rules among its ancestors.

* The most interesting rules are then used to construct the generalised intrusion detection profiles, which is stored in the database for off line analysis. The fact that the rule set can be used to model "new" behaviour when gathering audit data suggests that we can use the final rule set directly to detect anomalies. Also, by mining and comparing specific patterns from audit data, we can discover whether certain known intrusion patterns exist [15].

Experimental Results and Discussions

In this section, we summarize our experimental results to detect network intrusion detections using Apriori algorithm over KDDCup'99 data set. We first describe the dataset used in this experiment and then discuss the results obtained. Finally, we evaluate our approach and compare the results with the results obtained by other researchers.

Dataset and Pre processing

Under the sponsorships of Defence Advanced Research projects Agency (DARPA) and Air force Research Laboratory (AFRL), MIT Lincoln Laboratory has collected and distributed the datasets for the evaluation of computer network intrusion detection systems [16, 17]. DARPA dataset is most popular dataset used to test and evaluate a large number of Intrusion detection systems (IDSs). The KDDCup'99 dataset is a subset of DARPA dataset prepared by Sal Stolfo and Wenke Lee [18].The data set was pre-processed by extracting 41 features from the tcpdump data in the 1998 DARPA data set. The KDDCup'99 dataset can be used without further timeconsuming pre-processing and different IDS can be compared with each other by working on the same dataset. Therefore, we carry out our experiments on 10%KDDCup'99 dataset, which contains 65,525 connections.

Run with Single and Multiple Supports Single support

This is a learning engine for all the mining methods. Single support means that there will be only one minimum support during learning. This technique can be applied to the normal association rule mining. In this case, in order to solve the skewed class distribution problem, the minimum support specified by user will be distributed to each class labels. By doing this, each class labels will have its own minimum support. As association rule generating algorithm has exponential grow in nature, a hard memory limit is necessary to prevent the program runs beyond the machine limit. In general, if the rule setting is limited to 50, 000, 4-5 MB memory is allocated.

Multiple Supports

In this, the upper limit, lower limit and step levels are to be specified. The values to be set are largely dependent on the application involved. Multiple supports mean that there will be more than one minimum support during mining. By having a single minimum support during mining, it will have the problem of not being able to discover associations that involve rare items (very low frequency), which can be very important to an application. By introducing the concept of the multiple minimum supports to a different item. Of course, we also know that assigning multiple minimum supports could be tedious process for the user. In our case, the multiple minimum support factors used to automatically compute minimum support of each individual item, which can be calculated as follows: f* freq (item I).

Experiments are done with Apriori algorithm with single and multiple minimum supports is applied on the 10% KDDCup'99 dataset that is implemented by Interestingness Analysis System (IAS) [19]. At first, Apriori algorithm is applied on the dataset with three different minimum support threshold [greater than or equal to] 10% and minimum confidence threshold [greater than or equal to] 90%. Then, Apriori algorithm with multiple minimum supports is implemented in the dataset to generate significant association rules. The execution time of this is plotted in MATLAB 7.0 against the minimum support threshold, which is shown in figure (4).Finally, to select the set of interesting rules among its generalised rules, various interestingness measures are calculated and compared, which are shown in the table below. These experiments are carried out in Intel Pentium4 CPU, 2.8GHz, and 512 MB RAM PC.

Evaluation and Discussion

As is observed from the figure4, the execution time for the two algorithms is different for different values for the support factor on a data set of 65525 connections. We can also notice that the Apriori algorithm has a lower performance than the MS Apriori.

[FIGURE 4 OMITTED]

Measures for interestingness as given in the previous section differ not only in their definition but also in their results. They do not rate the same sets the same way. They are illustrated as follows:

Chi-Square Test

In our case, 14 rows and 5 columns are present, which gives the degree of freedom as 52, as per the calculation discussed earlier. So, for 52 degree of freedom with 5% significance level, the [chi square] value is obtained using the Surf Stat Chi-Square calculator [20] to get the value of [chi square].The [chi square] value obtained from this is 69.83.Now, it can be seen that after comparison of this value with the calculated value, the rules having lower than the obtained [chi square] value is considered, while discarding others.

R [greater than or equal to] 1.1 Interesting Measures

From the Table.5, it is significant to consider the rules that satisfy these criteria. With R threshold greater than or equal to 1.1, Rule 1 is considered to be interesting in comparison to others, since it has no ancestors. It can also be seen that Rules 3, 6,8,9,11,13 and 14 are also interesting, because it's support and confidence are more than 1.1 times the expected support and confidence of rules 2,5,7,8,10,12 and 13,respectively.Where as other rules i.e. Rule 2,4,5,7,10 and 12 are considered to be redundant, since they are not confirming to this criteria of measuring interestingness. Based on this information as provided in Table.5, R-interesting measures prune down much less interesting rules.

Lift

From the Table.6, it can be observed that Rules,those are having high lift values indicates that they are highly correlated among each other and the rules having lift value around 1 are somewhat treated as independent to each other.

Correlation

As all the rules as shown in Table.6 are having positive correlation value, they are treated as positively correlated. More the value of correlation, more will be the inter relation among each other.

Conviction

It is observed from the Table.6 that, the conviction of infinite value indicates that their confidence is 1, so they are highly correlated. However, in other cases, the degree of correlation is measured with respect to their corresponding values. This otherwise indicates that their confidence is not equal to 1.

Cosine

Finally, Cosine is the only measure that always rate X and Y as correlated. This is due to the fact that Cosine calculation is independent of n, the size of the population, but considers only the number of transactions where both X and Y occurs, as well as the numbers of transactions where X and y, both occurs. From the Table.6, it can be observed that, all rules are having Cosine values close to Unity, which indicates that they are highly correlated.

As a result of choosing multiple minimum support and various interestingness measures as discussed above, the large number of discovered rules have been reduced to a small amount of interesting rules, which are considered as a general profile of association rules for detecting the network intrusion .It is also seen that the proposed method also reduces the execution time significantly.

Related Work

In [15, 21], it is explained that by applying proper subintervals, system will reduce the length of the user records. At the same time, system will keep the historical records for the activities in the database. Using the user records, system will generate a rule set for the activities within network. At this stage, system can notice the irregularities and identify them. The problem lies in this approach is that adaptability of their record system requires that someone always keep the system rule sets up to date. It could be a big challenge to include an automated adaptation feature in the intrusion detection system. In [22], the proposed graph based Induction (GBI) approach using entropy based data mining indicates that the goal is to provide system with the ability to detect newly encountered attacks. However, this claim in this paper is not supported by the experimental results. Association rule mining has been studied extensively in the past [22]. however, the model used in this work is based on only one user-specified minimum support threshold.[23] Presents a generalised multiple --level association rule mining technique, where an association rule can involve items at any level of hierarchy. However, the model still uses only one single minimum support. In [24],while comparing Apriori, Partition, DIC and Eclat, it has been concluded that the algorithms show quite similar runtime behaviour .it has also shown that, at least there is no algorithm that is fundamentally beating out the other ones. It is easy to see that our algorithm MSApriori is a generalization of the Apriori algorithm for single minimum support mining. That is, when all minimum support values are the same as the lowest minimum support, it reduces to the Apriori algorithm. However, for multiple minimum supports, it is different than the Apriori algorithm, starting from the initialisation; candidate item set generation to pruning of candidate item sets. It provides more flexibility along with discovery of most significant rules, which makes the proposed model as an efficient and effective one for mining association rules in network intrusion detection systems.

Conclusion

This research has demonstrated the effectiveness of the proposed techniques for discarding unproductive rules(false discoveries),which are directly applicable to any statistical hypothesis test, allowing users to identify and discard false discoveries using whatever definition of false discoveries are applicable to their specific applications. In this paper, we investigated the interestingness of the association rules found in the KDDCup'99 dataset. Apriori algorithm is used to look for the strong association rules among the various attributes specified in order to detect the network intrusion detection, in order to show the effectiveness of the proposed model.

References

[1] Heady, R., Lugar, G., McCabe, A. and Servilla, M., 1990. "The architecture of a Network Level Intrusion detection system", Technical Report, CS90-20, Computer Science Dept., University Of New Mexico, Albuguerque, NM87131-USA.

[2] Saso Dzeroski, 2001. "Data Mining in a Nutshell". In Saso Dzeroski and Nada Lavrac, Editors, Relational Data Mining, Springer Verlag, pp. 3-28.

[3] Agrawal, R., Imichinski, T., and Swami, A, 1993. "Mining association rules between sets of items in large databases". SIGMOD-1993, pp. 207-216.

[4] J. Han, M. Kamber, 2001. "Data Mining concepts and techniques", Morgan Kaufmann Publishers, San Francisco, USA, 2001, ISBN1558604898.

[5] C.Gyorodi, R.Gyorodi, 2002. "Mining association rules using large databases", Proceedings of Oradea EMES'02, Oradea, Romania, pp. 45-50.

[6] R.Gyorodi, C.Gyorodi, 2002. "Architecture of data mining system", Proceedings of Oradea ENES'02, Oradea, Romania, pp. 141-146.

[7] Rakesh Agrawal and Ramakrishna Srikant, 1994. "Fast algorithms for mining association rules in large databases", Proceedings of the 20th. International conference on VLDB, Santiago, Chile, pp. 487-499.

[8] Qiankun Zhao and Sourav S. Bhowmik, 2003. "Association rule mining: A survey", Technical report, CAIS, Nayang Technological University, Singapore, No.2003116.

[9] Bing Liu, Wynne, Hsu and Yiming Ma, 1999. "Mining association rules with multiple supports", Proceedings of the ACM SIGKDD International conference on knowledge discovery and data mining (KDD-99), August15-18, San Diego, CA, USA, pp. 337-341.

[10] H.Toivonen, M.Klemettinen, P.Ronkainen, K.Hatonen and H.Mannila,1995. "Pruning and Grouping of discovered association rules", ECML-95, Workshop on statistics/c Learning and knowledge discovery in databases, pp.47-52.

[11] Fatma E, Giha, Y.P.singh, H.T. Ewe, 2006. "Mining Generalised customer profiles", AIML'06 International conference, 13-15, June, pp. 141-147.

[12] Tan, P.N., Kumar, V.,Srivastava,J., 2001. "Selecting the right interestingness measure for association patterns", 8th .ACM SIGKDD International conference on knowledge discovery and data mining, San Fransisco, USA,pp. 67-76.

[13] R. Srikant and R. Agrawal, 1995. "Mining Generalized association rules", in proceeding of the21st. International conference on VLDB, Switzerland, vol.13 (2-3), pp.161-180.

[14] Geoffrey I. Webb, 2006. "Discovering significant rules", Proceedings of the 12th. ACMSIGKDD international conference on knowledge discovery and data mining, Aug20-23, Philadelphia, USA, pp.434-443.

[15] Wenke Lee, Salvatore J. Stolfo and Kui W. Mok, 2001. "Adaptive Intrusion Detection: A data mining approach", Artificial intelligence review, vol.14, pp.533-567.

[16] M.Mahoney, and P.Chan, 2003. "An analysis of the 1999 DARPA/Lincoln Laboratory evaluation data for network anomaly detection", (RAID)-2003, Pittsburg, USA, Sept.8-10, LNCS, vol.2820, pp.220-237..

[17] MIT Lincoln Laboratory, DARPA Intrusion Detection Evaluation, http://www.ii.mit.edu.

[18] Charles Elkan, 2000. "Results of the KDD'99 Classifier learning", SIGKDD Explorations, ACM SIGKDD, vol.1, Issue.2, pp.63-64.

[19] Michael Goebal and Le, Gruenwald, 1999. "A survey of data mining and knowledge discovery software tools", SIGKDD Explorations, ACMSIGKDD, vol.1, Issue 1, pp.20-33.

[20] Surf stat Chi-Square Calculator

[21] http://www.anu.edu.au/nceph/surfstat/surfstat-home/tables/chi.php

[22] W.Lee, Salvatore J.Stolfo, and Kui W. Mok, 1998. "Mining audit data to build network intrusion detection models", in proceedings of the 4th. international conference on knowledge discovery and data mining (KDD'98), New York, USA. American association of artificial intelligence (www.aaai.org).

[23] Ken Yoshida, 2003. "Entropy based intrusion detection", in Proc. Of IEEE Pacific Rim Conference on communications, computers and signal processing (PACRIM2003), vol.2, IEEE, IEEExplore, pp.840-843.

[24] Park, J.S. Chen, M.S. and Yu, P.S, 1995. "An effective hash based algorithm for mining association rules", SIGMOD-95, pp.175-186.

[25] Jochen Hipp, Ulrich Gunter and Gholamreza Nakhaeizadeh, 2000. "Algorithms for association rule mining: A general survey and comparison", SIGKDD Explorations, ACMSIGKDD, vol.2, Issue 1, pp.58-64.

Mrutyunjaya Panda (1) and Manas Ranjan Patra (2)

(1) Department of ECE, G.I.E.T., Gunupur, 765022, India. E-mail: panda_mrutyunjaya@yahoo.com

(2) Department of Computer Science, Berhampur University, 760007, India. E-mail: mrpatra12@gmail.com

Table 1: Contingency table for observed support. Normal Nmap port sweep smurf Neptune Total ntp_u 151 0 0 0 0 151 pop_3 29 0 0 0 0 29 SH 0 103 0 0 0 103 auth 84 0 0 0 0 84 other 15 0 0 0 0 15 RSTOS0 0 0 10 0 0 10 http 29037 0 0 0 0 29037 smtp 3083 0 0 0 0 3083 domain_u 1236 0 0 0 0 1236 ftp_data 1079 0 0 0 0 1079 ecr_i 0 0 0 11258 0 11258 finger 156 0 0 0 0 156 ftp 146 0 0 0 0 146 S0 0 0 0 0 416 416 Total 35016 103 10 11258 416 46803 Table 2: Contingency table for Expected support. Normal Nmap Port Smurf Neptune Total sweep ntp_u 113 0.332 0.032 36.321 1.335 151.02 Pop_3 22 0.063 0.006 7 0.3 29.369 SH 77 0.226 0.022 25 0.916 103.164 Auth 63 0.185 0.018 20.202 0.747 84.152 Other 11 0.033 0.003 4 0.133 15.169 RSTOS0 7 0.022 0.002 2.405 0.0891 9.518 http 21724 63.902 6.2 6983.398 258 29035.5 Smtp 2306 6.785 0.658 741 27.438 3081.881 Domain_u 925 2.72 0.264 297.258 11 1236.242 ftp_data 807 2.374 0.230 260 9.603 1079.207 Ecr_i 8423 24.775 2.405 2708 100 11258.18 Finger 117 0.343 0.033 38 1.388 156.764 ftp 110 0.321 0.031 35.113 1.229 146.764 S0 311 0.92 0.09 100 3.702 415.712 Total 35016 103 10 11258 416 46803 Table 3: Contingency table for observed confidence. Normal Nmap Port Smurf Neptune Total sweep ntp_u 1 0 0 0 0 1 Pop_3 0.96667 0 0 0 0 0.96667 SH 0 1 0 0 0 1 Auth 1 0 0 0 0 1 Other 1 0 0 0 0 1 RSTOS0 0 0 1 0 0 1 http 0.93538 0 0 0 0 0.93538 Smtp 0.99903 0 0 0 0 0.99903 Domain_u 0.99919 0 0 0 0 0.99919 ftp_data 0.99082 0 0 0 0 0.99082 Ecr_i 0 0 0 0.98954 0 0.98954 Finger 0.98734 0 0 0 0 0.98734 ftp 0.96689 0 0 0 0 0.96689 S0 0 0 0 0 0.95853 0.95853 Total 9.84532 1 1 0.98954 0.95853 13.79339 Table 4: Contingency table for expected confidence. Normal Nmap Port Smurf Neptune sweep ntp_u 0.71377 0.0725 0.0725 0.07174 0.06949 Pop_3 0.68998 0.07008 0.07008 0.069349 0.06717 SH 0.71337 0.0725 0.0725 0.07174 0.06949 Auth 0.71337 0.0725 0.0725 0.07174 0.06949 Other 0.71337 0.0725 0.0725 0.07174 0.06949 RSTOS0 0.71337 0.0725 0.0725 0.07174 0.06949 http 0.66764 0.068 0.068 0.06710 0.06500 Smtp 0.7131 0.07242 0.07242 0.07167 0.06942 Domain_u 0.71319 0.072439 0.072439 0.0717 0.06943 ftp_data 0.707228 0.072 0.072 0.07108 0.06885 Ecr_i 0.70630 0.072 0.072 0.07098 0.06876 Finger 0.70473 0.07158 0.07158 0.07083 0.06861 ftp 0.69013 0.0701 0.0701 0.06936 0.06719 S0 0.68417 0.0695 0.0695 0.06876 0.06661 Total 9.8454 1.0078 1.00078 0.98951 0.95849 Total ntp_u 1 Pop_3 0.96667 SH 1 Auth 1 Other 1 RSTOS0 1 http 0.93358 Smtp 0.99903 Domain_u 0.99919 ftp_data 0.99082 Ecr_i 0.98954 Finger 0.98734 ftp 0.96689 S0 0.959 Total 13.79339 Table 5 : Applying Interestingness measures to select interesting rules-(Part I) No. Rule Support Expected Confidence Support 1 Service=Ntp 151 113 1 _u [right arrow class=normal 2 Flag=SH [right 103 0.226 1 arrow] class=nmap 3 Service=auth 84 63 1 [right arrow] class=normal 4 Service=other 15 11 1 [right arrow] class=normal 5 Flag=RSTO 10 0.002 1 S0 [right arrow] class=port sweep 6 Service=smtp 3083 2306 0.99903 [right arrow] class=normal 7 Service=do 1236 925 0.99919 main_u [right arrow] class=normal 8 Service=ftp 1079 807 099082 data [right arrow] class=normal 9 Service=ecr_i 11258 2708 0.98954 [right arrow] class=s murf 10 Service=finger 156 117 0.98734 [right arrow] class=normal 11 Service=ftp 146 110 0.96689 [right arrow] class=normal 12 Service=pop_3 29 22 0.96667 [right arrow] class=normal 13 Flag=S0 [right 416 3.702 0.95853 arrow] class=neptune 14 Service=http 29036 21724 0.93538 [right arrow] class=normal No. Rule Expected Chi-square R [greater Confidence than or equal to] 1.1 1 Service=Ntp 0.71734 50.778 -- _u [right arrow class=normal 2 Flag=SH [right 0.71377 46839.447 0.9115 arrow] class=nmap 3 Service=auth 0.71377 28 371.681 [right arrow] class=normal 4 Service=other 0.71377 5.454 0.238 [right arrow] class=normal 5 Flag=RSTO 0.0725 49990 0.909 S0 [right arrow] class=port sweep 6 Service=smtp 0.7131 1038.8 1541500 [right arrow] class=normal 7 Service=do 0.71319 415.56 0.5359 main_u [right arrow] class=normal 8 Service=ftp 0.707228 363.677 1.1664 data [right arrow] class=normal 9 Service=ecr_i 0.07098 35545 13.9504 [right arrow] class=smurf 10 Service=finger 0.70473 52 0.0576 [right arrow] class=normal 11 Service=ftp 0.69013 47.78 1.2478 [right arrow] class=normal 12 Service=pop_3 0.68998 9.596 0.2636 [right arrow] class=normal 13 Flag=S0 [right 0.06661 46330.62 18.909 arrow] class=neptune 14 Service=http 0.66764 9774.79 7843.32 [right arrow] 79 class=normal Table 6: Applying Interestingness measures to select interesting rules-(Part II). No. Rule Lift Correlation Conviction (X [right n (X [right (X [right arrow] Y) arrow] Y) arrow] Y) 1 Service=Ntp_u 331.12582 1.001 [infinity] [right arrow] class=normal 2 Flag=SH 485.436 1.014 [infinity] [right arrow class=nmap 3 Service=auth 595.238 1.02 [infinity] [right arrow] class=normal 4 Service=other 3333.333 0.9534 [infinity] [right arrow] class=normal 5 Flag=RSTOS0 5000 3.65 [infinity] [right arrow] class=port sweep 6 Service=smtp 16.217 16.51 967.42 [right arrow] class=normal 7 Service=domain_u 40.45 1.008 1219.125 [right arrow] class=normal 8 Service=ftp_data 46.04 0.996 106.59 [right arrow] class=normal 9 Service=ecr_i 4.397 0.995 74.09 [right arrow] class=smurf 10 Service=finger 329.1 0.998 78.75 [right arrow] class=normal 11 Service=ftp 333.379 0.99698 30.11 [right arrow] class=normal 12 Service=pop_3 1666.55 0.9843 29.98 [right arrow] class=normal 13 Flag=S0 115.48 0.9762 23.91 [right arrow] class=neptune 14 Service=http 1.612 0.92 6.499 [right arrow] class=normal No. Rule Cosine (X [right arrow] Y) 1 Service=Ntp_u 1 [right arrow] class=normal 2 Flag=SH 1 [right arrow class=nmap 3 Service=auth 1 [right arrow] class=normal 4 Service=other 1 [right arrow] class=normal 5 Flag=RSTOS0 1 [right arrow] class=port sweep 6 Service=smtp 0.999 [right arrow] class=normal 7 Service=domain_u 0.999 [right arrow] class=normal 8 Service=ftp_data 0.993 [right arrow] class=normal 9 Service=ecr_i 0.996 [right arrow] class=smurf 10 Service=finger 0.996 [right arrow] class=normal 11 Service=ftp 0.983 [right arrow] class=normal 12 Service=pop_3 1.05 [right arrow] class=normal 13 Flag=S0 0.976 [right arrow] class=neptune 14 Service=http 0.966 [right arrow] class=normal

Printer friendly Cite/link Email Feedback | |

Author: | Panda, Mrutyunjaya; Patra, Manas Ranjan |
---|---|

Publication: | International Journal of Applied Engineering Research |

Article Type: | Report |

Date: | Mar 1, 2009 |

Words: | 7363 |

Previous Article: | A high performance DTC strategy for torque ripple minimization using discrete space vector modulation techniques for SRM drive. |

Next Article: | A study on different activation functions for neural network-based wear loss prediction. |

Topics: |