Printer Friendly

INCREMENTAL DISTANCE ASSOCIATION RULES MINING.

Byline: Ahmed Z. Emam and Lama A. Al-Zaben

ABSTRACT: Association Rules Mining (ARM) is simply mining a given dataset for some kind of data that appear together frequently to form rules, while Distance-Based Association Rule Mining (DARM) is an enhanced Apriori algorithm that takes a distance between each pair of items within an item set into consideration. It is crucial to maintain such discovered rules from large datasets, this was the main idea behind Incremental Association Rules Mining (IARM), which has recently received much attention from the data mining researcher. In this paper we have added an incremental feature to the distance-enhanced association rule mining algorithm. The proposed algorithm updates a set of distance preserving frequent itemsets to reflect changes occurred to the original dataset. This update is done by making use of previous knowledge which was generated from a previous mining process on the original dataset and was kept for future updates.

This knowledge includes the number of occurrences for each frequent itemset, in addition to the mean and standard deviation of distances within these itemsets occurrences. When the original dataset is changed, accessing the original dataset is needed to gather information of only the itemsets that were not previously frequent and have a potential to be currently frequent. Finally, the resulting algorithm has been implemented in Weka.

I. INTRODUCTION

Association Rules Mining (ARM) is mining a given database for some kind of data that appear together frequently to form rules. Rules that are driven state that if some specific values have occurred then it is likely that some other specific values will occur too. In ARM, these values are called items and a set of values is an itemset.

This idea was initially introduced as ,,market basket analysis in which experts wanted to know what items are usually bought together; for example, mining the transactional database of some supermarket might show that milk and bread are commonly bought together. Using this information can help in developing some marketing strategy to increase profits. ARM is not only restricted to business areas; rather it may reveal valuable information in many domains, and therefore a lot of research and algorithms have been developed to improve this field. To form the problem,X Y means that whenever X occurs, there is a chance that Y will occur too. The left hand side of a rule is called the antecedent or premise and right hand side is called consequent or succedent. An itemset can be of multiple sizes; k-itemset is an itemset with k items. To state whether an itemset appears frequently, a user threshold needs to be defined.

This threshold is usually a percentage value defined by the user and multiplied by the number of transactions in a database. For an itemset to be frequent it needs to be supported in a set of transactions equal or greater than this threshold which is referred to as minimum support. Moreover, rules that are formed from these frequent itemsets also have to pass a threshold to produce valuable knowledge.

This threshold is called minimum confidence and is a percentage value defined by the user. The confidence of a rule is the ratio of the support of the whole itemset to the support of antecedent part of the rule.

II. RELATED WORK

In this section we present three types of ARM. First we present the general association rule mining in which all we are concerned about is the presence of items in a transaction for a fixed size database. Then we present an algorithm in which not only the presence of an item is considered, but moreover the distance between specific items within a transaction is considered as well. Finally, we present an incremental association rule mining technique which is one of the solutions to the problem of maintaining frequent itemsets.

Apriori [2] is one of the earliest algorithms in ARM. The algorithm uses input as set of transactions (Datasets), a user minimum support and minimum confidence. The output is a set of rules along with their confidence and support. Apriori is a level-wise algorithm; it starts by extracting frequent 1- itemsets and increases the size by one in each succeeding iteration. It uses prior knowledge gained from the previous iteration, and thus its name. On each iteration it first forms the current candidate levels set then after a database scan it prunes the infrequent itemset so that remaining itemsets are all the frequent itemsets in the current level. These frequent itemsets are self-joined in the next iteration to generate the candidates. Before scanning the database, Apriori prunes any itemset with an infrequent subset because it cant bepart of any frequent itemsets.

This monotonicity property of Apriori helps in minimizing the set of candidates generated in each next level. The algorithm works as follows. In the first iteration it starts with a set of candidates composed of all items in a database. Then, a whole database scan is conducted to determine the support of each candidate. Next, candidates with a support below the minimum support threshold are pruned and remaining candidates form the frequent set of 1-itemset and are labeled with L1. The next level's candidates C2 will be formed by self-joining L1 and will be processed as C1 to generate L2. Starting from the third iteration, candidates will be formed by joining each pair of frequent itemset from the previous level if all their k-1 items are similar, and the kth item of the first itemset is less than the kth item in the second one.

Furthermore, it removes candidates that contain a subset of any infrequent itemset. Next, the process continues as the first two iterations by scanning the database and pruning infrequent itemsets. The algorithm continues until the set of candidates is empty. As the set of all frequent itemsets are available, all that is needed is to generate a set of rules. First, for each frequent itemset we generate all possible subsets and then form a rule by placing the subset in the antecedent part and all other items that belong to the same itemset in consequent part. Such a rule is only considered if its confidence is equal or greater than the minimum confidence threshold. Extracting frequent itemsets is the major part before generating the rules and consumes a lot of memory space, CPU time and I/O operations. Generating Rules, on the other hand, is a straightforward process therefore a lot of researches focused on improving the first part.

There are many algorithms that were developed based on Ap oir algorithm, which enhance the number of scan such as FAST Apriori and others.

B. Distance- Enhanced Association Rule Mining

DARM [3,4] is an enhanced Apriori algorithm that takes distance between each pair of items within an itemset into consideration. For a given itemset, meeting the minimum support threshold is not enough. Each itemset maintains the variation of distances between each pair of items in every occurrence. For a given itemset to be considered frequent, its variation of distances must not exceed a specific value in addition to having a support greater than the minimum support. Variation in distances is captured using the coefficient of variation of distances (cvd).

An itemset has a cvd for each pair of numeric items that belongs to that itemset. Each cvd is the ratio between the standard deviation and the mean of distances within each pair in every itemset's occurrence. DARM parameters are minimum support, minimum confidence and maximum cvd. It works as Apriori in two phases. In the first phase it generates frequent itemsets that deviate properly. ,,Deviate properly" is a property that is used to prune some itemset in which all supersets will either fail to reach the minimum support threshold or will contain a pair of items that have a cvd above the max cvd. For an itemset to deviate properly the cvd between each pair of items need to be below the max cvd in any subset of occurrences equal to at least the minimum support threshold. Such a property has no effect on the output but was added to enhance the efficiency of the DARM algorithm by minimizing the number of candidates.

Additionally the max-cvd constraint is a non-monotonic property. In the next phase, it generates rules similarly to Apriori but with an additional process. For itemsets that have a support greater than the minimum support, they will be removed if any cvd between any pair of items is above the maximum cvd when considering all occurrences and not only a subset. Remaining itemsets are used to form Distance-based Association Rules (DAR). The DARM algorithm was introduced to analyze gene expression. It can be further utilized in other domains such as finance, retail and many others. DARM is not restricted to only distance but also any other measurements that need to be captured between a pair of items such as time or cost. The limitation here is that DARM considers only a single measurement. Since, association rules algorithms such as Apriori generate the list of association rules by processing the whole database.

Processing the whole database whenever a set of records is added or deleted is costly. Therefore a number of algorithms have been proposed to solve the issue of maintaining association rules for dynamic databases. The basic dilemma associated with updating the dataset is that adding a number of transactions will increase the number of transactions in the whole dataset which results in a higher minimum threshold so previously frequent itemsets may now become infrequent. In the next sections will explore the basic algorithms that used as based for any incremental association rule mining.

C. Fast Update Algorithm (FUP)

The first algorithm that maintains association rules against a dynamic database is FUP [5]. This algorithm is built upon Apriori [2] with the aim to minimize the number of candidates as much as possible. This minimization can be reached by using previous information which constitutes the set of frequent itemsets with their support counts that have been generated for the original database. To update the set of frequent itemsets FUP separates old frequent itemsets from new, potentially frequent itemsets. The reason for such a separation is because the support counts for old frequent itemsets are known and all that is needed is to scan the new set of records and count their occurrences to be added to their old support. The result will represent the support for a given itemset through the whole updated database. Thus for old frequent itemsets, it can decide whether a candidate remains frequent or not without a database scan.

Unfortunately, this is not the case for new frequent itemsets as we dont know their support in the original database; therefore a scan is needed. FUP algorithm in this case tries to minimize the number candidates as much as possible. In order for an itemset to be frequent through the updated database, it at least needs to be frequent through the newly added set of records. Therefore it first scans that set to identify itemsets that are frequent. If there is any frequent itemset within the new set then the original database need to be scanned. The best case is when there are no potential frequent itemset. As in Apriori, FUP is a level-wise algorithm which goes through an iterative process starting from itemsets of size 1 until the k-itemset. A set of frequent itemsets are generated through each iteration. The FUP algorithm deals only with addition. Deletion needs to be considered since it will also change the list of frequent itemset.

By deleting a set of records some infrequent itemsets may become frequent as the minimum support threshold is now less due to the decrease in the number of transactions. On the other hand some frequent itemsets may become infrequent because their occurrences were high in the deleted set of records. An updated version of FUP is presented in the next section. Parthasarathy et al. (1999) proposed an active association rule mining algorithm that combined the features of Apriori and FPU, while utilizing a client-server mechanism that allowed the user to query interactively without accessing the dataset. As extensions and refinements of the aforementioned algorithms, several other algorithms were also introduced for mining association rules. These include: a frequent pattern-growth method, multidimensional association rules, a hits and LOGSOM algorithm, and a parallel and distributed algorithm.

Interested readers should refer to Dunham (2003) and Kantardzic (2003) for further details of these algorithms.

D. Fast Update Algorithm2 (FUP2)

In this section we present an updated version of FUP named FUP2 [6] which is also based on Apriori [2]. This algorithm is the same as FUP in the case of addition and is the complementary in case of deletion. In the case of deletion, the algorithm starts by forming a set of all 1-itemset as candidates. Then, separate the candidate 1-itemset into two lists. The first is the list of old frequent itemsets and the second list contains all remaining candidates. After scanning the deleted set, the support count for each itemset within the decrement is calculated. For each itemset we need to determine whether or not it is frequent in the remaining set of instances in the database. For the first list it subtracts their corresponding support in the decrement from their total support in the original database. For the second list, on the other hand, their support in the original database is not known. As FUP, the objective is to minimize the number of candidates.

Therefore, the algorithm states that an itemset cant be requent if it was infrequent in the original database and frequent in the decrement. So by scanning the decrement we can identify frequent itemsets and remove them. For remaining itemsets a database scan is required to determine newly frequent itemsets.

As Apriori all discovered frequent itemsets will be joined to form the next level"s candidates and the process will be repeated. When adding and deleting is done at the same time, the candidates are divided similarly into old frequent itemset and all remaining itemsets. It starts by scanning the decrement to determine the corresponding support count of all itemsets. For old frequent itemsets, it checks if there is a possibility that such an itemset will remain frequent. Up to now it doesn"t know their support in the increment but at most it will occur in each transaction. Therefore, we can minimize the number of candidates by removing those that can"t reach the minimum threshold in the best case. So it subtracts their support in the decrement from their support in the original database and adds the number of transactions in the increment and prunes it if the result is less than the support of the updated database.

After scanning the increment it computes the final support of all remaining candidates in the first list. In order to determine whether it is frequent or not, the result is compared to minimum support threshold for the whole updated database. The total support for the second list is its support in the original database and in the increment minus its support in the decrement. A database scan it needed to determine it support in the original database. The number of candidates is minimized before scanning the database. Since the support of a given itemset in the second list in the original database is less than the minimum support threshold for the original database then each itemset is pruned if their support in the increment minus their support in the decrement is less than the minimum support threshold for increment size minus the decrement size. Because it has no potential to be frequent and therefore can be pruned.

For the remaining candidates in the second list a database scan is required to determine their total support. After defining the set of all frequent itemsets of size 1 as Apriori it generates the next level of candidates and the process is repeated. To improve efficiency, the number of candidates in the second iteration and beyond can be reduced before scanning the increment. To do this we need to keep track of the support of each itemset in the increment in addition to their total support. Increment support will be used to determine the high bound of each itemset in the next iteration. The minimum increment support of the two joined itemsets will be the high bound for the resulting itemset. Using this high bound the number of candidates can be reduced. It treats this high bound as if it was the support in the increment and deletes itemsets that do not meet the minimum support threshold.

After that, the increment will be scanned to update the increment supports of the remaining set of itemsets. The algorithm will continue as described previously.

III. INCREMENTAL DISTANCE-BASED ASSOCIATION RULES MINING (IDARM)

The main goal of Incremental Distance-Based Association

Rules Mining (IDARM) is to update a set of DAR when a number of records are added/deleted to/from a dataset. It is about modifying FUP2 to be applicable over DARs. We want to update a set of DARs without the need to apply DARM on the updated dataset as if it was a new set. The mining process can be minimized by making use of previous knowledge gained from the last mining process and making use of FUP2 concepts. Basically, the framework of our algorithm is similar to that of the FUP2 algorithm, but what needs to be kept as knowledge after a mining process is the main problem of IDARM.

A. Frequent ITEMSETs vs. DPF ITEMSETs

In IDARM we have made use of FUP2 concepts. Therefore a set of itemsets needs to be kept as part of the gained knowledge. FUP2 keeps the set of frequent itemsets with their support, but in IDARM the issue is which set can be sufficient is it the frequent set of itemsets as in FUP2 or the set of distance preserving frequent (DPF) itemsets. Because the DPF set is part of the frequent set, keeping The DPF set can be beneficial in terms of space. Unfortunately that is not sufficient as a previous knowledge in IDARM because an itemset"s cvd might increase or decrease whenever an occurrence has been added or deleted.

Assume an itemset x is frequent but not DPF within a dataset. Then a set of records has been added to this dataset that contains occurrences of x in way that x is not frequent within the new set, but remains frequent within the whole updated dataset. Since occurrences of x have increased its cvd will change and may become lower than the maximum cvd. Therefore, if we keep only DPF as knowledge then x (which is a DPF itemset within the updated dataset) would be missed because x would not be frequent in the new set of records and would not be DPF in the original dataset. The following example clarifies the basic idea. Assume we have a dataset of 6 records and a minimum support of 50%, minimum confidence of 70% and a maximum cvd of 0.5. A frequent itemset AB is found with 4 occurrences (a support of 66.6%), and distances between A and B are 10, 4, 5 and 3.

The corresponding cvd is calculated as follows: the mean is 5.5, standard deviation will be 3.11, and cvd will be 0.57. Therefore, AB is not a DPF itemset since its cvd is greater than the maximum cvd. Later the dataset was updated by adding 6 records. The itemset AB had two occurrences in this new set of records which makes it infrequent in the new set. As a result when keeping only DPF itemsets, AB will not be found. On the other hand if we mine DPF itemsets from the whole dataset (12 records) we will find that AB is frequent (a support of 50%). If Distances between A and B in the new set of records was 7 and 8 then the corresponding cvd would be calculated as follows: the mean is 6.17, standard deviation will be 2.64, and cvd will be 0.43. AB became a DPF itemset within the updated dataset since its cvd is now less than the maximum cvd. As a result we need to keep frequent and not DPF itemsets as knowledge for future updates.

Since we need to keep all frequent itemsets the ,,deviate properly DARM concep is not helpful in IDARM and thus not considered.

We still need to distinguish between DPF and Frequent, but not DP because rules generated need to be derived from only DPF itemsets.

V. CONCLUSION AND FUTURE WORK

The dataset may be changed by the user either frequently or infrequently and these changes would modify the characteristic of the dataset especially if one uses a distance- based association rule mining approach. Insert, Delete and Update are the most significant processes that affect any incremental mining process. In this research we have combined DARM [3, 4] and FUP2 [6] to get an incremental version of DARM. Our algorithm generates distance association rules in addition to the set of all frequent itemsets. Each itemset is augmented with a set of mean and standard deviation in addition to its support count and items it represents. The frequent set is used as knowledge to be accessed whenever the original dataset is updated. IDARM and DARM have been implemented in the Weka workbench. IDARM will be close to or equal to FUP2 performance if the IDARM is applied using a set of Instances that contains no numeric items.

Therefore, implementing IDARM is an implementation of FUP2 as well. The proposed algorithms are implemented using Weka API and using synthetic dataset with different size to evaluate the performance of the algorithms. Several experiments set were conducted using synthetic dataset with different minimum support, and confidence and accuracy of IDARM were close to DARM and do better than traditional Aprioir. IDARM shows significantly better achievement in the execution time for finding the large itemset than DARM and Apriori. The developed algorithm using the incremental distance association rules mining technique could reduce the time and computational power needed to mine a dynamic dataset staged over a multi-server, mirrored array, which would lead to more efficient processing.

In the future, IDARM could be implemented and tested over a set of servers distributed over a network, and if these servers maintain mirrors of the same dataset that need to be mined. Each server updates its dataset in isolation from others. At some point these updates are sent to the main server as patches. The main server using IDARM can use previous knowledge to update its DAR and minimize the number of candidates when scanning the whole dataset. Also, further discussion on the challenge IDARM technique on different application areas, such as applied security context; in other words what type of systems would benefit most from adopting this technique and what does that mean

for the chief informational officer, analyst supervisor, policymaker or other beneficiary of this technical advance. Finally, the criteria for setting significance thresholds and the influence of exogenous factors influencing dataset performance over time without an secure effect caused by updates that preserve initial data distance and variability characteristics during updates.

ACKNOWLEDGMENT

This work was supported by the Research Center of College of Computer and Information Sciences (CCIS), King Saud University, Saudi Arabia.

REFERENCES

[1] B. Liu, W. Hsu and Y. Ma, "Integrating Classification and Association Rule Mining," 1998.

[2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," 1994.

[3] A. Icev, "DARM: DISTANCE-BASED ASSOCIATION RULE MINING A MASTERS THESIS," 2003.

[4] A. Icev, "Distance-enhanced association rules for gene expression," in BIOKDD"03, in conjunction with ACM SIGKDD, 2003.

[5] D. W. Cheung, J. Han, V. T. Ng and C. Y. Wong, Maintenance of Discovered Association Rules in Large Databases: An, 1996.

[6] D. W. Cheung, S. D. Lee and B. Kao, "A General Incremental Technique for Maintaining Discovered Association," in In Proceedings of the Fifth International Conference On Database, 1997.

[7] C.-H. Lee, C.-R. Lin and M.-S. Chen, "Sliding-Window Filtering: An Efficient Algorithm for Incremental Mining," Information Systems, vol. 30,no. 3, p. 227-244, May 2005.

[8] "Weka homepage," [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.

[9] A. Savasere, E. Omiecinski and S. Navathe, "An efficient algorithm for mining association rules in large databases," 1995.

[10] B. Goethals, "Survey on frequent pattern mining," 2002.

[11] J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2005.

[12] I. H. Witten, E. Frank and M. A. Hall, Data Mining: Practical machine learning tools and techniques, 3rd Edition ed., 2011.
COPYRIGHT 2013 Asianet-Pakistan
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2013 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Publication:Science International
Date:Jun 30, 2013
Words:4123
Previous Article:RHIZOSPHERE BACTERIAL FLORA OF OKRO (HIBISCUS ESCULENTUS).
Next Article:AGROBACTERIUM-MEDIATED STABLE TRANSFORMATION OF ARABIDOPSIS THALIANA PLANTS USING B-GLUCURONIDASE (GUS) GENE.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters