# Automated Boolean Matrix data representation scheme through AVL tree for mining association rules.

Introduction

Mining Association rule is an appropriate and well researched tool of data mining to discover interesting relationships between variables in large databases. It was first introduced by (Sedgewick, et al., 1983), and has got widespread attention since the publication of his research and the introduction of the Apriori algorithm. This study has included generating association rules from frequent itemsets (Agrawal, et al., 1993; Mannila, et al., 1994; Agrawal, et al., 1996.). The initial research of this method was mainly driven by the analysis of market basket data. For example, from the transaction data of a supermarket, the rule {milk, egg} => {bread} is generated, which represents that customers who purchase milk and eggs also tend to buy bread as well. This relationship explains the behavior of consumers purchasing, and it can help managers to make better decisions in product promotion, shelf design, and customer relationship. Association rule mining has been applied to variant areas, including risk analysis in commercial institutions, medical sectors, basket data analysis, cross-marketing, catalog design, clustering, classification, etc.

In computer science and data mining, Apriori (Agrawal, et al., 1993; Agrawal, et al., 1996.) is a classic algorithm for learning association rules (Han, et al., 2001). Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). In the first step, frequent itemsets are generated, i.e those itemsets which holds the criteria of minimum support (Mannila, et al., 1994). In the second and final steps, Rule generation is made possible by evaluation of the confidence measure.

In this study, we propose data representation scheme to improve the existing Apriori algorithm for mining association rules. The algorithm improves through representation scheme through AVL-Tree (a balanced tree) to provide more reduced time and condensed rules than the Apriori method. In addition, our method is able to get the last frequent itemsets and generate rules depending on data representation and Boolean Matrix. This paper is organized in six sections: section 2 reviewed related works, section 3 proposed an automated Boolean matrix data representation using AVL-Tree, section 4 demonstrated the benefits of data representation, section 5 presented the algorithms, section 6 discussed the experimental results, and section 7 concluded the proposed representation scheme.

Related work:

Basically, data that are related to businesses grow speedily, making data mining a reality and an essential field in our life, and has increased the interest in the database management system. In data mining, Apriori(Agrawal, et al., 1993; Agrawal, et al., 1996.) and FP-growth (frequent pattern growth) (Han, et al., 2001.)are the main techniques in Association Rule Mining (ARM) which is used to get sequences pattern for items in a database.

Mining frequent pattern is seemingly one of the most important approaches in data mining and is referred to the market basket analysis from the Association Rules Mining task (Yen, et al., 1996.). A pattern is a set of items or subsequences or substructures that occurs frequently in a data set.

An association rule has the support X [??] Y to be the percentage of transactions in the database that contain X and Y, and the formal definition of support(x) is based on (AggrawalandSrikant., 1996)work :

Let I = { i1, i2, ..., im) be set of m literals called items and the database D = (t1, t2, ..., tn) a set of n transactions, each consisting of a set of items from I. A transaction t E D is said to contain itemsetX if X [subset or equal to] t. Hence, the support of itemsetX is defined as in the equation (1) below (Chunjie, et al., 2009.):

support(X) = ft E D X [subset or equal to] {t E D} (1)

The Mining frequent pattern using Apriori algorithm has been used in many applications in data mining such as: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Each transaction t [euro] D is an item set such that t is a proper subset of I. A unique identifier, which we call it as TID, is associated with each transaction. We say that a transaction t supports X, a set of items inI, if X is a proper subset of t. We assume that the items in a transaction or an itemset are sorted in lexicographic order.

An itemsetX has support s if s% of the transactions support X. Support of X is denoted as Supp(X). An association rule is an implication of the form

X [??] Y, where X and Y are subsets of I and X [intersection] Y = [empty set].

We say that the rule X [??] Y holds in the database D with confidence C if

confidence (X [right arrow] Y) = P(Y|X) = n(X [union] Y)/n(X) [greater than or equal to] C (2)

n( X)

(where [absolute value of A] is the number of occurrences of the set of items A in the set of transactions D, and A occurs in a transaction t, if and only if A is a proper subset of t). We also say that the rule X [??] Y has support S if

support (X [right arrow] Y) = p(X [union] = n(X [union] Y)/N [greater than or equal to] S (3)

where N is the number of transactions in D. Here, the support is a measure of the frequency of a rule, whereas, the confidence is a measure of the strength of the relationship between sets of items.

On the other hand, (Han, et al., 2006) has presented a new set based approach for getting frequent item sets. The best feature of one scan algorithm is that it requires only one scan of the database. They compared their algorithm with Apriori. In addition, (Gautan, et al., 2010) proposed MLBM algorithm only scans the transaction database once, and it adopts the Boolean vector to discover frequent itemset. Other authors (Zhu, et al., 2010)present the combination of the cloud model based on generalization method with Apriori algorithm.

(Coenen, et al., 2004) presented how to compute partial support counts in one DB-pass and how to store them in an enumeration tree (P-Tree). In addition, (Yu, et al., 2006) proposed a novel algorithm so called reduced apriori algorithm with tag (RAAT), which reduces one redundant pruning operation and compares with Apriori algorithm. On the other hand, (Gautan, et al., 2010) proposed Association Rule Based on Boolean Matrix algorithm, which only scans the transaction database once, and discovers frequent item set. In addition, (He, et al., 2004) proposed a frequent pattern discovery method for outlier detection.

AVL tree proposed by G.M Adelson-Velskii and E.M Landis (Sedgewick, 1983), and hence, its name Adelson-Velskii Landis, is a self-balanced tree; that means the heights of the children (any two sub-trees) can differ by at most one. Using an AVL tree is better than simple data structures such as vectors or arrays where complexity of running time usually takes the form of O(log n) time in both the average and worst cases (Gilberg, et al., 2005).

In this study, there are two main phases involved: firstly, a new transactional data representation scheme in the Boolean matrix form is proposed to help us to generate N- frequentitemset, and secondly the improvement to get association rules for the proposed Apriori algorithm, because of its efficiency to detect the last frequent itemsets.

Automated Boolean Matrix with Association Rules (ABMAR) involves several common steps in Apriori algorithm such as follows:

i. Compute unduplicated sorted n items and m transactions during loading process by AVL tree.

ii. Save all unduplicated n items in itemVector.

iii. Get the value of parameters: number_of_items, and number_of_rows automatically, then initialize settings for minSupport and minConfident.

iv. Use regular expression to replace items by itemVector's index.

v. Replacement process will reduce the capacity of Boolean Matrix.

vi. The columns of Boolean Matrix are discretized data while the number of rows is equivalent to the number of transactions.

vii. Apply BinApriori on Boolean Matrix, then generate n-Candidates and get m-Frequent itemsets.

viii. BinApriori will complete all frequent itemsets and save them with their minSupport values in AVL tree.

ix. Capture the last frequent and set confidence value.

x. obtain all possible subsets with their weight by set theory.

xi. Generate all possible association rules depend on weight and confidence value.

In general, the conceptual framework has involved two main phases: First, initial phase is executed by loading data from transaction table and scan the database once. The main phase (I) is called the data representation scheme to represent the data from transaction database to Boolean Matrix automatically. In addition, N-frequent itemsets are generated from the Apriori algorithm by using Boolean Matrix from the previous stage. The implementation has been carried out under Java JDK 1.6 platform and uses object oriented programming concept. Table 1 and Table 2 show the example of the first stage of phase (I). Supposed, if we have a synthetic data 'Sparse' which is a dataset with multi-scattered records with each record contains several data items.

Data representation starts after loading data that are saved in AVL-tree (balanced binary tree). The AVL-tree has several benefits such as saving balanced data, giving sorted data when using Inorder-Traversal-Tree. It also has an ability to save unduplicated data. The Inorder-Traversal-AVL-Tree process gives us sorted unduplicated data that are saved in vector or sorted unduplicated array. In addition, this array is used in data discretization which is a part of data reduction for a given continuous number of values, especially for numerical data. Replacing the values of the array by their index values is the discretization process

The first step is to set the value of minimum support number(S) to 0.22 and the required minimumconfidence (C) to 0.7. Table 1 contains the transaction items, and Table 2 contains the discretized data, which is a part of the data reduction for a given continuous number of values, especially for numerical data. The discretization process is done by replacing the index of array in Table 3 that is generated from AVL-tree, which gives unduplicated sorted frequent patterns.

Automatically, Boolean Matrix in Table 4 is generated by phase (I) - stage (I) in previous section by transferring discretized table to Boolean Matrix. In addition, the number of items in the whole dataset and the transaction table's rows are computed. In this situation, we have 5 items, and 9 transactions (rows). This approach has decided the number of itmes is equivalent to the number of columns, and the number of transaction is equivalent to the number of rows of the Boolean Matrix.

The data representation which is related to AVL-tree for sorting Frequent Pattern and Boolean(Binary) Matrix will be applied during Association Rules Mining.

Here ct counts all 1 's in each column, and r is the number of transactions (rows). In Table 4, the 10th row of the Boolean Matrix presents the count number of all 1's in column one, which is 6, where the value 6 is then divided by 9 transactions (rows) and equals to 0.66 by the following equation:

computed _ weight = count 1's in column/number of trans (rows) [greater than or equal to] S (4)

The last step of phase(I) is generation of rules that is based on the equivalence subset theory that can be represented as a set of specific rules. In this example, that last frequentitemset is {1, 2, 5}. Therefore, several rules generated are Rules1: 1 ^ 2 [right arrow] 5, Rules 2:1 ^ 5 [right arrow] 2, Rules 3: 2 ^ 5 [right arrow] 1, Rules 4:1 [right arrow] 2 ^ 5, Rules 5: 2 [right arrow] 1 ^ 5 and Rules 6:5 [right arrow] 1 ^ 2.

Benefits of data representation:

1. Transformation data from source-data-file to AVL tree that is a balanced data structure for getting non-duplicated items sorted and maintained actual elements of data, and increased processing efficiency.

2. Reduce computational processing time.

3. Obtain multi-data types of represented data storage such as tree, hash table or Binary Matrix.

4. Use AVL tree and Automated Binary Matrix to get Association Rules Mining.

5. Capture Frequent Pattern by apply Apriori algorithm.

The above Figures 3 and 4 show how to create AVL-Tree from the transaction database. The first tree inserts 1, 2, and 5 and rotates its nodes to keep data as balanced. During this process the duplicated data will ignore the insert to the tree. For this reason, we read the data and use AVL-tree to save balanced unduplicated of that data, during which the process of counting the items and transactions is done, and getting sorted unduplicated data by inorder function of AVL-tree. Thus, the sorted unduplicated data is saved in Vector that is used in discretization phase. The data of vector is replaced by index of vector, which starts from 1 to number of n-unduplicated items. In addition, the benefit of this operation is to reduce the occupation of discretized data in Boolean matrix, and hence, applying Apriori algorithm on this matrix to catch the n-frequent itemset.

Algorithms:

In this phase, a new transactional data representation scheme in the automated Boolean matrix form is proposed to help us to generate n-frequentitemset and get association rules. During this representation, the data read from AVL-tree to get unduplicated items and return the number of items and transactions automatically, then generate Boolean matrix, which put 1 if the item is available in original dataset, and otherwise put 0. Applying Apriori algorithm on this matrix to obtain n-frequentitemset is then performed.

The Figures 1, 2, 3 and 4 show the data representation of the Boolean Matrix from the source data file and then apply Apriori algorithm with Boolean Matrix to get n-frequent items and association rules, respectively.

Experimental Results & Discussion:

There are five datasets where two were obtained from the FIMI Repository and the other three datasets were obtained from the UCI Repository. Two synthetic datasets were chosen for the experimental testing in the research. The datasets were chosen based on number of transactions, average length and type (Sparse). The datasets are FIMI Repository's T10I4D100K and T40I10D100K data.

We ran our experiments by using PC with these specifications: Intel(R) Core(TM) Duo CPU 2.20 GHz with 2 GB RAM and Windows XP with 32-bit operating system. We also used the concept of object oriented programming by Java programming language to implement our code.

We used synthetic data 'Sparse', which is a dataset with multi-scattered records that are characterized by normal sparsity. Each record contains several attributes (i.e., columns in a database scheme), which can be viewed as a vector data. Sparsity is related to the average records that are no "similar" length in the multi- dimensional space and are defined by the attributes. The "Sparsity" is analytically well-constituted and related to "long-tail" circumstance: individual transactions attend to irregular sold items (Leskovecet al., 2007)

The developed system returned us the number of items and the number of transaction automatically without the user determining them (e.g. T40I10D100K has 942 items and 100000 rows).

Finally, our experiments computed the n-frequent itemset with their threshold minSupport. Table 8 shows n-frequent itemsets with their threshold minSupport of T10I4D100K dataset.

The process of generation of association rules starts with finding all subset of all the sets that are the last frequent itemset of a dataset. For example, if we choose (354, 494, 501, 696,732) with its threshold minSupportand confidence of 0.00507 and 70%, respectively, which resulted in the rejection of 6 rules and selection of 18 rules that are shown in Table 9.

Three datasets (anneal, which is synthetic and others, which are non-synthetic) obtained from the UCI Repository of machine learning databases repository were chosen for the next experimental testing in the research..

The composition of the matrix depended highly on the discretized data, which constitute the columns' data. Hence, the number of items and transactions were computed automatically using the AVL-tree with the support and confident numbers are set-up during the initial stage. The computational time of the data transformation from datasets to AVL-tree is between 0.109 seconds to 1.812 seconds, providing us with several benefits such as save balanced and sorted data.

Summarizing, data mining employs several techniques to extract association rules, such as Apriori and FP- Growth. Mining Association Rule is an appropriate tool and has been one of the well-researched approaches of data mining used to discover entertaining relationships between items in databases.

Our objective was how to improve efficiency and generate association rules by a new data representation through AVL-Tree from the transaction of data, and generate n-frequent itemsets. The benefit of AVL-Tree is that it keeps the data as balanced values. Thus, the proposed representation scheme transfers the 'dataset' to Automated Boolean Matrix (ABM) and then builds association rules.

Conclusion:

Basically data, which are business related, grow hastily, thus making data mining a reality and essential field in our life, where searching for meaningful information in huge databases has constituted a fundamental issue. That tells why mining association rules is a widespread technique in data miming.

This study has demonstrated that Apriori with data representation scheme can reduces computational time, improves the process of finding n-frequentitemset and generates association rules faster than the original Apriori. In addition, the database only being scan once, before being automatically transform to Boolean Matrix. Finally, applying Apriori to get the frequent itemset and their rules depend on minSupport and confidence values, which have controlled over the number of association rules.

References

Sedgewick, R., Algorithms, Addison-Wesley, 1983. ISBN 0-201-06672-6, page 199, chapter 15: Balanced Trees.

Agrawal, R. and R. Srikant, 1994.Fast algorithms for mining association rules in large databases. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile.

Mannila, H., H. Toivonen and A. Verkamo, 1994."Efficient algorithms for discovering association rules." AAAI Workshop on Knowledge Discovery in Databases (SIGKDD). Seattle, 181-92.

Agrawal, R. and R. Srikant, 1995. "Mining Sequential Patterns", Proc. Internal Conf. Data Engineering, pp: 3-14.

Agrawal, R., M. Mehta, J. Shafer, R. Srikant, A. Arning and T. Bollinger, 1996. "The Quest Data Mining System", Proc. of KDD'96, USA, pp: 244-249.

Yen, S.J. and A.L. P.Chen, 1996.An Efficient Approach to Discovering Knowledge from Large Databases. In Proc. of the IEEE/ACM International Conference on Parallel and Distributed Information Systems, pp: 8-18.

Han, J. and M. Kamber, 2001. Data Mining Concepts and Techniques: Morgan Kaufman.

Han, J., P. Jian, Y. Yiwen and R. Mao, 2004. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, (8): 53-87.

He, Z., J.Z. Huang, X. Xuand D. Shengchun, 2004.A Frequent Pattern Discovery Method for Outlier Detection. In SpringerLink (Ed.), Lecture Notes Computer Science. (3129/2004): 726-732 .Springer Berlin/ Heidelberg.

Coenen, F., G. Goulbourne and P. Leng, 2004. "Tree structures for mining association rules." Data Mining and Knowledge Discovery (8.1): 25-51.

Gilberg, R.F. and B.A. Forouzan, 2005. Data Structures: A Pseudocode Approach with C, 2nd Edition, Thomson Course Technology.

Han, J. and M. Kamber, 2006. Data mining concepts and techniques', Elsevier Inc., Second Edition, San Francisco.

Kotsiantis, S. And D. Kanellopoulos, 2006. Association Rules Mining: A Recent Overview, GESTS International Transactions on Computer Science and Engineering, 32(1): 71-82.

Leskovec, J., L. Adamicand, B. Huberman, 2007. The Dynamics of Viral Marketing. ACM Transactions on the Web (TWEB), 1: 1.

Yu, W., X. Wang, F. Wang, E. Wang and B. Chen, 2008. The research of improved apriori algorithm for mining association rules, the 11th IEEE International conference technology proceedings. Sch. of Inf. & Eng., Northeast Dianli Univ., Jilin 513-516.

Chunjie, B. and W. Xiaoping, 2009. An Algorithm for Mining Association Rules Based on Sets Operation, Second International Workshop on Computer Science and Engineering, China.

Gautan, P. and K.R. Pardasani, 2010. A Fast Algorithm for Mining Multilevel Association Rule Based on Boolean Matrix,IJCSE.

Zhu, Y.,Y. Wang, S. Wang and X. Guo, 2010. Mining Association Rules Based on Cloud Model and Application in Credit Card Marketing, Asia-Pacific Conference on Wearable Computing System, IEEE, China.

Fimi Frequent Itemset Mining Dataset Repository retrieved from the World Wide Web: http://fimi.cs.helsinki.fi/data/

UCI Repository for Machine Learning Databases retrieved from the World Wide Web: http://www.ics.uci.edu or http://archive.ics.uci.edu/ml/

Agrawal, R., T. Imielinski and A. Swami, May 1993.Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 207-216, Washington D.C.

Ghassan Saleh Al-Dharhani, Zulaiha Ali Othman, Azuraliza A. Bakar,

Centre for Artificial Intelligence Technology Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia.

GhassanSaleh Al-Dharhani, Zulaiha Ali Othman, AzuralizaA.Bakar: Automated Boolean Matrix Data Representation Scheme through AVL tree for Mining Association Rules

Corresponding Author: GhassanSaleh Al-Dharhani, Centre for Artificial Intelligence Technology, Faculty of Information Science and Technology, UniversitiKebangsaan Malaysia, 43600, Bangi, Selangor DarulEhsan, Malaysia

Tel.: +60 12-2306594; E-mail: ghassanxyz@gmail.com

Mining Association rule is an appropriate and well researched tool of data mining to discover interesting relationships between variables in large databases. It was first introduced by (Sedgewick, et al., 1983), and has got widespread attention since the publication of his research and the introduction of the Apriori algorithm. This study has included generating association rules from frequent itemsets (Agrawal, et al., 1993; Mannila, et al., 1994; Agrawal, et al., 1996.). The initial research of this method was mainly driven by the analysis of market basket data. For example, from the transaction data of a supermarket, the rule {milk, egg} => {bread} is generated, which represents that customers who purchase milk and eggs also tend to buy bread as well. This relationship explains the behavior of consumers purchasing, and it can help managers to make better decisions in product promotion, shelf design, and customer relationship. Association rule mining has been applied to variant areas, including risk analysis in commercial institutions, medical sectors, basket data analysis, cross-marketing, catalog design, clustering, classification, etc.

In computer science and data mining, Apriori (Agrawal, et al., 1993; Agrawal, et al., 1996.) is a classic algorithm for learning association rules (Han, et al., 2001). Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). In the first step, frequent itemsets are generated, i.e those itemsets which holds the criteria of minimum support (Mannila, et al., 1994). In the second and final steps, Rule generation is made possible by evaluation of the confidence measure.

In this study, we propose data representation scheme to improve the existing Apriori algorithm for mining association rules. The algorithm improves through representation scheme through AVL-Tree (a balanced tree) to provide more reduced time and condensed rules than the Apriori method. In addition, our method is able to get the last frequent itemsets and generate rules depending on data representation and Boolean Matrix. This paper is organized in six sections: section 2 reviewed related works, section 3 proposed an automated Boolean matrix data representation using AVL-Tree, section 4 demonstrated the benefits of data representation, section 5 presented the algorithms, section 6 discussed the experimental results, and section 7 concluded the proposed representation scheme.

Related work:

Basically, data that are related to businesses grow speedily, making data mining a reality and an essential field in our life, and has increased the interest in the database management system. In data mining, Apriori(Agrawal, et al., 1993; Agrawal, et al., 1996.) and FP-growth (frequent pattern growth) (Han, et al., 2001.)are the main techniques in Association Rule Mining (ARM) which is used to get sequences pattern for items in a database.

Mining frequent pattern is seemingly one of the most important approaches in data mining and is referred to the market basket analysis from the Association Rules Mining task (Yen, et al., 1996.). A pattern is a set of items or subsequences or substructures that occurs frequently in a data set.

An association rule has the support X [??] Y to be the percentage of transactions in the database that contain X and Y, and the formal definition of support(x) is based on (AggrawalandSrikant., 1996)work :

Let I = { i1, i2, ..., im) be set of m literals called items and the database D = (t1, t2, ..., tn) a set of n transactions, each consisting of a set of items from I. A transaction t E D is said to contain itemsetX if X [subset or equal to] t. Hence, the support of itemsetX is defined as in the equation (1) below (Chunjie, et al., 2009.):

support(X) = ft E D X [subset or equal to] {t E D} (1)

The Mining frequent pattern using Apriori algorithm has been used in many applications in data mining such as: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Each transaction t [euro] D is an item set such that t is a proper subset of I. A unique identifier, which we call it as TID, is associated with each transaction. We say that a transaction t supports X, a set of items inI, if X is a proper subset of t. We assume that the items in a transaction or an itemset are sorted in lexicographic order.

An itemsetX has support s if s% of the transactions support X. Support of X is denoted as Supp(X). An association rule is an implication of the form

X [??] Y, where X and Y are subsets of I and X [intersection] Y = [empty set].

We say that the rule X [??] Y holds in the database D with confidence C if

confidence (X [right arrow] Y) = P(Y|X) = n(X [union] Y)/n(X) [greater than or equal to] C (2)

n( X)

(where [absolute value of A] is the number of occurrences of the set of items A in the set of transactions D, and A occurs in a transaction t, if and only if A is a proper subset of t). We also say that the rule X [??] Y has support S if

support (X [right arrow] Y) = p(X [union] = n(X [union] Y)/N [greater than or equal to] S (3)

where N is the number of transactions in D. Here, the support is a measure of the frequency of a rule, whereas, the confidence is a measure of the strength of the relationship between sets of items.

On the other hand, (Han, et al., 2006) has presented a new set based approach for getting frequent item sets. The best feature of one scan algorithm is that it requires only one scan of the database. They compared their algorithm with Apriori. In addition, (Gautan, et al., 2010) proposed MLBM algorithm only scans the transaction database once, and it adopts the Boolean vector to discover frequent itemset. Other authors (Zhu, et al., 2010)present the combination of the cloud model based on generalization method with Apriori algorithm.

(Coenen, et al., 2004) presented how to compute partial support counts in one DB-pass and how to store them in an enumeration tree (P-Tree). In addition, (Yu, et al., 2006) proposed a novel algorithm so called reduced apriori algorithm with tag (RAAT), which reduces one redundant pruning operation and compares with Apriori algorithm. On the other hand, (Gautan, et al., 2010) proposed Association Rule Based on Boolean Matrix algorithm, which only scans the transaction database once, and discovers frequent item set. In addition, (He, et al., 2004) proposed a frequent pattern discovery method for outlier detection.

AVL tree proposed by G.M Adelson-Velskii and E.M Landis (Sedgewick, 1983), and hence, its name Adelson-Velskii Landis, is a self-balanced tree; that means the heights of the children (any two sub-trees) can differ by at most one. Using an AVL tree is better than simple data structures such as vectors or arrays where complexity of running time usually takes the form of O(log n) time in both the average and worst cases (Gilberg, et al., 2005).

In this study, there are two main phases involved: firstly, a new transactional data representation scheme in the Boolean matrix form is proposed to help us to generate N- frequentitemset, and secondly the improvement to get association rules for the proposed Apriori algorithm, because of its efficiency to detect the last frequent itemsets.

Automated Boolean Matrix with Association Rules (ABMAR) involves several common steps in Apriori algorithm such as follows:

i. Compute unduplicated sorted n items and m transactions during loading process by AVL tree.

ii. Save all unduplicated n items in itemVector.

iii. Get the value of parameters: number_of_items, and number_of_rows automatically, then initialize settings for minSupport and minConfident.

iv. Use regular expression to replace items by itemVector's index.

v. Replacement process will reduce the capacity of Boolean Matrix.

vi. The columns of Boolean Matrix are discretized data while the number of rows is equivalent to the number of transactions.

vii. Apply BinApriori on Boolean Matrix, then generate n-Candidates and get m-Frequent itemsets.

viii. BinApriori will complete all frequent itemsets and save them with their minSupport values in AVL tree.

ix. Capture the last frequent and set confidence value.

x. obtain all possible subsets with their weight by set theory.

xi. Generate all possible association rules depend on weight and confidence value.

In general, the conceptual framework has involved two main phases: First, initial phase is executed by loading data from transaction table and scan the database once. The main phase (I) is called the data representation scheme to represent the data from transaction database to Boolean Matrix automatically. In addition, N-frequent itemsets are generated from the Apriori algorithm by using Boolean Matrix from the previous stage. The implementation has been carried out under Java JDK 1.6 platform and uses object oriented programming concept. Table 1 and Table 2 show the example of the first stage of phase (I). Supposed, if we have a synthetic data 'Sparse' which is a dataset with multi-scattered records with each record contains several data items.

Data representation starts after loading data that are saved in AVL-tree (balanced binary tree). The AVL-tree has several benefits such as saving balanced data, giving sorted data when using Inorder-Traversal-Tree. It also has an ability to save unduplicated data. The Inorder-Traversal-AVL-Tree process gives us sorted unduplicated data that are saved in vector or sorted unduplicated array. In addition, this array is used in data discretization which is a part of data reduction for a given continuous number of values, especially for numerical data. Replacing the values of the array by their index values is the discretization process

The first step is to set the value of minimum support number(S) to 0.22 and the required minimumconfidence (C) to 0.7. Table 1 contains the transaction items, and Table 2 contains the discretized data, which is a part of the data reduction for a given continuous number of values, especially for numerical data. The discretization process is done by replacing the index of array in Table 3 that is generated from AVL-tree, which gives unduplicated sorted frequent patterns.

Automatically, Boolean Matrix in Table 4 is generated by phase (I) - stage (I) in previous section by transferring discretized table to Boolean Matrix. In addition, the number of items in the whole dataset and the transaction table's rows are computed. In this situation, we have 5 items, and 9 transactions (rows). This approach has decided the number of itmes is equivalent to the number of columns, and the number of transaction is equivalent to the number of rows of the Boolean Matrix.

The data representation which is related to AVL-tree for sorting Frequent Pattern and Boolean(Binary) Matrix will be applied during Association Rules Mining.

Here ct counts all 1 's in each column, and r is the number of transactions (rows). In Table 4, the 10th row of the Boolean Matrix presents the count number of all 1's in column one, which is 6, where the value 6 is then divided by 9 transactions (rows) and equals to 0.66 by the following equation:

computed _ weight = count 1's in column/number of trans (rows) [greater than or equal to] S (4)

The last step of phase(I) is generation of rules that is based on the equivalence subset theory that can be represented as a set of specific rules. In this example, that last frequentitemset is {1, 2, 5}. Therefore, several rules generated are Rules1: 1 ^ 2 [right arrow] 5, Rules 2:1 ^ 5 [right arrow] 2, Rules 3: 2 ^ 5 [right arrow] 1, Rules 4:1 [right arrow] 2 ^ 5, Rules 5: 2 [right arrow] 1 ^ 5 and Rules 6:5 [right arrow] 1 ^ 2.

Benefits of data representation:

1. Transformation data from source-data-file to AVL tree that is a balanced data structure for getting non-duplicated items sorted and maintained actual elements of data, and increased processing efficiency.

2. Reduce computational processing time.

3. Obtain multi-data types of represented data storage such as tree, hash table or Binary Matrix.

4. Use AVL tree and Automated Binary Matrix to get Association Rules Mining.

5. Capture Frequent Pattern by apply Apriori algorithm.

The above Figures 3 and 4 show how to create AVL-Tree from the transaction database. The first tree inserts 1, 2, and 5 and rotates its nodes to keep data as balanced. During this process the duplicated data will ignore the insert to the tree. For this reason, we read the data and use AVL-tree to save balanced unduplicated of that data, during which the process of counting the items and transactions is done, and getting sorted unduplicated data by inorder function of AVL-tree. Thus, the sorted unduplicated data is saved in Vector that is used in discretization phase. The data of vector is replaced by index of vector, which starts from 1 to number of n-unduplicated items. In addition, the benefit of this operation is to reduce the occupation of discretized data in Boolean matrix, and hence, applying Apriori algorithm on this matrix to catch the n-frequent itemset.

Algorithms:

In this phase, a new transactional data representation scheme in the automated Boolean matrix form is proposed to help us to generate n-frequentitemset and get association rules. During this representation, the data read from AVL-tree to get unduplicated items and return the number of items and transactions automatically, then generate Boolean matrix, which put 1 if the item is available in original dataset, and otherwise put 0. Applying Apriori algorithm on this matrix to obtain n-frequentitemset is then performed.

The Figures 1, 2, 3 and 4 show the data representation of the Boolean Matrix from the source data file and then apply Apriori algorithm with Boolean Matrix to get n-frequent items and association rules, respectively.

Fig. 5: ALGORITHM I (DRBM): Data representation to Boolean Matrix from the source data file. Input: unduplicated sorted items obtained from AVL tree Output: Boolean Matrix Start /*generate Boolean matrix during replacement process by regular expression automatically */ num_of_items = fbobj.retuenNumOfItems(); num_of_rows= fbobj.retuenNumOfrows; itemsVectore= fbobj .reurnNondublicateSorteditems(); wholefile=fbobj.reurnDataFile(); dataReplacement= fbobj.PatternMacher(itemsVectore,wholefile); CallSaveReplacementData(); intbinaryMatrix[] []= newint[num_of_rows][num_of_items]; /*the matrix columns are the data items that are been in replacement file, and the rows of file as same as rows of that matrix*/ while(!EOFile) { int y=Scanner.readItem(); if(newline) matrixrow++; else booleanMtrix[matrixrow][y-1]=1; returnbooleanMtrix; } End Fig. 6: ALGORITHM II: Applied Apriori algorithm with Boolean Matrix. Input: number of items, number of transaction, minSupport, minConfident Output: n-frequent itemsets, association rules Start /*generate all possible candidates of n-frquentitemsets form Boolean Matrix that is generated during data representation scheme */ readDataToAVL_tree(sourceDataFile); numberOfItems=automatedReturnNumOfItems(); numberOfTrans=automatedReturnNumOfTrans(); dataInitialization(sourceDataFile,numberOfItems, numberOfTrans); while (possibleCandidates is not empty) { noItemset= noItemset+1; generateAllPossibleCandidates(noItemset); getFrequentItemsets(); b=captureFrequentItemsets(); if (b==true) then { constructCompletedFrequent (); saveFrequentWithMinSup (); } } End

Experimental Results & Discussion:

There are five datasets where two were obtained from the FIMI Repository and the other three datasets were obtained from the UCI Repository. Two synthetic datasets were chosen for the experimental testing in the research. The datasets were chosen based on number of transactions, average length and type (Sparse). The datasets are FIMI Repository's T10I4D100K and T40I10D100K data.

We ran our experiments by using PC with these specifications: Intel(R) Core(TM) Duo CPU 2.20 GHz with 2 GB RAM and Windows XP with 32-bit operating system. We also used the concept of object oriented programming by Java programming language to implement our code.

We used synthetic data 'Sparse', which is a dataset with multi-scattered records that are characterized by normal sparsity. Each record contains several attributes (i.e., columns in a database scheme), which can be viewed as a vector data. Sparsity is related to the average records that are no "similar" length in the multi- dimensional space and are defined by the attributes. The "Sparsity" is analytically well-constituted and related to "long-tail" circumstance: individual transactions attend to irregular sold items (Leskovecet al., 2007)

The developed system returned us the number of items and the number of transaction automatically without the user determining them (e.g. T40I10D100K has 942 items and 100000 rows).

Finally, our experiments computed the n-frequent itemset with their threshold minSupport. Table 8 shows n-frequent itemsets with their threshold minSupport of T10I4D100K dataset.

The process of generation of association rules starts with finding all subset of all the sets that are the last frequent itemset of a dataset. For example, if we choose (354, 494, 501, 696,732) with its threshold minSupportand confidence of 0.00507 and 70%, respectively, which resulted in the rejection of 6 rules and selection of 18 rules that are shown in Table 9.

Three datasets (anneal, which is synthetic and others, which are non-synthetic) obtained from the UCI Repository of machine learning databases repository were chosen for the next experimental testing in the research..

The composition of the matrix depended highly on the discretized data, which constitute the columns' data. Hence, the number of items and transactions were computed automatically using the AVL-tree with the support and confident numbers are set-up during the initial stage. The computational time of the data transformation from datasets to AVL-tree is between 0.109 seconds to 1.812 seconds, providing us with several benefits such as save balanced and sorted data.

Summarizing, data mining employs several techniques to extract association rules, such as Apriori and FP- Growth. Mining Association Rule is an appropriate tool and has been one of the well-researched approaches of data mining used to discover entertaining relationships between items in databases.

Our objective was how to improve efficiency and generate association rules by a new data representation through AVL-Tree from the transaction of data, and generate n-frequent itemsets. The benefit of AVL-Tree is that it keeps the data as balanced values. Thus, the proposed representation scheme transfers the 'dataset' to Automated Boolean Matrix (ABM) and then builds association rules.

Conclusion:

Basically data, which are business related, grow hastily, thus making data mining a reality and essential field in our life, where searching for meaningful information in huge databases has constituted a fundamental issue. That tells why mining association rules is a widespread technique in data miming.

This study has demonstrated that Apriori with data representation scheme can reduces computational time, improves the process of finding n-frequentitemset and generates association rules faster than the original Apriori. In addition, the database only being scan once, before being automatically transform to Boolean Matrix. Finally, applying Apriori to get the frequent itemset and their rules depend on minSupport and confidence values, which have controlled over the number of association rules.

References

Sedgewick, R., Algorithms, Addison-Wesley, 1983. ISBN 0-201-06672-6, page 199, chapter 15: Balanced Trees.

Agrawal, R. and R. Srikant, 1994.Fast algorithms for mining association rules in large databases. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile.

Mannila, H., H. Toivonen and A. Verkamo, 1994."Efficient algorithms for discovering association rules." AAAI Workshop on Knowledge Discovery in Databases (SIGKDD). Seattle, 181-92.

Agrawal, R. and R. Srikant, 1995. "Mining Sequential Patterns", Proc. Internal Conf. Data Engineering, pp: 3-14.

Agrawal, R., M. Mehta, J. Shafer, R. Srikant, A. Arning and T. Bollinger, 1996. "The Quest Data Mining System", Proc. of KDD'96, USA, pp: 244-249.

Yen, S.J. and A.L. P.Chen, 1996.An Efficient Approach to Discovering Knowledge from Large Databases. In Proc. of the IEEE/ACM International Conference on Parallel and Distributed Information Systems, pp: 8-18.

Han, J. and M. Kamber, 2001. Data Mining Concepts and Techniques: Morgan Kaufman.

Han, J., P. Jian, Y. Yiwen and R. Mao, 2004. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, (8): 53-87.

He, Z., J.Z. Huang, X. Xuand D. Shengchun, 2004.A Frequent Pattern Discovery Method for Outlier Detection. In SpringerLink (Ed.), Lecture Notes Computer Science. (3129/2004): 726-732 .Springer Berlin/ Heidelberg.

Coenen, F., G. Goulbourne and P. Leng, 2004. "Tree structures for mining association rules." Data Mining and Knowledge Discovery (8.1): 25-51.

Gilberg, R.F. and B.A. Forouzan, 2005. Data Structures: A Pseudocode Approach with C, 2nd Edition, Thomson Course Technology.

Han, J. and M. Kamber, 2006. Data mining concepts and techniques', Elsevier Inc., Second Edition, San Francisco.

Kotsiantis, S. And D. Kanellopoulos, 2006. Association Rules Mining: A Recent Overview, GESTS International Transactions on Computer Science and Engineering, 32(1): 71-82.

Leskovec, J., L. Adamicand, B. Huberman, 2007. The Dynamics of Viral Marketing. ACM Transactions on the Web (TWEB), 1: 1.

Yu, W., X. Wang, F. Wang, E. Wang and B. Chen, 2008. The research of improved apriori algorithm for mining association rules, the 11th IEEE International conference technology proceedings. Sch. of Inf. & Eng., Northeast Dianli Univ., Jilin 513-516.

Chunjie, B. and W. Xiaoping, 2009. An Algorithm for Mining Association Rules Based on Sets Operation, Second International Workshop on Computer Science and Engineering, China.

Gautan, P. and K.R. Pardasani, 2010. A Fast Algorithm for Mining Multilevel Association Rule Based on Boolean Matrix,IJCSE.

Zhu, Y.,Y. Wang, S. Wang and X. Guo, 2010. Mining Association Rules Based on Cloud Model and Application in Credit Card Marketing, Asia-Pacific Conference on Wearable Computing System, IEEE, China.

Fimi Frequent Itemset Mining Dataset Repository retrieved from the World Wide Web: http://fimi.cs.helsinki.fi/data/

UCI Repository for Machine Learning Databases retrieved from the World Wide Web: http://www.ics.uci.edu or http://archive.ics.uci.edu/ml/

Agrawal, R., T. Imielinski and A. Swami, May 1993.Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 207-216, Washington D.C.

Ghassan Saleh Al-Dharhani, Zulaiha Ali Othman, Azuraliza A. Bakar,

Centre for Artificial Intelligence Technology Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia.

GhassanSaleh Al-Dharhani, Zulaiha Ali Othman, AzuralizaA.Bakar: Automated Boolean Matrix Data Representation Scheme through AVL tree for Mining Association Rules

Corresponding Author: GhassanSaleh Al-Dharhani, Centre for Artificial Intelligence Technology, Faculty of Information Science and Technology, UniversitiKebangsaan Malaysia, 43600, Bangi, Selangor DarulEhsan, Malaysia

Tel.: +60 12-2306594; E-mail: ghassanxyz@gmail.com

Table 1: Transactions Items TID List of Items TID List of Items T100 7, 8, 55 T105 8, 9 T101 8, 10 T106 7, 9 T102 8, 9 T107 7, 8, 9, 55 T103 7, 8, 10 T108 7, 8, 9 T104 7 ,9 Table 2: Transactions Items TID List of Items TID List of Items T1 1, 2, 5 T6 2, 3 T2 2, 4 T7 1, 3 T3 2, 3 T8 1, 2 ,3, 5 T4 1, 2, 4 T9 1, 2, 3 T5 1, 3 Table 3: Sorted Frequent Pattern Array 1 2 3 4 5 7 8 9 10 55 Table 4: Automated Boolean Matrix 1 2 3 4 5 1 1 1 0 0 1 2 0 1 0 1 0 3 0 1 1 0 0 4 1 1 0 1 0 5 1 0 1 0 0 6 0 1 1 0 0 7 1 0 1 0 0 8 1 1 1 0 1 9 1 1 1 0 0 Ct 6 7 6 2 2 ct/r 0.66 0.77 0.66 0.22 0.22 Table 5: shows the Datasets Data set T10I4D100K T40I10D100K No. of Items 870 942 Avg. Length 10 40 No. of Trans. 100, 000 100, 000 Type Sparse Sparse Size 3. 93 MB 14. 8 MB Table 6: shows the meaning of synthetic datasets parameters T Average length of the transaction I Average size of frequent itemsets D Number of transactions K Thousand Table 7: The comparison of execution time (seconds) between Apriori with Boolean Matrix and normal Apriori with 10000 transactions and 868 items thresholdminSup(%) Apriori Data Repres. Apriori Bool. Matix 1 8630.668 111.822 153.899 1.5 4868.884 171.429 62.525 2 2111.979 180.102 33.072 2.5 1062.788 110.464 15.382 3 446.751 171.083 8.412 3.5 255.013 170.305 6.131 4 180.398 163.706 5.101 Table 8: n-frequent itemsets with their threshold minSupport of T10I4D100K dataset n-frequent itemset threshold minSupport>=0.005 32,192,245,298,446 0.00732 184,250,393,773,846 0.00693 192,474,573,806,828 0.00559 302,354,494,501,696 0.00504 302,354,494,501,732 0.00503 302,354,494,696,732 0.00509 302,354,501,696,732 0.00509 302,494,501,696,732 0.00503 354,494,501,696,732 0.00507 Table 9: The subset of frequent {354, 494, 501, 696, 732} with its threshold minSupport of 0.00507. Mining Association Rules minSup s confidence number c 354 -->[494, 501, 696, 732] 2.047 24.768% 494 -->[354, 501, 696, 732] 1.589 31.907% 501 -->[354, 494, 696, 732] 2.164 23.429% Rejected 696 -->[354, 494, 501, 732] 2.237 22.664% Rejected 732 -->[354, 494, 501, 696] 0.752 67.42% Rejected 354, 494 -->[501, 696, 732] 0.61 83.115% Selected 354, 501-->[494, 696, 7 0.617 82.172% Selected 354, 696-->[494, 501, 732] 0.826 61.38% Rejected 354 ,732-->[494, 501, 696] 0.603 84.08% Selected 494, 501-->[354, 696, 732] 0.601 84.359% Selected 494, 696-->[354, 501, 732] 0.604 83.94% Selected 494, 732-->[354, 501, 696] 0.592 85.642% Selected 501, 696-->[354, 494, 732] 0.619 81.906% Selected 501, 732-->[354, 494, 696] 0.6 84.5% 696, 732-->[354, 494, 501] 0.607 83.526% Selected 354, 494, 501 -->[696, 732] 0.56 90.536% Selected 354, 494, 696 -->[501, 732] 0.567 89.418% Selected 354, 494, 732 -->[501, 696] 0.561 90.374% Selected 494, 501, 696 -->[354, 732] 0.559 90.698% Selected 494, 501, 732 -->[354, 696] 0.558 90.86% Selected 501, 696, 732 -->[354, 494] 0.57 88.947% Selected 354, 494, 501 696-->[732] 0.533 95.122% Selected 354, 494, 501 732-->[696] 0.531 95.48% Selected 494, 501, 696 732-->[354] 0.531 95.48% Selected Mining Association Rules Rejected / Selected 354 -->[494, 501, 696, 732] Rejected 494 -->[354, 501, 696, 732] Rejected 501 -->[354, 494, 696, 732] 696 -->[354, 494, 501, 732] 732 -->[354, 494, 501, 696] 354, 494 -->[501, 696, 732] 354, 501-->[494, 696, 7 354, 696-->[494, 501, 732] 354, 732-->[494, 501, 696] 494, 501-->[354, 696, 732] 494, 696-->[354, 501, 732] 494, 732-->[354, 501, 696] 501, 696-->[354, 494, 732] 501, 732-->[354, 494, 696] Selected 696, 732-->[354, 494, 501] 354, 494, 501 -->[696, 732] 354, 494, 696 -->[501, 732] 354, 494, 732 -->[501, 696] 494, 501, 696 -->[354, 732] 494, 501, 732 -->[354, 696] 501, 696, 732 -->[354, 494] 354, 494, 501 696-->[732] 354, 494, 501 732-->[696] 494, 501, 696 732-->[354] Table 10: Results from the UCI Repository Datasets. Data set Name anneal chess waveform Non-Dup-items 71 58 101 No. of Trans. 898 28056 5000 No. of Attributes Avr (13) 7 22 synthetic YES NO NO minSup 0.2 8.861s 16.942s 1327.568s minSup 0.4 1.154s 8.595s 1210.750s minSup 0.6 0.281s 3.822s 708.2756s minSup 0.8 0.141s 3.222s 120.1424s minSup 1.0 0.109s 2.917s 34.4845s

Printer friendly Cite/link Email Feedback | |

Title Annotation: | Original Article |
---|---|

Author: | Al-Dharhani, Ghassan Saleh; Othman, Zulaiha Ali; Bakar, Azuraliza A. |

Publication: | Advances in Natural and Applied Sciences |

Article Type: | Report |

Date: | Jul 1, 2013 |

Words: | 4574 |

Previous Article: | Effect of silica content on the acid resistance of enamels. |

Next Article: | Measuring corporate social responsibility commitments in the Malaysian Financial Services Industry. |

Topics: |