Printer Friendly

Genetic association mining in web personalization for the non functional requirement.

1. INTRODUCTION

Requirements in Web engineering tend to rapidly evolve due to the dynamic nature of the Web [1]. As service oriented architectures have become more widespread in industry, many complex web services are assembled using independently developed modular services from different vendors. Although the functionalities of the composite web services are ensured during the composition process, the nonfunctional requirements (NFRs) are often ignored in this process. Since quality of services plays a more and more important role in modern service-based systems, there is a growing need for effective approaches to verifying that a composite web service not only offers the required functionality but also satisfies the desired NFRs [2].

Personalization is an attempt at addressing a service provider's desire to push additional information to users visiting their domains while at the same time restricting the flow of irrelevant recommendations. This is achieved by predicting a user's requirements and needs so that accurate results can be generated during the course of a user's web navigation. Different personalization systems give somewhat different recommendations based on the type of web data they collect and the methods they choose to employ for analyzing and generating recommendations [3].

Data mining is the method of extracting implicit and useful patterns automatically from databases. Web mining is a task that extracts hidden information from web relevant information. Web mining is divided into web content mining, web structure mining and web usage mining. Web content mining deals with discovery of useful knowledge from actual content of the web pages, in the form of text, audio or video [4]. Web usage mining deals with analysis of various log file data including web server log, client log and proxy server log and discovering useful knowledge regarding web usage. Web usage mining can be applied to e-learning domain as the site records information recording learner profiles, web access information, academic details of students and evaluation results. Web usage mining can track learning activities and identifies web access patterns and user behaviors [5]. Various data mining techniques [6] such as statistical analysis, association rules, clustering, classification and sequential pattern mining have been used for mining web usage logs.

NON FUNCTIONAL REQUIREMENTS

Requirements could be further classified into functional and non functional requirements. Functional requirements of a system generally concentrate on the service behavior whereas non functional requirements describe manifold quality attributes of services. It can be observed that non-functional requirements (NFR) are often referred to as soft criteria and exclusively considered for ranking [7]. Nonfunctional requirements describe important constraints upon the development and behavior of a system. They specify a broad range of qualities such as security, performance, availability, extensibility, and portability. These qualities play a critical role in the architectural design [8] and should therefore be considered and specified as early as possible during system analysis.

The NFRs represent requirements that should work to assist the application to accomplish its goal. Non-functional requirements define the overall qualities or attributes of the resulting system. Non-functional requirements place restrictions on the product being developed, the development process, and specify external constraints that the product must meet [9]. Examples of NFRs include safety, security, usability, reliability and performance requirements. Project management issues such as costs, time, and schedule are often considered as non-functional requirements as well.

NFRs are the needs of a system to carry out its operations under a set of constraints and quality expectations. Capturing the NFRs and then matching them with architectural concerns are harder than dealing with functional requirements since users feel more difficulty in explaining the NFRs definitely at every stage of requirements modeling. Extracting more NFRs and modeling them improves the success and quality of a product.

WEB PERSONALIZATION WITH ASSOCIATION RULE MINING

The challenge in designing effective Web personalization systems is to improve the scalability of collaborative filtering through offline pattern discovery, while maintaining or improving the overall recommendation effectiveness. Furthermore, the effectiveness of the system must be measured in terms of both coverage and accuracy of the produced recommendations [10]. In recent years there has been an increasing interest and a growing body of work in Web usage mining as an underlying approach to capturing and modeling [11]. [12, 13, 14] have considered automatic personalization based on clustering of user transactions and page views. However, this is often at the cost of reduced recommendation accuracy. One solution to improve accuracy is presented by using preprocessing techniques such as normalization and significance filtering [15].

Usage based web personalization systems [13] involve three phases:

1. Data preparation and transformation,

2. Pattern discovery, and

3. Recommendation

The pattern discovery phase may include the discovery of association rules, sequential navigational patterns, clustering of users or sessions, and clustering of page views or products [10]. The required high level tasks in the data preparation phase are data cleaning, user identification, session identification, page view identification, and the inference of missing references due to caching. Page view identification is the task of determining which page file accesses contribute to a single browser display. Transaction identification can be performed as a final preprocessing step prior to pattern discovery in order to focus on the relevant subsets of page views in each user session [16].

PROPOSED METHODOLOGY

Association rules capture the relationships among items based on their patterns of co-occurrence across transactions. In the case of web transactions, association rules capture relationships among page views based on the navigational patterns of users. The leading algorithms used for the association rule mining is Apriori, FP-growth and Eclat. Apriori and FP-growth works on horizontal format and the Eclat works on the vertical format to produce the item sets. In the previous work [17] we have discussed about these algorithms in detail and the reasons for choosing the Apriori.

Here we propose to optimize the association rules using genetic algorithms. Genetic algorithm is the part of the evolutionary algorithm which has their base as the simulation of the human reproduction, used mostly in the optimization problems.

GENETIC ALGORITHMS

Standard GA apply genetic operators such selection, crossover and mutation on an initially random population in order to compute a whole generation of new strings. GA runs to generate solutions for successive generations. The probability of an individual reproducing is proportional to the goodness of the solution it represents. Hence the quality of the solutions in successive generations improves. The process is terminated when an acceptable or optimum solution is found. GA is appropriate for problems which require optimization, with respect to some computable criterion. The functions of genetic operators are as follows:

1. Selection: Selection deals with the probabilistic survival of the fittest, in that, more fit chromosomes are chosen to survive. Where fitness is a comparable measure of how well a chromosome solves the problem at hand.

2. Crossover: This operation is performed by selecting a random gene along the length of the chromosomes and swapping all the genes after that point.

3. Mutation: Alters the new solutions so as to add stochastic in the search for better solutions. This is the chance that a bit within a chromosome will be flipped (0 becomes 1, 1 becomes 0).

USE OF GENETIC ALGORITHM FOR ASSOCIA TION RULE MINING

In the existing works, we could cite a lot of works based on the genetic algorithm used for Association rule mining. The main motivation for using GA in the discovery of high-level prediction rules is that they perform a global search and cope better with attribute interaction that the greedy rule induction algorithms often used in data mining [11]. Due the high randomness and inbuilt parallelization, GA can be deployed to prioritize the rules needed for further manipulations. The rule mining process can use the searching capability of the GA for the undisclosed rules. Based on the above reasons, we tend to use GA for prioritization of the rules at multi level.

PSEUDO CODE

The classical apriori algorithm is used to produce the rules from the web logs for the personalization. Later the Genetic Algorithm is used for the optimization of the rules

1. Start

2. Load the web log data

3. Apply apriori algorithm to find the frequent item sets with the minimum Support. Let the frequent set be F

4. Set F_GA is the output set, which contains the association rule.

5. Input the termination condition of GA

6. Represent each frequent item set of F as binary string using the combination of representation.

7. Select the two members from the frequent item set using Roulette Wheel sampling method

8. Apply the crossover and mutation on the selected members to generate the association rules.

9. Find the fitness function for each rule x->y and check the following condition.

10. if (fitness function > M)

11. set F_GA = F_GA U {x - > y}

12. If the desired number of generations is not completed, then go to Step 3.

13. Stop

The fitness function is designed based on the user's interesting measure and M is the threshold value of the interesting measure considered.

5. COMPARISON METRICS

The various comparison metrics used by the researcher are

1. Support

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

2. Confidence

3. Lift ratio

4. Gini Index

The explanation of the metrics is

Support

Every association rule has a support. The support is the percentage of transactions that demonstrate the rule. An item set is called frequent if its support is equal or greater than an agreed upon minimal value - the support threshold. support (X [??] Y) = support (XY) ie,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Support is defined on item sets and gives the proportion of transactions that contain it and therefore is used as a measure of significance (importance) of an item set. Since it basically uses the count of transactions it is often called a frequency constraint. An item set with a support greater than a set minimum support threshold is called a frequent or large item set.

Confidence

The confidence is the conditional probability that, given X present in a transition , Y will also be present. Confidence measure, by definition:

Confidence(X [??] Y) equals support(X, Y)/support(X)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

It is the ratio of transactions containing X and Y to those containing just X. The Percent of transactions with X that also contain Y.

Lift ratio

Another important concept in association rules is that of the "Lift" of the rule. A high value of confidence suggests a strong association rule. However this can be deceptive because if the antecedent and/or the consequent have a high support, a high value for confidence even when they are independent. A better measure to judge the strength of an association rule is to compare the confidence of the rule with the benchmark value where the assumption that the occurrence of the consequent item set in a transaction is independent of the occurrence of the antecedent for each rule. Compute this benchmark from the frequency counts of the frequent item sets. The benchmark confidence value for a rule is the support for the consequent divided by the number of transactions in the database. This enables us to compute the lift ratio of a rule.

The lift ratio is the confidence of the rule divided by the confidence assuming independence of consequent from antecedent. The Lift Ratio of an Association Rule is defined as follows:

Lift = Confidence/Expected Confidence , where

Expected Confidence = (number of transactions having the consequent items)/ (total number of transactions)

Gini Index

The implication of this criterion is "how much the mean squared error of target values is decreased." The optimal segmentation according to this criterion minimizes the mean squared error.

Gini Index = [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The Gini coefficient's main advantage is that it is a measure of inequality by means of a ratio analysis. This makes it easily interpretable, and avoids references to a statistical average.

RESULTS AND DISCUSSIONS

Many experiments were performed on the web data, which taken from the public Microsoft web logs file (msweb.data) [18], the results are shown below The environmental measure we had for testing is the population size for performing GA is 1000, the selection rate is 10% and the cross over rate is 6% and the mutation rate is fixed as 1%.

[FIGURE 1 OMITTED]

Fig 1 gives the following observations.

1. When the Support is increased the number of rules generated decreases

2. The application of the GA reduces the number of interesting rules generated; this is because of the ability of the GA to come out of the local optima

3. When the minimum support is 5% the improvement of the proposed approach is 23% more than the Apriori

4. When the minimum support is 10% the improvement of the proposed approach is 27.6% more than the Apriori

5. When the minimum support is 15% the improvement of the proposed approach is 23.6% more than the Apriori

6. When the minimum support is 20% the improvement of the proposed approach is 29.4% more than the Apriori

It is observed that the performance of the proposed approach is better compared to the other algorithm. It outperforms the apriori by an average of 8.5% for the four support levels (5%,10%,15% and 20%).

Comparison has been done for the lift ratio for the top 500 rules generated to the support count given with the Apriori algorithm, the Apriori algorithm optimized with the GA. Lift ratio informs us how much better the rule is better as predicting the result than just assuming the result in the first place. It is defined as the ratio of the records that support the entire rule to the number that would be expected, assuming there is no relationship between the items.

[FIGURE 2 OMITTED]

Fig 2 shows the Lift ratio for the proposed approach is better than that of the other algorithm. The observations from the figure shows

1. When the minimum support is 5% the improvement of the proposed approach is 7.8% more than the Apriori

2. When the minimum support is 10% the improvement of the proposed approach is 8.2% more than the Apriori

3. When the minimum support is 15% the improvement of the proposed approach is 10.4% more than the Apriori

4. When the minimum support is 20% the improvement of the proposed approach is 19.6%

It is observed that the performance of the proposed approach is better compared to the other algorithm. It outperforms the Apriori by an average of 6.8% for the four support levels (5%,10%,15% and 20%). It shows good improvement for increase in the lift ratio in the 20% support level.

Comparison based on the run time of the algorithm has been done for the above said four support levels for the two algorithms. The computational time is compared and the graph has been plotted. Fig 3 shows the comparison and the following interpretations are made. It shows the computational time of the proposed approach is less than the other algorithm. This is due to the fact that the rules generated are filtered by the proposed approach. Due to the reduction of the interesting rules generated the time required for the production of the rules is altogether reduced.

[FIGURE 3 OMITTED]

The observations from the figure show

1. When the minimum support is 5% the improvement of the proposed approach is 11.8% more than the Apriori

2. When the minimum support is 10% the improvement of the proposed approach is 14.2% more than the Apriori

3. When the minimum support is 15% the improvement of the proposed approach is 19.8% more than the Apriori

4. When the minimum support is 20% the improvement of the proposed approach is 17.6% more than the Apriori

The Gini index is a measure of statistical dispersion. A low Gini index indicates a more equal distribution, with 0 corresponding to complete equality. While a higher Gini index indicates more unequal distribution, with 1 corresponding to complete inequality. Fig 4 shows the comparison and the following interpretations are made based on the comparison of the Gini index for the various support levels of the two algorithms. The observations from Fig 4 show that the proposed approach outperforms the other algorithm.

[FIGURE 4 OMITTED]

From Fig 4 we can observe that

1. The Gini index of the Apriori > Apriori with GA, which indicates that the distribution in the association rules is greater in Apriori with GAs better than other algorithm.

2. When the minimum support is 5% the improvement of the proposed approach is 1.45% more than the Apriori

3. When the minimum support is 10% the improvement of the proposed approach is 1.98% more than the

4. When the minimum support is 15% the improvement of the proposed approach is 1.23% more than the Apriori

5. When the minimum support is 19.82% the improvement of the proposed approach is 2.06% more than the Apriori

[FIGURE 5 OMITTED]

Fig 5 gives the following observations.

1. When the Confidence is increased the number of rules generated decreases

2. The application of the GA reduces the number of interesting rules generated; this is because of the ability of the GA to come out of the local optima

3. When the minimum support is 5% the improvement of the proposed approach is 21.67% more than the Apriori

4. When the minimum support is 10% the improvement of the proposed approach is 19.86% more than the Apriori

5. When the minimum support is 15% the improvement of the proposed approach is 13.73% more than the Apriori

6. When the minimum support is 20% the improvement of the proposed approach is 1.26% more than the Apriori

7. CONCLUSION

Web personalization is a domain that has been recently gaining great momentum not only in the research area, where many research teams have addressed this problem from different perspectives, but also in the industrial area, where there exists a variety of tools and applications addressing one or more modules of the personalization process. The work we have proposed in this paper is to appreciate the NFRs in the web engineering process and to improve the NFRs using the web personalization process. The web personalization is achieved through the associations and the association is further optimized by the genetic algorithms. This is the promising approach and the results are proving and the discussions are made.

REFERENCES

[1] Ginige, A, Web engineering: managing the complexity of web systems development, SEKE. (2002) 721-729

[2] Sun, Hongyu, "Quantifiable non-functional requirements modeling and static verification for web service compositions" (2010). Theses and Dissertations

[3] Minghao Lu, Web Personalization Based on Association Rules Finding on Both Static and Dynamic Web Data, THE UNIVERSITY OF BRITISH COLUMBIA, September, 2008

[4] A. Anitha, N. Krishnan, A Dynamic Web Mining Framework for E-Learning Recommendations using Rough Sets and Association Rule Mining, International Journal of Computer Applications (0975-8887), Volume 12 No.11, January 2011.

[5] Nasraoui, O. Soliman, M. Saka, E. Badia, A.Germain, R. "A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites", IEEE transaction on Knowledge and dataengineering,Volume 20,Issue 2, Feb 2008 pp. 202-215

[6] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data", ACM SIGKDD Explorations, Vol. 1. No. 2, 2000, pp. 12-23.

[7] S. Agarwal, S. Lamparter, and R. Studer, "Making Web services tradable - A policy-based approach for specifying preferences on Web service properties," Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 1, pp. 11-20, Januar 2009.

[8] B. Nuseibeh, "Weaving Together Requirements and Architecture", IEEE Computer, Vol. 34, No. 3, March 2001, pp.115-117.

[9] http://www.iai.unibonn.de/III/lehre/vorlesungen/SWT/RE05/slides/09_NonfunctionalRequirements.pdf

[10] Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Effective Personalization Based on Association Rule Discovery from Web Usage Data, WIDM01, 3rd ACM Workshop on Web Information and Data Management, November 9, 2001, Atlanta, Georgia, USA.

[11] J. Srivastava, R. Cooley, M. Deshpande, P-T. Tan. Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explorations, (1) 2, 2000.

[12] B. Mobasher, R. Cooley, and J. Srivastava. Creating adaptive web sites through usage-based clustering of urls. In IEEE Knowledge and Data Engineering Workshop (KDEX'99), November 1999.

[13] B. Mobasher, R. Cooley, and J. Srivastava. Automatic personalization based on Web usage mining. In Communications of the ACM, (43) 8, August 2000.

[14] B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Y. Sun, and J. Wiltshire. Discovery of aggregate usage profiles for Web personalization. In Proceedings of the WebKDD 2000 Workshop at the ACM SIGKKD 2000, Boston, August 2000.

[15] B. Mobasher, H. Dai, T. Luo and M. Nakagawa. Improving the effectiveness of collaborative filtering on anonymous Web usage data. In Proceedings of the IJCAI 2001 Workshop on Intelligent Techniques for Web Personalization (ITWP01), August 2001, Seattle.

[16] Nakagawa, B. Mobasher, A hybrid web personalization model based on site connectivity, In Proc. of Web KDD (2003), pp. 59-70

[17] R. Jayakarthik, Improvement of the non functional requirement using the web personalization with the adaptation of web usage mining, International Journal of Computing Technology, Vol II, No 4, March 2012.

[18] The Public Microsoft Anonymous Web Server Logs Data, save from http://kdd.ics.uci.edu/database/msweb/msweb.data.htm.

R. Jayakarthik and Dr. K. Alagarsamy

Lecturer, Department of Computer science & Software Madurai Sivakasi nadar's pioneer meenakshi women's College, Poovanthi, Sivagangai Dist., Tamilnadu, INDIA. jayapranav15@gmail.com Director, Dept. of computer center Madurai kamaraj university, Madurai, Tamilnadu, INDIA. alagarsamymku@gmail.com,
COPYRIGHT 2012 Research India Publications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Jayakarthik, R.; Alagarsamy, K.
Publication:International Journal of Computational Intelligence Research
Date:Jul 1, 2012
Words:3612
Previous Article:Smart object tracking system using MATLAB.
Next Article:Estimation of soil permeability using the basic and index properties of soils.
Topics:

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters