# An impudent approach for intelligent data mining using rough set theory.

IntroductionData mining is the step in the process of knowledge discovery which employs the several preprocessing methods, to facilitate data mining algorithms, and also post-processing methods to refine and improve the discovered knowledge [1]. Knowledge discovery aims to extract high-level knowledge or create a high-level description from real-world data sets [2]. Data mining is a particular step in this process involving the application of specific algorithms for extracting patterns (models) from data. Additional steps in the Knowledge Discovery (KD) process are such as data preparation, data selection, data cleaning, and incorporation of appropriate prior knowledge and proper interpretation of the results of mining, ensure that useful knowledge is derived from the data [3]. Soft computing techniques, involving neural networks, genetic algorithms, fuzzy sets, and rough sets are mostly widely used in the data mining phase of the overall Knowledge Discovery (KD) process. Fuzzy sets provide a natural framework for the process to deal with uncertainty [4]. Neural networks [5] and rough sets [6] are widely used for classification and rule generation. Genetic algorithms are involved in various optimization and search processes, like query optimization [7] and template selection [8]. Other approaches like Case Based Reasoning [9] and Decision Trees [10] are also widely used to solve data mining problems. The power of data mining including its problem solving capabilities, performance and utilization depends on developing generic and also problem specific algorithms employing methods from different fields of science. To proceed toward intelligent data mining, obviating the need for human intervention, we need to incorporate and embed the artificial intelligence into data mining tools. Intelligent data mining is to use the intelligent search to discover information within data warehouses those queries and reports cannot effectively reveal and to find the patterns in the data and infer rules from them, and use these patterns and rules to guide for decision making and forecasting. Recently few tools are used in intelligent data mining is case-based reasoning, neural computing, intelligent agents, and other tools like decision trees, rule induction, data visualization. Rough sets help in granular computation and knowledge discovery process. Data mining tools such as Genetic Algorithm(GA) are presently used to recognize patterns, anticipate changes, and learn the buying habits and preferences of electronic commerce customers in Internet-based transactions [11][12]. In this paper we present a Genetic Algorithm based method to derive the rough sets from a set of given transactions.

In the following part of this paper, section 2, we present a brief description for rough sets concepts and its approximations. Section 3, describes the situation about the problem based on intelligent data mining and its architecture, some rules. The next part, section 4, is dedicated to experiment results and analysis.

Related Works

Recently many researchers various soft computing methodologies have been applied to handle the different challenges posed by the data mining [13]. The main constituent of soft computing is rough set theory applied in the intelligent data mining.

Rough sets

The new intelligent mathematical tool is rough sets which are proposed by mathematician Zdzislaw Pawlak [14] [15] [16], is a based concept of approximation spaces and models of the sets and concepts. The data in rough sets theory is collected in a table called a decision table. Rows of the decision table correspond to objects, and columns correspond to features. The rough set is commonly used in conjunction with other technique to do discrimination on the dataset. The main feature of the rough set data analysis is confined, and ability to handle qualitative data in data mining. Rough set theory is concerned with the analysis of deterministic data dependencies.

Definition 1: Information system is a tuple (U, A), where U consists of objects and A consists of features. Every a e A corresponds to the function a : U--VawhereVa is the value set of a. In the applications, we often distinguish between conditional features C and decision feature D, where C n D = <p. In such cases, we define decision system (u, c, d).

The above table 1 classified into to that the set regarding {patient2, patient3, patient5} is indiscernible in terms of headache attribute. The set concerning {patient1, patient3, patient4} is indiscernible in terms of vomiting attribute. Patient2 has a viral illness, whereas patient5 does not, however they are indiscernible with respect to the attributes headache, vomiting and temperature. Therefore, patient2 and patient5 are the elements of patients' set with unconcluded symptoms.

Definition 2: In rough sets theory, the approximation of sets is introduced to deal with inconsistency. A rough set approximates traditional sets using a pair of sets named the lower and upper approximation of the set. Given a setB c A, the lower and upper approximations of set Y c U are defined as follows.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

The positive region of X is defined as:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

[POS.sub.C](d) is the set of all objects in U that can be uniquely classified by elementary sets in the partition U / [Ind.sub.D] by means of C [17]. The negative region [NEG.sub.C] (d)is defined by:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (4)

is the set of all objects can be definitely ruled out as member of X. The boundary region is the difference between upper and lower approximations of set X that consists of equivalence classes having one or more elements in common with X; it is given by the following formula:

[BND.sub.B] (X) = [B.bar]X - [bar.B]X (5)

Definition 3: Given a decision system, the degree of the dependency of D on C can be defined as:

[gamma](C, D) = |[POS.sub.c] (d)|/| U (6)

A reduct is a subset R [??] C such that

[gamma](C, d) = [gamma](R, d) (7)

The reduct set is a minimal subset of attributes that preserves the degree of dependency of decision attributes on full condition attributes. The intersection of all the relative reduct sets is called core.

Boundary Region Approximations

Boundary Region is description of the objects that of a set X regarding R is the set of all the objects, which cannot be classified neither as X nor -X regarding R. If the boundary region is a set X =[empty set] (Empty), then the set is considered "Crisp", that is, exact in relation to R; otherwise, if the boundary region is a set X [not equal to] [empty set] (empty) the set X "Rough" is considered. In that the boundary region is BR = D* - D". Rough approximations have been shown in fig.1.

[FIGURE 1 OMITTED]

Let a set X [subset or equal to] U, D be an equivalence relation and knowledge base K= (U, D). Then two subsets can be associated:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (8)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (9)

In the same way, POS(d), BN(D)andNEG(D) are defines below [18]. POS(d) = d" [??] certainly member of X NEG(D) = U - D* [??] certainly non-member of X br(d) = D* - D" [??] possibly member of X

Rough sets on Intelligent Data Mining

Intelligent data mining is to use the intelligent search to discover knowledge within databases and warehouses those queries and reports cannot effectively reveal and to find the patterns in the data. Rough Set can be used in different phases of the knowledge discovery process, as attribute selection, attribute extraction, data reduction, decision rule generation and pattern extraction [19].

[FIGURE 2 OMITTED]

In this approach, input parameters will pass through a rough analysis system which will act as a data mining core for our system. Outputs of this system are appeared as a new database with some reductions in rows and columns. This means that redundancies in both attributes and entities of information system are discovered and omitted from the database. This block-set also recognizes condition attributes strongly affecting each decision one. After this process, new set of condition attributes will be passed through an artificial neural network and the corresponding decision attributes will be appeared on network outputs. We described the following rules for the decision rules for sir to the dengue diagnosis data.

Rule 1: If patient blotched_red_skin=No and muscular_pain_artculations = No and temperature=Normal Then dengue=No. Rule-2 If patient blotched_red_skin = No and muscular_pain_articulations = No and temperature = Very High Then dengue = Yes. Rule-3 If patient blotched_red_skin = No and muscular_pain_articulations = Yes and temperature = High Then dengue = Yes.

Experiments and Analysis

In this section, we describe our experiment results, which are collected Dengue fever data from different medical diagnosis labs in Hyderabad, INDIA. Based on this data we created an information table, and information, it can generate the decision rules for the dengue diagnosis.

Imprecision coefficient [alpha]D(X): where [alpha]D is the quality of approximation of X, it's denoted by

[alpha]D(X) = |D"(X)|/| D *(X) (10)

Where |D"(x) and |d *(x) it represents the cardinality of approximation lower and upper, and the approximation are set [not equal to][empty set]. Therefore, 0 [less than or equal to] [alpha]D [less than or equal to] 1, if [alpha]D(X) = 1, X it is a definable set regarding the attributes B, that is, X is crisp set. If [alpha]D(X) < 1, X is rough set regarding the attributes D. Then it apply for the Table 1, we get [alpha]D(x)=3/5 for the patients with possibility of they are with Illness. Apply for the Table 2 using equation (10) for the patients with possibility of they are with dengue [alpha]D(X) =7/8; and also not with dengue [alpha]D(X) = 8/12.

Upper approximation [alpha]D(D*(X): It is the percent of all the elements that are classified as belonging to X, it's denoted as

[alpha]D(D *(X)) = \D *(X)|/| A (11)

From the table 1, we get [alpha]D(D*(X)= 5/6, for the patients that have the possibility of they be with illness.

Upper Approximation set (B*) of the patients that possibly have dengue are identified as

D* = {P3, P4, P5, P6, P7, P9, P13, P18}

Upper Approximation set (B*) of the patients that possibly have not dengue are identified as D* = {P1, P2, P8, P10, P11, P12, P14, P15, P16, P17, P19, P20}

Using equation (11), for the patients that have the possibility of they be with

dengue [alpha]D(D *(X)= 8/20, and for the patients that not have the possibility of they be with dengue [alpha]D(D *(x)=11/20.

Lower approximation [alpha]D(D (X): It is the percentage of all the elements that possibility is classified as belonging to X, and is denoted as:

[alpha]D(D"(X)) = \D"(X)|/| A (12)

From table 1, [alpha]D(D (x) =3/6=1/2, for the patients that have illness. Lower Approximation set (D") of the patients that are definitely have dengue are identified as

B" = {P3, P4, P5, P6, P7, P13, P18}

Lower Approximation set (B") of patients that certain have not dengue are identified as

D" = {P1, P2, P8, P10, P12, P14, P15, P16, P17, P19, P20}

Using equation (12), for the patients that have dengue [alpha]D(D (X) = 7/20, and for the patients that not have dengue [alpha]D(D (X) = 8/20.

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

[FIGURE 5 OMITTED]

Patient with dengue: [alpha]D(D (X)=7/20, that is, 35% of patients certainly with dengue. Patient that don't have dengue: [alpha]D(D (X) = 11/20, that is, approximately 55% of patients certainly don't have dengue.10% of patients (P9 and P11) cannot be classified neither with dengue nor without dengue, since the characteristics of all attributes are the same, with only the decision attribute (dengue) not being identical and generates an inconclusive diagnosis for dengue.

Conclusions

In this paper we presented an approach to with rough sets on intelligent data mining, this approach for the elimination of redundant data and the development of set of rules which can aid the doctor in the elaboration of the patient's diagnosis. Also process the incomplete data is based on the lower and upper approximations and theory was defined as a pair of the two crisp sets to the approximations. We derived information table which can be generated the necessary decision rules for the aid to the dengue diagnosis. Bayesian Network for classification and rules generation in computational paradigm is the aim of our future work.

References

[1] S. Mitra and T. Acharya, Data Mining: Multimedia, Soft Computing, and Bioinformatics, John Wiley & Sons, Inc., NY, USA, 2003.

[2] M. Kantardzic, Data Mining: Concepts, Models, Methods and Algorithms, John Wiley & Sons, Inc., New York, USA, 2002.

[3] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "The KDD process for extracting useful knowledge from volumes of data," Communications of the acm, Vol. 39, 1996, pp. 27-34.

[4] Y. Y. Liu and X. Q. Wu, "Evaluation for data fusion system based on uncertainty," Journal of Data Acquisition & Processing, Vol. 20, 2005, pp. 150155.

[5] A. Ataei, "Design of rubble mound breakwaters using artificial neural networks," Vol. M. Sc. Tarbiat Modares University, Tehran, Iran, 2002.

[6] P. Yang, "Data mining diagnosis system based on rough set theory for boilers in thermal power plants," Frontiers of Mechanical Engineering in China, Vol. 1, 2006, pp. 162-167.

[7] F. Pentaris and Y. Ioannidis, "Query optimization in distributed networks of autonomous database systems," ACM Transactions on Database Systems, Vol. 31, 2006, pp. 537-583.

[8] A. Lumini and L. Nanni, "A clustering method for automatic biometric template selection," Pattern Recognition, Vol. 39, 2006, pp. 495-497.

[9] M. M. Richter and A. Aamodt, "Case-based reasoning foundations," The Knowledge Engineering Review, Vol. 20, 2006, pp. 203-207.

[10] C. Scott and R. D. Nowak, "Minimax-optimal classification with dyadic decision trees," IEEE Transactions on Information Theory, Vol. 52, 2006, pp. 1335-1353.

[11] J. McCarthy, Phenomenal data mining, association for computing machinery, Communications of the ACM, 2000,43 (8): 75-80.

[12] T.K. Sung, N. Chang, G. Lee.Dynamics of modeling in data mining: Interpretive approach to bankruptcy prediction, Journal of Management Information Systems1999, 16 (1): 63-86.

[13] Sushmita Mitra, " Data Mining in soft computing framework: A Survey" IEEE Transactions on Neural Networks, Vol 13, No.1, January 2002.

[14] Pawlak Z., (1982) Rough Sets. Int. J. Computer and Information SCI., Vol. 11 pp. 341-356.

[15] Pawlak Z. (1991)Rough Sets- Theoretical aspect of Reasoning about Data.Kluwer Academic Publishers, 1991.

[16] Pawlak Z., Grzymala-Busse J., Slowinski R., Ziarko, W., "(1995) Rough sets. Communications of the ACM, Vol. 38, No. 11, pp. 89-95.

[17] Hassanien, A.E., Own, H. "Rough sets for Prostate Patient Analysis". In Proceedings of International Conference on Modeling and Simulation (MS 2006), Malaysia(2006).

[18] Pawlak, Z. Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, ISBN 0-79231472, Norwell-USA.1991.

[19] Komorowski, J.; Pawlak, Z.; Polkowski, L. & Skowron, A. Rough Sets Perspective on Data and Knowledge, In: The Handbook of Data Mining and Knowledge Discovery, Klosgrn, W. & Zylkon, J. (Ed.), pp. 134-149, Oxford University Press, ISBN 0-19-511831-6, New York-USA. 1999.

Madhu G., G. Suresh Reddy and Dr. C. Kiranmai

V.N.R Vignana Jyothi Institute of Engineering & Technology

Batchupally Nizampet (S. O.), Hyderabad- 500 090, Andhra Pradesh, India

E-mail: madhu_g@vnrvjiet.in, ithead@vnrvjiet.in

E-mail: viceprincipal@vnrvjiet.in

Table 1: Information table for Dengue Fever Patient Attributes Temperature Headache Vomiting Illness #1 High No Yes Yes #2 High Yes No Yes #3 Very High Yes Yes Yes #4 Normal No Yes No #5 High Yes No No #6 Very High No Yes Yes Table2: Dengue symptoms for the patients. Patient Name Conditional Attributes Blotched_red_skin Muscular_pain P1 No No P2 No No P3 No No P4 No Yes P5 No Yes P6 Yes Yes P7 Yes Yes P8 No No P9 Yes No P10 Yes No P11 Yes No P12 No Yes P13 No Yes P14 No Yes P15 Yes Yes P16 Yes No P17 Yes No P18 Yes Yes P19 Yes No P20 No Yes Conditional Patient Name Attributes Decision Attributes Temperature Dengue Fever P1 Normal No P2 High No P3 Very High Yes P4 High Yes P5 Very High Yes P6 High Yes P7 Very High Yes P8 High No P9 Very High Yes P10 High No P11 Very High No P12 Normal No P13 High Yes P14 Normal No P15 Normal No P16 Normal No P17 High No P18 Very high Yes P19 Normal No P20 Normal No

Printer friendly Cite/link Email Feedback | |

Author: | Madhu G.; Reddy, G. Suresh; Kiranmai, C. |
---|---|

Publication: | International Journal of Computational Intelligence Research |

Date: | Apr 1, 2011 |

Words: | 2763 |

Previous Article: | An agent-based system for analyzing microblog dynamics. |

Next Article: | Information fusion based on multiplicity of data preprocessing boosts the AdaBoost classifier. |