Printer Friendly

Ontology assisted data mining and pattern discovery approach: a case study on Indian school education system.


In India, according to the 2009 Right to Education Act, education is a fundamental right and schooling should be free and compulsory for all children from the age of 6 to 14. However, improvements are slowly being implemented and disadvantaged groups still do not have adequate access to education. In order to implement policies for development of infrastructure, facilities and quality for schools in a country with more than 1.27 billion population and 3 million sq. km land area, it is a primary necessity to have a thorough survey and understanding about the current education scenario.

In order to get practical and real world scenarios, trends and patterns from available data, it is required to apply techniques of data mining algorithms. But the problem with mining this kind of miscellaneous data is that it requires domain knowledge and understanding of the data to use it. Different steps involved in data mining like data preprocessing, understanding of data formats and structure, modeling the data, choice of algorithm and assessing the results demand some level of understanding about the data format. Selection of the right features for mining is also to some extent dependent on domain knowledge. Moreover, interpretation of data mining results depends heavily on domain knowledge. The other problem with current scenario also includes the gap between data mining techniques and inclusion of existing knowledge of the system in data mining. In the process of knowledge discovery, knowledge can either be of two types: data mining knowledge and domain knowledge. Data mining knowledge includes the knowledge available from the different algorithms, how they are used, formats of input data, parameter tuning and so on. Domain knowledge includes the understanding of the dataset, relationships among variables, normal range of variables, known causal relations and so on (Kuo, Lonie, Sonenberg et al. 2007). Data mining algorithms are syntactical in the sense that they do not take advantage of the existing knowledge in the learning process. The most challenging problem is interpretability of the results.

In this paper, the importance of ontology for representing the domain knowledge to assist the data mining process is illustrated. The ontology is simple and flexible yet powerful to capture various degrees of relationship among objects, attributes and their properties. An approach for ontology supported data mining is implemented on the schools dataset in order to find out useful patterns existing in the system. The paper is organized as follows. In section 1, a brief discussion about different approaches to combine ontology with data mining is presented. Section 2 details about background of the application dataset; section 3 describes the research framework including ontology design and development, data preprocessing, cluster analysis and association rule mining. The results and analysis are presented in section 4 along with the findings and discussion. This is followed by conclusion and future work in section 5.

Related Work:

Data mining is generally seen as a descriptive task where values of attributes are taken as deciding factors. Ontologies have been used with clustering for biomedical informatics, semantic web and text documents clustering in recent researches but fewer approaches try to combine ontology with data mining to add domain knowledge in other fields of applications (Peleg., Asbeh, Kuflik et al., 2009). Liu, Wang and Yang have proposed an ontology driven sub space clustering framework which creates tendency preserving cluster trees and uses ontology based pruning techniques (Liu, Wang and Yang, 2004). Kuo, Lonie, Sonenberg and Paizis proposed an ontology based data mining approach where a new ontology based measurement 'compared implication' for association rule mining is used (Kuo et al., 2007). Zhang and Wang created ontology based domain feature graph with feature weights clustering for multidimensional data (Zhang and Wang, 2010). Liao, Chen and Hsu uses ontology to model customer database knowledge for mining in sports marketing (Liao, Chen and Hsu, 2009). Kaur and Sapra use ontology with data mining techniques for grouping research proposals and the research reviewers (Kaur and Sapra, 2013).

Application Area--Survey of Indian Schools:

To illustrate the significance of ontology in data mining, the school data set created by the National University of Educational Planning and Administration (NUEPA) through the District Information System for Education (DISE) has been taken up. The dataset consists of data about schools from all 29 states and 7 union territories of India. The dataset is organized as data of schools in each state/UT divided into parts: basic data, general data, facility data, RTE data, enrollment and repeaters, teachers data. Basic data consists of district, village/city, school name, code and address, block and cluster resource centre details. General data consists of general details of each school like area, school type, management, residential, shifts, highest and lowest class details and so on. Facility data contains details about physical facilities and equipments like building status, electricity, water, toilet, playground, library, computer aided learning, blackboard facilities. RTE data consists of mid-day meal details of each government and government aided schools. Enrollment and repeater dataset contains numbers of class wise enrolment along with genders of students (total, SC, ST, OBC, EMBC), repeaters, differently abled students. Teachers data include number of male/female teachers, head teachers, their educational qualifications. Since the dataset contains 242 features, it is very difficult for a data mining tool to infer useful knowledge by random clustering. Only with the help of domain knowledge represented in the form of an ontology, useful patterns could be inferred.

Proposed Ontology Assisted Mining Framework:

This section describes the various steps in the ontology assisted data mining for the school data set in detail. An ontology of the domain is designed which is used for selecting inputs to the mining algorithms as well as the ontology is consulted in order to evaluate and interpret the results. The proposed framework uses the ontology along with data mining techniques particularly clustering and association rule mining in this paper. The system flows through a series of steps which are explained below.


The research framework comprises of three major tasks: constructing ontology and categorizing features, data mining and evaluation. Figure 2 shows the research framework. It begins with the understanding of the school domain, its properties and their inter relationship by modeling them as an ontology which is consulted for choosing the features for the initial grouping and further clustering. Then data in each cluster is preprocessed and association rule mining is applied. This finds hidden patterns in each cluster individually. The newly found rules are evaluated by contrasting with the ontology if needed to get more useful semantics.

Ontology Design and Development:

Ontology design is done in three steps: data modeling, creating class hierarchy and defining relations among the classes and their properties. The role of ontology in this work is described in figure 1. Ontology diagrams are used to find relations among different levels in the hierarchy of the schooling system like national, state, district, city, village, block, cluster resource centre and so on. The districts have different areas which might be remote, rural, semi urban, urban or metro. Schools have different parameters which are modeled as sub classes like students, teachers, building, distance from nearest control centre among others. In this study, Protege Version 4.3.0 is used as the ontology design tool.



Data Preprocessing:

The dataset needs considerable preprocessing because it is partially unstructured and has incoherencies due to either human error, multiple occurrences of features, missing values or data entry error. It is cleansed, integrated and values are either removed or filled with relevant data, empty features removed. For cluster analysis numeric values are necessary whereas for association rule mining using Apriori, nominal values are needed. Feature sets to be used for association rule mining are transformed to transactions which is the form of input accepted by Apriori.

Cluster Analysis:

Clustering is a widely used technique, whose goal is to partition a set of patterns into disjoint and homogeneous clusters. The K-means algorithm, hill-climbing and the density-based DBSCAN are the most popular partition-clustering methods. The goal of the K-means algorithm is to partition the data into k clusters so that the within-group sum of squares is minimized (Liao et al, 2009). This system employs K-means in cluster analysis and divides the datasets into different groups depending on certain parameters chosen from the outcome of the ontology and categorizing steps.

Association Rule Mining:

An association rule shows relationships among items in a transaction of a dataset. Discovering association rules is an important data mining problem. The association rule algorithm is employed mainly to determine the relationships between items or features that occur synchronously in the dataset, especially within clusters in this research work. The Apriori algorithm is used in this system which is one of the most common and successful algorithms to obtain association rules.

Data Mining Results and Analysis:

The framework proposed is applied on datasets described in section 2. Out of 242 features that exist for each school, those required for analysis are chosen from the ontology. First clustering is done using K-Means algorithm and then each cluster is individually processed with association rule mining algorithm. In order to apply Apriori, data format of requisite fields are preprocessed. For instance, if we want to find patterns about schools according to management of the school and school type which indicates secondary/primary/higher secondary standard school with relation to their location in Pondicherry, consulting ontology we choose RURUB, SCHCAT and SCHMGT from General Data cluster analysis. To find patterns inside each cluster we associate features like medium of instruction, school management and school type. The main ontology is used to choose the features for clustering. Part of the ontology consulted is shown in figure 3.


Cluster Analysis:

Clustering is used to categorize data into well defined and meaningful groups. Before running the K-Means algorithm number of clusters is fixed. Clusters were formed on basis of school area (rural/urban), school category and school management. Depending on the group center clusters are analyzed. With focus on analyzing school area distribution clustering results on schools of Pondicherry is shown in figure. 4 with 3 clusters with cluster0, cluster! and cluster2 having 53%, 26% and 22% of all the 717 data.


Results of the clustering can be summarized in the following table 1.

Association Patterns:

Apriori algorithm is applied to each cluster individually and association rules are formed based on school management, medium of instruction and school type. Table 2 lists some of the rules from cluster0 with minimum support 0.1 and minimum confidence 0.9 representing mostly schools of the rural area.

Association mining applied to cluster1 with the values of minimum support 0.1 and minimum confidence 0.9 provides results as listed in table 3 which are mostly schools of the urban.

Findings and Discussion:

This study uses K-Means to form meaningful groups. As seen from table 1 values and comparing them with the standards mentioned in UDISE we conclude that cluster0 consists of schools which are in rural area, managed by social welfare groups/local body and are having classes from 1 to 12. These are calculated by matching final cluster center value with the provided codes. Schools in cluster1 and cluster2 are in urban areas but cluster1 schools are managed by department of education and have classes from 1 to 12 whereas schools in cluster2 are private unaided and have classes 6 to 12 as per results. Association rules derived from Apriori for cluster0 and cluster1 show correlation between school management and medium of instruction. For instance, rule 3 in table 2 can be interpreted as English medium schools in rural areas are mostly private unaided and have classes 1 to 10. Many combinations of interesting clusters and rules can be formed with different feature sets chosen separately. This is possible only with the help of domain knowledge which was provided by the constructed ontology.

Conclusion and Future Work:

This paper proposes a framework which uses ontology based cluster analysis and association rule mining for mining interesting knowledge from the datasets of survey of schools of India. Knowledge extracted from data is presented to the user as cluster allocations and association rules are interpreted with help of the ontology. Some of the future enhancements of the work include automating the consultation and querying of ontology by the mining algorithms, attaching weights to features of the datasets to make the semantics more relevant and pruning of association rules based on improved semantic interestingness.


Article history:

Received 12 October 2014

Received in revised form 26 December 2014

Accepted 1 January 2015

Available online 25 February 2015

REFERENCES DocumentsZUDISE_DCF2014-15_26Aug2014.pdf Documents/U-DISE-SchoolEducationInIndia-2013-14.pdf

Kaur, P., R. Sapra, 2013. Ontology based classification and clustering of research proposals and external research reviewers. International journal of computers & technology, 5(1): 49-53.

Kogilavani, A.A., B.D.P. Balasubramanie, 2009. Ontology enhanced clustering based summarization of medical documents. International Journal of Recent Trends in Engineering, 7(1).

Kuo, Y.T., A. Lonie, L. Sonenberg, K. Paizis, 2007. Domain ontology driven data mining: a medical case study. In Proceedings of the 2007 international workshop on Domain driven data mining, pp: 11-17. ACM.

Liao, S.H., J.L. Chen, T.Y. Hsu, 2009. Ontology-based data mining approach implemented for sport marketing. Expert Systems with Applications, 36(8): 11045-11056.

Liu, J., W. Wang, J. Yang, 2004. A framework for ontology-driven subspace clustering. In Proceedings of the tenth ACMSIGKDD international conference on Knowledge discovery and data mining, pp: 623-628. ACM

Lula, P., G. Paliwoda-Pekosz, 2008. An ontology-based cluster analysis framework. In Proceedings of the first international workshop on Ontology-supported business intelligence, pp:7. ACM.

Mehta, A.C., 2013. Elementary education in India--Progress towards UEE.

Peleg, M., N. Asbeh, T. Kuflik, M. Schertz, 2009. Onto-clust--A methodology for combining clustering analysis and ontological methods for identifying groups of comorbidities for developmental disorders. Journal of biomedical informatics, 42(1): 165-175.

Song, W., C.H. Li, S.C. Park, 2009. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications, 36(5): 9095-9104.

Sureka, V., S.C. Punitha, 2012. Approaches to Ontology Based Algorithms for Clustering Text Documents. IJCTA Sept-Oct.

Yuan, S.T., C. Cheng, 2004. Ontology-based personalized couple clustering for heterogeneous product recommendation in mobile marketing. Expert systems with applications, 26(4): 461-476.

Zhang, L., Z. Wang, 2010. Ontology-based clustering algorithm with feature weights. Journal of Computational Information Systems, 6(9): 2959-2966.

(1) Prokriti Roy, (2) S. Siva Sathya, (3) Naveen Kumar

(1) M. Tech (Network and Internet Engineering), Department of Computer Science, Pondicherry University, Puducherry 605014, India

(2) Associate Professor, Department of Computer Science, Pondicherry University, Puducherry 605014, India

(3) Ph. D Research Scholar, Department of Computer Science, Pondicherry University, Puducherry 605014, India

Corresponding Author: Prokriti Roy, M. Tech (Network and Internet Engineering), Department of Computer Science, Pondicherry University, Puducherry 605014, India
Table 1: Values of result of Clustering on schools of Pondicherry
depending on features: rural or urban area, category of school
and management of school.

                                      Cluster #

                          Cluster 0          Cluster 1

# schools (total 717)        377                183
Area                        Rural              Urban
School Management         Local body     Dept. of education
School Category         Class I to XII     Class I to XII

                           Cluster #

                           Cluster 2

# schools (total 717)         157
Area                         Urban
School Management       Private unaided
School Category         Class VI to XII

Table 2: Rules generated on cluster0

1. SCHMGT=pvt unaided [right arrow] MEDINSTR=english

2. SCHCAT=pr MEDINSTR=tamil [right arrow] SCHMGT=edu dpt

3. SCHMGT=pvt unaided SCHCAT=p_up_s [right arrow] MEDINSTR=english

4. MEDINSTR=tamil [right arrow] SCHMGT=edu dpt

5. SCHCAT=p_up_s MEDINSTR=English [right arrow] SCHMGT=pvt unaided

6. SCHMGT=edu dpt MEDINSTR=english [right arrow] SCHCAT=pr

7. SCHCAT=pr [right arrow] SCHMGT=edu dpt

Table 3: Rules generated on cluster1.

1. SCHMGT=dpt of edu [right arrow] RURURB=urban

2. SCHCAT=p [right arrow] RURURB=urban

3. SCHMGT=dpt of edu SCHCAT=p [right arrow] RURURB=urban

4. SCHCAT=up_s [right arrow] RURURB=urban

5. SCHCAT=up_s [right arrow] SCHMGT=dpt of edu

6. SCHMGT=dpt of edu SCHCAT=up_s [right arrow] RURURB=urban

7. RURURB=urban SCHCAT=up_s [right arrow] SCHMGT=dpt of edu

8. SCHCAT=up_s [right arrow] RURURB=urban SCHMGT=dpt of edu

9. SCHCAT=up_s_hs [right arrow] RURURB=urban

10. SCHCAT=up s hs [right arrow] SCHMGT=dpt of edu

Similarly association rule mining on cluster-3 yields a
different set of rules.
COPYRIGHT 2015 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Roy, Prokriti; Sathya, S. Siva; Kumar, Naveen
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:Jun 1, 2015
Previous Article:Mining sentiments and sequential rules for event prediction.
Next Article:Sign gesture representation using and-or tree.

Terms of use | Copyright © 2017 Farlex, Inc. | Feedback | For webmasters