Printer Friendly

A privacy preserving data mining approach for handling data with outliers.


Data Mining [1] is the process of extracting the knowledge from the data which is stored in the large repositories. Privacy Preserving Data Mining problem [2] has been considered more importantly in recent years due to the fact that huge amount of information about individuals are stored at different vendors for the research purposes. PPDM is an new research topic in Data Mining and in the Statistical databases in which the Data Mining algorithms are analyzed to check whether they acquire privacy in data. Privacy Preservation of individuals data from disclosure is considered as the important function inorder to maintain privacy. In this way privacy plays major role in the data mining process. The problem in the data mining output is it reveals the individuals personal data. It leads to threats in the privacy of the individuals. The motivation of the people is that their personal information should not be known to others without their knowledge. But data mining algorithms failed to protect the privacy of the individuals. Privacy is defined as the right of an individual person to keep their sensitive information from being disclosed. Privacy states that from an set of records the adversary should not identify the person associated with that record. The results of the data mining operations are sensitive. Privacy is one of the important properties [4] that an system needs to be satisfied. For this purpose, numerous efforts had been undertaken to devote the PPDM algorithms to protect the information from being disclosed. One of the basic data mining problem is Outlier Detection. An Outlier is an observation point that deviates from the other observations or from the rest of the data [6].Outliers can be novel, abnormal, unusual or noisy information. Outliers may be real or erroneous.

2. Related work:

In recent years the privacy preserving data publishing had drawn more attention. To protect[3] the privacy of the individuals the dataset must be anonymized before it is released. Previous[3] study shown that by removing the explicit identifiers such as name, SSN(Social Security Number) from the dataset cannot maintain privacy. It is because the Quasi identifiers such as zip code, gender helps to jointly identify the person privately. The identity of the person can be revealed easily when it is compared with the public dataset (eg. voter list). Sweeney[5] proposed k-anonymity method which is treated as the conventional method for anonymization. Quasi Identifier consists of person specific sensitive attribute information. It achieves using generalization and suppression method so that the each individual is indistinguishable from the at least k1records.Generalization replaces the value less specific but it is also said to be semantically consistent. Suppression[8] reduces the exactness of applications and does not releases the value at all. This type of Kanonymity method prevents from the linkage attack. The authors[10][5] proved that removing Quasi-Identifiers from the dataset donot ensure the privacy so they suggested that the k-anonymity method is better for publishing the microdata. Author[9] suggested an novel approach based on bottom up method to group the quasi identifiers. K-anonymity model[5] is proved to be theoretically NP-Hard. Two types of attack are possible such as Background Knowledge attack and Homogeneity attack.

L-diversity [7] model is introduced to protect from the attribute disclosure .It consists of distinct well represented values in each equivalence class. The improved methods such as the t-closeness, p-sensitive anonymity, (k,e)anonymity[11] are described in it. As the L-diversity model, the several other approaches are proposed to achieve the principle of privacy in[13,15,16,18].They are

Classified as partitioning method and randomization method. The dataset[15,18] is divided into Quasi-identifier groups and it publishes only the anonymized groups in the partitioning based method. To increase the utility of the anonymized dataset nonhomogeneous generalization method is proposed by the Koudos [12] In randomization approach the original values are replaced with the noise or duplicate values[15]Li et al proposed that the distribution of sensitive values in the released dataset must be close to the original in t-closeness method[14] If outliers present in the original dataset they must be shown in the both the original and the modified dataset. In this way the outliers can be easily detected using the distribution.

Few studies shows the possibility of attacks in the partition based schemes. Machanavajjhala et al[17] described some of the attacks in the k-anonymity and proposed the l-diversity. Our work adapts the l-diversity model.

In the recent years an new model for privacy is emerged known as differential privacy [20]

In this differential privacy method the removal or the addition of any one record will not affect the entire dataset [19]. Numerous techniques had been proposed to publish the different types of data to satisfy the differential privacy[21,23]Barak et al proposes[21] the method to publish the marginals of the dataset.

Blum et al [22]proposed an approach for releasing the one dimensional data which satisfies the differential privacy in non interactive way. Hay et al[24] improves the performance of the[22]The wavelet based approach[25]is used by the Xiao et al for publishing the micro dimensional dataset. 3 3

3. Privacy preserving method for data containing Outliers:

Organizations are increasingly publishing microdata that contains non aggregated information about the individuals. Non-aggregated data that contains outliers raises serious concerns in data privacy. When outliers exist in the dataset, they are easier to be distinguished from the crowd and the privacy is breached. Distinguishability-based attack occurs by which the adversary can identify outliers and reveal their private information from an anonymized dataset. The existing plain l-diversity preserves the dataset from the distinguishability attack but it results in information loss since it hides only the hideable outliers present in the dataset.

In l-diversity all records that share the same values of quasi identifiers should have distinct values for their sensitive attributes. In previous studies[3]using l-diversity method the QI attributes are generalized and the outliers present in the data are hided in order to maintain privacy. But when we generalize and hide the data containing outliers it results in information loss since it hides only the hideable outliers and the unhideable outliers are eliminated.

In the proposed system the information loss is reduced using KNN algorithm by enhancing the l-diversity model. The proposed system consists of five steps described in the fig 3.1

In this proposed system first we import the dataset and Fuzzy clustering is applied. Fuzzy clustering used to group the data into n clusters in which each datapoint in the dataset are belongs to each cluster to an certain degree. In simple words we can say that each point can belong to more than one cluster. Fuzzy clustering is applied because it helps the datapoint to move to its nearest cluster then KNN classifier is applied to find the outliers. Fuzzy Clustering helps to find combination weights, membership functions and cluster centres to minimize the objective function. Outliers are the observation point that deviates from others. KNN classifies the new cases (outliers) based on its distance functions. The outliers present in the data are moved to its nearest bucket (cluster).

KNN algorithm steps:

* Determine parameter K=number of nearest neighbours.

* Calculate the distance between the query instance and all training samples.

* Find the nearest neighbour by sorting and gather the category of the nearest neighbours.

* Use simple majority of the category of the nearest neighbours as the prediction value of the query instance.


The input given is adult dataset which is first loaded and then fuzzy Clustering method is applied to cluster the records. It is especially used for mapping the outliers to its nearest cluster.

After that the KNN classifier is applied and found the outliers based on its distance. Euclidian Distance Method is used to calculate the distance between the records. Then the outliers are moved to its nearest bucket (cluster).

By this way the privacy of the dataset is maintained using l-diversity and information loss is reduced using KNN classifier by assigning the outliers to its nearest clusters.

Table 1 describes the performance of the Information loss metrics. The information loss is analyzed in terms of outlier detection error ratio results in figure 4.6. Information loss is defined as,



D: Dimensionality of the vector (2,3,4,5, ...)

O: Outlier

NDc: The number of training samples per class (>D+1)

Table 2 describes the performance of the system using silhoutee metrics in figure 4.3. Accuracy is defined as,

Accuracy(i) = b(i) - a(i)/max{a(i), b(i)}

where, a(i) is the cluster similarity, b(i) be the lowest average dissimilarity of i to any other cluster, of which i is not a member. The cluster with this lowest average dissimilarity is said to be the "neighbouring cluster" of i because it is the next best fit cluster for point i.

Table 3 describes the performance of the system using Time metrics. Computational time metrics is analyzed in terms of Outlier detection in figure 4.8. Computational Time(CT) is defined as

CT = Process Start Time - Process End Time

The information loss is reduced when comparing to the existing method.


In this paper the problem of publishing data with outliers in privacy fashion is studied. The microdata containing outliers are published in a privacy preserving way. The existing plain l-diversity system provides privacy only for the hideable outliers and it results in information loss. In this paper we improved the algorithm using K Nearest Neighbour Algorithm to reduce information loss.


[1.] Han, J. and M. Kamber, 2006. "Data Mining: Concepts and Techniques", 2nd ed., The Morgan Kaufmann Series in DataManagement Systems, Jim Gray, Series Editor.

[2.] Ankita Shrivastava, U. Dutta, 2013. "An Emblematic Study of Different Techniques in PPDM ", International Journal of Advanced Research in Computer Science and Software Engineering.

[3.] Hui(Wendy)Wang, Ruilin Liu, 2015. "Hiding outliers into crowd: Privacy Preserving data publishing with outliers", Elsevier.

[4.] Elisa Bertino, Dan Lin, and Wei Jiang, "A Survey of Quantification of Privacy Preserving Data Mining Algorithms"

[5.] Sweeney, L., 2002. k-anonymity: a model for protecting privacy, Int. J. Uncertain. Fuzziness Knowl. Based Syst., 10(5): 557-570.

[6.] Williams, G., R. Baxter, H. He, S. Hawkins and L. Gu, 2002. "A Comparative Study for RNN for Outlier Detection in Data Mining". In Proceedings of the 2ndIEEE International Conference on Data Mining, page 709, Maebashi City, Japan.

[7.] Tiancheng Li, Ninghui Li, Jian Zhang, Ian Molloy, 2012. "Slicing: A New Approach for Privacy Preserving Data Publishing", IEEE Transactions on Knowledge & Data Engineering, 24(3): 561574, doi:10.1109/TKDE.2010.236.

[8.] Samarati, P., "Protecting respondent's privacy in Microdata release", IEEE Transactions on Knowledge and Data Engineering, 13: 1010-1027.

[9.] Tiancheng Li, Ninghui Li, 2008. "Towards Optimal k-anonymization", Data & Knowledge Engineering, Elsevier. 303.

[10.] Machanavajjhala, A., J. Gehrke, D. Kifer, 2007. "l-Diversity: Privacy Beyond k-Anonymity", ACM Transactions on Knowledge Discovery from Data, pp: 24-35.

[11.] SergioMartinez, David Sanchez, Aida Valls, 2013. "A semantic framework to protect the privacy of electronic health records with non-numerical attributes", Journal of Biomedical Informatics, 46: 294-303.

[12.] Wong, W.K., N. Mamoulis, D.W.L. Cheung, 2010. Non-homogeneous generalization in privacy preserving data publishing, Proceedings of ACM International Conference on Special Interest Group on Management of Data (SIGMOD) pp: 747-758.

[13.] LeFevre, K., D.J. DeWitt, R. Ramakrishnan, 2005. Incognito: efficient full-domain k-anonymity, Proceedings of ACM International Conference on Special Interest Group on Management of Data (SIGMOD) pp: 49-60.

[14.] Li, N., T. Li, 2007. t-Closeness: Privacy beyond k-anonymity and l-diversity, Proceedings of the International Conference on Data Engineering (ICDE) pp: 106-115.

[15.] Koudas, N., D. Srivastava, T. Yu, Q. Zhang, 2009. Distribution based microdata anonymization, Proc. VLDB Endow. 2(1): 958-969.

[16.] Li, J., Y. Tao, X. Xiao, 2008. Preservation of proximity privacy in publishing numerical sensitive data, Proceedings of ACM International Conference on Special InterestGroup on Management of Data (SIGMOD) pp: 473-486.

[17.] Li, N., T. Li, 2007. t-closeness: Privacy beyond k-anonymity and l-diversity, Proceedings of the IEEE 23rd International Conference on Data Engineering, pp: 106-115.

[18.] Chaytor, R., K. Wang, 2010. Small domain randomization: same privacy, more utility, Proc. VLDB Endow. 3(1-2): 608-618.

[19.] Dwork, C., 2006. Differential privacy, Proceedings of International Colloquium on Automata, Languages and Programming (ICALP) pp: 1-12.

[20.] Dwork, C., F. McSherry, K. Nissim, A. Smith, 2006. Calibrating noise to sensitivity in private data analysis, Proceedings of the Conference on Theory of Cryptography(TCC) pp: 265-284.

[21.] Barak, B., K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, K. Talwar, 2007. Privacy, accuracy, and consistency too: a holistic solution to contingency table release, Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp: 273-282.

[22.] Blum, A., K. Ligett, A. Roth, 2008. A learning theory approach to non-interactive database privacy, Proceedings of the ACM Symposium on Theory of Computing (STOC), pp: 609-618.

[23.] Xiao, X., G. Bender, M. Hay, J. Gehrke, 2011. iReduct: differential privacy with reduced relative errors, Proceedings of ACM International Conference on Special InterestGroup on Management of Data (SIGMOD), pp: 229-240.

[24.] Li, C., M. Hay, V. Rastogi, G. Miklau, A. McGregor, 2010. Optimizing linear counting queries under differential privacy, Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp: 123134.

[25.] Xiao, X., G. Wang, J. Gehrke, 2011. Differential privacy via wavelet transforms, IEEE Trans. Knowl. Data Eng., 23(8): 1200-1214.

(1) V.V. Vishnu Priya, (2) A.K. Ilavarasi, (3) Dr. B. Sathiya Bhama

(1) PG Student-CSE Sona College of Technology Salem, India.

(2) Assistant Professor-CSE Sona College of Technology Salem, India.

(3) HOD-CSE Sona College of Technology Salem, India.

Received 28 January 2017; Accepted 22 May 2017; Available online 28 May 2017

Address For Correspondence:

V.V. Vishnu Priya, PG Student-CSE Sona College of Technology Salem, India. E-mail:

Caption: Fig. 3.1: System Architecture of Proposed System

Caption: Fig. 4.1: Raw Dataset

Caption: Fig. 4.2: Fuzzy Clustering

Caption: Fig. 4.3: Assigning K value for KNN classification

Caption: Fig. 4.4: Outlier Detection

Caption: Fig. 4.5: Outlier Mapping

Caption: Fig. 4.6: Information loss Graph

Caption: Fig. 4.7: Accuracy Graph

Caption: Fig. 4.8: Compilation Time Graph
Table 1: Comparisons of Information Loss during number of runs

Methods    1      2      3      4      5
Existing   0.75   0.72   0.6    0.55   0.25
Proposed   0.56   0.42   0.35   0.27   0.17

Table 2: Comparisons of Cluster Accuracy during number of runs

Methods    1    2    3      4      5
Existing   86   88   89     92     93
Proposed   89   99   99.5   99.8   100

Table 2: Comparisons of Computational Time (CPU seconds) during
number of runs

Methods    1    2    3    4    5
Existing   26   27   27   28   29
Proposed   15   17   18   20   23
COPYRIGHT 2017 American-Eurasian Network for Scientific Information
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2017 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Priya, V.V. Vishnu; Ilavarasi, A.K.; Bhama, B. Sathiya
Publication:Advances in Natural and Applied Sciences
Article Type:Report
Date:May 1, 2017
Previous Article:A novel approach for detecting and correcting errors using combinatorial testing.
Next Article:ACM based segmentation of pulmonary lung nodule in ct images.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters