Printer Friendly

Mining frequent patterns through microaggregation in differential privacy.

I. Introduction

Frequent pattern mining has been well studied within the data mining context, and numerous frequent pattern mining algorithms have been developed. Among them, the Apriori algorithm [1] and the FP-Growth algorithm [2] are the two most widely used. However, these algorithms are unable to preserve the privacy of the dataset if sensitive information (e.g., patient health records) is contained therein. Bhaskar et al. [3] proposed the Trun-Fre approach based on a Laplace mechanism to achieve differential privacy. However, this does not perform well when candidate sets become overly large, as the length of the frequent pattern increases. To decrease the magnitude of a candidate set, Li et al. [4] developed the PrivBasis approach, under which:

(1) The pattern basis is generated based on maximal clique; and (2) top-k patterns are mined to achieve differential privacy. However, the released dataset to be mined often lacks utility. For example, the support of patterns {a} and {b} with the original support <3, 3> could become <2.5, 3.2> after adding Laplace noise. As we can see, the consistency constraint has been broken. The DP-topkP algorithm in [5] performs well in terms of data utility, while in this paper noise is reduced through combining the k-anonymity model with the differential privacy model.

Normally, there are two metrics when evaluating an algorithm in differential privacy: privacy and utility. Therefore, a synergy must be achieved. In [6], an approach to enhance data utility in differential privacy via microaggregation-based k-anonymity is proposed.

In this paper, microaggregated data is released by clustering the selected patterns based on the weight of each pattern to achieve k-anonymity, and top-k frequent patterns are selected based on I the microaggregated data through the exponential mechanism. Lastly, the true support of each frequent pattern is perturbed by adding Laplace noise. By adjusting the cluster size, a synergy between privacy and data utility can be achieved.

2. Background And Related Work

2.1 Mining Frequent Patterns

Frequent pattern mining is widely used in the area of knowledge discovery and can serve various purposes, among which mining association rules [7] and predicting user behavior [8] have been greatly studied.

Formally speaking, let t be a transaction record, and D = {[t.sub.1], [t.sub.2],...,[t.sub.N]} be the transaction dataset that contains a few items. A set of items tending to appear together as a pattern shall be denoted asp In this paper, we apply the FP-Growth algorithm to the original dataset D to create a set of frequent patterns with support larger than the minimum support, denoted as [P.sub.set]. For each pattern [p.sub.i] [member of] Pset, assign a weight value w. based on its frequency. The universe of denoted as p stands for the weight value set of all the frequent patterns mined through the FP-Growth algorithm.

2.2 Differential Privacy

The differential privacy model for query operation in an interactive setting was originally proposed in [9]. The differential privacy model provides significantly robust and provable privacy preservation.

Definition 1: [epsilon]--differential privacy

As discussed in [1,10], for two given neighboring datasets D and D', a randomized mechanism A gives for any s e range (A):

Pr [A (D) = s] [less than or equal to] [e.sup.[epsilon]]. Pr [A (D') = S].

In this paper, the two neighboring datasets are or, where denotes a frequent pattern. Specifically, preserving differential privacy in a frequent pattern mining setting refers to the change in the probability distribution of the [e.sup.[epsilon]]. multiplicative--bounded by selecting a given frequent pattern or not.

2.3 Achieving k-anonymity based on Microaggregation

As shown by many researchers, the k-anonymity [11] model and its extensions have a significant influence on the research of privacy preservation. This model defines the attributes related to background knowledge as quasi-identifier attributes. Based on the same quasi-identifier, all records are generalized or suppressed so they can be separated into several equivalent groups. There are at least k records in each group. Although the k-anonymity model performs well in preventing identity disclosure, it cannot protect records against attribute disclosure.

For example, Table 1 shows a 3-anonymity dataset. Alice, Bob, and Carol share the same quasi-identifier attributes in Work (which defines whether the person works or not) and Age, and the AIDS attribute is what needs to be protected. Assume an intruder knows the target is in the group and Alice and Bob are not infected. He can therefore easily establish that Carol is infected. Several extensions of k-anonymity have been proposed such as l-diversity [12], t-closeness [13], ([alpha], k)--anonymity [14], and Min-variance. However, these models all suffer from two vital defects. First, the privacy disclosure is related to the background knowledge of the intruder, and the background knowledge is difficult to fully define. Second, these approaches cannot be strictly proved.

However, the k-anonymity model is preferable to the differential privacy model from the perspective of data utility. Hence, as studied in [6], we preprocess the transaction dataset based on k-anonymity, and a microaggregation [15] algorithm is used as the anonymity algorithm with two procedures including clustering the records in the dataset based on the quasi-identifier attributes and replacing the original attributes with centroids, normally the mean.

[FIGURE 1 OMITTED]

3. Mining Frequent Patterns Through Microaggregation In Differential Privacy

In [6], the MDAV algorithm is applied to cluster the records in the dataset. For frequent pattern mining, each frequent pattern should be treated as a category rather than an arithmetic number. Therefore, the microaggregation operation should be applied to the weight of each frequent pattern, which is an arithmetic number. As discussed in section 2.1, we have achieved the frequent pattern set Pset. Next, the weight of each pattern should be calculated based on the frequency in the exponential mechanism [16]. First, we define the ranking function Rank (D, [P.sub.i]):

Rank (D, [P.sub.i]) = [N.summation over (i=1)] count ([t.sub.i], P), [t.sub.i] [member of] D

s.t if p [subset] t., count ([t.sub.i], p) = 1

else count ([t.sub.i], p) = 0

Since adding or deleting one transaction record will cause the support of a pattern to increase one or decrease one at most, we know that the global sensitivity of the ranking function is [DELTA] Rank = k.. With tis in mind, the weight of each pattern should be depicted formally as:

[w.sub.i] = exp ([epsilon] Rank (D, [p.sub.i]) / Rank

where [w.sub.i] denotes the weight of pattern [p.sub.i], consisting of the weight set Pweight.

3.1 Select top-k Frequent Patterns in the Exponential Mechanism

In this paper, the microaggregation operation is applied to Pweight. (1) The records in Pweight are clustered into several groups with the decreasing weight of each pattern. Each cluster contains at least records. (2) All records in the individual cluster are replaced by a representative value, mostly the mean value within the cluster.

Algorithm 1. Release the differentially private data through microaggregation in the exponential mechanism. let Pweight denote the set of weight value of all the frequent patterns

let m be the minimal cluster size

set i: = 0

while [absolute value of (Pweight)] 2m do

[C.sub.i] [right arrow] m largest weight from Pweight

Pweight : = Pweight \ [C.sub.i]

i: = i +1

end while

[abr.Pweight] [left arrow] replace each value w e Pweight with the mean of the cluster

return [bar.Pweight]

After the microaggregation, we are able to obtain the set of microaggregated weight of each pattern p., denoted as * i * i

Pweight. The top-k frequent patterns should be selected from Pset based on the perturbed weight of each pattern.
Algorithm 2. Select top-k frequent patterns randomly.

For i = 1 to [absolute value of (Pset)] do
set

Pr ([P.sub.i]) = [w.sub.i] / [[absolute value of
(Pset)].summation over (i=1)][w.sub.i]

randomly select a pattern p [member of] Pset

add p to the top-k frequent pattern set Mset

end for


3.2 Perturb the true support in the Laplace mechanism [10]

Since we selected top-k frequent patterns, to prevent support disclosure to the intruders, we need to add Laplace noise to the support of each pattern.
Algorithm 3. Perturb the support of each pattern.

let support (p) return the support of p

for i = 1 to [absolute value of (Mset)] do

support (p) [left arrow] support (p) + Lap (k / [[epsilon].sub.2])

end for

return Mset


3.3 Privacy analysis and utility analysis

In our approach, both the exponential mechanism and Laplace mechanism have been applied, respectively, to select top-k frequent patterns based on the masked weight and to perturb the true support of each pattern. In this section, we prove that the entire proposed mechanism preserves differential privacy.

Proof 1. Selecting top-k patterns preserves [SIGMA] 1--differential privacy

In the counting scenario of mining frequent patterns, the global sensitivity of the ranking function could be denoted as [DELTA] Rank, and the probability of pattern p being selected is formally depicted as:

Pr (p) = exp ([SIGMA] Rank (Pset, p) / 2 [DELTA] Rank/ [summation over (p '[member of] Pset)] exp ([epsilon] Rank (Pset, p ') / 2 [DELTA] Rank

where [epsilon] is the privacy budget [17], We assign [[epsilon].sub.1]/k to be the privacy budget to select an individual pattern p. and denote Pset' as the neighboring set of Pset. Thus, the probability of one pattern being selected should be formally depicted as:

Pr (p) = exp ([[epsilon].sub.1] Rank (Pset, p) / 2k [DELTA] Rank) /[summation over (p '[member of] Pset)] exp ([SIGMA] Rank (Pset, p') / 2 [DELTA] Rank)

Denote the selecting function by select (Pset, i), returning the pattern to be selected for the ith time. The differential privacy constraint for selecting an individual pattern can be proved as follows:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

According to the definition of differential privacy, selecting one pattern preserves differential privacy with privacy budget [[epsilon].sub.1]/k To select k patterns, differential privacy is pre served with privacy budget [[epsilon].sub.1]/k x k = [[epsilon].sub.i] according to sequential composition [18]. In total, k patterns should be selected.

After applying the microaggregation operation to the weight of each pattern, the sensitivity of selecting the individual pattern is upper bounded by [DELTA] Sen (Pset) / m, where [DELTA] Sen (Pset) denotes the sensitivity of the selecting function. As there are n / m clusters, the total sensitivity of the selecting function is upper bounded by n/m x Sen (Pset) / m. The cluster size should be adjusted so that n/m x Sen (Pset) / m [less than or equal to] [DELTA] Sen (Pset). Thus, cluster size m should be larger than sqrt (n) to reduce the magnitude of the exponential noise and improve data utility.

Proof 2. Perturbing the support of each pattern preserves [[epsilon].sub.1]--differential privacy

First, it should be proved that perturbing the support of an individual patternp preserves [[epsilon].sub.2]/k differential privacy. Denote the set of the support of all patterns with perturbed support as Sset, and denote the function to obtain the true support of pattern p from dataset D by tsup (D,p) and the perturbed support bypsup (D,p). We define A as the global sensitivity of the perturbing function:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

As a result, differential privacy is preserved to perturb the support of an individual pattern. Again, differential privacy stands for the perturbation operation according to sequential composition with privacy budget [[epsilon].sub.2]/k x k = [[epsilon].sub.1].

In summary, the total privacy budget should be [epsilon] = [[epsilon].sub.1] + [[epsilon].sub.2], as the selecting operation and perturbation operation run sequentially in the algorithm.

4. Empirical Evaluation

4.1 Evaluation Data and Metrics

To demonstrate the utility of the algorithm statistically, our approach has been applied to mine top-k frequent patterns from retail market basket data [8], which contains 88,162 records and 16,470 items in total. At first, the FP-Growth algorithm is applied to mine all frequent patterns with support larger than the minimum support. In the experiment, we set the minimum support as 100, and n = 6445 frequent patterns are mined.

Two utility metrics are used in the experiment: False Negative Rate (FNR) and Average Relative Error (ARE).

(1) FNRmeasures thetop-k patterns that occur in topk P (D) and not in topkP ([bar.D]). topkP (D) is the frequent pattern set mined from the original transaction dataset and topkP ([bar.D]) is the perturbed dataset.

FNR = [absolute value of (topkP(D) [union] topkP ([bar.D]) - topkP ([bar.D]))]/k

(2) ARE measures the error introduced by the privacy preserving algorithm. trueSup ([p.sub.i], topkP(D)) is the true support of pattern p. in the selected top-k patterns, and perSup ([p.sub.i], topkP(D)) is the perturbed support.

ARE = [summation over ([p.sub.i][member of]topkp(D))] trueSup([p.sub.i], topkP(D)) -perSup([p.sub.i], topKP([bar.D]))/trueSup([p.sub.i], topkP(D))

[FIGURE 2 OMITTED]

For the second group of experiments, k = 100 and [epsilon] ranges from 0.5 to 1.5, while for the microaggregation operation the cluster size m is set as 2 x [square root of (6445)] [approximately equal to] 160.

[FIGURE 3 OMITTED]

5. Conclusion

When mining frequent patterns, the approach proposed here may be employed to enhance data utility while preserving differential privacy through microaggregation. Based on the exponential mechanism and the microaggregated weight of each pattern, the top-k frequent patterns are selected. To enhance privacy, the true support of each selected pattern is perturbed by adding Laplace noise. In conclusion, a synergy between privacy and utility is achieved.

Received: 17 December 2014, Revised 24 January 2015, Accepted 29 January 2015

Acknowledgements

The research work was supported by National Natural Science Foundation of China under Grant No. 61272419, Prospective Research Project of Jiangsu Province under Grant No.BY2013095-3-02, No.BY2014089, No.BY2013039, No.BY2013037 and Lianyungang International Cooperation Project under Grant No. CH1304.

References

[1] Blum, A., Ligett, K., Roth., A. (2008). A learning theory approach to non-interactive database privacy. In: ACM Symposium on Theory of Computing (STOC 2008), Victoria (BC), Canada: ACM, pages 609-618.

[2] Brin, S., Motwani, R.,Silverstein, C. (1998). Beyond market baskets: Generalizing association rules to correlations. Data Mining and Knowledge Discovery, 2 (1)39-68.

[3] Bhaskar, R., Laxman S., Smith, A., et al. (2010). Discovering frequent patterns in sensitive data. In: Proc of the 16th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining (KDD10), Washington, DC, USA :ACM, p 503-512.

[4] Ning-hui Li., Wahbeh Qardaji., Dong Su.,Jian-nen Cao. (2012). Privbasis: Frequent itemset mining with differential privacy. Proceedings of the VLDB Endowment, 5(11)1340-1351.

[5] Xiao-jian Zhang., Miao Wang., Xiao-feng Meng. (2014). An Accurate Method for Mining top-k Frequent Patterns Under Differential Privacy. Journal of Computer Research and Development, 51(1)104-114.

[6] Soria-Comas, J., Domingo-Ferrer, J., Sanchez, D., Martinez, S. (2014). Enhancing data utility in differential privacy via microaggregation-based k-anonymity. The VLDB Journal, 23 (5) 771-794.

[7] Agrawal, R., Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, (VLDB'94), Santiago de Chile, Chile:ACM, p 487-499.

[8] Adar, E., Weld, D S., Bershad., B N., Gribble, S D. (2007). Why we search: Visualizing and predicting user behavior. In: International World Wide Web Conferences, Banff, Alberta, Canada:ACM, pages 161-170.

[9] Dwork, C. (2006). Differential privacy. In: Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP06), S. Servolo, Venice, Italy:ACM, p 1-12.

[10] Dwork, C., McSherry, F., Nissim, K., Smith., A (2006). Calibrating noise to sensitivity in private data analysis. In: Third Theory of Cryptography Conference (TCC 2006), New York, USA:IACR, p 265-284.

[11] Samarati, P., Sweeney, L. (1998). Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression, SRI International Report, SRI-CSL-98-04.

[12] Machanavajjhala, A., Gehrke, J., Kifer D., Venkitasubramaniam, M. (2006). L-Diversity: Privacy beyond k-anonymity. In: IEEE International Conference on Data Engineering (ICDE 2006), Atlanta, GA, USA:IEEE, p 24.

[13] Li, N., Li, T., Venkatasubramanian, S (2007). t Closeness: Privacy beyond k-anonymity and l-diversity. In: IEEE International Conference on Data Engineering (ICDE2007), Istanbul, Turkey:IEEE, pages 106-115.

[14] Wong, R., Li, J., Fu, A., Wang, K. (2006). ([+ or -], k)-Anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), Philadelphia, USA:ACM, p 754-759.

[15] Domingo-Ferrer, J., Mateo-Sanz, J M. (2002). Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans.Knowl. Data Eng, 14 (1)189-201.

[16] McSherry, F., Talwar, K. (2007). Mechanism design via differential privacy. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Cmnputer Science, Providence, Rhode Island,USA:IEEE, pages 94-103.

[17] Haeberlen, A., Pierce, B C., Narayan., A. (2011). Differential privacy under fire. In: 20th USENIX-Security Symposium, San Francisco, CA, USA: USENIX, pages 33-33.

Z. NI (1), Q.M. LI (1) *, X.Q. LIU (1), T. LI (2), R. HOU (3)

(1) School of Computer Science and Engineering, Nanjing University of Science and Technology No.200 Xiaolingwei, 210094 Nanjing, China

(2) School of Computer Science, Florida International University 11200 SW 8th Street, Miami, FL 33199, U.S.A.

(3) Yahoo Center of Latin America, Blanco Encalada 2120, Santiago, Chile

* qianmu@mail.njust.edu.cn
COPYRIGHT 2015 Digital Information Research Foundation
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Ni, Z.; Li, Q.M.; Liu, X.Q.; Li, T.; Hou, R.
Publication:Journal of Digital Information Management
Article Type:Report
Date:Apr 1, 2015
Words:2900
Previous Article:Intelligent embedded health care seat cushion of vision robot design by fuzzy neural network.
Next Article:Privacy in Social Networks.
Topics:

Terms of use | Copyright © 2017 Farlex, Inc. | Feedback | For webmasters