Printer Friendly


If you're an online merchant, ask a visitor to your site a simple question, and you are likely to get a misleading answer, if you get anything at all. That's the dilemma prompting a new privacy-enhancing data mining technique being developed by Dr. Rakesh Agrawal and Dr. Ramakrishnan Srikant, researchers at IBM's Almaden Research Center.

The research project -- one of several underway at IBM's Privacy Research Institute -- scientifically addresses the Catch-22 created by Web users entering false personal data on sites to protect their privacy and e-businesses relying on the data to develop data models and deliver customized services. "Our research institutionalizes the notion of fibbing on the Internet, and does so to preserve the overall reality behind the data," says Dr. Agrawal.

Called Privacy-Preserving Data Mining, the research relies on the notion that one's personal data can be protected by being scrambled or randomized prior to being communicated. By applying this technique, a retailer could generate highly accurate data models without ever seeing personal information. "The beauty of this research is that retailers and other Web businesses are able to extract the valuable demographic information they need without necessarily knowing the underlying personal consumer data", said Harriet P. Pearson, IBM's Chief Privacy Officer. "I believe we'll see technological approaches such as this playing a larger role in managing the privacy issues of today and the future."

According to Dr. Agrawal, the Privacy-Preserving Data Mining research has a wide range of potential applications, from medical research and building disease prediction models using randomized individual medical histories, to e-commerce and accurate promotions using randomized demographics of individual users.

A Web user decides to enter a piece of personal data -- e.g., age, salary, weight. Upon entry, that number, say age 30 is immediately scrambled or 'randomized' by IBM software: the software takes the original number that was input and adds (or subtracts) to it a random value. This randomization step is performed independently for every user who opts to enter their age. So, a 30 year old's age may be randomized to 42, while a 34 year old's entry may be randomized to 28. The randomization differs for every single user.

What does not change is the allowed range of the randomization. And, the range is directly linked to the desired level of privacy. Large randomization increases the uncertainty and the personal privacy of the users. However, at the same time, larger randomizations can cause loss in the accuracy of the results that are, at the end, produced by a data mining algorithm that uses the randomized data as input. According to Dr. Agrawal, it is clearly a trade off. Experiments indicate only a 5-10 percent loss in accuracy even for 100 percent randomization after the data mining algorithm has applied corrections to the randomized distributions.

Take the randomization of an IT manager's salary, which, for purposes of this example, may range between $50,000 and $150,000 per year. Let's say that the web merchant (or web site owner) decides that the software's randomization parameter will be set to add a random value somewhere between -$30,000 to +$30,000.

Jane, who comes to the site and decides to enter her salary in exchange for personalized recommendations, has a salary of $100,000. Upon entering $100,000, the IBM software happens to pick a random value of -$15,000, so Jane's salary is recorded as $85,000. No record is kept of her true salary to protect her privacy. Then Bob comes to the site and enters his true salary of $90,000. The software happens to pick +$25,000 for Bob and his salary is recorded as $115,000. Again no record is kept of Bob's true salary.

To view the effect of the randomization, look at the true or real salary distribution of the group of folks, in addition to Jane and Bob, who input their salary on the site, next to the randomized distribution.
Distribution Truthful Distribution Randomized

$50,000- 60,000:1 visitor $50,000-60,000:3 visitors
60,000- 70,000:4 visitors 60,000-70,000:7 visitors
70,000- 80,000:20 visitors 70,000-80,000:12 visitors
90,000-100,000:50 visitors 90,000-100,000:33 visitors
100,000-110,000:10 visitors 100,000-110,000:55 visitors
110,000-120,000:45 visitors 110,000-120,000:23 visitors
120,000-130,000:15 visitors 120,000-130,000:10 visitors
130,000-140,000:3 visitors 130,000-140,000:2 visitors
140,000-150,000:2 visitors 140,000-150,000:5 visitors

Once all the randomized data is in for a large number of users, the privacy preserving data mining software takes the randomized distribution and reconstructs how the true distribution might have looked like.

The software cannot determine what Jane or Bob's salaries were. It has access to only the randomized values and the parameters of randomization (i.e. random values that were added or subtracted came from the range -$30,000 to +$30,000), and nothing else. Based only on this information, the software reconstructs a close approximation of the true distribution. This reconstructed distribution is then used in building an accurate data mining model. Jane gets personalized recommendations by having the data mining model shipped to her client and applied locally.

Launched in early 2002, the IBM Privacy Institute is the industry's first formal technology research effort focused exclusively on developing privacy-enabling and data protection technologies for businesses. Under the direction of Dr. Michael Waidner, the Institute conducts privacy-enabling technology research in IBM's eight research laboratories around the world.
COPYRIGHT 2002 Millin Publishing, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2002, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Publication:EDP Weekly's IT Monitor
Geographic Code:1USA
Date:Jun 3, 2002

Related Articles
The Emerging CPO -- Chief Privacy Officer.
Websites Seek Credibility.
IBM Privacy Institute. (Security Notes).
P3P now W3C recommendation. (Internet Focus).
The New Zealand Conference on Database Integration and Linked Employer-Employee Data.
Protecting privacy in Canada's private sector: businesses that are serious about competing successfully in Canada need to get serious about privacy....
Digital privacy act to increase security.
TIA, the undead.
Workplace privacy cannot be ignored.

Terms of use | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters