Utility Independent Privacy Preserving Data Mining on Vertically Partitioned Data

: Problem statement: Driven by mutual benefits, or by regulations that require certain data to be published, there has been a demand for the exchange and publication of data among various parties. Data publishing has been ubiquitous in many domains such as medical, business and education. Detailed person-specific data, present in the centralized server or in the distributed environment, in its original form often contains sensitive information about individuals, and publishing such data immediately violates individual privacy. The main problem in this regard is to develop method for publishing data in a more hostile environment so that the published data remains practically useful while individual privacy is preserved. There are n parties, each having a private database, want to jointly conduct a data mining operation on the union of their databases. How could these parties accomplish this without disclosing their database to the other parties or any third party? Approach: To address this issue, we developed a simple technique of transforming the categorical and numeric sensitive data using a mapping table and graded grouping technique, respectively. The typical data mining tasks such as classification, clustering and association rule mining were performed on both the original and transformed tables. The rules/results/patterns of both the tables were compared and the utility of the transformed data was evaluated. Results: The evaluation results demonstrated that the proposed approach was able to achieve cent percent utility for any type of mining task as compared to the original table. The classification accuracy of Adult data set obtained, with education as class variable was 40.08% and the same accuracy was obtained even after transformation. Similarly the number of rules generated for the given confidence 0.9, was the same for both the original and transformed table and equal to 10. Conclusion: The association rules involving categorical sensitive attributes were checked manually for privacy breach. We found that it is not possible to guess the actual sensitive values from the rules, even though there was no information loss. The results can be interpreted only with the concern of data owner or data publisher.


INTRODUCTION
Progress in scientific research depends on availability and sharing of information and ideas. But protecting the privacy of human participant is given top priority by the researcher. Many privacy preserving data mining algorithms have been developed to protect the privacy of the individual even after the data mining process. Some privacy preserving data mining approaches have been developed for centralized data, while the others refer to distributed data scenario. Distributed data may be horizontally or vertically partitioned. A school is maintaining the academic records of its students in a database. Suppose a researcher wants to analyze the students' performance in an area, the academic records with the same attributes of different schools (sites) in that area are to be collected for analysis. Also, consider separate hospitals that wish to conduct a joint research while preserving the privacy of their patients.
In this scenario it is required to protect privileged information, but it is also required to enable its use for research or for other purposes. In particular, although the parties realize that combining their data has some mutual benefit, none of them is willing to reveal its database to any other party. Such kind of data with same attributes at different sites is called as horizontally partitioned data. But, if the researcher wants to find the association between the students' character and their parents occupation or between medical diagnosis and attendance performance, the different databases like, academic, medical, personal data of the same set of students are to be combined for analysis and such a kind of data set with a single join key (e.g., student id) is called vertically partitioned data. In other words, a portion of each instance is present at each site but no site contains complete information for any instance is vertically partitioned data.
A researcher can mine very useful rules\patterns if he is allowed to work on vertically partitioned data. For example, some cancer treatments are highly effective but have debilitating side effects with high variance between populations [1] . The factors determining the efficacy of such treatments can be learnt by decision trees\ association rules derived from vertically partitioned data tables like hospital management data, pharmacy data and insurance data, each of which is prevented by privacy laws from disclosing the individuals' identifiable information. Other than medical research, competing companies may like to perform mining tasks on data of both to get accurate results but unlike to disclose their own data to the other party. For example Ford and Firestone shared a problem with jointly produced product: Ford Explorers with Firestone tires. Factors such as trade secrets and agreements with other manufacturers stand in the way of necessary sharing. Even government entities face similar problems such as limitations on sharing between law enforcement, intelligence agencies and tax collection.
We have developed a simple technique by which vertically partitioned data can be used for any type of mining tasks. But, the individuals' privacy is preserved. Privacy preservation can mean many things: Protecting specific sensitive values of the individuals and hiding the link between the attribute values and the individuals they applied to, protecting the sources. Our goal is that by applying our technique each site can sponsor the required data to the third party, without modifying the structure of the data, so that any mining technique or algorithm, without any modification can be applied by the third party to get the actual accurate results as if mined from actual database. At the same time, the data miner can not interpret the results\patterns.

Related works:
A simple approach to privacy preserving data mining over multiple sources that are not willing to share data is to apply existing techniques and tools at each site independently and combine the results. But it will not give the globally valid results because of duplicated data at different sites. Also it is not possible to detect the cross site correlations.
Another approach is to perturb the local data (by adding "noise") before the data mining process and mitigate the impact of the noise from the data mining results by using reconstruction techniques [2] . However, it is impossible to reconstruct the original data set and also the accuracy depends on the reconstruction algorithm [3] . The problem of distributed privacy preserving data mining overlaps closely with a field in cryptography for determining secure multi-party computations. Many of these techniques work by sending changed or encrypted versions of the inputs to one another in order to compute the function with different alternative versions followed by an oblivious transfer protocol to retrieve the correct value of the final output. The algorithms for secure multiparty computation over horizontally partitioned data set include Naïve Bayes classifier [4] , Support Vector Machine (SVM) classifier with non linear kernels [5] , Association Rule Mining [6] , Clustering [7][8][9] .
The approach of vertically partitioned mining has been extended to a variety of data mining application such as Naïve Bayes classifier [10] , SVM classification [11] ,decision trees [12] K-means clustering [13] and Association Rule Mining [14,15] . Vaidya and Clifton [16] gave a nice algebraic solution for vertically partitioned data. However, this solution can leak many linear combinations of each party's private data to other. Also, to process one candidate frequent item set, its computational overhead is quadratic in the number of transactions. Two algorithms are given by Sheng Zhong in [17] which are having computational overheads linear to the number of transactions. But when his technique is used in practice, it should be complemented by other algorithm that computes all frequent item sets without testing candidates one by one.
All of the cryptographic work falls under the theoretical framework of Secure Multiparty Computation. In Agrawal's study [2] the privacypreserving data mining problem between two parties is solved by data perturbation method while Lindell and Pinkas use secure multi-party computation protocols [18] to solve the problem. We have proposed a framework that allows us to systematically transform normal data mining computations to secure multi-party computations. The problem is defined as this: There are n parties, each having a private database, want to jointly conduct a data mining operation on the union of their databases. How could these parties accomplish this without disclosing their database to the other parties or any third party?

MATERIALS AND METHOD
The framework of our privacy preserving mining model is as shown in Fig. 1. Our model assumes that all sites collect data for the exact same set of entities. The assumption can be neglected by deciding the behavior for missing values. For example, missing value may be replaced by the average if it is numerical data or by mostly used value if it is categorical data. Based on this assumption, attribute A j is common to all the vertically partitioned data sets (D 1 -D n ) and hence form the join key. Also the number of rows is almost same in all the sites.
Data flow: All the n parties available in Site 1 to Site n have their own datasets D 1 -D n with only one attribute A j in common, called join key attribute. In some situations only a part of the data set needs to be kept confidential. These attributes are sensitive attributes and all the other attributes don't need any treatment.
All the parties want to jointly conduct data mining operation on a single database D which is formed by the union of all the datasets D 1 {A j , A 1 , A 2 ,…}, D 2 {A j ,A 3 , A 4 ,…}… and D n {A j ,A 8 , A 9 ,…} to get better results. But to preserve the privacy of the actual values of the individual databases, the third party data miner is allowed to work with a single database D T which is formed by the union of all the transformed data bases An attribute is called Sensitive, if the individual is not willing to disclose or an adversary must not be allowed to discover the value of that attribute. The method of converting the attribute A x to A xT is explained in the next section. The data miner who works on the transformed single data base D T can perform any data mining task, as if he works on the original data. But he can not interpret the results\rules\patterns. He can declare the results to all the parties who have participated in the data sharing. The individual parties can interpret a few results, which contain the transformed attribute values of their own data bases. To interpret the remaining results, each site should communicate with the other sites and mutually exchange the actual values, involved in the results. Since, the actual value of the results\rules\patterns alone known by all the parties, the individuals' privacy is preserved.

Transformation:
The transformation of the attribute A x -A xT is based on the data type of attribute A x . If the data type is numerical we use graded grouping technique and if the data type is categorical we use mapping table for the transformation.
Graded grouping technique: This is a simple transformation method which maintains the correlation factor of nearly 1 between the transformed values and the original values. Our approach to numerical attribute is graded grouping is as shown in Fig. 2. To convert the actual values of a single numeric attribute, the following steps are followed. First step is to fix the number of categories (k) for the given range. Second step is, for each category C 1 …C k , the max and min value is to be fixed in such a way that non overlapping continuous range results. Range for each category may or may not be uniform. If the uniform range is considered for each category then the correlation factor between the original and transformed values is one. Otherwise, it will decrease but maintains a positive linearity. Third step is to fix the category (Ci) for each actual value(x), to which it belongs and find the membership value m(x) using: The fourth step is to replace the actual value with a new value n(x) or transformed value, which can be calculated by adding category number (Ci) and the membership value m(x).

Experimental setup:
The data miner's job is to perform union operation on the various transformed attributes A 1T , A 2T … A 9T and other non sensitive attributes using the join key attribute A j to form a single table D T , which alone can be used for any data mining task. We have decided to conduct the experiment on real data set and hence used the adult database from UCI machine learning repository [19] with 35,561 records. The attributes Age, Work class, Education, Marital status, Occupation, Relation, Race, Sex, country (form table D) are considered for analysis, assuming that different attributes are received from different sites. We considered age as sensitive numerical attribute and hence Age T calculated by our algorithm for graded grouping. Similarly, education is considered as sensitive categorical attribute and hence Education T is formed from the mapping table shown in Table 1. We have implemented the algorithm in Java standard Edition 5.0 and made to run on Intel® Core2 Duo, 1.8 GHz, 1 GB RAM system which took only 28sec for generating privacy preserving Adult data set D T . The various data mining tasks on both table D and new table D T with the attributes Age T , Work class, Education T , Marital status, Occupation, Relation, Race, Sex, country are performed using the tool WEKA [20] and the results are compared.

RESULTS
For evaluation purposes, we performed the mining tasks such as classification, association rule mining and clustering on both the original adult data (D) and transformed table (D T ). The results were compared. Classification was performed by decision tree (J48) method and zeroR method, considering education as classification variable. Parameters compared are shown in Table 2. The results were not affected by the proposed transformation method. In J48 method highest F-measure value was 0.874, for doctorate in the original table but for Education_9 in the transformed table.

DISCUSSION
Any data mining technique can be evaluated by the parameters like performance of algorithm, data utility after transformation, level of uncertainty, resistance accomplished by the privacy algorithms [21] .

Performance of algorithm:
The performance of any privacy algorithm can be measured by the time needed to hide a specified set of sensitive information. We have considered the original Adult Data set table, with two sensitive attributes age (Numeric) and Education (Categorical), as the table for transformation. Our algorithm took only 28sec for transformation. Time complexity of our algorithm is linear to size of the table.
Data utility after transformation: The data utility after transformation can be measured by the parameter, loss of functionality or information loss. For example, suppression and generalization are some form of transformation. If suppression is used for an attribute value, utility of that data gets reduced, since missing values can not be handled by many mining tools.
Use of sampling does not modify the information stored in the data base, but still utility gets reduced, since information is not complete. The measure used to evaluate the information loss depends on the specific data mining technique with respect to which a privacy algorithm is performed. For example in the case of association rule mining information loss can be measured by counting the number of rules framed for the given support and confidence, before and after transformation. From the Table 3, we conclude that information loss is nil because, from the original and transformed table we get the same number of rules, whatever may the type of algorithm used.

Level of uncertainty:
The level of uncertainty is the measure of capability of predicting hidden data from the data se, given for analysis or from the rules/results declared. For example, randomization is the method used to hide the data. To maintain the information, if the randomization is done to have correlation very close to one, then the data reconstruction procedure, discloses the actual values, otherwise there is loss of information. But in our method, numerical attribute preserves the information while the actual values can not be guessed without knowing the number of categories and the range of each category. Similarly, without access to the mapping table actual value can not be guessed. Consider the snap shots of Association Rule Mining experiment conducted on D and D T by Tertius method shown in Fig. 3 and 4. The number rules are the same in both the cases.
Also the rules are exactly the same except for Rule number 6 which contains the sensitive attribute Education (Fig. 3) with the value preschool. But the same rule number in transformed table (Fig. 4) contains the sensitive attribute H_Education with the value Education_15 which no one can interpret except the owner of data set, with the availability of mapping table. Since, WEKA Association Rule mining tool can not handle numeric attribute, attribute Age is not considered for analysis.

Resistance accomplished:
If the resistance accomplished by the privacy algorithm is low, means that the sanitization algorithm developed against a particular data mining technique that assures privacy of information may not attain similar protection against all possible data mining algorithm. But, the resistance of our algorithm is high enough so that, whatever may be the mining task performed, the sensitive information does not leak out. For example association rule mining using Predictive Apriori algorithm was performed on the original and the transformed table. The time taken   by the task for the original and transformed table are 53  min 45 sec and 54 min 38 sec respectively, while the number of rules framed in both the cases is the same for the given confidence. The fourth rule framed by this task is shown in Table 5.
Limitation: We assume that the parties participating in the process are honest but only curious to know about others.
Once the rules are declared by the data miner, being honest, the parties should be giving the actual values corresponding to the transformed values, (if they have) to other parties. Since each party knows its own data and resultant association rules, there may be some information disclosure. For example, the support of the Rule A -> B is 10% and it is known by both the parties. If Item set A and item Set B belongs to Party I and Party II respectively, who participated in the mining task, then they can calculate the value of the opponent's item set. But it is acceptable to disclose knowledge that could be obtained from global rules.

CONCLUSION
Many works limited to Boolean association rule mining. But, Non categorical attributes and quantitative association rule mining are significantly more complex but using our algorithm they can be handled easily. Our goal is to develop methods enabling any data mining tasks that can be done at a single site to be done across various sources, while respecting their privacy policies and is achieved. Transformation can be easily implemented at the data source itself, whatever may be the number of sensitive attributes, at the user machine. This increases the confidence of the user in providing accurate information since he/she does not have to trust a third party to carry out the transformation process. Also many techniques concern about output privacy, whereas our focus is on the privacy of input data given for mining. Mining the distributed database can be significantly more expensive in terms of both time and space as compared to mining the true data base [22] . We have treated the distributed data as centralized data, before any mining tool is applied and hence time taken for mining is reduced.