ABSTRACT
Discovering frequent patterns from data is a popular exploratory technique in datamining. However, if the data are sensitive (e.g., patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there.
We present two efficient algorithms for discovering the k most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning 'noisy' lists of patterns that are close to the actual list of k most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private top-k pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately 'robust' measure of interest.
Supplemental Material
- Apriori implementation of Ferenc Bodon. http://www.cs.bme.hu/~bodon/en/apriori/.Google Scholar
- Frequent itemset mining implementations repository. http://fimi.helsinki.fi.Google Scholar
- C. Aggarwal, C. C. Aggarwal, and P. S. Yu. A condensation approach to privacy preserving data mining. In Proceedings of the Ninth International Conference on Extending Database Technology (EDBT), pages 183--199, 2004.Google ScholarCross Ref
- R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 207--216, May 1993. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, Mar. 1995. IEEE Computer Society, Washington, DC, USA. Google ScholarDigital Library
- S. Agrawal and J. R. Haritsa. A framework for high-accuracy privacy-preserving mining. In ICDE, pages 193--204, 2005. Google ScholarDigital Library
- M. Barbaro and T. Zeller. A face is exposed for aol searcher no. 4417749. The New York Times, Aug. 2006.Google Scholar
- R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering Frequent Patterns in Sensitive Data . Technical Report NAS-TR-0129-2010, Network and Security Research Center, Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA, Apr. 2010. http://www.cse.psu.edu/~asmith/fim/.Google ScholarDigital Library
- A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pages 609--618, 2008. Google ScholarDigital Library
- C. Dwork. Differential privacy. In ICALP, LNCS, pages 1--12, 2006. Google ScholarDigital Library
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265--284, 2006. Google ScholarDigital Library
- A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211--222, 2003. Google ScholarDigital Library
- S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In KDD, pages 265--273, 2008. Google ScholarDigital Library
- N. G.N., B. A., H. J., S. K., and E. I.R. Temporal pattern discovery for trends and transient effects: Its application to patient records. In Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining SIGKDD 2008, pages 963--971, 2008. Google ScholarDigital Library
- M. Götz, A. Machanavajjhala, G. Wang, X. Xiao, and J. Gehrke. Privacy in search logs. CoRR, abs/0904.0682, 2009.Google Scholar
- J. Han and M. Kamber. Data mining: Concepts and techniques. Morgan Kaufmann Publishers, San Fransisco, CA, USA, 2001. Google ScholarDigital Library
- D. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT Press, Cambridge, MA, USA, 2001. Google ScholarDigital Library
- V. Hristidis, editor. Information Discovery on Electronic Health Records. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, Boca Raton, FL, USA, 2009. Google ScholarDigital Library
- A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately. In WWW, pages 171--180, 2009. Google ScholarDigital Library
- H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. 1(3):259--289, 1997. Google ScholarDigital Library
- Y. Matias, J. S. Vitter, and W.-C. Ni. Dynamic generation of discrete random variates. Theory Comput. Syst., 36(4):329--358, 2003.Google ScholarCross Ref
- F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD Conference, pages 19--30, 2009. Google ScholarDigital Library
- F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94--103, 2007. Google ScholarDigital Library
- T. Mielikäinen. On inverse frequent set mining. In 2nd Workshop on Privacy Preserving Data Mining (PPDM 2003), pages 18--23. IEEE Computer Society, 2003.Google Scholar
- L. Sweeney. k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557--570, 2002. Google ScholarDigital Library
- T. Washio and H. Motoda. State of the art of graph-based data mining. SIGKDD Explorations, 5:59--68, 2003. Google ScholarDigital Library
Index Terms
- Discovering frequent patterns in sensitive data
Recommendations
Efficient algorithms for mining constrained frequent patterns from uncertain data
U '09: Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain DataMining of frequent patterns is one of the popular knowledge discovery and data mining (KDD) tasks. It also plays an essential role in the mining of many other patterns such as correlation, sequences, and association rules. Hence, it has been the subject ...
Mining frequent graph patterns with differential privacy
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningDiscovering frequent graph patterns in a graph database offers valuable information in a variety of applications. However, if the graph dataset contains sensitive data of individuals such as mobile phone-call graphs and web-click graphs, releasing ...
Dataless Transitions Between Concise Representations of Frequent Patterns
For many data mining problems in order to solve them it is required to discover frequent patterns. Frequent itemsets are useful e.g. in the discovery of association and episode rules, sequential patterns and clusters. Nevertheless, the number of ...
Comments