Abstract
Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the data can satisfy the given constraints purely by random chance. Hence association rule mining often suffers from a high risk of false positive errors. There is a lack of comprehensive study on controlling false positives in association rule mining. In this paper, we adopt three multiple testing correction approaches---the direct adjustment approach, the permutation-based approach and the holdout approach---to control false positives in association rule mining, and conduct extensive experiments to study their performance. Our results show that (1) Numerous spurious rules are generated if no correction is made. (2) The three approaches can control false positives effectively. Among the three approaches, the permutation-based approach has the highest power of detecting real association rules, but it is very computationally expensive. We employ several techniques to reduce its cost effectively.
- H. Abdi. Bonferroni and Šidák corrections for multiple comparisons. Encyclopedia of Measurement and Statistics. Thousand Oaks, CA: Sage, 2007.Google Scholar
- R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993. Google Scholar
- S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5:213--246, 2001. Google Scholar
- Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1):125--133, 1995.Google Scholar
- S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265--276, 1997. Google Scholar
- G. E. Dallal. Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance. (http://www.jerrydallal.com/LHSP/p05.htm), 2007.Google Scholar
- S. Dudoit and M. J. van der Laan. Multiple Testing Procedures with Applications to Genomics. Springer, 2008.Google Scholar
- R. A. Fisher. On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1):87--94, 1922.Google Scholar
- L. Geng and H. J. Hamilton. Interestingness measures for data mining: a survey. ACM Computing Surveys, 38(3), 2006. Google Scholar
- A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal, and F. Vandin. An efficient rigorous approach for identifying statistically significant frequent itemsets. In PODS, pages 117--126, 2009. Google Scholar
- B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In SIGKDD, pages 80--86, 1998.Google Scholar
- G. Liu, H. Lu, and J. X. Yu. Cfp-tree: A compact disk-based structure for storing and querying frequent itemsets. Information Systems, 32(2):295--319, 2007. Google Scholar
- N. Megiddo and R. Srikant. Discovering predictive association rules. In SIGKDD, pages 274--278, 1998.Google Scholar
- N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In ICDT, pages 398--416, 1999. Google Scholar
- T. V. Perneger. What's wrong with bonferroni adjustments. British Medical Journal, 316:1236--1238, 1998.Google Scholar
- R. Rymon. Search through systematic set enumeration. In Proceedings of the Internation Conference on Principles of Knowledge Representation and Reasoning, pages 268--275, 1992.Google Scholar
- P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In SIGKDD, pages 32--41, 2002. Google Scholar
- G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1--33, 2007. Google Scholar
- G. I. Webb. Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Machine Learning, 71(2-3):307--323, 2008. Google Scholar
- P. H. Westfall and S. S. Young. Resampling-based multiple testing: examples and methods for P-value adjustment. Wiley, 1993.Google Scholar
- X. Yin and J. Han. Cpar: Classification based on predictive association rules. In SDM, 2003.Google Scholar
- M. J. Zaki and K. Gouda. Fast vertical mining using Diffsets. In SIGKDD, 2003. Google Scholar
Index Terms
- Controlling false positives in association rule mining
Recommendations
A Survey on Association Rule Mining
ACCT '15: Proceedings of the 2015 Fifth International Conference on Advanced Computing & Communication TechnologiesTask of extracting useful and interesting knowledge from large data is called data mining. It has many aspects like clustering, classification, association mining, outlier detection, regression etc. Among them association rule mining is one of the ...
Efficient negative association rule mining based on chance thresholds
Typically association rule mining only considers positive frequent itemsets in rule generation, where rules involving only the presence of items are generated. In this paper we consider the complementary problem of negative association rule mining, ...
MICAR: nonlinear association rule mining based on maximal information coefficient
AbstractAssociation rule mining (ARM) is an important research issue in data mining and knowledge discovery. Existing ARM methods cannot discover nonlinear association rules, despite nonlinearity being common and significant in engineering practice. ...
Comments