skip to main content
research-article

Controlling false positives in association rule mining

Published:01 October 2011Publication History
Skip Abstract Section

Abstract

Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the data can satisfy the given constraints purely by random chance. Hence association rule mining often suffers from a high risk of false positive errors. There is a lack of comprehensive study on controlling false positives in association rule mining. In this paper, we adopt three multiple testing correction approaches---the direct adjustment approach, the permutation-based approach and the holdout approach---to control false positives in association rule mining, and conduct extensive experiments to study their performance. Our results show that (1) Numerous spurious rules are generated if no correction is made. (2) The three approaches can control false positives effectively. Among the three approaches, the permutation-based approach has the highest power of detecting real association rules, but it is very computationally expensive. We employ several techniques to reduce its cost effectively.

References

  1. H. Abdi. Bonferroni and Šidák corrections for multiple comparisons. Encyclopedia of Measurement and Statistics. Thousand Oaks, CA: Sage, 2007.Google ScholarGoogle Scholar
  2. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993. Google ScholarGoogle Scholar
  3. S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5:213--246, 2001. Google ScholarGoogle Scholar
  4. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1):125--133, 1995.Google ScholarGoogle Scholar
  5. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265--276, 1997. Google ScholarGoogle Scholar
  6. G. E. Dallal. Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance. (http://www.jerrydallal.com/LHSP/p05.htm), 2007.Google ScholarGoogle Scholar
  7. S. Dudoit and M. J. van der Laan. Multiple Testing Procedures with Applications to Genomics. Springer, 2008.Google ScholarGoogle Scholar
  8. R. A. Fisher. On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1):87--94, 1922.Google ScholarGoogle Scholar
  9. L. Geng and H. J. Hamilton. Interestingness measures for data mining: a survey. ACM Computing Surveys, 38(3), 2006. Google ScholarGoogle Scholar
  10. A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal, and F. Vandin. An efficient rigorous approach for identifying statistically significant frequent itemsets. In PODS, pages 117--126, 2009. Google ScholarGoogle Scholar
  11. B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In SIGKDD, pages 80--86, 1998.Google ScholarGoogle Scholar
  12. G. Liu, H. Lu, and J. X. Yu. Cfp-tree: A compact disk-based structure for storing and querying frequent itemsets. Information Systems, 32(2):295--319, 2007. Google ScholarGoogle Scholar
  13. N. Megiddo and R. Srikant. Discovering predictive association rules. In SIGKDD, pages 274--278, 1998.Google ScholarGoogle Scholar
  14. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In ICDT, pages 398--416, 1999. Google ScholarGoogle Scholar
  15. T. V. Perneger. What's wrong with bonferroni adjustments. British Medical Journal, 316:1236--1238, 1998.Google ScholarGoogle Scholar
  16. R. Rymon. Search through systematic set enumeration. In Proceedings of the Internation Conference on Principles of Knowledge Representation and Reasoning, pages 268--275, 1992.Google ScholarGoogle Scholar
  17. P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In SIGKDD, pages 32--41, 2002. Google ScholarGoogle Scholar
  18. G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1--33, 2007. Google ScholarGoogle Scholar
  19. G. I. Webb. Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Machine Learning, 71(2-3):307--323, 2008. Google ScholarGoogle Scholar
  20. P. H. Westfall and S. S. Young. Resampling-based multiple testing: examples and methods for P-value adjustment. Wiley, 1993.Google ScholarGoogle Scholar
  21. X. Yin and J. Han. Cpar: Classification based on predictive association rules. In SDM, 2003.Google ScholarGoogle Scholar
  22. M. J. Zaki and K. Gouda. Fast vertical mining using Diffsets. In SIGKDD, 2003. Google ScholarGoogle Scholar

Index Terms

  1. Controlling false positives in association rule mining

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader