research-article

Controlling false positives in association rule mining

Authors:
Guimei Liu

National University of Singapore

National University of Singapore
View Profile

,
Haojun Zhang

National University of Singapore

National University of Singapore
View Profile

,
Limsoon Wong

National University of Singapore

National University of Singapore
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 2pp 145–156https://doi.org/10.14778/2078324.2078330

Published:01 October 2011Publication History

Proceedings of the VLDB Endowment

Abstract

Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the data can satisfy the given constraints purely by random chance. Hence association rule mining often suffers from a high risk of false positive errors. There is a lack of comprehensive study on controlling false positives in association rule mining. In this paper, we adopt three multiple testing correction approaches---the direct adjustment approach, the permutation-based approach and the holdout approach---to control false positives in association rule mining, and conduct extensive experiments to study their performance. Our results show that (1) Numerous spurious rules are generated if no correction is made. (2) The three approaches can control false positives effectively. Among the three approaches, the permutation-based approach has the highest power of detecting real association rules, but it is very computationally expensive. We employ several techniques to reduce its cost effectively.

References

H. Abdi. Bonferroni and Šidák corrections for multiple comparisons. Encyclopedia of Measurement and Statistics. Thousand Oaks, CA: Sage, 2007.Google Scholar
R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993. Google Scholar
S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5:213--246, 2001. Google Scholar
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1):125--133, 1995.Google Scholar
S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265--276, 1997. Google Scholar
G. E. Dallal. Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance. (http://www.jerrydallal.com/LHSP/p05.htm), 2007.Google Scholar
S. Dudoit and M. J. van der Laan. Multiple Testing Procedures with Applications to Genomics. Springer, 2008.Google Scholar
R. A. Fisher. On the interpretation of χ² from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1):87--94, 1922.Google Scholar
L. Geng and H. J. Hamilton. Interestingness measures for data mining: a survey. ACM Computing Surveys, 38(3), 2006. Google Scholar
A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal, and F. Vandin. An efficient rigorous approach for identifying statistically significant frequent itemsets. In PODS, pages 117--126, 2009. Google Scholar
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In SIGKDD, pages 80--86, 1998.Google Scholar
G. Liu, H. Lu, and J. X. Yu. Cfp-tree: A compact disk-based structure for storing and querying frequent itemsets. Information Systems, 32(2):295--319, 2007. Google Scholar
N. Megiddo and R. Srikant. Discovering predictive association rules. In SIGKDD, pages 274--278, 1998.Google Scholar
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In ICDT, pages 398--416, 1999. Google Scholar
T. V. Perneger. What's wrong with bonferroni adjustments. British Medical Journal, 316:1236--1238, 1998.Google Scholar
R. Rymon. Search through systematic set enumeration. In Proceedings of the Internation Conference on Principles of Knowledge Representation and Reasoning, pages 268--275, 1992.Google Scholar
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In SIGKDD, pages 32--41, 2002. Google Scholar
G. I. Webb. Discovering significant patterns. Machine Learning, 68(1):1--33, 2007. Google Scholar
G. I. Webb. Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Machine Learning, 71(2-3):307--323, 2008. Google Scholar
P. H. Westfall and S. S. Young. Resampling-based multiple testing: examples and methods for P-value adjustment. Wiley, 1993.Google Scholar
X. Yin and J. Han. Cpar: Classification based on predictive association rules. In SDM, 2003.Google Scholar
M. J. Zaki and K. Gouda. Fast vertical mining using Diffsets. In SIGKDD, 2003. Google Scholar

Index Terms

Controlling false positives in association rule mining
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A Survey on Association Rule Mining
ACCT '15: Proceedings of the 2015 Fifth International Conference on Advanced Computing & Communication Technologies

Task of extracting useful and interesting knowledge from large data is called data mining. It has many aspects like clustering, classification, association mining, outlier detection, regression etc. Among them association rule mining is one of the ...
Read More
Efficient negative association rule mining based on chance thresholds

Typically association rule mining only considers positive frequent itemsets in rule generation, where rules involving only the presence of items are generated. In this paper we consider the complementary problem of negative association rule mining, ...
Read More
MICAR: nonlinear association rule mining based on maximal information coefficient
Abstract
Association rule mining (ARM) is an important research issue in data mining and knowledge discovery. Existing ARM methods cannot discover nonlinear association rules, despite nonlinearity being common and significant in engineering practice. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 2
October 2011
72 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 October 2011
Published in pvldb Volume 5, Issue 2
Author Tags
association rule mining
multiple testing correction
statistical hypothesis testing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 285
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Controlling false positives in association rule mining

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A Survey on Association Rule Mining

Efficient negative association rule mining based on chance thresholds

MICAR: nonlinear association rule mining based on maximal information coefficient

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Controlling false positives in association rule mining

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A Survey on Association Rule Mining

Efficient negative association rule mining based on chance thresholds

MICAR: nonlinear association rule mining based on maximal information coefficient

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media