Abstract
In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2009), pp 29–38
Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Proceedings of 1993 ACM-SIGMOD international conference on management of data, pp 207–216
Agresti A (1992) A survey of exact inference for contingency tables. Stat Sci 7(1):131–153
Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of first international conference on computational logic, pp 972–986
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246
Bayardo RJ Jr, Agrawal R, Gunopulos D (2000) Constraint-based rule mining in large, dense databases. Data Min Knowl Disc 4(2/3):217–240
Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, New York
Bishop G (2009) Assessing the likely quality of the statistical longitudinal census dataset. Research paper, Australian Bureau of Statistics
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data, pp 265–276
Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of IEEE international conference on data mining (ICDM 2010), pp 749–754
Carvalho JV, Ruiz DD (2013) Discovering frequent itemsets on uncertain data: a systematic review. In: Proceedings of 9th international conference on machine learning and data mining, pp 390–404
Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2008), pp 64–75
Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 47–58
Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201
Fosu GB (2001) Evaluation of population census data through demographic analysis. In: Symposium on global review of 2000 round of population and housing censuses: mid-decade assessment and future prospects. http://unstats.un.org/unsd/demographic/meetings/egm/symposium2001/docs/symposium_11.htm#_Toc7406238. Accessed 22 July 2015
Gray B, Orlowska M (1998) CCAIIA: clustering categorical attributes into interesting association rules. In: Proceedings of 2nd Pacific-Asia conference on knowledge discovery and data mining (PAKDD’98), pp 132–143
Hollister JW, Gonzalez ML, Paul JF, August PV, Copeland JL (2004) Assessing the accuracy of National Land Cover Dataset area estimates at multiple spatial extents. Photogramm Eng Remote Sensing 70:405–414
International Business Machines (1996) IBM intelligent miner user’s guide, version 1, release 1
Jones N, Lewis D (eds, with Aitken A, Hörngren J, Zilhão MJ) (2003) Handbook on improving quality by analysis of process variables. Final report, Eurostat
Mennis J, Liu JW (2005) Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change. Trans GIS 9(1):5–17
McDonald JH (2014) Handbook of biological statistics, 3rd edn. Sparky House Publishing, Baltimore
Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584
Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99), pp 125–134
Liu B, Hsu W, Ma Y (2001) Identifying non-actionable association rules. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 329–334
Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of 4th international conference on knowledge discovery and data mining (KDD ’98), pp 27–78
Office for National Statistics, The United Kingdom (2014) 2011 Census quality survey. http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census-user-guide/quality-and-methods/quality/quality-measures/assessing-accuracy-of-answers/2011-census-quality-survey-report.pdf. Accessed 22 July 2015
Olson CE (2008) Is 80% accuracy good enough? In: Proceedings of 17th William T. pecora memorial remote sensing symposium. http://www.asprs.org/a/publications/proceedings/pecora17/0026.pdf. Accessed 27 Feb 2014
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowledge discovery in databases. AAAI/MIT Press, Menlo Park, pp 229–248
Penrose R (1955) A generalized inverse for matrices. Math Proc Cambridge Philos 51:406–413
Rao CR, Mitra SK (1972) Generalized inverse of a matrix and its applications. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1: theory of statistics, pp 601–620
Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349
Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of 21st international conference on very large data bases, pp 407–419
Stehman SV, Wickham JD, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sensing 74:1561–1571
Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2010), pp 273–282
Taussky O (1949) A recurring theorem on determinants. Am Math Mon 56(10):672–676
The Executive Office for Administration and Finance, Commonwealth of Massachusetts (2012) MassGIS datalayers. http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/datalayers/layerlist.html. Accessed 26 Sept 2013
Ting KM (2011) Confusion matrix. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning, 1st edn. Springer, New York
Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: Proceedings of 28th international conference on data engineering, pp 270–281
Webb GI (2007) Discovering significant patterns. Mach Learn 68:1–33
Webb GI, Zhang S (2005) \(K\)-optimal rule discovery. Data Min Knowl Disc 10(1):39–79
Yang L, Stehman SV, Smith JH, Wickham JD (2001) Thematic accuracy of MRLC land cover for eastern United States. Remote Sens Environ 76:418–422
Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000), pp 34–43
Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of 10th international conference on knowledge discovery and data mining (KDD 2004), pp 374–383
Zhu XQ, Wu XD (2006) Error awareness data mining. In: 2006 IEEE international conference on granular computing, pp 269–274
Zhu XQ, Wu XD, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of 19th national conference on artificial intelligence, pp 378–383
Acknowledgments
We wish to thank the action editor Mohammed J. Zaki and the anonymous reviewers for their insightful comments which have helped very much on improving the manuscript. This study is partially supported by the Special Fund for Technological Leading Talents, National Administration of Surveying, Mapping and Geoinformation, P.R. China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M.J. Zaki.
Appendices
Appendix 1: Evaluating discrepancy between exact and approximate \(\hat{{s}}_{0}(c_{l})\) values
Let \(f\left( {x_{1}, \ldots , x_{k}} \right) =\sum \limits _{j=1}^{k}{p_{ij}^{-1} z\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})x_{l}}} \right) ^{1/2}} \), then the discrepancy between the exact solution to \(\hat{{s}}_{0}(c_{i})\) from (13) and the approximate solution from (14) is:
and
Let \(\Delta _{l}=s\left( {c_{l}} \right) -\hat{{s}}_{0}\left( {c_{l}} \right) \), then (a1) may be estimated by the first degree Taylor polynomial of \(f\left( {s(c_{1}), \ldots , s(c_{1})} \right) \):
The error of estimating (20) by (22) is the Lagrange form of the remainder of the first degree Taylor polynomial:
where \(0\le \theta \le 1\). Each item in (23) with the same \(\left( {j, l} \right) \) value pair is equal to \(-{\left[ {\left( {\hat{{s}}_{0}(c_{l})} \right) ^{1/2}\Delta _{l}} \right] }/{\left[ {4\left( {\hat{{s}}_{0}(c_{l})+\theta \Delta _{l}} \right) ^{3/2}} \right] }\) times the corresponding item in (22). Typically \(\hat{{s}}_{0}\left( {c_{l}} \right) \) is much larger than \(\Delta _{l}\), thus (22) is much larger than (23) and should be a reasonable estimator of (20).
For each specific attribute in data, the elements of P and \(\mathbf{P}^{-1}\) and \(\hat{{s}}_{0}(c_{1})\cdots \hat{{s}}_{0}(c_{k})\) values can be substituted into (22) for evaluating the discrepancy. An exemplary evaluation was made with the item “\(att_{3}=1\)” in the synthetic experiment (see Sect. 4.1), one of the most affected items by the discrepancy. The computation used the “ideal” data defined in Sect. 4.1 as the original data, and the average z value actually used in the experiment at each error level in Table 5. At the highest error level with 20 % records contained erroneous \(att_{3}\) values, the relative discrepancy with respect to the correction to \(s(att_{3}=1)\) was only \(-0.19\) and \(-0.06\,\%\) for the data size of 4000 and 64,000, respectively. At lower data error levels, the relative discrepancy was even smaller as the z value decreased.
Appendix 2: Numerical synthetic data experiment results
Rights and permissions
About this article
Cite this article
Zhang, A., Shi, W. & Webb, G.I. Mining significant association rules from uncertain data. Data Min Knowl Disc 30, 928–963 (2016). https://doi.org/10.1007/s10618-015-0446-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0446-6