Mining significant association rules from uncertain data

1431 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Significant Association Rule Mining Without Support and Confidence Thresholds

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

Positive and Negative Association Rule Mining Using Correlation Threshold and Dual Confidence Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2009), pp 29–38
Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Proceedings of 1993 ACM-SIGMOD international conference on management of data, pp 207–216
Agresti A (1992) A survey of exact inference for contingency tables. Stat Sci 7(1):131–153
Article MathSciNet MATH Google Scholar
Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of first international conference on computational logic, pp 972–986
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246
Article MATH Google Scholar
Bayardo RJ Jr, Agrawal R, Gunopulos D (2000) Constraint-based rule mining in large, dense databases. Data Min Knowl Disc 4(2/3):217–240
Article Google Scholar
Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, New York
MATH Google Scholar
Bishop G (2009) Assessing the likely quality of the statistical longitudinal census dataset. Research paper, Australian Bureau of Statistics
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data, pp 265–276
Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of IEEE international conference on data mining (ICDM 2010), pp 749–754
Carvalho JV, Ruiz DD (2013) Discovering frequent itemsets on uncertain data: a systematic review. In: Proceedings of 9th international conference on machine learning and data mining, pp 390–404
Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2008), pp 64–75
Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 47–58
Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201
Article Google Scholar
Fosu GB (2001) Evaluation of population census data through demographic analysis. In: Symposium on global review of 2000 round of population and housing censuses: mid-decade assessment and future prospects. http://unstats.un.org/unsd/demographic/meetings/egm/symposium2001/docs/symposium_11.htm#_Toc7406238. Accessed 22 July 2015
Gray B, Orlowska M (1998) CCAIIA: clustering categorical attributes into interesting association rules. In: Proceedings of 2nd Pacific-Asia conference on knowledge discovery and data mining (PAKDD’98), pp 132–143
Hollister JW, Gonzalez ML, Paul JF, August PV, Copeland JL (2004) Assessing the accuracy of National Land Cover Dataset area estimates at multiple spatial extents. Photogramm Eng Remote Sensing 70:405–414
Article Google Scholar
International Business Machines (1996) IBM intelligent miner user’s guide, version 1, release 1
Jones N, Lewis D (eds, with Aitken A, Hörngren J, Zilhão MJ) (2003) Handbook on improving quality by analysis of process variables. Final report, Eurostat
Mennis J, Liu JW (2005) Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change. Trans GIS 9(1):5–17
Article Google Scholar
McDonald JH (2014) Handbook of biological statistics, 3rd edn. Sparky House Publishing, Baltimore
Google Scholar
Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584
Article Google Scholar
Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99), pp 125–134
Liu B, Hsu W, Ma Y (2001) Identifying non-actionable association rules. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 329–334
Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of 4th international conference on knowledge discovery and data mining (KDD ’98), pp 27–78
Office for National Statistics, The United Kingdom (2014) 2011 Census quality survey. http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census-user-guide/quality-and-methods/quality/quality-measures/assessing-accuracy-of-answers/2011-census-quality-survey-report.pdf. Accessed 22 July 2015
Olson CE (2008) Is 80% accuracy good enough? In: Proceedings of 17th William T. pecora memorial remote sensing symposium. http://www.asprs.org/a/publications/proceedings/pecora17/0026.pdf. Accessed 27 Feb 2014
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowledge discovery in databases. AAAI/MIT Press, Menlo Park, pp 229–248
Google Scholar
Penrose R (1955) A generalized inverse for matrices. Math Proc Cambridge Philos 51:406–413
Article MathSciNet MATH Google Scholar
Rao CR, Mitra SK (1972) Generalized inverse of a matrix and its applications. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1: theory of statistics, pp 601–620
Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349
Article Google Scholar
Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of 21st international conference on very large data bases, pp 407–419
Stehman SV, Wickham JD, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sensing 74:1561–1571
Article Google Scholar
Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2010), pp 273–282
Taussky O (1949) A recurring theorem on determinants. Am Math Mon 56(10):672–676
Article MathSciNet MATH Google Scholar
The Executive Office for Administration and Finance, Commonwealth of Massachusetts (2012) MassGIS datalayers. http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/datalayers/layerlist.html. Accessed 26 Sept 2013
Ting KM (2011) Confusion matrix. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning, 1st edn. Springer, New York
Google Scholar
Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: Proceedings of 28th international conference on data engineering, pp 270–281
Webb GI (2007) Discovering significant patterns. Mach Learn 68:1–33
Article Google Scholar
Webb GI, Zhang S (2005) $K$-optimal rule discovery. Data Min Knowl Disc 10(1):39–79
Article MathSciNet Google Scholar
Yang L, Stehman SV, Smith JH, Wickham JD (2001) Thematic accuracy of MRLC land cover for eastern United States. Remote Sens Environ 76:418–422
Article Google Scholar
Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000), pp 34–43
Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of 10th international conference on knowledge discovery and data mining (KDD 2004), pp 374–383
Zhu XQ, Wu XD (2006) Error awareness data mining. In: 2006 IEEE international conference on granular computing, pp 269–274
Zhu XQ, Wu XD, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of 19th national conference on artificial intelligence, pp 378–383

Download references

Acknowledgments

We wish to thank the action editor Mohammed J. Zaki and the anonymous reviewers for their insightful comments which have helped very much on improving the manuscript. This study is partially supported by the Special Fund for Technological Leading Talents, National Administration of Surveying, Mapping and Geoinformation, P.R. China.

Author information

Authors and Affiliations

Department of Land Surveying and Geo-Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, People’s Republic of China
Anshu Zhang & Wenzhong Shi
Faculty of Information Technology, Monash University, Melbourne, VIC, 3800, Australia
Geoffrey I. Webb

Authors

Anshu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenzhong Shi.

Additional information

Responsible editor: M.J. Zaki.

Appendices

Appendix 1: Evaluating discrepancy between exact and approximate $\hat{{s}}_{0}(c_{l})$ values

Let $f\left( {x_{1}, \ldots , x_{k}} \right) =\sum \limits _{j=1}^{k}{p_{ij}^{-1} z\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})x_{l}}} \right) ^{1/2}} $, then the discrepancy between the exact solution to $\hat{{s}}_{0}(c_{i})$ from (13) and the approximate solution from (14) is:

$$\begin{aligned}&\sum \limits _{j=1}^{k}{\left( {p_{ij}^{-1} z\left( {\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})s(c_{l})}} \right) ^{1/2}-\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})\hat{{s}}_{0}(c_{l})}} \right) ^{1/2}} \right) } \right) } \nonumber \\&\quad =f\left( {s(c_{1}), \ldots , s(c_{1})} \right) -f\left( {\hat{{s}}_{0}(c_{1}), \ldots , \hat{{s}}_{0}(c_{k})} \right) , \end{aligned}$$

(20)

and

$$\begin{aligned} \frac{\partial f\left( {x_{1}, \ldots , x_{k}} \right) }{\partial x_{l}}=\frac{z}{2}\sum \limits _{j=1}^{k}{p_{ij}^{-1} \cdot \left[ {p_{jl} \left( {1-p_{jl}} \right) } \right] ^{1/2}\cdot x_{l}^{{-1}/2}} . \end{aligned}$$

(21)

Let $\Delta _{l}=s\left( {c_{l}} \right) -\hat{{s}}_{0}\left( {c_{l}} \right) $, then (a1) may be estimated by the first degree Taylor polynomial of $f\left( {s(c_{1}), \ldots , s(c_{1})} \right) $:

$$\begin{aligned}&\left( {\Delta _{1}\frac{\partial }{\partial \hat{{s}}_{0}(c_{1})}+\cdots +\Delta _{k}\frac{\partial }{\partial \hat{{s}}_{0}(c_{k})}} \right) f\left( {\hat{{s}}_{0}(c_{1}),\cdots ,\hat{{s}}_{0}(c_{k})} \right) \nonumber \\&\quad =\frac{z}{2}\sum \limits _{j=1}^{k}{\sum \limits _{l=1}^{k}{p_{ij}^{-1} \cdot \left[ {p_{jl} \left( {1-p_{jl}} \right) } \right] ^{1/2}\cdot \hat{{s}}_{0}(c_{l})^{{-1}/2}}} \cdot \Delta _{l}. \end{aligned}$$

(22)

Table 8 Numerical synthetic data experiment results for E and R treatments

Full size table

The error of estimating (20) by (22) is the Lagrange form of the remainder of the first degree Taylor polynomial:

$$\begin{aligned} R_{1}\left( {\hat{{s}}_{0}(c_{1}),\cdots ,\hat{{s}}_{0}(c_{k})} \right)= & {} \frac{1}{2!}\left( {\Delta _{1}\frac{\partial }{\partial \hat{{s}}_{0}(c_{1})}+\cdots +\Delta _{k}\frac{\partial }{\partial \hat{{s}}_{0}(c_{k})}} \right) ^{2}f\nonumber \\&\times \left( {\hat{{s}}_{0}(c_{1})+\theta \Delta _{1},\cdots ,\hat{{s}}_{0}(c_{k})+\theta \Delta _{k}} \right) \nonumber \\= & {} -\frac{z}{8}\sum \limits _{j=1}^{k}\sum \limits _{l=1}^{k}p_{ij}^{-1} \cdot \left[ {p_{jl} \left( {1-p_{jl}} \right) } \right] ^{1/2}\nonumber \\&\cdot \left( {\hat{{s}}_{0}(c_{l})+\theta \Delta _{l}} \right) ^{{-3}/2}\cdot \Delta _{l}^{2}, \end{aligned}$$

(23)

where $0\le \theta \le 1$. Each item in (23) with the same $\left( {j, l} \right) $ value pair is equal to $-{\left[ {\left( {\hat{{s}}_{0}(c_{l})} \right) ^{1/2}\Delta _{l}} \right] }/{\left[ {4\left( {\hat{{s}}_{0}(c_{l})+\theta \Delta _{l}} \right) ^{3/2}} \right] }$ times the corresponding item in (22). Typically $\hat{{s}}_{0}\left( {c_{l}} \right) $ is much larger than $\Delta _{l}$, thus (22) is much larger than (23) and should be a reasonable estimator of (20).

For each specific attribute in data, the elements of P and $\mathbf{P}^{-1}$ and $\hat{{s}}_{0}(c_{1})\cdots \hat{{s}}_{0}(c_{k})$ values can be substituted into (22) for evaluating the discrepancy. An exemplary evaluation was made with the item “$att_{3}=1$” in the synthetic experiment (see Sect. 4.1), one of the most affected items by the discrepancy. The computation used the “ideal” data defined in Sect. 4.1 as the original data, and the average z value actually used in the experiment at each error level in Table 5. At the highest error level with 20 % records contained erroneous $att_{3}$ values, the relative discrepancy with respect to the correction to $s(att_{3}=1)$ was only $-0.19$ and $-0.06\,\%$ for the data size of 4000 and 64,000, respectively. At lower data error levels, the relative discrepancy was even smaller as the z value decreased.

Appendix 2: Numerical synthetic data experiment results

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, A., Shi, W. & Webb, G.I. Mining significant association rules from uncertain data. Data Min Knowl Disc 30, 928–963 (2016). https://doi.org/10.1007/s10618-015-0446-6

Download citation

Received: 21 January 2015
Accepted: 30 November 2015
Published: 12 January 2016
Issue Date: July 2016
DOI: https://doi.org/10.1007/s10618-015-0446-6

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Significant Association Rule Mining Without Support and Confidence Thresholds

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

Positive and Negative Association Rule Mining Using Correlation Threshold and Dual Confidence Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Evaluating discrepancy between exact and approximate \(\hat{{s}}_{0}(c_{l})\) values

Appendix 2: Numerical synthetic data experiment results

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Mining significant association rules from uncertain data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Significant Association Rule Mining Without Support and Confidence Thresholds

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

Positive and Negative Association Rule Mining Using Correlation Threshold and Dual Confidence Approach

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Evaluating discrepancy between exact and approximate \(\hat{{s}}_{0}(c_{l})\) values

Appendix 2: Numerical synthetic data experiment results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation