Skip to main content
Log in

Mining significant association rules from uncertain data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2009), pp 29–38

  • Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Proceedings of 1993 ACM-SIGMOD international conference on management of data, pp 207–216

  • Agresti A (1992) A survey of exact inference for contingency tables. Stat Sci 7(1):131–153

    Article  MathSciNet  MATH  Google Scholar 

  • Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of first international conference on computational logic, pp 972–986

  • Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246

    Article  MATH  Google Scholar 

  • Bayardo RJ Jr, Agrawal R, Gunopulos D (2000) Constraint-based rule mining in large, dense databases. Data Min Knowl Disc 4(2/3):217–240

    Article  Google Scholar 

  • Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, New York

    MATH  Google Scholar 

  • Bishop G (2009) Assessing the likely quality of the statistical longitudinal census dataset. Research paper, Australian Bureau of Statistics

  • Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data, pp 265–276

  • Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of IEEE international conference on data mining (ICDM 2010), pp 749–754

  • Carvalho JV, Ruiz DD (2013) Discovering frequent itemsets on uncertain data: a systematic review. In: Proceedings of 9th international conference on machine learning and data mining, pp 390–404

  • Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2008), pp 64–75

  • Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 47–58

  • Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201

    Article  Google Scholar 

  • Fosu GB (2001) Evaluation of population census data through demographic analysis. In: Symposium on global review of 2000 round of population and housing censuses: mid-decade assessment and future prospects. http://unstats.un.org/unsd/demographic/meetings/egm/symposium2001/docs/symposium_11.htm#_Toc7406238. Accessed 22 July 2015

  • Gray B, Orlowska M (1998) CCAIIA: clustering categorical attributes into interesting association rules. In: Proceedings of 2nd Pacific-Asia conference on knowledge discovery and data mining (PAKDD’98), pp 132–143

  • Hollister JW, Gonzalez ML, Paul JF, August PV, Copeland JL (2004) Assessing the accuracy of National Land Cover Dataset area estimates at multiple spatial extents. Photogramm Eng Remote Sensing 70:405–414

    Article  Google Scholar 

  • International Business Machines (1996) IBM intelligent miner user’s guide, version 1, release 1

  • Jones N, Lewis D (eds, with Aitken A, Hörngren J, Zilhão MJ) (2003) Handbook on improving quality by analysis of process variables. Final report, Eurostat

  • Mennis J, Liu JW (2005) Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change. Trans GIS 9(1):5–17

    Article  Google Scholar 

  • McDonald JH (2014) Handbook of biological statistics, 3rd edn. Sparky House Publishing, Baltimore

    Google Scholar 

  • Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584

    Article  Google Scholar 

  • Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99), pp 125–134

  • Liu B, Hsu W, Ma Y (2001) Identifying non-actionable association rules. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 329–334

  • Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of 4th international conference on knowledge discovery and data mining (KDD ’98), pp 27–78

  • Office for National Statistics, The United Kingdom (2014) 2011 Census quality survey. http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/2011-census-user-guide/quality-and-methods/quality/quality-measures/assessing-accuracy-of-answers/2011-census-quality-survey-report.pdf. Accessed 22 July 2015

  • Olson CE (2008) Is 80% accuracy good enough? In: Proceedings of 17th William T. pecora memorial remote sensing symposium. http://www.asprs.org/a/publications/proceedings/pecora17/0026.pdf. Accessed 27 Feb 2014

  • Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowledge discovery in databases. AAAI/MIT Press, Menlo Park, pp 229–248

    Google Scholar 

  • Penrose R (1955) A generalized inverse for matrices. Math Proc Cambridge Philos 51:406–413

    Article  MathSciNet  MATH  Google Scholar 

  • Rao CR, Mitra SK (1972) Generalized inverse of a matrix and its applications. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1: theory of statistics, pp 601–620

  • Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349

    Article  Google Scholar 

  • Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of 21st international conference on very large data bases, pp 407–419

  • Stehman SV, Wickham JD, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sensing 74:1561–1571

    Article  Google Scholar 

  • Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2010), pp 273–282

  • Taussky O (1949) A recurring theorem on determinants. Am Math Mon 56(10):672–676

    Article  MathSciNet  MATH  Google Scholar 

  • The Executive Office for Administration and Finance, Commonwealth of Massachusetts (2012) MassGIS datalayers. http://www.mass.gov/anf/research-and-tech/it-serv-and-support/application-serv/office-of-geographic-information-massgis/datalayers/layerlist.html. Accessed 26 Sept 2013

  • Ting KM (2011) Confusion matrix. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning, 1st edn. Springer, New York

    Google Scholar 

  • Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: Proceedings of 28th international conference on data engineering, pp 270–281

  • Webb GI (2007) Discovering significant patterns. Mach Learn 68:1–33

    Article  Google Scholar 

  • Webb GI, Zhang S (2005) \(K\)-optimal rule discovery. Data Min Knowl Disc 10(1):39–79

    Article  MathSciNet  Google Scholar 

  • Yang L, Stehman SV, Smith JH, Wickham JD (2001) Thematic accuracy of MRLC land cover for eastern United States. Remote Sens Environ 76:418–422

    Article  Google Scholar 

  • Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000), pp 34–43

  • Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of 10th international conference on knowledge discovery and data mining (KDD 2004), pp 374–383

  • Zhu XQ, Wu XD (2006) Error awareness data mining. In: 2006 IEEE international conference on granular computing, pp 269–274

  • Zhu XQ, Wu XD, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of 19th national conference on artificial intelligence, pp 378–383

Download references

Acknowledgments

We wish to thank the action editor Mohammed J. Zaki and the anonymous reviewers for their insightful comments which have helped very much on improving the manuscript. This study is partially supported by the Special Fund for Technological Leading Talents, National Administration of Surveying, Mapping and Geoinformation, P.R. China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenzhong Shi.

Additional information

Responsible editor: M.J. Zaki.

Appendices

Appendix 1: Evaluating discrepancy between exact and approximate \(\hat{{s}}_{0}(c_{l})\) values

Let \(f\left( {x_{1}, \ldots , x_{k}} \right) =\sum \limits _{j=1}^{k}{p_{ij}^{-1} z\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})x_{l}}} \right) ^{1/2}} \), then the discrepancy between the exact solution to \(\hat{{s}}_{0}(c_{i})\) from (13) and the approximate solution from (14) is:

$$\begin{aligned}&\sum \limits _{j=1}^{k}{\left( {p_{ij}^{-1} z\left( {\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})s(c_{l})}} \right) ^{1/2}-\left( {\sum \limits _{l=1}^{k}{p_{jl} (1-p_{jl})\hat{{s}}_{0}(c_{l})}} \right) ^{1/2}} \right) } \right) } \nonumber \\&\quad =f\left( {s(c_{1}), \ldots , s(c_{1})} \right) -f\left( {\hat{{s}}_{0}(c_{1}), \ldots , \hat{{s}}_{0}(c_{k})} \right) , \end{aligned}$$
(20)

and

$$\begin{aligned} \frac{\partial f\left( {x_{1}, \ldots , x_{k}} \right) }{\partial x_{l}}=\frac{z}{2}\sum \limits _{j=1}^{k}{p_{ij}^{-1} \cdot \left[ {p_{jl} \left( {1-p_{jl}} \right) } \right] ^{1/2}\cdot x_{l}^{{-1}/2}} . \end{aligned}$$
(21)

Let \(\Delta _{l}=s\left( {c_{l}} \right) -\hat{{s}}_{0}\left( {c_{l}} \right) \), then (a1) may be estimated by the first degree Taylor polynomial of \(f\left( {s(c_{1}), \ldots , s(c_{1})} \right) \):

$$\begin{aligned}&\left( {\Delta _{1}\frac{\partial }{\partial \hat{{s}}_{0}(c_{1})}+\cdots +\Delta _{k}\frac{\partial }{\partial \hat{{s}}_{0}(c_{k})}} \right) f\left( {\hat{{s}}_{0}(c_{1}),\cdots ,\hat{{s}}_{0}(c_{k})} \right) \nonumber \\&\quad =\frac{z}{2}\sum \limits _{j=1}^{k}{\sum \limits _{l=1}^{k}{p_{ij}^{-1} \cdot \left[ {p_{jl} \left( {1-p_{jl}} \right) } \right] ^{1/2}\cdot \hat{{s}}_{0}(c_{l})^{{-1}/2}}} \cdot \Delta _{l}. \end{aligned}$$
(22)
Table 8 Numerical synthetic data experiment results for E and R treatments

The error of estimating (20) by (22) is the Lagrange form of the remainder of the first degree Taylor polynomial:

$$\begin{aligned} R_{1}\left( {\hat{{s}}_{0}(c_{1}),\cdots ,\hat{{s}}_{0}(c_{k})} \right)= & {} \frac{1}{2!}\left( {\Delta _{1}\frac{\partial }{\partial \hat{{s}}_{0}(c_{1})}+\cdots +\Delta _{k}\frac{\partial }{\partial \hat{{s}}_{0}(c_{k})}} \right) ^{2}f\nonumber \\&\times \left( {\hat{{s}}_{0}(c_{1})+\theta \Delta _{1},\cdots ,\hat{{s}}_{0}(c_{k})+\theta \Delta _{k}} \right) \nonumber \\= & {} -\frac{z}{8}\sum \limits _{j=1}^{k}\sum \limits _{l=1}^{k}p_{ij}^{-1} \cdot \left[ {p_{jl} \left( {1-p_{jl}} \right) } \right] ^{1/2}\nonumber \\&\cdot \left( {\hat{{s}}_{0}(c_{l})+\theta \Delta _{l}} \right) ^{{-3}/2}\cdot \Delta _{l}^{2}, \end{aligned}$$
(23)

where \(0\le \theta \le 1\). Each item in (23) with the same \(\left( {j, l} \right) \) value pair is equal to \(-{\left[ {\left( {\hat{{s}}_{0}(c_{l})} \right) ^{1/2}\Delta _{l}} \right] }/{\left[ {4\left( {\hat{{s}}_{0}(c_{l})+\theta \Delta _{l}} \right) ^{3/2}} \right] }\) times the corresponding item in (22). Typically \(\hat{{s}}_{0}\left( {c_{l}} \right) \) is much larger than \(\Delta _{l}\), thus (22) is much larger than (23) and should be a reasonable estimator of (20).

For each specific attribute in data, the elements of P and \(\mathbf{P}^{-1}\) and \(\hat{{s}}_{0}(c_{1})\cdots \hat{{s}}_{0}(c_{k})\) values can be substituted into (22) for evaluating the discrepancy. An exemplary evaluation was made with the item “\(att_{3}=1\)” in the synthetic experiment (see Sect. 4.1), one of the most affected items by the discrepancy. The computation used the “ideal” data defined in Sect. 4.1 as the original data, and the average z value actually used in the experiment at each error level in Table 5. At the highest error level with 20 % records contained erroneous \(att_{3}\) values, the relative discrepancy with respect to the correction to \(s(att_{3}=1)\) was only \(-0.19\) and \(-0.06\,\%\) for the data size of 4000 and 64,000, respectively. At lower data error levels, the relative discrepancy was even smaller as the z value decreased.

Appendix 2: Numerical synthetic data experiment results

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, A., Shi, W. & Webb, G.I. Mining significant association rules from uncertain data. Data Min Knowl Disc 30, 928–963 (2016). https://doi.org/10.1007/s10618-015-0446-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0446-6

Keywords

Navigation