×

Re-identification in the absence of common variables for matching. (English) Zbl 07776803

Summary: A basic concern in statistical disclosure limitation is the re-identification of individuals in anonymised microdata. Linking against a second dataset that contains identifying information can result in a breach of confidentiality. Almost all linkage approaches are based on comparing the values of variables that are common to both datasets. It is tempting to think that if datasets contain no common variables, then there can be no risk of re-identification. However, linkage has been attempted between such datasets via the extraction of structural information using ordered weighted averaging (OWA) operators. Although this approach has been shown to perform better than randomly pairing records, it is debatable whether it demonstrates a practically significant disclosure risk. This paper reviews some of the main aspects of statistical disclosure limitation. It then goes on to show that a relatively simple, supervised Bayesian approach can consistently outperform OWA linkage. Furthermore, the Bayesian approach demonstrates a significant risk of re-identification for the types of data considered in the OWA record linkage literature.

MSC:

62Fxx Parametric inference
05Cxx Graph theory
62Hxx Multivariate analysis

Software:

UCI-ml; FaceTracer

References:

[1] Backstrom, L., Dwork, C. & Kleinberg, J.2007. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th International World Wide Web Conference, pp. 181-190: Banff, Alberta, Canada.
[2] Bambauer, J., Muralidhar, K. & Sarathy, R.2014. Fool’s gold: an illustrated critique of differential privacy. Vanderbilt J. Entertain. Technol. Law, 16(4), 701-755.
[3] Bohannon, J.2015. UNMASKED: facial recognition software could soon ID you in any photo. Science, 347, 492-494.
[4] Brownstein, J., Cassa, C. & Mandl, K.2006. No place to hide reverse identification of patients from published maps. N. Engl. J. Med., 355(16), 1741-1742.
[5] Dawid, A. & Lauritzen, S.1993. Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Stat., 21(3), 1272-1317. · Zbl 0815.62038
[6] Dempster, A.P., Laird, N. & Rubin, D.B.1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B, 39(1), 1-38. · Zbl 0364.62022
[7] Dobra, A. & Fienberg, S.E.2000. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. Proc. Natl. Acad. Sci., 97(22), 11185-11189. · Zbl 0960.62059
[8] Dobra, A. & Fienberg, S.E.2008. The Generalized Shuttle Algorithm.Technical report, University of Washington, 2008. http://www.csss.washington.edu/Papers/wp83.pdf
[9] Duncan, G.T.2002. Confidentiality and statistical disclosure limitation. In International Encyclopedia of the Social and Behavioral Sciences, Pergamon: Oxford, pp. 2521-2525.
[10] Duncan, G.T., Elliot, M. & Salazar‐Gonzáles, J.‐J.2011. Statistical Confidentiality: Principles and Practice, Statistics for Social and Behavioral Sciences.Springer‐Verlag: New York. ISBN 978‐1‐4419‐7801‐1. · Zbl 1233.62204
[11] Dwork, C.2008. An ad omnia approach to defining and achieving private data analysis. In Privacy, Security, and Trust in KDD, Lecture Notes in Computer Science. Springer: Berlin Heidelberg, pp. 1-13.
[12] Elliot, M.J. & Dale, A.1999. Scenarios of attack: the data intruder’s perspective on statistical disclosure risk. Netherlands Official Stat., 14, 6-10.
[13] Elliot, M.J., Manning, A.M. & Ford, R.W.2002. A computational algorithm for handling the special uniques problem. Int. J. Uncertainty Fuzziness Knowledge Based Syst., 10(5), 493-509. · Zbl 1085.68580
[14] Fellegi, I. & Sunter, A.1969. A theory for record linkage. JASA, 64(238), 1183-1210. · Zbl 0186.53903
[15] Fienberg, S.E. & Makov, U.1998. Confidentiality, uniqueness and disclosure limitation for categorical data. J. Off. Stat., 14, 385-397.
[16] Forster, J. & Webb, E.2007. Bayesian disclosure risk assessment: predicting small frequencies in contingency tables. JRSS Ser. C, 56(5), 551-570.
[17] Fortini, M., Liseo, B., Nuccitelli, A. & Scanu, M.2001. On Bayesian record linkage. Res. Off. Stat., 4, 185-198.
[18] Frydenberg, M. & Lauritzen, S.1989. Decomposition of maximum likelihood in mixed graphical interaction models. Biometrika, 76(3), 539-555. · Zbl 0677.62053
[19] Fu, Z., Christen, P. & Zhou, J.2014. A graph matching method for historical census household linkage. Int. J. Humanit. Arts Comput., 8, 204-225.
[20] Garfinkel, S.L., Abowd, J.M. & Powazek, S.2018. Issues encountered deploying differential privacy. In Proceedings of the 2018 Workshop on Privacy in the Electronic Society, WPES’18, pp. 133-137, ACM: New York, NY, USA. http://doi.acm.org/10.1145/3267323.3268949
[21] Getoor, L., Friedman, N., Koller, D. & Taskar, B.2002. Learning probabilistic models of link structure. J. Mach. Learn. Res., 3, 679-707. · Zbl 1112.68441
[22] Gutman, R., Afendulis, C. & Zaslavsky, A.2013. A Bayesian procedure for file linking to analyze end‐of‐life medical costs. JASA, 108(501), 34-47. https://doi.org/10.1080/01621459.2012.726889. PMID: 23645944. · Zbl 1379.62069 · doi:10.1080/01621459.2012.726889
[23] Gymrek, M., McGuire, A., Golan, D., Halperin, E. & Erlich, Y.2013. Identifying person genomes by surname inference. Science, 339, 321-324.
[24] Hacker, P.2018. Teaching fairness to artificial intelligence: existing and novel strategies against algorithmic discrimination under EU law. Common Market Law Rev., 55, 1143-1185.
[25] Hoeting, J., Madigan, D., Raftery, A. & Volinsky, C.1999. Bayesian model averaging: a tutorial. Stat. Sci., 14(4), 382-417. · Zbl 1059.62525
[26] Institute of Medicine2015. Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. The National Academies Press: Washington, DC. ISBN 978‐0‐309‐31629‐3. URL https://www.nap.edu/catalog/18998/sharing-clinical-trial-data-maximizing-benefits-minimizing-risk
[27] Jaro, M.A.1989. Advances in record linkage methodology as applied to the 1985 census of Tampa Florida. JASA, 84(406), 414-420.
[28] Jensen, F.V.1996. An Introduction to Bayesian Networks. Springer‐Verlag: New York.
[29] Kerckoffs, A.1883. La cryptographie militaire. J. Sci. Mil., IX, 5-38.
[30] Kruskal, J.1956. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math. Soc., 7, 48-50. · Zbl 0070.18404
[31] Kuhn, H.1955. The Hungarian method for the assignment problem. Nav. Res. Logist. Q., 2(2), 83-97. · Zbl 0143.41905
[32] Kumar, N., Belhumeur, P.N. & Nayar, S.K.2008. Facetracer: a search engine for large collections of images with faces. In IEEE International Conference on Computer Vision, pp. 340-353: Berlin, Heidelberg.
[33] Lauritzen, S. & Spiegelhalter, D.1988. Local computations with probabilities on graphical structures and their application to expert systems. JRSS Ser. B, 50(2), 157-224. · Zbl 0684.68106
[34] Lichman, M.2013. UCI machine learning repository. http://archive.ics.uci.edu/ml
[35] Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M.2007. l‐diversity: privacy beyond k‐anonymity. ACM Trans. Knowl. Discov. Data, 1(1). http://doi.acm.org/10.1145/1217299.1217302
[36] Madigan, D. & Raftery, A.1994. Model selection and accounting for model uncertainty in graphical models using Occam’s window. JASA, 89(428), 1535-1546. · Zbl 0814.62030
[37] Madigan, D. & York, J.1995. Bayesian graphical models for discrete data. Int. Stat. Rev., 63, 215-232. · Zbl 0834.62003
[38] Montjoye, Y.‐A., Radaelli, L., Singh, V. & Pentland, A.2015. Unique in the shopping mall on the reidentifiability of credit card metadata. Science, 347, 536-539.
[39] Nin, J., Herranz, J. & Torra, V.2008. Rethinking rank swapping to decrease disclosure risk. Data Knowl. Eng., 64(1), 346-364. https://doi.org/10.1016/j.datak.2007.07.006. URL http://www.sciencedirect.com/science/article/pii/S0169023X07001498 Fourth International Conference on Business Process Management (BPM 2006) 8th International Conference on Enterprise Information Systems (ICEIS’ 2006). · doi:10.1016/j.datak.2007.07.006
[40] Nin, J. & Torra, V.2005. Towards the use of OWA operators for record linkage. In Proceedings of the Joint 4th Conference of the European Society for Fuzzy Logic and Technology and the 11th Rencontres Francophones sur la Logique Floue et ses Applications, pp. 34-39: Barcelona, Spain.
[41] Li, N., Li, T. & Venkatasubramanian, S.2007. t‐closeness: privacy beyond k‐anonymity and l‐diversity. In Proceedings of the IEEE 23rd International Conference on Data Engineering Edited byChirkova, R. (ed.), Dogac, A. (ed.), Ozsu, T. (ed.) & Sellis, T. (ed.), pp. 106-115, IEEE: Istanbul, Turkey. https://doi.org/10.1109/ICDE.2007.367856 · doi:10.1109/ICDE.2007.367856
[42] Paass, G.1988. Disclosure risk and disclosure avoidance for microdata. J. Bus. Econ. Stat., 6(4), 487-500.
[43] Purdam, K. & Elliot, M.2007. A case study of the impact of statistical disclosure control on data quality in the individual UK samples of anonymised records. Environ. Plan. A, 39(5), 1101-1118.
[44] Purdam, K., Mackey, E. & Elliot, M.2003. Whose data is it? Personal data and privacy. In Paper Presented to British Sociology Association Annual Conference: York.
[45] Sadinle, M.2017. Bayesian estimation of bipartite matchings for record linkage. JASA, 112(518), 600-612. https://doi.org/10.1080/01621459.2016.1148612 · doi:10.1080/01621459.2016.1148612
[46] Sadinle, M. & Fienberg, S.E.2013. A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. JASA, 108(502), 385-397. · Zbl 06195947
[47] Skinner, C. & Elliot, M.E.2002. A measure of disclosure risk for microdata. J. R. Stat. Soc. Ser. B, 64(4), 855-867. · Zbl 1067.62015
[48] Skinner, C. & Holmes, D.1998. Estimating the re‐identification risk per record in microdata. J. Off. Stat., 14, 361-372.
[49] Smith, D.2001. The efficient propagation of arbitrary subsets of beliefs in discrete‐valued Bayesian belief networks. In Artificial Intelligence and Statistics, Vol. 2001: Key West, Florida.
[50] Smith, D.2006. An evaluation of strategies for matching population and sample units. In International Conference of the Royal Statistical Society, Queen’s University: Belfast.
[51] Smith, D.2011. Aspects of Statistical Disclosure Control. PhD thesis, University of Manchester.
[52] Smith, D. & Elliot, M.E.2008. A measure of disclosure risk for tables of counts. Trans. Data Priv., 1(1), 34-52.
[53] Smith, D. & Elliot, M.2014. A graph‐based approach to key variable mapping. J. Priv. Confid., 6(2), 81-115.
[54] Smith, D. & Shlomo, N.2014. Report for the data without boundaries project. In Technical report, University of Manchester http://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/2014-01-Data_without_Boundaries_Report.pdf
[55] Steorts, R.C., Hall, R. & Fienberg, S.E.2016. A Bayesian approach to graphical record linkage and deduplication. JASA, 111(516), 1660-1672. https://doi.org/10.1080/01621459.2015.1105807 · doi:10.1080/01621459.2015.1105807
[56] Sweeney, L.2002. k‐anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowledge Based Syst., 10(7), 557-570. · Zbl 1085.68589
[57] Tancredi, A. & Liseo, B.2011. A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat., 5(2B), 1553-1585. · Zbl 1223.62015
[58] Taskar, B., Wong, M., Abdeel, P. & Koller, D.2004. Link prediction in relational data. In Advances in Neural Information Processing Systems, Vol. 16, MIT Press: Vancouver, USA.
[59] Tockar, A.2014. Riding with the stars: passenger privacy in the NYC taxicab dataset. https://research.neustar.biz/2014/09/15/. (visited Oct. 31st 2018).
[60] Torra, V.2004. OWA operators in data modeling and reidentification. IEEE Trans. Fuzzy Syst., 12(5), 652-660.
[61] Willenborg, L. & deWaal, T.2001. Elements of Statistical Disclosure Control, Lecture Notes in Statistics, Vol. 155. Springer: New York NY. · Zbl 0973.62009
[62] Winkler, W.E.1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods (American Statistical Association), pp. 354-359: Washington, DC.
[63] Winkler, W.E.2004. Re‐identification methods for masked microdata. Lect. Notes Comput. Sci, 3050, 216-230. https://doi.org/10.1007/978-3-540-25955-8_17 · doi:10.1007/978-3-540-25955-8_17
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.