×

HDDA: DataSifter: statistical obfuscation of electronic health records and other sensitive datasets. (English) Zbl 07193723

Summary: There are no practical and effective mechanisms to share high-dimensional data including sensitive information in various fields like health financial intelligence or socioeconomics without compromising either the utility of the data or exposing private personal or secure organizational information. Excessive scrambling or encoding of the information makes it less useful for modelling or analytical processing. Insufficient preprocessing may compromise sensitive information and introduce a substantial risk for re-identification of individuals by various stratification techniques. To address this problem, we developed a novel statistical obfuscation method (DataSifter) for on-the-fly de-identification of structured and unstructured sensitive high-dimensional data such as clinical data from electronic health records (EHR). DataSifter provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Simulation results suggest that DataSifter can provide privacy protection while maintaining data utility for different types of outcomes of interest. The application of DataSifter on a large autism dataset provides a realistic demonstration of its promise practical applications.

MSC:

62-XX Statistics

References:

[1] Donoho D.50 years of data science. J Comput Graph Stat. 2017;26(4):745-766. doi: 10.1080/10618600.2017.1384734[Taylor & Francis Online], [Web of Science ®], [Google Scholar]
[2] Golle P. Revisiting the uniqueness of simple demographics in the US population. Proceedings of the 5th ACM Workshop on Privacy in Electronic Society. ACM; 2006. [Google Scholar]
[3] Sweeney L.Weaving technology and policy together to maintain confidentiality. J Law Med Ethics. 1997;25(2-3):98-110. doi: 10.1111/j.1748-720X.1997.tb01885.x[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[4] Sweeney L.Simple demographics often identify people uniquely. Health (San Francisco). 2000;671:1-34. [Google Scholar]
[5] Aggarwal G, et al. Approximation algorithms for k-anonymity. J Privacy Technol. 2005:1-18. http://ilpubs.stanford.edu:8090/645/1/2004-24.pdf. [Google Scholar]
[6] Harper FM, Konstan JA.The movielens datasets: history and context. ACM Trans Interact Intell Syst. 2016;5(4):19. [Web of Science ®], [Google Scholar]
[7] Dwork C, Roth A.The algorithmic foundations of differential privacy. Found Trends Theoret Comput Sci. 2014;9(3-4):211-407. doi: 10.1561/0400000042[Crossref], [Google Scholar] · Zbl 1302.68109
[8] Dwork C. Differential privacy: a survey of results. International Conference on Theory and Applications of Models of Computation. Springer; 2008. [Google Scholar] · Zbl 1139.68339
[9] Dinur I, Nissim K. Revealing information while preserving privacy. Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM; 2003. [Google Scholar]
[10] Dwork C, et al. Calibrating noise to sensitivity in private data analysis. Theory of Cryptography Conference. Springer; 2006. [Google Scholar] · Zbl 1112.94027
[11] Mohammed N, et al. Differentially private data release for data mining. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2011. [Google Scholar]
[12] Raskhodnikova S, et al. What can we learn privately. Proceedings of the 54th Annual Symposium on Foundations of Computer Science. 2008. [Google Scholar] · Zbl 1235.68093
[13] Zhang J, et al. Privbayes: private data release via Bayesian networks. ACM Trans Database Syst. 2017;42(4):25. doi: 10.1145/3134428[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1474.68149
[14] Chen R, et al. Differentially private high-dimensional data publication via sampling-based inference. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2015. [Google Scholar]
[15] Bhanot R, Hans R.A review and comparative analysis of various encryption algorithms. Int J Secur Appl. 2015;9(4):289-306. [Web of Science ®], [Google Scholar]
[16] Stallings W, et al. Computer security principles and practice. Upper Saddle River (NJ): Pearson Education; 2012. [Google Scholar]
[17] Suo H, et al. Security in the internet of things: a review. 2012 International Conference on Computer Science and Electronics Engineering (ICCSEE). IEEE; 2012. [Google Scholar]
[18] Gentry C. A fully homomorphic encryption scheme. Palo Alto (CA): Stanford University; 2009. [Google Scholar] · Zbl 1304.94059
[19] Gentry C, Halevi S. Implementing gentry’s fully-homomorphic encryption scheme. Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer; 2011. [Google Scholar] · Zbl 1281.94026
[20] Gentry C, Sahai A, Waters B.Homomorphic encryption from learning with errors: conceptually-simpler, asymptotically-faster, attribute-basedAdvances in cryptology - CRYPTO 2013. Santa Barbara (CA): Springer; 2013. p. 75-92. [Crossref], [Google Scholar] · Zbl 1310.94148
[21] Van Dijk M, et al. Fully homomorphic encryption over the integers. Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer; 2010. [Google Scholar] · Zbl 1279.94130
[22] Little RJ.A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198-1202. doi: 10.1080/01621459.1988.10478722[Taylor & Francis Online], [Web of Science ®], [Google Scholar]
[23] Stekhoven DJ, Bühlmann P.Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2011;28(1):112-118. doi: 10.1093/bioinformatics/btr597[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[24] Gower JC.Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53(3-4):325-338. doi: 10.1093/biomet/53.3-4.325[Crossref], [Google Scholar] · Zbl 0192.26003
[25] Gower JC.Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra Appl. 1985;67:81-97. doi: 10.1016/0024-3795(85)90187-9[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0569.15016
[26] Haggag MM.Adjusting the penalized term for the regularized regression models. Afr Stat. 2018;13(2):1609-1630. [Google Scholar] · Zbl 1392.62204
[27] Di Martino A, et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol Psychiatry. 2014;19(6):659. doi: 10.1038/mp.2013.78[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[28] Liaw A, Wiener M.Classification and regression by randomForest. R News. 2002;2(3):18-22. [Google Scholar]
[29] Wilkinson MD, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:1-9. doi: 10.1038/sdata.2016.18[Crossref], [Web of Science ®], [Google Scholar]
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.