×

Bayesian causal inference with bipartite record linkage. (English) Zbl 07810152

Summary: In some scenarios, the observational data needed for causal inferences are spread over two data files. In particular, we consider scenarios where one file includes covariates and the treatment measured on a set of individuals, and a second file includes responses measured on another, partially overlapping set of individuals. In the absence of error-free direct identifiers like social security numbers, straightforward merging of separate files is not feasible, so that records must be linked using error-prone variables such as names, birth dates, and demographic characteristics. Typical practice in such situations generally follows a two-stage procedure: first link the two files using a probabilistic linkage technique, then make causal inferences with the linked dataset. This does not propagate uncertainty due to imperfect linkages to the causal inference, nor does it leverage relationships among the study variables to improve the quality of the linkages. We propose a joint model for simultaneous Bayesian inference on probabilistic linkage and causal effects that addresses these deficiencies. Using simulation studies and theoretical arguments, we show that the joint model can improve the accuracy of estimated treatment effects, as well as the record linkages, compared to the two-stage modeling option. We illustrate the joint model using a constructed causal study of the effects of debit card possession on household spending.

MSC:

62-XX Statistics

References:

[1] Angrist, J. D. and Pischke, J. (2009). “Instrumental variables in action: sometimes you get what you need. Mostly harmless econometrics: an empiricist’s companion.”
[2] Attanasio, O. P., Guiso, L., and Jappelli, T. (2002). “The demand for money, financial innovation, and the welfare cost of inflation: An analysis with household data.” Journal of Political Economy, 110(2): 317-351.
[3] Belin, T. R. and Rubin, D. B. (1995). “A method for calibrating false-match rates in record linkage.” Journal of the American Statistical Association, 90(430): 694-707. · Zbl 0925.62548
[4] Binette, O. and Steorts, R. C. (2020). “(Almost) All of Entity Resolution.” arXiv preprint arXiv: 2008.04443.
[5] Chipperfield, J. O., Bishop, G., Campbell, P. D., et al. (2011). “Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data.”
[6] Christen, P. (2012). “The data matching process.” In Data matching, 23-35. Springer.
[7] Cole, C. (1998). “Identifying interventions to reduce credit card misuse through consumer behavior research.” In Proceedings of the Marketing and Public Policy Conference, 11-13. Washington, DC: Georgetown University Press.
[8] Dalzell, N. M. and Reiter, J. P. (2018). “Regression modeling and file matching using possibly erroneous matching variables.” Journal of Computational and Graphical Statistics, 27(4): 728-738. · Zbl 07498986 · doi:10.1080/10618600.2018.1458624
[9] Ding, P., Li, F., et al. (2018). “Causal inference: A missing data perspective.” Statistical Science, 33(2): 214-237. · Zbl 1397.62125 · doi:10.1214/18-STS645
[10] Domingo-Ferrer, J. (2011). Privacy in statistical databases. Springer.
[11] Fellegi, I. P. and Sunter, A. B. (1969). “A theory for record linkage.” Journal of the American Statistical Association, 64(328): 1183-1210.
[12] Fortini, M., Nuccitelli, A., Liseo, B., and Scanu, M. (2002). “Modelling issues in record linkage: a Bayesian perspective.” In Proceedings of the American Statistical Association, Survey Research Methods Section, 1008-1013.
[13] Frölich, M. and Sperlich, S. (2019). Impact evaluation. Cambridge University Press.
[14] Graham, D. J., McCoy, E. J., Stephens, D. A., et al. (2016). “Approximate Bayesian inference for doubly robust estimation.” Bayesian Analysis, 11(1): 47-69. · Zbl 1357.62186 · doi:10.1214/14-BA928
[15] Guha, S., Reiter, J. P., and Mercatanti, A. (2022). “Supplementary Material: Bayesian Causal Inference with Bipartite Record Linkage.” Bayesian Analysis. · doi:10.1214/21-BA1297SUPP
[16] Gutman, R., Afendulis, C. C., and Zaslavsky, A. M. (2013). “A Bayesian procedure for file linking to analyze end-of-life medical costs.” Journal of the American Statistical Association, 108(501): 34-47. · Zbl 1379.62069 · doi:10.1080/01621459.2012.726889
[17] Gutman, R. and Rubin, D. B. (2013). “Robust estimation of causal effects of binary treatments in unconfounded studies with dichotomous outcomes.” Statistics in Medicine, 32(11): 1795-1814. · doi:10.1002/sim.5627
[18] Gutman, R. and Rubin, D. B. (2015). “Estimation of causal effects of binary treatments in unconfounded studies.” Statistics in medicine, 34(26): 3381-3398. · doi:10.1002/sim.6532
[19] Herzog, T. N., Scheuren, F. J., and Winkler, W. E. (2007). Data quality and record linkage techniques. Springer Science & Business Media. · Zbl 1262.62004 · doi:10.1016/S0169-7161(08)00014-X
[20] Hill, J. L. (2011). “Bayesian nonparametric modeling for causal inference.” Journal of Computational and Graphical Statistics, 20(1): 217-240. · doi:10.1198/jcgs.2010.08162
[21] Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. · Zbl 1355.62002 · doi:10.1017/CBO9781139025751
[22] Jaro, M. A. (1989). “Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida.” Journal of the American Statistical Association, 84(406): 414-420.
[23] Kim, Y. and Lee, M. (2010). “A model of debit card as a means of payment.” Journal of Economic Dynamics & Control, 34: 1359-1368. · Zbl 1232.91516 · doi:10.1016/j.jedc.2010.03.003
[24] Lahiri, P. and Larsen, M. D. (2005). “Regression analysis with linked data.” Journal of the American statistical association, 100(469): 222-230. · Zbl 1117.62376 · doi:10.1198/016214504000001277
[25] Larsen, M. D. (2010). “Record linkage modeling in federal statistical databases.” In FCSM Research Conference.
[26] Larsen, M. D. and Rubin, D. B. (2001). “Iterative automated record linkage using mixture models.” Journal of the American Statistical Association, 96(453): 32-41. · doi:10.1198/016214501750332956
[27] Mercatanti, A., Li, F., et al. (2014). “Do debit cards increase household spending? Evidence from a semiparametric causal analysis of a survey.” The Annals of Applied Statistics, 8(4): 2485-2508. · Zbl 1454.62488 · doi:10.1214/14-AOAS784
[28] Morewedge, C. K., Holtzman, L., and Epley, N. (2007). “Unfixed resources: Perceived costs, consumption, and the accessible account effect.” Journal of Consumer Research, 34(4): 459-467.
[29] Murray, J. S. (2016). “Probabilistic Record Linkage and De-duplication after Indexing, Blocking, and Filtering.” Journal of Privacy and Confidentiality, 7.
[30] Myers, J. A. and Louis, T. A. (2012). “Comparing treatments via the propensity score: stratification or modeling?” Health Services and Outcomes Research Methodology, 12(1): 29-43.
[31] Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103(482): 681-686. · Zbl 1330.62292 · doi:10.1198/016214508000000337
[32] Rosenbaum, P. R. and Rubin, D. B. (1983). “The central role of the propensity score in observational studies for causal effects.” Biometrika, 70(1): 41-55. · Zbl 0522.62091 · doi:10.1093/biomet/70.1.41
[33] Rubin, D. B. (1974). “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of educational Psychology, 66(5): 688.
[34] Rubin, D. B. (2005). “Bayesian inference for causal effects.” Handbook of statistics, 25: 1-16. · doi:10.1016/S0169-7161(05)25001-0
[35] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. 12. Cambridge University Press. · Zbl 1038.62042
[36] Saarela, O., Belzile, L. R., and Stephens, D. A. (2016). “A Bayesian view of doubly robust causal inference.” Biometrika, 103(3): 667-681. · Zbl 1506.62253 · doi:10.1093/biomet/asw025
[37] Sadinle, M. (2017). “Bayesian estimation of bipartite matchings for record linkage.” Journal of the American Statistical Association, 112(518): 600-612. · doi:10.1080/01621459.2016.1148612
[38] Sadinle, M. et al. (2018). “Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations.” The Annals of Applied Statistics, 12(2): 1013-1038. · Zbl 1405.62243 · doi:10.1214/18-AOAS1178
[39] Sariyar, M. and Borg, A. (2010). “The RecordLinkage package: Detecting errors in data.” The R Journal, 2(2): 61-67.
[40] Scheuren, F. and Winkler, W. E. (1991). “Regression analysis of data files that are computer matched.”
[41] Solomon, N. C. and O’Brien, S. M. (2019). “A Framework for Decision Threshold Selection in Record Linkage.”
[42] Soman, D. (2001). “Effects of payment mechanism on spending behavior: The role of rehearsal and immediacy of payments.” Journal of Consumer Research, 27(4): 460-474.
[43] Soman, D. and Cheema, A. (2002). “The effect of credit on spending decisions: The role of the credit limit and credibility.” Marketing Science, 21(1): 32-53.
[44] Steorts, R. C., Ventura, S. L., Sadinle, M., and Fienberg, S. E. (2014). “A Comparison of Blocking Methods for Record Linkage.” In International conference on privacy in statistical databases, 253-268. Springer.
[45] Stuart, E. A. (2010). “Matching methods for causal inference: A review and a look forward.” Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1): 1. · Zbl 1328.62007 · doi:10.1214/09-STS313
[46] Tancredi, A. and Liseo, B. (2011). “A hierarchical Bayesian approach to record linkage and population size problems.” The Annals of Applied Statistics, 5(2B): 1553-1585. · Zbl 1223.62015 · doi:10.1214/10-AOAS447
[47] Tancredi, A., Steorts, R., Liseo, B., et al. (2018). “A Unified Framework for De-Duplication and Population Size Estimation.” Bayesian Analysis. · Zbl 1459.62124 · doi:10.1214/19-BA1146
[48] Thaler, R. (1985). “Mental accounting and consumer choice.” Marketing science, 4(3): 199-214.
[49] Thaler, R. H. (1999). “Mental accounting matters.” Journal of Behavioral decision making, 12(3): 183-206.
[50] Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.”
[51] Winkler, W. E. (1993). Improved decision rules in the Fellegi-Sunter model of record linkage. Citeseer.
[52] Wortman, J. H. and Reiter, J. P. (2018). “Simultaneous record linkage and causal inference with propensity score subclassification.” Statistics in Medicine, 37(24): 3533-3546. · doi:10.1002/sim.7911
[53] Zheng, H. and Little, J. (2005). “Inference for the population total from probability-proportional-to-size samples based on predictions from a penalized spline nonparametric model.” Journal of Official Statistics, 21(1): 1.
[54] Zheng, H. and Little, R. J. (2003). “Penalized spline model-based estimation of the finite populations total from probability-proportional-to-size samples.” Journal of Official Statistics, 19(2): 99.
[55] Zhou, T., Elliott, M. R., and Little, R. J. (2019). “Penalized spline of propensity methods for treatment comparison.” Journal of the American Statistical Association, 114(525): 1-19 · Zbl 1418.62179 · doi:10.1080/01621459.2018.1518234
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.