On algorithmic and modeling approaches to imputation in large data sets. (English) Zbl 1464.62533
Summary: The machine learning and statistical modeling cultures provide contrasting approaches to statistical analysis. In an article in this journal, W.-Y. Loh et al. [Stat. Sin. 29, No. 1, 431–453 (2019; Zbl 1412.62080)] compare these approaches in the setting of imputation of large data sets, recommending machine-learning methods. All the compared methods make assumptions, and I note that these assumptions receive more critical assessment for the model-based approaches than for the tree-based machine-learning methods. I discuss in particular the assumptions about the missing-data mechanism implied by the differing approaches. I question the extent to which general conclusions can be drawn from their simulation study, given the relatively strong performance of the method that discards the incomplete cases, and the limited exploration of the relevant design space.
MSC:
62R07 | Statistical aspects of big data and data science |
62H30 | Classification and discrimination; cluster analysis (statistical aspects) |
62D10 | Missing data |
68T05 | Learning and adaptive systems in artificial intelligence |