Document Zbl 1486.62186

Marginal singularity and the benefits of labels in covariate-shift. (English) Zbl 1486.62186

Ann. Stat. 49, No. 6, 3299-3323 (2021).

Summary: Transfer Learning addresses common situations in Machine Leaning where little or no labeled data is available for a target prediction problem – corresponding to a distribution \(Q\), but much labeled data is available from some related but different data distribution \(P\). This work is concerned with the fundamental limits of transfer, that is, the limits in target performance in terms of \((1)\) sample sizes from \(P\) and \(Q\), and \((2)\) differences in data distributions \(P,Q\). In particular, we aim to address practical questions such as how much target data from \(Q\) is sufficient given a certain amount of related data from \(P\), and how to optimally sample such target data for labeling.
We present new minimax results for transfer in nonparametric classification (i.e., for situations where little is known about the target classifier), under the common assumption that the marginal distributions of covariates differ between \(P\) and \(Q\) (often termed covariate-shift). Our results are first to concisely capture the relative benefits of source and target labeled data in these settings through information-theoretic limits. Namely, we show that the benefits of target labels are tightly controlled by a transfer-exponent \(\gamma\) that encodes how singular Q is locally with respect to \(P\), and interestingly paints a more favorable picture of transfer than what might be believed from insights from previous work. In fact, while previous work rely largely on refinements of traditional metrics and divergences between distributions, and often only yield a coarse view of when transfer is possible or not, our analysis – in terms of \(\gamma \) – reveals a continuum of new regimes ranging from easy to hard transfer.
We then address the practical question of how to efficiently sample target data to label, by showing that a recently proposed semi-supervised procedure – based on \(k\)-NN classification, can be refined to adapt to unknown \(\gamma\) and, therefore, requests target labels only when beneficial, while achieving nearly minimax-optimal transfer rates without knowledge of distributional parameters. Of independent interest, we obtain new minimax-optimality results for vanilla \(k\)-NN classification in regimes with nonuniform marginals.

Cited in 5 Documents

MSC:

62H30	Classification and discrimination; cluster analysis (statistical aspects)
62G08	Nonparametric regression and quantile regression
68T05	Learning and adaptive systems in artificial intelligence

Keywords:

covariate shift; domain adaptation; nonparametric classification; transfer learning

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608-633. · Zbl 1118.62041 · doi:10.1214/009053606000001217
[2]	Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F. and Vaughan, J. W. (2010). A theory of learning from different domains. Mach. Learn. 79 151-175. · Zbl 1470.68081 · doi:10.1007/s10994-009-5152-4
[3]	Ben-David, S., Lu, T., Luu, T. and Pál, D. (2010). Impossibility theorems for domain adaptation. In Artificial Intelligence and Statistics 129-136.
[4]	Ben-David, S. and Urner, R. (2012). On the hardness of domain adaptation and the utility of unlabeled target samples. In Algorithmic Learning Theory. Lecture Notes in Computer Science 7568 139-153. Springer, Heidelberg. · Zbl 1367.68220 · doi:10.1007/978-3-642-34106-9_14
[5]	Berlind, C. and Urner, R. (2015). Active nearest neighbors in changing environments. In International Conference on Machine Learning 1870-1879.
[6]	Blitzer, J., Kakade, S. and Foster, D. (2011). Domain adaptation with coupled subspaces. In Conference on Artificial Intelligence and Statistics 173-181.
[7]	Cai, T. T. and Wei, H. (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Ann. Statist. 49 100-128. · Zbl 1466.62351 · doi:10.1214/20-AOS1949
[8]	Cannings, T. I., Berrett, T. B. and Samworth, R. J. (2020). Local nearest neighbour classification with applications to semi-supervised learning. Ann. Statist. 48 1789-1814. · Zbl 1451.62071 · doi:10.1214/19-AOS1868
[9]	Chattopadhyay, R., Fan, W., Davidson, I., Panchanathan, S. and Ye, J. (2013). Joint transfer and batch-mode active learning. In International Conference on Machine Learning 253-261.
[10]	Chaudhuri, K. and Dasgupta, S. (2014). Rates of convergence for nearest neighbor classification. In Neural Information Processing Systems 3437-3445.
[11]	Chen, M., Weinberger, K. Q. and Blitzer, J. (2011). Co-training for domain adaptation. In Neural Information Processing Systems 2456-2464.
[12]	Cortes, C., Mansour, Y. and Mohri, M. (2010). Learning bounds for importance weighting. In Neural Information Processing Systems 442-450.
[13]	Cortes, C., Mohri, M. and Muñoz Medina, A. (2019). Adaptation based on generalized discrepancy. J. Mach. Learn. Res. 20 Paper No. 1, 30. · Zbl 1483.68289
[14]	Gadat, S., Klein, T. and Marteau, C. (2014). Classification with the NN rule in general finite dimensional spaces: Necessary and sufficient conditions. Available at arXiv:1411.0894.
[15]	Germain, P., Habrard, A., Laviolette, F. and Morvant, E. (2013). A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. In International Conference on Machine Learning 738-746.
[16]	Goldberg, P. W. and Jerrum, M. R. (1995). Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Mach. Learn. 18 131-148. · Zbl 0831.68087
[17]	Goldenshluger, A. and Nemirovski, A. (1997). On spatially adaptive estimation of nonparametric regression. Math. Methods Statist. 6 135-170. · Zbl 0892.62018
[18]	Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer, New York. · Zbl 1021.62024 · doi:10.1007/b97848
[19]	Hoffman, J., Mohri, M. and Zhang, N. (2017). Multiple-source adaptation for regression problems. Available at arXiv:1711.05037.
[20]	Huang, J., Gretton, A., Borgwardt, K. M., Schölkopf, B. and Smola, A. J. (2007). Correcting sample selection bias by unlabeled data. In Neural Information Processing Systems 601-608.
[21]	Huang, L., Jiang, S. H.-C., Li, J. and Wu, X. (2018). \(ϵ\)-coresets for clustering (with outliers) in doubling metrics. In 59th Annual IEEE Symposium on Foundations of Computer Science—FOCS 2018 814-825. IEEE Comput. Soc., Los Alamitos, CA. · doi:10.1109/FOCS.2018.00082
[22]	Kpotufe, S. (2017). Lipschitz density-ratios, structured data, and data-driven tuning. In Artificial Intelligence and Statistics 1320-1328.
[23]	Kpotufe, S. and Martinet, G. (2018). Marginal singularity, and the benefits of labels in covariate-shift. In Conference on Learning Theory 1882-1886.
[24]	Kpotufe, S. and Martinet, G. (2021). Supplement to “Marginal singularity and the benefits of labels in covariate-shift.” https://doi.org/10.1214/21-AOS2084SUPP
[25]	Kulkarni, S. R. and Posner, S. E. (1995). Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Trans. Inf. Theory 41 1028-1039. · Zbl 0839.93070 · doi:10.1109/18.391248
[26]	Kuzborskij, I. and Orabona, F. (2013). Stability and hypothesis transfer learning. In International Conference on Machine Learning 942-950.
[27]	Lepski, O. V., Mammen, E. and Spokoiny, V. G. (1997). Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25 929-947. · Zbl 0885.62044 · doi:10.1214/aos/1069362731
[28]	Lipton, Z. C., Wang, Y.-X. and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. Available at arXiv:1802.03916.
[29]	Mansour, Y., Mohri, M. and Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. Preprint. Available at arXiv:0902.3430.
[30]	Mansour, Y., Mohri, M. and Rostamizadeh, A. (2009). Multiple source adaptation and the Rényi divergence. In Uncertainty in Artificial Intelligence 367-374.
[31]	Mohri, M. and Muñoz Medina, A. (2012). New analysis and algorithm for learning with drifting distributions. In Algorithmic Learning Theory. Lecture Notes in Computer Science 7568 124-138. Springer, Heidelberg. · Zbl 1367.68236 · doi:10.1007/978-3-642-34106-9_13
[32]	Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. and Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press, Cambridge, MA.
[33]	Saha, A., Rai, P., Daumé, H., Venkatasubramanian, S. and DuVall, S. L. (2011). Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases 97-112. Springer, Berlin.
[34]	Samworth, R. J. (2012). Optimal weighted nearest neighbour classifiers. Ann. Statist. 40 2733-2763. · Zbl 1373.62317 · doi:10.1214/12-AOS1049
[35]	Scott, C. (2019). A generalized Neyman-Pearson criterion for optimal domain adaption. In Algorithmic Learning Theory 2019. Proc. Mach. Learn. Res. (PMLR) 98 [738-761], 24. Proceedings of Machine Learning Research PMLR.
[36]	Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ. Press, Cambridge. · Zbl 1305.68005
[37]	Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V. and Kawanabe, M. (2008). Direct importance estimation with model selection and its application to covariate shift adaptation. In Neural Information Processing Systems 1433-1440.
[38]	Sugiyama, M., Suzuki, T. and Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge Univ. Press, Cambridge. With a foreword by Thomas G. Dietterich. · Zbl 1274.62037 · doi:10.1017/CBO9781139035613
[39]	Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York. Revised and extended from the 2004 French original, translated by Vladimir Zaiats. · Zbl 1029.62034 · doi:10.1007/b13794
[40]	Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York. · Zbl 1099.62029
[41]	Yang, L., Hanneke, S. and Carbonell, J. (2013). A theory of transfer learning with applications to active learning. Mach. Learn. 90 161-189. · Zbl 1260.68352 · doi:10.1007/s10994-012-5310-y
[42]	Yu, B. (1997). Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam 423-435. Springer, New York

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.