Abstract
In this study, we consider a transfer-learning problem using the parameter transfer approach, in which a suitable parameter of feature mapping is learned through one task and applied to another objective task. We introduce the notion of local stability and parameter transfer learnability of parametric feature mapping, and derive an excess risk bound for parameter transfer algorithms. As an application of parameter transfer learning, we discuss the performance of sparse coding in self-taught learning. Although self-taught learning algorithms with a large volume of unlabeled data often show excellent empirical performance, their theoretical analysis has not yet been studied. In this paper, we also provide a theoretical excess risk bound for self-taught learning. In addition, we show that the results of numerical experiments agree with our theoretical analysis.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In traditional machine learning, it is assumed that data are identically drawn from a single distribution. However, this assumption does not always hold in real-world applications. Therefore, it is imperative to develop methods that are capable of incorporating samples drawn from different distributions. In this case, transfer learning provides a general way to accommodate these situations. In transfer learning, apart from the few samples that are available related to an objective task, abundant samples from another domain that are not necessarily drawn from an identical distribution can be used. The domain related to the objective task is called the target domain and the other domain is called the source domain. Transfer learning aims to extract some useful knowledge from the source domain and apply this knowledge to achieve high task performance in the target domain.
Transfer learning is categorized in Pan and Yang (2010) into three areas: inductive transfer learning, transductive transfer learning, and unsupervised transfer learning. Inductive transfer learning corresponds to the setting in which labeled samples are available in the target domain. In addition, when labeled samples in the source domain are unavailable, the setting is called self-taught learning (Raina et al. 2007). In particular, self-taught learning can be applied to the case in which tasks are different in the source and target domains. Transductive transfer learning corresponds to the setting in which labeled samples are available only in the source domain. Then, tasks in both domains are typically assumed to be the same as in a covariate shift (Shimodaira 2000; Sugiyama et al. 2008) and sample selection bias (Zadrozny 2004; Huang et al. 2007). Domain adaptation (Daume and Marcu 2006; Blitzer et al. 2006) can be regarded as transfer learning in which tasks are the same in both domains; this is closely related to transductive transfer learning. Unsupervised transfer learning corresponds to the setting where labeled samples are unavailable in both domains. In this setting, the purpose is not to achieve high predictive performance but to perform an unsupervised task well in the target domain.
In accordance with the type of knowledge that is transferred, approaches for solving transfer-learning problems can be classified into types such as instance transfer, feature representation transfer, and parameter transfer (Pan and Yang 2010). In recent years, the parameter transfer approach has particularly attracted much attention in fine-tuning network weights of a deep neural network trained on source domains. In the setting of the parameter transfer approach, some kind of parametric models are supposed in both domains and the transferred knowledge is encoded into parameters. Biased regularization has been studied as a typical method in the parameter transfer approach, where the regularization term to an empirical loss has a non-zero center (e.g., \(\Vert {\mathbf{w}}-{\mathbf{w}}_{0}\Vert ^2\) instead of \(\Vert {\mathbf{w}}\Vert ^2\)) and the center is learnt on the source domain (Ben-David and Urner 2013; Pentina and Lampert 2014; Tommasi et al. 2014). Recently, generalization of the biased regularization was proposed and theoretically analyzed (Kuzborskij and Orabona 2013, 2017). Owing to its flexibility, the parameter transfer approach can be applied to other algorithms such as sparse coding (Raina et al. 2007; Maurer et al. 2013), multiple kernel learning (Duan et al. 2012), and deep learning (Yosinski et al. 2014).
As the parameter transfer approach typically requires many samples to accurately learn a suitable parameter in the source domain, unsupervised methods are often utilized for the learning process. In this sense, self-taught learning is compatible with the parameter transfer approach. The sparse coding-based method was used in Raina et al. (2007), in which self-taught learning was first introduced. Moreover, in this work, the parameter transfer approach was used with regard to a dictionary learnt from images as the parameter to be transferred. However, although self-taught learning has been studied in various contexts (Dai et al. 2008; Lee et al. 2009; Wang et al. 2013; Zhu et al. 2013) and many algorithms based on the parameter transfer approach have empirically demonstrated impressive performance in self-taught learning, some fundamental problems remain. First, the theoretical aspects of the parameter transfer approach have not been sufficiently studied. For example, in the context of the parameter transfer approach, the generalization error bound applicable to self-taught learning has not been considered except in a few studies (Kuzborskij and Orabona 2013, 2017). Furthermore, existing studies only treat restricted hypothesis sets, limiting applicability in areas such as sparse coding, multiple kernel learning, and neural network. Second, although it is believed that a large amount of unlabeled data helps improve the performance of the objective task in self-taught learning, the exact sample size has not been sufficiently clarified. Third, although sparsity-based methods are typically employed in self-taught learning, it is unknown how the sparsity works to guarantee the performance of self-taught learning.
In this study, we aimed to shed light on the above problems.Footnote 1 In this paper, we focus on inductive transfer learning and consider a general model of parametric feature mapping in the parameter transfer approach. We newly formulate the local stability of parametric feature mapping and the parameter transfer learnability for this mapping, and provide an excess risk bound for parameter transfer learning algorithms based on the notions. Furthermore, we consider the stability of sparse coding. Finally, we discuss parameter transfer learnability by dictionary learning under the sparse model. By applying the excess risk bound for parameter transfer learning algorithms, we derive an excess risk bound for the sparse coding algorithm in self-taught learning. Moreover, we show that the results of numerical experiments on handwritten digits datasets are in good agreement with the theoretical analysis of transfer learning with sparse coding. Note that our setting differs from the environment-based setting (Baxter 2000; Maurer 2009), where distribution over a set of distributions on labeled samples, known as an environment, is assumed. In our formulation, the existence of the environment is not assumed and presence of labeled data in the source domain is not required.
The remainder of the paper is organized as follows. In Sect. 2, we formulate the stability and parameter transfer learnability of the parametric feature mapping. Then, we present an excess risk bound for parameter transfer learning. In Sect. 3, we show the stability of sparse coding under perturbation of the dictionaries. By imposing sparsity assumptions on samples and considering dictionary learning, we derive the parameter transfer learnability for sparse coding. In particular, an excess risk bound is obtained for sparse coding in the setting of self-taught learning. Section 4 is devoted to numerical experiments of transfer learning with sparse coding. We conclude the paper with Sect. 5.
2 Excess risk bound for parameter transfer learning
2.1 Problem setting of parameter transfer learning
We formulate parameter transfer learning in this section. We first briefly introduce notations and terminology in transfer learning (Pan and Yang 2010). Let \(\mathcal {X}\) and \(\mathcal {Y}\) represent a sample space and label space, respectively. In addition, let \(\mathcal {H}=\{h:\mathcal {X}\rightarrow \mathcal {Y}\}\) be a hypothesis space and \(\ell :\mathcal {Y}\times \mathcal {Y}\rightarrow {\mathbb {R}}_{\ge 0}\) represent a loss function. Then, the expected risk and the empirical risk are defined as \(\mathcal {R}(h) :=\mathbb {E}_{({\mathbf{x}},y)\sim P} \left[ \ell (y, h({\mathbf{x}}))\right] \) and \(\widehat{\mathcal {R}}_n(h) :=\frac{1}{n}\sum _{j=1}^{n} \ell (y_j, h({\mathbf{x}}_j) )\), respectively. In the transfer learning setting, it is assumed that, apart from samples from a domain of interest (i.e., target domain), samples from another domain (i.e., source domain) are also available. We distinguish between the target and source domains by adding a subscript \(\mathcal {T}\) or \(\mathcal {S}\) to each notation introduced above, (e.g., \(P_{\mathcal {T}}\), \(\mathcal {R}_{\mathcal {S}}\)). The homogeneous setting (i.e., \(\mathcal {X}_{\mathcal {S}}=\mathcal {X}_{\mathcal {T}}\)) is not assumed in general, and thus, the heterogeneous setting (i.e., \(\mathcal {X}_{\mathcal {S}}\ne \mathcal {X}_{\mathcal {T}}\)) is used here. We note that self-taught learning, which is discussed in Sect. 3, corresponds to the case in which the label space \(\mathcal {Y}_{\mathcal {S}}\) in the source task is the set of a single element.
We consider the parameter transfer approach in which the knowledge to be transferred is encoded in a parameter. The parameter transfer approach aims to learn a hypothesis with low expected risk for the target task by obtaining some knowledge about an effective parameter in the source domain and transferring it to the target domain. We suppose that there are parametric models on both the source and target domains and their parameter spaces are partly shared. Our strategy is to learn an effective parameter in the source domain and then transfer a part of the parameter to the target domain. Next, we describe the formulation. In the target domain, we assume that \(\mathcal {Y}_{\mathcal {T}}\subset {\mathbb {R}}\) and there is a parametric feature mapping \(\psi _{{{\varvec{\theta }}}}:\mathcal {X}_{\mathcal {T}}\rightarrow {\mathbb {R}}^m\) on the target domain such that each hypothesis \(h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}}:\mathcal {X}_{\mathcal {T}}\rightarrow \mathcal {Y}_{\mathcal {T}}\) is represented by
with parameters \({{\varvec{\theta }}}\in \varTheta \) and \({\mathbf{w}}\in \mathcal {W}_{\mathcal {T}}\), where \(\varTheta \) is a subset of a normed space with norm \(\Vert \cdot \Vert \) and \(\mathcal {W}_{\mathcal {T}}\) is a subset of \({\mathbb {R}}^m\). Then, the hypothesis set in the target domain is parameterized as
In the following discussion, we simply denote \(\mathcal {R}_{\mathcal {T}}(h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}})\) and \(\widehat{\mathcal {R}}_{\mathcal {T},n}(h_{\mathcal {T},{{\varvec{\theta }}},{\mathbf{w}}})\) by \(\mathcal {R}_{\mathcal {T}}({{\varvec{\theta }}}, {\mathbf{w}})\) and \(\widehat{\mathcal {R}}_{\mathcal {T},n}({{\varvec{\theta }}}, {\mathbf{w}})\), respectively. In the source domain, we suppose that there exists some kind of parametric model such as a sample distribution \(P_{\mathcal {S},{{\varvec{\theta }}},{\mathbf{w}}}\) or a hypothesis \(h_{\mathcal {S},{{\varvec{\theta }}},{\mathbf{w}}}\) with parameters \({{\varvec{\theta }}}\in \varTheta \) and \({\mathbf{w}}\in \mathcal {W}_{\mathcal {S}}\), and a part \(\varTheta \) of the parameter space is shared with the target domain. Then, let \({{\varvec{\theta }}}_{\mathcal {S}}^{*} \in \varTheta \) and \({\mathbf{w}}_{\mathcal {S}}^{*}\in \mathcal {W}_{\mathcal {S}}\) be parameters that are supposed to be effective in the source domain (e.g., the true parameter of the sample distribution, the parameter of the optimal hypothesis with respect to the expected risk \(\mathcal {R}_{\mathcal {S}}\)). Here, the parameters \({{\varvec{\theta }}}^*_S\) and \({\mathbf{w}}^*_S\) may be taken mathematically arbitrarily (i.e. there are no mathematical restrictions) and we do not use any specific property on \({{\varvec{\theta }}}^*_S\) and \({\mathbf{w}}^*_S\). Then, the parameter transfer algorithm treated in this paper is described as follows. Let N- and n-samples be available in the source and target domains, respectively. First, a parameter transfer algorithm outputs the estimator \(\widehat{{{\varvec{\theta }}}}_N\in \varTheta \) of \({{\varvec{\theta }}}_{\mathcal {S}}^{*}\) by using N-samples. Next, for the parameter
in the target domain, the algorithm outputs its estimator
by using n-samples, where \(r({\mathbf{w}})\) is a 1-strongly convex function with respect to \(\Vert \cdot \Vert _2\) and \(\rho >0\). If the source domain relates to the target domain in some sense, the effective parameter \({{\varvec{\theta }}}^{*}_{\mathcal {S}}\) in the source domain is also expected to be useful for the target task. In the next section, we regard \(\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}^{*}_{\mathcal {S}}, {\mathbf{w}}^{*}_{\mathcal {T}}\right) \) as the baseline of predictive performance and derive an excess risk bound. The validity of the baseline is discussed in Sect. 2.2.
2.2 Excess risk bound based on stability and learnability
We introduce two new metrics, the local stability and parameter transfer learnability, as described below. These notions are essential to derive an excess risk bound in Theorem 1.
Definition 1
(Local Stability) A parametric feature mapping \(\psi _{{{\varvec{\theta }}}}\) is said to be locally stable if there exist \(\epsilon _{{{\varvec{\theta }}}}:\mathcal {X}\rightarrow {\mathbb {R}}_{>0}\) for each \({{\varvec{\theta }}}\in \varTheta \) and \(L_{\psi }>0\) such that, for \({{\varvec{\theta }}}'\in \varTheta \),
Local stability implies that the feature is not significantly affected by the parameter shift. We term \(\epsilon _{{{\varvec{\theta }}}}({\mathbf{x}})\) as the permissible radius of perturbation for \({{\varvec{\theta }}}\) at \({\mathbf{x}}\). For samples \({\mathbf{X}}^n=\{{\mathbf{x}}_1,\ldots {\mathbf{x}}_n\}\), we have \(\epsilon _{{{\varvec{\theta }}}}({\mathbf{X}}^n):=\min _{j\in [n]}\epsilon _{{{\varvec{\theta }}}}({\mathbf{x}}_j)\), where \([n]:=\{1,\ldots ,n\}\) for a positive integer n.
Next, we formulate the parameter transfer learnability based on the local stability.
Definition 2
(Parameter Transfer Learnability) Suppose that N-samples are available in the source domain and a sample \({\mathbf{x}}\) is available in the target domain. Let the parametric feature mapping \(\{\psi _{{{\varvec{\theta }}}}\}_{{{\varvec{\theta }}}\in \varTheta }\) be locally stable. For \(\bar{\delta }_N\in [0,1)\), \(\{\psi _{{{\varvec{\theta }}}}\}_{{{\varvec{\theta }}}\in \varTheta }\) is said to be parameter transfer learnable with probability \(1-\bar{\delta }_N\) if there exists an algorithm that depends only on N-samples in the source domain such that the output \(\widehat{{{\varvec{\theta }}}}_N\) of the algorithm satisfies
The parameter \(\bar{\delta }_N\) is written as \(\bar{\delta }\) for short as long as no conflict arises.
The parameter transfer learnability describes whether the effective parameter is properly transformed on the target domain with high probability. For n-samples \({\mathbf{X}}^n=\{{\mathbf{x}}_1,\ldots {\mathbf{x}}_n\}\) in the target domain, the union bound ensures that the inequality \(\Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert \le \epsilon _{{{\varvec{\theta }}}^{*}_{\mathcal {S}}}({\mathbf{X}}^n)\) holds with probability greater than or equal to \(1-n\bar{\delta }_N\).
Given training samples \(\{({\mathbf{x}}_j,y_j):j=1,\ldots ,n\}\) in the target domain, let us consider the learning method
where \(\widehat{\varvec{\theta }}_N\) is the estimated parameter in the source domain using N training samples. The optimal parameter in \(\mathcal {W}_{\mathcal {T}}\) is denoted as \(\widehat{{\mathbf{w}}}_{N,n}\). Then, the following excess risk bound is obtained.
Theorem 1
(Excess Risk Bound) We assume the following conditions.
-
1.
The parametric feature mapping \(\psi _\theta ({\mathbf{x}})\) is bounded and locally stable with the parameter \(L_\psi \). Suppose that \(\sup _{{\varvec{\theta }\in \varTheta },{\mathbf{x}}\in \mathcal {X}}\Vert \psi _{\varvec{\theta }}({\mathbf{x}})\Vert _2\le R_\psi \) holds for some positive constant \(R_\psi \).
-
2.
The estimator \(\widehat{\varvec{\theta }}_N\) on the source domain satisfies the transfer learnability with probability \(1-\bar{\delta }\).
-
3.
The non-negative loss \(\ell (\cdot ,\cdot )\) on the target domain is \(L_\ell \)-Lipschitz and convex in the second argument. Moreover, we assume that \(\sup _y\ell (y,0)\) is bounded above by a positive constant \(L_0\).
-
4.
The non-negative regularization term \(r({\mathbf{w}})\) is 1-strongly convex and \(r({\varvec{0}})=0\) holds.
Then, the excess risk is bounded above by
with probability \(1-\delta -(n+n')\bar{\delta }\), where \(n'\) is an arbitrary natural number and c is a positive constant expressed as a polynomial in \(L_\psi , R_\psi , L_\ell , L_0, r({\mathbf{w}}_{\mathcal {T}}^*)\), and \(\log (1/\delta )\).
Proof
In the proof, we define \(c_i\,(i=1,2,3,4,5)\) as a positive number depending on \(L_\psi , R_\psi , L_\ell , L_0\), and \(\log (1/\delta )\).
Using the boundedness of the non-negative loss \(\ell (\cdot ,\cdot )\) and the strong convexity of \(r({\mathbf{w}})\) with some other conditions, we have
Thus, \(\Vert \widehat{{\mathbf{w}}}_{N,n}\Vert \) is bounded above by \(\sqrt{2L_0/\rho }\). Let \(\widehat{{\mathbf{w}}}_n^*\) be the optimal solution of
Likewise, we see that the norm of \(\widehat{{\mathbf{w}}}_n^*\) has the same upper bound.
The excess risk is decomposed to the following three terms.
Let us consider the upper bound of each term.
For the first term of the excess risk, the following inequality holds:
Here, for an arbitrary natural number \(n'\), we introduce independent random variables \(\bar{{\mathbf{x}}}_1,...,\bar{{\mathbf{x}}}_{n'}\) (called ghost samples) such that the probability distribution of each \(\bar{{\mathbf{x}}}_j\) is the marginal distribution of \(P_{\mathcal {T}}\). Then, we have the following bound with probability greater than \(1-\delta /2\) by Hoeffding’s inequality:
Moreover, since it holds that \(\Vert \psi _{\widehat{\varvec{\theta }}_N}(\bar{{\mathbf{x}}}_i) - \psi _{\varvec{\theta }_{\mathcal {S}}^*}(\bar{{\mathbf{x}}}_i)\Vert \le L_{\psi }\Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert \) with probability greater than \(1-\bar{\delta }\) by local stability and parameter transfer learnability, we have the following bound with probability greater than \(1-n'\bar{\delta }\) by the union bound:
From (5)–(7), with probability greater than \(1-\delta /2-n'\bar{\delta }\), we obtain
Next, we provide an upper bound of the second term of the decomposed excess risk. To do so, we first provide an upper bound of \(\Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert \) in the following. The 1-strong convexity of r leads to \(\rho \)-strong convexity of the empirical loss with the regularization term in the parameter \({\mathbf{w}}\). Hence, we have
and
Summing up the above two inequalities, we have
The last inequality holds with probability greater than or equal to \(1-n\bar{\delta }\) owing to the parameter transfer learnability and local stability. Thus, \(\Vert \widehat{{\mathbf{w}}}_n^* - \widehat{{\mathbf{w}}}_{N,n}\Vert \le 2^{3/4} (L_\ell L_\psi L_0^{1/2})^{1/2} \Vert \widehat{\varvec{\theta }}_N - \varvec{\theta }_{\mathcal {S}}^*\Vert ^{1/2}/ \rho ^{3/4}\) holds. Hence, the second term of the decomposed excess risk is bounded above by
with probability \(1-n\bar{\delta }\).
For the third term of the excess risk, we obtain the following upper bound with probability \(1-\delta /2\) by Theorem 1 of Sridharan et al. (2009):
Combining the above results, we obtain
with probability of at least \(1-\delta -(n+n')\bar{\delta }\). \(\square \)
We mention the relation between our formulation and a fast rate result of the excess risk in “Appendix A”. The optimal \(\rho \) is obtained by minimizing the upper bound of the excess risk.
Corollary 1
Suppose that the conditions 1, 3, and 4 in Theorem 1 hold. In addition, we assume that there exist a real number \(\beta \ge 1\) and a sequence \(\tau _N\) such that
When \(n\tau _N^\beta \) is sufficiently small, the asymptotic upper bound of the excess risk is given as
by setting \(\rho =\varTheta (\max \{n^{-1/2},\,\tau _N^{2/7}\})\).
Proof
The assumptions (8) and Markov’s inequality lead to \({\mathrm{Pr}}\left[ \Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert /\tau _N \ge a \right] \le a^{-\beta }\) and
where C is a positive constant. Here, the independence of the source and target samples is used. The second inequality denotes that parameter transfer learnability holds by setting \(\bar{\delta }_N=C\tau _N^\beta \). From the first inequality, we have \(\Vert \widehat{{{\varvec{\theta }}}}_N-{{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert /\sqrt{\rho }=O_p(\tau _N/\sqrt{\rho })\) and \(\Vert \widehat{{{\varvec{\theta }}}}_N-{{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert ^{1/2}/\rho ^{3/4}=O_p(\tau _N^{1/2}/\rho ^{3/4})\), where \(O_p\) denotes the probabilistic order. Let \(\delta \) be a small positive constant, and define \(n'\) by \(n'=\delta /\bar{\delta }_N=\delta /(C\tau _N^\beta )\). We have
Suppose that \(\rho \rightarrow 0\) and \(n\rho \rightarrow \infty \) hold as \(n\rightarrow \infty \) and \(\tau _N\rightarrow 0\) while keeping \(n \bar{\delta }_N=Cn\tau _N^\beta \) sufficiently small. For large n and small \(\tau _N\), we have \(\tau _N^{\min \{1,\beta /2\}}/\sqrt{\rho }\le \tau _N^{1/2}/\rho ^{3/4}\). Hence, we obtain
with probability greater than \(1-2\delta -n\bar{\delta }_N\). Substituting
which satisfies the above condition, we have \(\mathcal {R}_{\mathrm {excess}}\le c\max \{n^{-1/2},\, \tau _N^{2/7}\}\) with high probability. \(\square \)
The upper bound of the excess risk is expressed by the bias term \(\tau _N\) induced from the source domain and the sample complexity bound on the target domain. If \(\tau _N\) is large, additional training samples on the target domain will not help attain high prediction accuracy. On the contrary, when the bias term \(\tau _N\) is sufficiently small, the excess risk is bounded above by \(\mathcal {O}(n^{-1/2})\), which is the standard asymptotic order of the supervised learning using n i.i.d. samples.
Remark 1
Suppose that the bias \(\tau _N\) on the source domain is of the order \(N^{-1/2}\), which is the standard order in the parameter estimation.Footnote 2 When \(n\tau _N^\beta \) is sufficiently small for some \(\beta \ge 1\), we have \(n=\mathcal {O}(N^{\beta /2})\). If \(c' N^{2/7}\le n\le c'' N^{\beta /2}\) holds for some constants \(c',c''\), the excess risk is of the order \(\mathcal {O}(N^{-1/7})\). For \(n=\mathcal {O}(N^{2/7})\), we have \(\mathcal {R}_{\mathrm {excess}}=\mathcal {O}(n^{-1/2})\). Given an acceptable level of the excess risk, the above result provides a rough estimate of the required sample size on both the source and target domains.
One can regard \(\mathcal {R}_{\mathcal {T}}^{*}=\min _{{{\varvec{\theta }}},{\mathbf{w}}}\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}, {\mathbf{w}}\right) \) as the baseline instead of \(\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}^{*}_{\mathcal {S}}, {\mathbf{w}}^{*}_{\mathcal {T}}\right) \). In this case, the risk bound is decomposed into
The first term, \(\mathcal {R}_{\mathrm {excess}}\), denotes the excess risk to transfer learning with optimal parameter on the source domain and its upper bounded is presented in Theorem 1 and Corollary 1. The second term, \(\mathcal {R}_{\mathrm {gap}}\), is interpreted as the difference between the source and target domains.
In an ideal situation, transfer learning is regarded as a method to reduce the bias of the model; this is explained next. Suppose that \(\mathcal {R}_{\mathrm {gap}}\) is close to zero and N is sufficiently large. Then, self-taught learning with the optimal parameter \({{\varvec{\theta }}}_{\mathcal {S}}^{*}\) is approximately realized. However, in the common learning setup using samples from only the target domain, the optimal feature representation \(\psi _{{{\varvec{\theta }}}_{\mathcal {S}}^{*}}\) will not be available. This is thought to be the main reason why transfer learning is advantageous over the standard learning methods.
On the contrary, if \(\mathcal {R}_{\mathrm {gap}}\) is much larger than \(\mathcal {R}_{\mathrm {excess}}\), negative transfer can occur easily, i.e., transfer learning actually decreases the prediction performance. This is because the parameter \({{\varvec{\theta }}}\) that is superior to \({{\varvec{\theta }}}_{\mathcal {S}}^{*}\) will be effortlessly found.
Example 1
As an example of \(\mathcal {R}_{\text {gap}}\), let us consider the regression analysis using the basis function \(\psi _{{{\varvec{\theta }}}}\). We assume that the labels in source and target domains are given as \(y={\mathbf{w}}_{\mathcal {S}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {S}}+\xi \) and \(y={\mathbf{w}}_{\mathcal {T}}^{\top }\psi _{{{\varvec{\theta }}}_\mathcal {T}}+\epsilon \) respectively, where \(\xi \) and \(\epsilon \) are noise random variables with mean 0. In addition, let the loss function be \(\ell (y,y'):=|y-y'|\) and effective parameters in source domain be \({{\varvec{\theta }}}^*_\mathcal {S}={{\varvec{\theta }}}_\mathcal {S}\), \({\mathbf{w}}^*_\mathcal {S}={\mathbf{w}}_\mathcal {S}\). Then, it holds that
Thus, it is found from this upper bound of \(\mathcal {R}_{\text {gap}}\) that, if the parameter \({{\varvec{\theta }}}_{\mathcal {S}}\) of the optimal feature map in source domain is distant from that \({{\varvec{\theta }}}_{\mathcal {T}}\) in target domain, the first term can be large, and accordingly, the second term can be also large since \({\mathbf{w}}^*_{\mathcal {T}}\) depends on \({{\varvec{\theta }}}_{\mathcal {S}}\).
A simple way to avoid the negative transfer is to assess the \(\mathcal {R}_{\mathrm {gap}}\). A naive statistic,
is available to estimate \(\mathcal {R}_{\mathrm {gap}}\). When \(\widehat{\mathcal {R}}_{\mathrm {gap}}\) is significantly larger than the order of \(\mathcal {O}(n^{-1/2})\), we will need more elaborate learning on the source domain or fine tuning (Goodfellow et al. 2016, Sec. 8.7.4) of the parameter \({{\varvec{\theta }}}\) using samples on the target domain. The domain adaptation is also another promising method to avoid a large \(\mathcal {R}_{\mathrm {gap}}\) when samples in the source and target domains are simultaneously available. We do not go into the details for this case here. In this paper, we assume that \(\mathcal {R}_{\mathrm {gap}}\) is sufficiently small and we focus on the excess risk \(\mathcal {R}_{\mathrm {excess}}\) via local stability and parameter transfer learnability.
3 Stability and learnability in sparse coding
In this section, we consider sparse coding in self-taught learning, where the source domain essentially consists of the sample space \(\mathcal {X}_{\mathcal {S}}\) without the label space \(\mathcal {Y}_{\mathcal {S}}\). We assume that the sample spaces in both domains are \({\mathbb {R}}^d\). Then, the sparse coding method considered here consists of a two-stage procedure, where a dictionary is learnt on the source domain, and then sparse coding with the learnt dictionary is used for a predictive task in the target domain.
First, we show that sparse coding satisfies the local stability in Sect. 3.1 and then explain how appropriate dictionary learning algorithms satisfy the parameter transfer learnability in Sect. 3.3. As a consequence of Theorem 1, we obtain the excess risk bound of self-taught learning algorithms based on sparse coding. We note that the results in this section are useful independent of transfer learning.
Next, we summarize the notations used in this section. Let \(\Vert \cdot \Vert _p\) be the p-norm on \({\mathbb {R}}^d\). We define \({\mathrm{supp}}({\mathbf{a}}):=\{i\in [m]|a_i\ne 0\}\) for \({\mathbf{a}}\in {\mathbb {R}}^m\). We denote the number of elements of a set S by |S|. When a vector \({\mathbf{a}}\) satisfies \(\Vert {\mathbf{a}}\Vert _0=|{\mathrm{supp}}({\mathbf{a}})|\le k\), \({\mathbf{a}}\) is said to be k-sparse. We set \(\mathcal {D}:= \{ {\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m]\in {\mathbb {R}}^{d\times m}|~\Vert {\mathbf{d}}_j\Vert _2=1~(i=1,\ldots ,m)\}\) and each \({\mathbf{D}}\in \mathcal {D}\) represents a dictionary of size m.
Definition 3
(Induced matrix norm)Footnote 3 For an arbitrary matrix \({\mathbf{E}}=[{\mathbf{e}}_1,\ldots ,{\mathbf{e}}_m]\in {\mathbb {R}}^{d\times m}\), the induced matrix norm is defined by \(\Vert {\mathbf{E}}\Vert _{1,2} := \max _{i\in [m]} \Vert {\mathbf{e}}_i\Vert _2\).
We adopt \(\Vert \cdot \Vert _{1,2}\) to measure the difference in dictionaries since it is typically used in the framework of dictionary learning. We note that \(\Vert {\mathbf{D}}- \tilde{{\mathbf{D}}}\Vert _{1,2}\le 2\) holds for arbitrary dictionaries \({\mathbf{D}}, \tilde{{\mathbf{D}}}\in \mathcal {D}\).
3.1 Local stability of sparse representation
In this section, we show the local stability of sparse representation under a sparse model. A sparse representation with dictionary parameter \({\mathbf{D}}\) of a sample \({\mathbf{x}}\in {\mathbb {R}}^d\) is expressed as follows:
where \(\lambda >0\) is a regularization parameter that induces sparsity. This situation corresponds to the case where \({{\varvec{\theta }}}={\mathbf{D}}\) and \(\psi _{{{\varvec{\theta }}}} = {\varphi }_{{\mathbf{D}}}\) in the setting of Sect. 2.1.
We define some notions used in the discussion on stability of sparse representation. The following k-margin was introduced by Mehta and Gray (2013).
Definition 4
(k-margin) Given a dictionary \({\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m]\in \mathcal {D}\) and a point \({\mathbf{x}}\in {\mathbb {R}}^d\), the k-margin of \({\mathbf{D}}\) on \({\mathbf{x}}\) is
The following \(\mu \)-incoherence is not equal to the k-incoherence defined in Mehta and Gray (2013), although these are related to each other as stated in Remark 2.
Definition 5
(\(\mu \)-incoherence) A dictionary matrix \({\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m] \in \mathcal {D}\) is said to be \(\mu \)-incoherent if \(|\langle {\mathbf{d}}_i, {\mathbf{d}}_j \rangle |\le \mu /\sqrt{d}\) for all \(i\ne j\).
Then, the following theorem is obtained.
Theorem 2
(Local Stability of Sparse Coding) Let \({\mathbf{D}}\in \mathcal {D}\) be \(\mu \)-incoherent for \(\mu <\sqrt{d}/k\) and \(\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\le \lambda \). When
the following stability bound holds.
From Theorem 2, \(\epsilon _{k,{\mathbf{D}}}({\mathbf{x}})\) becomes the permissible radius of perturbation in Definition 1.
Remark 2
We mention the relation between the \(\mu \)-incoherence defined above and k-incoherence of a dictionary, which is the assumption of the sparse coding stability in Mehta and Gray (2013). For \(k\in [m]\) and \({\mathbf{D}}\in \mathcal {D}\), the k-incoherence \(s_k({\mathbf{D}})\) is defined as
where \(\varsigma _k({\mathbf{D}}_{\varLambda })\) is the kth singular value of \({\mathbf{D}}_{\varLambda }=[{\mathbf{d}}_{i_1},\ldots ,{\mathbf{d}}_{i_k}]\) for \(\varLambda = \{i_1,\ldots ,i_k\}\). From Lemma 9 in “Appendix B”, when a dictionary \({\mathbf{D}}\) is \(\mu \)-incoherent, the k-incoherence of \({\mathbf{D}}\) satisfies
Thus, a \(\mu \)-incoherent dictionary has positive k-incoherence when \(d> (\mu k)^2\). On the other hand, when \(k\ge 2\), if a dictionary \({\mathbf{D}}\) has positive k-incoherence \(s_k({\mathbf{D}})\), there is \(0<\mu <\sqrt{d}\) such that the dictionary is \(\mu \)-incoherent.Footnote 4 However, we note that positive k-incoherence \(s_k({\mathbf{D}})\) does not imply that \({\mathbf{D}}\) is \(\mu \)-incoherent and \(\mu <\sqrt{d}/k\) in general.Footnote 5
Here, we refer to the relation with the sparse coding stability (Theorem 4) of Mehta and Gray (2013) in which the difference of dictionaries was measured by \(\Vert \cdot \Vert _{2,2}\) instead of \(\Vert \cdot \Vert _{1,2}\) and the permissible radius of perturbation was given by \({\mathcal {M}}_{k}({\mathbf{D}},{\mathbf{x}})^2\lambda \) except for a constant factor. Applying the simple inequality \(\Vert {\mathbf{E}}\Vert _{2,2} \le \sqrt{m}\Vert {\mathbf{E}}\Vert _{1,2}\) for \({\mathbf{E}}\in {\mathbb {R}}^{d\times m}\), we can obtain a variant of the sparse coding stability with norm \(\Vert \cdot \Vert _{1,2}\). However, then the dictionary size m affects the permissible radius of perturbation and the stability bound of sparse coding stability. On the other hand, the factor of m does not appear in Theorem 2, and thus, the result is effective even for a large m. In addition, whereas \(\Vert {\mathbf{x}}\Vert \le 1\) is assumed in Mehta and Gray (2013), Theorem 2 does not assume that \(\Vert {\mathbf{x}}\Vert \le 1\) and clarifies the dependency for the norm \(\Vert {\mathbf{x}}\Vert \). The Lipschitz constant \(L_\psi \) is obtained independent of \({\mathbf{x}}\) for a bounded sample space.
In existing studies related to sparse coding, the sparse representation \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) is modified as \({\varphi }_{{\mathbf{D}}}({\mathbf{x}}) \otimes {\mathbf{x}}\) (Mairal et al. 2009) or \({\varphi }_{{\mathbf{D}}}({\mathbf{x}}) \otimes ({\mathbf{x}}- {\mathbf{D}}{\varphi }_{{\mathbf{D}}}({\mathbf{x}}))\) (Raina et al. 2007), where \(\otimes \) is the tensor product. Owing to the stability of sparse representation (Theorem 2), it can be shown that such modified representations also have local stability.
3.2 Sparse modeling and margin bound
In this section, we assume a sparse structure for samples \({\mathbf{x}}\in {\mathbb {R}}^d\) and specify a lower bound for the k-margin used in (11). The result obtained in this section plays an essential role in demonstrating the parameter transfer learnability in Sect. 3.3.
Assumption 1
(Model) There exists a dictionary matrix \({\mathbf{D}}^{*}\) such that every sample \({\mathbf{x}}\) is independently generated by a representation \({\mathbf{a}}\) and noise \({{\varvec{\xi }}}\) as
Moreover, we impose the following three assumptions on the above model.
Assumption 2
(Dictionary) The dictionary matrix \({\mathbf{D}}^{*}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m] \in \mathcal {D}\) is \(\mu \)-incoherent.
Assumption 3
(Representation) The representation \({\mathbf{a}}\) is a random variable that is k-sparse (i.e., \(\Vert {\mathbf{a}}\Vert _0\le k\)) and the non-zero entries are lower bounded by \(C>0\) (i.e., \(a_i\ne 0\) satisfies \(|a_i|\ge C\)).
Assumption 4
(Noise) The noise \({{\varvec{\xi }}}\) is independent across coordinates and Gaussian with zero mean and a maximum variance \(\sigma ^2k/d\) on each component, where \(\sigma >0\) is a constant.
Remark 3
Note that Assumption 4 is valid under Assumptions 1–3 and the condition \(\mu \le \sqrt{d}/k\) if we assume a situation where dictionary learning is possible. To learn the true dictionary \({\mathbf{D}}^*\) and true signal \({\mathbf{a}}\) from a sample \({\mathbf{x}}\), it is necessary that the noise \({{\varvec{\xi }}}\) must be much smaller than the signal \({\mathbf{D}}^*{\mathbf{a}}\) with high probability. This condition is represented by \(\Vert {{\varvec{\xi }}}\Vert \le \Vert {\mathbf{D}}^*{\mathbf{a}}\Vert \). Here, it holds that
where \(|{\mathbf{a}}|\) is the vector whose components are absolute values of those of \({\mathbf{a}}\) and \(a_{max} := \max _{1\le i\le m} |a_{i}|\). Then, each component \(\xi _i\) of \({{\varvec{\xi }}}\) approximately satisfies, with high probability,
Thus, since each component is Gaussian, its variance should be \(\tilde{\mathcal {O}}(k/d)\).
In transfer learning, samples on the source and target domains are not necessarily identically distributed. Indeed, independent but non-identical distributions are allowed under Assumptions 3 and 4. This is essential because samples in the source and target domains cannot be assumed to be identically distributed in transfer learning.
Theorem 3
(Margin Bound) Let \(0<t<1\). We set
We assume that \(d\ge \left\{ \left( 1+\frac{6}{(1-t)}\right) \mu k\right\} ^2\) and \(\lambda = d^{-\tau }\) for arbitrary \(1/4\le \tau \le 1/2\). Under Assumptions 1–4, the following inequality holds with a probability of at least \(1-\delta _{t,\lambda }\).
We provide the proof of Theorem 3 in “Appendix C”.
Note that the failure probability of the margin bound in (14) decreases as the dimension increases since the variance of the noise gets smaller because of Assumption 4.
Next, we analyze the regularization parameter \(\lambda \). An appropriate reflection of the sparsity of samples requires the regularization parameter \(\lambda \) to be set suitably. This is according to Theorem 4 of Zhao and Yu (2006)Footnote 6 when samples follow the sparse model as in Assumptions 1–4 and \(\lambda \cong d^{-\tau }\) for \(1/4\le \tau \le 1/2\). The representation \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) reconstructs the true sparse representation \({\mathbf{a}}\) of sample \({\mathbf{x}}\) with a small error. In particular, when \(\tau =1/4\) (i.e., \(\lambda \cong d^{-1/4}\)) in Theorem 3, the failure probability \(\delta _{t,\lambda }\cong e^{-\sqrt{d}}\) on the margin is guaranteed to become sub-exponentially small with respect to dimension d and is negligible for the high-dimensional case. On the other hand, the typical choice \(\tau =1/2\) (i.e., \(\lambda \cong d^{-1/2}\)) does not provide a useful result because \(\delta _{t,\lambda }\) is not small at all.
3.3 Parameter transfer learnability for dictionary learning
When a true dictionary \({\mathbf{D}}^*\) exists as in Assumption 1, we show that the output \(\widehat{{\mathbf{D}}}_N\) of a suitable dictionary learning algorithm from N-unlabeled samples satisfies the parameter transfer learnability for the sparse coding \({\varphi }_{{\mathbf{D}}}\). Then, Theorem 1 guarantees the excess risk bound in self-taught learning since the discussion in this section does not assume the label space in the source domain. This situation corresponds to the case where \({{\varvec{\theta }}}_{\mathcal {S}}^*={\mathbf{D}}^*\), \(\widehat{{{\varvec{\theta }}}}_N=\widehat{{\mathbf{D}}}_N\) and \(\Vert \cdot \Vert =\Vert \cdot \Vert _{1,2}\) in Sect. 2.1.
We show that an appropriate dictionary learning algorithm satisfies parameter transfer learnability for the sparse coding \({\varphi }_{{\mathbf{D}}}\) by focusing on the permissible radius of perturbation in (11) under some assumptions. When Assumptions 1–4 hold and \(\lambda = d^{-\tau }\) for \(1/4\le \tau \le 1/2\), the margin bound (15) for \({\mathbf{x}}\in \mathcal {X}\) holds with probability \(1-\delta _{t,\lambda }\), and we have
Thus, if a dictionary learning algorithm outputs the estimator \(\widehat{{\mathbf{D}}}_{N}\) such that
with probability \(1-\delta _N\), the estimator \(\widehat{{\mathbf{D}}}_{N}\) of \({\mathbf{D}}^{*}\) satisfies the parameter transfer learnability for the sparse coding \({\varphi }_{{\mathbf{D}}}\) with probability \(\bar{\delta }_N:=\delta _N+\delta _{t,\lambda }\). Then, by local stability of sparse representation and parameter transfer learnability of such a dictionary learning, Theorem 1 guarantees that sparse coding in self-taught learning satisfies the excess risk bound in (4). For n-samples \({\mathbf{X}}^n=\{{\mathbf{x}}_1,\ldots {\mathbf{x}}_n\}\) in the target domain, detailed analysis reveals that the inequality \(\Vert \widehat{{{\varvec{\theta }}}}_N - {{\varvec{\theta }}}^{*}_{\mathcal {S}}\Vert \le \epsilon _{{{\varvec{\theta }}}^{*}_{\mathcal {S}}}({\mathbf{X}}^n)\) holds with probability \(1-(\delta _N+n\delta _{t,\lambda })\), which is sharper than \(1-n\bar{\delta }_N\).
Theorem 1 applies to any dictionary learning algorithm as long as (16) is satisfied. For example, from Theorem 12 in Arora et al. (2015), when some conditionsFootnote 7 including Assumptions 1–4 are assumed, there is an iterative algorithm [Algorithm 5 in Arora et al. (2015)] whose output \({{\mathbf{D}}}^s\) at iteration s satisfies
for some \(1/2<\gamma <1\). When \(s\ge C\log d\) for a large constant C and dimension d is large enough, it holds that
We note that the algorithm requires infinite number of samples at each iteration. However, modifying Appendix G of Arora et al. (2015), it is expected that there is a large constant \(C'\) and an alternative stochastic algorithm whose output \(\widehat{{\mathbf{D}}}^s\cong {\mathbf{D}}^s\) at each iteration \(s\ge C'\log d\) satisfies (16) for \(1/4<\tau <1/3\).
Note that, although we imposed the hard-sparsity assumption (Assumption 3) as in Arora et al. (2015), we focused on the LASSO-based encoder \({\varphi }_{{\mathbf{D}}}\) instead of the hard-sparsity encoder treated in Arora et al. (2015). Under the hard-sparsity assumption, we could derive the lower bound of the permissible radius of perturbation \(\epsilon _{k,{\mathbf{D}}}\) about the LASSO-based encoder and use the result about the estimation error in dictionary learning in Arora et al. (2015).
4 Numerical experiments
We report numerical experiments using US postal service (USPS) and MNIST handwritten digits datasets and compare the results with our theoretical conclusions. Especially, we intensively investigate the relationship among \(N, n, \Vert \widehat{{{\varvec{\theta }}}}_{N}-{{\varvec{\theta }}}_{\mathcal {S}}^{*}\Vert , \rho \), and the prediction performance of the transfer learning.
The USPS dataset is composed of \(d=256\) dimensional 7291 training images and 2007 test images, and each element of data vectors ranges from \(-\,1\) to 1. The MNIST data set has 60,000 training images and 10,000 test images of dimension \(d=784\), and each element of the data vectors range from 0 to 255. In numerical experiments, the MNIST data is scaled such that each element takes a value in the interval [0, 1]. In both datasets, each image with the \(\ell _\infty \)-norm 1 has a label in \(\{0,1,\ldots ,9\}\).
4.1 Prediction accuracy of learning with sparse representation
Let us describe the setup of numerical experiments in which the self-taught learning was applied to the USPS dataset. N images out of USPS training images were randomly chosen as training samples on the source domain. In the experiments, N was set to 3000. The data matrix of the source domain was represented by \({\mathbf{X}}_{\mathcal {S}}=(\widetilde{{\mathbf{x}}}_1,\ldots ,\widetilde{{\mathbf{x}}}_{N})\in \mathbb {R}^{d\times {N}}\).
The dictionary \(\widehat{{\mathbf{D}}}\in \mathbb {R}^{d\times m}\) was obtained by solving the problem
where \(\Vert {\mathbf{A}}\Vert _p\) was \((\sum _{i,j}|a_{ij}|^p)^{1/p}\) for the matrix \({\mathbf{A}}=(a_{ij})\). The feature map was defined by the sparse representation of \({\mathbf{x}}\in \mathbb {R}^{d}\) in (10). The regularization parameter \(\lambda \) for the sparse representation was set to \(\lambda =1\). The dimension m of the dictionary varied from 16 to 512. The problem on the target domain was a 10-class digits classification of the image data. In experiments, n images were randomly chosen out of the remaining 4291 USPS training images. Here, n was varied from 500 to 3000. The prediction accuracy of the classifier was evaluated using all test images. The sparse representation of the training images, \(\{({\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}}_i),\,y_i):i=1,\ldots ,n\}\), was used to train the linear SVM (Huang and Aviyente 2006; Yang et al. 2009). The Lasso-type sparse representation (10) was employed, since it is quite popular in the dictionary learning Zhang et al. (2015). In addition, a computationally efficient implementation of the dictionary learning called spams is available as an R package (Mairal et al. 2009). The classifier was provided by kernlab package of R language (Karatzoglou et al. 2004; R Core Team 2016), and a one-vs-one strategy was used to deal with multi-class classification problems.
In addition, we evaluated the influence of the dictionary shift by analyzing how the estimation error on the source domain affected learning results on the target domain. A shifted dictionary of \(\widehat{{\mathbf{D}}}\) was obtained by adding a random matrix \({\mathbf{M}}\) to \(\widehat{{\mathbf{D}}}\). In experiments, each element of \({\mathbf{M}}\) was assumed to be an i.i.d. copy of Gaussian noise with mean 0 and standard deviation \(\sigma \). Each column of the perturbed matrix \(\widehat{{\mathbf{D}}}+{\mathbf{M}}\) was normalized to obtain the shifted dictionary \(\widetilde{{\mathbf{D}}}\in \mathcal {D}\). The feature map \({\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\) with the shifted dictionary was used to obtain the sparse representation. Numerical experiments were conducted to reveal the relation among the test error, regularization parameter \(\rho \), and noise level \(\sigma \).
For the MNIST dataset, N was set to 30,000 and n was varied from 500 to 10,000. The dimension of the dictionary, m, was varied from 16 to 1568. All test images were used to evaluate the test error on the target domain. Furthermore, the effect of the dictionary shift was evaluated.
The results are shown in Figs. 1 and 2. The test error on the target domain is plotted in the left column of each figure. Furthermore, the test error of the linear SVM using only samples on the target domain are reported. The regularization parameter \(\rho \) in the linear SVM was set to an optimal value that achieved the smallest test error. In the right column, the optimal \(\rho \) of learning with sparse representation is shown as a function of the sample size n.
When large dictionaries were used, we found that the test error of the SVM using sparse representation was smaller than that got by implementing standard SVM that used only target samples without the sparse representation. Hence, samples on the source domain were effectively used to improve the prediction accuracy.
Furthermore, we investigated the usefulness of samples on the source domain by comparing with a variant of supervised dictionary learning (SDL) proposed by Mairal et al. (2009). In the common SDL, the dictionary and classifier are simultaneously optimized based on only samples from the target domain. In the experiments, we employed a simple variant of the SDL to reduce the computational cost. In the simplified SDL, the dictionary \({\mathbf{D}}\) and feature \({\mathbf{Z}}\) were obtained using the data matrix of the target domain instead of the source domain. Then, the sparse feature representation, \({\mathbf{Z}}\), was fed into the linear SVM as training input with the output label.
Table 1 shows the result. The column “source and target” shows the test error of transfer learning using information on the source and target domains. On the other hand, the column “target (SDL)” shows the results of simplified SDL using only samples on the target domain. For the MNIST dataset, the SDL with \(m=1568\) was dropped, because the computational cost of training \({\mathbf{D}}\) in each iteration was too high. Overall, learning using both the source and target domains performed better than the simplified SDL, especially for small n. Therefore, transfer learning using \({\mathbf{D}}\) trained on the source domain is expected to be practically useful.
4.2 Setting of regularization parameters
Next, we investigate the relationship between the dictionary shift on the source domain and the regularization parameter \(\rho \) on the target domain. For \(m=16\) of USPS data in Fig. 1, a large optimal regularization parameter \(\rho \) was required to deal with the large noise level \(\sigma \). Theoretical analysis in Sect. 2.2 showed that a large regularization parameter was needed to suppress the large perturbation of the feature map. Hence, numerical results for small m agreed with our theoretical results. On the contrary, when \(m=512\), the small optimal regularization parameter efficiently worked to suppress the large noise level \(\sigma \). The same tendency was observed in the result on MNIST dataset in Fig. 2. In summary, when the dictionary is small, a larger regularization parameter is required to deal with larger noise level. For a large dictionary, the opposite is true.
Next, we explain the relation between the noise level \(\sigma \) and regularization parameter \(\rho \). Theorem 2 shows the relation between the sensitivity of the sparse representation and the dictionary shift. The upper bound of the sensitivity depends mainly on the degree of incoherence and amount of dictionary shift. Let \(\mu _{{\mathbf{D}}}\) be the degree of incoherence for dictionary \({\mathbf{D}}\); see Definition 5. For \({\mathbf{D}}=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_m ]\), \(\mu _{{\mathbf{D}}}\) is computed as \(\sqrt{d}\max \{|\langle {\mathbf{d}}_i,{\mathbf{d}}_j\rangle |\,:\,i\ne {j}\}\). Then, the difference \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) is bounded above by \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}/(1-\mu _{\widetilde{{\mathbf{D}}}}/\sqrt{d})\) up to a positive factor depending on \({\mathbf{x}}\) when the 1-margin is used. We numerically confirmed that the upper bound using \(\mu _{\widetilde{{\mathbf{D}}}}\) is tighter than the upper bound using \(\mu _{\widehat{{\mathbf{D}}}}\). This is because \(\widehat{{\mathbf{D}}}\) includes some similar column vectors and the random noise relaxes such similarity. As shown in the proof of Theorem 1, the shift of the feature map \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) directly affects the upper bound of the generalization ability. Figure 3 depicts the average value of \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}/(1-\mu _{\widetilde{{\mathbf{D}}}}/\sqrt{d})\) over 20 different random matrices as a function of \(\sigma \). Generally, larger \(\sigma \) leads to larger \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}\) and smaller \(\mu _{\widetilde{{\mathbf{D}}}}\). When the size of the dictionary, m, is small, the effect of \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}\) dominates that of \(\mu _{\widetilde{{\mathbf{D}}}}\). On the contrary, for a large dictionary, the decrease of \(\mu _{\widetilde{{\mathbf{D}}}}\) dominates the increase of \(\Vert \widehat{{\mathbf{D}}}-\widetilde{{\mathbf{D}}}\Vert _{1,2}\). As a result, the upper bound of \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) becomes small for large noise level.
Based on the above consideration, the numerical results in Figs. 1 and 2 can be interpreted as follows. Let \({\mathbf{D}}^*\) be the true dictionary. The difference \(\Vert {\varphi }_{{\mathbf{D}}^*}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) will affect the test error and optimal regularization parameter. Let us consider the upper bound of the difference, \(\Vert {\varphi }_{{\mathbf{D}}^*}({\mathbf{x}})-{\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})\Vert _2+\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\). In numerical experiments, the first term \(\Vert {\varphi }_{{\mathbf{D}}^*}({\mathbf{x}})-{\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) is fixed and the second term \(\Vert {\varphi }_{\widehat{{\mathbf{D}}}}({\mathbf{x}})-{\varphi }_{\widetilde{{\mathbf{D}}}}({\mathbf{x}})\Vert _2\) affects the learning results. As shown above, the effect of the dictionary shift presented in Figs. 1 and 2 agree with our theoretical findings.
5 Conclusion
We derived an excess risk bound (Theorem 1) for a parameter transfer learning problem based on local stability and parameter transfer learnability, which were newly introduced in this paper. By applying the proposed model to a sparse coding-based algorithm under a sparse model (Assumptions 1–4), we obtained the first theoretical guarantee of excess risk bound in self-taught learning. Numerical experiments in Sect. 4 showed that transfer learning with appropriate regularization parameter worked efficiently to achieve high prediction accuracy on the target domain. Moreover, we confirmed that the theoretical analysis of local stability for sparse coding was useful for understanding the relationship between size of the dictionary and regularization parameter and prediction accuracy.
The framework of parameter transfer learning included not only sparse coding, but also other promising algorithms such as multiple kernel learning and deep neural networks. Our results are expected to be effective in analyzing the theoretical performance of these algorithms. Finally, we noted that our excess risk bound could be applied to applications other than self-taught learning because Theorem 1 included a case in which labeled samples were available in the source domain.
Notes
A short version of this article was published as a conference paper (Kumagai 2016). In this article, we present new theoretical results that extend our previous work, and include more detailed analysis of excess risk bound. Furthermore, results are provided from numerical experiments conducted to confirm the validity of the theoretical analysis in practical situations.
For example, let us consider the case of fine-tuning of deep learning, which is a typical transfer learning method. Then, probabilistic models constructed by neural networks are pre-trained in the source domain and fine-tuned in the target domain. We note that deep neural networks with Lipschitz activation functions (e.g. ReLU, sigmoid and softmax) satisfy local stability and parameter transfer learnability for \(\epsilon _{\theta }({\mathbf{x}})\equiv \infty \). If the models satisfy regularity conditions around the optimal parameter \({{\varvec{\theta }}}^*\) in the source domain, the bias \(\tau _N\) can attain the order \(\mathcal {O}(N^{-1/2})\). However, the models based on neural networks are known to have many singular points. Then, the bias around a singular point is thought to have a different order and its evaluation is the future work which relates to parameter transfer learning.
In general, the (p, q)-induced norm for \(p,q\ge 1\) is defined by \(\Vert {\mathbf{E}}\Vert _{p,q} := \sup _{ {\mathbf{v}}\in {\mathbb {R}}^m, \Vert {\mathbf{v}}\Vert _p=1} \Vert {\mathbf{E}}{\mathbf{v}}\Vert _q\). Then, \(\Vert \cdot \Vert _{1,2}\) in this general definition coincides with that in Definition 3 by Lemma 17 of Vainsencher et al. (2011).
Since \(s_k({\mathbf{D}})^2=\min _{{\mathbf{b}}:k\text {-sparse}}{\mathbf{b}}^{\top }{\mathbf{D}}^{\top }{\mathbf{D}}{\mathbf{b}}\), we have \(s_2({\mathbf{D}})\ge s_k({\mathbf{D}})\) if \(k\ge 2\). Moreover, \(s_2({\mathbf{D}})^2=1-\max _{i\ne j}|\langle {\mathbf{d}}_i, {\mathbf{d}}_j \rangle |\) from the direct calculation. Thus, if \(s_k({\mathbf{D}})\) is positive, \(\max _{i\ne j}|\langle {\mathbf{d}}_i, {\mathbf{d}}_j \rangle |< 1\). In other words, it holds that \(|\langle {\mathbf{d}}_i, {\mathbf{d}}_j \rangle |\le \mu /\sqrt{d}\) for all \(i\ne j\) when \(\mu :=\sqrt{d}\max _{i\ne j}|\langle {\mathbf{d}}_i, {\mathbf{d}}_j \rangle | <\sqrt{d}\).
For example, when \(d=k=2\) and \({\mathbf{D}}=\frac{1}{\sqrt{2}}\left[ \begin{array}{cc} 1&{}4/3\\ 1&{}\sqrt{2}/3 \end{array} \right] \), it holds that \(s_k({\mathbf{D}})>0\). However, there is no \(\mu \) such that \({\mathbf{D}}\) is \(\mu \)-incoherent and \(\mu <\sqrt{d}/k\).
Theorem 4 of Zhao and Yu (2006), which was stated for Gaussian noise. However, it can be easily generalized to sub-Gaussian noise. Our setting corresponds to the case in which \(c_1=1/2, c_2=1, c_3=(\log \kappa +\log \log d)/\log d\) for some \(\kappa >1\) (i.e., \(e^{d^{c_3}}\cong d^{\kappa }\)) and \(c_4=c\) in Theorem 4 of Zhao and Yu (2006). Note that our regularization parameter \(\lambda \) corresponds to \(\lambda _d/d\) in Zhao and Yu (2006).
See the page 4 in Arora et al. (2015).
When the number of samples N is large enough, this condition typically holds for an appropriate estimator \(\widehat{{{\varvec{\theta }}}}_N\).
The notations in Zhao and Yu (2006) and this paper relate as follows: \(z_i^n:=z_i, \zeta _i^n:=\zeta _i, \beta _i^n:=a_i, b_i^n\le \frac{\sqrt{k}}{1-\mu k/\sqrt{d}}, \eta _i^n:=1-\mu (2k-1)/\sqrt{d}, \epsilon ^n:={{\varvec{\xi }}}, p:=m, q:=k, \lambda _n:=\lambda , M_1:=1/d, M_2:=1-\mu k/\sqrt{d}\), respectively.
References
Arora, S., Ge, R., Ma, T., & Moitra, A. (2015). Simple, efficient, and neural algorithms for sparse coding. In Conference on learning theory (pp. 113–149).
Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12(149–198), 3.
Ben-David, S., & Urner, R. (2013). Domain adaptation as learning with auxiliary information. In New directions in transfer and multi-task-workshop@ NIPS.
Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 120–128). Association for Computational Linguistics.
Dai, W., Yang, Q., Xue, G. R., & Yu, Y. (2008). Self-taught clustering. In Proceedings of the 25th international conference on Machine learning (pp. 200–207). ACM.
Daume, H, I. I. I., & Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26, 101–126.
Duan, L., Tsang, I. W., & Xu, D. (2012). Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 465–479.
Fuchs, J. J. (2004). On sparse representations in arbitrary redundant bases. IEEE Transactions on Information Theory, 50(6), 1341–1344.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.
Huang, J., Gretton, A., Borgwardt, K. M., Schölkopf, B., & Smola, A. J. (2007). Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems (pp. 601–608).
Huang, K., & Aviyente, S. (2006). Sparse representation for signal classification. In NIPS (Vol. 19, pp. 609–616).
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab: An S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20.
Kumagai, W. (2016). Learning bound for parameter transfer learning. In Advances in neural information processing systems (pp. 2721–2729).
Kuzborskij, I., & Orabona, F. (2013). Stability and hypothesis transfer learning. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 942–950).
Kuzborskij, I., & Orabona, F. (2017). Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2), 171–195.
Lee, H., Raina, R., Teichman, A., & Ng, A. Y. (2009). Exponential family sparse coding with application to self-taught learning. In IJCAI (Vol. 9, pp. 1113–1119). Citeseer.
Mairal, J., Bach, F., Ponce, & J., Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning (pp. 689–696). ACM.
Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., & Bach, F. R. (2009). Supervised dictionary learning. In Advances in neural information processing systems (pp. 1033–1040).
Maurer, A. (2009). Transfer bounds for linear feature learning. Machine Learning, 75(3), 327–350.
Maurer, A., Pontil, M., & Romera-Paredes, B. (2013). Sparse coding for multitask and transfer learning. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 343–351).
Mehta, N., & Gray, A. G. (2013). Sparsity-based generalization bounds for predictive sparse coding. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 36–44).
Mehta, N. A., & Gray, A. G. (2012). On the sample complexity of predictive sparse coding. arXiv:1202.4050.
Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). On the lasso and its dual. Journal of Computational and Graphical statistics, 9(2), 319–337.
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.
Pentina, A., & Lampert, C. (2014). A PAC-Bayesian bound for lifelong learning. In Proceedings of the 31st international conference on machine learning (ICML-14) (pp. 991–999).
R Core Team. (2016). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning (pp. 759–766). ACM.
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227–244.
Sridharan, K., Shalev-Shwartz, S., & Srebro, N. (2009). Fast rates for regularized objectives. In Advances in neural information processing systems (pp. 1545–1552).
Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V., & Kawanabe, M. (2008). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems (pp. 1433–1440).
Tibshirani, R. J. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics, 7, 1456–1490.
Tommasi, T., Orabona, F., & Caputo, B. (2014). Learning categories from few examples with multi model knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 928–941.
Vainsencher, D., Mannor, S., & Bruckstein, A. M. (2011). The sample complexity of dictionary learning. The Journal of Machine Learning Research, 12, 3259–3281.
Wang, H., Nie, F., & Huang, H. (2013). Robust and discriminative self-taught learning. In Proceedings of the 30th international conference on machine learning (pp. 298–306).
Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009 (pp. 1794–1801). IEEE.
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (pp. 3320–3328).
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning (p. 114). ACM.
Zhang, Z., Xu, Y., Yang, J., Li, X., & Zhang, D. (2015). A survey of sparse representation: Algorithms and applications. IEEE Access, 3, 490–530.
Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7, 2541–2563.
Zhu, X., Huang, Z., Yang, Y., Shen, H. T., Xu, C., & Luo, J. (2013). Self-taught dimensionality reduction on the high-dimensional small-sized data. Pattern Recognition, 46(1), 215–229.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Numbers 16K00044, 15H03636, 15H01678, 17K12653.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Steve Hanneke.
Appendices
Fast rate in hypothesis transfer learning
We have been studied the two-step learning algorithm, where first an effective parameter \({{\varvec{\theta }}}^*_{\mathcal {S}}\) is learnt in the source domain, and then, the optimal parameter \({\mathbf{w}}^*_{\mathcal {T}}\) based on an estimator of \({{\varvec{\theta }}}^*_{\mathcal {S}}\) is learnt in the target domain. Here, we consider hypothesis transfer learning in Kuzborskij and Orabona (2017) as a special case of the setting introduced in Sect. 2.1, and show that the expected risk of the two-step algorithm has a fast rate when the number of samples in the target domain is not large.
Let the sample space \(\mathcal {X}\subset {\mathbb {R}}^d\) and the label space \(\mathcal {Y}=[-1,1]\) be common in the source and target domains. In addition, let \(r:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) and \(r':{\mathbb {R}}^K\rightarrow {\mathbb {R}}\) be non-negative 1-strongly convex functions and satisfy \(r({\varvec{0}})=0\) and \(r'({\varvec{0}})=0\), respectively. We set \(\mathcal {W}:=\left\{ {\mathbf{w}}\in {\mathbb {R}}^d| r({\mathbf{w}})\le D\right\} \) and \(\varTheta :=\left\{ {{\varvec{\theta }}}\in {\mathbb {R}}^K| r'({{\varvec{\theta }}})\le B\right\} \). Kuzborskij and Orabona (2017), it is supposed that there exist finite hypotheses \(\{h_k:\mathcal {X}\rightarrow \mathcal {Y}\}_{k=1}^K\) such that the source hypothesis is represented by
and the target hypothesis is represented by
for \({{\varvec{\theta }}}\in \varTheta \) and \({\mathbf{w}}\in \mathcal {W}\).
Here, we set the feature mapping
and the set
Then, identifying \({\mathbf{w}}\in \mathcal {W}\) with \(({\mathbf{w}}^{\top }, 1)^{\top }\in \mathcal {W}_\mathcal {T}\), the hypothesis in (1) coincides with that in (19). In this sense, the formulation in Kuzborskij and Orabona (2017) is covered by ours.
In the above setting, let us consider the excess risk:
where
and \({\mathbf{w}}^*_\mathcal {T}\) was defined in (2). The first term in (20) is bounded above as follows:
where \({\varvec{h}}({\mathbf{x}}) = [h_1({\mathbf{x}}), \ldots , h_K({\mathbf{x}})]^{\top }\), where we used \(\Vert {\varvec{h}}({\mathbf{x}})\Vert _{\infty }\le 1\) because of \(h_k({\mathbf{x}})\in \mathcal {Y}=[-1,1]\). The second term in (20) can be evaluated by the following theorem.
Theorem 4
[Theorem 2 of Kuzborskij and Orabona (2017)] Let \(\mathcal {R}_{src}:=\mathcal {R}_{\mathcal {T}}\left( {{\varvec{\theta }}}^*_\mathcal {S}, \mathbf{0}\right) \). When the loss function \(\ell \) is \(\kappa \)-smooth, it holds that
where
which is positive and has a constant order for a small \(\mathcal {R}_{src}\).
We assume that \(\Vert \widehat{{{\varvec{\theta }}}}_N - {\varvec{\theta }}_{\mathcal {S}}^* \Vert _1 = O(1/n)\) holds.Footnote 8 Then, from the above discussion, as long as the number of samples in the target domain satisfies \(n =O(1/\mathcal {R}_{src}^{1/5})\), the excess risk has the following order:
In other words, when the estimation error in the source domain is small enough and the number of samples n in the target domain is not so large, the excess risk decreases in the order of O(1 / n), which is faster than the conventional asymptotic order \(O(1/\sqrt{n})\) in a non-transfer setting.
Proof of sparse coding stability
The proof of Theorem 2 is almost the same as that of Theorem 1 in Mehta and Gray (2012). However, since a part of the proof cannot be applied to our setting, we provide the full proof of Theorem 2 in this section.
Lemma 1
Let \({\mathbf{a}}\in {\mathbb {R}}^m\) and \({\mathbf{E}}\in {\mathbb {R}}^{d\times m}\). Then, \(\Vert {\mathbf{E}}{\mathbf{a}}\Vert _2 \le \Vert {\mathbf{E}}\Vert _{1,2}\Vert {\mathbf{a}}\Vert _1\).
Proof
\(\square \)
Lemma 2
The sparse representation \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) satisfies \(\left\| {\varphi }_{{\mathbf{D}}}({\mathbf{x}}) \right\| _1 \le \frac{\Vert {\mathbf{x}}\Vert _2^2}{2\lambda }\).
Proof
\(\square \)
We prepare the following notation:
Let \({\mathbf{a}}^{*}\) and \(\tilde{{\mathbf{a}}}^{*}\) denote the solutions to the LASSO problems for the dictionary \({\mathbf{D}}\) and \(\tilde{{\mathbf{D}}}\), respectively.
Then, the following equation holds owing to the subgradient of \(v_{{\mathbf{D}}}({\mathbf{z}})\) with respect to \({\mathbf{z}}\) [e.g. (2.8) of Osborne et al. (2000)].
Lemma 3
Let \(v_{{\mathbf{D}}}\) and \(v_{\tilde{{\mathbf{D}}}}\) be the optimal values of the LASSO problems for dictionary \({\mathbf{D}}\) and \(\tilde{{\mathbf{D}}}\).
Lemma 4
(Optimal Value Stability) If \(\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\le \lambda \), then
Proof
where we use
\(\square \)
The following Lemma 5 is obtained by the proof of Lemma 11 in Mehta and Gray (2012).
Lemma 5
(Stability of Norm of Reconstructor) If \(\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\le \lambda \), then
Lemma 6
If \(\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\le \lambda \), then
Proof
First, note that
and
where we use Lemma 2. Then, we have
Combining this fact with Lemma 5, we have
\(\square \)
Lemma 7
(Reconstructor Stability) If \(\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\le \lambda \), then
Proof
We set \(\bar{{\mathbf{a}}}^{*}:=\frac{1}{2}({\mathbf{a}}^{*}+\tilde{{\mathbf{a}}}^{*})\). From the optimality of \({\mathbf{a}}^{*}\), it follows that \(v_{{\mathbf{D}}}({\mathbf{a}}^{*})\le v_{{\mathbf{D}}}(\bar{{\mathbf{a}}}^{*})\), i.e.,
We denote \(\epsilon :=\Vert {\mathbf{D}}-\tilde{{\mathbf{D}}}\Vert _{1,2}\), \(c_{\mathbf{x}}:=\left( 1 + \frac{\Vert {\mathbf{x}}\Vert _2}{4}\right) \Vert {\mathbf{x}}\Vert _2^3\) and \(c'_{{\mathbf{x}}}:=\left( \Vert {\mathbf{x}}\Vert _2 +3\right) \Vert {\mathbf{x}}\Vert _2^3\).
By the convexity of the \(l_1\)-norm, the RHS of (23) obeys:
Now, taking the (expanded) LHS of (23) and the newly derived upper bound of the RHS of (23) yields the inequality:
Replacing \(\lambda \Vert {\mathbf{a}}^{*}\Vert _1\) with \(\langle {\mathbf{x}}- {\mathbf{D}}{\mathbf{a}}^*, {\mathbf{D}}{\mathbf{a}}^*\rangle \) by Lemma 3 yields:
Hence,
Then, we obtain
\(\square \)
Lemma 8
(Preservation of Sparsity) If
then
Proof
In this proof, we denote \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) and \({\varphi }_{\tilde{{\mathbf{D}}}}({\mathbf{x}})\) by \({\mathbf{a}}^*=[a^*_1,\ldots ,a^*_m]^{\top }\) and \(\tilde{{\mathbf{a}}}^*=[\tilde{a}^*_1,\ldots ,\tilde{a}^*_m]^{\top }\), respectively. When \(\tilde{{\mathbf{D}}}={\mathbf{D}}\), Lemma 8 obviously holds. In the following, we assume \(\tilde{{\mathbf{D}}}\ne {\mathbf{D}}\). Since \({\mathcal {M}}_{k}({\mathbf{D}},{\mathbf{x}})>0\) from (25), there is a \(\mathcal {I}\subset [m]\) with \(|\mathcal {I}|=m-k\), such that, for all \(j\in \mathcal {I}\),
To obtain (26), it is enough to show that \(a^{*}_i=0\) and \(\tilde{a}^{*}_i=0\) for all \(i\in \mathcal {I}\).
First, we show \(a^{*}_i=0\) for all \(i\in \mathcal {I}\). From the optimality conditions for the LASSO (Fuchs 2004), we have
Note that the above optimality conditions imply that, if \(a^{*}_j\ne 0\), then
Combining (28) with (27), it holds that \(a^{*}_i=0\) for all \(i\in \mathcal {I}\).
Next, we show \(\tilde{a}^{*}_i=0\) for all \(i\in \mathcal {I}\). To do so, it is sufficient to show that
for all \(i\in \mathcal {I}\). Note that
and
Hence,
Now,
where (30) is because of Lemma 7. Then, (29) is obtained from (25). \(\square \)
Lemma 9
When a dictionary \({\mathbf{D}}\) is \(\mu \)-incoherent, the following bound holds for an arbitrary k-sparse vector \({\mathbf{b}}\).
Proof
We set \({\mathbf{G}}:={\mathbf{D}}^{\top }{\mathbf{D}}- {\mathbf{I}}\), where \({\mathbf{I}}\) is the \(m\times m\) identity matrix. Since \({\mathbf{D}}\) is \(\mu \)-incoherent, the absolute value of each component of \({\mathbf{G}}\) is less than or equal to \(\mu /\sqrt{d}\), and thus, \({\mathbf{b}}^{\top } {\mathbf{G}}{\mathbf{b}}\ge -\mu /\sqrt{d}\Vert {\mathbf{b}}\Vert _1^2\). Then, we obtain
where we use \(\Vert {\mathbf{b}}\Vert _1\le \sqrt{k}\Vert {\mathbf{b}}\Vert _2\) for the k-sparse vector \({\mathbf{b}}\) in the last inequality. \(\square \)
Proof of Theorem 2
Following the notations of Mehta and Gray (2012), we denote \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) and \({\varphi }_{\tilde{{\mathbf{D}}}}({\mathbf{x}})\) by \(z_*\) and \(t_*\), respectively. From (23) of Mehta and Gray (2012), we have
Note that the assumption (25) of Lemma 8 follows from (11), and thus, \(\Vert z_{*} - t_{*}\Vert _0\le k\) holds from Lemma 8. Then, \(\Vert z_{*} - t_{*}\Vert _1 \le \sqrt{k}\Vert z_{*} - t_{*}\Vert _2.\)
When \({\mathbf{E}}:={\mathbf{D}}- \tilde{{\mathbf{D}}}\), the first term in (32) is evaluated as follows.
where we use \(\Vert {\mathbf{E}}\Vert _{1,2}\le 2\) in the last inequality. The second term in (32) is evaluated as follows:
Thus, we have
On the other hand, we have the following lower bound of (32) from the \(\mu \)-incoherence of \({\mathbf{D}}\) and Lemma 9:
\(\square \)
Proof of margin bound
In this proof, we set
Then, \(\delta _{t,\lambda } = \delta _1+\delta _2+\delta _3\).
The column vectors for a \(\mu \)-incoherent dictionary are in general position. Thus, a solution of LASSO for a \(\mu \)-incoherent dictionary is unique as per Lemma 3 in Tibshirani (2013).
The following notions are introduced in Zhao and Yu (2006). Let \({\mathbf{a}}\) be a k-sparse vector. Without loss of generality, we can assume that \({\mathbf{a}}=[a_1,\ldots ,a_k,0,\ldots ,0]^{\top }\). Then, \({\mathbf{a}}(1)=[a_1,\ldots ,a_k]^{\top }\), \({\mathbf{D}}(1)=[{\mathbf{d}}_1,\ldots ,{\mathbf{d}}_k]\) and \({\mathbf{D}}(2)=[{\mathbf{d}}_{k+1},\ldots ,{\mathbf{d}}_m]\). We define \({\mathbf{C}}_{ij}:=\frac{1}{d}{\mathbf{D}}(i)^{\top }{\mathbf{D}}(j)\) for \(i,j\in \{1,2\}\). When a dictionary \({\mathbf{D}}\) is \(\mu \)-incoherent and \((\mu k)^2/d< 1\), \({\mathbf{C}}_{11}\) is positive definite owing to Lemma 9 and hence invertible.
Definition 6
(Strong Irrepresentation Condition) There exists a positive vector \({{\varvec{\eta }}}\) such that
where \({\mathrm{sign}}({\mathbf{a}}(1))\) maps positive entry of \({\mathbf{a}}(1)\) to 1, negative entry to \(-1\) and 0 to 0, \({\mathbf{1}}\) is the \((d-k)\times 1\) vector of 1’s, and the inequality holds element-wise.
Then, the following lemma is derived by modifying the proof of Corollary 2 of Zhao and Yu (2006).
Lemma 10
(Strong Irrepresentation Condition) When a dictionary \({\mathbf{D}}\) is \(\mu \)-incoherent and \(\sqrt{d}>\mu (2k-1)\) holds, the strong irrepresentation condition holds with \({{\varvec{\eta }}}=(1-\mu (2k-1)/\sqrt{d}){\mathbf{1}}\).
The following lemma is given in the proof of Theorem 3 and Theorem 4 of Zhao and Yu (2006).
Lemma 11
When Assuptions 1–4 hold and \({\mathbf{D}}\) is \(\mu \)-incoherent and \(\sqrt{d}>2\mu (2k-1)\), there exist Gaussian random variables \(\{z_i\}_{i=1}^k\) and \(\{\zeta _i\}_{i=1}^{d-k}\)Footnote 9 such that their variances are bounded as \({\mathbf{E}}[z_i^2]\le \sigma ^2k/ d(1-\mu k/\sqrt{d})\) and \({\mathbf{E}}[\zeta _i^2] \le \sigma ^2k/d^2\) and
Proof
The following is derived in Proposition 1 of Zhao and Yu (2006).
Proposition 1
Assume Strong Irrepresentable Condition holds with a constant \(\eta >0\) then
where
Here, we have
From Lemma 10, we obtain \(\Vert ({\mathbf{C}}^d_{11})^{-1}{\mathrm{sign}}({\mathbf{a}}{(1)})\Vert _2 \le \sqrt{k}/(1-\mu k/\sqrt{d})\). Thus, it holds that
where \({\mathbf{z}}= (z_1,...,z_m)^{\top } := (C^d_{11})^{-1} W^d(1)\) and \({{\varvec{\zeta }}}= (\zeta _1,...,\zeta _m)^{\top }:= C^d_{21}(C^d_{11})^{-1} W^d(1) - W^d(2)\). Thus, it is enough to show that \({\mathbf{E}}[z_i^2]\le \sigma ^2k/ d(1-\mu k/\sqrt{d})\) and \({\mathbf{E}}[\zeta _i^2] \le \sigma ^2k/d^2\).
If we write \({\mathbf{z}}={\mathbf{H}}_A^{\top }{{\varvec{\xi }}}\) where \({\mathbf{H}}_A^{\top }=({\mathbf{h}}^A_1,...,{\mathbf{h}}^A_k)^{\top }=({\mathbf{C}}_{11})^{-1}\sqrt{d}^{-1}{\mathbf{D}}(1)\), then
Therefore \(z_i=h^{A\top }_1 {{\varvec{\xi }}}\) with
Similarly if we write \({{\varvec{\zeta }}}={\mathbf{H}}_B^{\top }{{\varvec{\xi }}}\) where \({\mathbf{H}}_B^{\top }=({\mathbf{h}}^B_1,...,{\mathbf{h}}^B_k)^{\top }={\mathbf{C}}_{21}({\mathbf{C}}_{11})^{-1}\sqrt{d}^{-1}{\mathbf{D}}(1)^{\top } - \sqrt{d}^{-1}{\mathbf{D}}(2)^{\top }\), then
Then
where we used the fact that \(I - {\mathbf{D}}(1)^{\top }({\mathbf{D}}(1)^{\top }{\mathbf{D}}(1))^{-1}{\mathbf{D}}(1)^{\top }\) has eigenvalues between 0 and 1. Therefore \(\zeta _i=h^{B\top }_i{{\varvec{\xi }}}\) with
\(\square \)
Lemma 12
Under Assuptions 1–4, when \({\mathbf{D}}\) is \(\mu \)-incoherent and \(\sqrt{d}>2\mu (2k-1)\), the following holds:
Proof
The following inequality obviously holds:
From Lemma 11, there exist Gaussian random variables \(\{z_i\}_{i=1}^k\) and \(\{\zeta _i\}_{i=1}^{d-k}\) such that their variances are bounded as \({\mathbf{E}}[z_i^2]\le \sigma ^2k/ d(1-\mu k/\sqrt{d})\) and \({\mathbf{E}}[\zeta _i^2] \le \sigma ^2k/d^2\) and (35).
When \(\lambda \le (1-\mu k/\sqrt{d}) C d/\sqrt{k}\), the inequality \(|a_i| - \frac{\sqrt{k}\lambda }{2 (1-\mu k/\sqrt{d}) d}\ge C/2\) holds since \(|a_i|\ge C\). Then, since \(1-\mu (2k-1)/\sqrt{d} \ge 1/2\) holds, we obtain
where we use the fact that \(z_i\) and \(\zeta _i\) are sub-Gaussian. Thus, the proof is completed. \(\square \)
Lemma 13
Let \({\mathbf{D}}\) be a dictionary. When \({{\varvec{\xi }}}\) satisfies Assumption 4, the following holds:
Proof
Let \(\xi \) be a 1-dimensional Gaussian with variance \(\sigma ^2k/d\). Then, it holds that, for \(t>0\),
Note that \(\langle {\mathbf{d}}_j,{{\varvec{\xi }}}\rangle \) is Gaussian with variance \(\sigma ^2k/d\) because \(\Vert {\mathbf{d}}_j\Vert _2=1\) for every \(j\in [m]\) and components of \({{\varvec{\xi }}}\) are independent and Gaussian with variance \(\sigma ^2k/d\). Thus,
where we used (38) in the last inequality. \(\square \)
Lemma 14
Proof
By Assumption 1, \({\mathbf{x}}={\mathbf{D}}{\mathbf{a}}+{{\varvec{\xi }}}\). We denote \({\varphi }_{{\mathbf{D}}}({\mathbf{x}})\) by \({\mathbf{a}}^{*}\) and \({\mathbf{a}}- {\mathbf{a}}^{*}\) by \(\varDelta \). We have the following inequality by the definition of \({\mathbf{a}}^{*}\):
Substituting \({\mathbf{x}}={\mathbf{D}}{\mathbf{a}}+{{\varvec{\xi }}}\), we have
Let \(\varDelta _k\) be the vector whose ith component equals that of \(\varDelta \), if i is in the support of \({\mathbf{a}}\), and equals 0, otherwise. In addition, let \(\varDelta _k^{\perp } = \varDelta - \varDelta _k\). Using \(\varDelta = \varDelta _k + \varDelta _k^{\perp }\), we have
Substituting the above inequality into (39), we have
The inequality \(\lambda \ge 2\Vert {\mathbf{D}}^{\top }{{\varvec{\xi }}}\Vert _{\infty }\) holds with probability \(1- \delta _2\) due to Lemma 13, and then, the following inequality holds:
Thus, we have
Here, \(\Vert {\mathrm{supp}}(\varDelta )\Vert _0\le k\) with probability \(1 - \delta _3\) due to Lemma 12 and the following inequality holds by the \(\mu \)-incoherence of the dictionary \({\mathbf{D}}\):
Thus,
\(\square \)
Proof of Theorem 3
From Assumption 1, an arbitrary sample \({\mathbf{x}}\) is represented as \({\mathbf{x}}={\mathbf{D}}^{*}{\mathbf{a}}+{{\varvec{\xi }}}\). Then,
We evaluate the probability that the first and second terms shown above are bounded by \(\frac{1-t}{2}\lambda \).
We evaluate the probability for the first term. Since \(\Vert {\mathbf{d}}_j\Vert =1\) by definition, and \({{\varvec{\xi }}}\) is drawn from a Gaussian distribution with variance \(\sigma ^2k/d\), we have
With probability \(1 -\delta _2 -\delta _3\), the second term is evaluated as follows:
where we used Lemmas 12 and 14 in (40) and \(d\ge \left\{ \left( 1+\frac{6}{(1-t)}\right) \mu k\right\} ^2\) in (41). Thus, with probability \(1-(\delta _1+\delta _2+\delta _3)=1-\delta _{t,\lambda }\),
Thus, the proof of Theorem 3 is completed. \(\square \)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Kumagai, W., Kanamori, T. Risk bound of transfer learning using parametric feature mapping and its application to sparse coding. Mach Learn 108, 1975–2008 (2019). https://doi.org/10.1007/s10994-019-05805-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-019-05805-2