Abstract
The formulation
where f and g are extended-value convex functions, arises in many application areas such as signal processing, imaging and image processing, statistics, and machine learning either naturally or after variable splitting. In many common problems, one of the two objective functions is strictly convex and has Lipschitz continuous gradient. On this kind of problem, a very effective approach is the alternating direction method of multipliers (ADM or ADMM), which solves a sequence of f/g-decoupled subproblems. However, its effectiveness has not been matched by a provably fast rate of convergence; only sublinear rates such as O(1 / k) and \(O(1/k^2)\) were recently established in the literature, though the O(1 / k) rates do not require strong convexity. This paper shows that global linear convergence can be guaranteed under the assumptions of strong convexity and Lipschitz gradient on one of the two functions, along with certain rank assumptions on A and B. The result applies to various generalizations of ADM that allow the subproblems to be solved faster and less exactly in certain manners. The derived rate of convergence also provides some theoretical guidance for optimizing the ADM parameters. In addition, this paper makes meaningful extensions to the existing global convergence theory of ADM generalizations.
Similar content being viewed by others
Notes
The results continue to hold in many cases when strong convexity is relaxed to strict convexity (e.g., \(-\log (x)\) is strictly convex but not strongly convex over \(x>0\). ADM always generates a bounded sequence \(Ax^k, By^k, {\uplambda }^k\), where the bound only depends on the starting point and the solution, even when the feasible set is unbounded. When restricted to a compact set, a strictly convex function is strongly convex.
Suppose a sequence \(\{u^k\}\) converges to \(u^*\). We say the convergence is (in some norm \(\Vert \cdot \Vert \))
-
Q-linear, if there exists \(\mu \in (0,1)\) such that \(\frac{\Vert u^{k+1}-u^*\Vert }{\Vert u^{k}-u^*\Vert }\le \mu \);
-
R-linear, if there exists a sequence \(\{\sigma ^k\}\) such that \(\Vert u^{k}-u^*\Vert \le \sigma ^k\) and \(\sigma ^k\rightarrow 0\) Q-linearly.
-
References
Boley, D.: Linear convergence of ADMM on a model problem. TR 12-009, Department of Computer Science and Engineering, University of Minnesota (2012)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)
Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)
Cai, J., Osher, S., Shen, Z.: Split Bregman methods and frame based image restoration. Multiscale Model. Simul. 8(2), 337 (2009)
Chen, G., Teboulle, M.: A proximal-based decomposition method for convex minimization problems. Math. Program. 64(1), 81–101 (1994)
Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. arXiv preprint arXiv:1406.4834 (2014)
Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. arXiv preprint arXiv:1407.5210 (2014)
Deng, W., Yin, W., Zhang, Y.: Group sparse optimization by alternating direction method. In: SPIE Optical Engineering+Applications, pp. 88580R–88580R (2013)
Douglas, J., Rachford, H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956)
Eckstein, J., Bertsekas, D.: An alternating direction method for linear programming. Division of Research, Harvard Business School, Laboratory for Information Technology, M.I., Systems (1990)
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1–3), 293–318 (1992)
Esser, E.: Applications of Lagrangian-based alternating direction methods and connections to split Bregman. CAM report 09–31, UCLA (2009)
Gabay, D.: Chapter ix applications of the method of multipliers to variational inequalities. Stud. Math. Appl. 15, 299–331 (1983)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)
Glowinski, R.: Numerical Methods for Nonlinear Variational Problems, Springer Series in Computational Physics. Springer, Berlin (1984)
Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité, d’une classe de problèmes de Dirichlet non linéaires. Laboria (1975)
Goldfarb, D., Ma, S.: Fast multiple splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)
Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)
Goldfarb, D., Yin, W.: Parametric maximum flow algorithms for fast total variation minimization. SIAM J. Sci. Comput. 31(5), 3712–3743 (2009)
Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split Bregman method: segmentation and surface reconstruction. J. Sci. Comput. 45(1), 272–293 (2010)
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
Goldstein, T., Osher, S.: The split Bregman method for L1 regularized problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)
He, B., Liao, L., Han, D., Yang, H.: A new inexact alternating directions method for monotone variational inequalities. Math. Program. 92(1), 103–118 (2002)
He, B., Yuan, X.: On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik 130(3), 567–577 (2014)
He, B., Yuan, X.: On the \(O(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)
Hong, M., Luo, Z.: On the linear convergence of the alternating direction method of multipliers. Arxiv preprint arXiv:1208.3922v3 (2013)
Jiang, H., Deng, W., Shen, Z.: Surveillance video processing using compressive sensing. Inverse Probl. Imaging 6(2), 201–214 (2012)
Liang, J., Fadili, J., Peyre, G., Luke, R.: Activity identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv:1412.6858v5 (2015)
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)
Mateos, G., Bazerque, J., Giannakis, G.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)
Mendel, J., Burrus, C.: Maximum-Likelihood Deconvolution: A Journey into Model-Based Signal Processing. Springer, New York (1990)
Rockafellar, R.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Rockafellar, R.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)
Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)
Yan, M., Yin, W.: Self equivalence of the alternating direction method of multipliers. arXiv:1407.7400 (2014)
Yang, J., Zhang, Y.: Alternating direction algorithms for \(\ell _1\)-problems in compressive sensing. SIAM J. Sci. Comput. 33(1–2), 250–278 (2011)
Zhang, X., Burger, M., Osher, S.: A unified primal–dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1), 20–46 (2011)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgments
The authors’ work is supported in part by ARL MURI Grant W911NF-09-1-0383 and NSF Grant DMS-1317602.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Proof of Lemma 2.1
Proof
-
i)
The optimality conditions of the subproblems of Algorithm 2 yield (2.6) and (2.7).
-
ii)
By the convexity of f (1.16), combining the optimality conditions (1.13) and (2.6) yield (2.8). Similarly, (2.9) follows from the convexity of g (1.17) and the optimality conditions (1.14) and (2.7).
- iii)
-
iv)
By adding (2.8) and (2.9) and using (2.10), we have
$$\begin{aligned}&\frac{1}{\beta }\langle {\uplambda }^{k}-\hat{{\uplambda }},~\hat{{\uplambda }}-{\uplambda }^*-\beta A(x^k-x^{k+1})\rangle + \langle x^{k+1}-x^*, ~\hat{P}(x^k-x^{k+1})\rangle \nonumber \\&\qquad + \langle y^{k+1}-y^*, ~Q(y^k-y^{k+1})\rangle \nonumber \\&\quad \ge \nu _f\Vert x^{k+1}-x^{*}\Vert ^2+\nu _g\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$(7.1)which can be simplified as
$$\begin{aligned} (\hat{u}-u^*)^TG_1(u^k-\hat{u})\ge & {} \langle A(x^k-x^{k+1}),~{\uplambda }^k-\hat{{\uplambda }}\rangle \nonumber \\&+\,\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(7.2)By rearranging the terms, we have
$$\begin{aligned} (u^k-u^*)^TG_1(u^k-\hat{u})\ge & {} \Vert u^k-\hat{u}\Vert _{G_1}^2+\langle A(x^k-x^{k+1}),{\uplambda }^k-\hat{{\uplambda }}\rangle \nonumber \\&+\,\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(7.3)Then, from the identity \(\Vert a-c\Vert _G^2-\Vert b-c\Vert _G^2=2(a-c)^TG(a-b)-\Vert a-b\Vert _G^2\) and (2.5), it follows that
$$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2=2(u^k-u^*)^TG_1(u^k-\hat{u})-\Vert G_0(u^k-\hat{u})\Vert ^2_G. \end{aligned}$$(7.4)Substituting (7.3) into the right-hand side of the above equation, we have
$$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\ge & {} 2\Vert u^k-\hat{u}\Vert _{G_1}^2-\Vert u^k-\hat{u}\Vert ^2_{G_1G_0}\nonumber \\&+\,2\langle A(x^k-x^{k+1}),{\uplambda }^k-\hat{{\uplambda }}\rangle \nonumber \\&+\,2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$(7.5)and thus (2.11) follows immediately.
Appendix 2: Proof of Theorem 2.1
Proof
-
i)
By the Cauchy–Schwarz inequality, we have
$$\begin{aligned} 2({\uplambda }^k-\hat{{\uplambda }})^TA(x^k-x^{k+1})\ge -\frac{1}{\rho }\Vert A(x^k-x^{k+1})\Vert ^2-\rho \Vert {\uplambda }^k-\hat{{\uplambda }}\Vert ^2,~\forall \rho >0. \end{aligned}$$(8.1)Substituting (8.1) into (2.11) and using \(\frac{1}{\gamma }({\uplambda }^k-{\uplambda }^{k+1})={\uplambda }^k-\hat{{\uplambda }}\), we have
$$\begin{aligned}&\Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\nonumber \\&\quad \ge ~\Vert x^k-x^{k+1}\Vert _{\hat{P}-\frac{1}{\rho }A^TA}^2+\Vert y^k-y^{k+1}\Vert _Q^2+\left( \frac{2-\gamma }{\beta }-\rho \right) \frac{1}{\gamma ^2}\Vert {\uplambda }^k-{\uplambda }^{k+1}\Vert ^2\nonumber \\&\qquad +\,2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2,\quad \forall \rho >0. \end{aligned}$$(8.2)To show that (2.14) holds for a certain \(\eta >0\), we only need \(\hat{P}-\frac{1}{\rho }A^TA\succ 0\) and \(\frac{2-\gamma }{\beta }-\rho >0\) for a certain \(\rho >0\), which is true if and only if we have \(\hat{P}\succ \frac{\beta }{2-\gamma }A^T A\) or, equivalently, (2.13).
-
ii)
For \(P=\mathbf {0}\), we first derive a lower bound for the cross term \(({\uplambda }^k-\hat{{\uplambda }})^TA(x^k-x^{k+1})\). Applying (2.6) at two consecutive iterations with \(P=\mathbf {0}\) and in light of the definition of \(\hat{{\uplambda }}\), we have
$$\begin{aligned} \left\{ \begin{array}{l} A^T[{\uplambda }^{k-1}-\beta (Ax^{k}+By^{k}-b)] \in \partial f(x^{k}),\\ A^T\hat{{\uplambda }} \in \partial f(x^{k+1}). \end{array} \right. \end{aligned}$$(8.3)The difference of the two terms on the left in (8.3) is
$$\begin{aligned}&A^T\left[ {\uplambda }^{k-1}-\beta (Ax^{k}+By^{k}-b)-\hat{{\uplambda }}\right] \nonumber \\&\quad =A^T\left( {\uplambda }^k-\hat{{\uplambda }}\right) -(1-\gamma )\beta A^T\left( Ax^{k}+By^{k}-b\right) . \end{aligned}$$(8.4)By (8.3), (8.4) and (1.16), we get
$$\begin{aligned}&\langle A^T({\uplambda }^k-\hat{{\uplambda }}),x^k-x^{k+1}\rangle - \langle (1-\gamma )\beta A^T(Ax^{k}+By^{k}-b),x^k-x^{k+1}\rangle \nonumber \\&\quad \ge \nu _f\Vert x^k-x^{k+1}\Vert ^2, \end{aligned}$$(8.5)to which applying the Cauchy–Schwarz inequality gives
$$\begin{aligned}&({\uplambda }^k-\hat{{\uplambda }})^T A(x^k-x^{k+1})\nonumber \\&\quad \ge \langle \sqrt{\beta } (Ax^{k}+By^{k}-b), (1-\gamma )\sqrt{\beta }A(x^k-x^{k+1})\rangle +\nu _f\Vert x^k-x^{k+1}\Vert ^2\nonumber \\&\quad \ge -\frac{\beta }{2\rho }\Vert Ax^{k}+By^{k}-b\Vert ^2-\frac{(1-\gamma )^2\beta \rho }{2}\Vert A(x^k-x^{k+1})\Vert ^2\nonumber \\&\qquad +\nu _f\Vert x^k-x^{k+1}\Vert ^2,\quad \forall \rho >0. \end{aligned}$$(8.6)Substituting (8.6) into (2.11) and using \(\hat{P}=P+\beta A^T A=\beta A^T A\) and the definition of \(\hat{{\uplambda }}\), we have
$$\begin{aligned}&\Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert Ax^{k}+By^{k}-b\Vert ^2 \nonumber \\&\quad \ge ~\Vert u^{k+1}-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert Ax^{k+1}+By^{k+1}-b\Vert ^2\nonumber \\&\qquad +\,\beta \left( 2-\gamma -\frac{1}{\rho }\right) \Vert Ax^{k+1} +By^{k+1}-b\Vert ^2 +\beta \left( 1-(1-\gamma )^2\rho \right) \Vert A(x^k-x^{k+1})\Vert ^2\nonumber \\&\qquad +\,\Vert y^k-y^{k+1}\Vert ^2_Q+2\nu _f\Vert x^k-x^{k+1}\Vert ^2+2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(8.7)To prove such \(\eta >0\) exists for (2.16), we only need the existence of \(\rho >0\) such that \(2-\gamma -\frac{1}{\rho }>0\) and \(1-(1-\gamma )^2\rho >0\), which holds if and only if \(2-\gamma >(1-\gamma )^2\) or, equivalently, \(\gamma \in (0,\frac{1+\sqrt{5}}{2})\). In this case of \(P=\mathbf {0}\), if we set \(\gamma =1\), (8.6) reduces to \(({\uplambda }^k-\hat{{\uplambda }})^T A(x^k-x^{k+1})\ge \nu _f\Vert x^k-x^{k+1}\Vert ^2\), which substituting into (2.11) gives (2.17) with \(\eta = 1\).
Rights and permissions
About this article
Cite this article
Deng, W., Yin, W. On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers. J Sci Comput 66, 889–916 (2016). https://doi.org/10.1007/s10915-015-0048-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10915-015-0048-x