-
A second-order-like optimizer with adaptive gradient scaling for deep learning
Authors:
Jérôme Bolte,
Ryan Boustany,
Edouard Pauwels,
Andrei Purica
Abstract:
In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum.After having recalled our geometrical motivations, we provide quite extensive experiments. On image classif…
▽ More
In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum.After having recalled our geometrical motivations, we provide quite extensive experiments. On image classification (CIFAR-10, ImageNet) and language modeling (GPT-2), INNAprop consistently matches or outperforms AdamW both in training speed and accuracy, with minimal hyperparameter tuning in large-scale settings. Our code is publicly available at \url{https://github.com/innaprop/innaprop}.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
An Efficient and Explainable Transformer-Based Few-Shot Learning for Modeling Electricity Consumption Profiles Across Thousands of Domains
Authors:
Weijie Xia,
Gao Peng,
Chenguang Wang,
Peter Palensky,
Eric Pauwels,
Pedro P. Vergara
Abstract:
Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing numbers of various low-carbon technologies such as solar panels and electric vehicles. Traditional ECP modeling methods typically assume the availability of sufficient ECP data. However, in practice, the accessibility of ECP data is limited due to privacy issues…
▽ More
Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing numbers of various low-carbon technologies such as solar panels and electric vehicles. Traditional ECP modeling methods typically assume the availability of sufficient ECP data. However, in practice, the accessibility of ECP data is limited due to privacy issues or the absence of metering devices. Few-shot learning (FSL) has emerged as a promising solution for ECP modeling in data-scarce scenarios. Nevertheless, standard FSL methods, such as those used for images, are unsuitable for ECP modeling because (1) these methods usually assume several source domains with sufficient data and several target domains. However, in the context of ECP modeling, there may be thousands of source domains with a moderate amount of data and thousands of target domains. (2) Standard FSL methods usually involve cumbersome knowledge transfer mechanisms, such as pre-training and fine-tuning, whereas ECP modeling requires more lightweight methods. (3) Deep learning models often lack explainability, hindering their application in industry. This paper proposes a novel FSL method that exploits Transformers and Gaussian Mixture Models (GMMs) for ECP modeling to address the above-described issues. Results show that our method can accurately restore the complex ECP distribution with a minimal amount of ECP data (e.g., only 1.6\% of the complete domain dataset) while it outperforms state-of-the-art time series modeling methods, maintaining the advantages of being both lightweight and interpretable. The project is open-sourced at https://github.com/xiaweijie1996/TransformerEM-GMM.git.
△ Less
Submitted 22 August, 2024; v1 submitted 15 August, 2024;
originally announced August 2024.
-
Geometric and computational hardness of bilevel programming
Authors:
Jérôme Bolte,
Quoc-Tung Le,
Edouard Pauwels,
Samuel Vaiter
Abstract:
We first show a simple but striking result in bilevel optimization: unconstrained $C^\infty$ smooth bilevel programming is as hard as general extended-real-valued lower semicontinuous minimization. We then proceed to a worst-case analysis of box-constrained bilevel polynomial optimization. We show in particular that any extended-real-valued semi-algebraic function, possibly non-continuous, can be…
▽ More
We first show a simple but striking result in bilevel optimization: unconstrained $C^\infty$ smooth bilevel programming is as hard as general extended-real-valued lower semicontinuous minimization. We then proceed to a worst-case analysis of box-constrained bilevel polynomial optimization. We show in particular that any extended-real-valued semi-algebraic function, possibly non-continuous, can be expressed as the value function of a polynomial bilevel program. Secondly, from a computational complexity perspective, the decision version of polynomial bilevel programming is one level above NP in the polynomial hierarchy ($Σ^p_2$-hard). Both types of difficulties are uncommon in non-linear programs for which objective functions are typically continuous and belong to the class NP. These results highlight the irremediable hardness attached to general bilevel optimization and the necessity of imposing some form of regularity on the lower level.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
On the sequential convergence of Lloyd's algorithms
Authors:
Léo Portales,
Elsa Cazelles,
Edouard Pauwels
Abstract:
Lloyd's algorithm is an iterative method that solves the quantization problem, i.e. the approximation of a target probability measure by a discrete one, and is particularly used in digital applications.This algorithm can be interpreted as a gradient method on a certain quantization functional which is given by optimal transport. We study the sequential convergence (to a single accumulation point)…
▽ More
Lloyd's algorithm is an iterative method that solves the quantization problem, i.e. the approximation of a target probability measure by a discrete one, and is particularly used in digital applications.This algorithm can be interpreted as a gradient method on a certain quantization functional which is given by optimal transport. We study the sequential convergence (to a single accumulation point) for two variants of Lloyd's method: (i) optimal quantization with an arbitrary discrete measure and (ii) uniform quantization with a uniform discrete measure. For both cases, we prove sequential convergence of the iterates under an analiticity assumption on the density of the target measure. This includes for example analytic densities truncated to a compact semi-algebraic set. The argument leverages the log analytic nature of globally subanalytic integrals, the interpretation of Lloyd's method as a gradient method and the convergence analysis of gradient algorithms under Kurdyka-Lojasiewicz assumptions. As a by-product, we also obtain definability results for more general semi-discrete optimal transport losses such as transport distances with general costs, the max-sliced Wasserstein distance and the entropy regularized optimal transport loss.
△ Less
Submitted 2 July, 2024; v1 submitted 31 May, 2024;
originally announced May 2024.
-
Derivatives of Stochastic Gradient Descent
Authors:
Franck Iutzeler,
Edouard Pauwels,
Samuel Vaiter
Abstract:
We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the conv…
▽ More
We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the convergence of the original SGD. This enables us to establish that the derivatives of SGD converge to the derivative of the solution mapping in terms of mean squared error whenever the objective is strongly convex. Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(\log(k)^2 / k)$ convergence rates. Additionally, we prove exponential convergence in the interpolation regime. Our theoretical findings are illustrated by numerical experiments on synthetic tasks.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Inexact subgradient methods for semialgebraic functions
Authors:
Jérôme Bolte,
Tam Le,
Éric Moulines,
Edouard Pauwels
Abstract:
Motivated by the widespread use of approximate derivatives in machine learning and optimization, we study inexact subgradient methods with non-vanishing additive errors and step sizes. In the nonconvex semialgebraic setting, under boundedness assumptions, we prove that the method provides points that eventually fluctuate close to the critical set at a distance proportional to $ε^ρ$ where $ε$ is t…
▽ More
Motivated by the widespread use of approximate derivatives in machine learning and optimization, we study inexact subgradient methods with non-vanishing additive errors and step sizes. In the nonconvex semialgebraic setting, under boundedness assumptions, we prove that the method provides points that eventually fluctuate close to the critical set at a distance proportional to $ε^ρ$ where $ε$ is the error in subgradient evaluation and $ρ$ relates to the geometry of the problem. In the convex setting, we provide complexity results for the averaged values. We also obtain byproducts of independent interest, such as descent-like lemmas for nonsmooth nonconvex problems and some results on the limit of affine interpolants of differential inclusions.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Generic Fr{é}chet stationarity in constrained optimization
Authors:
Edouard Pauwels
Abstract:
Minimizing a smooth function f on a closed subset C leads to different notions of stationarity: Fr{é}chet stationarity, which carries a strong variational meaning, and criticallity, which is defined through a closure process. The latter is an optimality condition which may loose the variational meaning of Fr{é}chet stationarity in some settings. We show that, while criticality is the appropriate n…
▽ More
Minimizing a smooth function f on a closed subset C leads to different notions of stationarity: Fr{é}chet stationarity, which carries a strong variational meaning, and criticallity, which is defined through a closure process. The latter is an optimality condition which may loose the variational meaning of Fr{é}chet stationarity in some settings. We show that, while criticality is the appropriate notion in full generality, Fr{é}chet stationarity is typical in practical scenarios. This is illustrated with two main results, first we show that if C is semi-algebraic, then for a generic smooth semi-algebraic function f , all critical points of f on C are actually Fr{é}chet stationary. Second we prove that for small step-sizes, all the accumulation points of the projected gradient algorithm are Fr{é}chet stationary, with an explicit global quadratic estimate of the remainder, avoiding potential critical points which are not Fr{é}chet stationary, and some bad local minima.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
One-step differentiation of iterative algorithms
Authors:
Jérôme Bolte,
Edouard Pauwels,
Samuel Vaiter
Abstract:
In appropriate frameworks, automatic differentiation is transparent to the user at the cost of being a significant computational burden when the number of operations is large. For iterative algorithms, implicit differentiation alleviates this issue but requires custom implementation of Jacobian evaluation. In this paper, we study one-step differentiation, also known as Jacobian-free backpropagatio…
▽ More
In appropriate frameworks, automatic differentiation is transparent to the user at the cost of being a significant computational burden when the number of operations is large. For iterative algorithms, implicit differentiation alleviates this issue but requires custom implementation of Jacobian evaluation. In this paper, we study one-step differentiation, also known as Jacobian-free backpropagation, a method as easy as automatic differentiation and as performant as implicit differentiation for fast algorithms (e.g., superlinear optimization methods). We provide a complete theoretical approximation analysis with specific examples (Newton's method, gradient descent) along with its consequences in bilevel optimization. Several numerical examples illustrate the well-foundness of the one-step estimator.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
On the nature of Bregman functions
Authors:
Edouard Pauwels
Abstract:
Let C be convex, compact, with nonempty interior and h be Legendre with domain C, continuous on C. We prove that h is Bregman if and only if it is strictly convex on C and C is a polytope. This provides insights on sequential convergence of many Bregman divergence based algorithm: abstract compatibility conditions between Bregman and Euclidean topology may equivalently be replaced by explicit cond…
▽ More
Let C be convex, compact, with nonempty interior and h be Legendre with domain C, continuous on C. We prove that h is Bregman if and only if it is strictly convex on C and C is a polytope. This provides insights on sequential convergence of many Bregman divergence based algorithm: abstract compatibility conditions between Bregman and Euclidean topology may equivalently be replaced by explicit conditions on h and C. This also emphasizes that a general convergence theory for these methods (beyond polyhedral domains) would require more refinements than Bregman's conditions.
△ Less
Submitted 6 February, 2023;
originally announced February 2023.
-
Differentiating Nonsmooth Solutions to Parametric Monotone Inclusion Problems
Authors:
Jérôme Bolte,
Edouard Pauwels,
Antonio Silveti-Falls
Abstract:
We leverage path differentiability and a recent result on nonsmooth implicit differentiation calculus to give sufficient conditions ensuring that the solution to a monotone inclusion problem will be path differentiable, with formulas for computing its generalized gradient. A direct consequence of our result is that these solutions happen to be differentiable almost everywhere. Our approach is full…
▽ More
We leverage path differentiability and a recent result on nonsmooth implicit differentiation calculus to give sufficient conditions ensuring that the solution to a monotone inclusion problem will be path differentiable, with formulas for computing its generalized gradient. A direct consequence of our result is that these solutions happen to be differentiable almost everywhere. Our approach is fully compatible with automatic differentiation and comes with assumptions which are easy to check, roughly speaking: semialgebraicity and strong monotonicity. We illustrate the scope of our results by considering three fundamental composite problem settings: strongly convex problems, dual solutions to convex minimization problems and primal-dual solutions to min-max problems.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
The derivatives of Sinkhorn-Knopp converge
Authors:
Edouard Pauwels,
Samuel Vaiter
Abstract:
We show that the derivatives of the Sinkhorn-Knopp algorithm, or iterative proportional fitting procedure, converge towards the derivatives of the entropic regularization of the optimal transport problem with a locally uniform linear convergence rate.
We show that the derivatives of the Sinkhorn-Knopp algorithm, or iterative proportional fitting procedure, converge towards the derivatives of the entropic regularization of the optimal transport problem with a locally uniform linear convergence rate.
△ Less
Submitted 3 August, 2022; v1 submitted 26 July, 2022;
originally announced July 2022.
-
On the complexity of nonsmooth automatic differentiation
Authors:
Jérôme Bolte,
Ryan Boustany,
Edouard Pauwels,
Béatrice Pesquet-Popescu
Abstract:
Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This co…
▽ More
Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.
△ Less
Submitted 6 February, 2023; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Automatic differentiation of nonsmooth iterative algorithms
Authors:
Jérôme Bolte,
Edouard Pauwels,
Samuel Vaiter
Abstract:
Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and c…
▽ More
Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. For nonsmooth piggyback iterations, we characterize the attractor set of nonsmooth piggyback iterations as a set-valued fixed point which remains in the conservative framework. This has various consequences and in particular almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization problems with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method.
△ Less
Submitted 31 May, 2022;
originally announced June 2022.
-
Subgradient sampling for nonsmooth nonconvex minimization
Authors:
Jérôme Bolte,
Tam Le,
Edouard Pauwels
Abstract:
Risk minimization for nonsmooth nonconvex problems naturally leads to first-order sampling or, by an abuse of terminology, to stochastic subgradient descent. We establish the convergence of this method in the path-differentiable case and describe more precise results under additional geometric assumptions. We recover and improve results from Ermoliev and Norkin [Cybern. Syst. Anal., 34 (1998), pp.…
▽ More
Risk minimization for nonsmooth nonconvex problems naturally leads to first-order sampling or, by an abuse of terminology, to stochastic subgradient descent. We establish the convergence of this method in the path-differentiable case and describe more precise results under additional geometric assumptions. We recover and improve results from Ermoliev and Norkin [Cybern. Syst. Anal., 34 (1998), pp. 196--215] by using a different approach: conservative calculus and the ODE method. In the definable case, we show that first-order subgradient sampling avoids artificial critical points with probability one and applies moreover to a large range of risk minimization problems in deep learning, based on the backpropagation oracle. As byproducts of our approach, we obtain several results on integration of independent interest, such as an interchange result for conservative derivatives and integrals or the definability of set-valued parameterized integrals.
△ Less
Submitted 23 July, 2024; v1 submitted 28 February, 2022;
originally announced February 2022.
-
The Iterates of the Frank-Wolfe Algorithm May Not Converge
Authors:
Jérôme Bolte,
Cyrille W. Combettes,
Édouard Pauwels
Abstract:
The Frank-Wolfe algorithm is a popular method for minimizing a smooth convex function $f$ over a compact convex set $\mathcal{C}$. While many convergence results have been derived in terms of function values, hardly nothing is known about the convergence behavior of the sequence of iterates $(x_t)_{t\in\mathbb{N}}$. Under the usual assumptions, we design several counterexamples to the convergence…
▽ More
The Frank-Wolfe algorithm is a popular method for minimizing a smooth convex function $f$ over a compact convex set $\mathcal{C}$. While many convergence results have been derived in terms of function values, hardly nothing is known about the convergence behavior of the sequence of iterates $(x_t)_{t\in\mathbb{N}}$. Under the usual assumptions, we design several counterexamples to the convergence of $(x_t)_{t\in\mathbb{N}}$, where $f$ is $d$-time continuously differentiable, $d\geq2$, and $f(x_t)\to\min_\mathcal{C}f$. Our counterexamples cover the cases of open-loop, closed-loop, and line-search step-size strategies. We do not assume \emph{misspecification} of the linear minimization oracle and our results thus hold regardless of the points it returns, demonstrating the fundamental pathologies in the convergence behavior of $(x_t)_{t\in\mathbb{N}}$.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
Path differentiability of ODE flows
Authors:
Swann Marx,
Edouard Pauwels
Abstract:
We consider flows of ordinary differential equations (ODEs) driven by path differentiable vector fields. Path differentiable functions constitute a proper subclass of Lipschitz functions which admit conservative gradients, a notion of generalized derivative compatible with basic calculus rules. Our main result states that such flows inherit the path differentiability property of the driving vector…
▽ More
We consider flows of ordinary differential equations (ODEs) driven by path differentiable vector fields. Path differentiable functions constitute a proper subclass of Lipschitz functions which admit conservative gradients, a notion of generalized derivative compatible with basic calculus rules. Our main result states that such flows inherit the path differentiability property of the driving vector field. We show indeed that forward propagation of derivatives given by the sensitivity differential inclusions provide a conservative Jacobian for the flow. This allows to propose a nonsmooth version of the adjoint method, which can be applied to integral costs under an ODE constraint. This result constitutes a theoretical ground to the application of small step first order methods to solve a broad class of nonsmooth optimization problems with parametrized ODE constraints. This is illustrated with the convergence of small step first order methods based on the proposed nonsmooth adjoint.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Regularisation for PCA- and SVD-type matrix factorisations
Authors:
Abdolrahman Khoshrou,
Eric J. Pauwels
Abstract:
Singular Value Decomposition (SVD) and its close relative, Principal Component Analysis (PCA), are well-known linear matrix decomposition techniques that are widely used in applications such as dimension reduction and clustering. However, an important limitation of SVD/PCA is its sensitivity to noise in the input data. In this paper, we take another look at the problem of regularisation and show t…
▽ More
Singular Value Decomposition (SVD) and its close relative, Principal Component Analysis (PCA), are well-known linear matrix decomposition techniques that are widely used in applications such as dimension reduction and clustering. However, an important limitation of SVD/PCA is its sensitivity to noise in the input data. In this paper, we take another look at the problem of regularisation and show that different formulations of the minimisation problem lead to qualitatively different solutions.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
Numerical influence of ReLU'(0) on backpropagation
Authors:
David Bertoin,
Jérôme Bolte,
Sébastien Gerchinovitz,
Edouard Pauwels
Abstract:
In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (f…
▽ More
In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU'(0) = 0 seems to be the most efficient. For our experiments on ImageNet the gain in test accuracy over ReLU'(0) = 1 was more than 10 points (two runs). We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of ReLU'(0)'s value. Overall, the message we convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously.
△ Less
Submitted 3 November, 2023; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Nonsmooth Implicit Differentiation for Machine Learning and Optimization
Authors:
Jérôme Bolte,
Tam Le,
Edouard Pauwels,
Antonio Silveti-Falls
Abstract:
In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clar…
▽ More
In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clarke Jacobians in the usual differentiation formulas is fully justified for a wide class of nonsmooth problems. Moreover this calculus is entirely compatible with algorithmic differentiation (e.g., backpropagation). We provide several applications such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsmooth Lasso-type models. To show the sharpness of our assumptions, we present numerical experiments showcasing the extremely pathological gradient dynamics one can encounter when applying implicit algorithmic differentiation without any hypothesis.
△ Less
Submitted 5 April, 2022; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Semialgebraic Representation of Monotone Deep Equilibrium Models and Applications to Certification
Authors:
Tong Chen,
Jean-Bernard Lasserre,
Victor Magron,
Edouard Pauwels
Abstract:
Deep equilibrium models are based on implicitly defined functional relations and have shown competitive performance compared with the traditional deep networks. Monotone operator equilibrium networks (monDEQ) retain interesting performance with additional theoretical guaranties. Existing certification tools for classical deep networks cannot directly be applied to monDEQs for which much fewer tool…
▽ More
Deep equilibrium models are based on implicitly defined functional relations and have shown competitive performance compared with the traditional deep networks. Monotone operator equilibrium networks (monDEQ) retain interesting performance with additional theoretical guaranties. Existing certification tools for classical deep networks cannot directly be applied to monDEQs for which much fewer tools exist. We introduce a semialgebraic representation for ReLU based monDEQs which allows to approximate the corresponding input output relation by semidefinite programming (SDP). We present several applications to network certification and obtain SDP models for the following problems : robustness certification, Lipschitz constant estimation, ellipsoidal uncertainty propagation. We use these models to certify robustness of monDEQs w.r.t. a general $L_q$ norm. Experimental results show that the proposed models outperform existing approaches for monDEQ certification. Furthermore, our investigations suggest that monDEQs are much more robust to $L_2$ perturbations than $L_{\infty}$ perturbations.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Conservative parametric optimality and the ridge method for tame min-max problems
Authors:
Edouard Pauwels
Abstract:
We study the ridge method for min-max problems, and investigate its convergence without any convexity, differentiability or qualification assumption. The central issue is to determine whether the ''parametric optimality formula'' provides a conservative field, a notion of generalized derivative well suited for optimization. The answer to this question is positive in a semi-algebraic, and more gene…
▽ More
We study the ridge method for min-max problems, and investigate its convergence without any convexity, differentiability or qualification assumption. The central issue is to determine whether the ''parametric optimality formula'' provides a conservative field, a notion of generalized derivative well suited for optimization. The answer to this question is positive in a semi-algebraic, and more generally definable, context. The proof involves a new characterization of definable conservative fields which is of independent interest. As a consequence, the ridge method applied to definable objectives is proved to have a minimizing behavior and to converge to a set of equilibria which satisfy an optimality condition. Definability is key to our proof: we show that for a more general class of nonsmooth functions, conservativity of the parametric optimality formula may fail, resulting in an absurd behavior of the ridge method.
△ Less
Submitted 26 June, 2023; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Second-order step-size tuning of SGD for non-convex optimization
Authors:
Camille Castera,
Jérôme Bolte,
Cédric Févotte,
Edouard Pauwels
Abstract:
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version o…
▽ More
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.
△ Less
Submitted 21 November, 2021; v1 submitted 5 March, 2021;
originally announced March 2021.
-
A Sublevel Moment-SOS Hierarchy for Polynomial Optimization
Authors:
Tong Chen,
Jean-Bernard Lasserre,
Victor Magron,
Edouard Pauwels
Abstract:
We introduce a sublevel Moment-SOS hierarchy where each SDP relaxation can be viewed as an intermediate (or interpolation) between the d-th and (d+1)-th order SDP relaxations of the Moment-SOS hierarchy (dense or sparse version). With the flexible choice of determining the size (level) and number (depth) of subsets in the SDP relaxation, one is able to obtain different improvements compared to the…
▽ More
We introduce a sublevel Moment-SOS hierarchy where each SDP relaxation can be viewed as an intermediate (or interpolation) between the d-th and (d+1)-th order SDP relaxations of the Moment-SOS hierarchy (dense or sparse version). With the flexible choice of determining the size (level) and number (depth) of subsets in the SDP relaxation, one is able to obtain different improvements compared to the d-th order relaxation, based on the machine memory capacity. In particular, we provide numerical experiments for d=1 and various types of problems both in combinatorial optimization (Max-Cut, Mixed Integer Programming) and deep learning (robustness certification, Lipschitz constant of neural networks), where the standard Lasserre's relaxation (or its sparse variant) is computationally intractable. In our numerical results, the lower bounds from the sublevel relaxations improve the bound from Shor's relaxation (first order Lasserre's relaxation) and are significantly closer to the optimal value or to the best-known lower/upper bounds.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Sequential convergence of AdaGrad algorithm for smooth convex optimization
Authors:
Cheik Traoré,
Edouard Pauwels
Abstract:
We prove that the iterates produced by, either the scalar step size variant, or the coordinatewise variant of AdaGrad algorithm, are convergent sequences when applied to convex objective functions with Lipschitz gradient. The key insight is to remark that such AdaGrad sequences satisfy a variable metric quasi-Fejér monotonicity property, which allows to prove convergence.
We prove that the iterates produced by, either the scalar step size variant, or the coordinatewise variant of AdaGrad algorithm, are convergent sequences when applied to convex objective functions with Lipschitz gradient. The key insight is to remark that such AdaGrad sequences satisfy a variable metric quasi-Fejér monotonicity property, which allows to prove convergence.
△ Less
Submitted 13 April, 2021; v1 submitted 24 November, 2020;
originally announced November 2020.
-
A Hölderian backtracking method for min-max and min-min problems
Authors:
Jérôme Bolte,
Lilian Glaudin,
Edouard Pauwels,
Mathieu Serrurier
Abstract:
We present a new algorithm to solve min-max or min-min problems out of the convex world. We use rigidity assumptions, ubiquitous in learning, making our method applicable to many optimization problems. Our approach takes advantage of hidden regularity properties and allows us to devise a simple algorithm of ridge type. An original feature of our method is to come with automatic step size adaptatio…
▽ More
We present a new algorithm to solve min-max or min-min problems out of the convex world. We use rigidity assumptions, ubiquitous in learning, making our method applicable to many optimization problems. Our approach takes advantage of hidden regularity properties and allows us to devise a simple algorithm of ridge type. An original feature of our method is to come with automatic step size adaptation which departs from the usual overly cautious backtracking methods. In a general framework, we provide convergence theoretical guarantees and rates. We apply our findings on simple GAN problems obtaining promising numerical results.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
Incremental Without Replacement Sampling in Nonconvex Optimization
Authors:
Edouard Pauwels
Abstract:
Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analys…
▽ More
Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting we prove that the sequence is attracted by solutions of optimality conditions of the problem.
△ Less
Submitted 6 January, 2023; v1 submitted 15 July, 2020;
originally announced July 2020.
-
A mathematical model for automatic differentiation in machine learning
Authors:
Jerome Bolte,
Edouard Pauwels
Abstract:
Automatic differentiation, as implemented today, does not have a simple mathematical model adapted to the needs of modern machine learning. In this work we articulate the relationships between differentiation of programs as implemented in practice and differentiation of nonsmooth functions. To this end we provide a simple class of functions, a nonsmooth calculus, and show how they apply to stochas…
▽ More
Automatic differentiation, as implemented today, does not have a simple mathematical model adapted to the needs of modern machine learning. In this work we articulate the relationships between differentiation of programs as implemented in practice and differentiation of nonsmooth functions. To this end we provide a simple class of functions, a nonsmooth calculus, and show how they apply to stochastic approximation methods. We also evidence the issue of artificial critical points created by algorithmic differentiation and show how usual methods avoid these points with probability one.
△ Less
Submitted 29 October, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Long term dynamics of the subgradient method for Lipschitz path differentiable functions
Authors:
Jerome Bolte,
Edouard Pauwels,
Rodolfo Rios-Zertuche
Abstract:
We consider the long-term dynamics of the vanishing stepsize subgradient method in the case when the objective function is neither smooth nor convex. We assume that this function is locally Lipschitz and path differentiable, i.e., admits a chain rule. Our study departs from other works in the sense that we focus on the behavoir of the oscillations, and to do this we use closed measures. We recover…
▽ More
We consider the long-term dynamics of the vanishing stepsize subgradient method in the case when the objective function is neither smooth nor convex. We assume that this function is locally Lipschitz and path differentiable, i.e., admits a chain rule. Our study departs from other works in the sense that we focus on the behavoir of the oscillations, and to do this we use closed measures. We recover known convergence results, establish new ones, and show a local principle of oscillation compensation for the velocities. Roughly speaking, the time average of gradients around one limit point vanishes. This allows us to further analyze the structure of oscillations, and establish their perpendicularity to the general drift.
△ Less
Submitted 29 May, 2020;
originally announced June 2020.
-
Semialgebraic Optimization for Lipschitz Constants of ReLU Networks
Authors:
Tong Chen,
Jean-Bernard Lasserre,
Victor Magron,
Edouard Pauwels
Abstract:
The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives…
▽ More
The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives with a weak generalization of Putinar's positivity certificate. This idea could also apply to other, nearly sparse, polynomial optimization problems in machine learning. We empirically demonstrate that our method provides a trade-off with respect to state of the art linear programming approach, and in some cases we obtain better bounds in less time.
△ Less
Submitted 28 October, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
Curiosities and counterexamples in smooth convex optimization
Authors:
Jerome Bolte,
Edouard Pauwels
Abstract:
Counterexamples to some old-standing optimization problems in the smooth convex coercive setting are provided. We show that block-coordinate, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features are established: directional convergence of Cauchy's gradient curves, convergence of Newton's flow, finite length of Tikhono…
▽ More
Counterexamples to some old-standing optimization problems in the smooth convex coercive setting are provided. We show that block-coordinate, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features are established: directional convergence of Cauchy's gradient curves, convergence of Newton's flow, finite length of Tikhonov path, convergence of central paths, or smooth Kurdyka-Lojasiewicz inequality. All examples are planar. These examples are based on general smooth convex interpolation results. Given a decreasing sequence of positively curved C k convex compact sets in the plane, we provide a level set interpolation of a C k smooth convex function where k $\ge$ 2 is arbitrary. If the intersection is reduced to one point our interpolant has positive definite Hessian, otherwise it is positive definite out of the solution set. Furthermore , given a sequence of decreasing polygons we provide an interpolant agreeing with the vertices and whose gradients coincide with prescribed normals.
△ Less
Submitted 29 January, 2020; v1 submitted 22 January, 2020;
originally announced January 2020.
-
To snipe or not to snipe, that is the question! Transitions in sniping behaviour among competing algorithmic traders
Authors:
Somayeh Kokabisaghi,
Eric J Pauwels,
Andre B Dorsman
Abstract:
In this paper we extend the investigation into the transition from sure to probabilistic sniping as introduced in Menkveld and Zoican \cite{mz2017}. In that paper, the authors introduce a stylized version of a competitive game in which high frequency traders (HFTs) interact with each other and liquidity traders. The authors then show that risk aversion plays an important role in the transition fro…
▽ More
In this paper we extend the investigation into the transition from sure to probabilistic sniping as introduced in Menkveld and Zoican \cite{mz2017}. In that paper, the authors introduce a stylized version of a competitive game in which high frequency traders (HFTs) interact with each other and liquidity traders. The authors then show that risk aversion plays an important role in the transition from sure to mixed (or probabilistic) sniping. In this paper, we re-interpret and extend these conclusions in the context of repeated games and highlight some differences in results. In particular, we identify situations in which probabilistic sniping is genuinely profitable that are qualitatively different from the ones obtained in \cite{mz2017}. It turns out that beyond a specific risk aversion threshold the game resembles the well-known prisoner's dilemma, in that probabilistic sniping becomes a way to cooperate among the HFTs that leaves all the participants better off. In order to turn this into a viable strategy for the repeated game, we show how compliance can be monitored through the use of sequential statistical testing. Keywords: algorithmic trading, bandits, high-frequency exchange, Nash equilibrium, repeated games, sniping, subgame-perfect equilibrium, Sequential probability ratio, transition
△ Less
Submitted 11 September, 2020; v1 submitted 9 December, 2019;
originally announced December 2019.
-
Rate of convergence for geometric inference based on the empirical Christoffel function
Authors:
Mai Trang Vu,
François Bachoc,
Edouard Pauwels
Abstract:
We consider the problem of estimating the support of a measure from a finite, independent, sample. The estimators which are considered are constructed based on the empirical Christoffel function. Such estimators have been proposed for the problem of set estimation with heuristic justifications. We carry out a detailed finite sample analysis, that allows us to select the threshold and degree parame…
▽ More
We consider the problem of estimating the support of a measure from a finite, independent, sample. The estimators which are considered are constructed based on the empirical Christoffel function. Such estimators have been proposed for the problem of set estimation with heuristic justifications. We carry out a detailed finite sample analysis, that allows us to select the threshold and degree parameters as a function of the sample size. We provide a convergence rate analysis of the resulting support estimation procedure. Our analysis establishes that we may obtain finite sample bounds which are comparable to existing rates for different set estimation procedures. Our results rely on concentration inequalities for the empirical Christoffel function and on estimates of the supremum of the Christoffel-Darboux kernel on sets with smooth boundaries, that can be considered of independent interest.
△ Less
Submitted 19 May, 2020; v1 submitted 31 October, 2019;
originally announced October 2019.
-
Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning
Authors:
Jérôme Bolte,
Edouard Pauwels
Abstract:
Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path d…
▽ More
Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path differentiable. Using Whitney stratification techniques for semialgebraic and definable sets, our model provides variational formulas for nonsmooth automatic differentiation oracles, as for instance the famous backpropagation algorithm in deep learning. Our differential model is applied to establish the convergence in values of nonsmooth stochastic gradient methods as they are implemented in practice.
△ Less
Submitted 9 April, 2020; v1 submitted 23 September, 2019;
originally announced September 2019.
-
An Inertial Newton Algorithm for Deep Learning
Authors:
Camille Castera,
Jérôme Bolte,
Cédric Févotte,
Edouard Pauwels
Abstract:
We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both…
▽ More
We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.
△ Less
Submitted 28 July, 2021; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Semi-algebraic approximation using Christoffel-Darboux kernel
Authors:
Swann Marx,
Edouard Pauwels,
Tillmann Weisser,
Didier Henrion,
Jean Lasserre
Abstract:
We provide a new method to approximate a (possibly discontinuous) function using Christoffel-Darboux kernels. Our knowledge about the unknown multivariate function is in terms of finitely many moments of the Young measure supported on the graph of the function. Such an input is available when approximating weak (or measure-valued) solution of optimal control problems, entropy solutions to non-line…
▽ More
We provide a new method to approximate a (possibly discontinuous) function using Christoffel-Darboux kernels. Our knowledge about the unknown multivariate function is in terms of finitely many moments of the Young measure supported on the graph of the function. Such an input is available when approximating weak (or measure-valued) solution of optimal control problems, entropy solutions to non-linear hyperbolic PDEs, or using numerical integration from finitely many evaluations of the function. While most of the existing methods construct a piecewise polynomial approximation, we construct a semi-algebraic approximation whose estimation and evaluation can be performed efficiently. An appealing feature of this method is that it deals with non-smoothness implicitly so that a single scheme can be used to treat smooth or non-smooth functions without any prior knowledge. On the theoretical side, we prove pointwise convergence almost everywhere as well as convergence in the Lebesgue one norm under broad assumptions. Using more restrictive assumptions, we obtain explicit convergence rates. We illustrate our approach on various examples from control and approximation. In particular we observe empirically that our method does not suffer from the the Gibbs phenomenon when approximating discontinuous functions.
△ Less
Submitted 8 April, 2021; v1 submitted 3 April, 2019;
originally announced April 2019.
-
Data analysis from empirical moments and the Christoffel function
Authors:
Edouard Pauwels,
Mihai Putinar,
Jean-Bernard Lasserre
Abstract:
Spectral features of the empirical moment matrix constitute a resourceful tool for unveiling properties of a cloud of points, among which, density, support and latent structures. It is already well known that the empirical moment matrix encodes a great deal of subtle attributes of the underlying measure. Starting from this object as base of observations we combine ideas from statistics, real algeb…
▽ More
Spectral features of the empirical moment matrix constitute a resourceful tool for unveiling properties of a cloud of points, among which, density, support and latent structures. It is already well known that the empirical moment matrix encodes a great deal of subtle attributes of the underlying measure. Starting from this object as base of observations we combine ideas from statistics, real algebraic geometry, orthogonal polynomials and approximation theory for opening new insights relevant for Machine Learning (ML) problems with data supported on singular sets. Refined concepts and results from real algebraic geometry and approximation theory are empowering a simple tool (the empirical moment matrix) for the task of solving non-trivial questions in data analysis. We provide (1) theoretical support, (2) numerical experiments and, (3) connections to real world data as a validation of the stamina of the empirical moment matrix approach.
△ Less
Submitted 10 January, 2020; v1 submitted 19 October, 2018;
originally announced October 2018.
-
An unexpected connection between Bayes $A-$optimal designs and the Group Lasso
Authors:
Guillaume Sagnol,
Edouard Pauwels
Abstract:
We show that the $A$-optimal design optimization problem over $m$ design points in $\mathbb{R}^n$ is equivalent to minimizing a quadratic function plus a group lasso sparsity inducing term over $n\times m$ real matrices. This observation allows to describe several new algorithms for $A$-optimal design based on splitting and block coordinate decomposition. These techniques are well known and proved…
▽ More
We show that the $A$-optimal design optimization problem over $m$ design points in $\mathbb{R}^n$ is equivalent to minimizing a quadratic function plus a group lasso sparsity inducing term over $n\times m$ real matrices. This observation allows to describe several new algorithms for $A$-optimal design based on splitting and block coordinate decomposition. These techniques are well known and proved powerful to treat large scale problems in machine learning and signal processing communities. The proposed algorithms come with rigorous convergence guaranties and convergence rate estimate stemming from the optimization literature. Performances are illustrated on synthetic benchmarks and compared to existing methods for solving the optimal design problem.
△ Less
Submitted 6 September, 2018;
originally announced September 2018.
-
SVD-based Visualisation and Approximation for Time Series Data in Smart Energy Systems
Authors:
Abdolrahman Khoshrou,
Andre B. Dorsman,
Eric. J. Pauwels
Abstract:
Many time series in smart energy systems exhibit two different timescales. On the one hand there are patterns linked to daily human activities. On the other hand, there are relatively slow trends linked to seasonal variations. In this paper we interpret these time series as matrices, to be visualized as images. This approach has two advantages: First of all, interpreting such time series as images…
▽ More
Many time series in smart energy systems exhibit two different timescales. On the one hand there are patterns linked to daily human activities. On the other hand, there are relatively slow trends linked to seasonal variations. In this paper we interpret these time series as matrices, to be visualized as images. This approach has two advantages: First of all, interpreting such time series as images enables one to visually integrate across the image and makes it therefore easier to spot subtle or faint features. Second, the matrix interpretation also grants elucidation of the underlying structure using well-established matrix decomposition methods. We will illustrate both these aspects for data obtained from the German day-ahead market.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.
-
Quantifying Volatility Reduction in German Day-ahead Spot Market in the Period 2006 through 2016
Authors:
Abdolrahman Khoshrou,
Eric J. Pauwels
Abstract:
In Europe, Germany is taking the lead in the switch from the conventional to renewable energy. This poses new challenges as wind and solar energy are fundamentally intermittent, weather-dependent and less predictable. It is therefore of considerable interest to investigate the evolution of price volatility in this post-transition era. There are a number of reasons, however, that makes the practica…
▽ More
In Europe, Germany is taking the lead in the switch from the conventional to renewable energy. This poses new challenges as wind and solar energy are fundamentally intermittent, weather-dependent and less predictable. It is therefore of considerable interest to investigate the evolution of price volatility in this post-transition era. There are a number of reasons, however, that makes the practical studies difficult. For instance, EPEX prices can be zero or negative. Consequently, the standard approach in financial time series analysis to switch to logarithmic measures is inapplicable. Furthermore, in contrast to the stock market prices which are only available for trading days, EPEX prices cover the whole year, including weekends and holidays. Accordingly, there is a lot of underlying variability in the data which has nothing to do with volatility, but simply reflects diurnal activity patterns. An important distinction of the present work is the application of matrix decomposition techniques, namely the singular value decomposition (SVD), for defining an alternative notion of volatility. This approach is systematically more robust toward outliers and also the diurnal patterns. Our observations show that the day-ahead market is becoming less volatile in recent years.
△ Less
Submitted 19 July, 2018;
originally announced July 2018.
-
Data-driven pattern identification and outlier detection in time series
Authors:
Abdolrahman Khoshrou,
Eric J. Pauwels
Abstract:
We address the problem of data-driven pattern identification and outlier detection in time series. To this end, we use singular value decomposition (SVD) which is a well-known technique to compute a low-rank approximation for an arbitrary matrix. By recasting the time series as a matrix it becomes possible to use SVD to highlight the underlying patterns and periodicities. This is done without the…
▽ More
We address the problem of data-driven pattern identification and outlier detection in time series. To this end, we use singular value decomposition (SVD) which is a well-known technique to compute a low-rank approximation for an arbitrary matrix. By recasting the time series as a matrix it becomes possible to use SVD to highlight the underlying patterns and periodicities. This is done without the need for specifying user-defined parameters. From a data mining perspective, this opens up new ways of analyzing time series in a data-driven, bottom-up fashion. However, in order to get correct results, it is important to understand how the SVD-spectrum of a time series is influenced by various characteristics of the underlying signal and noise. In this paper, we have extended the work in earlier papers by initiating a more systematic analysis of these effects. We then illustrate our findings on some real-life data.
△ Less
Submitted 25 June, 2018;
originally announced July 2018.
-
Relating Leverage Scores and Density using Regularized Christoffel Functions
Authors:
Edouard Pauwels,
Francis Bach,
Jean-Philippe Vert
Abstract:
Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite k…
▽ More
Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.
△ Less
Submitted 21 November, 2018; v1 submitted 21 May, 2018;
originally announced May 2018.
-
The multiproximal linearization method for convex composite problems
Authors:
Jérôme Bolte,
Zheng Chen,
Edouard Pauwels
Abstract:
Composite minimization involves a collection of smooth functions which are aggregated in a nonsmooth manner. In the convex setting, we design an algorithm by linearizing each smooth component in accordance with its main curvature. The resulting method, called the Multiprox method, consists in solving successively simple problems (e.g. constrained quadratic problems) which can also feature some pro…
▽ More
Composite minimization involves a collection of smooth functions which are aggregated in a nonsmooth manner. In the convex setting, we design an algorithm by linearizing each smooth component in accordance with its main curvature. The resulting method, called the Multiprox method, consists in solving successively simple problems (e.g. constrained quadratic problems) which can also feature some proximal operators. To study the complexity and the convergence of this method we are led to study quantitative qualification conditions to understand the impact of multipliers on the complexity bounds. We obtain explicit complexity results of the form $O(\frac{1}{k})$ involving new types of constant terms. A distinctive feature of our approach is to be able to cope with oracles involving moving constraints. Our method is flexible enough to include the moving balls method, the proximal Gauss-Newton's method, or the forward-backward splitting, for which we recover known complexity results or establish new ones. We show through several numerical experiments how the use of multiple proximal terms can be decisive for problems with complex geometries.
△ Less
Submitted 23 March, 2019; v1 submitted 7 December, 2017;
originally announced December 2017.
-
Qualification Conditions in Semi-algebraic Programming
Authors:
Jérôme Bolte,
Antoine Hochart,
Edouard Pauwels
Abstract:
For an arbitrary finite family of semi-algebraic/definable functions, we consider the corresponding inequality constraint set and we study qualification conditions for perturbations of this set. In particular we prove that all positive diagonal perturbations, save perhaps a finite number of them, ensure that any point within the feasible set satisfies Mangasarian-Fromovitz constraint qualification…
▽ More
For an arbitrary finite family of semi-algebraic/definable functions, we consider the corresponding inequality constraint set and we study qualification conditions for perturbations of this set. In particular we prove that all positive diagonal perturbations, save perhaps a finite number of them, ensure that any point within the feasible set satisfies Mangasarian-Fromovitz constraint qualification. Using the Milnor-Thom theorem, we provide a bound for the number of singular perturbations when the constraints are polynomial functions. Examples show that the order of magnitude of our exponential bound is relevant. Our perturbation approach provides a simple protocol to build sequences of "regular" problems approximating an arbitrary semi-algebraic/definable problem. Applications to sequential quadratic programming methods and sum of squares relaxation are provided.
△ Less
Submitted 7 March, 2018; v1 submitted 23 May, 2017;
originally announced May 2017.
-
On Fienup Methods for Regularized Phase Retrieval
Authors:
Edouard Pauwels,
Amir Beck,
Yonina C. Eldar,
Shoham Sabach
Abstract:
Alternating minimization, or Fienup methods, have a long history in phase retrieval. We provide new insights related to the empirical and theoretical analysis of these algorithms when used with Fourier measurements and combined with convex priors. In particular, we show that Fienup methods can be viewed as performing alternating minimization on a regularized nonconvex least-squares problem with re…
▽ More
Alternating minimization, or Fienup methods, have a long history in phase retrieval. We provide new insights related to the empirical and theoretical analysis of these algorithms when used with Fourier measurements and combined with convex priors. In particular, we show that Fienup methods can be viewed as performing alternating minimization on a regularized nonconvex least-squares problem with respect to amplitude measurements. We then prove that under mild additional structural assumptions on the prior (semi-algebraicity), the sequence of signal estimates has a smooth convergent behaviour towards a critical point of the nonconvex regularized least-squares objective. Finally, we propose an extension to Fienup techniques, based on a projected gradient descent interpretation and acceleration using inertial terms. We demonstrate experimentally that this modification combined with an $\ell_1$ prior constitutes a competitive approach for sparse phase retrieval.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
The empirical Christoffel function with applications in data analysis
Authors:
Jean-Bernard Lasserre,
Edouard Pauwels
Abstract:
We illustrate the potential applications in machine learning of the Christoffel function, or more precisely, its empirical counterpart associated with a counting measure uniformly supported on a finite set of points. Firstly, we provide a thresholding scheme which allows to approximate the support of a measure from a finite subset of its moments with strong asymptotic guaranties. Secondly, we prov…
▽ More
We illustrate the potential applications in machine learning of the Christoffel function, or more precisely, its empirical counterpart associated with a counting measure uniformly supported on a finite set of points. Firstly, we provide a thresholding scheme which allows to approximate the support of a measure from a finite subset of its moments with strong asymptotic guaranties. Secondly, we provide a consistency result which relates the empirical Christoffel function and its population counterpart in the limit of large samples. Finally, we illustrate the relevance of our results on simulated and real world datasets for several applications in statistics and machine learning: (a) density and support estimation from finite samples, (b) outlier and novelty detection and (c) affine matching.
△ Less
Submitted 7 February, 2019; v1 submitted 11 January, 2017;
originally announced January 2017.
-
Extragradient Method in Optimization: Convergence and Complexity
Authors:
Trong Phong Nguyen,
Edouard Pauwels,
Emile Richard,
Bruce W. Suter
Abstract:
We consider the extragradient method to minimize the sum of two functions, the first one being smooth and the second being convex. Under the Kurdyka-Lojasiewicz assumption, we prove that the sequence produced by the extragradient method converges to a critical point of the problem and has finite length. The analysis is extended to the case when both functions are convex. We provide, in this case,…
▽ More
We consider the extragradient method to minimize the sum of two functions, the first one being smooth and the second being convex. Under the Kurdyka-Lojasiewicz assumption, we prove that the sequence produced by the extragradient method converges to a critical point of the problem and has finite length. The analysis is extended to the case when both functions are convex. We provide, in this case, a sublinear convergence rate, as for gradient-based methods. Furthermore, we show that the recent small-prox complexity result can be applied to this method. Considering the extragradient method is an occasion to describe an exact line search scheme for proximal decomposition methods. We provide details for the implementation of this scheme for the one norm regularized least squares problem and demonstrate numerical results which suggest that combining nonaccelerated methods with exact line search can be a competitive choice.
△ Less
Submitted 13 December, 2017; v1 submitted 26 September, 2016;
originally announced September 2016.
-
Sorting out typicality with the inverse moment matrix SOS polynomial
Authors:
Jean-Bernard Lasserre,
Edouard Pauwels
Abstract:
We study a surprising phenomenon related to the representation of a cloud of data points using polynomials. We start with the previously unnoticed empirical observation that, given a collection (a cloud) of data points, the sublevel sets of a certain distinguished polynomial capture the shape of the cloud very accurately. This distinguished polynomial is a sum-of-squares (SOS) derived in a simple…
▽ More
We study a surprising phenomenon related to the representation of a cloud of data points using polynomials. We start with the previously unnoticed empirical observation that, given a collection (a cloud) of data points, the sublevel sets of a certain distinguished polynomial capture the shape of the cloud very accurately. This distinguished polynomial is a sum-of-squares (SOS) derived in a simple manner from the inverse of the empirical moment matrix. In fact, this SOS polynomial is directly related to orthogonal polynomials and the Christoffel function. This allows to generalize and interpret extremality properties of orthogonal polynomials and to provide a mathematical rationale for the observed phenomenon. Among diverse potential applications, we illustrate the relevance of our results on a network intrusion detection task for which we obtain performances similar to existing dedicated methods reported in the literature.
△ Less
Submitted 14 June, 2016; v1 submitted 13 June, 2016;
originally announced June 2016.
-
Positivity certificates in optimal control
Authors:
Edouard Pauwels,
Didier Henrion,
Jean-Bernard Lasserre
Abstract:
We propose a tutorial on relaxations and weak formulations of optimal control with their semidefinite approximations. We present this approach solely through the prism of positivity certificates which we consider to be the most accessible for a broad audience, in particular in the engineering and robotics communities. This simple concept allows to express very concisely powerful approximation cert…
▽ More
We propose a tutorial on relaxations and weak formulations of optimal control with their semidefinite approximations. We present this approach solely through the prism of positivity certificates which we consider to be the most accessible for a broad audience, in particular in the engineering and robotics communities. This simple concept allows to express very concisely powerful approximation certificates in control. The relevance of this technique is illustrated on three applications: region of attraction approximation, direct optimal control and inverse optimal control, for which it constitutes a common denominator. In a first step, we highlight the core mechanisms underpinning the application of positivity in control and how they appear in the different control applications. This relies on simple mathematical concepts and gives a unified treatment of the applications considered. This presentation is based on the combination and simplification of published materials. In a second step, we describe briefly relations with broader literature, in particular, occupation measures and Hamilton-Jacobi-Bellman equation which are important elements of the global picture. We describe the Sum-Of-Squares (SOS) semidefinite hierarchy in the semialgebraic case and briefly mention its convergence properties. Numerical experiments on a classical example in robotics, namely the nonholonomic vehicle, illustrate the concepts presented in the text for the three applications considered.
△ Less
Submitted 9 May, 2016;
originally announced May 2016.
-
The value function approach to convergence analysis in composite optimization
Authors:
Edouard Pauwels
Abstract:
This works aims at understanding further convergence properties of first order local search methods with complex geometries. We focus on the composite optimization model which unifies within a simple formalism many problems of this type. We provide a general convergence analysis of the composite Gauss-Newton method under tameness assumptions (an extension of semi-algebraicity). Tameness is a very…
▽ More
This works aims at understanding further convergence properties of first order local search methods with complex geometries. We focus on the composite optimization model which unifies within a simple formalism many problems of this type. We provide a general convergence analysis of the composite Gauss-Newton method under tameness assumptions (an extension of semi-algebraicity). Tameness is a very general condition satisfied by virtually all problems solved in practice. The analysis is based on recent progresses in understanding convergence properties of sequential convex programming methods through the value function.
△ Less
Submitted 6 October, 2016; v1 submitted 6 April, 2016;
originally announced April 2016.
-
The Mid-Infrared Instrument for the James Webb Space Telescope, VI: The Medium Resolution Spectrometer
Authors:
Martyn Wells,
J. -W. Pel,
Alistair Glasse,
G. S. Wright,
Gabby Aitink-Kroes,
Ruyman Azzollini,
Steven Beard,
B. R. Brandl,
Angus Gallie,
V. C. Geers,
A. M. Glauser,
Peter Hastings,
Th. Henning,
Rieks Jager,
K. Justtanont,
Bob Kruizinga,
Fred Lahuis,
David Lee,
I. Martinez-Delgado,
J. R. Martinez-Galarza,
M. Meijers,
Jane E. Morrison,
Friedrich Mueller,
Thodori Nakos,
Brian O'Sullivan
, et al. (13 additional authors not shown)
Abstract:
We describe the design and performance of the Medium Resolution Spectrometer (MRS) for the JWST-MIRI instrument. The MRS incorporates four coaxial spectral channels in a compact opto-mechanical layout that generates spectral images over fields of view up to 7.7 X 7.7 arcseconds in extent and at spectral resolving powers ranging from 1,300 to 3,700. Each channel includes an all-reflective integral…
▽ More
We describe the design and performance of the Medium Resolution Spectrometer (MRS) for the JWST-MIRI instrument. The MRS incorporates four coaxial spectral channels in a compact opto-mechanical layout that generates spectral images over fields of view up to 7.7 X 7.7 arcseconds in extent and at spectral resolving powers ranging from 1,300 to 3,700. Each channel includes an all-reflective integral field unit (IFU): an 'image slicer' that reformats the input field for presentation to a grating spectrometer. Two 1024 X 1024 focal plane arrays record the output spectral images with an instantaneous spectral coverage of approximately one third of the full wavelength range of each channel. The full 5 to 28.5 micron spectrum is then obtained by making three exposures using gratings and pass-band-determining filters that are selected using just two three-position mechanisms. The expected on-orbit optical performance is presented, based on testing of the MIRI Flight Model and including spectral and spatial coverage and resolution. The point spread function of the reconstructed images is shown to be diffraction limited and the optical transmission is shown to be consistent with the design expectations.
△ Less
Submitted 12 August, 2015;
originally announced August 2015.