subscribe to arXiv mailings

A second-order-like optimizer with adaptive gradient scaling for deep learning

Authors: Jérôme Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica

Abstract: In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum.After having recalled our geometrical motivations, we provide quite extensive experiments. On image classif… ▽ More In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum.After having recalled our geometrical motivations, we provide quite extensive experiments. On image classification (CIFAR-10, ImageNet) and language modeling (GPT-2), INNAprop consistently matches or outperforms AdamW both in training speed and accuracy, with minimal hyperparameter tuning in large-scale settings. Our code is publicly available at \url{https://github.com/innaprop/innaprop}. △ Less

Submitted 8 October, 2024; originally announced October 2024.

arXiv:2408.08399 [pdf, other]

An Efficient and Explainable Transformer-Based Few-Shot Learning for Modeling Electricity Consumption Profiles Across Thousands of Domains

Authors: Weijie Xia, Gao Peng, Chenguang Wang, Peter Palensky, Eric Pauwels, Pedro P. Vergara

Abstract: Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing numbers of various low-carbon technologies such as solar panels and electric vehicles. Traditional ECP modeling methods typically assume the availability of sufficient ECP data. However, in practice, the accessibility of ECP data is limited due to privacy issues… ▽ More Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing numbers of various low-carbon technologies such as solar panels and electric vehicles. Traditional ECP modeling methods typically assume the availability of sufficient ECP data. However, in practice, the accessibility of ECP data is limited due to privacy issues or the absence of metering devices. Few-shot learning (FSL) has emerged as a promising solution for ECP modeling in data-scarce scenarios. Nevertheless, standard FSL methods, such as those used for images, are unsuitable for ECP modeling because (1) these methods usually assume several source domains with sufficient data and several target domains. However, in the context of ECP modeling, there may be thousands of source domains with a moderate amount of data and thousands of target domains. (2) Standard FSL methods usually involve cumbersome knowledge transfer mechanisms, such as pre-training and fine-tuning, whereas ECP modeling requires more lightweight methods. (3) Deep learning models often lack explainability, hindering their application in industry. This paper proposes a novel FSL method that exploits Transformers and Gaussian Mixture Models (GMMs) for ECP modeling to address the above-described issues. Results show that our method can accurately restore the complex ECP distribution with a minimal amount of ECP data (e.g., only 1.6\% of the complete domain dataset) while it outperforms state-of-the-art time series modeling methods, maintaining the advantages of being both lightweight and interpretable. The project is open-sourced at https://github.com/xiaweijie1996/TransformerEM-GMM.git. △ Less

Submitted 22 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

arXiv:2407.12372 [pdf, other]

Geometric and computational hardness of bilevel programming

Authors: Jérôme Bolte, Quoc-Tung Le, Edouard Pauwels, Samuel Vaiter

Abstract: We first show a simple but striking result in bilevel optimization: unconstrained $C^\infty$ smooth bilevel programming is as hard as general extended-real-valued lower semicontinuous minimization. We then proceed to a worst-case analysis of box-constrained bilevel polynomial optimization. We show in particular that any extended-real-valued semi-algebraic function, possibly non-continuous, can be… ▽ More We first show a simple but striking result in bilevel optimization: unconstrained $C^\infty$ smooth bilevel programming is as hard as general extended-real-valued lower semicontinuous minimization. We then proceed to a worst-case analysis of box-constrained bilevel polynomial optimization. We show in particular that any extended-real-valued semi-algebraic function, possibly non-continuous, can be expressed as the value function of a polynomial bilevel program. Secondly, from a computational complexity perspective, the decision version of polynomial bilevel programming is one level above NP in the polynomial hierarchy ($Σ^p_2$-hard). Both types of difficulties are uncommon in non-linear programs for which objective functions are typically continuous and belong to the class NP. These results highlight the irremediable hardness attached to general bilevel optimization and the necessity of imposing some form of regularity on the lower level. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2405.20744 [pdf, other]

On the sequential convergence of Lloyd's algorithms

Authors: Léo Portales, Elsa Cazelles, Edouard Pauwels

Abstract: Lloyd's algorithm is an iterative method that solves the quantization problem, i.e. the approximation of a target probability measure by a discrete one, and is particularly used in digital applications.This algorithm can be interpreted as a gradient method on a certain quantization functional which is given by optimal transport. We study the sequential convergence (to a single accumulation point)… ▽ More Lloyd's algorithm is an iterative method that solves the quantization problem, i.e. the approximation of a target probability measure by a discrete one, and is particularly used in digital applications.This algorithm can be interpreted as a gradient method on a certain quantization functional which is given by optimal transport. We study the sequential convergence (to a single accumulation point) for two variants of Lloyd's method: (i) optimal quantization with an arbitrary discrete measure and (ii) uniform quantization with a uniform discrete measure. For both cases, we prove sequential convergence of the iterates under an analiticity assumption on the density of the target measure. This includes for example analytic densities truncated to a compact semi-algebraic set. The argument leverages the log analytic nature of globally subanalytic integrals, the interpretation of Lloyd's method as a gradient method and the convergence analysis of gradient algorithms under Kurdyka-Lojasiewicz assumptions. As a by-product, we also obtain definability results for more general semi-discrete optimal transport losses such as transport distances with general costs, the max-sliced Wasserstein distance and the entropy regularized optimal transport loss. △ Less

Submitted 2 July, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.15894 [pdf, other]

Derivatives of Stochastic Gradient Descent

Authors: Franck Iutzeler, Edouard Pauwels, Samuel Vaiter

Abstract: We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the conv… ▽ More We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the convergence of the original SGD. This enables us to establish that the derivatives of SGD converge to the derivative of the solution mapping in terms of mean squared error whenever the objective is strongly convex. Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(\log(k)^2 / k)$ convergence rates. Additionally, we prove exponential convergence in the interpolation regime. Our theoretical findings are illustrated by numerical experiments on synthetic tasks. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2404.19517 [pdf, ps, other]

Inexact subgradient methods for semialgebraic functions

Authors: Jérôme Bolte, Tam Le, Éric Moulines, Edouard Pauwels

Abstract: Motivated by the widespread use of approximate derivatives in machine learning and optimization, we study inexact subgradient methods with non-vanishing additive errors and step sizes. In the nonconvex semialgebraic setting, under boundedness assumptions, we prove that the method provides points that eventually fluctuate close to the critical set at a distance proportional to $ε^ρ$ where $ε$ is t… ▽ More Motivated by the widespread use of approximate derivatives in machine learning and optimization, we study inexact subgradient methods with non-vanishing additive errors and step sizes. In the nonconvex semialgebraic setting, under boundedness assumptions, we prove that the method provides points that eventually fluctuate close to the critical set at a distance proportional to $ε^ρ$ where $ε$ is the error in subgradient evaluation and $ρ$ relates to the geometry of the problem. In the convex setting, we provide complexity results for the averaged values. We also obtain byproducts of independent interest, such as descent-like lemmas for nonsmooth nonconvex problems and some results on the limit of affine interpolants of differential inclusions. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2402.09831 [pdf, other]

Generic Fr{é}chet stationarity in constrained optimization

Authors: Edouard Pauwels

Abstract: Minimizing a smooth function f on a closed subset C leads to different notions of stationarity: Fr{é}chet stationarity, which carries a strong variational meaning, and criticallity, which is defined through a closure process. The latter is an optimality condition which may loose the variational meaning of Fr{é}chet stationarity in some settings. We show that, while criticality is the appropriate n… ▽ More Minimizing a smooth function f on a closed subset C leads to different notions of stationarity: Fr{é}chet stationarity, which carries a strong variational meaning, and criticallity, which is defined through a closure process. The latter is an optimality condition which may loose the variational meaning of Fr{é}chet stationarity in some settings. We show that, while criticality is the appropriate notion in full generality, Fr{é}chet stationarity is typical in practical scenarios. This is illustrated with two main results, first we show that if C is semi-algebraic, then for a generic smooth semi-algebraic function f , all critical points of f on C are actually Fr{é}chet stationary. Second we prove that for small step-sizes, all the accumulation points of the projected gradient algorithm are Fr{é}chet stationary, with an explicit global quadratic estimate of the remainder, avoiding potential critical points which are not Fr{é}chet stationary, and some bad local minima. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2305.13768 [pdf, other]

One-step differentiation of iterative algorithms

Authors: Jérôme Bolte, Edouard Pauwels, Samuel Vaiter

Abstract: In appropriate frameworks, automatic differentiation is transparent to the user at the cost of being a significant computational burden when the number of operations is large. For iterative algorithms, implicit differentiation alleviates this issue but requires custom implementation of Jacobian evaluation. In this paper, we study one-step differentiation, also known as Jacobian-free backpropagatio… ▽ More In appropriate frameworks, automatic differentiation is transparent to the user at the cost of being a significant computational burden when the number of operations is large. For iterative algorithms, implicit differentiation alleviates this issue but requires custom implementation of Jacobian evaluation. In this paper, we study one-step differentiation, also known as Jacobian-free backpropagation, a method as easy as automatic differentiation and as performant as implicit differentiation for fast algorithms (e.g., superlinear optimization methods). We provide a complete theoretical approximation analysis with specific examples (Newton's method, gradient descent) along with its consequences in bilevel optimization. Several numerical examples illustrate the well-foundness of the one-step estimator. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2302.02689 [pdf, other]

On the nature of Bregman functions

Authors: Edouard Pauwels

Abstract: Let C be convex, compact, with nonempty interior and h be Legendre with domain C, continuous on C. We prove that h is Bregman if and only if it is strictly convex on C and C is a polytope. This provides insights on sequential convergence of many Bregman divergence based algorithm: abstract compatibility conditions between Bregman and Euclidean topology may equivalently be replaced by explicit cond… ▽ More Let C be convex, compact, with nonempty interior and h be Legendre with domain C, continuous on C. We prove that h is Bregman if and only if it is strictly convex on C and C is a polytope. This provides insights on sequential convergence of many Bregman divergence based algorithm: abstract compatibility conditions between Bregman and Euclidean topology may equivalently be replaced by explicit conditions on h and C. This also emphasizes that a general convergence theory for these methods (beyond polyhedral domains) would require more refinements than Bregman's conditions. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2212.07844 [pdf, ps, other]

Differentiating Nonsmooth Solutions to Parametric Monotone Inclusion Problems

Authors: Jérôme Bolte, Edouard Pauwels, Antonio Silveti-Falls

Abstract: We leverage path differentiability and a recent result on nonsmooth implicit differentiation calculus to give sufficient conditions ensuring that the solution to a monotone inclusion problem will be path differentiable, with formulas for computing its generalized gradient. A direct consequence of our result is that these solutions happen to be differentiable almost everywhere. Our approach is full… ▽ More We leverage path differentiability and a recent result on nonsmooth implicit differentiation calculus to give sufficient conditions ensuring that the solution to a monotone inclusion problem will be path differentiable, with formulas for computing its generalized gradient. A direct consequence of our result is that these solutions happen to be differentiable almost everywhere. Our approach is fully compatible with automatic differentiation and comes with assumptions which are easy to check, roughly speaking: semialgebraicity and strong monotonicity. We illustrate the scope of our results by considering three fundamental composite problem settings: strongly convex problems, dual solutions to convex minimization problems and primal-dual solutions to min-max problems. △ Less

Submitted 15 December, 2022; originally announced December 2022.

arXiv:2207.12717 [pdf, other]

The derivatives of Sinkhorn-Knopp converge

Authors: Edouard Pauwels, Samuel Vaiter

Abstract: We show that the derivatives of the Sinkhorn-Knopp algorithm, or iterative proportional fitting procedure, converge towards the derivatives of the entropic regularization of the optimal transport problem with a locally uniform linear convergence rate. We show that the derivatives of the Sinkhorn-Knopp algorithm, or iterative proportional fitting procedure, converge towards the derivatives of the entropic regularization of the optimal transport problem with a locally uniform linear convergence rate. △ Less

Submitted 3 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

arXiv:2206.01730 [pdf, ps, other]

On the complexity of nonsmooth automatic differentiation

Authors: Jérôme Bolte, Ryan Boustany, Edouard Pauwels, Béatrice Pesquet-Popescu

Abstract: Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This co… ▽ More Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem. △ Less

Submitted 6 February, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

arXiv:2206.00457 [pdf, other]

Automatic differentiation of nonsmooth iterative algorithms

Authors: Jérôme Bolte, Edouard Pauwels, Samuel Vaiter

Abstract: Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and c… ▽ More Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. For nonsmooth piggyback iterations, we characterize the attractor set of nonsmooth piggyback iterations as a set-valued fixed point which remains in the conservative framework. This has various consequences and in particular almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization problems with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method. △ Less

Submitted 31 May, 2022; originally announced June 2022.

arXiv:2202.13744 [pdf, ps, other]

doi 10.1137/22M1479178

Subgradient sampling for nonsmooth nonconvex minimization

Authors: Jérôme Bolte, Tam Le, Edouard Pauwels

Abstract: Risk minimization for nonsmooth nonconvex problems naturally leads to first-order sampling or, by an abuse of terminology, to stochastic subgradient descent. We establish the convergence of this method in the path-differentiable case and describe more precise results under additional geometric assumptions. We recover and improve results from Ermoliev and Norkin [Cybern. Syst. Anal., 34 (1998), pp.… ▽ More Risk minimization for nonsmooth nonconvex problems naturally leads to first-order sampling or, by an abuse of terminology, to stochastic subgradient descent. We establish the convergence of this method in the path-differentiable case and describe more precise results under additional geometric assumptions. We recover and improve results from Ermoliev and Norkin [Cybern. Syst. Anal., 34 (1998), pp. 196--215] by using a different approach: conservative calculus and the ODE method. In the definable case, we show that first-order subgradient sampling avoids artificial critical points with probability one and applies moreover to a large range of risk minimization problems in deep learning, based on the backpropagation oracle. As byproducts of our approach, we obtain several results on integration of independent interest, such as an interchange result for conservative derivatives and integrals or the definability of set-valued parameterized integrals. △ Less

Submitted 23 July, 2024; v1 submitted 28 February, 2022; originally announced February 2022.

Journal ref: SIAM Journal on Optimization, 2023, 33 (4), pp.2542-2569

arXiv:2202.08711 [pdf, ps, other]

The Iterates of the Frank-Wolfe Algorithm May Not Converge

Authors: Jérôme Bolte, Cyrille W. Combettes, Édouard Pauwels

Abstract: The Frank-Wolfe algorithm is a popular method for minimizing a smooth convex function $f$ over a compact convex set $\mathcal{C}$. While many convergence results have been derived in terms of function values, hardly nothing is known about the convergence behavior of the sequence of iterates $(x_t)_{t\in\mathbb{N}}$. Under the usual assumptions, we design several counterexamples to the convergence… ▽ More The Frank-Wolfe algorithm is a popular method for minimizing a smooth convex function $f$ over a compact convex set $\mathcal{C}$. While many convergence results have been derived in terms of function values, hardly nothing is known about the convergence behavior of the sequence of iterates $(x_t)_{t\in\mathbb{N}}$. Under the usual assumptions, we design several counterexamples to the convergence of $(x_t)_{t\in\mathbb{N}}$, where $f$ is $d$-time continuously differentiable, $d\geq2$, and $f(x_t)\to\min_\mathcal{C}f$. Our counterexamples cover the cases of open-loop, closed-loop, and line-search step-size strategies. We do not assume \emph{misspecification} of the linear minimization oracle and our results thus hold regardless of the points it returns, demonstrating the fundamental pathologies in the convergence behavior of $(x_t)_{t\in\mathbb{N}}$. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 15 pages, 7 figures

arXiv:2201.03819 [pdf, ps, other]

Path differentiability of ODE flows

Authors: Swann Marx, Edouard Pauwels

Abstract: We consider flows of ordinary differential equations (ODEs) driven by path differentiable vector fields. Path differentiable functions constitute a proper subclass of Lipschitz functions which admit conservative gradients, a notion of generalized derivative compatible with basic calculus rules. Our main result states that such flows inherit the path differentiability property of the driving vector… ▽ More We consider flows of ordinary differential equations (ODEs) driven by path differentiable vector fields. Path differentiable functions constitute a proper subclass of Lipschitz functions which admit conservative gradients, a notion of generalized derivative compatible with basic calculus rules. Our main result states that such flows inherit the path differentiability property of the driving vector field. We show indeed that forward propagation of derivatives given by the sensitivity differential inclusions provide a conservative Jacobian for the flow. This allows to propose a nonsmooth version of the adjoint method, which can be applied to integral costs under an ODE constraint. This result constitutes a theoretical ground to the application of small step first order methods to solve a broad class of nonsmooth optimization problems with parametrized ODE constraints. This is illustrated with the convergence of small step first order methods based on the proposed nonsmooth adjoint. △ Less

Submitted 11 January, 2022; originally announced January 2022.

arXiv:2106.12955 [pdf, other]

Regularisation for PCA- and SVD-type matrix factorisations

Authors: Abdolrahman Khoshrou, Eric J. Pauwels

Abstract: Singular Value Decomposition (SVD) and its close relative, Principal Component Analysis (PCA), are well-known linear matrix decomposition techniques that are widely used in applications such as dimension reduction and clustering. However, an important limitation of SVD/PCA is its sensitivity to noise in the input data. In this paper, we take another look at the problem of regularisation and show t… ▽ More Singular Value Decomposition (SVD) and its close relative, Principal Component Analysis (PCA), are well-known linear matrix decomposition techniques that are widely used in applications such as dimension reduction and clustering. However, an important limitation of SVD/PCA is its sensitivity to noise in the input data. In this paper, we take another look at the problem of regularisation and show that different formulations of the minimisation problem lead to qualitatively different solutions. △ Less

Submitted 24 June, 2021; originally announced June 2021.

arXiv:2106.12915 [pdf, other]

Numerical influence of ReLU'(0) on backpropagation

Authors: David Bertoin, Jérôme Bolte, Sébastien Gerchinovitz, Edouard Pauwels

Abstract: In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (f… ▽ More In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU'(0) = 0 seems to be the most efficient. For our experiments on ImageNet the gain in test accuracy over ReLU'(0) = 1 was more than 10 points (two runs). We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of ReLU'(0)'s value. Overall, the message we convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously. △ Less

Submitted 3 November, 2023; v1 submitted 23 June, 2021; originally announced June 2021.

Journal ref: Advances in Neural Information Processing Systems, Dec 2021, Paris, France

arXiv:2106.04350 [pdf, other]

Nonsmooth Implicit Differentiation for Machine Learning and Optimization

Authors: Jérôme Bolte, Tam Le, Edouard Pauwels, Antonio Silveti-Falls

Abstract: In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clar… ▽ More In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clarke Jacobians in the usual differentiation formulas is fully justified for a wide class of nonsmooth problems. Moreover this calculus is entirely compatible with algorithmic differentiation (e.g., backpropagation). We provide several applications such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsmooth Lasso-type models. To show the sharpness of our assumptions, we present numerical experiments showcasing the extremely pathological gradient dynamics one can encounter when applying implicit algorithmic differentiation without any hypothesis. △ Less

Submitted 5 April, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

Journal ref: Advances in Neural Information Processing Systems, Dec 2021, Online, France

arXiv:2106.01453 [pdf, other]

Semialgebraic Representation of Monotone Deep Equilibrium Models and Applications to Certification

Authors: Tong Chen, Jean-Bernard Lasserre, Victor Magron, Edouard Pauwels

Abstract: Deep equilibrium models are based on implicitly defined functional relations and have shown competitive performance compared with the traditional deep networks. Monotone operator equilibrium networks (monDEQ) retain interesting performance with additional theoretical guaranties. Existing certification tools for classical deep networks cannot directly be applied to monDEQs for which much fewer tool… ▽ More Deep equilibrium models are based on implicitly defined functional relations and have shown competitive performance compared with the traditional deep networks. Monotone operator equilibrium networks (monDEQ) retain interesting performance with additional theoretical guaranties. Existing certification tools for classical deep networks cannot directly be applied to monDEQs for which much fewer tools exist. We introduce a semialgebraic representation for ReLU based monDEQs which allows to approximate the corresponding input output relation by semidefinite programming (SDP). We present several applications to network certification and obtain SDP models for the following problems : robustness certification, Lipschitz constant estimation, ellipsoidal uncertainty propagation. We use these models to certify robustness of monDEQs w.r.t. a general $L_q$ norm. Experimental results show that the proposed models outperform existing approaches for monDEQ certification. Furthermore, our investigations suggest that monDEQs are much more robust to $L_2$ perturbations than $L_{\infty}$ perturbations. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: 16 pages, 4 tables, 2 figures

arXiv:2104.00283 [pdf, other]

doi 10.1007/s11228-023-00682-3

Conservative parametric optimality and the ridge method for tame min-max problems

Authors: Edouard Pauwels

Abstract: We study the ridge method for min-max problems, and investigate its convergence without any convexity, differentiability or qualification assumption. The central issue is to determine whether the ''parametric optimality formula'' provides a conservative field, a notion of generalized derivative well suited for optimization. The answer to this question is positive in a semi-algebraic, and more gene… ▽ More We study the ridge method for min-max problems, and investigate its convergence without any convexity, differentiability or qualification assumption. The central issue is to determine whether the ''parametric optimality formula'' provides a conservative field, a notion of generalized derivative well suited for optimization. The answer to this question is positive in a semi-algebraic, and more generally definable, context. The proof involves a new characterization of definable conservative fields which is of independent interest. As a consequence, the ridge method applied to definable objectives is proved to have a minimizing behavior and to converge to a set of equilibria which satisfy an optimality condition. Definability is key to our proof: we show that for a more general class of nonsmooth functions, conservativity of the parametric optimality formula may fail, resulting in an absurd behavior of the ridge method. △ Less

Submitted 26 June, 2023; v1 submitted 1 April, 2021; originally announced April 2021.

Journal ref: Set-Valued and Variational Analysis, 2023, 31 (19)

arXiv:2103.03570 [pdf, other]

doi 10.1007/s11063-021-10705-5

Second-order step-size tuning of SGD for non-convex optimization

Authors: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels

Abstract: In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version o… ▽ More In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM. △ Less

Submitted 21 November, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: To appear in Neural Processing Letters (accepted Nov. 2021)

Journal ref: Neural Processing Letters (2022)

arXiv:2101.05167 [pdf, ps, other]

A Sublevel Moment-SOS Hierarchy for Polynomial Optimization

Authors: Tong Chen, Jean-Bernard Lasserre, Victor Magron, Edouard Pauwels

Abstract: We introduce a sublevel Moment-SOS hierarchy where each SDP relaxation can be viewed as an intermediate (or interpolation) between the d-th and (d+1)-th order SDP relaxations of the Moment-SOS hierarchy (dense or sparse version). With the flexible choice of determining the size (level) and number (depth) of subsets in the SDP relaxation, one is able to obtain different improvements compared to the… ▽ More We introduce a sublevel Moment-SOS hierarchy where each SDP relaxation can be viewed as an intermediate (or interpolation) between the d-th and (d+1)-th order SDP relaxations of the Moment-SOS hierarchy (dense or sparse version). With the flexible choice of determining the size (level) and number (depth) of subsets in the SDP relaxation, one is able to obtain different improvements compared to the d-th order relaxation, based on the machine memory capacity. In particular, we provide numerical experiments for d=1 and various types of problems both in combinatorial optimization (Max-Cut, Mixed Integer Programming) and deep learning (robustness certification, Lipschitz constant of neural networks), where the standard Lasserre's relaxation (or its sparse variant) is computationally intractable. In our numerical results, the lower bounds from the sublevel relaxations improve the bound from Shor's relaxation (first order Lasserre's relaxation) and are significantly closer to the optimal value or to the best-known lower/upper bounds. △ Less

Submitted 13 January, 2021; originally announced January 2021.

Comments: 25 pages, 13 tables

arXiv:2011.12341 [pdf, ps, other]

Sequential convergence of AdaGrad algorithm for smooth convex optimization

Authors: Cheik Traoré, Edouard Pauwels

Abstract: We prove that the iterates produced by, either the scalar step size variant, or the coordinatewise variant of AdaGrad algorithm, are convergent sequences when applied to convex objective functions with Lipschitz gradient. The key insight is to remark that such AdaGrad sequences satisfy a variable metric quasi-Fejér monotonicity property, which allows to prove convergence. We prove that the iterates produced by, either the scalar step size variant, or the coordinatewise variant of AdaGrad algorithm, are convergent sequences when applied to convex objective functions with Lipschitz gradient. The key insight is to remark that such AdaGrad sequences satisfy a variable metric quasi-Fejér monotonicity property, which allows to prove convergence. △ Less

Submitted 13 April, 2021; v1 submitted 24 November, 2020; originally announced November 2020.

Comments: 9 pages

arXiv:2007.08810 [pdf, other]

A Hölderian backtracking method for min-max and min-min problems

Authors: Jérôme Bolte, Lilian Glaudin, Edouard Pauwels, Mathieu Serrurier

Abstract: We present a new algorithm to solve min-max or min-min problems out of the convex world. We use rigidity assumptions, ubiquitous in learning, making our method applicable to many optimization problems. Our approach takes advantage of hidden regularity properties and allows us to devise a simple algorithm of ridge type. An original feature of our method is to come with automatic step size adaptatio… ▽ More We present a new algorithm to solve min-max or min-min problems out of the convex world. We use rigidity assumptions, ubiquitous in learning, making our method applicable to many optimization problems. Our approach takes advantage of hidden regularity properties and allows us to devise a simple algorithm of ridge type. An original feature of our method is to come with automatic step size adaptation which departs from the usual overly cautious backtracking methods. In a general framework, we provide convergence theoretical guarantees and rates. We apply our findings on simple GAN problems obtaining promising numerical results. △ Less

Submitted 17 July, 2020; originally announced July 2020.

arXiv:2007.07557 [pdf, ps, other]

doi 10.1007/s10957-021-01883-2

Incremental Without Replacement Sampling in Nonconvex Optimization

Authors: Edouard Pauwels

Abstract: Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analys… ▽ More Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting we prove that the sequence is attracted by solutions of optimality conditions of the problem. △ Less

Submitted 6 January, 2023; v1 submitted 15 July, 2020; originally announced July 2020.

Comments: Journal of Optimization Theory and Applications, 2021

arXiv:2006.02080 [pdf, other]

A mathematical model for automatic differentiation in machine learning

Authors: Jerome Bolte, Edouard Pauwels

Abstract: Automatic differentiation, as implemented today, does not have a simple mathematical model adapted to the needs of modern machine learning. In this work we articulate the relationships between differentiation of programs as implemented in practice and differentiation of nonsmooth functions. To this end we provide a simple class of functions, a nonsmooth calculus, and show how they apply to stochas… ▽ More Automatic differentiation, as implemented today, does not have a simple mathematical model adapted to the needs of modern machine learning. In this work we articulate the relationships between differentiation of programs as implemented in practice and differentiation of nonsmooth functions. To this end we provide a simple class of functions, a nonsmooth calculus, and show how they apply to stochastic approximation methods. We also evidence the issue of artificial critical points created by algorithmic differentiation and show how usual methods avoid these points with probability one. △ Less

Submitted 29 October, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Journal ref: Conference on Neural Information Processing Systems, Dec 2020, Vancouver, Canada

arXiv:2006.00098 [pdf, other]

Long term dynamics of the subgradient method for Lipschitz path differentiable functions

Authors: Jerome Bolte, Edouard Pauwels, Rodolfo Rios-Zertuche

Abstract: We consider the long-term dynamics of the vanishing stepsize subgradient method in the case when the objective function is neither smooth nor convex. We assume that this function is locally Lipschitz and path differentiable, i.e., admits a chain rule. Our study departs from other works in the sense that we focus on the behavoir of the oscillations, and to do this we use closed measures. We recover… ▽ More We consider the long-term dynamics of the vanishing stepsize subgradient method in the case when the objective function is neither smooth nor convex. We assume that this function is locally Lipschitz and path differentiable, i.e., admits a chain rule. Our study departs from other works in the sense that we focus on the behavoir of the oscillations, and to do this we use closed measures. We recover known convergence results, establish new ones, and show a local principle of oscillation compensation for the velocities. Roughly speaking, the time average of gradients around one limit point vanishes. This allows us to further analyze the structure of oscillations, and establish their perpendicularity to the general drift. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: 28 pages, 2 figures

MSC Class: 65K10 (Primary); 37A50; 37B35; 62M45 (Secondary) ACM Class: G.1.6; I.2.6

arXiv:2002.03657 [pdf, other]

Semialgebraic Optimization for Lipschitz Constants of ReLU Networks

Authors: Tong Chen, Jean-Bernard Lasserre, Victor Magron, Edouard Pauwels

Abstract: The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives… ▽ More The Lipschitz constant of a network plays an important role in many applications of deep learning, such as robustness certification and Wasserstein Generative Adversarial Network. We introduce a semidefinite programming hierarchy to estimate the global and local Lipschitz constant of a multiple layer deep neural network. The novelty is to combine a polynomial lifting for ReLU functions derivatives with a weak generalization of Putinar's positivity certificate. This idea could also apply to other, nearly sparse, polynomial optimization problems in machine learning. We empirically demonstrate that our method provides a trade-off with respect to state of the art linear programming approach, and in some cases we obtain better bounds in less time. △ Less

Submitted 28 October, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

Comments: NeurIPS 2020

arXiv:2001.07999 [pdf, other]

Curiosities and counterexamples in smooth convex optimization

Authors: Jerome Bolte, Edouard Pauwels

Abstract: Counterexamples to some old-standing optimization problems in the smooth convex coercive setting are provided. We show that block-coordinate, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features are established: directional convergence of Cauchy's gradient curves, convergence of Newton's flow, finite length of Tikhono… ▽ More Counterexamples to some old-standing optimization problems in the smooth convex coercive setting are provided. We show that block-coordinate, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features are established: directional convergence of Cauchy's gradient curves, convergence of Newton's flow, finite length of Tikhonov path, convergence of central paths, or smooth Kurdyka-Lojasiewicz inequality. All examples are planar. These examples are based on general smooth convex interpolation results. Given a decreasing sequence of positively curved C k convex compact sets in the plane, we provide a level set interpolation of a C k smooth convex function where k $\ge$ 2 is arbitrary. If the intersection is reduced to one point our interpolant has positive definite Hessian, otherwise it is positive definite out of the solution set. Furthermore , given a sequence of decreasing polygons we provide an interpolant agreeing with the vertices and whose gradients coincide with prescribed normals. △ Less

Submitted 29 January, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

arXiv:1912.04012 [pdf, other]

To snipe or not to snipe, that is the question! Transitions in sniping behaviour among competing algorithmic traders

Authors: Somayeh Kokabisaghi, Eric J Pauwels, Andre B Dorsman

Abstract: In this paper we extend the investigation into the transition from sure to probabilistic sniping as introduced in Menkveld and Zoican \cite{mz2017}. In that paper, the authors introduce a stylized version of a competitive game in which high frequency traders (HFTs) interact with each other and liquidity traders. The authors then show that risk aversion plays an important role in the transition fro… ▽ More In this paper we extend the investigation into the transition from sure to probabilistic sniping as introduced in Menkveld and Zoican \cite{mz2017}. In that paper, the authors introduce a stylized version of a competitive game in which high frequency traders (HFTs) interact with each other and liquidity traders. The authors then show that risk aversion plays an important role in the transition from sure to mixed (or probabilistic) sniping. In this paper, we re-interpret and extend these conclusions in the context of repeated games and highlight some differences in results. In particular, we identify situations in which probabilistic sniping is genuinely profitable that are qualitatively different from the ones obtained in \cite{mz2017}. It turns out that beyond a specific risk aversion threshold the game resembles the well-known prisoner's dilemma, in that probabilistic sniping becomes a way to cooperate among the HFTs that leaves all the participants better off. In order to turn this into a viable strategy for the repeated game, we show how compliance can be monitored through the use of sequential statistical testing. Keywords: algorithmic trading, bandits, high-frequency exchange, Nash equilibrium, repeated games, sniping, subgame-perfect equilibrium, Sequential probability ratio, transition △ Less

Submitted 11 September, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

arXiv:1910.14458 [pdf, other]

Rate of convergence for geometric inference based on the empirical Christoffel function

Authors: Mai Trang Vu, François Bachoc, Edouard Pauwels

Abstract: We consider the problem of estimating the support of a measure from a finite, independent, sample. The estimators which are considered are constructed based on the empirical Christoffel function. Such estimators have been proposed for the problem of set estimation with heuristic justifications. We carry out a detailed finite sample analysis, that allows us to select the threshold and degree parame… ▽ More We consider the problem of estimating the support of a measure from a finite, independent, sample. The estimators which are considered are constructed based on the empirical Christoffel function. Such estimators have been proposed for the problem of set estimation with heuristic justifications. We carry out a detailed finite sample analysis, that allows us to select the threshold and degree parameters as a function of the sample size. We provide a convergence rate analysis of the resulting support estimation procedure. Our analysis establishes that we may obtain finite sample bounds which are comparable to existing rates for different set estimation procedures. Our results rely on concentration inequalities for the empirical Christoffel function and on estimates of the supremum of the Christoffel-Darboux kernel on sets with smooth boundaries, that can be considered of independent interest. △ Less

Submitted 19 May, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

arXiv:1909.10300 [pdf, other]

Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning

Authors: Jérôme Bolte, Edouard Pauwels

Abstract: Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path d… ▽ More Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path differentiable. Using Whitney stratification techniques for semialgebraic and definable sets, our model provides variational formulas for nonsmooth automatic differentiation oracles, as for instance the famous backpropagation algorithm in deep learning. Our differential model is applied to establish the convergence in values of nonsmooth stochastic gradient methods as they are implemented in practice. △ Less

Submitted 9 April, 2020; v1 submitted 23 September, 2019; originally announced September 2019.

Comments: Corrected typos

arXiv:1905.12278 [pdf, other]

An Inertial Newton Algorithm for Deep Learning

Authors: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels

Abstract: We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both… ▽ More We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems. △ Less

Submitted 28 July, 2021; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: To appear in Journal of Machine Learning Research (JMLR), Volume 22, acceptance date: 5/21

Journal ref: Journal of Machine Learning Research (JMLR), v22(134):1-31, 2021

arXiv:1904.01833 [pdf, other]

Semi-algebraic approximation using Christoffel-Darboux kernel

Authors: Swann Marx, Edouard Pauwels, Tillmann Weisser, Didier Henrion, Jean Lasserre

Abstract: We provide a new method to approximate a (possibly discontinuous) function using Christoffel-Darboux kernels. Our knowledge about the unknown multivariate function is in terms of finitely many moments of the Young measure supported on the graph of the function. Such an input is available when approximating weak (or measure-valued) solution of optimal control problems, entropy solutions to non-line… ▽ More We provide a new method to approximate a (possibly discontinuous) function using Christoffel-Darboux kernels. Our knowledge about the unknown multivariate function is in terms of finitely many moments of the Young measure supported on the graph of the function. Such an input is available when approximating weak (or measure-valued) solution of optimal control problems, entropy solutions to non-linear hyperbolic PDEs, or using numerical integration from finitely many evaluations of the function. While most of the existing methods construct a piecewise polynomial approximation, we construct a semi-algebraic approximation whose estimation and evaluation can be performed efficiently. An appealing feature of this method is that it deals with non-smoothness implicitly so that a single scheme can be used to treat smooth or non-smooth functions without any prior knowledge. On the theoretical side, we prove pointwise convergence almost everywhere as well as convergence in the Lebesgue one norm under broad assumptions. Using more restrictive assumptions, we obtain explicit convergence rates. We illustrate our approach on various examples from control and approximation. In particular we observe empirically that our method does not suffer from the the Gibbs phenomenon when approximating discontinuous functions. △ Less

Submitted 8 April, 2021; v1 submitted 3 April, 2019; originally announced April 2019.

Report number: Rapport LAAS n{\textdegree} 19111

arXiv:1810.08480 [pdf, other]

Data analysis from empirical moments and the Christoffel function

Authors: Edouard Pauwels, Mihai Putinar, Jean-Bernard Lasserre

Abstract: Spectral features of the empirical moment matrix constitute a resourceful tool for unveiling properties of a cloud of points, among which, density, support and latent structures. It is already well known that the empirical moment matrix encodes a great deal of subtle attributes of the underlying measure. Starting from this object as base of observations we combine ideas from statistics, real algeb… ▽ More Spectral features of the empirical moment matrix constitute a resourceful tool for unveiling properties of a cloud of points, among which, density, support and latent structures. It is already well known that the empirical moment matrix encodes a great deal of subtle attributes of the underlying measure. Starting from this object as base of observations we combine ideas from statistics, real algebraic geometry, orthogonal polynomials and approximation theory for opening new insights relevant for Machine Learning (ML) problems with data supported on singular sets. Refined concepts and results from real algebraic geometry and approximation theory are empowering a simple tool (the empirical moment matrix) for the task of solving non-trivial questions in data analysis. We provide (1) theoretical support, (2) numerical experiments and, (3) connections to real world data as a validation of the stamina of the empirical moment matrix approach. △ Less

Submitted 10 January, 2020; v1 submitted 19 October, 2018; originally announced October 2018.

Report number: Rapport LAAS n{\textdegree} 18219

arXiv:1809.01931 [pdf, other]

An unexpected connection between Bayes $A-$optimal designs and the Group Lasso

Authors: Guillaume Sagnol, Edouard Pauwels

Abstract: We show that the $A$-optimal design optimization problem over $m$ design points in $\mathbb{R}^n$ is equivalent to minimizing a quadratic function plus a group lasso sparsity inducing term over $n\times m$ real matrices. This observation allows to describe several new algorithms for $A$-optimal design based on splitting and block coordinate decomposition. These techniques are well known and proved… ▽ More We show that the $A$-optimal design optimization problem over $m$ design points in $\mathbb{R}^n$ is equivalent to minimizing a quadratic function plus a group lasso sparsity inducing term over $n\times m$ real matrices. This observation allows to describe several new algorithms for $A$-optimal design based on splitting and block coordinate decomposition. These techniques are well known and proved powerful to treat large scale problems in machine learning and signal processing communities. The proposed algorithms come with rigorous convergence guaranties and convergence rate estimate stemming from the optimization literature. Performances are illustrated on synthetic benchmarks and compared to existing methods for solving the optimal design problem. △ Less

Submitted 6 September, 2018; originally announced September 2018.

MSC Class: 62K05

arXiv:1807.10120 [pdf, other]

doi 10.1109/ISGTEurope.2017.8260303

SVD-based Visualisation and Approximation for Time Series Data in Smart Energy Systems

Authors: Abdolrahman Khoshrou, Andre B. Dorsman, Eric. J. Pauwels

Abstract: Many time series in smart energy systems exhibit two different timescales. On the one hand there are patterns linked to daily human activities. On the other hand, there are relatively slow trends linked to seasonal variations. In this paper we interpret these time series as matrices, to be visualized as images. This approach has two advantages: First of all, interpreting such time series as images… ▽ More Many time series in smart energy systems exhibit two different timescales. On the one hand there are patterns linked to daily human activities. On the other hand, there are relatively slow trends linked to seasonal variations. In this paper we interpret these time series as matrices, to be visualized as images. This approach has two advantages: First of all, interpreting such time series as images enables one to visually integrate across the image and makes it therefore easier to spot subtle or faint features. Second, the matrix interpretation also grants elucidation of the underlying structure using well-established matrix decomposition methods. We will illustrate both these aspects for data obtained from the German day-ahead market. △ Less

Submitted 11 July, 2018; originally announced July 2018.

arXiv:1807.07328 [pdf, other]

doi 10.1109/PESGM.2018.8586020

Quantifying Volatility Reduction in German Day-ahead Spot Market in the Period 2006 through 2016

Authors: Abdolrahman Khoshrou, Eric J. Pauwels

Abstract: In Europe, Germany is taking the lead in the switch from the conventional to renewable energy. This poses new challenges as wind and solar energy are fundamentally intermittent, weather-dependent and less predictable. It is therefore of considerable interest to investigate the evolution of price volatility in this post-transition era. There are a number of reasons, however, that makes the practica… ▽ More In Europe, Germany is taking the lead in the switch from the conventional to renewable energy. This poses new challenges as wind and solar energy are fundamentally intermittent, weather-dependent and less predictable. It is therefore of considerable interest to investigate the evolution of price volatility in this post-transition era. There are a number of reasons, however, that makes the practical studies difficult. For instance, EPEX prices can be zero or negative. Consequently, the standard approach in financial time series analysis to switch to logarithmic measures is inapplicable. Furthermore, in contrast to the stock market prices which are only available for trading days, EPEX prices cover the whole year, including weekends and holidays. Accordingly, there is a lot of underlying variability in the data which has nothing to do with volatility, but simply reflects diurnal activity patterns. An important distinction of the present work is the application of matrix decomposition techniques, namely the singular value decomposition (SVD), for defining an alternative notion of volatility. This approach is systematically more robust toward outliers and also the diurnal patterns. Our observations show that the day-ahead market is becoming less volatile in recent years. △ Less

Submitted 19 July, 2018; originally announced July 2018.

arXiv:1807.03386 [pdf, other]

doi 10.1007/978-3-030-01174-1_35

Data-driven pattern identification and outlier detection in time series

Authors: Abdolrahman Khoshrou, Eric J. Pauwels

Abstract: We address the problem of data-driven pattern identification and outlier detection in time series. To this end, we use singular value decomposition (SVD) which is a well-known technique to compute a low-rank approximation for an arbitrary matrix. By recasting the time series as a matrix it becomes possible to use SVD to highlight the underlying patterns and periodicities. This is done without the… ▽ More We address the problem of data-driven pattern identification and outlier detection in time series. To this end, we use singular value decomposition (SVD) which is a well-known technique to compute a low-rank approximation for an arbitrary matrix. By recasting the time series as a matrix it becomes possible to use SVD to highlight the underlying patterns and periodicities. This is done without the need for specifying user-defined parameters. From a data mining perspective, this opens up new ways of analyzing time series in a data-driven, bottom-up fashion. However, in order to get correct results, it is important to understand how the SVD-spectrum of a time series is influenced by various characteristics of the underlying signal and noise. In this paper, we have extended the work in earlier papers by initiating a more systematic analysis of these effects. We then illustrate our findings on some real-life data. △ Less

Submitted 25 June, 2018; originally announced July 2018.

arXiv:1805.07943 [pdf, other]

Relating Leverage Scores and Density using Regularized Christoffel Functions

Authors: Edouard Pauwels, Francis Bach, Jean-Philippe Vert

Abstract: Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite k… ▽ More Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings. △ Less

Submitted 21 November, 2018; v1 submitted 21 May, 2018; originally announced May 2018.

arXiv:1712.02623 [pdf, other]

The multiproximal linearization method for convex composite problems

Authors: Jérôme Bolte, Zheng Chen, Edouard Pauwels

Abstract: Composite minimization involves a collection of smooth functions which are aggregated in a nonsmooth manner. In the convex setting, we design an algorithm by linearizing each smooth component in accordance with its main curvature. The resulting method, called the Multiprox method, consists in solving successively simple problems (e.g. constrained quadratic problems) which can also feature some pro… ▽ More Composite minimization involves a collection of smooth functions which are aggregated in a nonsmooth manner. In the convex setting, we design an algorithm by linearizing each smooth component in accordance with its main curvature. The resulting method, called the Multiprox method, consists in solving successively simple problems (e.g. constrained quadratic problems) which can also feature some proximal operators. To study the complexity and the convergence of this method we are led to study quantitative qualification conditions to understand the impact of multipliers on the complexity bounds. We obtain explicit complexity results of the form $O(\frac{1}{k})$ involving new types of constant terms. A distinctive feature of our approach is to be able to cope with oracles involving moving constraints. Our method is flexible enough to include the moving balls method, the proximal Gauss-Newton's method, or the forward-backward splitting, for which we recover known complexity results or establish new ones. We show through several numerical experiments how the use of multiple proximal terms can be decisive for problems with complex geometries. △ Less

Submitted 23 March, 2019; v1 submitted 7 December, 2017; originally announced December 2017.

arXiv:1705.08219 [pdf, ps, other]

Qualification Conditions in Semi-algebraic Programming

Authors: Jérôme Bolte, Antoine Hochart, Edouard Pauwels

Abstract: For an arbitrary finite family of semi-algebraic/definable functions, we consider the corresponding inequality constraint set and we study qualification conditions for perturbations of this set. In particular we prove that all positive diagonal perturbations, save perhaps a finite number of them, ensure that any point within the feasible set satisfies Mangasarian-Fromovitz constraint qualification… ▽ More For an arbitrary finite family of semi-algebraic/definable functions, we consider the corresponding inequality constraint set and we study qualification conditions for perturbations of this set. In particular we prove that all positive diagonal perturbations, save perhaps a finite number of them, ensure that any point within the feasible set satisfies Mangasarian-Fromovitz constraint qualification. Using the Milnor-Thom theorem, we provide a bound for the number of singular perturbations when the constraints are polynomial functions. Examples show that the order of magnitude of our exponential bound is relevant. Our perturbation approach provides a simple protocol to build sequences of "regular" problems approximating an arbitrary semi-algebraic/definable problem. Applications to sequential quadratic programming methods and sum of squares relaxation are provided. △ Less

Submitted 7 March, 2018; v1 submitted 23 May, 2017; originally announced May 2017.

MSC Class: 26D10; 32B20; 49K24; 49J52; 37B35; 14P15

arXiv:1702.08339 [pdf, other]

doi 10.1109/TSP.2017.2780044

On Fienup Methods for Regularized Phase Retrieval

Authors: Edouard Pauwels, Amir Beck, Yonina C. Eldar, Shoham Sabach

Abstract: Alternating minimization, or Fienup methods, have a long history in phase retrieval. We provide new insights related to the empirical and theoretical analysis of these algorithms when used with Fourier measurements and combined with convex priors. In particular, we show that Fienup methods can be viewed as performing alternating minimization on a regularized nonconvex least-squares problem with re… ▽ More Alternating minimization, or Fienup methods, have a long history in phase retrieval. We provide new insights related to the empirical and theoretical analysis of these algorithms when used with Fourier measurements and combined with convex priors. In particular, we show that Fienup methods can be viewed as performing alternating minimization on a regularized nonconvex least-squares problem with respect to amplitude measurements. We then prove that under mild additional structural assumptions on the prior (semi-algebraicity), the sequence of signal estimates has a smooth convergent behaviour towards a critical point of the nonconvex regularized least-squares objective. Finally, we propose an extension to Fienup techniques, based on a projected gradient descent interpretation and acceleration using inertial terms. We demonstrate experimentally that this modification combined with an $\ell_1$ prior constitutes a competitive approach for sparse phase retrieval. △ Less

Submitted 27 February, 2017; originally announced February 2017.

arXiv:1701.02886 [pdf, other]

The empirical Christoffel function with applications in data analysis

Authors: Jean-Bernard Lasserre, Edouard Pauwels

Abstract: We illustrate the potential applications in machine learning of the Christoffel function, or more precisely, its empirical counterpart associated with a counting measure uniformly supported on a finite set of points. Firstly, we provide a thresholding scheme which allows to approximate the support of a measure from a finite subset of its moments with strong asymptotic guaranties. Secondly, we prov… ▽ More We illustrate the potential applications in machine learning of the Christoffel function, or more precisely, its empirical counterpart associated with a counting measure uniformly supported on a finite set of points. Firstly, we provide a thresholding scheme which allows to approximate the support of a measure from a finite subset of its moments with strong asymptotic guaranties. Secondly, we provide a consistency result which relates the empirical Christoffel function and its population counterpart in the limit of large samples. Finally, we illustrate the relevance of our results on simulated and real world datasets for several applications in statistics and machine learning: (a) density and support estimation from finite samples, (b) outlier and novelty detection and (c) affine matching. △ Less

Submitted 7 February, 2019; v1 submitted 11 January, 2017; originally announced January 2017.

arXiv:1609.08177 [pdf, other]

Extragradient Method in Optimization: Convergence and Complexity

Authors: Trong Phong Nguyen, Edouard Pauwels, Emile Richard, Bruce W. Suter

Abstract: We consider the extragradient method to minimize the sum of two functions, the first one being smooth and the second being convex. Under the Kurdyka-Lojasiewicz assumption, we prove that the sequence produced by the extragradient method converges to a critical point of the problem and has finite length. The analysis is extended to the case when both functions are convex. We provide, in this case,… ▽ More We consider the extragradient method to minimize the sum of two functions, the first one being smooth and the second being convex. Under the Kurdyka-Lojasiewicz assumption, we prove that the sequence produced by the extragradient method converges to a critical point of the problem and has finite length. The analysis is extended to the case when both functions are convex. We provide, in this case, a sublinear convergence rate, as for gradient-based methods. Furthermore, we show that the recent small-prox complexity result can be applied to this method. Considering the extragradient method is an occasion to describe an exact line search scheme for proximal decomposition methods. We provide details for the implementation of this scheme for the one norm regularized least squares problem and demonstrate numerical results which suggest that combining nonaccelerated methods with exact line search can be a competitive choice. △ Less

Submitted 13 December, 2017; v1 submitted 26 September, 2016; originally announced September 2016.

arXiv:1606.03858 [pdf, other]

Sorting out typicality with the inverse moment matrix SOS polynomial

Authors: Jean-Bernard Lasserre, Edouard Pauwels

Abstract: We study a surprising phenomenon related to the representation of a cloud of data points using polynomials. We start with the previously unnoticed empirical observation that, given a collection (a cloud) of data points, the sublevel sets of a certain distinguished polynomial capture the shape of the cloud very accurately. This distinguished polynomial is a sum-of-squares (SOS) derived in a simple… ▽ More We study a surprising phenomenon related to the representation of a cloud of data points using polynomials. We start with the previously unnoticed empirical observation that, given a collection (a cloud) of data points, the sublevel sets of a certain distinguished polynomial capture the shape of the cloud very accurately. This distinguished polynomial is a sum-of-squares (SOS) derived in a simple manner from the inverse of the empirical moment matrix. In fact, this SOS polynomial is directly related to orthogonal polynomials and the Christoffel function. This allows to generalize and interpret extremality properties of orthogonal polynomials and to provide a mathematical rationale for the observed phenomenon. Among diverse potential applications, we illustrate the relevance of our results on a network intrusion detection task for which we obtain performances similar to existing dedicated methods reported in the literature. △ Less

Submitted 14 June, 2016; v1 submitted 13 June, 2016; originally announced June 2016.

arXiv:1605.02452 [pdf, other]

Positivity certificates in optimal control

Authors: Edouard Pauwels, Didier Henrion, Jean-Bernard Lasserre

Abstract: We propose a tutorial on relaxations and weak formulations of optimal control with their semidefinite approximations. We present this approach solely through the prism of positivity certificates which we consider to be the most accessible for a broad audience, in particular in the engineering and robotics communities. This simple concept allows to express very concisely powerful approximation cert… ▽ More We propose a tutorial on relaxations and weak formulations of optimal control with their semidefinite approximations. We present this approach solely through the prism of positivity certificates which we consider to be the most accessible for a broad audience, in particular in the engineering and robotics communities. This simple concept allows to express very concisely powerful approximation certificates in control. The relevance of this technique is illustrated on three applications: region of attraction approximation, direct optimal control and inverse optimal control, for which it constitutes a common denominator. In a first step, we highlight the core mechanisms underpinning the application of positivity in control and how they appear in the different control applications. This relies on simple mathematical concepts and gives a unified treatment of the applications considered. This presentation is based on the combination and simplification of published materials. In a second step, we describe briefly relations with broader literature, in particular, occupation measures and Hamilton-Jacobi-Bellman equation which are important elements of the global picture. We describe the Sum-Of-Squares (SOS) semidefinite hierarchy in the semialgebraic case and briefly mention its convergence properties. Numerical experiments on a classical example in robotics, namely the nonholonomic vehicle, illustrate the concepts presented in the text for the three applications considered. △ Less

Submitted 9 May, 2016; originally announced May 2016.

arXiv:1604.01654 [pdf, ps, other]

The value function approach to convergence analysis in composite optimization

Authors: Edouard Pauwels

Abstract: This works aims at understanding further convergence properties of first order local search methods with complex geometries. We focus on the composite optimization model which unifies within a simple formalism many problems of this type. We provide a general convergence analysis of the composite Gauss-Newton method under tameness assumptions (an extension of semi-algebraicity). Tameness is a very… ▽ More This works aims at understanding further convergence properties of first order local search methods with complex geometries. We focus on the composite optimization model which unifies within a simple formalism many problems of this type. We provide a general convergence analysis of the composite Gauss-Newton method under tameness assumptions (an extension of semi-algebraicity). Tameness is a very general condition satisfied by virtually all problems solved in practice. The analysis is based on recent progresses in understanding convergence properties of sequential convex programming methods through the value function. △ Less

Submitted 6 October, 2016; v1 submitted 6 April, 2016; originally announced April 2016.

arXiv:1508.03070 [pdf, other]

doi 10.1086/682281

The Mid-Infrared Instrument for the James Webb Space Telescope, VI: The Medium Resolution Spectrometer

Authors: Martyn Wells, J. -W. Pel, Alistair Glasse, G. S. Wright, Gabby Aitink-Kroes, Ruyman Azzollini, Steven Beard, B. R. Brandl, Angus Gallie, V. C. Geers, A. M. Glauser, Peter Hastings, Th. Henning, Rieks Jager, K. Justtanont, Bob Kruizinga, Fred Lahuis, David Lee, I. Martinez-Delgado, J. R. Martinez-Galarza, M. Meijers, Jane E. Morrison, Friedrich Mueller, Thodori Nakos, Brian O'Sullivan , et al. (13 additional authors not shown)

Abstract: We describe the design and performance of the Medium Resolution Spectrometer (MRS) for the JWST-MIRI instrument. The MRS incorporates four coaxial spectral channels in a compact opto-mechanical layout that generates spectral images over fields of view up to 7.7 X 7.7 arcseconds in extent and at spectral resolving powers ranging from 1,300 to 3,700. Each channel includes an all-reflective integral… ▽ More We describe the design and performance of the Medium Resolution Spectrometer (MRS) for the JWST-MIRI instrument. The MRS incorporates four coaxial spectral channels in a compact opto-mechanical layout that generates spectral images over fields of view up to 7.7 X 7.7 arcseconds in extent and at spectral resolving powers ranging from 1,300 to 3,700. Each channel includes an all-reflective integral field unit (IFU): an 'image slicer' that reformats the input field for presentation to a grating spectrometer. Two 1024 X 1024 focal plane arrays record the output spectral images with an instantaneous spectral coverage of approximately one third of the full wavelength range of each channel. The full 5 to 28.5 micron spectrum is then obtained by making three exposures using gratings and pass-band-determining filters that are selected using just two three-position mechanisms. The expected on-orbit optical performance is presented, based on testing of the MIRI Flight Model and including spectral and spatial coverage and resolution. The point spread function of the reconstructed images is shown to be diffraction limited and the optical transmission is shown to be consistent with the design expectations. △ Less

Submitted 12 August, 2015; originally announced August 2015.

Journal ref: PASP, 2015, 127, 646

Showing 1–50 of 59 results for author: Pauwels, É