-
Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching
Authors:
Ganesh Ramachandra Kini,
Vala Vakilian,
Tina Behnia,
Jaidev Gill,
Christos Thrampoulidis
Abstract:
Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectiv…
▽ More
Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations. We arrive at this finding analytically, by establishing that the global minimizers of an unconstrained features model with SCL loss and entry-wise non-negativity constraints form an orthogonal frame. Extensive experiments conducted across various datasets, architectures, and imbalance scenarios corroborate our finding. Importantly, our experiments reveal that the inclusion of the ReLU activation restores symmetry without compromising test accuracy. This constitutes the first geometry characterization of SCL under imbalances. Additionally, our analysis and experiments underscore the pivotal role of batch selection strategies in representation geometry. By proving necessary and sufficient conditions for mini-batch choices that ensure invariant symmetric representations, we introduce batch-binding as an efficient strategy that guarantees these conditions hold.
△ Less
Submitted 18 October, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data
Authors:
Tina Behnia,
Ganesh Ramachandra Kini,
Vala Vakilian,
Christos Thrampoulidis
Abstract:
Various logit-adjusted parameterizations of the cross-entropy (CE) loss have been proposed as alternatives to weighted CE for training large models on label-imbalanced data far beyond the zero train error regime. The driving force behind those designs has been the theory of implicit bias, which for linear(ized) models, explains why they successfully induce bias on the optimization path towards sol…
▽ More
Various logit-adjusted parameterizations of the cross-entropy (CE) loss have been proposed as alternatives to weighted CE for training large models on label-imbalanced data far beyond the zero train error regime. The driving force behind those designs has been the theory of implicit bias, which for linear(ized) models, explains why they successfully induce bias on the optimization path towards solutions that favor minorities. Aiming to extend this theory to non-linear models, we investigate the implicit geometry of classifiers and embeddings that are learned by different CE parameterizations. Our main result characterizes the global minimizers of a non-convex cost-sensitive SVM classifier for the unconstrained features model, which serves as an abstraction of deep nets. We derive closed-form formulas for the angles and norms of classifiers and embeddings as a function of the number of classes, the imbalance and the minority ratios, and the loss hyperparameters. Using these, we show that logit-adjusted parameterizations can be appropriately tuned to learn symmetric geometries irrespective of the imbalance ratio. We complement our analysis with experiments and an empirical study of convergence accuracy in deep-nets.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Imbalance Trouble: Revisiting Neural-Collapse Geometry
Authors:
Christos Thrampoulidis,
Ganesh R. Kini,
Vala Vakilian,
Tina Behnia
Abstract:
Neural Collapse refers to the remarkable structural properties characterizing the geometry of class embeddings and classifier weights, found by deep nets when trained beyond zero training error. However, this characterization only holds for balanced data. Here we thus ask whether it can be made invariant to class imbalances. Towards this end, we adopt the unconstrained-features model (UFM), a rece…
▽ More
Neural Collapse refers to the remarkable structural properties characterizing the geometry of class embeddings and classifier weights, found by deep nets when trained beyond zero training error. However, this characterization only holds for balanced data. Here we thus ask whether it can be made invariant to class imbalances. Towards this end, we adopt the unconstrained-features model (UFM), a recent theoretical model for studying neural collapse, and introduce Simplex-Encoded-Labels Interpolation (SELI) as an invariant characterization of the neural collapse phenomenon. Specifically, we prove for the UFM with cross-entropy loss and vanishing regularization that, irrespective of class imbalances, the embeddings and classifiers always interpolate a simplex-encoded label matrix and that their individual geometries are determined by the SVD factors of this same label matrix. We then present extensive experiments on synthetic and real datasets that confirm convergence to the SELI geometry. However, we caution that convergence worsens with increasing imbalances. We theoretically support this finding by showing that unlike the balanced case, when minorities are present, ridge-regularization plays a critical role in tweaking the geometry. This defines new questions and motivates further investigations into the impact of class imbalances on the rates at which first-order methods converge to their asymptotically preferred solutions.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
Label-Imbalanced and Group-Sensitive Classification under Overparameterization
Authors:
Ganesh Ramachandra Kini,
Orestis Paraskevas,
Samet Oymak,
Christos Thrampoulidis
Abstract:
The goal in label-imbalanced and group-sensitive classification is to optimize relevant metrics such as balanced error and equal opportunity. Classical methods, such as weighted cross-entropy, fail when training deep nets to the terminal phase of training (TPT), that is training beyond zero training error. This observation has motivated recent flurry of activity in developing heuristic alternative…
▽ More
The goal in label-imbalanced and group-sensitive classification is to optimize relevant metrics such as balanced error and equal opportunity. Classical methods, such as weighted cross-entropy, fail when training deep nets to the terminal phase of training (TPT), that is training beyond zero training error. This observation has motivated recent flurry of activity in developing heuristic alternatives following the intuitive mechanism of promoting larger margin for minorities. In contrast to previous heuristics, we follow a principled analysis explaining how different loss adjustments affect margins. First, we prove that for all linear classifiers trained in TPT, it is necessary to introduce multiplicative, rather than additive, logit adjustments so that the interclass margins change appropriately. To show this, we discover a connection of the multiplicative CE modification to the cost-sensitive support-vector machines. Perhaps counterintuitively, we also find that, at the start of training, the same multiplicative weights can actually harm the minority classes. Thus, while additive adjustments are ineffective in the TPT, we show that they can speed up convergence by countering the initial negative effect of the multiplicative weights. Motivated by these findings, we formulate the vector-scaling (VS) loss, that captures existing techniques as special cases. Moreover, we introduce a natural extension of the VS-loss to group-sensitive classification, thus treating the two common types of imbalances (label/group) in a unifying way. Importantly, our experiments on state-of-the-art datasets are fully consistent with our theoretical insights and confirm the superior performance of our algorithms. Finally, for imbalanced Gaussian-mixtures data, we perform a generalization analysis, revealing tradeoffs between balanced / standard error and equal opportunity.
△ Less
Submitted 8 November, 2021; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Analytic Study of Double Descent in Binary Classification: The Impact of Loss
Authors:
Ganesh Kini,
Christos Thrampoulidis
Abstract:
Extensive empirical evidence reveals that, for a wide range of different learning methods and datasets, the risk curve exhibits a double-descent (DD) trend as a function of the model size. In a recent paper [Zeyu,Kammoun,Thrampoulidis,2019] the authors studied binary linear classification models and showed that the test error of gradient descent (GD) with logistic loss undergoes a DD. In this pape…
▽ More
Extensive empirical evidence reveals that, for a wide range of different learning methods and datasets, the risk curve exhibits a double-descent (DD) trend as a function of the model size. In a recent paper [Zeyu,Kammoun,Thrampoulidis,2019] the authors studied binary linear classification models and showed that the test error of gradient descent (GD) with logistic loss undergoes a DD. In this paper, we complement these results by extending them to GD with square loss. We show that the DD phenomenon persists, but we also identify several differences compared to logistic loss. This emphasizes that crucial features of DD curves (such as their transition threshold and global minima) depend both on the training data and on the learning algorithm. We further study the dependence of DD curves on the size of the training set. Similar to our earlier work, our results are analytic: we plot the DD curves by first deriving sharp asymptotics for the test error under Gaussian features. Albeit simple, the models permit a principled study of DD features, the outcomes of which theoretically corroborate related empirical findings occurring in more complex learning tasks.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
A Tight Rate Bound and Matching Construction for Locally Recoverable Codes with Sequential Recovery From Any Number of Multiple Erasures
Authors:
S. B. Balaji,
Ganesh R. Kini,
P. Vijay Kumar
Abstract:
By a locally recoverable code (LRC), we will in this paper, mean a linear code in which a given code symbol can be recovered by taking a linear combination of at most $r$ other code symbols with $r << k$. A natural extension is to the local recovery of a set of $t$ erased symbols. There have been several approaches proposed for the handling of multiple erasures. The approach considered here, is on…
▽ More
By a locally recoverable code (LRC), we will in this paper, mean a linear code in which a given code symbol can be recovered by taking a linear combination of at most $r$ other code symbols with $r << k$. A natural extension is to the local recovery of a set of $t$ erased symbols. There have been several approaches proposed for the handling of multiple erasures. The approach considered here, is one of sequential recovery meaning that the $t$ erased symbols are recovered in succession, each time contacting at most $r$ other symbols for assistance in recovery. Under the constraint that each erased symbol be recoverable by contacting at most $r$ other code symbols, this approach is the most general and hence offers maximum possible code rate. We characterize the maximum possible rate of an LRC with sequential recovery for any $r \geq 3$ and $t$. We do this by first deriving an upper bound on code rate and then going on to construct a {\em binary} code that achieves this optimal rate. The upper bound derived here proves a conjecture made earlier relating to the structure (but not the exact form) of the rate bound. Our approach also permits us to deduce the structure of the parity-check matrix of a rate-optimal LRC with sequential recovery.
The parity-check matrix in turn, leads to a graphical description of the code. The construction of a binary code having rate achieving the upper bound derived here makes use of this description. Interestingly, it turns out that a subclass of binary codes that are both rate and block-length optimal, correspond to graphs known as Moore graphs that are regular graphs having the smallest number of vertices for a given girth. A connection with Tornado codes is also made in the paper.
△ Less
Submitted 6 December, 2018;
originally announced December 2018.
-
A Rate-Optimal Construction of Codes with Sequential Recovery with Low Block Length
Authors:
Balaji Srinivasan Babu,
Ganesh R. Kini,
P. Vijay Kumar
Abstract:
An erasure code is said to be a code with sequential recovery with parameters $r$ and $t$, if for any $s \leq t$ erased code symbols, there is an $s$-step recovery process in which at each step we recover exactly one erased code symbol by contacting at most $r$ other code symbols. In earlier work by the same authors, presented at ISIT 2017, we had given a construction for binary codes with sequent…
▽ More
An erasure code is said to be a code with sequential recovery with parameters $r$ and $t$, if for any $s \leq t$ erased code symbols, there is an $s$-step recovery process in which at each step we recover exactly one erased code symbol by contacting at most $r$ other code symbols. In earlier work by the same authors, presented at ISIT 2017, we had given a construction for binary codes with sequential recovery from $t$ erasures, with locality parameter $r$, which were optimal in terms of code rate for given $r,t$, but where the block length was large, on the order of $r^{c^t}$, for some constant $c >1$. In the present paper, we present an alternative construction of a rate-optimal code for any value of $t$ and any $r\geq3$, where the block length is significantly smaller, on the order of $r^{\frac{5t}{4}+\frac{7}{4}}$ (in some instances of order $r^{\frac{3t}{2}+2}$). Our construction is based on the construction of certain kind of tree-like graphs with girth $t+1$. We construct these graphs and hence the codes recursively.
△ Less
Submitted 21 January, 2018;
originally announced January 2018.
-
A Tight Rate Bound and a Matching Construction for Locally Recoverable Codes with Sequential Recovery From Any Number of Multiple Erasures
Authors:
S. B. Balaji,
Ganesh R. Kini,
P. Vijay Kumar
Abstract:
An $[n,k]$ code $\mathcal{C}$ is said to be locally recoverable in the presence of a single erasure, and with locality parameter $r$, if each of the $n$ code symbols of $\mathcal{C}$ can be recovered by accessing at most $r$ other code symbols. An $[n,k]$ code is said to be a locally recoverable code with sequential recovery from $t$ erasures, if for any set of $s \leq t$ erasures, there is an…
▽ More
An $[n,k]$ code $\mathcal{C}$ is said to be locally recoverable in the presence of a single erasure, and with locality parameter $r$, if each of the $n$ code symbols of $\mathcal{C}$ can be recovered by accessing at most $r$ other code symbols. An $[n,k]$ code is said to be a locally recoverable code with sequential recovery from $t$ erasures, if for any set of $s \leq t$ erasures, there is an $s$-step sequential recovery process, in which at each step, a single erased symbol is recovered by accessing at most $r$ other code symbols. This is equivalent to the requirement that for any set of $s \leq t$ erasures, the dual code contain a codeword whose support contains the coordinate of precisely one of the $s$ erased symbols. In this paper, a tight upper bound on the rate of such a code, for any value of number of erasures $t$ and any value $r \geq 3$, of the locality parameter is derived. This bound proves an earlier conjecture due to Song, Cai and Yuen. While the bound is valid irrespective of the field over which the code is defined, a matching construction of {\em binary} codes that are rate-optimal is also provided, again for any value of $t$ and any value $r\geq3$.
△ Less
Submitted 17 February, 2017; v1 submitted 25 November, 2016;
originally announced November 2016.
-
Performance Metrics Analysis of Torus Embedded Hypercube Interconnection Network
Authors:
N. Gopalakrishna Kini,
M. Sathish Kumar,
H. S. Mruthyunjaya
Abstract:
Advantages of hypercube network and torus topology are used to derive an embedded architecture for product network known as torus embedded hypercube scalable interconnection network. This paper analyzes torus embedded hypercube network pertinent to parallel architecture. The network metrics are used to show how good embedded network can be designed for parallel computation. Network parameter ana…
▽ More
Advantages of hypercube network and torus topology are used to derive an embedded architecture for product network known as torus embedded hypercube scalable interconnection network. This paper analyzes torus embedded hypercube network pertinent to parallel architecture. The network metrics are used to show how good embedded network can be designed for parallel computation. Network parameter analysis and comparison of embedded network with basic networks is presented.
△ Less
Submitted 11 December, 2009;
originally announced December 2009.