subscribe to arXiv mailings

QCQA: Quality and Capacity-aware grouped Query Attention

Authors: Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas Subramoney

Abstract: Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of correspo… ▽ More Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function. We demonstrate that QCQA achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For the Llama2 $7\,$B model, QCQA achieves $\mathbf{20}$\% higher accuracy than GQA with similar KV-cache size requirements in the absence of fine-tuning. After fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides $\mathbf{10.55}\,$\% higher accuracy than GQA. Furthermore, QCQA requires $40\,$\% less KV-cache size than GQA to attain similar accuracy. The proposed quality and capacity-aware grouping of query heads can serve as a new paradigm for KV-cache optimization in autoregressive LLM inference. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2111.08434 [pdf, other]

Robust 3D Scene Segmentation through Hierarchical and Learnable Part-Fusion

Authors: Anirud Thyagharajan, Benjamin Ummenhofer, Prashant Laddha, Om J Omer, Sreenivas Subramoney

Abstract: 3D semantic segmentation is a fundamental building block for several scene understanding applications such as autonomous driving, robotics and AR/VR. Several state-of-the-art semantic segmentation models suffer from the part misclassification problem, wherein parts of the same object are labelled incorrectly. Previous methods have utilized hierarchical, iterative methods to fuse semantic and insta… ▽ More 3D semantic segmentation is a fundamental building block for several scene understanding applications such as autonomous driving, robotics and AR/VR. Several state-of-the-art semantic segmentation models suffer from the part misclassification problem, wherein parts of the same object are labelled incorrectly. Previous methods have utilized hierarchical, iterative methods to fuse semantic and instance information, but they lack learnability in context fusion, and are computationally complex and heuristic driven. This paper presents Segment-Fusion, a novel attention-based method for hierarchical fusion of semantic and instance information to address the part misclassifications. The presented method includes a graph segmentation algorithm for grouping points into segments that pools point-wise features into segment-wise features, a learnable attention-based network to fuse these segments based on their semantic and instance features, and followed by a simple yet effective connected component labelling algorithm to convert segment features to instance labels. Segment-Fusion can be flexibly employed with any network architecture for semantic/instance segmentation. It improves the qualitative and quantitative performance of several semantic segmentation backbones by upto 5% when evaluated on the ScanNet and S3DIS datasets. △ Less

Submitted 16 November, 2021; originally announced November 2021.

arXiv:2011.12669 [pdf, other]

AccSS3D: Accelerator for Spatially Sparse 3D DNNs

Authors: Om Ji Omer, Prashant Laddha, Gurpreet S Kalsi, Anirud Thyagharajan, Kamlesh R Pillai, Abhimanyu Kulkarni, Anbang Yao, Yurong Chen, Sreenivas Subramoney

Abstract: Semantic understanding and completion of real world scenes is a foundational primitive of 3D Visual perception widely used in high-level applications such as robotics, medical imaging, autonomous driving and navigation. Due to the curse of dimensionality, compute and memory requirements for 3D scene understanding grow in cubic complexity with voxel resolution, posing a huge impediment to realizing… ▽ More Semantic understanding and completion of real world scenes is a foundational primitive of 3D Visual perception widely used in high-level applications such as robotics, medical imaging, autonomous driving and navigation. Due to the curse of dimensionality, compute and memory requirements for 3D scene understanding grow in cubic complexity with voxel resolution, posing a huge impediment to realizing real-time energy efficient deployments. The inherent spatial sparsity present in the 3D world due to free space is fundamentally different from the channel-wise sparsity that has been extensively studied. We present ACCELERATOR FOR SPATIALLY SPARSE 3D DNNs (AccSS3D), the first end-to-end solution for accelerating 3D scene understanding by exploiting the ample spatial sparsity. As an algorithm-dataflow-architecture co-designed system specialized for spatially-sparse 3D scene understanding, AccSS3D includes novel spatial locality-aware metadata structures, a near-zero latency and spatial sparsity-aware dataflow optimizer, a surface orientation aware pointcloud reordering algorithm and a codesigned hardware accelerator for spatial sparsity that exploits data reuse through systolic and multicast interconnects. The SSpNNA accelerator core together with the 64 KB of L1 memory requires 0.92 mm2 of area in 16nm process at 1 GHz. Overall, AccSS3D achieves 16.8x speedup and a 2232x energy efficiency improvement for 3D sparse convolution compared to an Intel-i7-8700K 4-core CPU, which translates to a 11.8x end-to-end 3D semantic segmentation speedup and a 24.8x energy efficiency improvement (iso technology node) △ Less

Submitted 25 November, 2020; originally announced November 2020.

arXiv:1103.2741 [pdf]

Memory Retrieval in the B-Matrix Neural Network

Authors: Prerana Laddha

Abstract: This paper is an extension to the memory retrieval procedure of the B-Matrix approach [6],[17] to neural network learning. The B-Matrix is a part of the interconnection matrix generated from the Hebbian neural network, and in memory retrieval, the B-matrix is clamped with a small fragment of the memory. The fragment gradually enlarges by means of feedback, until the entire vector is obtained. In t… ▽ More This paper is an extension to the memory retrieval procedure of the B-Matrix approach [6],[17] to neural network learning. The B-Matrix is a part of the interconnection matrix generated from the Hebbian neural network, and in memory retrieval, the B-matrix is clamped with a small fragment of the memory. The fragment gradually enlarges by means of feedback, until the entire vector is obtained. In this paper, we propose the use of delta learning to enhance the retrieval rate of the stored memories. △ Less

Submitted 14 March, 2011; originally announced March 2011.

Comments: 8 Pages, 4 Figures

arXiv:1007.5476 [pdf]

Degree of Separation in Social Networks

Authors: Prerana Laddha

Abstract: According to the small-world concept, the entire world is connected through short chains of acquaintances. In popular imagination this is captured in the phrase six degrees of separation, implying that any two individuals are, at most, six handshakes away. Social network analysis is the understanding of concepts and information on relationships among interacting units in an ecological system. In t… ▽ More According to the small-world concept, the entire world is connected through short chains of acquaintances. In popular imagination this is captured in the phrase six degrees of separation, implying that any two individuals are, at most, six handshakes away. Social network analysis is the understanding of concepts and information on relationships among interacting units in an ecological system. In this analysis the properties of the actors are explained in terms of the structures of links amongst them. In general, the relational links between the actors are primary and the properties of the actors are secondary. This paper presents two methods to calculate the average degree of separation between the actors or nodes in a graph. We apply this approach to other random graphs depicting social networks and then compare the characteristics of these graphs with the average degree of separation. △ Less

Submitted 30 July, 2010; originally announced July 2010.

Comments: 9 pages, 7 figures

Showing 1–5 of 5 results for author: Laddha, P