-
Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling
Authors:
Satya Kapoor,
Alex Gil,
Sreyoshi Bhaduri,
Anshul Mittal,
Rutu Mulkar
Abstract:
Topic modeling is a widely used technique for uncovering thematic structures from large text corpora. However, most topic modeling approaches e.g. Latent Dirichlet Allocation (LDA) struggle to capture nuanced semantics and contextual understanding required to accurately model complex narratives. Recent advancements in this area include methods like BERTopic, which have demonstrated significantly i…
▽ More
Topic modeling is a widely used technique for uncovering thematic structures from large text corpora. However, most topic modeling approaches e.g. Latent Dirichlet Allocation (LDA) struggle to capture nuanced semantics and contextual understanding required to accurately model complex narratives. Recent advancements in this area include methods like BERTopic, which have demonstrated significantly improved topic coherence and thus established a new standard for benchmarking. In this paper, we present a novel approach, the Qualitative Insights Tool (QualIT) that integrates large language models (LLMs) with existing clustering-based topic modeling approaches. Our method leverages the deep contextual understanding and powerful language generation capabilities of LLMs to enrich the topic modeling process using clustering. We evaluate our approach on a large corpus of news articles and demonstrate substantial improvements in topic coherence and topic diversity compared to baseline topic modeling techniques. On the 20 ground-truth topics, our method shows 70% topic coherence (vs 65% & 57% benchmarks) and 95.5% topic diversity (vs 85% & 72% benchmarks). Our findings suggest that the integration of LLMs can unlock new opportunities for topic modeling of dynamic and complex text data, as is common in talent management research contexts.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
SALSA: Speedy ASR-LLM Synchronous Aggregation
Authors:
Ashish Mittal,
Darshan Prabhu,
Sunita Sarawagi,
Preethi Jyothi
Abstract:
Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which coup…
▽ More
Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
Reconciling Methodological Paradigms: Employing Large Language Models as Novice Qualitative Research Assistants in Talent Management Research
Authors:
Sreyoshi Bhaduri,
Satya Kapoor,
Alex Gil,
Anshul Mittal,
Rutu Mulkar
Abstract:
Qualitative data collection and analysis approaches, such as those employing interviews and focus groups, provide rich insights into customer attitudes, sentiment, and behavior. However, manually analyzing qualitative data requires extensive time and effort to identify relevant topics and thematic insights. This study proposes a novel approach to address this challenge by leveraging Retrieval Augm…
▽ More
Qualitative data collection and analysis approaches, such as those employing interviews and focus groups, provide rich insights into customer attitudes, sentiment, and behavior. However, manually analyzing qualitative data requires extensive time and effort to identify relevant topics and thematic insights. This study proposes a novel approach to address this challenge by leveraging Retrieval Augmented Generation (RAG) based Large Language Models (LLMs) for analyzing interview transcripts. The novelty of this work lies in strategizing the research inquiry as one that is augmented by an LLM that serves as a novice research assistant. This research explores the mental model of LLMs to serve as novice qualitative research assistants for researchers in the talent management space. A RAG-based LLM approach is extended to enable topic modeling of semi-structured interview data, showcasing the versatility of these models beyond their traditional use in information retrieval and search. Our findings demonstrate that the LLM-augmented RAG approach can successfully extract topics of interest, with significant coverage compared to manually generated topics from the same dataset. This establishes the viability of employing LLMs as novice qualitative research assistants. Additionally, the study recommends that researchers leveraging such models lean heavily on quality criteria used in traditional qualitative research to ensure rigor and trustworthiness of their approach. Finally, the paper presents key recommendations for industry practitioners seeking to reconcile the use of LLMs with established qualitative research paradigms, providing a roadmap for the effective integration of these powerful, albeit novice, AI tools in the analysis of qualitative datasets within talent
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Power Aware Container Placement in Cloud Computing with Affinity and Cubic Power Model
Authors:
Suvarthi Sarkar,
Nandini Sharma,
Akshat Mittal,
Aryabartta Sahu
Abstract:
Modern data centres are increasingly adopting containers to enhance power and performance efficiency. These data centres consist of multiple heterogeneous machines, each equipped with varying amounts of resources such as CPU, I/O, memory, and network bandwidth. Data centers rent their resources to applications, which demand different amounts of resources and execute on machines for extended durati…
▽ More
Modern data centres are increasingly adopting containers to enhance power and performance efficiency. These data centres consist of multiple heterogeneous machines, each equipped with varying amounts of resources such as CPU, I/O, memory, and network bandwidth. Data centers rent their resources to applications, which demand different amounts of resources and execute on machines for extended durations if the machines provide the demanded resources to the applications. Certain applications run efficiently on specific machines, referred to as system affinity between applications and machines. In contrast, others are incompatible with specific machines, referred to as anti-affinity between applications and machines. We consider that there are multiple applications, and data centers need to execute as many applications as possible. Data centers incur electricity based on CPU usage due to the execution of applications, with the cost being proportional to the cube of the total CPU usage. It is a challenging problem to place applications on the machines they have an affinity for while keeping the electricity cost in check. Our work addresses the placement problem of matching applications to machines to minimize overall electricity costs while maximizing the number of affinity pairs of machines and applications. We propose three solution approaches: (a) Power-Aware Placement (PAP): applications are placed on machines where power usage is minimized, (b) Affinity-Aware Placement (AAP): applications are placed on machines where affinity is maximized, (c) Combined Power-Affinity Placement (CPAAP): this approach integrates the benefits of both PAP and AAP. Our proposed approach improves the affinity satisfaction ratio by up to 4% while reducing the total system cost by up to 26% and improving the affinity payoff ratio by up to 37% compared to state-of-the-art approaches for real-life datasets.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Semantic SQL -- Combining and optimizing semantic predicates in SQL
Authors:
Akash Mittal,
Anshul Bheemreddy,
Huili Tao
Abstract:
In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vec…
▽ More
In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vector databases have emerged, embedding unstructured data for efficient top-k queries based on textual queries. This paper introduces a novel framework SSQL - Semantic SQL that utilizes these two approaches, enabling the incorporation of semantic queries within SQL statements. Our approach extends SQL queries with dedicated keywords for specifying semantic queries alongside predicates related to ML model results and metadata. Our experimental results show that using just semantic queries fails catastrophically to answer count and spatial queries in more than 60% of the cases. Our proposed method jointly optimizes the queries containing both semantic predicates and predicates on structured tables, such as those generated by ML models or other metadata. Further, to improve the query results, we incorporated human-in-the-loop feedback to determine the optimal similarity score threshold for returning results.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
The Cordiality Game and the Game Cordiality Number
Authors:
Elliot Krop,
Aryan Mittal,
Michael C. Wigal
Abstract:
The cordiality game is played on a graph $G$ by two players, Admirable (A) and Impish (I), who take turns selecting \track{unlabeled} vertices of $G$. Admirable labels the selected vertices by $0$ and Impish by $1$, and the resulting label on any edge is the sum modulo $2$ of the labels of the vertices incident to that edge. The two players have opposite goals: Admirable attempts to minimize the n…
▽ More
The cordiality game is played on a graph $G$ by two players, Admirable (A) and Impish (I), who take turns selecting \track{unlabeled} vertices of $G$. Admirable labels the selected vertices by $0$ and Impish by $1$, and the resulting label on any edge is the sum modulo $2$ of the labels of the vertices incident to that edge. The two players have opposite goals: Admirable attempts to minimize the number of edges with different labels as much as possible while Impish attempts to maximize this number. When both Admirable and Impish play their optimal games, we define the \emph{game cordiality number}, $c_g(G)$, as the absolute difference between the number of edges labeled zero and one. Let $P_n$ be the path on $n$ vertices. We show $c_g(P_n)\le \frac{n-3}{3}$ when $n \equiv 0 \pmod 3$, $c_g(P_n)\le \frac{n-1}{3}$ when $n \equiv 1 \pmod 3$, and $c_g(P_n)\le \frac{n+1}{3}$ when $n \equiv 2\pmod 3$. Furthermore, we show a similar bound, $c_g(T) \leq \frac{|T|}{2}$ holds for any tree $T$.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Graph Regularized Encoder Training for Extreme Classification
Authors:
Anshul Mittal,
Shikhar Mohan,
Deepak Saini,
Suchith C. Prabhu,
Jain jiao,
Sumeet Agarwal,
Soumen Chakrabarti,
Purushottam Kar,
Manik Varma
Abstract:
Deep extreme classification (XC) aims to train an encoder architecture and an accompanying classifier architecture to tag a data point with the most relevant subset of labels from a very large universe of labels. XC applications in ranking, recommendation and tagging routinely encounter tail labels for which the amount of training data is exceedingly small. Graph convolutional networks (GCN) prese…
▽ More
Deep extreme classification (XC) aims to train an encoder architecture and an accompanying classifier architecture to tag a data point with the most relevant subset of labels from a very large universe of labels. XC applications in ranking, recommendation and tagging routinely encounter tail labels for which the amount of training data is exceedingly small. Graph convolutional networks (GCN) present a convenient but computationally expensive way to leverage task metadata and enhance model accuracies in these settings. This paper formally establishes that in several use cases, the steep computational cost of GCNs is entirely avoidable by replacing GCNs with non-GCN architectures. The paper notices that in these settings, it is much more effective to use graph data to regularize encoder training than to implement a GCN. Based on these insights, an alternative paradigm RAMEN is presented to utilize graph metadata in XC settings that offers significant performance boosts with zero increase in inference computational costs. RAMEN scales to datasets with up to 1M labels and offers prediction accuracy up to 15% higher on benchmark datasets than state of the art methods, including those that use graph metadata to train GCNs. RAMEN also offers 10% higher accuracy over the best baseline on a proprietary recommendation dataset sourced from click logs of a popular search engine. Code for RAMEN will be released publicly.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Using Deep Learning to Predict Neural Stem Cell Differentiation in Regenerative Medicine
Authors:
Nidhi Parthasarathy,
Chandra Suda,
Anika Mittal,
Ian Young Chen,
Ananya Jalihal
Abstract:
Over one in three people are affected by neurodegenerative disorders. Neural stem cells, which are multipotent regenerative cells with the potential to differentiate into any of the neural cell types, have immense therapeutic potential for treating neurological disorders. However, lengthy differentiation protocols hinder clinical applications and research. In this study, we present a deep learning…
▽ More
Over one in three people are affected by neurodegenerative disorders. Neural stem cells, which are multipotent regenerative cells with the potential to differentiate into any of the neural cell types, have immense therapeutic potential for treating neurological disorders. However, lengthy differentiation protocols hinder clinical applications and research. In this study, we present a deep learning approach using convolutional neural networks (CNNs) to predict the fate of neural stem cell differentiation at an early stage. We trained a CNN model on a dataset of cellular images from neural stem cell cultures. Our models achieved impressive results in predicting neuron and glial cell differentiation, with a 93.3% testing accuracy for a multiclass Resnet50 model (and 99.7% accuracy for a binary Resnet50 model). In addition, we developed and published a web tool to give stem cell researchers access to this technology to allow for efficient prediction of stem cell cell differentiation. Our work demonstrates the feasibility of and builds tooling for using CNNs for rapid, early differentiation outcome prediction from simple microscopy images, which could greatly accelerate neural stem cell research and therapies.
△ Less
Submitted 26 September, 2024; v1 submitted 19 November, 2023;
originally announced December 2023.
-
Soft Random Sampling: A Theoretical and Empirical Analysis
Authors:
Xiaodong Cui,
Ashish Mittal,
Songtao Lu,
Wei Zhang,
George Saon,
Brian Kingsbury
Abstract:
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. N…
▽ More
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost.
△ Less
Submitted 23 November, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Remaining useful life prediction of Lithium-ion batteries using spatio-temporal multimodal attention networks
Authors:
Sungho Suh,
Dhruv Aditya Mittal,
Hymalai Bello,
Bo Zhou,
Mayank Shekhar Jha,
Paul Lukowicz
Abstract:
Lithium-ion batteries are widely used in various applications, including electric vehicles and renewable energy storage. The prediction of the remaining useful life (RUL) of batteries is crucial for ensuring reliable and efficient operation, as well as reducing maintenance costs. However, determining the life cycle of batteries in real-world scenarios is challenging, and existing methods have limi…
▽ More
Lithium-ion batteries are widely used in various applications, including electric vehicles and renewable energy storage. The prediction of the remaining useful life (RUL) of batteries is crucial for ensuring reliable and efficient operation, as well as reducing maintenance costs. However, determining the life cycle of batteries in real-world scenarios is challenging, and existing methods have limitations in predicting the number of cycles iteratively. In addition, existing works often oversimplify the datasets, neglecting important features of the batteries such as temperature, internal resistance, and material type. To address these limitations, this paper proposes a two-stage RUL prediction scheme for Lithium-ion batteries using a spatio-temporal multimodal attention network (ST-MAN). The proposed ST-MAN is to capture the complex spatio-temporal dependencies in the battery data, including the features that are often neglected in existing works. Despite operating without prior knowledge of end-of-life (EOL) events, our method consistently achieves lower error rates, boasting mean absolute error (MAE) and mean square error (MSE) of 0.0275 and 0.0014, respectively, compared to existing convolutional neural networks (CNN) and long short-term memory (LSTM)-based methods. The proposed method has the potential to improve the reliability and efficiency of battery operations and is applicable in various industries.
△ Less
Submitted 6 June, 2024; v1 submitted 29 October, 2023;
originally announced October 2023.
-
Neutral Hydrogen (HI) 21 cm as a probe: Investigating Spatial Variations in Interstellar Turbulent Properties
Authors:
Amit K. Mittal,
Brian L Babler,
Snezana Stanimirovic,
Nickolas Pingel
Abstract:
Interstellar turbulence shapes the HI distribution in the Milky Way (MW). How this affects large-scale statistical properties of HI column density across the MW remains largely unconstrained. We use approx 13,000 square-degree GALFA-HI survey to map statistical fluctuations of HI over the 40 km s-1 velocity range. We calculate the spatial power spectrum (SPS) of HI column density image by running…
▽ More
Interstellar turbulence shapes the HI distribution in the Milky Way (MW). How this affects large-scale statistical properties of HI column density across the MW remains largely unconstrained. We use approx 13,000 square-degree GALFA-HI survey to map statistical fluctuations of HI over the 40 km s-1 velocity range. We calculate the spatial power spectrum (SPS) of HI column density image by running a 3-degree kernel and measuring SPS slope over a range of angular scales from 16 arcmin to 20 degree. Due to GALFA complex observing and calibration strategy, we construct detailed estimates of the noise contribution and account for GALFA beam effects on SPS. This allows us to systematically analyze HI images that trace a wide range of interstellar environments. We find that SPS slope varies between -2.6 at high Galactic latitudes, and -3.2 close to Galactic plane. The range of SPS slope values becomes tighter when we consider HI optical depth and line-of-sight length caused by the plane-parallel geometry of HI disk. This relatively uniform, large-scale distribution of SPS slope is suggestive of large-scale turbulent driving being a dominant mechanism for shaping HI structures in the MW and/or the stellar feedback turbulence being efficiently dissipated within dense molecular clouds. Only at latitudes above 60 degrees we find evidence for HI SPS slope being consistently more shallow. Those directions are largely within the Local Bubble, suggesting the recent history of this cavity, shaped by multiple supernovae explosions, has modified the turbulent state of HI and/or fractions of HI phases.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval
Authors:
Ramnath Kumar,
Anshul Mittal,
Nilesh Gupta,
Aditya Kusupati,
Inderjit Dhillon,
Prateek Jain
Abstract:
Dense embedding-based retrieval is widely used for semantic search and ranking. However, conventional two-stage approaches, involving contrastive embedding learning followed by approximate nearest neighbor search (ANNS), can suffer from misalignment between these stages. This mismatch degrades retrieval performance. We propose End-to-end Hierarchical Indexing (EHI), a novel method that directly ad…
▽ More
Dense embedding-based retrieval is widely used for semantic search and ranking. However, conventional two-stage approaches, involving contrastive embedding learning followed by approximate nearest neighbor search (ANNS), can suffer from misalignment between these stages. This mismatch degrades retrieval performance. We propose End-to-end Hierarchical Indexing (EHI), a novel method that directly addresses this issue by jointly optimizing embedding generation and ANNS structure. EHI leverages a dual encoder for embedding queries and documents while simultaneously learning an inverted file index (IVF)-style tree structure. To facilitate the effective learning of this discrete structure, EHI introduces dense path embeddings that encodes the path traversed by queries and documents within the tree. Extensive evaluations on standard benchmarks, including MS MARCO (Dev set) and TREC DL19, demonstrate EHI's superiority over traditional ANNS index. Under the same computational constraints, EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in nDCG@10 on TREC DL19, highlighting the benefits of our end-to-end approach.
△ Less
Submitted 13 October, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
CHIP: Contrastive Hierarchical Image Pretraining
Authors:
Arpit Mittal,
Harshil Jhaveri,
Swapnil Mallick,
Abhishek Ajmera
Abstract:
Few-shot object classification is the task of classifying objects in an image with limited number of examples as supervision. We propose a one-shot/few-shot classification model that can classify an object of any unseen class into a relatively general category in an hierarchically based classification. Our model uses a three-level hierarchical contrastive loss based ResNet152 classifier for classi…
▽ More
Few-shot object classification is the task of classifying objects in an image with limited number of examples as supervision. We propose a one-shot/few-shot classification model that can classify an object of any unseen class into a relatively general category in an hierarchically based classification. Our model uses a three-level hierarchical contrastive loss based ResNet152 classifier for classifying an object based on its features extracted from Image embedding, not used during the training phase. For our experimentation, we have used a subset of the ImageNet (ILSVRC-12) dataset that contains only the animal classes for training our model and created our own dataset of unseen classes for evaluating our trained model. Our model provides satisfactory results in classifying the unknown objects into a generic category which has been later discussed in greater detail.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Agree To Disagree
Authors:
Abhinav Raghuvanshi,
Siddhesh Pawar,
Anirudh Mittal
Abstract:
How frequently do individuals thoroughly review terms and conditions before proceeding to register for a service, install software, or access a website? The majority of internet users do not engage in this practice. This trend is not surprising, given that terms and conditions typically consist of lengthy documents replete with intricate legal terminology and convoluted sentences. In this paper, w…
▽ More
How frequently do individuals thoroughly review terms and conditions before proceeding to register for a service, install software, or access a website? The majority of internet users do not engage in this practice. This trend is not surprising, given that terms and conditions typically consist of lengthy documents replete with intricate legal terminology and convoluted sentences. In this paper, we introduce a Machine Learning-powered approach designed to automatically parse and summarize critical information in a user-friendly manner. This technology focuses on distilling the pertinent details that users should contemplate before committing to an agreement.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model
Authors:
Neel Bhate,
Ansh Mittal,
Zhe He,
Xiao Luo
Abstract:
Demographics, Social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clin…
▽ More
Demographics, Social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research.
△ Less
Submitted 13 September, 2023; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Multi-modal Extreme Classification
Authors:
Anshul Mittal,
Kunal Dahiya,
Shreya Malani,
Janani Ramaswamy,
Seba Kuruvilla,
Jitendra Ajmera,
Keng-hao Chang,
Sumeet Agarwal,
Purushottam Kar,
Manik Varma
Abstract:
This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where datapoints and labels are endowed with visual and textual descriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several millions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based methods. On t…
▽ More
This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where datapoints and labels are endowed with visual and textual descriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several millions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based methods. On the other hand, XC methods utilize classifier architectures to offer superior accuracies than embedding only methods but mostly focus on text-based categorization tasks. MUFIN bridges this gap by reformulating multi-modal categorization as an XC problem with several millions of labels. This presents the twin challenges of developing multi-modal architectures that can offer embeddings sufficiently expressive to allow accurate categorization over millions of labels; and training and inference routines that scale logarithmically in the number of labels. MUFIN develops an architecture based on cross-modal attention and trains it in a modular fashion using pre-training and positive and negative mining. A novel product-to-product recommendation dataset MM-AmazonTitles-300K containing over 300K products was curated from publicly available amazon.com listings with each product endowed with a title and multiple images. On the all datasets MUFIN offered at least 3% higher accuracy than leading text-based, image-based and multi-modal techniques. Code for MUFIN is available at https://github.com/Extreme-classification/MUFIN
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers
Authors:
Sharath Koorathota,
Nikolas Papadopoulos,
Jia Li Ma,
Shruti Kumar,
Xiaoxiao Sun,
Arunesh Mittal,
Patrick Adelman,
Paul Sajda
Abstract:
Modern transformer-based models designed for computer vision have outperformed humans across a spectrum of visual tasks. However, critical tasks, such as medical image interpretation or autonomous driving, still require reliance on human judgments. This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models…
▽ More
Modern transformer-based models designed for computer vision have outperformed humans across a spectrum of visual tasks. However, critical tasks, such as medical image interpretation or autonomous driving, still require reliance on human judgments. This work demonstrates how human visual input, specifically fixations collected from an eye-tracking device, can be integrated into transformer models to improve accuracy across multiple driving situations and datasets. First, we establish the significance of fixation regions in left-right driving decisions, as observed in both human subjects and a Vision Transformer (ViT). By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap is exploited for model pruning without compromising accuracy. Thereafter, we incorporate information from the driving scene with fixation data, employing a "joint space-fixation" (JSF) attention setup. Lastly, we propose a "fixation-attention intersection" (FAX) loss to train the ViT model to attend to the same regions that humans fixated on. We find that the ViT performance is improved in accuracy and number of training epochs when using JSF and FAX. These results hold significant implications for human-guided artificial intelligence.
△ Less
Submitted 26 August, 2023;
originally announced August 2023.
-
Evaluation of Faithfulness Using the Longest Supported Subsequence
Authors:
Anirudh Mittal,
Timo Schick,
Mikel Artetxe,
Jane Dwivedi-Yu
Abstract:
As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-gener…
▽ More
As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous substring of the claim that is supported by the context, which we refer to as the Longest Supported Subsequence (LSS). Using a new human-annotated dataset, we finetune a model to generate LSS. We introduce a new method of evaluation and demonstrate that these metrics correlate better with human ratings when LSS is employed, as opposed to when it is not. Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset. Our metric consistently outperforms other metrics on a summarization dataset across six different models. Finally, we compare several popular Large Language Models (LLMs) for faithfulness using this metric. We release the human-annotated dataset built for predicting LSS and our fine-tuned model for evaluating faithfulness.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Estimating Time to Clear Pendency of Cases in High Courts in India using Linear Regression
Authors:
Kshitiz Verma,
Anshu Musaddi,
Ansh Mittal,
Anshul Jain
Abstract:
Indian Judiciary is suffering from burden of millions of cases that are lying pending in its courts at all the levels. The High Court National Judicial Data Grid (HC-NJDG) indexes all the cases pending in the high courts and publishes the data publicly. In this paper, we analyze the data that we have collected from the HC-NJDG portal on 229 randomly chosen days between August 31, 2017 to March 22,…
▽ More
Indian Judiciary is suffering from burden of millions of cases that are lying pending in its courts at all the levels. The High Court National Judicial Data Grid (HC-NJDG) indexes all the cases pending in the high courts and publishes the data publicly. In this paper, we analyze the data that we have collected from the HC-NJDG portal on 229 randomly chosen days between August 31, 2017 to March 22, 2020, including these dates. Thus, the data analyzed in the paper spans a period of more than two and a half years. We show that: 1) the pending cases in most of the high courts is increasing linearly with time. 2) the case load on judges in various high courts is very unevenly distributed, making judges of some high courts hundred times more loaded than others. 3) for some high courts it may take even a hundred years to clear the pendency cases if proper measures are not taken.
We also suggest some policy changes that may help clear the pendency within a fixed time of either five or fifteen years. Finally, we find that the rate of institution of cases in high courts can be easily handled by the current sanctioned strength. However, extra judges are needed only to clear earlier backlogs.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Improving RNN-Transducers with Acoustic LookAhead
Authors:
Vinit S. Unni,
Ashish Mittal,
Preethi Jyothi,
Sunita Sarawagi
Abstract:
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to s…
▽ More
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity
Authors:
Zhenlin Xu,
Yi Zhu,
Tiffany Deng,
Abhay Mittal,
Yanbei Chen,
Manchen Wang,
Paolo Favaro,
Joseph Tighe,
Davide Modolo
Abstract:
This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Finding…
▽ More
This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models.
△ Less
Submitted 18 June, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Determining Smallest Path Size of Multiplication Transducers Without a Restricted Digit Set
Authors:
Aditya Mittal,
Karthik Mittal
Abstract:
Directed multiplication transducers are a tool for performing non-decimal base multiplication without an additional conversion to base 10. This allows for faster computation and provides easier visualization depending on the problem at hand. By building these multiplication transducers computationally, new patterns can be identified as these transducers can be built with much larger bases and mult…
▽ More
Directed multiplication transducers are a tool for performing non-decimal base multiplication without an additional conversion to base 10. This allows for faster computation and provides easier visualization depending on the problem at hand. By building these multiplication transducers computationally, new patterns can be identified as these transducers can be built with much larger bases and multipliers. Through a recursive approach, we created artificial multiplication transducers, allowing for the formation of several unique conjectures specifically focused on the smallest closed loop around a multiplication transducer starting and ending at zero. We show a general recursive pattern for this loop; through this recurrence relation, the length of the smallest closed loop for a particular transducer base b along with the range of multipliers having this particular length for multiplier m was also identified. This research is expected to be explored further by testing reductions of the digit set and determining whether similar properties will hold.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
MOVES: Movable and Moving LiDAR Scene Segmentation in Label-Free settings using Static Reconstruction
Authors:
Prashant Kumar,
Dhruv Makwana,
Onkar Susladkar,
Anurag Mittal,
Prem Kumar Kalra
Abstract:
Accurate static structure reconstruction and segmentation of non-stationary objects is of vital importance for autonomous navigation applications. These applications assume a LiDAR scan to consist of only static structures. In the real world however, LiDAR scans consist of non-stationary dynamic structures - moving and movable objects. Current solutions use segmentation information to isolate and…
▽ More
Accurate static structure reconstruction and segmentation of non-stationary objects is of vital importance for autonomous navigation applications. These applications assume a LiDAR scan to consist of only static structures. In the real world however, LiDAR scans consist of non-stationary dynamic structures - moving and movable objects. Current solutions use segmentation information to isolate and remove moving structures from LiDAR scan. This strategy fails in several important use-cases where segmentation information is not available. In such scenarios, moving objects and objects with high uncertainty in their motion i.e. movable objects, may escape detection. This violates the above assumption. We present MOVES, a novel GAN based adversarial model that segments out moving as well as movable objects in the absence of segmentation information. We achieve this by accurately transforming a dynamic LiDAR scan to its corresponding static scan. This is obtained by replacing dynamic objects and corresponding occlusions with static structures which were occluded by dynamic objects. We leverage corresponding static-dynamic LiDAR pairs.
△ Less
Submitted 15 October, 2023; v1 submitted 26 June, 2023;
originally announced June 2023.
-
A Numerically Robust and Stable Time-Space Pseudospectral Approach for Generalized Burgers-Fisher Equation
Authors:
Harvindra Singh,
Lokendra Balyan,
A. K. Mittal,
Parul Saini
Abstract:
In this article, we present the time-space Chebyshev pseudospectral method (TS-CPsM) to approximate a solution to the generalised Burgers-Fisher (gBF) equation. The Chebyshev-Gauss-Lobatto (CGL) points serve as the foundation for the recommended method, which makes use of collocations in both the time and space directions. Further, using a mapping, the non-homogeneous initial-boundary value proble…
▽ More
In this article, we present the time-space Chebyshev pseudospectral method (TS-CPsM) to approximate a solution to the generalised Burgers-Fisher (gBF) equation. The Chebyshev-Gauss-Lobatto (CGL) points serve as the foundation for the recommended method, which makes use of collocations in both the time and space directions. Further, using a mapping, the non-homogeneous initial-boundary value problem is transformed into a homogeneous problem, and a system of algebraic equations is obtained. The numerical approach known as Newton-Raphson is implemented in order to get the desired results for the system. The proposed method's stability analysis has been performed. Different researchers' considerations on test problems have been explored to illustrate the robustness and practicality of the approach presented. The approximate solutions we found using the proposed method are highly accurate and significantly better than the existing results.
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
ScaleDet: A Scalable Multi-Dataset Object Detector
Authors:
Yanbei Chen,
Manchen Wang,
Abhay Mittal,
Zhenlin Xu,
Paolo Favaro,
Joseph Tighe,
Davide Modolo
Abstract:
Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisti…
▽ More
Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisticated optimizations to unify labels across datasets, we introduce a simple yet scalable formulation to derive a unified semantic label space for multi-dataset training. ScaleDet is trained by visual-textual alignment to learn the label assignment with label semantic similarities across datasets. Once trained, ScaleDet can generalize well on any given upstream and downstream datasets with seen and unseen classes. We conduct extensive experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and 13 datasets from Object Detection in the Wild (ODinW) as downstream datasets. Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the same backbone.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Neural Radiance Fields: Past, Present, and Future
Authors:
Ansh Mittal
Abstract:
The various aspects like modeling and interpreting 3D environments and surroundings have enticed humans to progress their research in 3D Computer Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in Computer Graphics, Robotics, Computer Vision, and the possible scope of High-Resolution Low Storage…
▽ More
The various aspects like modeling and interpreting 3D environments and surroundings have enticed humans to progress their research in 3D Computer Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in Computer Graphics, Robotics, Computer Vision, and the possible scope of High-Resolution Low Storage Augmented Reality and Virtual Reality-based 3D models have gained traction from res with more than 1000 preprints related to NeRFs published. This paper serves as a bridge for people starting to study these fields by building on the basics of Mathematics, Geometry, Computer Vision, and Computer Graphics to the difficulties encountered in Implicit Representations at the intersection of all these disciplines. This survey provides the history of rendering, Implicit Learning, and NeRFs, the progression of research on NeRFs, and the potential applications and implications of NeRFs in today's world. In doing so, this survey categorizes all the NeRF-related research in terms of the datasets used, objective functions, applications solved, and evaluation criteria for these applications.
△ Less
Submitted 14 January, 2024; v1 submitted 19 April, 2023;
originally announced April 2023.
-
Bayesian Beta-Bernoulli Process Sparse Coding with Deep Neural Networks
Authors:
Arunesh Mittal,
Kai Yang,
Paul Sajda,
John Paisley
Abstract:
Several approximate inference methods have been proposed for deep discrete latent variable models. However, non-parametric methods which have previously been successfully employed for classical sparse coding models have largely been unexplored in the context of deep models. We propose a non-parametric iterative algorithm for learning discrete latent representations in such deep models. Additionall…
▽ More
Several approximate inference methods have been proposed for deep discrete latent variable models. However, non-parametric methods which have previously been successfully employed for classical sparse coding models have largely been unexplored in the context of deep models. We propose a non-parametric iterative algorithm for learning discrete latent representations in such deep models. Additionally, to learn scale invariant discrete features, we propose local data scaling variables. Lastly, to encourage sparsity in our representations, we propose a Beta-Bernoulli process prior on the latent factors. We evaluate our spare coding model coupled with different likelihood models. We evaluate our method across datasets with varying characteristics and compare our results to current amortized approximate inference methods.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment
Authors:
Ansh Mittal,
Aditya Malte
Abstract:
Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this…
▽ More
Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
MuTable (Music Table): Turn any surface into musical instrument
Authors:
Akash Mittal,
Ragini Gupta
Abstract:
With the rise in pervasive computing solutions, interactive surfaces have gained a large popularity across multi-application domains including smart boards for education, touch-enabled kiosks for smart retail and smart mirrors for smart homes. Despite the increased popularity of such interactive surfaces, existing platforms are mostly limited to custom built surfaces with attached sensors and hard…
▽ More
With the rise in pervasive computing solutions, interactive surfaces have gained a large popularity across multi-application domains including smart boards for education, touch-enabled kiosks for smart retail and smart mirrors for smart homes. Despite the increased popularity of such interactive surfaces, existing platforms are mostly limited to custom built surfaces with attached sensors and hardware, that are expensive and require complicated design considerations. To address this, we design a low-cost, intuitive system called MuTable that repurposes any flat surface (such as table tops) into a live musical instrument. This provides a unique, close to real-time instrument playing experience to the user to play any type of musical instrument. This is achieved by projecting the instrument's shape on any tangible surface, sensor calibration, user taps detection, tap position identification, and associated sound generation. We demonstrate the performance of our working system by reporting an accuracy of 83% for detecting softer taps, 100% accuracy for detecting the regular taps, and a precision of 95.7% for estimating hand location.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
Nostradamus: Weathering Worth
Authors:
Alapan Chaudhuri,
Zeeshan Ahmed,
Ashwin Rao,
Shivansh Subramanian,
Shreyas Pradhan,
Abhishek Mittal
Abstract:
Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate da…
▽ More
Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate data, and various climate indicators such as carbon dioxide emissions. We have conducted our study based on the US financial market, global climate trends, and daily weather records to demonstrate a significant relationship between climate and stock price fluctuation. Our analysis covers both short-term and long-term rises and dips in company stock performances. Lastly, we take four natural disasters as a case study to observe the effect they have on people's emotional state and their influence on the stock market.
△ Less
Submitted 17 January, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Reducing Collision Risk in Multi-Agent Path Planning: Application to Air traffic Management
Authors:
Sarah H. Q. Li,
Avi Mittal,
Pierre-Loïc Garoche,
Açıkmeşe,
Behçet
Abstract:
To minimize collision risks in the multi-agent path planning problem with stochastic transition dynamics, we formulate a Markov decision process congestion game with a multi-linear congestion cost. Players within the game complete individual tasks while minimizing their own collision risks. We show that the set of Nash equilibria coincides with the first-order KKT points of a non-convex optimizati…
▽ More
To minimize collision risks in the multi-agent path planning problem with stochastic transition dynamics, we formulate a Markov decision process congestion game with a multi-linear congestion cost. Players within the game complete individual tasks while minimizing their own collision risks. We show that the set of Nash equilibria coincides with the first-order KKT points of a non-convex optimization problem. Our game is applied to a historical flight plan over France to reduce collision risks between commercial aircraft.
△ Less
Submitted 10 December, 2022; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training
Authors:
Ashish Mittal,
Durga Sivasubramanian,
Rishabh Iyer,
Preethi Jyothi,
Ganesh Ramakrishnan
Abstract:
Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the…
▽ More
Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
Estimation methods for elementary chirp model parameters
Authors:
Anjali Mittal,
Rhythm Grover,
Debasis Kundu,
Amit Mitra
Abstract:
In this paper, we propose some estimation techniques to estimate the elementary chirp model parameters, which are encountered in sonar, radar, acoustics, and other areas. We derive asymptotic theoretical properties of least squares estimators and approximate least squares estimators for the one-component elementary chirp model. It is proved that the proposed estimators are strongly consistent and…
▽ More
In this paper, we propose some estimation techniques to estimate the elementary chirp model parameters, which are encountered in sonar, radar, acoustics, and other areas. We derive asymptotic theoretical properties of least squares estimators and approximate least squares estimators for the one-component elementary chirp model. It is proved that the proposed estimators are strongly consistent and follow the normal distribution asymptotically. We also suggest how to obtain proper initial values for these methods. The problem of finding initial values is a difficult problem when the number of components in the model is large, or when the signal-to-noise ratio is low, or when two frequency rates are close to each other. We propose sequential procedures to estimate the multiple-component elementary chirp model parameters. We prove that the theoretical properties of sequential least squares estimators and sequential approximate least squares estimators coincide with those of least squares estimators and approximate least squares estimators, respectively. To evaluate the performance of the proposed estimators, numerical experiments are performed. It is observed that the proposed sequential estimators perform well even in situations where least squares estimators do not perform well. We illustrate the performance of the proposed sequential algorithm on a bat data.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Alternate stabilization methods for CZTSSe photovoltaic devices by thermal treatment, dark electric bias and illumination
Authors:
W. Ananda,
M. Rennhofer,
A. Mittal,
N. Zechner,
W. Lang
Abstract:
Reliable measurement routines are crucial for power rating and yield prediction of photovoltaic emerging thinfilm technologies. Copper-Zinc-Tin-Sulfur-Selenium (CZTSSe) thin-film photovoltaic devices are an emerging technology made of abundant elements. Still, sufficient stabilization methods prior to electric power measurement are missing in the international standardization, while existing stand…
▽ More
Reliable measurement routines are crucial for power rating and yield prediction of photovoltaic emerging thinfilm technologies. Copper-Zinc-Tin-Sulfur-Selenium (CZTSSe) thin-film photovoltaic devices are an emerging technology made of abundant elements. Still, sufficient stabilization methods prior to electric power measurement are missing in the international standardization, while existing standards for other thin-film technologies do not work properly for CZTSSe. This study investigated methods for achieving power stabilization of the CZTSSe solar devices. Three complementary stabilization routines for the kesterite-based solar devices were investigated as an alternative to the existing international device testing standards: rapid annealing, dark electric biasing and different operating points under illumination. The typical number of stabilization cycles for power stabilization was between 3 and 6 cycles of rapid annealing, dark electric bias and illumination with a power loss of -19.5%, -11.4%, and -1.9%, for the respective methods. The dark electric bias method was found to provide the most reliable average result for power stabilization. All stabilization methods proved to have the potential to work sufficiently in stabilizing the CZTSSe devices for standardized power measurement.
△ Less
Submitted 23 September, 2022;
originally announced September 2022.
-
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions
Authors:
Ansh Mittal,
Shuvam Ghosal,
Rishibha Bansal
Abstract:
Detecting suspicious activities in surveillance videos is a longstanding problem in real-time surveillance that leads to difficulties in detecting crimes. Hence, we propose a novel approach for detecting and summarizing suspicious activities in surveillance videos. We have also created ground truth summaries for the UCF-Crime video dataset. We modify a pre-existing approach for this task by levera…
▽ More
Detecting suspicious activities in surveillance videos is a longstanding problem in real-time surveillance that leads to difficulties in detecting crimes. Hence, we propose a novel approach for detecting and summarizing suspicious activities in surveillance videos. We have also created ground truth summaries for the UCF-Crime video dataset. We modify a pre-existing approach for this task by leveraging the Human-Object Interaction (HOI) model for the Visual features in the Bi-Modal Transformer. Further, we validate our approach against the existing state-of-the-art algorithms for the Dense Video Captioning task for the ActivityNet Captions dataset. We observe that this formulation for Dense Captioning performs significantly better than other discussed BMT-based approaches for BLEU@1, BLEU@2, BLEU@3, BLEU@4, and METEOR. We further perform a comparative analysis of the dataset and the model to report the findings based on different NMS thresholds (searched using Genetic Algorithms). Here, our formulation outperforms all the models for BLEU@1, BLEU@2, BLEU@3, and most models for BLEU@4 and METEOR falling short of only ADV-INF Global by 25% and 0.5%, respectively.
△ Less
Submitted 22 October, 2022; v1 submitted 24 July, 2022;
originally announced July 2022.
-
NGAME: Negative Mining-aware Mini-batching for Extreme Classification
Authors:
Kunal Dahiya,
Nilesh Gupta,
Deepak Saini,
Akshay Soni,
Yajun Wang,
Kushal Dave,
Jian Jiao,
Gururaj K,
Prasenjit Dey,
Amit Singh,
Deepesh Hada,
Vidit Jain,
Bhawna Paliwal,
Anshul Mittal,
Sonu Mehta,
Ramachandran Ramjee,
Sumeet Agarwal,
Purushottam Kar,
Manik Varma
Abstract:
Extreme Classification (XC) seeks to tag data points with the most relevant subset of labels from an extremely large label set. Performing deep XC with dense, learnt representations for data points and labels has attracted much attention due to its superiority over earlier XC methods that used sparse, hand-crafted features. Negative mining techniques have emerged as a critical component of all dee…
▽ More
Extreme Classification (XC) seeks to tag data points with the most relevant subset of labels from an extremely large label set. Performing deep XC with dense, learnt representations for data points and labels has attracted much attention due to its superiority over earlier XC methods that used sparse, hand-crafted features. Negative mining techniques have emerged as a critical component of all deep XC methods that allow them to scale to millions of labels. However, despite recent advances, training deep XC models with large encoder architectures such as transformers remains challenging. This paper identifies that memory overheads of popular negative mining techniques often force mini-batch sizes to remain small and slow training down. In response, this paper introduces NGAME, a light-weight mini-batch creation technique that offers provably accurate in-batch negative samples. This allows training with larger mini-batches offering significantly faster convergence and higher accuracies than existing negative sampling techniques. NGAME was found to be up to 16% more accurate than state-of-the-art methods on a wide array of benchmark datasets for extreme classification, as well as 3% more accurate at retrieving search engine queries in response to a user webpage visit to show personalized ads. In live A/B tests on a popular search engine, NGAME yielded up to 23% gains in click-through-rates.
△ Less
Submitted 10 July, 2022;
originally announced July 2022.
-
AmbiPun: Generating Humorous Puns with Ambiguous Context
Authors:
Anirudh Mittal,
Yufei Tian,
Nanyun Peng
Abstract:
In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3…
▽ More
In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words and then generate puns incorporating context words from both concepts. Human evaluation shows that our method successfully generates pun 52\% of the time, outperforming well-crafted baselines and the state-of-the-art models by a large margin.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
Authors:
Sreyan Ghosh,
Ashish Seth,
and Deepak Mittal,
Maneesh Singh,
S. Umesh
Abstract:
Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins…
▽ More
Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins objective function, we propose to learn embeddings that are invariant to distortions of an input audio sample, while making sure that they contain non-redundant information about the sample. To achieve this, we measure the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of an audio segment sampled from an audio file and make it as close to the identity matrix as possible. We use a combination of a small subset of the large-scale AudioSet dataset and FSD50K for self-supervised learning and are able to learn with less than half the parameters compared to state-of-the-art algorithms. For evaluation, we transfer these learned representations to 9 downstream classification tasks, including speech, music, and animal sounds, and show competitive results under different evaluation setups. In addition to being simple and intuitive, our pre-training algorithm is amenable to compute through its inherent nature of construction and does not require careful implementation details to avoid trivial or degenerate solutions. Furthermore, we conduct ablation studies on our results and make all our code and pre-trained models publicly available https://github.com/Speech-Lab-IITM/DeLoRes.
△ Less
Submitted 26 June, 2022; v1 submitted 25 March, 2022;
originally announced March 2022.
-
Adaptive Discounting of Implicit Language Models in RNN-Transducers
Authors:
Vinit Unni,
Shreya Khare,
Ashish Mittal,
Preethi Jyothi,
Sunita Sarawagi,
Samarth Bharadwaj
Abstract:
RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that a…
▽ More
RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that are acoustically inconsistent with the underlying speech. To address this issue, we propose a lightweight adaptive LM discounting technique AdaptLMD, that can be used with any RNN-T architecture without requiring any external resources or additional parameters. AdaptLMD uses a two-pronged approach: 1) Randomly mask the prediction network output to encourage the RNN-T to not be overly reliant on it's outputs. 2) Dynamically choose when to discount the implicit LM (ILM) based on rarity of recently predicted tokens and divergence between ILM and implicit acoustic model (IAM) scores. Comparing AdaptLMD to a competitive RNN-T baseline, we obtain up to 4% and 14% relative reductions in overall WER and rare word PER, respectively, on a conversational, code-mixed Hindi-English ASR task.
△ Less
Submitted 21 February, 2022;
originally announced March 2022.
-
Eye-focused Detection of Bell's Palsy in Videos
Authors:
Sharik Ali Ansari,
Koteswar Rao Jerripothula,
Pragya Nagpal,
Ankush Mittal
Abstract:
In this paper, we present how Bell's Palsy, a neurological disorder, can be detected just from a subject's eyes in a video. We notice that Bell's Palsy patients often struggle to blink their eyes on the affected side. As a result, we can observe a clear contrast between the blinking patterns of the two eyes. Although previous works did utilize images/videos to detect this disorder, none have expli…
▽ More
In this paper, we present how Bell's Palsy, a neurological disorder, can be detected just from a subject's eyes in a video. We notice that Bell's Palsy patients often struggle to blink their eyes on the affected side. As a result, we can observe a clear contrast between the blinking patterns of the two eyes. Although previous works did utilize images/videos to detect this disorder, none have explicitly focused on the eyes. Most of them require the entire face. One obvious advantage of having an eye-focused detection system is that subjects' anonymity is not at risk. Also, our AI decisions based on simple blinking patterns make them explainable and straightforward. Specifically, we develop a novel feature called blink similarity, which measures the similarity between the two blinking patterns. Our extensive experiments demonstrate that the proposed feature is quite robust, for it helps in Bell's Palsy detection even with very few labels. Our proposed eye-focused detection system is not only cheaper but also more convenient than several existing methods.
△ Less
Submitted 27 January, 2022;
originally announced January 2022.
-
Non-linear Motion Estimation for Video Frame Interpolation using Space-time Convolutions
Authors:
Saikat Dutta,
Arulkumar Subramaniam,
Anurag Mittal
Abstract:
Video frame interpolation aims to synthesize one or multiple frames between two consecutive frames in a video. It has a wide range of applications including slow-motion video generation, frame-rate up-scaling and developing video codecs. Some older works tackled this problem by assuming per-pixel linear motion between video frames. However, objects often follow a non-linear motion pattern in the r…
▽ More
Video frame interpolation aims to synthesize one or multiple frames between two consecutive frames in a video. It has a wide range of applications including slow-motion video generation, frame-rate up-scaling and developing video codecs. Some older works tackled this problem by assuming per-pixel linear motion between video frames. However, objects often follow a non-linear motion pattern in the real domain and some recent methods attempt to model per-pixel motion by non-linear models (e.g., quadratic). A quadratic model can also be inaccurate, especially in the case of motion discontinuities over time (i.e. sudden jerks) and occlusions, where some of the flow information may be invalid or inaccurate.
In our paper, we propose to approximate the per-pixel motion using a space-time convolution network that is able to adaptively select the motion model to be used. Specifically, we are able to softly switch between a linear and a quadratic model. Towards this end, we use an end-to-end 3D CNN encoder-decoder architecture over bidirectional optical flows and occlusion maps to estimate the non-linear motion model of each pixel. Further, a motion refinement module is employed to refine the non-linear motion and the interpolated frames are estimated by a simple warping of the neighboring frames with the estimated per-pixel motion. Through a set of comprehensive experiments, we validate the effectiveness of our model and show that our method outperforms state-of-the-art algorithms on four datasets (Vimeo, DAVIS, HD and GoPro).
△ Less
Submitted 12 April, 2022; v1 submitted 27 January, 2022;
originally announced January 2022.
-
Improving Prediction of Cognitive Performance using Deep Neural Networks in Sparse Data
Authors:
Sharath Koorathota,
Arunesh Mittal,
Richard P. Sloan,
Paul Sajda
Abstract:
Cognition in midlife is an important predictor of age-related mental decline and statistical models that predict cognitive performance can be useful for predicting decline. However, existing models struggle to capture complex relationships between physical, sociodemographic, psychological and mental health factors that effect cognition. Using data from an observational, cohort study, Midlife in th…
▽ More
Cognition in midlife is an important predictor of age-related mental decline and statistical models that predict cognitive performance can be useful for predicting decline. However, existing models struggle to capture complex relationships between physical, sociodemographic, psychological and mental health factors that effect cognition. Using data from an observational, cohort study, Midlife in the United States (MIDUS), we modeled a large number of variables to predict executive function and episodic memory measures. We used cross-sectional and longitudinal outcomes with varying sparsity, or amount of missing data. Deep neural network (DNN) models consistently ranked highest in all of the cognitive performance prediction tasks, as assessed with root mean squared error (RMSE) on out-of-sample data. RMSE differences between DNN and other model types were statistically significant (T(8) = -3.70; p < 0.05). The interaction effect between model type and sparsity was significant (F(9)=59.20; p < 0.01), indicating the success of DNNs can partly be attributed to their robustness and ability to model hierarchical relationships between health-related factors. Our findings underscore the potential of neural networks to model clinical datasets and allow better understanding of factors that lead to cognitive decline.
△ Less
Submitted 28 December, 2021;
originally announced December 2021.
-
Emotion-Cause Pair Extraction in Customer Reviews
Authors:
Arpit Mittal,
Jeel Tejaskumar Vaishnav,
Aishwarya Kaliki,
Nathan Johns,
Wyatt Pease
Abstract:
Emotion-Cause Pair Extraction (ECPE) is a complex yet popular area in Natural Language Processing due to its importance and potential applications in various domains. In this report , we aim to present our work in ECPE in the domain of online reviews. With a manually annotated dataset, we explore an algorithm to extract emotion cause pairs using a neural network. In addition, we propose a model us…
▽ More
Emotion-Cause Pair Extraction (ECPE) is a complex yet popular area in Natural Language Processing due to its importance and potential applications in various domains. In this report , we aim to present our work in ECPE in the domain of online reviews. With a manually annotated dataset, we explore an algorithm to extract emotion cause pairs using a neural network. In addition, we propose a model using previous reference materials and combining emotion-cause pair extraction with research in the domain of emotion-aware word embeddings, where we send these embeddings into a Bi-LSTM layer which gives us the emotionally relevant clauses. With the constraint of a limited dataset, we achieved . The overall scope of our report comprises of a comprehensive literature review, implementation of referenced methods for dataset construction and initial model training, and modifying previous work in ECPE by proposing an improvement to the pipeline, as well as algorithm development and implementation for the specific domain of reviews.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.
-
Tapping BERT for Preposition Sense Disambiguation
Authors:
Siddhesh Pawar,
Shyam Thombre,
Anirudh Mittal,
Girishkumar Ponkiya,
Pushpak Bhattacharyya
Abstract:
Prepositions are frequently occurring polysemous words. Disambiguation of prepositions is crucial in tasks like semantic role labelling, question answering, text entailment, and noun compound paraphrasing. In this paper, we propose a novel methodology for preposition sense disambiguation (PSD), which does not use any linguistic tools. In a supervised setting, the machine learning model is presente…
▽ More
Prepositions are frequently occurring polysemous words. Disambiguation of prepositions is crucial in tasks like semantic role labelling, question answering, text entailment, and noun compound paraphrasing. In this paper, we propose a novel methodology for preposition sense disambiguation (PSD), which does not use any linguistic tools. In a supervised setting, the machine learning model is presented with sentences wherein prepositions have been annotated with senses. These senses are IDs in what is called The Preposition Project (TPP). We use the hidden layer representations from pre-trained BERT and BERT variants. The latent representations are then classified into the correct sense ID using a Multi Layer Perceptron. The dataset used for this task is from SemEval-2007 Task-6. Our methodology gives an accuracy of 86.85% which is better than the state-of-the-art.
△ Less
Submitted 27 November, 2021;
originally announced November 2021.
-
Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks
Authors:
Arulkumar Subramaniam,
Jayesh Vaidya,
Muhammed Abdul Majeed Ameen,
Athira Nambiar,
Anurag Mittal
Abstract:
Video-based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Although using pre-trained models is a viable approach, it has several limitation…
▽ More
Video-based computer vision tasks can benefit from estimation of the salient regions and interactions between those regions. Traditionally, this has been done by identifying the object regions in the images by utilizing pre-trained models to perform object detection, object segmentation and/or object pose estimation. Although using pre-trained models is a viable approach, it has several limitations in the need for an exhaustive annotation of object categories, a possible domain gap between datasets, and a bias that is typically present in pre-trained models. In this work, we propose to utilize the common rationale that a sequence of video frames capture a set of common objects and interactions between them, thus a notion of co-segmentation between the video frame features may equip the model with the ability to automatically focus on task-specific salient regions and improve the underlying task's performance in an end-to-end manner. In this regard, we propose a generic module called ``Co-Segmentation inspired Attention Module'' (COSAM) that can be plugged in to any CNN model to promote the notion of co-segmentation based attention among a sequence of video frame features. We show the application of COSAM in three video-based tasks namely: 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification and demonstrate that COSAM is able to capture the task-specific salient regions in video frames, thus leading to notable performance improvements along with interpretable attention maps for a variety of video-based vision tasks, with possible application to other video-based vision tasks as well.
△ Less
Submitted 1 August, 2022; v1 submitted 14 November, 2021;
originally announced November 2021.
-
DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents
Authors:
Kunal Dahiya,
Deepak Saini,
Anshul Mittal,
Ankush Shaw,
Kushal Dave,
Akshay Soni,
Himanshu Jain,
Sumeet Agarwal,
Manik Varma
Abstract:
Scalability and accuracy are well recognized challenges in deep extreme multi-label learning where the objective is to train architectures for automatically annotating a data point with the most relevant subset of labels from an extremely large label set. This paper develops the DeepXML framework that addresses these challenges by decomposing the deep extreme multi-label task into four simpler sub…
▽ More
Scalability and accuracy are well recognized challenges in deep extreme multi-label learning where the objective is to train architectures for automatically annotating a data point with the most relevant subset of labels from an extremely large label set. This paper develops the DeepXML framework that addresses these challenges by decomposing the deep extreme multi-label task into four simpler sub-tasks each of which can be trained accurately and efficiently. Choosing different components for the four sub-tasks allows DeepXML to generate a family of algorithms with varying trade-offs between accuracy and scalability. In particular, DeepXML yields the Astec algorithm that could be 2-12% more accurate and 5-30x faster to train than leading deep extreme classifiers on publically available short text datasets. Astec could also efficiently train on Bing short text datasets containing up to 62 million labels while making predictions for billions of users and data points per day on commodity hardware. This allowed Astec to be deployed on the Bing search engine for a number of short text applications ranging from matching user queries to advertiser bid phrases to showing personalized ads where it yielded significant gains in click-through-rates, coverage, revenue and other online metrics over state-of-the-art techniques currently in production. DeepXML's code is available at https://github.com/Extreme-classification/deepxml
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
DSC-IITISM at FinCausal 2021: Combining POS tagging with Attention-based Contextual Representations for Identifying Causal Relationships in Financial Documents
Authors:
Gunjan Haldar,
Aman Mittal,
Pradyumna Gupta
Abstract:
Causality detection draws plenty of attention in the field of Natural Language Processing and linguistics research. It has essential applications in information retrieval, event prediction, question answering, financial analysis, and market research. In this study, we explore several methods to identify and extract cause-effect pairs in financial documents using transformers. For this purpose, we…
▽ More
Causality detection draws plenty of attention in the field of Natural Language Processing and linguistics research. It has essential applications in information retrieval, event prediction, question answering, financial analysis, and market research. In this study, we explore several methods to identify and extract cause-effect pairs in financial documents using transformers. For this purpose, we propose an approach that combines POS tagging with the BIO scheme, which can be integrated with modern transformer models to address this challenge of identifying causality in a given text. Our best methodology achieves an F1-Score of 0.9551, and an Exact Match Score of 0.8777 on the blind test in the FinCausal-2021 Shared Task at the FinCausal 2021 Workshop.
△ Less
Submitted 31 October, 2021;
originally announced November 2021.
-
"So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy
Authors:
Anirudh Mittal,
Pranav Jeevan,
Prerak Gandhi,
Diptesh Kanojia,
Pushpak Bhattacharyya
Abstract:
Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset ($\sim$40 hours) using stand-up comedy clips. We devise a novel scoring mecha…
▽ More
Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset ($\sim$40 hours) using stand-up comedy clips. We devise a novel scoring mechanism to annotate the training data with a humour quotient score using the audience's laughter. The normalized duration (laughter duration divided by the clip duration) of laughter in each clip is used to compute this humour coefficient score on a five-point scale (0-4). This method of scoring is validated by comparing with manually annotated scores, wherein a quadratic weighted kappa of 0.6 is obtained. We use this dataset to train a model that provides a "funniness" score, on a five-point scale, given the audio and its corresponding text. We compare various neural language models for the task of humour-rating and achieve an accuracy of $0.813$ in terms of Quadratic Weighted Kappa (QWK). Our "Open Mic" dataset is released for further research along with the code.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers
Authors:
Dinesh Kumar Vishwakarma,
Mayank Jindal,
Ayush Mittal,
Aditya Sharma
Abstract:
Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situati…
▽ More
Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework that takes both cognition and affect-based features into consideration. A pre-features fusion-based framework that takes into account: situation-based features from a regular snapshot of a trailer that includes nouns and verbs providing the useful affect-based mapping with the corresponding genres, dialogue (speech) based feature from audio, metadata which together provides the relevant information for cognitive and affect based video analysis. We also develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres: Action, Romance, Comedy, Horror, and Science Fiction, and perform cross-validation on the standard LMTD-9 dataset for validating the proposed framework. The results demonstrate that the proposed methodology for movie genre classification has performed excellently as depicted by the F1 scores, precision, recall, and area under the precision-recall curves.
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
-
On the Significance of Question Encoder Sequence Model in the Out-of-Distribution Performance in Visual Question Answering
Authors:
Gouthaman KV,
Anurag Mittal
Abstract:
Generalizing beyond the experiences has a significant role in developing practical AI systems. It has been shown that current Visual Question Answering (VQA) models are over-dependent on the language-priors (spurious correlations between question-types and their most frequent answers) from the train set and pose poor performance on Out-of-Distribution (OOD) test sets. This conduct limits their gen…
▽ More
Generalizing beyond the experiences has a significant role in developing practical AI systems. It has been shown that current Visual Question Answering (VQA) models are over-dependent on the language-priors (spurious correlations between question-types and their most frequent answers) from the train set and pose poor performance on Out-of-Distribution (OOD) test sets. This conduct limits their generalizability and restricts them from being utilized in real-world situations. This paper shows that the sequence model architecture used in the question-encoder has a significant role in the generalizability of VQA models. To demonstrate this, we performed a detailed analysis of various existing RNN-based and Transformer-based question-encoders, and along, we proposed a novel Graph attention network (GAT)-based question-encoder. Our study found that a better choice of sequence model in the question-encoder improves the generalizability of VQA models even without using any additional relatively complex bias-mitigation approaches.
△ Less
Submitted 21 December, 2021; v1 submitted 28 August, 2021;
originally announced August 2021.