-
EgoQR: Efficient QR Code Reading in Egocentric Settings
Authors:
Mohsen Moslehpour,
Yichao Lu,
Pierce Chuang,
Ashish Shenoy,
Debojeet Chatterjee,
Abhay Harpale,
Srihari Jayakumar,
Vikas Bhardwaj,
Seonghyeon Nam,
Anuj Kumar
Abstract:
QR codes have become ubiquitous in daily life, enabling rapid information exchange. With the increasing adoption of smart wearable devices, there is a need for efficient, and friction-less QR code reading capabilities from Egocentric point-of-views. However, adapting existing phone-based QR code readers to egocentric images poses significant challenges. Code reading from egocentric images bring un…
▽ More
QR codes have become ubiquitous in daily life, enabling rapid information exchange. With the increasing adoption of smart wearable devices, there is a need for efficient, and friction-less QR code reading capabilities from Egocentric point-of-views. However, adapting existing phone-based QR code readers to egocentric images poses significant challenges. Code reading from egocentric images bring unique challenges such as wide field-of-view, code distortion and lack of visual feedback as compared to phones where users can adjust the position and framing. Furthermore, wearable devices impose constraints on resources like compute, power and memory. To address these challenges, we present EgoQR, a novel system for reading QR codes from egocentric images, and is well suited for deployment on wearable devices. Our approach consists of two primary components: detection and decoding, designed to operate on high-resolution images on the device with minimal power consumption and added latency. The detection component efficiently locates potential QR codes within the image, while our enhanced decoding component extracts and interprets the encoded information. We incorporate innovative techniques to handle the specific challenges of egocentric imagery, such as varying perspectives, wider field of view, and motion blur. We evaluate our approach on a dataset of egocentric images, demonstrating 34% improvement in reading the code compared to an existing state of the art QR code readers.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1110 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 8 August, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Authors:
Ashish Shenoy,
Yichao Lu,
Srihari Jayakumar,
Debojeet Chatterjee,
Mohsen Moslehpour,
Pierce Chuang,
Abhay Harpale,
Vikas Bhardwaj,
Di Xu,
Shicong Zhao,
Longfang Zhao,
Ankit Ramchandani,
Xin Luna Dong,
Anuj Kumar
Abstract:
We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to…
▽ More
We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.
△ Less
Submitted 1 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Lorentz invariants of pure three-qubit states
Authors:
A R Usha Devi,
Sudha,
H Akshata Shenoy,
H S Karthik,
B N Karthik
Abstract:
Extending the mathematical framework of Phys. Rev. A 102, 052419 (2020) we construct Lorentz invariant quantities of pure three-qubit states. This method serves as a bridge between the well-known local unitary (LU) invariants viz. concurrences and three-tangle of an arbitrary three-qubit pure state and the Lorentz invariants of its reduced two-qubit systems.
Extending the mathematical framework of Phys. Rev. A 102, 052419 (2020) we construct Lorentz invariant quantities of pure three-qubit states. This method serves as a bridge between the well-known local unitary (LU) invariants viz. concurrences and three-tangle of an arbitrary three-qubit pure state and the Lorentz invariants of its reduced two-qubit systems.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual Translation of Dravidian Languages
Authors:
Danish Ebadulla,
Rahul Raman,
S. Natarajan,
Hridhay Kiran Shetty,
Ashish Harish Shenoy
Abstract:
Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In thi…
▽ More
Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In this work, we overcome the shortcomings of zero-shot translation by taking advantage of transliteration and linguistic similarity. We build a single encoder-decoder neural machine translation system for Dravidian-Dravidian multilingual translation and perform zero-shot translation. We compare the data vs zero-shot accuracy tradeoff and evaluate the performance of our vanilla method against the current state of the art pivot based method. We also test the theory that morphologically rich languages require large vocabularies by restricting the vocabulary using an optimal transport based technique. Our model manages to achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50\% of the language directions.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Now It Sounds Like You: Learning Personalized Vocabulary On Device
Authors:
Sid Wang,
Ashish Shenoy,
Pierce Chuang,
John Nguyen
Abstract:
In recent years, Federated Learning (FL) has shown significant advancements in its ability to perform various natural language processing (NLP) tasks. This work focuses on applying personalized FL for on-device language modeling. Due to limitations of memory and latency, these models cannot support the complexity of sub-word tokenization or beam search decoding, resulting in the decision to deploy…
▽ More
In recent years, Federated Learning (FL) has shown significant advancements in its ability to perform various natural language processing (NLP) tasks. This work focuses on applying personalized FL for on-device language modeling. Due to limitations of memory and latency, these models cannot support the complexity of sub-word tokenization or beam search decoding, resulting in the decision to deploy a closed-vocabulary language model. However, closed-vocabulary models are unable to handle out-of-vocabulary (OOV) words belonging to specific users. To address this issue, We propose a novel technique called "OOV expansion" that improves OOV coverage and increases model accuracy while minimizing the impact on memory and latency. This method introduces a personalized "OOV adapter" that effectively transfers knowledge from a central model and learns word embedding for personalized vocabulary. OOV expansion significantly outperforms standard FL personalization methods on a set of common FL benchmarks.
△ Less
Submitted 13 February, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Green Federated Learning
Authors:
Ashkan Yousefpour,
Shen Guo,
Ashish Shenoy,
Sayan Ghosh,
Pierre Stock,
Kiwan Maeng,
Schalk-Willem Krüger,
Michael Rabbat,
Carole-Jean Wu,
Ilya Mironov
Abstract:
The rapid progress of AI is fueled by increasingly large and computationally intensive machine learning models and datasets. As a consequence, the amount of compute used in training state-of-the-art models is exponentially increasing (doubling every 10 months between 2015 and 2022), resulting in a large carbon footprint. Federated Learning (FL) - a collaborative machine learning technique for trai…
▽ More
The rapid progress of AI is fueled by increasingly large and computationally intensive machine learning models and datasets. As a consequence, the amount of compute used in training state-of-the-art models is exponentially increasing (doubling every 10 months between 2015 and 2022), resulting in a large carbon footprint. Federated Learning (FL) - a collaborative machine learning technique for training a centralized model using data of decentralized entities - can also be resource-intensive and have a significant carbon footprint, particularly when deployed at scale. Unlike centralized AI that can reliably tap into renewables at strategically placed data centers, cross-device FL may leverage as many as hundreds of millions of globally distributed end-user devices with diverse energy sources. Green AI is a novel and important research area where carbon footprint is regarded as an evaluation criterion for AI, alongside accuracy, convergence speed, and other metrics. In this paper, we propose the concept of Green FL, which involves optimizing FL parameters and making design choices to minimize carbon emissions consistent with competitive performance and training time. The contributions of this work are two-fold. First, we adopt a data-driven approach to quantify the carbon emissions of FL by directly measuring real-world at-scale FL tasks running on millions of phones. Second, we present challenges, guidelines, and lessons learned from studying the trade-off between energy efficiency, performance, and time-to-train in a production FL system. Our findings offer valuable insights into how FL can reduce its carbon footprint, and they provide a foundation for future research in the area of Green AI.
△ Less
Submitted 1 August, 2023; v1 submitted 25 March, 2023;
originally announced March 2023.
-
Prediction of the outcome of a Twenty-20 Cricket Match : A Machine Learning Approach
Authors:
Ashish V Shenoy,
Arjun Singhvi,
Shruthi Racha,
Srinivas Tunuguntla
Abstract:
Twenty20 cricket, sometimes written Twenty-20, and often abbreviated to T20, is a short form of cricket. In a Twenty20 game the two teams of 11 players have a single innings each, which is restricted to a maximum of 20 overs. This version of cricket is especially unpredictable and is one of the reasons it has gained popularity over recent times. However, in this paper we try four different machine…
▽ More
Twenty20 cricket, sometimes written Twenty-20, and often abbreviated to T20, is a short form of cricket. In a Twenty20 game the two teams of 11 players have a single innings each, which is restricted to a maximum of 20 overs. This version of cricket is especially unpredictable and is one of the reasons it has gained popularity over recent times. However, in this paper we try four different machine learning approaches for predicting the results of T20 Cricket Matches. Specifically we take in to account: previous performance statistics of the players involved in the competing teams, ratings of players obtained from reputed cricket statistics websites, clustering the players' with similar performance statistics and propose a novel method using an ELO based approach to rate players. We compare the performances of each of these feature engineering approaches by using different ML algorithms, including logistic regression, support vector machines, bayes network, decision tree, random forest.
△ Less
Submitted 22 July, 2023; v1 submitted 13 September, 2022;
originally announced September 2022.
-
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
Authors:
Jack FitzGerald,
Shankar Ananthakrishnan,
Konstantine Arkoudas,
Davide Bernardi,
Abhishek Bhagia,
Claudio Delli Bovi,
Jin Cao,
Rakesh Chada,
Amit Chauhan,
Luoxin Chen,
Anurag Dwarakanath,
Satyam Dwivedi,
Turan Gojayev,
Karthik Gopalakrishnan,
Thomas Gueudre,
Dilek Hakkani-Tur,
Wael Hamza,
Jonathan Hueser,
Kevin Martin Jose,
Haidar Khan,
Beiye Liu,
Jianhua Lu,
Alessandro Manzotti,
Pradeep Natarajan,
Karolina Owczarzak
, et al. (16 additional authors not shown)
Abstract:
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform co…
▽ More
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Multi-Object Grasping -- Generating Efficient Robotic Picking and Transferring Policy
Authors:
Adheesh Shenoy,
Tianze Chen,
Yu Sun
Abstract:
Transferring multiple objects between bins is a common task for many applications. In robotics, a standard approach is to pick up one object and transfer it at a time. However, grasping and picking up multiple objects and transferring them together at once is more efficient. This paper presents a set of novel strategies for efficiently grasping multiple objects in a bin to transfer them to another…
▽ More
Transferring multiple objects between bins is a common task for many applications. In robotics, a standard approach is to pick up one object and transfer it at a time. However, grasping and picking up multiple objects and transferring them together at once is more efficient. This paper presents a set of novel strategies for efficiently grasping multiple objects in a bin to transfer them to another. The strategies enable a robotic hand to identify an optimal ready hand configuration (pre-grasp) and calculate a flexion synergy based on the desired quantity of objects to be grasped. This paper also presents an approach that uses the Markov decision process (MDP) to model the pick-transfer routines when the required quantity is larger than the capability of a single grasp. Using the MDP model, the proposed approach can generate an optimal pick-transfer routine that minimizes the number of transfers, representing efficiency. The proposed approach has been evaluated in both a simulation environment and on a real robotic system. The results show the approach reduces the number of transfers by 59% and the number of lifts by 58% compared to an optimal single object pick-transfer solution.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems
Authors:
Saket Dingliwal,
Ashish Shenoy,
Sravan Bodapati,
Ankur Gandhe,
Ravi Teja Gadde,
Katrin Kirchhoff
Abstract:
Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains creating a need to adapt to new domains with small memory and deployment overhead. In this work, we introduce domain-prompts, a methodology that involves training a small number of domain embedding parameters to prime a Transformer-based Language Model (LM) to a particular do…
▽ More
Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains creating a need to adapt to new domains with small memory and deployment overhead. In this work, we introduce domain-prompts, a methodology that involves training a small number of domain embedding parameters to prime a Transformer-based Language Model (LM) to a particular domain. Using this domain-adapted LM for rescoring ASR hypotheses can achieve 7-13% WER reduction for a new domain with just 1000 unlabeled textual domain-specific sentences. This improvement is comparable or even better than fully fine-tuned models even though just 0.02% of the parameters of the base LM are updated. Additionally, our method is deployment-friendly as the learnt domain embeddings are prefixed to the input to the model rather than changing the base model architecture. Therefore, our method is an ideal choice for on-the-fly adaptation of LMs used in ASR systems to progressively scale it to new domains.
△ Less
Submitted 21 July, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Multi-Object Grasping -- Estimating the Number of Objects in a Robotic Grasp
Authors:
Tianze Chen,
Adheesh Shenoy,
Anzhelika Kolinko,
Syed Shah,
Yu Sun
Abstract:
A human hand can grasp a desired number of objects at once from a pile based solely on tactile sensing. To do so, a robot needs to grasp within a pile, sense the number of objects in the grasp before lifting, and predict the number of objects that will remain in the grasp after lifting. It is a challenging problem because when making the prediction, the robotic hand is still in the pile and the ob…
▽ More
A human hand can grasp a desired number of objects at once from a pile based solely on tactile sensing. To do so, a robot needs to grasp within a pile, sense the number of objects in the grasp before lifting, and predict the number of objects that will remain in the grasp after lifting. It is a challenging problem because when making the prediction, the robotic hand is still in the pile and the objects in the grasp are not observable to vision systems. Moreover, some objects that are grasped by the hand before lifting from the pile may fall out of the grasp when the hand is lifted. This occurs because they were supported by other objects in the pile instead of the fingers of the hand. Therefore, a robotic hand should sense the number of objects in a grasp using its tactile sensors before lifting. This paper presents novel multi-object grasping analyzing methods for solving this problem. They include a grasp volume calculation, tactile force analysis, and a data-driven deep learning approach. The methods have been implemented on a Barrett hand and then evaluated in simulations and a real setup with a robotic system. The evaluation results conclude that once the Barrett hand grasps multiple objects in the pile, the data-driven model can predict, before lifting, the number of objects that will remain in the hand after lifting. The root-mean-square errors for our approach are 0.74 for balls and 0.58 for cubes in simulations, and 1.06 for balls, and 1.45 for cubes in the real system.
△ Less
Submitted 30 November, 2021;
originally announced December 2021.
-
Prompt-tuning in ASR systems for efficient domain-adaptation
Authors:
Saket Dingliwal,
Ashish Shenoy,
Sravan Bodapati,
Ankur Gandhe,
Ravi Teja Gadde,
Katrin Kirchhoff
Abstract:
Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains. Since domain-specific systems perform better than their generic counterparts on in-domain evaluation, the need for memory and compute-efficient domain adaptation is obvious. Particularly, adapting parameter-heavy transformer-based language models used for rescoring ASR hypot…
▽ More
Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains. Since domain-specific systems perform better than their generic counterparts on in-domain evaluation, the need for memory and compute-efficient domain adaptation is obvious. Particularly, adapting parameter-heavy transformer-based language models used for rescoring ASR hypothesis is challenging. In this work, we overcome the problem using prompt-tuning, a methodology that trains a small number of domain token embedding parameters to prime a transformer-based LM to a particular domain. With just a handful of extra parameters per domain, we achieve much better perplexity scores over the baseline of using an unadapted LM. Despite being parameter-efficient, these improvements are comparable to those of fully-fine-tuned models with hundreds of millions of parameters. We replicate our findings in perplexity numbers to Word Error Rate in a domain-specific ASR system for one such domain.
△ Less
Submitted 22 October, 2021; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Remember the context! ASR slot error correction through memorization
Authors:
Dhanush Bekal,
Ashish Shenoy,
Monica Sunkara,
Sravan Bodapati,
Katrin Kirchhoff
Abstract:
Accurate recognition of slot values such as domain specific words or named entities by automatic speech recognition (ASR) systems forms the core of the Goal-oriented Dialogue Systems. Although it is a critical step with direct impact on downstream tasks such as language understanding, many domain agnostic ASR systems tend to perform poorly on domain specific or long tail words. They are often supp…
▽ More
Accurate recognition of slot values such as domain specific words or named entities by automatic speech recognition (ASR) systems forms the core of the Goal-oriented Dialogue Systems. Although it is a critical step with direct impact on downstream tasks such as language understanding, many domain agnostic ASR systems tend to perform poorly on domain specific or long tail words. They are often supplemented with slot error correcting systems but it is often hard for any neural model to directly output such rare entity words. To address this problem, we propose k-nearest neighbor (k-NN) search that outputs domain-specific entities from an explicit datastore. We improve error correction rate by conveniently augmenting a pretrained joint phoneme and text based transformer sequence to sequence model with k-NN search during inference. We evaluate our proposed approach on five different domains containing long tail slot entities such as full names, airports, street names, cities, states. Our best performing error correction model shows a relative improvement of 7.4% in word error rate (WER) on rare word entities over the baseline and also achieves a relative WER improvement of 9.8% on an out of vocabulary (OOV) test set.
△ Less
Submitted 17 September, 2021; v1 submitted 10 September, 2021;
originally announced September 2021.
-
ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling
Authors:
Ashish Shenoy,
Sravan Bodapati,
Katrin Kirchhoff
Abstract:
Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve co…
▽ More
Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domain-general NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a non-contextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Adapting Long Context NLM for ASR Rescoring in Conversational Agents
Authors:
Ashish Shenoy,
Sravan Bodapati,
Monica Sunkara,
Srikanth Ronanki,
Katrin Kirchhoff
Abstract:
Neural Language Models (NLM), when trained and evaluated with context spanning multiple utterances, have been shown to consistently outperform both conventional n-gram language models and NLMs that use limited context. In this paper, we investigate various techniques to incorporate turn based context history into both recurrent (LSTM) and Transformer-XL based NLMs. For recurrent based NLMs, we exp…
▽ More
Neural Language Models (NLM), when trained and evaluated with context spanning multiple utterances, have been shown to consistently outperform both conventional n-gram language models and NLMs that use limited context. In this paper, we investigate various techniques to incorporate turn based context history into both recurrent (LSTM) and Transformer-XL based NLMs. For recurrent based NLMs, we explore context carry over mechanism and feature based augmentation, where we incorporate other forms of contextual information such as bot response and system dialogue acts as classified by a Natural Language Understanding (NLU) model. To mitigate the sharp nearby, fuzzy far away problem with contextual NLM, we propose the use of attention layer over lexical metadata to improve feature based augmentation. Additionally, we adapt our contextual NLM towards user provided on-the-fly speech patterns by leveraging encodings from a large pre-trained masked language model and performing fusion with a Transformer-XL based NLM. We test our proposed models using N-best rescoring of ASR hypotheses of task-oriented dialogues and also evaluate on downstream NLU tasks such as intent classification and slot labeling. The best performing model shows a relative WER between 1.6% and 9.1% and a slot labeling F1 score improvement of 4% over non-contextual baselines.
△ Less
Submitted 4 June, 2021; v1 submitted 20 April, 2021;
originally announced April 2021.
-
Contextual Biasing of Language Models for Speech Recognition in Goal-Oriented Conversational Agents
Authors:
Ashish Shenoy,
Sravan Bodapati,
Katrin Kirchhoff
Abstract:
Goal-oriented conversational interfaces are designed to accomplish specific tasks and typically have interactions that tend to span multiple turns adhering to a pre-defined structure and a goal. However, conventional neural language models (NLM) in Automatic Speech Recognition (ASR) systems are mostly trained sentence-wise with limited context. In this paper, we explore different ways to incorpora…
▽ More
Goal-oriented conversational interfaces are designed to accomplish specific tasks and typically have interactions that tend to span multiple turns adhering to a pre-defined structure and a goal. However, conventional neural language models (NLM) in Automatic Speech Recognition (ASR) systems are mostly trained sentence-wise with limited context. In this paper, we explore different ways to incorporate context into a LSTM based NLM in order to model long range dependencies and improve speech recognition. Specifically, we use context carry over across multiple turns and use lexical contextual cues such as system dialog act from Natural Language Understanding (NLU) models and the user provided structure of the chatbot. We also propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time. Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
△ Less
Submitted 4 June, 2021; v1 submitted 18 March, 2021;
originally announced March 2021.
-
Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation
Authors:
Aman Shenoy,
Ashish Sardana
Abstract:
Sentiment Analysis and Emotion Detection in conversation is key in several real-world applications, with an increase in modalities available aiding a better understanding of the underlying emotions. Multi-modal Emotion Detection and Sentiment Analysis can be particularly useful, as applications will be able to use specific subsets of available modalities, as per the available data. Current systems…
▽ More
Sentiment Analysis and Emotion Detection in conversation is key in several real-world applications, with an increase in modalities available aiding a better understanding of the underlying emotions. Multi-modal Emotion Detection and Sentiment Analysis can be particularly useful, as applications will be able to use specific subsets of available modalities, as per the available data. Current systems dealing with Multi-modal functionality fail to leverage and capture - the context of the conversation through all modalities, the dependency between the listener(s) and speaker emotional states, and the relevance and relationship between the available modalities. In this paper, we propose an end to end RNN architecture that attempts to take into account all the mentioned drawbacks. Our proposed model, at the time of writing, out-performs the state of the art on a benchmark dataset on a variety of accuracy and regression metrics.
△ Less
Submitted 22 June, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Countering Inconsistent Labelling by Google's Vision API for Rotated Images
Authors:
Aman Apte,
Aritra Bandyopadhyay,
K Akhilesh Shenoy,
Jason Peter Andrews,
Aditya Rathod,
Manish Agnihotri,
Aditya Jajodia
Abstract:
Google's Vision API analyses images and provides a variety of output predictions, one such type is context-based labelling. In this paper, it is shown that adversarial examples that cause incorrect label prediction and spoofing can be generated by rotating the images. Due to the black-boxed nature of the API, a modular context-based pre-processing pipeline is proposed consisting of a Res-Net50 mod…
▽ More
Google's Vision API analyses images and provides a variety of output predictions, one such type is context-based labelling. In this paper, it is shown that adversarial examples that cause incorrect label prediction and spoofing can be generated by rotating the images. Due to the black-boxed nature of the API, a modular context-based pre-processing pipeline is proposed consisting of a Res-Net50 model, that predicts the angle by which the image must be rotated to correct its orientation. The pipeline successfully performs the correction whilst maintaining the image's resolution and feeds it to the API which generates labels similar to the original correctly oriented image and using a Percentage Error metric, the performance of the corrected images as compared to its rotated counter-parts is found to be significantly higher. These observations imply that the API can benefit from such a pre-processing pipeline to increase robustness to rotational perturbances.
△ Less
Submitted 17 November, 2019;
originally announced November 2019.
-
Flow topology during multiplexed particle manipulation using a Stokes Trap
Authors:
Anish Shenoy,
Dinesh Kumar,
Sascha Hilgenfeldt,
Charles M. Schroeder
Abstract:
Trapping and manipulation of small particles underlies many scientific and technological applications. Recently, the precise manipulation of multiple small particles was demonstrated using a Stokes trap that relies only on fluid flow without the need for optical or electric fields. Active flow control generates complex flow topologies around suspended particles during the trapping process, yet the…
▽ More
Trapping and manipulation of small particles underlies many scientific and technological applications. Recently, the precise manipulation of multiple small particles was demonstrated using a Stokes trap that relies only on fluid flow without the need for optical or electric fields. Active flow control generates complex flow topologies around suspended particles during the trapping process, yet the relationship between the control algorithm and flow structure is not well understood. In this work, we characterize the flow topology during active control of particle trajectories using a Stokes trap. Our results show that optimal control of two particles unexpectedly relies on flow patterns with zero or one stagnation points, as opposed to positioning two particles using two distinct stagnation points. We characterize the sensitivity of the system with respect to the parameters in the control objective function, thereby providing a systematic understanding of the trapping process. Overall, these results will be useful in guiding applications involving the controlled manipulation of multiple colloidal particles and the precise deformation of soft particles in defined flow fields.
△ Less
Submitted 5 August, 2019;
originally announced August 2019.
-
Orientation control and nonlinear trajectory tracking of colloidal particles using microfluidics
Authors:
Dinesh Kumar,
Anish Shenoy,
Songsong Li,
Charles M. Schroeder
Abstract:
Suspensions of anisotropic Brownian particles are commonly encountered in a wide array of applications such as drug delivery and manufacturing of fiber-reinforced composites. Technological applications and fundamental studies of small anisotropic particles critically require precise control of particle orientation over defined trajectories and paths. In this work, we demonstrate robust control ove…
▽ More
Suspensions of anisotropic Brownian particles are commonly encountered in a wide array of applications such as drug delivery and manufacturing of fiber-reinforced composites. Technological applications and fundamental studies of small anisotropic particles critically require precise control of particle orientation over defined trajectories and paths. In this work, we demonstrate robust control over the two-dimensional (2D) center-of-mass position and orientation of anisotropic Brownian particles using only fluid flow. We implement a path-following model predictive control scheme to manipulate colloidal particles over defined trajectories in position space, where the speed of movement along the path is a degree of freedom in the controller design. We further explore how the external flow field affects the orientation dynamics of anisotropic particles in steady and transient extensional flow using a combination of experiments and analytical modeling. Overall, this technique offers new avenues for fundamental studies of anisotropic colloidal particles using only fluid flow, without the need for external electric or optical fields.
△ Less
Submitted 19 July, 2019;
originally announced July 2019.
-
First results from the LUCID-Timepix spacecraft payload onboard the TechDemoSat-1 satellite in Low Earth Orbit
Authors:
Will Furnell,
Abhishek Shenoy,
Elliot Fox,
Peter Hatfield
Abstract:
The Langton Ultimate Cosmic ray Intensity Detector (LUCID) is a payload onboard the satellite TechDemoSat-1, used to study the radiation environment in Low Earth Orbit ($\sim$635km). LUCID operated from 2014 to 2017, collecting over 2.1 million frames of radiation data from its five Timepix detectors on board. LUCID is one of the first uses of the Timepix detector technology in open space, with th…
▽ More
The Langton Ultimate Cosmic ray Intensity Detector (LUCID) is a payload onboard the satellite TechDemoSat-1, used to study the radiation environment in Low Earth Orbit ($\sim$635km). LUCID operated from 2014 to 2017, collecting over 2.1 million frames of radiation data from its five Timepix detectors on board. LUCID is one of the first uses of the Timepix detector technology in open space, with the data providing useful insight into the performance of this technology in new environments. It provides high-sensitivity imaging measurements of the mixed radiation field, with a wide dynamic range in terms of spectral response, particle type and direction. The data has been analysed using computing resources provided by GridPP, with a new machine learning algorithm that uses the Tensorflow framework. This algorithm provides a new approach to processing Medipix data, using a training set of human labelled tracks, providing greater particle classification accuracy than other algorithms. For managing the LUCID data, we have developed an online platform called Timepix Analysis Platform at School (TAPAS). This provides a swift and simple way for users to analyse data that they collect using Timepix detectors from both LUCID and other experiments. We also present some possible future uses of the LUCID data and Medipix detectors in space.
△ Less
Submitted 30 October, 2018;
originally announced October 2018.
-
Gamma-Ray Bursts: Temporal Scales and the Bulk Lorentz Factor
Authors:
E. Sonbas,
G. A. MacLachlan,
K. S. Dhuga,
P. Veres,
A. Shenoy,
T. N. Ukwatta
Abstract:
For a sample of Swift and Fermi GRBs, we show that the minimum variability timescale and the spectral lag of the prompt emission is related to the bulk Lorentz factor in a complex manner: For small $Γ$'s, the variability timescale exhibits a shallow (plateau) region. For large $Γ$'s, the variability timescale declines steeply as a function of $Γ$ ($δT\proptoΓ^{-4.05\pm0.64}$). Evidence is also pre…
▽ More
For a sample of Swift and Fermi GRBs, we show that the minimum variability timescale and the spectral lag of the prompt emission is related to the bulk Lorentz factor in a complex manner: For small $Γ$'s, the variability timescale exhibits a shallow (plateau) region. For large $Γ$'s, the variability timescale declines steeply as a function of $Γ$ ($δT\proptoΓ^{-4.05\pm0.64}$). Evidence is also presented for an intriguing correlation between the peak times, t$_p$, of the afterglow emission and the prompt emission variability timescale.
△ Less
Submitted 21 March, 2015; v1 submitted 13 August, 2014;
originally announced August 2014.
-
X - Ray Flares and Their Connection With Prompt Emission in GRBs
Authors:
E. Sonbas,
G. A. MacLachlan,
A. Shenoy,
K. S. Dhuga,
W. C. Parke
Abstract:
We use a wavelet technique to investigate the time variations in the light curves from a sample of GRBs detected by Fermi and Swift. We focus primarily on the behavior of the flaring region of Swift-XRT light curves in order to explore connections between variability time scales and pulse parameters (such as rise and decay times, widths, strengths, and separation distributions) and spectral lags.…
▽ More
We use a wavelet technique to investigate the time variations in the light curves from a sample of GRBs detected by Fermi and Swift. We focus primarily on the behavior of the flaring region of Swift-XRT light curves in order to explore connections between variability time scales and pulse parameters (such as rise and decay times, widths, strengths, and separation distributions) and spectral lags. Tight correlations between some of these temporal features suggest a common origin for the production of X-ray flares and the prompt emission.
△ Less
Submitted 8 August, 2013;
originally announced August 2013.
-
Probing Curvature Effects in the Fermi GRB 110920
Authors:
A. Shenoy,
E. Sonbas,
C. Dermer,
L. C. Maximon,
K. S. Dhuga,
P. N. Bhat,
J. Hakkila,
W. C. Parke,
G. A. Maclachlan,
T. N. Ukwatta
Abstract:
Curvature effects in Gamma-ray bursts (GRBs) have long been a source of considerable interest. In a collimated relativistic GRB jet, photons that are off-axis relative to the observer arrive at later times than on-axis photons and are also expected to be spectrally softer. In this work, we invoke a relatively simple kinematic two-shell collision model for a uniform jet profile and compare its pred…
▽ More
Curvature effects in Gamma-ray bursts (GRBs) have long been a source of considerable interest. In a collimated relativistic GRB jet, photons that are off-axis relative to the observer arrive at later times than on-axis photons and are also expected to be spectrally softer. In this work, we invoke a relatively simple kinematic two-shell collision model for a uniform jet profile and compare its predictions to GRB prompt-emission data for observations that have been attributed to curvature effects such as the peak-flux--peak-frequency relation, i.e., the relation between the $ν$F$_ν$ flux and the spectral peak, E$_{pk}$ in the decay phase of a GRB pulse, and spectral lags. In addition, we explore the behavior of pulse widths with energy. We present the case of the single-pulse Fermi GRB 110920, as a test for the predictions of the model against observations.
△ Less
Submitted 7 September, 2013; v1 submitted 15 April, 2013;
originally announced April 2013.
-
A New Correlation Between GRB X-Ray Flares And The Prompt Emission
Authors:
E. Sonbas,
G. A. MacLachlan,
A. Shenoy,
K. S. Dhuga,
W. C. Parke
Abstract:
From a sample of GRBs detected by the $Fermi$ and $Swift$ missions, we have extracted the minimum variability time scales for temporal structures in the light curves associated with the prompt emission and X-ray flares. A comparison of this variability time scale with pulse parameters such as rise times,determined via pulse-fitting procedures, and spectral lags, extracted via the cross-correlation…
▽ More
From a sample of GRBs detected by the $Fermi$ and $Swift$ missions, we have extracted the minimum variability time scales for temporal structures in the light curves associated with the prompt emission and X-ray flares. A comparison of this variability time scale with pulse parameters such as rise times,determined via pulse-fitting procedures, and spectral lags, extracted via the cross-correlation function (CCF), indicate a tight correlation between these temporal features for both the X-ray flares and the prompt emission. These correlations suggests a common origin for the production of X-ray flares and the prompt emission in GRBs.
△ Less
Submitted 23 March, 2013; v1 submitted 25 October, 2012;
originally announced October 2012.
-
The Hurst Exponent of Fermi GRBs
Authors:
Glen MacLachlan,
Ashwin Shenoy,
Eda Sonbas,
Rob Coyne,
Kalvir Dhuga,
Ali Eskandarian,
Leonard Maximon,
William Parke
Abstract:
Using a wavelet decomposition technique, we have extracted the Hurst exponent for a sample of 46 long and 22 short Gamma-ray bursts (GRBs) detected by the Gamma-ray Burst Monitor (GBM) aboard the Fermi satellite. This exponent is a scaling parameter that provides a measure of long-range behavior in a time series. The mean Hurst exponent for the short GRBs is significantly smaller than that for the…
▽ More
Using a wavelet decomposition technique, we have extracted the Hurst exponent for a sample of 46 long and 22 short Gamma-ray bursts (GRBs) detected by the Gamma-ray Burst Monitor (GBM) aboard the Fermi satellite. This exponent is a scaling parameter that provides a measure of long-range behavior in a time series. The mean Hurst exponent for the short GRBs is significantly smaller than that for the long GRBs. The separation may serve as an unbiased criterion for distinguishing short and long GRBs.
△ Less
Submitted 17 September, 2013; v1 submitted 11 September, 2012;
originally announced September 2012.
-
Infrahumps detected in Kepler light curve of V1504 Cygni
Authors:
Robert Coyne,
Ashwin Shenoy,
Glen MacLachlan,
Tiffany Lewis,
Kalvir Dhuga,
Ali Eskandarian,
Bethany Cobb,
Leonard Maximon,
William Parke
Abstract:
We present a power spectral density analysis of the short cadence Kepler data for the cataclysmic variable V1504 Cygni. We identify three distinct periods: the orbital period (1.669\pm0.005 hours), the superhump period (1.733\pm0.005 hours), and the infrahump period (1.628\pm0.005 hours). The results are consistent with those predicted by the period excess-deficit relation.
We present a power spectral density analysis of the short cadence Kepler data for the cataclysmic variable V1504 Cygni. We identify three distinct periods: the orbital period (1.669\pm0.005 hours), the superhump period (1.733\pm0.005 hours), and the infrahump period (1.628\pm0.005 hours). The results are consistent with those predicted by the period excess-deficit relation.
△ Less
Submitted 20 July, 2012; v1 submitted 28 June, 2012;
originally announced June 2012.
-
The Minimum Variability Time Scale and its Relation to Pulse Profiles of Fermi GRBs
Authors:
G. A. MacLachlan,
A. Shenoy,
E. Sonbas,
K. S. Dhuga,
A. Eskandarian,
L. C. Maximon,
W. C. Parke
Abstract:
We present a direct link between the minimum variability time scales extracted through a wavelet decomposition and the rise times of the shortest pulses extracted via fits of 34 Fermi GBM GRB light curves comprised of 379 pulses. Pulses used in this study were fitted with log-normal functions whereas the wavelet technique used employs a multiresolution analysis that does not rely on identifying di…
▽ More
We present a direct link between the minimum variability time scales extracted through a wavelet decomposition and the rise times of the shortest pulses extracted via fits of 34 Fermi GBM GRB light curves comprised of 379 pulses. Pulses used in this study were fitted with log-normal functions whereas the wavelet technique used employs a multiresolution analysis that does not rely on identifying distinct pulses. By applying a corrective filter to published data fitted with pulses we demonstrate agreement between these two independent techniques and offer a method for distinguishing signal from noise.
△ Less
Submitted 12 June, 2012; v1 submitted 30 April, 2012;
originally announced May 2012.
-
Minimum Variability Time Scales of Long and Short GRBs
Authors:
G. A. MacLachlan,
A. Shenoy,
E. Sonbas,
K. S. Dhuga,
B. Cobb,
T. N. Ukwatta,
D. C. Morris,
A. Eskandarian,
L. C. Maximon,
W. C. Parke
Abstract:
We have investigated the time variations in the light curves from a sample of long and short Fermi/GBM Gamma ray bursts (GRBs) using an impartial wavelet analysis. The results indicate that in the source frame, the variability time scales for long bursts differ from that for short bursts, that variabilities on the order of a few milliseconds are not uncommon, and that an intriguing relationship ex…
▽ More
We have investigated the time variations in the light curves from a sample of long and short Fermi/GBM Gamma ray bursts (GRBs) using an impartial wavelet analysis. The results indicate that in the source frame, the variability time scales for long bursts differ from that for short bursts, that variabilities on the order of a few milliseconds are not uncommon, and that an intriguing relationship exists between the minimum variability time and the burst duration.
△ Less
Submitted 7 February, 2013; v1 submitted 20 January, 2012;
originally announced January 2012.
-
A Robust Quantum Random Number Generator Based on Bosonic Stimulation
Authors:
H. Akshata Shenoy,
S. Omkar,
R. Srikanth,
T. Srinivas
Abstract:
We propose a method to realize a robust quantum random number generator based on bosonic stimulation. A particular implementation that employs weak coherent pulses and conventional avalanche photo-diode detectors (APDs) is discussed.
We propose a method to realize a robust quantum random number generator based on bosonic stimulation. A particular implementation that employs weak coherent pulses and conventional avalanche photo-diode detectors (APDs) is discussed.
△ Less
Submitted 2 February, 2012; v1 submitted 3 October, 2011;
originally announced October 2011.
-
A New Frequency-Luminosity Relation for Long GRBs?
Authors:
T. N. Ukwatta,
K. S. Dhuga,
D. C. Morris,
G. MacLachlan,
W. C. Parke,
L. C. Maximon,
A. Eskandarian,
N. Gehrels,
J. P. Norris,
A. Shenoy
Abstract:
We have studied power density spectra (PDS) of 206 long Gamma-Ray Bursts (GRBs). We fitted the PDS with a simple power-law and extracted the exponent of the power-law (alpha) and the noise-crossing threshold frequency (f_th). We find that the distribution of the extracted alpha peaks around -1.4 and that of f_th around 1 Hz. In addition, based on a sub-set of 58 bursts with known redshifts, we sho…
▽ More
We have studied power density spectra (PDS) of 206 long Gamma-Ray Bursts (GRBs). We fitted the PDS with a simple power-law and extracted the exponent of the power-law (alpha) and the noise-crossing threshold frequency (f_th). We find that the distribution of the extracted alpha peaks around -1.4 and that of f_th around 1 Hz. In addition, based on a sub-set of 58 bursts with known redshifts, we show that the redshift-corrected threshold frequency is positively correlated with the isotropic peak luminosity. The correlation coefficient is 0.57 +/- 0.03.
△ Less
Submitted 18 November, 2010;
originally announced November 2010.
-
Supporting Out-of-turn Interactions in a Multimodal Web Interface
Authors:
Atul Shenoy,
Naren Ramakrishnan,
Manuel A. Perez-Quinones,
Srinidhi Varadarajan
Abstract:
Multimodal interfaces are becoming increasingly important with the advent of mobile devices, accessibility considerations, and novel software technologies that combine diverse interaction media. This article investigates systems support for web browsing in a multimodal interface. Specifically, we outline the design and implementation of a software framework that integrates hyperlink and speech m…
▽ More
Multimodal interfaces are becoming increasingly important with the advent of mobile devices, accessibility considerations, and novel software technologies that combine diverse interaction media. This article investigates systems support for web browsing in a multimodal interface. Specifically, we outline the design and implementation of a software framework that integrates hyperlink and speech modes of interaction. Instead of viewing speech as merely an alternative interaction medium, the framework uses it to support out-of-turn interaction, providing a flexibility of information access not possible with hyperlinks alone. This approach enables the creation of websites that adapt to the needs of users, yet permits the designer fine-grained control over what interactions to support. Design methodology, implementation details, and two case studies are presented.
△ Less
Submitted 4 July, 2003;
originally announced July 2003.