subscribe to arXiv mailings

Metric-aware LLM inference for regression and scoring

Authors: Michal Lukasik, Harikrishna Narasimhan, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar

Abstract: Large language models (LLMs) have demonstrated strong results on a range of NLP tasks. Typically, outputs are obtained via autoregressive sampling from the LLM's underlying distribution. Building on prior work on Minimum Bayes Risk Decoding, we show that this inference strategy can be suboptimal for a range of regression and scoring tasks, and associated evaluation metrics. As a remedy, we propose… ▽ More Large language models (LLMs) have demonstrated strong results on a range of NLP tasks. Typically, outputs are obtained via autoregressive sampling from the LLM's underlying distribution. Building on prior work on Minimum Bayes Risk Decoding, we show that this inference strategy can be suboptimal for a range of regression and scoring tasks, and associated evaluation metrics. As a remedy, we propose metric aware LLM inference: a decision theoretic approach optimizing for custom regression and scoring metrics at inference time. We report improvements over baselines on academic benchmarks and publicly available models. △ Less

Submitted 4 April, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

Comments: 15 pages

arXiv:2310.09250 [pdf, other]

It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models

Authors: Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar

Abstract: Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classif… ▽ More Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2310.05337 [pdf, other]

What do larger image classifiers memorise?

Authors: Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

Abstract: The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the correspondi… ▽ More The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification bench-marks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experience decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman memorization score fail to capture these fundamental trends. Lastly, we find that knowledge distillation, an effective and popular model compression technique, tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation. △ Less

Submitted 8 October, 2023; originally announced October 2023.

MSC Class: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Machine Learning (stat.ML)

arXiv:2302.01576 [pdf, other]

ResMem: Learn what you can and memorize the rest

Authors: Zitong Yang, Michal Lukasik, Vaishnavh Nagarajan, Zonglin Li, Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Sanjiv Kumar

Abstract: The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a ne… ▽ More The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor. △ Less

Submitted 20 October, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

arXiv:2211.05110 [pdf, other]

Large Language Models with Controllable Working Memory

Authors: Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

Abstract: Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model with an informational context to aid its performa… ▽ More Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model with an informational context to aid its performance on the underlying task, how the model's world knowledge interacts with the factual information presented in the context remains under explored. As a desirable behavior, an LLM should give precedence to the context whenever it contains task-relevant information that conflicts with the model's memorized knowledge. This enables model predictions to be grounded in the context, which can then be used to update or correct specific model predictions without frequent retraining. By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge. In this paper, we undertake a first joint study of the aforementioned two properties, namely controllability and robustness, in the context of LLMs. We demonstrate that state-of-the-art T5 and PaLM (both pretrained and finetuned) could exhibit poor controllability and robustness, which do not scale with increasing model size. As a solution, we propose a novel method - Knowledge Aware FineTuning (KAFT) - to strengthen both controllability and robustness by incorporating counterfactual and irrelevant contexts to standard supervised datasets. Our comprehensive evaluation showcases the utility of KAFT across model architectures and sizes. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2211.00635 [pdf, other]

Two-stage LLM Fine-tuning with Less Specialization and More Generalization

Authors: Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, Sanjiv Kumar

Abstract: Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts. They can be further improved towards a specific task by fine-tuning on a specialized dataset. However, fine-tuning usually makes the model narrowly specialized on this dataset with reduced general in-context learning performances, which is undesirable whenever the fine-tun… ▽ More Pretrained large language models (LLMs) are general purpose problem solvers applicable to a diverse set of tasks with prompts. They can be further improved towards a specific task by fine-tuning on a specialized dataset. However, fine-tuning usually makes the model narrowly specialized on this dataset with reduced general in-context learning performances, which is undesirable whenever the fine-tuned model needs to handle additional tasks where no fine-tuning data is available. In this work, we first demonstrate that fine-tuning on a single task indeed decreases LLMs' general in-context learning performance. We discover one important cause of such forgetting, format specialization, where the model overfits to the format of the fine-tuned task.We further show that format specialization happens at the very beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that reduces format specialization and improves generalization.ProMoT offloads task-specific format learning into additional and removable parameters by first doing prompt tuning and then fine-tuning the model itself with this soft prompt attached. With experiments on several fine-tuning tasks and 8 in-context evaluation tasks, we show that ProMoT achieves comparable performance on fine-tuned tasks to standard fine-tuning, but with much less loss of in-context learning performances across a board range of out-of-domain evaluation tasks. More importantly, ProMoT can even enhance generalization on in-context learning tasks that are semantically related to the fine-tuned task, e.g. ProMoT on En-Fr translation significantly improves performance on other language pairs, and ProMoT on NLI improves performance on summarization. Experiments also show that ProMoT can improve the generalization performance of multi-task training. △ Less

Submitted 12 March, 2024; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: ICLR 2024

arXiv:2207.03833 [pdf, other]

XR Hackathon Going Online: Lessons Learned from a Case Study with Goethe-Institut

Authors: Wiesław Kopeć, Kinga Skorupska, Anna Jaskulska, Michał Łukasik, Barbara Karpowicz, Julia Paluch, Kinga Kwiatkowska, Daniel Jabłoński, Rafał Masłyk

Abstract: In this article we report a case study of a Language and Culture-oriented transdisciplinary XR hackathon organized with Goethe-Institut. The hackathon was hosted as an online event in November 2020 by our University Lab in collaboration with Goethe-Institut as a follow-up to our previous co-organized event within our research group Living Lab. We have improved the formula of the event based on les… ▽ More In this article we report a case study of a Language and Culture-oriented transdisciplinary XR hackathon organized with Goethe-Institut. The hackathon was hosted as an online event in November 2020 by our University Lab in collaboration with Goethe-Institut as a follow-up to our previous co-organized event within our research group Living Lab. We have improved the formula of the event based on lessons learned from its previous edition. First, in one of the two hackathon tracks we provided the participants with a custom VR framework, to serve as a starting point for their designs to skip the repetitive early development stage. In cooperation with our partner, Goethe-Institut, we have also outlined best modern research-backed language-learning practices and methods and gathered them into actionable evaluation criteria. △ Less

Submitted 8 July, 2022; originally announced July 2022.

arXiv:2206.06479 [pdf, other]

Robust Distillation for Worst-class Performance

Authors: Serena Wang, Harikrishna Narasimhan, Yichen Zhou, Sara Hooker, Michal Lukasik, Aditya Krishna Menon

Abstract: Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may… ▽ More Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may follow a long-tailed distribution, we develop distillation techniques that are tailored to improve the student's worst-class performance. Specifically, we introduce robust optimization objectives in different combinations for the teacher and student, and further allow for training with any tradeoff between the overall accuracy and the robust worst-class objective. We show empirically that our robust distillation techniques not only achieve better worst-class performance, but also lead to Pareto improvement in the tradeoff between overall performance and worst-class performance compared to other baseline methods. Theoretically, we provide insights into what makes a good teacher when the goal is to train a robust student. △ Less

Submitted 13 June, 2022; originally announced June 2022.

arXiv:2110.15440 [pdf, other]

HD-cos Networks: Efficient Neural Architectures for Secure Multi-Party Computation

Authors: Wittawat Jitkrittum, Michal Lukasik, Ananda Theertha Suresh, Felix Yu, Gang Wang

Abstract: Multi-party computation (MPC) is a branch of cryptography where multiple non-colluding parties execute a well designed protocol to securely compute a function. With the non-colluding party assumption, MPC has a cryptographic guarantee that the parties will not learn sensitive information from the computation process, making it an appealing framework for applications that involve privacy-sensitive… ▽ More Multi-party computation (MPC) is a branch of cryptography where multiple non-colluding parties execute a well designed protocol to securely compute a function. With the non-colluding party assumption, MPC has a cryptographic guarantee that the parties will not learn sensitive information from the computation process, making it an appealing framework for applications that involve privacy-sensitive user data. In this paper, we study training and inference of neural networks under the MPC setup. This is challenging because the elementary operations of neural networks such as the ReLU activation function and matrix-vector multiplications are very expensive to compute due to the added multi-party communication overhead. To address this, we propose the HD-cos network that uses 1) cosine as activation function, 2) the Hadamard-Diagonal transformation to replace the unstructured linear transformations. We show that both of the approaches enjoy strong theoretical motivations and efficient computation under the MPC setup. We demonstrate on multiple public datasets that HD-cos matches the quality of the more expensive baselines. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2110.06821 [pdf, other]

Leveraging redundancy in attention with Reuse Transformers

Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Frederick Liu, Yin-Wen Chang, Sanjiv Kumar

Abstract: Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similari… ▽ More Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similarity of these scores across heads and layers and find them to be considerably redundant, especially adjacent layers showing high similarity. Motivated by these findings, we propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers. Experiments on a number of standard benchmarks show that reusing attention delivers performance equivalent to or better than standard transformers, while reducing both compute and memory usage. △ Less

Submitted 13 October, 2021; originally announced October 2021.

arXiv:2106.10494 [pdf, other]

Teacher's pet: understanding and mitigating biases in distillation

Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Abstract: Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain s… ▽ More Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain subgroups, e.g., classes with few associated samples. We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance. △ Less

Submitted 8 July, 2021; v1 submitted 19 June, 2021; originally announced June 2021.

Comments: 21 pages, 8 figures

arXiv:2106.08823 [pdf, other]

Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation

Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Himanshu Jain, Sanjiv Kumar, Michal Lukasik, Andreas Veit

Abstract: State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of their variation. Through eigen analysis of full atten… ▽ More State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of their variation. Through eigen analysis of full attention score matrices, as well as of their individual rows, we find that most of the variation among attention scores lie in a low-dimensional eigenspace. Moreover, we find significant overlap between these eigenspaces for different layers and even different transformer models. Based on this, we propose to compute scores only for a partial subset of token pairs, and use them to estimate scores for the remaining pairs. Beyond investigating the accuracy of reconstructing attention scores themselves, we investigate training transformer models that employ these approximations, and analyze the effect on overall accuracy. Our analysis and the proposed method provide insights into how to balance the benefits of exact pair-wise attention and its significant computational expense. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: 14 pages

arXiv:2010.11851 [pdf, other]

Hawkes Process Classification through Discriminative Modeling of Text

Authors: Rohan Tondulkar, Manisha Dubey, P. K. Srijith, Michal Lukasik

Abstract: Social media has provided a platform for users to gather and share information and stay updated with the news. Such networks also provide a platform to users where they can engage in conversations. However, such micro-blogging platforms like Twitter restricts the length of text. Due to paucity of sufficient word occurrences in such posts, classification of this information is a challenging task us… ▽ More Social media has provided a platform for users to gather and share information and stay updated with the news. Such networks also provide a platform to users where they can engage in conversations. However, such micro-blogging platforms like Twitter restricts the length of text. Due to paucity of sufficient word occurrences in such posts, classification of this information is a challenging task using standard tools of natural language processing (NLP). Moreover, high complexity and dynamics of the posts in social media makes text classification a challenging problem. However, considering additional cues in the form of past labels and times associated with the post can be potentially helpful for performing text classification in a better way. To address this problem, we propose models based on the Hawkes process (HP) which can naturally incorporate the temporal features and past labels along with textual features for improving short text classification. In particular, we propose a discriminative approach to model text in HP where the text features parameterize the base intensity and/or the triggering kernel. Another major contribution is to consider kernel to be a function of both time and text, and further use a neural network to model the kernel. This enables modelling and effectively learning the text along with the historical influences for tweet classification. We demonstrate the advantages of the proposed techniques on standard benchmarks for rumour stance classification. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: 9 pages, 10 figures

arXiv:2010.07447 [pdf, ps, other]

Semantic Label Smoothing for Sequence to Sequence Problems

Authors: Michal Lukasik, Himanshu Jain, Aditya Krishna Menon, Seungyeon Kim, Srinadh Bhojanapalli, Felix Yu, Sanjiv Kumar

Abstract: Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, is challenging: the large target output space of such problems makes it intractable to apply label smoothing over all possible outputs. Most existing approache… ▽ More Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, is challenging: the large target output space of such problems makes it intractable to apply label smoothing over all possible outputs. Most existing approaches for seq2seq settings either do token level smoothing, or smooth over sequences generated by randomly substituting tokens in the target sequence. Unlike these works, in this paper, we propose a technique that smooths over \emph{well formed} relevant sequences that not only have sufficient n-gram overlap with the target sequence, but are also \emph{semantically similar}. Our method shows a consistent and significant improvement over the state-of-the-art techniques on different datasets. △ Less

Submitted 14 October, 2020; originally announced October 2020.

arXiv:2007.01570 [pdf, other]

doi 10.1145/3394486.3403296

Scaling Graph Neural Networks with Approximate PageRank

Authors: Aleksandar Bojchevski, Johannes Gasteiger, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, Stephan Günnemann

Abstract: Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, learning on large graphs remains a challenge - many recently proposed scalable GNN approaches rely on an expensive message-passing procedure to propagate information through the graph. We present the PPRGo model which utilizes an efficient approximation of information diffusion in GNNs… ▽ More Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, learning on large graphs remains a challenge - many recently proposed scalable GNN approaches rely on an expensive message-passing procedure to propagate information through the graph. We present the PPRGo model which utilizes an efficient approximation of information diffusion in GNNs resulting in significant speed gains while maintaining state-of-the-art prediction performance. In addition to being faster, PPRGo is inherently scalable, and can be trivially parallelized for large datasets like those found in industry settings. We demonstrate that PPRGo outperforms baselines in both distributed and single-machine training environments on a number of commonly used academic graphs. To better analyze the scalability of large-scale graph learning methods, we introduce a novel benchmark graph with 12.4 million nodes, 173 million edges, and 2.8 million node features. We show that training PPRGo from scratch and predicting labels for all nodes in this graph takes under 2 minutes on a single machine, far outpacing other baselines on the same graph. We discuss the practical application of PPRGo to solve large-scale node classification problems at Google. △ Less

Submitted 5 April, 2022; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: Published as a Conference Paper at ACM SIGKDD 2020. Author name changed from Johannes Klicpera to Johannes Gasteiger

arXiv:2004.14535 [pdf, other]

Text Segmentation by Cross Segment Attention

Authors: Michal Lukasik, Boris Dadachev, Gonçalo Simões, Kishore Papineni

Abstract: Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a ne… ▽ More Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications. △ Less

Submitted 7 December, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: 10 pages, 4 figures

arXiv:2003.02819 [pdf, other]

Does label smoothing mitigate label noise?

Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Abstract: Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --… ▽ More Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --- being equivalent to injecting symmetric noise to the labels --- we show how it relates to a general family of loss-correction techniques from the label noise literature. Building on this connection, we show that label smoothing is competitive with loss-correction under label noise. Further, we show that when distilling models from noisy data, label smoothing of the teacher is beneficial; this is in contrast to recent findings for noise-free problems, and sheds further light on settings where label smoothing is beneficial. △ Less

Submitted 5 March, 2020; originally announced March 2020.

arXiv:1712.02223 [pdf, other]

doi 10.1016/j.ipm.2017.11.009

Discourse-Aware Rumour Stance Classification in Social Media Using Sequential Classifiers

Authors: Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, Michal Lukasik, Kalina Bontcheva, Trevor Cohn, Isabelle Augenstein

Abstract: Rumour stance classification, defined as classifying the stance of specific social media posts into one of supporting, denying, querying or commenting on an earlier post, is becoming of increasing interest to researchers. While most previous work has focused on using individual tweets as classifier inputs, here we report on the performance of sequential classifiers that exploit the discourse featu… ▽ More Rumour stance classification, defined as classifying the stance of specific social media posts into one of supporting, denying, querying or commenting on an earlier post, is becoming of increasing interest to researchers. While most previous work has focused on using individual tweets as classifier inputs, here we report on the performance of sequential classifiers that exploit the discourse features inherent in social media interactions or 'conversational threads'. Testing the effectiveness of four sequential classifiers -- Hawkes Processes, Linear-Chain Conditional Random Fields (Linear CRF), Tree-Structured Conditional Random Fields (Tree CRF) and Long Short Term Memory networks (LSTM) -- on eight datasets associated with breaking news stories, and looking at different types of local and contextual features, our work sheds new light on the development of accurate stance classifiers. We show that sequential classifiers that exploit the use of discourse properties in social media conversations while using only local features, outperform non-sequential classifiers. Furthermore, we show that LSTM using a reduced set of features can outperform the other sequential classifiers; this performance is consistent across datasets and across types of stances. To conclude, our work also analyses the different features under study, identifying those that best help characterise and distinguish between stances, such as supporting tweets being more likely to be accompanied by evidence than denying tweets. We also set forth a number of directions for future research. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Journal ref: Information Processing & Management, Volume 54, Issue 2, March 2018, Pages 273-290

arXiv:1609.09028 [pdf, other]

Stance Classification in Rumours as a Sequential Task Exploiting the Tree Structure of Social Media Conversations

Authors: Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, Michal Lukasik

Abstract: Rumour stance classification, the task that determines if each tweet in a collection discussing a rumour is supporting, denying, questioning or simply commenting on the rumour, has been attracting substantial interest. Here we introduce a novel approach that makes use of the sequence of transitions observed in tree-structured conversation threads in Twitter. The conversation threads are formed by… ▽ More Rumour stance classification, the task that determines if each tweet in a collection discussing a rumour is supporting, denying, questioning or simply commenting on the rumour, has been attracting substantial interest. Here we introduce a novel approach that makes use of the sequence of transitions observed in tree-structured conversation threads in Twitter. The conversation threads are formed by harvesting users' replies to one another, which results in a nested tree-like structure. Previous work addressing the stance classification task has treated each tweet as a separate unit. Here we analyse tweets by virtue of their position in a sequence and test two sequential classifiers, Linear-Chain CRF and Tree CRF, each of which makes different assumptions about the conversational structure. We experiment with eight Twitter datasets, collected during breaking news, and show that exploiting the sequential structure of Twitter conversations achieves significant improvements over the non-sequential methods. Our work is the first to model Twitter conversations as a tree structure in this manner, introducing a novel way of tackling NLP tasks on Twitter conversations. △ Less

Submitted 11 October, 2016; v1 submitted 28 September, 2016; originally announced September 2016.

Comments: COLING 2016

arXiv:1609.01962 [pdf, other]

Using Gaussian Processes for Rumour Stance Classification in Social Media

Authors: Michal Lukasik, Kalina Bontcheva, Trevor Cohn, Arkaitz Zubiaga, Maria Liakata, Rob Procter

Abstract: Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly, one can mine multiple reactions expressed by social media users in those situations, exploring their stance towards rumours, ultimately enabling the flagging of highly disputed rumours as being potentially false. In this work, we set out to develop an automated, supervised classi… ▽ More Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly, one can mine multiple reactions expressed by social media users in those situations, exploring their stance towards rumours, ultimately enabling the flagging of highly disputed rumours as being potentially false. In this work, we set out to develop an automated, supervised classifier that uses multi-task learning to classify the stance expressed in each individual tweet in a rumourous conversation as either supporting, denying or questioning the rumour. Using a classifier based on Gaussian Processes, and exploring its effectiveness on two datasets with very different characteristics and varying distributions of stances, we show that our approach consistently outperforms competitive baseline classifiers. Our classifier is especially effective in estimating the distribution of different types of stance associated with a given rumour, which we set forth as a desired characteristic for a rumour-tracking system that will warn both ordinary users of Twitter and professional news practitioners when a rumour is being rebutted. △ Less

Submitted 7 September, 2016; originally announced September 2016.

arXiv:1506.00468 [pdf, ps, other]

Classifying Tweet Level Judgements of Rumours in Social Media

Authors: Michal Lukasik, Trevor Cohn, Kalina Bontcheva

Abstract: Social media is a rich source of rumours and corresponding community reactions. Rumours reflect different characteristics, some shared and some individual. We formulate the problem of classifying tweet level judgements of rumours as a supervised learning task. Both supervised and unsupervised domain adaptation are considered, in which tweets from a rumour are classified on the basis of other annot… ▽ More Social media is a rich source of rumours and corresponding community reactions. Rumours reflect different characteristics, some shared and some individual. We formulate the problem of classifying tweet level judgements of rumours as a supervised learning task. Both supervised and unsupervised domain adaptation are considered, in which tweets from a rumour are classified on the basis of other annotated rumours. We demonstrate how multi-task learning helps achieve good results on rumours from the 2011 England riots. △ Less

Submitted 10 September, 2015; v1 submitted 1 June, 2015; originally announced June 2015.

arXiv:1208.4822 [pdf, other]

Spectral and kinetic properties of electroluminescence of ZnS:Cu powder in polymer structure

Authors: E. Chimczak, T. Dunaj, M. Bertandt, A. Wieczorek, G. Neunert, G. Chimczak, M. Cież, M. Łukasik

Abstract: Spectral and kinetic measurements of the light output have been made for AC electroluminescent structure. ZnS:Cu is luminescence active layer in the structure. In kinetic measurements, excitation was by rectangular wave voltage pulse of 1 ms duration. During the excitation the structure emits blue-green light. The maximum of the spectrum lies at about 455 nm. Spectral and kinetic measurements of the light output have been made for AC electroluminescent structure. ZnS:Cu is luminescence active layer in the structure. In kinetic measurements, excitation was by rectangular wave voltage pulse of 1 ms duration. During the excitation the structure emits blue-green light. The maximum of the spectrum lies at about 455 nm. △ Less

Submitted 23 August, 2012; originally announced August 2012.

arXiv:gr-qc/0610090 [pdf, ps, other]

doi 10.1088/0264-9381/24/5/015

Conformal Yano-Killing tensors for the Taub-NUT metric

Authors: Jacek Jezierski, Maciej Łukasik

Abstract: Symmetric conformal Killing tensors and (skew-symmetric) conformal Yano-Killing tensors for Euclidean Taub-NUT metric are given in explicit form. Relations between Yano and CYK tensors in terms of conformal rescaling are discussed. Symmetric conformal Killing tensors and (skew-symmetric) conformal Yano-Killing tensors for Euclidean Taub-NUT metric are given in explicit form. Relations between Yano and CYK tensors in terms of conformal rescaling are discussed. △ Less

Submitted 18 October, 2006; originally announced October 2006.

Comments: 12 pages

Journal ref: Class.Quant.Grav.24:1331-1340,2007

arXiv:gr-qc/0510058 [pdf, ps, other]

doi 10.1088/0264-9381/23/9/008

Conformal Yano-Killing tensor for the Kerr metric and conserved quantities

Authors: Jacek Jezierski, Maciej Łukasik

Abstract: Properties of (skew-symmetric) conformal Yano--Killing tensors are reviewed. Explicit forms of three symmetric conformal Killing tensors in Kerr spacetime are obtained from the Yano--Killing tensor. The relation between spin-2 fields and solutions to the Maxwell equations is used in the construction of a new conserved quantity which is quadratic in terms of the Weyl tensor. The formula obtained… ▽ More Properties of (skew-symmetric) conformal Yano--Killing tensors are reviewed. Explicit forms of three symmetric conformal Killing tensors in Kerr spacetime are obtained from the Yano--Killing tensor. The relation between spin-2 fields and solutions to the Maxwell equations is used in the construction of a new conserved quantity which is quadratic in terms of the Weyl tensor. The formula obtained is similar to the functional obtained from the Bel--Robinson tensor and is examined in Kerr spacetime. A new interpretation of the conserved quantity obtained is proposed. △ Less

Submitted 28 December, 2005; v1 submitted 12 October, 2005; originally announced October 2005.

Comments: 29 pages

Journal ref: Class.Quant.Grav.23:2895-2918,2006

Showing 1–24 of 24 results for author: Lukasik, M