-
From Pixels to Prose: A Large Dataset of Dense Image Captions
Authors:
Vasu Singla,
Kaiyu Yue,
Sukriti Paul,
Reza Shirkavand,
Mayuka Jayawardhana,
Alireza Ganjdanesh,
Heng Huang,
Abhinav Bhatele,
Gowthami Somepalli,
Tom Goldstein
Abstract:
Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure d…
▽ More
Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Authors:
Abhimanyu Hans,
Yuxin Wen,
Neel Jain,
John Kirchenbauer,
Hamid Kazemi,
Prajwal Singhania,
Siddharth Singh,
Gowthami Somepalli,
Jonas Geiping,
Abhinav Bhatele,
Tom Goldstein
Abstract:
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verba…
▽ More
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
CinePile: A Long Video Question Answering Dataset and Benchmark
Authors:
Ruchit Rawal,
Khalid Saifullah,
Miquel Farré,
Ronen Basri,
David Jacobs,
Gowthami Somepalli,
Tom Goldstein
Abstract:
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This…
▽ More
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.
△ Less
Submitted 20 October, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
LMD3: Language Model Data Density Dependence
Authors:
John Kirchenbauer,
Garrett Honke,
Gowthami Somepalli,
Jonas Geiping,
Daphne Ippolito,
Katherine Lee,
Tom Goldstein,
David Andre
Abstract:
We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation. Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density, which is also a significant predicto…
▽ More
We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation. Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density, which is also a significant predictor of the performance increase caused by the intervention. Experiments with pretraining data demonstrate that we can explain a significant fraction of the variance in model perplexity via density measurements. We conclude that our framework can provide statistical evidence of the dependence of a target model's predictions on subsets of its training data, and can more generally be used to characterize the support (or lack thereof) in the training data for a given test task.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Measuring Style Similarity in Diffusion Models
Authors:
Gowthami Somepalli,
Anubhav Gupta,
Kamal Gupta,
Shramay Palta,
Micah Goldblum,
Jonas Geiping,
Abhinav Shrivastava,
Tom Goldstein
Abstract:
Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a…
▽ More
Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at https://github.com/learn2phoenix/CSD.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning
Authors:
Valeriia Cherepanova,
Roman Levin,
Gowthami Somepalli,
Jonas Geiping,
C. Bayan Bruss,
Andrew Gordon Wilson,
Tom Goldstein,
Micah Goldblum
Abstract:
Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. E…
▽ More
Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. Motivated by the increasing popularity of tabular deep learning, we construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
Authors:
Micah Goldblum,
Hossein Souri,
Renkun Ni,
Manli Shu,
Viraj Prabhu,
Gowthami Somepalli,
Prithvijit Chattopadhyay,
Mark Ibrahim,
Adrien Bardes,
Judy Hoffman,
Rama Chellappa,
Andrew Gordon Wilson,
Tom Goldstein
Abstract:
Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performan…
▽ More
Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones
△ Less
Submitted 19 November, 2023; v1 submitted 30 October, 2023;
originally announced October 2023.
-
NEFTune: Noisy Embeddings Improve Instruction Finetuning
Authors:
Neel Jain,
Ping-yeh Chiang,
Yuxin Wen,
John Kirchenbauer,
Hong-Min Chu,
Gowthami Somepalli,
Brian R. Bartoldson,
Bhavya Kailkhura,
Avi Schwarzschild,
Aniruddha Saha,
Micah Goldblum,
Jonas Geiping,
Tom Goldstein
Abstract:
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instru…
▽ More
We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
△ Less
Submitted 10 October, 2023; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Authors:
Neel Jain,
Avi Schwarzschild,
Yuxin Wen,
Gowthami Somepalli,
John Kirchenbauer,
Ping-yeh Chiang,
Micah Goldblum,
Aniruddha Saha,
Jonas Geiping,
Tom Goldstein
Abstract:
As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain…
▽ More
As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision?
We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.
△ Less
Submitted 4 September, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
Understanding and Mitigating Copying in Diffusion Models
Authors:
Gowthami Somepalli,
Vasu Singla,
Micah Goldblum,
Jonas Geiping,
Tom Goldstein
Abstract:
Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible f…
▽ More
Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
Authors:
Gowthami Somepalli,
Vasu Singla,
Micah Goldblum,
Jonas Geiping,
Tom Goldstein
Abstract:
Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detec…
▽ More
Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data.
△ Less
Submitted 12 December, 2022; v1 submitted 7 December, 2022;
originally announced December 2022.
-
How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization
Authors:
Jonas Geiping,
Micah Goldblum,
Gowthami Somepalli,
Ravid Shwartz-Ziv,
Tom Goldstein,
Andrew Gordon Wilson
Abstract:
Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsist…
▽ More
Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data. Moreover, we find that data augmentations which encourage invariances can be more valuable than invariance alone, especially on small and medium sized training sets. Following this observation, we show that augmentations induce additional stochasticity during training, effectively flattening the loss landscape.
△ Less
Submitted 30 March, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective
Authors:
Gowthami Somepalli,
Liam Fowl,
Arpit Bansal,
Ping Yeh-Chiang,
Yehuda Dar,
Richard Baraniuk,
Micah Goldblum,
Tom Goldstein
Abstract:
We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield resu…
▽ More
We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield results with strong similarities, especially in the case of wide architectures. We also use decision boundary methods to visualize double descent phenomena. We see that decision boundary reproducibility depends strongly on model width. Near the threshold of interpolation, neural network decision boundaries become fragmented into many small decision regions, and these regions are non-reproducible. Meanwhile, very narrows and very wide networks have high levels of reproducibility in their decision boundaries with relatively few decision regions. We discuss how our observations relate to the theory of double descent phenomena in convex models. Code is available at https://github.com/somepago/dbViz
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
PatchGame: Learning to Signal Mid-level Patches in Referential Games
Authors:
Kamal Gupta,
Gowthami Somepalli,
Anubhav Gupta,
Vinoj Jayasundara,
Matthias Zwicker,
Abhinav Shrivastava
Abstract:
We study a referential game (a type of signaling game) where two agents communicate with each other via a discrete bottleneck to achieve a common goal. In our referential game, the goal of the speaker is to compose a message or a symbolic representation of "important" image patches, while the task for the listener is to match the speaker's message to a different view of the same image. We show tha…
▽ More
We study a referential game (a type of signaling game) where two agents communicate with each other via a discrete bottleneck to achieve a common goal. In our referential game, the goal of the speaker is to compose a message or a symbolic representation of "important" image patches, while the task for the listener is to match the speaker's message to a different view of the same image. We show that it is indeed possible for the two agents to develop a communication protocol without explicit or implicit supervision. We further investigate the developed protocol and show the applications in speeding up recent Vision Transformers by using only important patches, and as pre-training for downstream recognition tasks (e.g., classification). Code available at https://github.com/kampta/PatchGame.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training
Authors:
Gowthami Somepalli,
Micah Goldblum,
Avi Schwarzschild,
C. Bayan Bruss,
Tom Goldstein
Abstract:
Tabular data underpins numerous high-impact applications of machine learning from fraud detection to genomics and healthcare. Classical approaches to solving tabular problems, such as gradient boosting and random forests, are widely used by practitioners. However, recent deep learning methods have achieved a degree of performance competitive with popular techniques. We devise a hybrid deep learnin…
▽ More
Tabular data underpins numerous high-impact applications of machine learning from fraud detection to genomics and healthcare. Classical approaches to solving tabular problems, such as gradient boosting and random forests, are widely used by practitioners. However, recent deep learning methods have achieved a degree of performance competitive with popular techniques. We devise a hybrid deep learning approach to solving tabular data problems. Our method, SAINT, performs attention over both rows and columns, and it includes an enhanced embedding method. We also study a new contrastive self-supervised pre-training method for use when labels are scarce. SAINT consistently improves performance over previous deep learning methods, and it even outperforms gradient boosting methods, including XGBoost, CatBoost, and LightGBM, on average over a variety of benchmark tasks.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
What Doesn't Kill You Makes You Robust(er): How to Adversarially Train against Data Poisoning
Authors:
Jonas Geiping,
Liam Fowl,
Gowthami Somepalli,
Micah Goldblum,
Michael Moeller,
Tom Goldstein
Abstract:
Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning thre…
▽ More
Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning threat models. Adversarial training, and its variants, are currently considered the only empirically strong defense against (inference-time) adversarial attacks. In this work, we extend the adversarial training framework to defend against (training-time) data poisoning, including targeted and backdoor attacks. Our method desensitizes networks to the effects of such attacks by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses such as DP-SGD or (evasion) adversarial training.
△ Less
Submitted 17 February, 2022; v1 submitted 26 February, 2021;
originally announced February 2021.
-
Unsupervised Anomaly Detection with Adversarial Mirrored AutoEncoders
Authors:
Gowthami Somepalli,
Yexin Wu,
Yogesh Balaji,
Bhanukiran Vinzamuri,
Soheil Feizi
Abstract:
Detecting out of distribution (OOD) samples is of paramount importance in all Machine Learning applications. Deep generative modeling has emerged as a dominant paradigm to model complex data distributions without labels. However, prior work has shown that generative models tend to assign higher likelihoods to OOD samples compared to the data distribution on which they were trained. First, we propo…
▽ More
Detecting out of distribution (OOD) samples is of paramount importance in all Machine Learning applications. Deep generative modeling has emerged as a dominant paradigm to model complex data distributions without labels. However, prior work has shown that generative models tend to assign higher likelihoods to OOD samples compared to the data distribution on which they were trained. First, we propose Adversarial Mirrored Autoencoder (AMA), a variant of Adversarial Autoencoder, which uses a mirrored Wasserstein loss in the discriminator to enforce better semantic-level reconstruction. We also propose a latent space regularization to learn a compact manifold for in-distribution samples. The use of AMA produces better feature representations that improve anomaly detection performance. Second, we put forward an alternative measure of anomaly score to replace the reconstruction-based metric which has been traditionally used in generative model-based anomaly detection methods. Our method outperforms the current state-of-the-art methods for anomaly detection on several OOD detection benchmarks.
△ Less
Submitted 3 January, 2021; v1 submitted 24 March, 2020;
originally announced March 2020.