Skip to main content

Showing 1–17 of 17 results for author: Somepalli, G

  1. arXiv:2406.10328  [pdf, other

    cs.CV cs.CL cs.LG

    From Pixels to Prose: A Large Dataset of Dense Image Captions

    Authors: Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein

    Abstract: Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure d… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: pixelprose 16M dataset

  2. arXiv:2406.10209  [pdf, other

    cs.CL

    Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

    Authors: Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein

    Abstract: Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verba… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 9.5 pages, 8 figures, and 1 table in the main body. Code available at https://github.com/ahans30/goldfish-loss

  3. arXiv:2405.08813  [pdf, other

    cs.CV cs.LG cs.MM

    CinePile: A Long Video Question Answering Dataset and Benchmark

    Authors: Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein

    Abstract: Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This… ▽ More

    Submitted 20 October, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: Project page with all the artifacts - https://ruchitrawal.github.io/cinepile/. Updated version with adversarial refinement pipeline and more model evaluations

  4. arXiv:2405.06331  [pdf, other

    cs.LG cs.CL

    LMD3: Language Model Data Density Dependence

    Authors: John Kirchenbauer, Garrett Honke, Gowthami Somepalli, Jonas Geiping, Daphne Ippolito, Katherine Lee, Tom Goldstein, David Andre

    Abstract: We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation. Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density, which is also a significant predicto… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: 10 pages in the main body

  5. arXiv:2404.01292  [pdf, other

    cs.CV cs.LG

    Measuring Style Similarity in Diffusion Models

    Authors: Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, Tom Goldstein

    Abstract: Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  6. arXiv:2311.05877  [pdf, other

    cs.LG cs.AI

    A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

    Authors: Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C. Bayan Bruss, Andrew Gordon Wilson, Tom Goldstein, Micah Goldblum

    Abstract: Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. E… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Journal ref: Conference on Neural Information Processing Systems 2023

  7. arXiv:2310.19909  [pdf, other

    cs.CV cs.LG

    Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

    Authors: Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein

    Abstract: Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performan… ▽ More

    Submitted 19 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to NeurIPS 2023

  8. arXiv:2310.05914  [pdf, other

    cs.CL cs.LG

    NEFTune: Noisy Embeddings Improve Instruction Finetuning

    Authors: Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

    Abstract: We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instru… ▽ More

    Submitted 10 October, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: 25 pages, Code is available on Github: https://github.com/neelsjain/NEFTune

  9. arXiv:2309.00614  [pdf, other

    cs.LG cs.CL cs.CR

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Authors: Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein

    Abstract: As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain… ▽ More

    Submitted 4 September, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: 12 pages

  10. arXiv:2305.20086  [pdf, other

    cs.LG cs.CR cs.CV

    Understanding and Mitigating Copying in Diffusion Models

    Authors: Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein

    Abstract: Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible f… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: 17 pages, preprint. Code is available at https://github.com/somepago/DCR

  11. arXiv:2212.03860  [pdf, other

    cs.LG cs.CV cs.CY

    Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

    Authors: Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein

    Abstract: Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detec… ▽ More

    Submitted 12 December, 2022; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: Updated draft with the following changes (1) Clarified the LAION Aesthetics versions everywhere (2) Correction on which LAION Aesthetics version SD - 1.4 is finetuned on and updated figure 12 based on this (3) A section on possible causes of replication

  12. arXiv:2210.06441  [pdf, other

    cs.LG cs.CV

    How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization

    Authors: Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, Andrew Gordon Wilson

    Abstract: Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsist… ▽ More

    Submitted 30 March, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: 31 pages, 29 figures. To be presented at ICLR 2023. Code at https://github.com/JonasGeiping/dataaugs

  13. arXiv:2203.08124  [pdf, other

    cs.LG cs.CV

    Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective

    Authors: Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Baraniuk, Micah Goldblum, Tom Goldstein

    Abstract: We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield resu… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

    Comments: To appear in CVPR 2022

  14. arXiv:2111.01785  [pdf, other

    cs.CV cs.LG

    PatchGame: Learning to Signal Mid-level Patches in Referential Games

    Authors: Kamal Gupta, Gowthami Somepalli, Anubhav Gupta, Vinoj Jayasundara, Matthias Zwicker, Abhinav Shrivastava

    Abstract: We study a referential game (a type of signaling game) where two agents communicate with each other via a discrete bottleneck to achieve a common goal. In our referential game, the goal of the speaker is to compose a message or a symbolic representation of "important" image patches, while the task for the listener is to match the speaker's message to a different view of the same image. We show tha… ▽ More

    Submitted 2 November, 2021; originally announced November 2021.

    Comments: To appear at NeurIPS 2021

  15. arXiv:2106.01342  [pdf, other

    cs.LG cs.AI stat.ML

    SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

    Authors: Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C. Bayan Bruss, Tom Goldstein

    Abstract: Tabular data underpins numerous high-impact applications of machine learning from fraud detection to genomics and healthcare. Classical approaches to solving tabular problems, such as gradient boosting and random forests, are widely used by practitioners. However, recent deep learning methods have achieved a degree of performance competitive with popular techniques. We devise a hybrid deep learnin… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

  16. arXiv:2102.13624  [pdf, other

    cs.LG cs.CR cs.CV

    What Doesn't Kill You Makes You Robust(er): How to Adversarially Train against Data Poisoning

    Authors: Jonas Geiping, Liam Fowl, Gowthami Somepalli, Micah Goldblum, Michael Moeller, Tom Goldstein

    Abstract: Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning thre… ▽ More

    Submitted 17 February, 2022; v1 submitted 26 February, 2021; originally announced February 2021.

    Comments: 25 pages, 15 figures

  17. arXiv:2003.10713  [pdf, other

    cs.LG stat.ML

    Unsupervised Anomaly Detection with Adversarial Mirrored AutoEncoders

    Authors: Gowthami Somepalli, Yexin Wu, Yogesh Balaji, Bhanukiran Vinzamuri, Soheil Feizi

    Abstract: Detecting out of distribution (OOD) samples is of paramount importance in all Machine Learning applications. Deep generative modeling has emerged as a dominant paradigm to model complex data distributions without labels. However, prior work has shown that generative models tend to assign higher likelihoods to OOD samples compared to the data distribution on which they were trained. First, we propo… ▽ More

    Submitted 3 January, 2021; v1 submitted 24 March, 2020; originally announced March 2020.

    Comments: Updated the paper with more OOD detection baselines. Performed ablation analysis on various components of AMA