Skip to main content

Showing 1–9 of 9 results for author: Petryk, S

  1. arXiv:2405.17247  [pdf, other

    cs.LG

    An Introduction to Vision-Language Modeling

    Authors: Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie , et al. (16 additional authors not shown)

    Abstract: Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  2. arXiv:2404.02904  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ALOHa: A New Measure for Hallucination in Captioning Models

    Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell

    Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverage… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: To appear at NAACL 2024

  3. arXiv:2310.12971  [pdf, other

    cs.CV cs.AI cs.CL

    CLAIR: Evaluating Image Captions with Large Language Models

    Authors: David Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John Canny

    Abstract: The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: To Appear at EMNLP 2023

  4. arXiv:2305.07021  [pdf, other

    cs.CV

    Simple Token-Level Confidence Improves Caption Correctness

    Authors: Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

    Abstract: The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  5. arXiv:2209.03745  [pdf, other

    cs.CV

    Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers

    Authors: Kevin Miao, Akash Gokul, Raghav Singh, Suzanne Petryk, Joseph Gonzalez, Kurt Keutzer, Trevor Darrell, Colorado Reed

    Abstract: Recent trends in self-supervised representation learning have focused on removing inductive biases from training pipelines. However, inductive biases can be useful in settings when limited data are available or provide additional insight into the underlying data distribution. We present spatial prior attention (SPAN), a framework that takes advantage of consistent spatial and semantic structure in… ▽ More

    Submitted 6 September, 2022; originally announced September 2022.

  6. arXiv:2204.13631  [pdf, other

    cs.CV

    Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

    Authors: Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

    Abstract: Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this… ▽ More

    Submitted 20 October, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

    Comments: ECCV 2022. Code and models are available here: https://github.com/facebookresearch/reliable_vqa

  7. arXiv:2202.08926  [pdf, other

    cs.CV

    On Guiding Visual Attention with Language Specification

    Authors: Suzanne Petryk, Lisa Dunlap, Keyan Nasseri, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach

    Abstract: While real world challenges typically define visual categories with language words or phrases, most visual classification methods define categories with numerical indices. However, the language specification of the classes provides an especially useful prior for biased and noisy datasets, where it can help disambiguate what features are task-relevant. Recently, large-scale multimodal models have b… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 14 pages, 9 figures

  8. arXiv:2010.01528  [pdf, other

    cs.CV cs.AI cs.LG

    Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting

    Authors: Sayna Ebrahimi, Suzanne Petryk, Akash Gokul, William Gan, Joseph E. Gonzalez, Marcus Rohrbach, Trevor Darrell

    Abstract: The goal of continual learning (CL) is to learn a sequence of tasks without suffering from the phenomenon of catastrophic forgetting. Previous work has shown that leveraging memory in the form of a replay buffer can reduce performance degradation on prior tasks. We hypothesize that forgetting can be further reduced when the model is encouraged to remember the \textit{evidence} for previously made… ▽ More

    Submitted 2 May, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

    Comments: Accepted at ICLR 2021

  9. arXiv:2004.00221  [pdf, other

    cs.CV cs.LG cs.NE

    NBDT: Neural-Backed Decision Trees

    Authors: Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, Joseph E. Gonzalez

    Abstract: Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use. In response, previous work combines decision trees with deep learning, yielding models that (1) sacrifice interpretability for accuracy or (2) sacrifice accuracy for interpretability. We forgo this dilemma by jointly improving accuracy and interpretab… ▽ More

    Submitted 27 January, 2021; v1 submitted 1 April, 2020; originally announced April 2020.

    Comments: 8 pages, 7 figures, accepted to ICLR 2021