Skip to main content

Showing 1–12 of 12 results for author: Marino, K

  1. arXiv:2406.14596  [pdf, other

    cs.CV cs.AI cs.LG

    ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

    Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

    Abstract: Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Ab… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Project website: http://ical-learning.github.io/

  2. arXiv:2302.00763  [pdf, other

    cs.LG cs.AI cs.CL

    Collaborating with language models for embodied reasoning

    Authors: Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, Rob Fergus

    Abstract: Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

    Comments: Presented at NeurIPS 2022 Language and Reinforcement Learning Workshop (best paper) and NeurIPS 2022 Foundation Models for Decision Making Workshop. 4 pages main; 14 pages total (including references and appendix); 3 figures

  3. arXiv:2301.12507  [pdf, other

    cs.AI

    Distilling Internet-Scale Vision-Language Models into Embodied Agents

    Authors: Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita Dasgupta

    Abstract: Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight… ▽ More

    Submitted 14 June, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: 9 pages, 7 figures. Presented at ICML 2023

  4. arXiv:2211.00177  [pdf, other

    cs.LG cs.IR cs.SI

    Learning to Navigate Wikipedia by Taking Random Walks

    Authors: Manzil Zaheer, Kenneth Marino, Will Grathwohl, John Schultz, Wendy Shang, Sheila Babayan, Arun Ahuja, Ishita Dasgupta, Christine Kaeser-Chen, Rob Fergus

    Abstract: A fundamental ability of an intelligent web-based agent is seeking out and acquiring new information. Internet search engines reliably find the correct vicinity but the top results may be a few links away from the desired target. A complementary approach is navigation via hyperlinks, employing a policy that comprehends local content and selects a link that moves it closer to the target. In this pa… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Journal ref: NeurIPS 2022

  5. arXiv:2206.01718  [pdf, other

    cs.CV cs.CL

    A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

    Authors: Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, Roozbeh Mottaghi

    Abstract: The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, lit… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

  6. arXiv:2012.11014  [pdf, other

    cs.CV cs.CL

    KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA

    Authors: Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, Marcus Rohrbach

    Abstract: One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can… ▽ More

    Submitted 20 December, 2020; originally announced December 2020.

  7. arXiv:2011.06431  [pdf, other

    cs.RO cs.CV

    Same Object, Different Grasps: Data and Semantic Knowledge for Task-Oriented Grasping

    Authors: Adithyavairavan Murali, Weiyu Liu, Kenneth Marino, Sonia Chernova, Abhinav Gupta

    Abstract: Despite the enormous progress and generalization in robotic grasping in recent years, existing methods have yet to scale and generalize task-oriented grasping to the same extent. This is largely due to the scale of the datasets both in terms of the number of objects and tasks studied. We address these concerns with the TaskGrasp dataset which is more diverse both in terms of objects and tasks, and… ▽ More

    Submitted 13 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: Accepted to Conference on Robot Learning (CoRL) 2020

  8. arXiv:2011.00517  [pdf, other

    cs.LG

    Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning

    Authors: Valerie Chen, Abhinav Gupta, Kenneth Marino

    Abstract: Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions an… ▽ More

    Submitted 26 September, 2021; v1 submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted at ICLR 2021

  9. arXiv:2006.15762  [pdf, other

    cs.AI cs.LG stat.ML

    Empirically Verifying Hypotheses Using Reinforcement Learning

    Authors: Kenneth Marino, Rob Fergus, Arthur Szlam, Abhinav Gupta

    Abstract: This paper formulates hypothesis verification as an RL problem. Specifically, we aim to build an agent that, given a hypothesis about the dynamics of the world, can take actions to generate observations which can help predict whether the hypothesis is true or false. Existing RL algorithms fail to solve this task, even for simple environments. In order to train the agents, we exploit the underlying… ▽ More

    Submitted 28 June, 2020; originally announced June 2020.

  10. arXiv:1906.00067  [pdf, other

    cs.CV cs.CL

    OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

    Authors: Kenneth Marino, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi

    Abstract: Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most VQA benchmarks to date are focused on questions such as simple counting, visual attributes, and object detection that do not require reasoning or knowledge beyond what is in the image. In this paper, we addre… ▽ More

    Submitted 4 September, 2019; v1 submitted 31 May, 2019; originally announced June 2019.

    Comments: CVPR 2019

  11. arXiv:1705.00053  [pdf, other

    cs.CV

    The Pose Knows: Video Forecasting by Generating Pose Futures

    Authors: Jacob Walker, Kenneth Marino, Abhinav Gupta, Martial Hebert

    Abstract: Current approaches in video forecasting attempt to generate videos directly in pixel space using Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). However, since these approaches try to model all the structure and scene dynamics at once, in unconstrained settings they often generate uninterpretable results. Our insight is to model the forecasting problem at a higher level… ▽ More

    Submitted 28 April, 2017; originally announced May 2017.

    Comments: Project Website: http://www.cs.cmu.edu/~jcwalker/POS/POS.html

  12. arXiv:1612.04844  [pdf, other

    cs.CV

    The More You Know: Using Knowledge Graphs for Image Classification

    Authors: Kenneth Marino, Ruslan Salakhutdinov, Abhinav Gupta

    Abstract: One characteristic that sets humans apart from modern learning-based computer vision algorithms is the ability to acquire knowledge about the world and use that knowledge to reason about the visual world. Humans can learn about the characteristics of objects and the relationships that occur between them to learn a large variety of visual concepts, often with few examples. This paper investigates t… ▽ More

    Submitted 21 April, 2017; v1 submitted 14 December, 2016; originally announced December 2016.

    Comments: CVPR 2017