Skip to main content

Showing 1–11 of 11 results for author: Narasimhan, M

  1. arXiv:2306.05392  [pdf, other

    cs.CL

    Modular Visual Question Answering via Code Generation

    Authors: Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

    Abstract: We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the o… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  2. arXiv:2303.13519  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Learning and Verification of Task Structure in Instructional Videos

    Authors: Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell

    Abstract: Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual lab… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Wesbite at https://medhini.github.io/task_structure

  3. arXiv:2208.06773  [pdf, other

    cs.CV cs.IR cs.LG cs.MM

    TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

    Authors: Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

    Abstract: YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In compa… ▽ More

    Submitted 14 August, 2022; originally announced August 2022.

    Comments: Accepted to ECCV 2022. Website: https://medhini.github.io/ivsum/

  4. arXiv:2111.12073  [pdf, other

    cs.CV

    Multi-Person 3D Motion Prediction with Multi-Range Transformers

    Authors: Jiashun Wang, Huazhe Xu, Medhini Narasimhan, Xiaolong Wang

    Abstract: We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human's action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformers model which contains of a local-range encoder for individual motion and a global-range encoder for social… ▽ More

    Submitted 23 November, 2021; originally announced November 2021.

  5. arXiv:2107.00650  [pdf, other

    cs.CV cs.AI cs.MM

    CLIP-It! Language-Guided Video Summarization

    Authors: Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

    Abstract: A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited av… ▽ More

    Submitted 7 December, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: Neurips 2021. Website at https://medhini.github.io/clip_it/

    Journal ref: Thirty-Fifth Conference on Neural Information Processing Systems. 2021

  6. arXiv:2104.02687  [pdf, other

    cs.CV cs.AI cs.MM

    Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

    Authors: Medhini Narasimhan, Shiry Ginosar, Andrew Owens, Alexei A. Efros, Trevor Darrell

    Abstract: We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance met… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: Project website at https://medhini.github.io/audio_video_textures/

  7. arXiv:2007.09841  [pdf, other

    cs.CV cs.RO

    Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

    Authors: Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Darrell, Dhruv Batra, Devi Parikh, Amanpreet Singh

    Abstract: We introduce a learning-based approach for room navigation using semantic maps. Our proposed architecture learns to predict top-down belief maps of regions that lie beyond the agent's field of view while modeling architectural and stylistic regularities in houses. First, we train a model to generate amodal semantic top-down maps indicating beliefs of location, size, and shape of rooms by learning… ▽ More

    Submitted 19 July, 2020; originally announced July 2020.

    Comments: Published at the European Conference on Computer Vision, 2020

  8. arXiv:1811.00538  [pdf, other

    cs.CV

    Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

    Authors: Medhini Narasimhan, Svetlana Lazebnik, Alexander G. Schwing

    Abstract: Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel `fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entitie… ▽ More

    Submitted 1 November, 2018; originally announced November 2018.

    Comments: Accepted to NIPS 2018

  9. arXiv:1809.01124  [pdf, other

    cs.CV cs.AI

    Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

    Authors: Medhini Narasimhan, Alexander G. Schwing

    Abstract: Question answering is an important task for autonomous agents and virtual assistants alike and was shown to support the disabled in efficiently navigating an overwhelming environment. Many existing methods focus on observation-based questions, ignoring our ability to seamlessly combine observed content with general knowledge. To understand interactions with a knowledge base, a dataset has been int… ▽ More

    Submitted 4 September, 2018; originally announced September 2018.

    Comments: Accepted to ECCV 2018

  10. arXiv:1207.4151  [pdf

    cs.LG cs.DS stat.ML

    PAC-learning bounded tree-width Graphical Models

    Authors: Mukund Narasimhan, Jeff A. Bilmes

    Abstract: We show that the class of strongly connected graphical models with treewidth at most k can be properly efficiently PAC-learnt with respect to the Kullback-Leibler Divergence. Previous approaches to this problem, such as those of Chow ([1]), and Ho gen ([7]) have shown that this class is PAC-learnable by reducing it to a combinatorial optimization problem. However, for k > 1, this problem is NP-com… ▽ More

    Submitted 11 July, 2012; originally announced July 2012.

    Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

    Report number: UAI-P-2004-PG-410-417

  11. arXiv:1207.1404  [pdf

    cs.LG cs.DS stat.ML

    A submodular-supermodular procedure with applications to discriminative structure learning

    Authors: Mukund Narasimhan, Jeff A. Bilmes

    Abstract: In this paper, we present an algorithm for minimizing the difference between two submodular functions using a variational framework which is based on (an extension of) the concave-convex procedure [17]. Because several commonly used metrics in machine learning, like mutual information and conditional mutual information, are submodular, the problem of minimizing the difference of two submodular pro… ▽ More

    Submitted 4 July, 2012; originally announced July 2012.

    Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

    Report number: UAI-P-2005-PG-404-412