Skip to main content

Showing 1–15 of 15 results for author: Vijayanarasimhan, S

  1. arXiv:2302.01328  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    IC3: Image Captioning by Committee Consensus

    Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

    Abstract: If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in th… ▽ More

    Submitted 19 October, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

    Comments: To Appear at EMNLP 2023

  2. arXiv:2212.10596  [pdf, other

    cs.CV

    Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

    Authors: Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross

    Abstract: Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised mod… ▽ More

    Submitted 10 January, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

  3. arXiv:2209.07518  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Distribution Aware Metrics for Conditional Natural Language Generation

    Authors: David M Chan, Yiming Ni, David A Ross, Sudheendra Vijayanarasimhan, Austin Myers, John Canny

    Abstract: Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersi… ▽ More

    Submitted 29 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

  4. arXiv:2205.06253  [pdf, other

    cs.CV cs.CL

    What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

    Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny

    Abstract: While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In th… ▽ More

    Submitted 12 January, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: The 1st Workshop on Vision Datasets Understanding, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

  5. arXiv:2007.13913  [pdf, other

    cs.CV cs.CL cs.LG

    Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

    Authors: David M. Chan, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

    Abstract: Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we bo… ▽ More

    Submitted 2 December, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Published at the 15th Asian Conference on Computer Vision (ACCV 2020)

  6. arXiv:1804.07667  [pdf, other

    cs.CV

    Rethinking the Faster R-CNN Architecture for Temporal Action Localization

    Authors: Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar

    Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions… ▽ More

    Submitted 20 April, 2018; originally announced April 2018.

    Comments: Accepted in CVPR 2018

  7. arXiv:1707.01932  [pdf, other

    cs.RO cs.LG stat.ML

    End-to-End Learning of Semantic Grasping

    Authors: Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, Sergey Levine

    Abstract: We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A "ventral stream" recognizes object class while a "dorsal stream"… ▽ More

    Submitted 9 November, 2017; v1 submitted 6 July, 2017; originally announced July 2017.

    Comments: 14 pages

  8. arXiv:1705.08421  [pdf, other

    cs.CV

    AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

    Authors: Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

    Abstract: This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual… ▽ More

    Submitted 30 April, 2018; v1 submitted 23 May, 2017; originally announced May 2017.

    Comments: To appear in CVPR 2018. Check dataset page https://research.google.com/ava/ for details

  9. arXiv:1705.06950  [pdf, other

    cs.CV

    The Kinetics Human Action Video Dataset

    Authors: Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman

    Abstract: We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such… ▽ More

    Submitted 19 May, 2017; originally announced May 2017.

  10. arXiv:1705.02082  [pdf, other

    cs.CV

    Motion Prediction Under Multimodality with Conditional Stochastic Networks

    Authors: Katerina Fragkiadaki, Jonathan Huang, Alex Alemi, Sudheendra Vijayanarasimhan, Susanna Ricco, Rahul Sukthankar

    Abstract: Given a visual history, multiple future outcomes for a video scene are equally probable, in other words, the distribution of future outcomes has multiple modes. Multimodality is notoriously hard to handle by standard regressors or classifiers: the former regress to the mean and the latter discretize a continuous high dimensional output space. In this work, we present stochastic neural network arch… ▽ More

    Submitted 5 May, 2017; originally announced May 2017.

  11. arXiv:1704.07804  [pdf, other

    cs.CV

    SfM-Net: Learning of Structure and Motion from Video

    Authors: Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

    Abstract: We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), different… ▽ More

    Submitted 25 April, 2017; originally announced April 2017.

  12. arXiv:1609.08675  [pdf, other

    cs.CV

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Authors: Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan

    Abstract: Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there… ▽ More

    Submitted 27 September, 2016; originally announced September 2016.

    Comments: 10 pages

  13. arXiv:1505.06250  [pdf, other

    cs.CV cs.MM cs.NE

    Efficient Large Scale Video Classification

    Authors: Balakrishnan Varadarajan, George Toderici, Sudheendra Vijayanarasimhan, Apostol Natsev

    Abstract: Video classification has advanced tremendously over the recent years. A large part of the improvements in video classification had to do with the work done by the image classification community and the use of deep convolutional networks (CNNs) which produce competitive results with hand- crafted motion features. These networks were adapted to use video frames in various ways and have yielded state… ▽ More

    Submitted 22 May, 2015; originally announced May 2015.

  14. arXiv:1503.08909  [pdf, other

    cs.CV

    Beyond Short Snippets: Deep Networks for Video Classification

    Authors: Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici

    Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handli… ▽ More

    Submitted 13 April, 2015; v1 submitted 31 March, 2015; originally announced March 2015.

  15. arXiv:1412.7479  [pdf, ps, other

    cs.NE cs.LG

    Deep Networks With Large Output Spaces

    Authors: Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, Jay Yagnik

    Abstract: Deep neural networks have been extremely successful at various image, speech, video recognition tasks because of their ability to model deep structures within the data. However, they are still prohibitively expensive to train and apply for problems containing millions of classes in the output layer. Based on the observation that the key computation common to most neural network layers is a vector/… ▽ More

    Submitted 10 April, 2015; v1 submitted 23 December, 2014; originally announced December 2014.