-
Grounding Multimodal Large Language Models in Actions
Authors:
Andrew Szot,
Bogdan Mazoure,
Harsh Agrawal,
Devon Hjelm,
Zsolt Kira,
Alexander Toshev
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Large Language Models as Generalizable Policies for Embodied Tasks
Authors:
Andrew Szot,
Max Schwarzer,
Harsh Agrawal,
Bogdan Mazoure,
Walter Talbott,
Katherine Metcalf,
Natalie Mackraz,
Devon Hjelm,
Alexander Toshev
Abstract:
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and…
▽ More
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io.
△ Less
Submitted 16 April, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Deep learning based projection domain metal segmentation for metal artifact reduction in cone beam computed tomography
Authors:
Harshit Agrawal,
Ari Hietanen,
Simo Särkkä
Abstract:
Metal artifact correction is a challenging problem in cone beam computed tomography (CBCT) scanning. Metal implants inserted into the anatomy cause severe artifacts in reconstructed images. Widely used inpainting-based metal artifact reduction (MAR) methods require segmentation of metal traces in the projections as a first step, which is a challenging task. One approach is to use a deep learning m…
▽ More
Metal artifact correction is a challenging problem in cone beam computed tomography (CBCT) scanning. Metal implants inserted into the anatomy cause severe artifacts in reconstructed images. Widely used inpainting-based metal artifact reduction (MAR) methods require segmentation of metal traces in the projections as a first step, which is a challenging task. One approach is to use a deep learning method to segment metals in the projections. However, the success of deep learning methods is limited by the availability of realistic training data. It is laborious and time consuming to get reliable ground truth annotations due to unclear implant boundaries and large numbers of projections. We propose to use X-ray simulations to generate synthetic metal segmentation training dataset from clinical CBCT scans. We compare the effect of simulations with different numbers of photons and also compare several training strategies to augment the available data. We compare our model's performance on real clinical scans with conventional region growing threshold-based MAR, moving metal artifact reduction method, and a recent deep learning method. We show that simulations with relatively small number of photons are suitable for the metal segmentation task and that training the deep learning model with full size and cropped projections together improves the robustness of the model. We show substantial improvement in the image quality affected by severe motion, voxel size under-sampling, and out-of-FOV metals. Our method can be easily integrated into the existing projection-based MAR pipeline to get improved image quality. This method can provide a novel paradigm to accurately segment metals in CBCT projections.
△ Less
Submitted 9 October, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users
Authors:
Ana Lucic,
Sheeraz Ahmad,
Amanda Furtado Brinhosa,
Vera Liao,
Himani Agrawal,
Umang Bhatt,
Krishnaram Kenthapadi,
Alice Xiang,
Maarten de Rijke,
Nicholas Drabowski
Abstract:
When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in or…
▽ More
When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone. This can be especially difficult for people living in remote regions, who make up a substantial portion of the patients at Portal Telemedicina, a digital healthcare organization based in Brazil. In this paper, we report on ongoing work regarding (i) the development of an AI system for flagging and explaining low-quality medical images in real-time, (ii) an interview study to understand the explanation needs of stakeholders using the AI system at OurCompany, and, (iii) a longitudinal user study design to examine the effect of including explanations on the workflow of the technicians in our clinics. To the best of our knowledge, this would be the first longitudinal study on evaluating the effects of XAI methods on end-users -- stakeholders that use AI systems but do not have AI-specific expertise. We welcome feedback and suggestions on our experimental setup.
△ Less
Submitted 6 July, 2022;
originally announced July 2022.
-
Effect of money heterogeneity on resource dependency in complex networks
Authors:
Harshit Agrawal,
Ashwin Lahorkar,
Snehal M. Shekatkar
Abstract:
Exchange of resources among individual components of a system is fundamental to systems like a social network of humans and a network of cities and villages. For various reasons, the human society has come up with the notion of money as a proxy for the resources. Here we extend the model of resource dependencies in networks that was recently proposed by one of us, by incorporating the concept of m…
▽ More
Exchange of resources among individual components of a system is fundamental to systems like a social network of humans and a network of cities and villages. For various reasons, the human society has come up with the notion of money as a proxy for the resources. Here we extend the model of resource dependencies in networks that was recently proposed by one of us, by incorporating the concept of money so that the vertices of a network can sell and buy required resources among themselves. We simulate the model using the configuration model as a substrate for homogeneous as well as heterogeneous degree distributions and using various exchange strategies. We show that a moderate amount of initial heterogeneity in the money on the vertices can significantly improve the survivability of Scale-free networks but not that of homogeneous networks like the Erdos-Renyi network. Our work is a step towards understanding the effect of presence of money on the resource distribution dynamics in complex networks.
△ Less
Submitted 26 August, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Housekeep: Tidying Virtual Households using Commonsense Reasoning
Authors:
Yash Kant,
Arun Ramachandran,
Sriram Yenamandra,
Igor Gilitschenski,
Dhruv Batra,
Andrew Szot,
Harsh Agrawal
Abstract:
We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house. Specifically, w…
▽ More
We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house. Specifically, we collect a dataset of where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms. Next, we propose a modular baseline approach for Housekeep that integrates planning, exploration, and navigation. It leverages a fine-tuned large language model (LLM) trained on an internet text corpus for effective planning. We show that our baseline agent generalizes to rearranging unseen objects in unknown environments. See our webpage for more details: https://yashkant.github.io/housekeep/
△ Less
Submitted 21 May, 2022;
originally announced May 2022.
-
Simple and Effective Synthesis of Indoor 3D Scenes
Authors:
Jing Yu Koh,
Harsh Agrawal,
Dhruv Batra,
Richard Tucker,
Austin Waters,
Honglak Lee,
Yinfei Yang,
Jason Baldridge,
Peter Anderson
Abstract:
We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an ima…
▽ More
We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images. On the Matterport3D and RealEstate10K datasets, our approach significantly outperforms prior work when evaluated by humans, as well as on FID scores. Further, we show that our model is useful for generative data augmentation. A vision-and-language navigation (VLN) agent trained with trajectories spatially-perturbed by our model improves success rate by up to 1.5% over a state of the art baseline on the R2R benchmark. Our code will be made available to facilitate generative data augmentation and applications to downstream robotics and embodied AI tasks.
△ Less
Submitted 1 December, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
Symmetric Convolutional Filters: A Novel Way to Constrain Parameters in CNN
Authors:
Harish Agrawal,
Sumana T.,
S. K. Nandy
Abstract:
We propose a novel technique to constrain parameters in CNN based on symmetric filters. We investigate the impact on SOTA networks when varying the combinations of symmetricity. We demonstrate that our models offer effective generalisation and a structured elimination of redundancy in parameters. We conclude by comparing our method with other pruning techniques.
We propose a novel technique to constrain parameters in CNN based on symmetric filters. We investigate the impact on SOTA networks when varying the combinations of symmetricity. We demonstrate that our models offer effective generalisation and a structured elimination of redundancy in parameters. We conclude by comparing our method with other pruning techniques.
△ Less
Submitted 26 February, 2022;
originally announced February 2022.
-
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
Authors:
Abhinav Moudgil,
Arjun Majumdar,
Harsh Agrawal,
Stefan Lee,
Dhruv Batra
Abstract:
Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders -- a scene classification network and an object detector -- which produce features t…
▽ More
Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders -- a scene classification network and an object detector -- which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i.e., learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1.8% absolute in SPL on R2R and 3.7% absolute in SR on RxR. Our analysis reveals even larger gains for navigation instructions that contain six or more object references, which further suggests that our approach is better able to use object features and align them to references in the instructions.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation
Authors:
Xiaoming Zhao,
Harsh Agrawal,
Dhruv Batra,
Alexander Schwing
Abstract:
It is fundamental for personal robots to reliably navigate to a specified goal. To study this task, PointGoal navigation has been introduced in simulated Embodied AI environments. Recent advances solve this PointGoal navigation task with near-perfect accuracy (99.6% success) in photo-realistically simulated environments, assuming noiseless egocentric vision, noiseless actuation, and most important…
▽ More
It is fundamental for personal robots to reliably navigate to a specified goal. To study this task, PointGoal navigation has been introduced in simulated Embodied AI environments. Recent advances solve this PointGoal navigation task with near-perfect accuracy (99.6% success) in photo-realistically simulated environments, assuming noiseless egocentric vision, noiseless actuation, and most importantly, perfect localization. However, under realistic noise models for visual sensors and actuation, and without access to a "GPS and Compass sensor," the 99.6%-success agents for PointGoal navigation only succeed with 0.3%. In this work, we demonstrate the surprising effectiveness of visual odometry for the task of PointGoal navigation in this realistic setting, i.e., with realistic noise models for perception and actuation and without access to GPS and Compass sensors. We show that integrating visual odometry techniques into navigation policies improves the state-of-the-art on the popular Habitat PointNav benchmark by a large margin, improving success from 64.5% to 71.7% while executing 6.4 times faster.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Contrast and Classify: Training Robust VQA Models
Authors:
Yash Kant,
Abhinav Moudgil,
Dhruv Batra,
Devi Parikh,
Harsh Agrawal
Abstract:
Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by m…
▽ More
Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConClaT) that optimizes both cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of representations for answer prediction.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training. On the VQA-Rephrasings benchmark, which measures the VQA model's answer consistency across human paraphrases of a question, ConClaT improves Consensus Score by 1 .63% over an improved baseline. In addition, on the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall. We also show that ConClaT is agnostic to the type of data-augmentation strategy used.
△ Less
Submitted 18 April, 2021; v1 submitted 12 October, 2020;
originally announced October 2020.
-
Spatially Aware Multimodal Transformers for TextVQA
Authors:
Yash Kant,
Dhruv Batra,
Peter Anderson,
Alex Schwing,
Devi Parikh,
Jiasen Lu,
Harsh Agrawal
Abstract:
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. I…
▽ More
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.
△ Less
Submitted 22 December, 2020; v1 submitted 23 July, 2020;
originally announced July 2020.
-
Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning
Authors:
Jyoti Aneja,
Harsh Agrawal,
Dhruv Batra,
Alexander Schwing
Abstract:
Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags. Common to all those methods is the fact t…
▽ More
Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags. Common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the steps of generation. Both methods offer no fine-grained control. To address this concern, we propose Seq-CVAE which learns a latent space for every word position. We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future. We illustrate the efficacy of the proposed approach to anticipate the sentence continuation on the challenging MSCOCO dataset, significantly improving diversity metrics compared to baselines while performing on par w.r.t sentence quality.
△ Less
Submitted 22 August, 2019;
originally announced August 2019.
-
EvalAI: Towards Better Evaluation Systems for AI Agents
Authors:
Deshraj Yadav,
Rishabh Jain,
Harsh Agrawal,
Prithvijit Chattopadhyay,
Taranjeet Singh,
Akash Jain,
Shiv Baran Singh,
Stefan Lee,
Dhruv Batra
Abstract:
We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale. EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop. This will help researcher…
▽ More
We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale. EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop. This will help researchers, students, and data scientists to create, collaborate, and participate in AI challenges organized around the globe. By simplifying and standardizing the process of benchmarking these models, EvalAI seeks to lower the barrier to entry for participating in the global scientific effort to push the frontiers of machine learning and artificial intelligence, thereby increasing the rate of measurable progress in this domain.
△ Less
Submitted 10 February, 2019;
originally announced February 2019.
-
nocaps: novel object captioning at scale
Authors:
Harsh Agrawal,
Karan Desai,
Yufei Wang,
Xinlei Chen,
Rishabh Jain,
Mark Johnson,
Dhruv Batra,
Devi Parikh,
Stefan Lee,
Peter Anderson
Abstract:
Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from…
▽ More
Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets. The associated training data consists of COCO image-caption pairs, plus OpenImages image-level labels and object bounding boxes. Since OpenImages contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps). We extend existing novel object captioning models to establish strong baselines for this benchmark and provide analysis to guide future work on this task.
△ Less
Submitted 30 September, 2019; v1 submitted 20 December, 2018;
originally announced December 2018.
-
Fabrik: An Online Collaborative Neural Network Editor
Authors:
Utsav Garg,
Viraj Prabhu,
Deshraj Yadav,
Ram Ramrakhya,
Harsh Agrawal,
Dhruv Batra
Abstract:
We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser. Fabrik provides a simple and intuitive GUI to import neural networks written in popular deep learning frameworks such as Caffe, Keras, and TensorFlow, and allows users to interact with, build, and edit models via simple drag and drop. Fabrik is designed to be…
▽ More
We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser. Fabrik provides a simple and intuitive GUI to import neural networks written in popular deep learning frameworks such as Caffe, Keras, and TensorFlow, and allows users to interact with, build, and edit models via simple drag and drop. Fabrik is designed to be framework agnostic and support high interoperability, and can be used to export models back to any supported framework. Finally, it provides powerful collaborative features to enable users to iterate over model design remotely and at scale.
△ Less
Submitted 27 October, 2018;
originally announced October 2018.
-
Spectrum Allocation in Cognitive Networks
Authors:
Himanshu Agrawal
Abstract:
Cognitive Network is a technique which is used to improve the spectrum utilization. Current network scenario is experiencing the huge spectrum scarcity problem due to the fixed assignment policy so in this method great amount of spectrum remain unused. To overcome this limitation the spectrum allocation must be in dynamic manner. In this paper the spectrum allocation is discussed thoroughly. Inter…
▽ More
Cognitive Network is a technique which is used to improve the spectrum utilization. Current network scenario is experiencing the huge spectrum scarcity problem due to the fixed assignment policy so in this method great amount of spectrum remain unused. To overcome this limitation the spectrum allocation must be in dynamic manner. In this paper the spectrum allocation is discussed thoroughly. Interference is the most important factor that needs to be considered. It is caused by the environment (noise) or by other radio users. It limits the possibility of spectrum reuse. Channel assignment is one of the techniques used to control interference in the network. There exist a trade-off between network capacity and level of contention. In cognitive radio networks spectrum assignment or spectrum allocation or frequency assignment is used to avoid interference. It is the process of simultaneous selection of operating central frequency and bandwidth. In doing so, the process of sensing the spectrum becomes very crucial; it must be reliable, accurate and efficient. The accuracy of sensing affects the overall operation of cognitive networks. Accurate results not only lead to higher utilization of the spectrum but also preserve the privacy of primary user. Accuracy of sensing is highly affected by the natural causes like noise, shadowing, fading etc. There are many other challenges as well, like, hardware requirements, hidden node problem, security, sensing frequency and duration, decision fusion etc.
△ Less
Submitted 15 December, 2016;
originally announced January 2017.
-
New Architecture for Dynamic Spectrum Allocation in Cognitive Heterogeneous Network using Self Organizing Map
Authors:
Himanshu Agrawal,
Krishna Asawa
Abstract:
This paper introduces the Hybrid Architecture of Dynamic Spectrum Allocation in the hierarchical network combining centralized and distributed architecture to get optimum allocation of radio resources. It can limit the interference by interacting dynamically and enhance the spectrum efficiency while maintaining the desired QoS in the network. This paper presented dynamic framework for the interact…
▽ More
This paper introduces the Hybrid Architecture of Dynamic Spectrum Allocation in the hierarchical network combining centralized and distributed architecture to get optimum allocation of radio resources. It can limit the interference by interacting dynamically and enhance the spectrum efficiency while maintaining the desired QoS in the network. This paper presented dynamic framework for the interaction. The proposed architecture employed simple learning rule based on hebbian learning for sensing the primary network and allocating the spectrum.
△ Less
Submitted 18 December, 2016;
originally announced December 2016.
-
Sort Story: Sorting Jumbled Images and Captions into Stories
Authors:
Harsh Agrawal,
Arjun Chandrasekaran,
Dhruv Batra,
Devi Parikh,
Mohit Bansal
Abstract:
Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, a…
▽ More
Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. We use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.
△ Less
Submitted 7 November, 2016; v1 submitted 23 June, 2016;
originally announced June 2016.
-
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?
Authors:
Abhishek Das,
Harsh Agrawal,
C. Lawrence Zitnick,
Devi Parikh,
Dhruv Batra
Abstract:
We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate at…
▽ More
We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.
△ Less
Submitted 17 June, 2016;
originally announced June 2016.
-
Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?
Authors:
Abhishek Das,
Harsh Agrawal,
C. Lawrence Zitnick,
Devi Parikh,
Dhruv Batra
Abstract:
We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate at…
▽ More
We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.
△ Less
Submitted 17 June, 2016; v1 submitted 11 June, 2016;
originally announced June 2016.
-
CloudCV: Large Scale Distributed Computer Vision as a Cloud Service
Authors:
Harsh Agrawal,
Clint Solomon Mathialagan,
Yash Goyal,
Neelima Chavali,
Prakriti Banik,
Akrit Mohapatra,
Ahmed Osman,
Dhruv Batra
Abstract:
We are witnessing a proliferation of massive visual data. Unfortunately scaling existing computer vision algorithms to large datasets leaves researchers repeatedly solving the same algorithmic, logistical, and infrastructural problems. Our goal is to democratize computer vision; one should not have to be a computer vision, big data and distributed computing expert to have access to state-of-the-ar…
▽ More
We are witnessing a proliferation of massive visual data. Unfortunately scaling existing computer vision algorithms to large datasets leaves researchers repeatedly solving the same algorithmic, logistical, and infrastructural problems. Our goal is to democratize computer vision; one should not have to be a computer vision, big data and distributed computing expert to have access to state-of-the-art distributed computer vision algorithms. We present CloudCV, a comprehensive system to provide access to state-of-the-art distributed computer vision algorithms as a cloud service through a Web Interface and APIs.
△ Less
Submitted 13 February, 2017; v1 submitted 12 June, 2015;
originally announced June 2015.
-
Object-Proposal Evaluation Protocol is 'Gameable'
Authors:
Neelima Chavali,
Harsh Agrawal,
Aroma Mahendru,
Dhruv Batra
Abstract:
Object proposals have quickly become the de-facto pre-processing step in a number of vision pipelines (for object detection, object discovery, and other tasks). Their performance is usually evaluated on partially annotated datasets. In this paper, we argue that the choice of using a partially annotated dataset for evaluation of object proposals is problematic -- as we demonstrate via a thought exp…
▽ More
Object proposals have quickly become the de-facto pre-processing step in a number of vision pipelines (for object detection, object discovery, and other tasks). Their performance is usually evaluated on partially annotated datasets. In this paper, we argue that the choice of using a partially annotated dataset for evaluation of object proposals is problematic -- as we demonstrate via a thought experiment, the evaluation protocol is 'gameable', in the sense that progress under this protocol does not necessarily correspond to a "better" category independent object proposal algorithm.
To alleviate this problem, we: (1) Introduce a nearly-fully annotated version of PASCAL VOC dataset, which serves as a test-bed to check if object proposal techniques are overfitting to a particular list of categories. (2) Perform an exhaustive evaluation of object proposal methods on our introduced nearly-fully annotated PASCAL dataset and perform cross-dataset generalization experiments; and (3) Introduce a diagnostic experiment to detect the bias capacity in an object proposal algorithm. This tool circumvents the need to collect a densely annotated dataset, which can be expensive and cumbersome to collect. Finally, we plan to release an easy-to-use toolbox which combines various publicly available implementations of object proposal algorithms which standardizes the proposal generation and evaluation so that new methods can be added and evaluated on different datasets. We hope that the results presented in the paper will motivate the community to test the category independence of various object proposal methods by carefully choosing the evaluation protocol.
△ Less
Submitted 23 November, 2015; v1 submitted 21 May, 2015;
originally announced May 2015.