Skip to main content

Showing 1–31 of 31 results for author: Sikka, K

  1. arXiv:2407.02352  [pdf, other

    cs.CL

    Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification

    Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran

    Abstract: Large Visual Language Models (LVLMs) struggle with hallucinations in visual instruction following task(s), limiting their trustworthiness and real-world applicability. We propose Pelican -- a novel framework designed to detect and mitigate hallucinations through claim verification. Pelican first decomposes the visual claim into a chain of sub-claims based on first-order predicates. These sub-claim… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2312.00115  [pdf, other

    cs.CV cs.CL

    A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

    Authors: Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran

    Abstract: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

    Comments: 13 pages, 15 tables, 5 figures

  3. arXiv:2311.10081  [pdf, other

    cs.CV cs.CL cs.LG

    DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

    Authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

    Abstract: We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feed… ▽ More

    Submitted 19 March, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: CVPR 2024. The feedback datasets are released at: https://huggingface.co/datasets/YangyiYY/LVLM_NLF

  4. arXiv:2310.10707  [pdf, other

    cs.CL cs.AI

    Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning

    Authors: Anirudh Som, Karan Sikka, Helen Gent, Ajay Divakaran, Andreas Kathol, Dimitra Vergyri

    Abstract: Paraphrasing of offensive content is a better alternative to content removal and helps improve civility in a communication environment. Supervised paraphrasers; however, rely heavily on large quantities of labelled data to help preserve meaning and intent. They also often retain a large portion of the offensiveness of the original content, which raises questions on their overall usability. In this… ▽ More

    Submitted 9 June, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: Accepted in Association for Computational Linguistics (ACL) 2024 Findings

  5. arXiv:2309.04461  [pdf, other

    cs.CL cs.CV cs.LG

    Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

    Authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran

    Abstract: Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are… ▽ More

    Submitted 19 March, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: NAACL 2024 Main Conference. The data is released at https://github.com/Yangyi-Chen/CoTConsistency

  6. arXiv:2309.04077  [pdf, other

    cs.RO cs.AI

    SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments

    Authors: Abhinav Rajvanshi, Karan Sikka, Xiao Lin, Bhoram Lee, Han-Pang Chiu, Alvaro Velasquez

    Abstract: Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigatio… ▽ More

    Submitted 3 April, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

  7. arXiv:2308.03906  [pdf, other

    cs.CV

    TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models

    Authors: Indranil Sur, Karan Sikka, Matthew Walmer, Kaushik Koneripalli, Anirban Roy, Xiao Lin, Ajay Divakaran, Susmit Jha

    Abstract: We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization). Recent work arXiv:2112.07668 has demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: Published as conference paper at ICCV 2023. 13 pages, 6 figures, 7 tables

  8. Predicting Information Pathways Across Online Communities

    Authors: Yiqiao Jin, Yeon-Chang Lee, Kartik Sharma, Meng Ye, Karan Sikka, Ajay Divakaran, Srijan Kumar

    Abstract: The problem of community-level information pathway prediction (CLIPP) aims at predicting the transmission trajectory of content across online communities. A successful solution to CLIPP holds significance as it facilitates the distribution of valuable information to a larger audience and prevents the proliferation of misinformation. Notably, solving CLIPP is non-trivial as inter-community relation… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'23)

    ACM Class: J.4

  9. arXiv:2302.09618  [pdf, other

    cs.CL

    Multilingual Content Moderation: A Case Study on Reddit

    Authors: Meng Ye, Karan Sikka, Katherine Atwell, Sabit Hassan, Ajay Divakaran, Malihe Alikhani

    Abstract: Content moderation is the process of flagging content based on pre-defined platform rules. There has been a growing need for AI moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. While prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderat… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

  10. arXiv:2112.07668  [pdf, other

    cs.CV cs.CL

    Dual-Key Multimodal Backdoors for Visual Question Answering

    Authors: Matthew Walmer, Karan Sikka, Indranil Sur, Abhinav Shrivastava, Susmit Jha

    Abstract: The success of deep learning has enabled advances in multimodal tasks that require non-trivial fusion of multiple input domains. Although multimodal models have shown potential in many problems, their increased complexity makes them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of security vulnerability wherein an attacker embeds a malicious secret behavior into a network (e… ▽ More

    Submitted 18 April, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: Published as conference paper at CVPR 2022. 22 pages, 11 figures, 12 tables

  11. arXiv:2110.11899  [pdf, other

    cs.CV cs.CL

    Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To Benchmark

    Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran

    Abstract: We focus on Multimodal Machine Reading Comprehension (M3C) where a model is expected to answer questions based on given passage (or context), and the context and the questions can be in different modalities. Previous works such as RecipeQA have proposed datasets and cloze-style tasks for evaluation. However, we identify three critical biases stemming from the question-answer generation process and… ▽ More

    Submitted 22 October, 2021; originally announced October 2021.

  12. arXiv:2104.10139  [pdf, other

    cs.CL

    Towards Solving Multimodal Comprehension

    Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran

    Abstract: This paper targets the problem of procedural multimodal machine comprehension (M3C). This task requires an AI to comprehend given steps of multimodal instructions and then answer questions. Compared to vanilla machine comprehension tasks where an AI is required only to understand a textual input, procedural M3C is more challenging as the AI needs to comprehend both the temporal and causal factors… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

  13. arXiv:2103.15918  [pdf, other

    cs.CR cs.CV stat.ML

    MISA: Online Defense of Trojaned Models using Misattributions

    Authors: Panagiota Kiourti, Wenchao Li, Anirban Roy, Karan Sikka, Susmit Jha

    Abstract: Recent studies have shown that neural networks are vulnerable to Trojan attacks, where a network is trained to respond to specially crafted trigger patterns in the inputs in specific and potentially malicious ways. This paper proposes MISA, a new online approach to detect Trojan triggers for neural networks at inference time. Our approach is based on a novel notion called misattributions, which ca… ▽ More

    Submitted 23 September, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

  14. arXiv:2012.02275  [pdf, other

    cs.LG cs.AI cs.CV cs.IT

    Detecting Trojaned DNNs Using Counterfactual Attributions

    Authors: Karan Sikka, Indranil Sur, Susmit Jha, Anirban Roy, Ajay Divakaran

    Abstract: We target the problem of detecting Trojans or backdoors in DNNs. Such models behave normally with typical inputs but produce specific incorrect predictions for inputs poisoned with a Trojan trigger. Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern and exhibit abnormally higher relative attribution for wrong decis… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

  15. arXiv:2011.10889  [pdf, other

    cs.CV

    Zero-Shot Learning with Knowledge Enhanced Visual Semantic Embeddings

    Authors: Karan Sikka, Jihua Huang, Andrew Silberfarb, Prateeth Nayak, Luke Rohrer, Pritish Sahu, John Byrnes, Ajay Divakaran, Richard Rohwer

    Abstract: We improve zero-shot learning (ZSL) by incorporating common-sense knowledge in DNNs. We propose Common-Sense based Neuro-Symbolic Loss (CSNL) that formulates prior knowledge as novel neuro-symbolic loss functions that regularize visual-semantic embedding. CSNL forces visual features in the VSE to obey common-sense rules relating to hypernyms and attributes. We introduce two key novelties for impro… ▽ More

    Submitted 21 November, 2020; originally announced November 2020.

  16. RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

    Authors: Niluthpol Chowdhury Mithun, Karan Sikka, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

    Abstract: We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing o… ▽ More

    Submitted 11 September, 2020; originally announced September 2020.

    Comments: ACM Multimedia 2020

  17. arXiv:2003.07344  [pdf, other

    cs.CV cs.AI

    Deep Adaptive Semantic Logic (DASL): Compiling Declarative Knowledge into Deep Neural Networks

    Authors: Karan Sikka, Andrew Silberfarb, John Byrnes, Indranil Sur, Ed Chow, Ajay Divakaran, Richard Rohwer

    Abstract: We introduce Deep Adaptive Semantic Logic (DASL), a novel framework for automating the generation of deep neural networks that incorporates user-provided formal knowledge to improve learning from data. We provide formal semantics that demonstrate that our knowledge representation captures all of first order logic and that finite sampling from infinite domains converges to correct truth values. DAS… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

  18. arXiv:1909.04696  [pdf, other

    cs.CV cs.AI

    Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

    Authors: Arijit Ray, Karan Sikka, Ajay Divakaran, Stefan Lee, Giedrius Burachas

    Abstract: While models for Visual Question Answering (VQA) have steadily improved over the years, interacting with one quickly reveals that these models lack consistency. For instance, if a model answers "red" to "What color is the balloon?", it might answer "no" if asked, "Is the balloon red?". These responses violate simple notions of entailment and raise questions about how effectively VQA models ground… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)

  19. arXiv:1907.06167  [pdf, other

    cs.CV

    FoodX-251: A Dataset for Fine-grained Food Classification

    Authors: Parneet Kaur, Karan Sikka, Weijun Wang, Serge Belongie, Ajay Divakaran

    Abstract: Food classification is a challenging problem due to the large number of categories, high visual similarity between different foods, as well as the lack of datasets for training state-of-the-art deep models. Solving this problem will require advances in both computer vision models as well as datasets for evaluating these models. In this paper we focus on the second aspect and introduce FoodX-251, a… ▽ More

    Submitted 14 July, 2019; originally announced July 2019.

    Comments: Published at Fine-Grained Visual Categorization Workshop, CVPR19

  20. arXiv:1905.07075  [pdf, other

    cs.IR cs.CL cs.CV cs.SI

    Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

    Authors: Karan Sikka, Lucas Van Bramer, Ajay Divakaran

    Abstract: There has been an explosion of multimodal content generated on social media networks in the last few years, which has necessitated a deeper understanding of social media content and user behavior. We present a novel content-independent content-user-reaction model for social multimedia content analysis. Compared to prior works that generally tackle semantic content understanding and user behavior m… ▽ More

    Submitted 10 June, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

    Comments: Preprint submitted to IJCV

  21. arXiv:1904.09073  [pdf, other

    cs.CV

    Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

    Authors: Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, Ajay Divakaran

    Abstract: Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to… ▽ More

    Submitted 7 November, 2019; v1 submitted 19 April, 2019; originally announced April 2019.

    Comments: Accepted at EMNLP'2019; Added dataset link

  22. arXiv:1903.11649  [pdf, other

    cs.CV

    Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

    Authors: Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran

    Abstract: We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discri… ▽ More

    Submitted 15 October, 2019; v1 submitted 27 March, 2019; originally announced March 2019.

    Comments: v2 contains phrase localization results on Flickr30k Entities. Accepted for publication at ICCV 2019

  23. arXiv:1812.03402  [pdf, other

    cs.CV

    Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

    Authors: Zachary Seymour, Karan Sikka, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

    Abstract: We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to gener… ▽ More

    Submitted 2 July, 2019; v1 submitted 8 December, 2018; originally announced December 2018.

    Comments: Appearing in BMVC 2019

  24. arXiv:1807.01448  [pdf, other

    cs.CV

    Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention

    Authors: Karuna Ahuja, Karan Sikka, Anirban Roy, Ajay Divakaran

    Abstract: We tackle the problem of understanding visual ads where given an ad image, our goal is to rank appropriate human generated statements describing the purpose of the ad. This problem is generally addressed by jointly embedding images and candidate statements to establish correspondence. Decoding a visual ad requires inference of both semantic and symbolic nuances referenced in an image and prior met… ▽ More

    Submitted 4 July, 2018; originally announced July 2018.

    Comments: Accepted at CVPR 2018 workshop- Towards Automatic Understanding of Visual Advertisements

  25. arXiv:1804.04340  [pdf, other

    cs.CV

    Zero-Shot Object Detection

    Authors: Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, Ajay Divakaran

    Abstract: We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. We present a principled approach by first adapting visual-semantic embeddings for ZSD.… ▽ More

    Submitted 27 July, 2018; v1 submitted 12 April, 2018; originally announced April 2018.

    Comments: 17 pages. ECCV 2018

  26. arXiv:1712.08730  [pdf, other

    cs.CV

    Combining Weakly and Webly Supervised Learning for Classifying Food Images

    Authors: Parneet Kaur, Karan Sikka, Ajay Divakaran

    Abstract: Food classification from images is a fine-grained classification problem. Manual curation of food images is cost, time and scalability prohibitive. On the other hand, web data is available freely but contains noise. In this paper, we address the problem of classifying food images with minimal data curation. We also tackle a key problems with food images from the web where they often have multiple… ▽ More

    Submitted 23 December, 2017; originally announced December 2017.

  27. arXiv:1611.08240  [pdf, other

    cs.CV

    AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos

    Authors: Amlan Kar, Nishant Rai, Karan Sikka, Gaurav Sharma

    Abstract: We propose a novel method for temporally pooling frames in a video for the task of human action recognition. The method is motivated by the observation that there are only a small number of frames which, together, contain sufficient information to discriminate an action class present in a video, from the rest. The proposed method learns to pool such discriminative and informative frames, while dis… ▽ More

    Submitted 25 June, 2017; v1 submitted 24 November, 2016; originally announced November 2016.

    Comments: CVPR 2017 Camera Ready Version

  28. arXiv:1608.02318  [pdf, other

    cs.CV

    Discriminatively Trained Latent Ordinal Model for Video Classification

    Authors: Karan Sikka, Gaurav Sharma

    Abstract: We study the problem of video classification for facial analysis and human action recognition. We propose a novel weakly supervised learning method that models the video as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for "smile", running and jumping for "highjump"). The proposed model is inspired by the recent works on Multiple Instance Learning and lat… ▽ More

    Submitted 14 August, 2017; v1 submitted 8 August, 2016; originally announced August 2016.

    Comments: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1604.01500

  29. arXiv:1604.01500  [pdf, other

    cs.CV

    LOMo: Latent Ordinal Model for Facial Analysis in Videos

    Authors: Karan Sikka, Gaurav Sharma, Marian Bartlett

    Abstract: We study the problem of facial analysis in videos. We propose a novel weakly supervised learning method that models the video event (expression, pain etc.) as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for smile, brow lower and cheek raise for pain). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF- i… ▽ More

    Submitted 6 April, 2016; originally announced April 2016.

    Comments: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  30. arXiv:1512.05484  [pdf, other

    cs.AI

    Deep Active Object Recognition by Joint Label and Action Prediction

    Authors: Mohsen Malmir, Karan Sikka, Deborah Forster, Ian Fasel, Javier R. Movellan, Garrison W. Cottrell

    Abstract: An active object recognition system has the advantage of being able to act in the environment to capture images that are more suited for training and that lead to better performance at test time. In this paper, we propose a deep convolutional neural network for active object recognition that simultaneously predicts the object label, and selects the next action to perform on the object with the aim… ▽ More

    Submitted 17 December, 2015; originally announced December 2015.

  31. arXiv:1310.6654  [pdf

    cs.CV

    Pseudo vs. True Defect Classification in Printed Circuits Boards using Wavelet Features

    Authors: Sahil Sikka, Karan Sikka, M. K. Bhuyan, Yuji Iwahori

    Abstract: In recent years, Printed Circuit Boards (PCB) have become the backbone of a large number of consumer electronic devices leading to a surge in their production. This has made it imperative to employ automatic inspection systems to identify manufacturing defects in PCB before they are installed in the respective systems. An important task in this regard is the classification of defects as either tru… ▽ More

    Submitted 24 October, 2013; originally announced October 2013.

    Comments: 6 pages, 8 figures