subscribe to arXiv mailings

What metrics of participation balance predict outcomes of collaborative learning with a robot?

Authors: Yuya Asano, Diane Litman, Quentin King-Shepard, Tristan Maidment, Tyree Langley, Teresa Davison, Timothy Nokes-Malach, Adriana Kovashka, Erin Walker

Abstract: One of the keys to the success of collaborative learning is balanced participation by all learners, but this does not always happen naturally. Pedagogical robots have the potential to facilitate balance. However, it remains unclear what participation balance robots should aim at; various metrics have been proposed, but it is still an open question whether we should balance human participation in h… ▽ More One of the keys to the success of collaborative learning is balanced participation by all learners, but this does not always happen naturally. Pedagogical robots have the potential to facilitate balance. However, it remains unclear what participation balance robots should aim at; various metrics have been proposed, but it is still an open question whether we should balance human participation in human-human interactions (HHI) or human-robot interactions (HRI) and whether we should consider robots' participation in collaborative learning involving multiple humans and a robot. This paper examines collaborative learning between a pair of students and a teachable robot that acts as a peer tutee to answer the aforementioned question. Through an exploratory study, we hypothesize which balance metrics in the literature and which portions of dialogues (including vs. excluding robots' participation and human participation in HHI vs. HRI) will better predict learning as a group. We test the hypotheses with another study and replicate them with automatically obtained units of participation to simulate the information available to robots when they adaptively fix imbalances in real-time. Finally, we discuss recommendations on which metrics learning science researchers should choose when trying to understand how to facilitate collaboration. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: To appear in Seventeenth International Conference on Educational Data Mining (EDM 2024)

arXiv:2401.01482 [pdf, other]

Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition

Authors: Kyle Buettner, Sina Malakouti, Xiang Lorraine Li, Adriana Kovashka

Abstract: Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies, we hypothesize that geographically diverse descriptive knowledge of categories can enhanc… ▽ More Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies, we hypothesize that geographically diverse descriptive knowledge of categories can enhance robustness. For this purpose, we explore the feasibility of probing a large language model for geography-based object knowledge, and we examine the effects of integrating knowledge into zero-shot and learnable soft prompting with CLIP. Within this exploration, we propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set. Accuracy gains over prompting baselines on DollarStreet while training only on Europe data are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall on the hardest classes. Competitive performance is shown vs. few-shot target training, and analysis is provided to direct future study of geographical robustness. △ Less

Submitted 29 March, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

Comments: To appear in IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

arXiv:2309.13525 [pdf, other]

Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment

Authors: Sina Malakouti, Adriana Kovashka

Abstract: Existing domain adaptation (DA) and generalization (DG) methods in object detection enforce feature alignment in the visual space but face challenges like object appearance variability and scene complexity, which make it difficult to distinguish between objects and achieve accurate detection. In this paper, we are the first to address the problem of semi-supervised domain generalization by explori… ▽ More Existing domain adaptation (DA) and generalization (DG) methods in object detection enforce feature alignment in the visual space but face challenges like object appearance variability and scene complexity, which make it difficult to distinguish between objects and achieve accurate detection. In this paper, we are the first to address the problem of semi-supervised domain generalization by exploring vision-language pre-training and enforcing feature alignment through the language space. We employ a novel Cross-Domain Descriptive Multi-Scale Learning (CDDMSL) aiming to maximize the agreement between descriptions of an image presented with different domain-specific characteristics in the embedding space. CDDMSL significantly outperforms existing methods, achieving 11.7% and 7.5% improvement in DG and DA settings, respectively. Comprehensive analysis and ablation studies confirm the effectiveness of our method, positioning CDDMSL as a promising approach for domain generalization in object detection tasks. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: Accepted at BMVC 2023

arXiv:2306.07302 [pdf, other]

Impact of Experiencing Misrecognition by Teachable Agents on Learning and Rapport

Authors: Yuya Asano, Diane Litman, Mingzhi Yu, Nikki Lobczowski, Timothy Nokes-Malach, Adriana Kovashka, Erin Walker

Abstract: While speech-enabled teachable agents have some advantages over typing-based ones, they are vulnerable to errors stemming from misrecognition by automatic speech recognition (ASR). These errors may propagate, resulting in unexpected changes in the flow of conversation. We analyzed how such changes are linked with learning gains and learners' rapport with the agents. Our results show they are not r… ▽ More While speech-enabled teachable agents have some advantages over typing-based ones, they are vulnerable to errors stemming from misrecognition by automatic speech recognition (ASR). These errors may propagate, resulting in unexpected changes in the flow of conversation. We analyzed how such changes are linked with learning gains and learners' rapport with the agents. Our results show they are not related to learning gains or rapport, regardless of the types of responses the agents should have returned given the correct input from learners without ASR errors. We also discuss the implications for optimal error-recovery policies for teachable agents that can be drawn from these findings. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: Accepted to AIED 2023

arXiv:2304.13130 [pdf, other]

doi 10.1145/3591106.3592223

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

Authors: Giacomo Nebbia, Adriana Kovashka

Abstract: Named entities are ubiquitous in text that naturally accompanies images, especially in domains such as news or Wikipedia articles. In previous work, named entities have been identified as a likely reason for low performance of image-text retrieval models pretrained on Wikipedia and evaluated on named entities-free benchmark datasets. Because they are rarely mentioned, named entities could be chall… ▽ More Named entities are ubiquitous in text that naturally accompanies images, especially in domains such as news or Wikipedia articles. In previous work, named entities have been identified as a likely reason for low performance of image-text retrieval models pretrained on Wikipedia and evaluated on named entities-free benchmark datasets. Because they are rarely mentioned, named entities could be challenging to model. They also represent missed learning opportunities for self-supervised models: the link between named entity and object in the image may be missed by the model, but it would not be if the object were mentioned using a more common term. In this work, we investigate hypernymization as a way to deal with named entities for pretraining grounding-based multi-modal models and for fine-tuning on open-vocabulary detection. We propose two ways to perform hypernymization: (1) a ``manual'' pipeline relying on a comprehensive ontology of concepts, and (2) a ``learned'' approach where we train a language model to learn to perform hypernymization. We run experiments on data from Wikipedia and from The New York Times. We report improved pretraining performance on objects of interest following hypernymization, and we show the promise of hypernymization on open-vocabulary detection, specifically on classes not seen during training. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2303.10937 [pdf, other]

Boosting Weakly Supervised Object Detection using Fusion and Priors from Hallucinated Depth

Authors: Cagri Gungor, Adriana Kovashka

Abstract: Despite recent attention and exploration of depth for various tasks, it is still an unexplored modality for weakly-supervised object detection (WSOD). We propose an amplifier method for enhancing the performance of WSOD by integrating depth information. Our approach can be applied to any WSOD method based on multiple-instance learning, without necessitating additional annotations or inducing large… ▽ More Despite recent attention and exploration of depth for various tasks, it is still an unexplored modality for weakly-supervised object detection (WSOD). We propose an amplifier method for enhancing the performance of WSOD by integrating depth information. Our approach can be applied to any WSOD method based on multiple-instance learning, without necessitating additional annotations or inducing large computational expenses. Our proposed method employs a monocular depth estimation technique to obtain hallucinated depth information, which is then incorporated into a Siamese WSOD network using contrastive loss and fusion. By analyzing the relationship between language context and depth, we calculate depth priors to identify the bounding box proposals that may contain an object of interest. These depth priors are then utilized to update the list of pseudo ground-truth boxes, or adjust the confidence of per-box predictions. Our proposed method is evaluated on six datasets (COCO, PASCAL VOC, Conceptual Captions, Clipart1k, Watercolor2k, and Comic2k) by implementing it on top of two state-of-the-art WSOD methods, and we demonstrate a substantial enhancement in performance. △ Less

Submitted 8 November, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

arXiv:2303.10093 [pdf, other]

Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection

Authors: Kyle Buettner, Adriana Kovashka

Abstract: Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment. It is unclear how methods use this context in learning, as well as whether models… ▽ More Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. To address this gap, we conduct extensive analysis of the role of attributes in vision-language models. We specifically measure model sensitivity to the presence and meaning of attribute context, gauging influence on object embeddings through unsupervised phrase grounding and classification via description methods. We further evaluate the utility of attribute context in training for open-vocabulary object detection, fine-grained text-region retrieval, and attribution tasks. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective. A viable strategy that we find to increase benefits from attributes is contrastive training with adjective-based negative captions. △ Less

Submitted 6 November, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: Accepted at Winter Conference on Applications of Computer Vision (WACV), 2024

arXiv:2303.09608 [pdf, other]

VEIL: Vetting Extracted Image Labels from In-the-Wild Captions for Weakly-Supervised Object Detection

Authors: Arushi Rai, Adriana Kovashka

Abstract: The use of large-scale vision-language datasets is limited for object detection due to the negative impact of label noise on localization. Prior methods have shown how such large-scale datasets can be used for pretraining, which can provide initial signal for localization, but is insufficient without clean bounding-box data for at least some categories. We propose a technique to "vet" labels extra… ▽ More The use of large-scale vision-language datasets is limited for object detection due to the negative impact of label noise on localization. Prior methods have shown how such large-scale datasets can be used for pretraining, which can provide initial signal for localization, but is insufficient without clean bounding-box data for at least some categories. We propose a technique to "vet" labels extracted from noisy captions, and use them for weakly-supervised object detection (WSOD), without any bounding boxes. We analyze and annotate the types of label noise in captions in our Caption Label Noise dataset, and train a classifier that predicts if an extracted label is actually present in the image or not. Our classifier generalizes across dataset boundaries and across categories. We compare the classifier to nine baselines on five datasets, and demonstrate that it can improve WSOD without label vetting by 30% (31.2 to 40.5 mAP when evaluated on PASCAL VOC). See dataset at: https://github.com/arushirai1/CLaNDataset. △ Less

Submitted 10 March, 2024; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2024 camera-ready

arXiv:2303.05546 [pdf, other]

Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors

Authors: Mesut Erhan Unal, Adriana Kovashka

Abstract: Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image. Even though the labeling effort required for building HOI detection datasets is inherently more extensive than for many other computer vision tasks, weakly-supervised directions in this area have not been sufficiently explored due to the difficulty of… ▽ More Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image. Even though the labeling effort required for building HOI detection datasets is inherently more extensive than for many other computer vision tasks, weakly-supervised directions in this area have not been sufficiently explored due to the difficulty of learning human-object interactions with weak supervision, rooted in the combinatorial nature of interactions over the object and predicate space. In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels, with the help of a pretrained vision-language model (VLM) and a large language model (LLM). We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model. Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis on unlikely interactions. Lastly, we use an auxiliary weakly-supervised preposition prediction task to make our model explicitly reason about space. Extensive experiments and ablations show that all of our contributions increase HOI detection performance. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: 8 pages, 3 figures and 5 tables

arXiv:2212.04613 [pdf, other]

Contrastive View Design Strategies to Enhance Robustness to Domain Shifts in Downstream Object Detection

Authors: Kyle Buettner, Adriana Kovashka

Abstract: Contrastive learning has emerged as a competitive pretraining method for object detection. Despite this progress, there has been minimal investigation into the robustness of contrastively pretrained detectors when faced with domain shifts. To address this gap, we conduct an empirical study of contrastive learning and out-of-domain object detection, studying how contrastive view design affects robu… ▽ More Contrastive learning has emerged as a competitive pretraining method for object detection. Despite this progress, there has been minimal investigation into the robustness of contrastively pretrained detectors when faced with domain shifts. To address this gap, we conduct an empirical study of contrastive learning and out-of-domain object detection, studying how contrastive view design affects robustness. In particular, we perform a case study of the detection-focused pretext task Instance Localization (InsLoc) and propose strategies to augment views and enhance robustness in appearance-shifted and context-shifted scenarios. Amongst these strategies, we propose changes to cropping such as altering the percentage used, adding IoU constraints, and integrating saliency based object priors. We also explore the addition of shortcut-reducing augmentations such as Poisson blending, texture flattening, and elastic deformation. We benchmark these strategies on abstract, weather, and context domain shifts and illustrate robust ways to combine them, in both pretraining on single-object and multi-object image datasets. Overall, our results and insights show how to ensure robustness through the choice of views in contrastive learning. △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: To appear, 2nd International Workshop on Practical Deep Learning in the Wild at AAAI Conference on Artificial Intelligence 2023

arXiv:2209.11842 [pdf, other]

Comparison of Lexical Alignment with a Teachable Robot in Human-Robot and Human-Human-Robot Interactions

Authors: Yuya Asano, Diane Litman, Mingzhi Yu, Nikki Lobczowski, Timothy Nokes-Malach, Adriana Kovashka, Erin Walker

Abstract: Speakers build rapport in the process of aligning conversational behaviors with each other. Rapport engendered with a teachable agent while instructing domain material has been shown to promote learning. Past work on lexical alignment in the field of education suffers from limitations in both the measures used to quantify alignment and the types of interactions in which alignment with agents has b… ▽ More Speakers build rapport in the process of aligning conversational behaviors with each other. Rapport engendered with a teachable agent while instructing domain material has been shown to promote learning. Past work on lexical alignment in the field of education suffers from limitations in both the measures used to quantify alignment and the types of interactions in which alignment with agents has been studied. In this paper, we apply alignment measures based on a data-driven notion of shared expressions (possibly composed of multiple words) and compare alignment in one-on-one human-robot (H-R) interactions with the H-R portions of collaborative human-human-robot (H-H-R) interactions. We find that students in the H-R setting align with a teachable robot more than in the H-H-R setting and that the relationship between lexical alignment and rapport is more complex than what is predicted by previous theoretical and empirical work. △ Less

Submitted 23 September, 2022; originally announced September 2022.

Comments: To be published in SIGDial 2022

arXiv:2206.04863 [pdf, other]

Symbolic image detection using scene and knowledge graphs

Authors: Nasrin Kalanat, Adriana Kovashka

Abstract: Sometimes the meaning conveyed by images goes beyond the list of objects they contain; instead, images may express a powerful message to affect the viewers' minds. Inferring this message requires reasoning about the relationships between the objects, and general common-sense knowledge about the components. In this paper, we use a scene graph, a graph representation of an image, to capture visual c… ▽ More Sometimes the meaning conveyed by images goes beyond the list of objects they contain; instead, images may express a powerful message to affect the viewers' minds. Inferring this message requires reasoning about the relationships between the objects, and general common-sense knowledge about the components. In this paper, we use a scene graph, a graph representation of an image, to capture visual components. In addition, we generate a knowledge graph using facts extracted from ConceptNet to reason about objects and attributes. To detect the symbols, we propose a neural network framework named SKG-Sym. The framework first generates the representations of the scene graph of the image and its knowledge graph using Graph Convolution Network. The framework then fuses the representations and uses an MLP to classify them. We extend the network further to use an attention mechanism which learn the importance of the graph representations. We evaluate our methods on a dataset of advertisements, and compare it with baseline symbolism classification methods (ResNet and VGG). Results show that our methods outperform ResNet in terms of F-score and the attention-based mechanism is competitive with VGG while it has much lower model complexity. △ Less

Submitted 10 June, 2022; originally announced June 2022.

arXiv:2205.05895 [pdf, other]

Weakly-Supervised Action Detection Guided by Audio Narration

Authors: Keren Ye, Adriana Kovashka

Abstract: Videos are more well-organized curated data sources for visual concept learning than images. Unlike the 2-dimensional images which only involve the spatial information, the additional temporal dimension bridges and synchronizes multiple modalities. However, in most video detection benchmarks, these additional modalities are not fully utilized. For example, EPIC Kitchens is the largest dataset in f… ▽ More Videos are more well-organized curated data sources for visual concept learning than images. Unlike the 2-dimensional images which only involve the spatial information, the additional temporal dimension bridges and synchronizes multiple modalities. However, in most video detection benchmarks, these additional modalities are not fully utilized. For example, EPIC Kitchens is the largest dataset in first-person (egocentric) vision, yet it still relies on crowdsourced information to refine the action boundaries to provide instance-level action annotations. We explored how to eliminate the expensive annotations in video detection data which provide refined boundaries. We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound. Our model learns to attend to the frames related to the narration label while suppressing the irrelevant frames from being used. Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: To appear, in Joint 1st Ego4D and 10th EPIC Workshop, held in conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

arXiv:2112.13910 [pdf, other]

doi 10.1145/3487553.3524647

Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal Characterization

Authors: Mesut Erhan Unal, Adriana Kovashka, Wen-Ting Chung, Yu-Ru Lin

Abstract: Social media content routinely incorporates multi-modal design to covey information and shape meanings, and sway interpretations toward desirable implications, but the choices and outcomes of using both texts and visual images have not been sufficiently studied. This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content, focusing on two aspe… ▽ More Social media content routinely incorporates multi-modal design to covey information and shape meanings, and sway interpretations toward desirable implications, but the choices and outcomes of using both texts and visual images have not been sufficiently studied. This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content, focusing on two aspects, popularity and reliability, in COVID-19-related news articles shared on Twitter. The two aspects are intertwined in the spread of misinformation: for example, an unreliable article that aims to misinform has to attain some popularity. This work has several contributions. First, we propose a multi-modal (image and text) approach to effectively identify popularity and reliability of information sources simultaneously. Second, we identify textual and visual elements that are predictive to information popularity and reliability. Third, by modeling cross-modal relations and similarity, we are able to uncover how unreliable articles construct multi-modal meaning in a distorted, biased fashion. Our work demonstrates how to use multi-modal analysis for understanding influential content and has implications to social media literacy and engagement. △ Less

Submitted 4 December, 2021; originally announced December 2021.

Comments: 10 pages

arXiv:2109.09532 [pdf, other]

Characterizing User Susceptibility to COVID-19 Misinformation on Twitter

Authors: Xian Teng, Yu-Ru Lin, Wen-Ting Chung, Ang Li, Adriana Kovashka

Abstract: Though significant efforts such as removing false claims and promoting reliable sources have been increased to combat COVID-19 "misinfodemic", it remains an unsolved societal challenge if lacking a proper understanding of susceptible online users, i.e., those who are likely to be attracted by, believe and spread misinformation. This study attempts to answer {\it who} constitutes the population vul… ▽ More Though significant efforts such as removing false claims and promoting reliable sources have been increased to combat COVID-19 "misinfodemic", it remains an unsolved societal challenge if lacking a proper understanding of susceptible online users, i.e., those who are likely to be attracted by, believe and spread misinformation. This study attempts to answer {\it who} constitutes the population vulnerable to the online misinformation in the pandemic, and what are the robust features and short-term behavior signals that distinguish susceptible users from others. Using a 6-month longitudinal user panel on Twitter collected from a geopolitically diverse network-stratified samples in the US, we distinguish different types of users, ranging from social bots to humans with various level of engagement with COVID-related misinformation. We then identify users' online features and situational predictors that correlate with their susceptibility to COVID-19 misinformation. This work brings unique contributions: First, contrary to the prior studies on bot influence, our analysis shows that social bots' contribution to misinformation sharing was surprisingly low, and human-like users' misinformation behaviors exhibit heterogeneity and temporal variability. While the sharing of misinformation was highly concentrated, the risk of occasionally sharing misinformation for average users remained alarmingly high. Second, our findings highlight the political sensitivity activeness and responsiveness to emotionally-charged content among susceptible users. Third, we demonstrate a feasible solution to efficiently predict users' transient susceptibility solely based on their short-term news consumption and exposure from their networks. Our work has an implication in designing effective intervention mechanism to mitigate the misinformation dissipation. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: Accepted into ICWSM 2022, 9 figures (main text)

arXiv:2106.13122 [pdf, other]

Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers

Authors: Katelyn Morrison, Benjamin Gilby, Colton Lipchak, Adam Mattioli, Adriana Kovashka

Abstract: Recently, vision transformers and MLP-based models have been developed in order to address some of the prevalent weaknesses in convolutional neural networks. Due to the novelty of transformers being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are robust to corruptions. Despite some works proposing that data augmentation remains… ▽ More Recently, vision transformers and MLP-based models have been developed in order to address some of the prevalent weaknesses in convolutional neural networks. Due to the novelty of transformers being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are robust to corruptions. Despite some works proposing that data augmentation remains essential for a model to be robust against corruptions, we propose to explore the impact that the architecture has on corruption robustness. We find that vision transformer architectures are inherently more robust to corruptions than the ResNet-50 and MLP-Mixers. We also find that vision transformers with 5 times fewer parameters than a ResNet-50 have more shape bias. Our code is available to reproduce. △ Less

Submitted 3 July, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

Comments: Under review at the Uncertainty and Robustness in Deep Learning workshop at ICML 2021. Our appendix is attached to the last page of the paper

arXiv:2105.13994 [pdf, other]

Linguistic Structures as Weak Supervision for Visual Scene Graph Generation

Authors: Keren Ye, Adriana Kovashka

Abstract: Prior work in scene graph generation requires categorical supervision at the level of triplets - subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene graph generation is a holistic task: thus holistic, contextual supervision should intuitively improve performance. In this work, we explore how linguistic structures in captions can be… ▽ More Prior work in scene graph generation requires categorical supervision at the level of triplets - subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene graph generation is a holistic task: thus holistic, contextual supervision should intuitively improve performance. In this work, we explore how linguistic structures in captions can benefit scene graph generation. Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects (e.g. visual properties are mentioned). Captions are a weaker type of supervision than triplets since the alignment between the exhaustive list of human-annotated subjects and objects in triplets, and the nouns in captions, is weak. However, given the large and diverse sources of multimodal data on the web (e.g. blog posts with images and captions), linguistic supervision is more scalable than crowdsourced triplets. We show extensive experimental comparisons against prior methods which leverage instance- and image-level supervision, and ablate our method to show the impact of leveraging phrasal and sequential context, and techniques to improve localization of subjects and objects. △ Less

Submitted 28 May, 2021; originally announced May 2021.

Comments: To appear in CVPR 2021

arXiv:2105.03014 [pdf, other]

BasisNet: Two-stage Model Synthesis for Efficient Inference

Authors: Mingda Zhang, Chun-Te Chu, Andrey Zhmoginov, Andrew Howard, Brendan Jou, Yukun Zhu, Li Zhang, Rebecca Hwa, Adriana Kovashka

Abstract: In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction.… ▽ More In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction. The two-stage model synthesis strategy can be applied to any network architectures and both stages are jointly trained. We also show that proper training recipes are critical for increasing generalizability for such high capacity neural networks. On ImageNet classification benchmark, our BasisNet with MobileNets as backbone demonstrated clear advantage on accuracy-efficiency trade-off over several strong baselines. Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations, halving the computational cost of previous state-of-the-art without sacrificing accuracy. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0% on ImageNet. △ Less

Submitted 6 May, 2021; originally announced May 2021.

Comments: To appear, 4th Workshop on Efficient Deep Learning for Computer Vision (ECV2021), CVPR2021 Workshop

arXiv:2103.15974 [pdf, other]

Domain-robust VQA with diverse datasets and methods but no target labels

Authors: Mingda Zhang, Tristan Maidment, Ahmad Diab, Adriana Kovashka, Rebecca Hwa

Abstract: The observation that computer vision methods overfit to dataset specifics has inspired diverse attempts to make object recognition models robust to domain shifts. However, similar work on domain-robust visual question answering methods is very limited. Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity: VQA models handle multimodal inputs, methods… ▽ More The observation that computer vision methods overfit to dataset specifics has inspired diverse attempts to make object recognition models robust to domain shifts. However, similar work on domain-robust visual question answering methods is very limited. Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity: VQA models handle multimodal inputs, methods contain multiple steps with diverse modules resulting in complex optimization, and answer spaces in different datasets are vastly different. To tackle these challenges, we first quantify domain shifts between popular VQA datasets, in both visual and textual space. To disentangle shifts between datasets arising from different modalities, we also construct synthetic shifts in the image and question domains separately. Second, we test the robustness of different families of VQA methods (classic two-stream, transformer, and neuro-symbolic methods) to these shifts. Third, we test the applicability of existing domain adaptation methods and devise a new one to bridge VQA domain gaps, adjusted to specific VQA models. To emulate the setting of real-world generalization, we focus on unsupervised domain adaptation and the open-ended classification task formulation. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: To appear in CVPR 2021

arXiv:2101.01260 [pdf, other]

SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection

Authors: Keren Ye, Adriana Kovashka, Mark Sandler, Menglong Zhu, Andrew Howard, Marco Fornoni

Abstract: Deep learning based object detectors are commonly deployed on mobile devices to solve a variety of tasks. For maximum accuracy, each detector is usually trained to solve one single specific task, and comes with a completely independent set of parameters. While this guarantees high performance, it is also highly inefficient, as each model has to be separately downloaded and stored. In this paper we… ▽ More Deep learning based object detectors are commonly deployed on mobile devices to solve a variety of tasks. For maximum accuracy, each detector is usually trained to solve one single specific task, and comes with a completely independent set of parameters. While this guarantees high performance, it is also highly inefficient, as each model has to be separately downloaded and stored. In this paper we address the question: can task-specific detectors be trained and represented as a shared set of weights, plus a very small set of additional weights for each task? The main contributions of this paper are the following: 1) we perform the first systematic study of parameter-efficient transfer learning techniques for object detection problems; 2) we propose a technique to learn a model patch with a size that is dependent on the difficulty of the task to be learned, and validate our approach on 10 different object detection tasks. Our approach achieves similar accuracy as previously proposed approaches, while being significantly more compact. △ Less

Submitted 4 January, 2021; originally announced January 2021.

Comments: Accepted by the ACCV2020 (Oral)

arXiv:2012.01642 [pdf, other]

Learning to Transfer Visual Effects from Videos to Images

Authors: Christopher Thomas, Yale Song, Adriana Kovashka

Abstract: We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos. We tackle two primary challenges in visual effect transfer: 1) how to capture the effect we wish to distill; and 2) how to ensure that only the effect, rather than content or artistic style, is transferred from the source videos to the input image. To address the f… ▽ More We study the problem of animating images by transferring spatio-temporal visual effects (such as melting) from a collection of videos. We tackle two primary challenges in visual effect transfer: 1) how to capture the effect we wish to distill; and 2) how to ensure that only the effect, rather than content or artistic style, is transferred from the source videos to the input image. To address the first challenge, we evaluate five loss functions; the most promising one encourages the generated animations to have similar optical flow and texture motions as the source videos. To address the second challenge, we only allow our model to move existing image pixels from the previous frame, rather than predicting unconstrained pixel values. This forces any visual effects to occur using the input image's pixels, preventing unwanted artistic style or content from the source video from appearing in the output. We evaluate our method in objective and subjective settings, and show interesting qualitative results which demonstrate objects undergoing atypical transformations, such as making a face melt or a deer bloom. △ Less

Submitted 17 December, 2020; v1 submitted 2 December, 2020; originally announced December 2020.

arXiv:2007.08617 [pdf, other]

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

Authors: Christopher Thomas, Adriana Kovashka

Abstract: The abundance of multimodal data (e.g. social media posts) has inspired interest in cross-modal retrieval methods. Popular approaches rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-t… ▽ More The abundance of multimodal data (e.g. social media posts) has inspired interest in cross-modal retrieval methods. Popular approaches rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-text pairs convey complementary information with little overlap. Further, images in news articles and media portray topics in a visually diverse fashion; thus, we need to take special care to ensure a meaningful image representation. We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces, which does not necessarily align with visual coherency. Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed. Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Journal ref: ECCV 2020

arXiv:1911.00147 [pdf, other]

Predicting the Politics of an Image Using Webly Supervised Data

Authors: Christopher Thomas, Adriana Kovashka

Abstract: The news media shape public opinion, and often, the visual bias they contain is evident for human observers. This bias can be inferred from how different media sources portray different subjects or topics. In this paper, we model visual political bias in contemporary media sources at scale, using webly supervised data. We collect a dataset of over one million unique images and associated news arti… ▽ More The news media shape public opinion, and often, the visual bias they contain is evident for human observers. This bias can be inferred from how different media sources portray different subjects or topics. In this paper, we model visual political bias in contemporary media sources at scale, using webly supervised data. We collect a dataset of over one million unique images and associated news articles from left- and right-leaning news sources, and develop a method to predict the image's political leaning. This problem is particularly challenging because of the enormous intra-class visual and semantic diversity of our data. We propose a two-stage method to tackle this problem. In the first stage, the model is forced to learn relevant visual concepts that, when joined with document embeddings computed from articles paired with the images, enable the model to predict bias. In the second stage, we remove the requirement of the text domain and train a visual classifier from the features of the former model. We show this two-stage approach facilitates learning and outperforms several strong baselines. We also present extensive qualitative results demonstrating the nuances of the data. △ Less

Submitted 31 October, 2019; originally announced November 2019.

Journal ref: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada

arXiv:1907.10164 [pdf, other]

Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection

Authors: Keren Ye, Mingda Zhang, Adriana Kovashka, Wei Li, Danfeng Qin, Jesse Berent

Abstract: Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes to that of image-level annotations, even cheaper supervision is naturally available in the form of unstructured textual descriptions that users may freely provide when… ▽ More Learning to localize and name object instances is a fundamental problem in vision, but state-of-the-art approaches rely on expensive bounding box supervision. While weakly supervised detection (WSOD) methods relax the need for boxes to that of image-level annotations, even cheaper supervision is naturally available in the form of unstructured textual descriptions that users may freely provide when uploading image content. However, straightforward approaches to using such data for WSOD wastefully discard captions that do not exactly match object names. Instead, we show how to squeeze the most information out of these captions by training a text-only classifier that generalizes beyond dataset boundaries. Our discovery provides an opportunity for learning detection models from noisy but more abundant and freely-available caption data. We also validate our model on three classic object detection benchmarks and achieve state-of-the-art WSOD performance. Our code is available at https://github.com/yekeren/Cap2Det. △ Less

Submitted 16 August, 2019; v1 submitted 23 July, 2019; originally announced July 2019.

Comments: To appear in ICCV 2019

arXiv:1901.07366 [pdf, other]

Measuring Effectiveness of Video Advertisements

Authors: James Hahn, Adriana Kovashka

Abstract: Advertisements are unavoidable in modern society. Times Square is notorious for its incessant display of advertisements. Its popularity is worldwide and smaller cities possess miniature versions of the display, such as Pittsburgh and its digital works in Oakland on Forbes Avenue. Tokyo's Ginza district recently rose to popularity due to its upscale shops and constant onslaught of advertisements to… ▽ More Advertisements are unavoidable in modern society. Times Square is notorious for its incessant display of advertisements. Its popularity is worldwide and smaller cities possess miniature versions of the display, such as Pittsburgh and its digital works in Oakland on Forbes Avenue. Tokyo's Ginza district recently rose to popularity due to its upscale shops and constant onslaught of advertisements to pedestrians. Advertisements arise in other mediums as well. For example, they help popular streaming services, such as Spotify, Hulu, and Youtube TV gather significant streams of revenue to reduce the cost of monthly subscriptions for consumers. Ads provide an additional source of money for companies and entire industries to allocate resources toward alternative business motives. They are attractive to companies and nearly unavoidable for consumers. One challenge for advertisers is examining a advertisement's effectiveness or usefulness in conveying a message to their targeted demographics. Rather than constructing a single, static image of content, a video advertisement possesses hundreds of frames of data with varying scenes, actors, objects, and complexity. Therefore, measuring effectiveness of video advertisements is important to impacting a billion-dollar industry. This paper explores the combination of human-annotated features and common video processing techniques to predict effectiveness ratings of advertisements collected from Youtube. This task is seen as a binary (effective vs. non-effective), four-way, and five-way machine learning classification task. The first findings in terms of accuracy and inference on this dataset, as well as some of the first ad research, on a small dataset are presented. Accuracies of 84\%, 65\%, and 55\% are reached on the binary, four-way, and five-way tasks respectively. △ Less

Submitted 28 January, 2019; v1 submitted 14 January, 2019; originally announced January 2019.

Comments: 9 pages, 7 figures, 2 tables

arXiv:1812.11139 [pdf, other]

Artistic Object Recognition by Unsupervised Style Adaptation

Authors: Christopher Thomas, Adriana Kovashka

Abstract: Computer vision systems currently lack the ability to reliably recognize artistically rendered objects, especially when such data is limited. In this paper, we propose a method for recognizing objects in artistic modalities (such as paintings, cartoons, or sketches), without requiring any labeled data from those modalities. Our method explicitly accounts for stylistic domain shifts between and wit… ▽ More Computer vision systems currently lack the ability to reliably recognize artistically rendered objects, especially when such data is limited. In this paper, we propose a method for recognizing objects in artistic modalities (such as paintings, cartoons, or sketches), without requiring any labeled data from those modalities. Our method explicitly accounts for stylistic domain shifts between and within domains. To do so, we introduce a complementary training modality constructed to be similar in artistic style to the target domain, and enforce that the network learns features that are invariant between the two training modalities. We show how such artificial labeled source domains can be generated automatically through the use of style transfer techniques, using diverse target images to represent the style in the target domain. Unlike existing methods which require a large amount of unlabeled target data, our method can work with as few as ten unlabeled images. We evaluate it on a number of cross-domain object and scene classification tasks and on a new dataset we release. Our experiments show that our approach, though conceptually simple, significantly improves the accuracy that existing domain adaptation techniques obtain for artistic object recognition. △ Less

Submitted 28 December, 2018; originally announced December 2018.

Journal ref: Asian Conference on Computer Vision 2018 (ACCV)

arXiv:1811.10080 [pdf, other]

Learning to discover and localize visual objects with open vocabulary

Authors: Keren Ye, Mingda Zhang, Wei Li, Danfeng Qin, Adriana Kovashka, Jesse Berent

Abstract: To alleviate the cost of obtaining accurate bounding boxes for training today's state-of-the-art object detection models, recent weakly supervised detection work has proposed techniques to learn from image-level labels. However, requiring discrete image-level labels is both restrictive and suboptimal. Real-world "supervision" usually consists of more unstructured text, such as captions. In this wo… ▽ More To alleviate the cost of obtaining accurate bounding boxes for training today's state-of-the-art object detection models, recent weakly supervised detection work has proposed techniques to learn from image-level labels. However, requiring discrete image-level labels is both restrictive and suboptimal. Real-world "supervision" usually consists of more unstructured text, such as captions. In this work we learn association maps between images and captions. We then use a novel objectness criterion to rank the resulting candidate boxes, such that high-ranking boxes have strong gradients along all edges. Thus, we can detect objects beyond a fixed object category vocabulary, if those objects are frequent and distinctive enough. We show that our objectness criterion improves the proposed bounding boxes in relation to prior weakly supervised detection methods. Further, we show encouraging results on object detection from image-level captions only. △ Less

Submitted 25 November, 2018; originally announced November 2018.

arXiv:1807.11122 [pdf, other]

Story Understanding in Video Advertisements

Authors: Keren Ye, Kyle Buettner, Adriana Kovashka

Abstract: In order to resonate with the viewers, many video advertisements explore creative narrative techniques such as "Freytag's pyramid" where a story begins with exposition, followed by rising action, then climax, concluding with denouement. In the dramatic structure of ads in particular, climax depends on changes in sentiment. We dedicate our study to understand the dynamic structure of video ads auto… ▽ More In order to resonate with the viewers, many video advertisements explore creative narrative techniques such as "Freytag's pyramid" where a story begins with exposition, followed by rising action, then climax, concluding with denouement. In the dramatic structure of ads in particular, climax depends on changes in sentiment. We dedicate our study to understand the dynamic structure of video ads automatically. To achieve this, we first crowdsource climax annotations on 1,149 videos from the Video Ads Dataset, which already provides sentiment annotations. We then use both unsupervised and supervised methods to predict the climax. Based on the predicted peak, the low-level visual and audio cues, and semantically meaningful context features, we build a sentiment prediction model that outperforms the current state-of-the-art model of sentiment prediction in video ads by 25%. In our ablation study, we show that using our context features, and modeling dynamics with an LSTM, are both crucial factors for improved performance. △ Less

Submitted 29 July, 2018; originally announced July 2018.

Comments: To appear, Proceedings of the British Machine Vision Conference (BMVC)

arXiv:1807.09882 [pdf, other]

Persuasive Faces: Generating Faces in Advertisements

Authors: Christopher Thomas, Adriana Kovashka

Abstract: In this paper, we examine the visual variability of objects across different ad categories, i.e. what causes an advertisement to be visually persuasive. We focus on modeling and generating faces which appear to come from different types of ads. For example, if faces in beauty ads tend to be women wearing lipstick, a generative model should portray this distinct visual appearance. Training generati… ▽ More In this paper, we examine the visual variability of objects across different ad categories, i.e. what causes an advertisement to be visually persuasive. We focus on modeling and generating faces which appear to come from different types of ads. For example, if faces in beauty ads tend to be women wearing lipstick, a generative model should portray this distinct visual appearance. Training generative models which capture such category-specific differences is challenging because of the highly diverse appearance of faces in ads and the relatively limited amount of available training data. To address these problems, we propose a conditional variational autoencoder which makes use of predicted semantic attributes and facial expressions as a supervisory signal when training. We show how our model can be used to produce visually distinct faces which appear to be from a fixed ad topic category. Our human studies and quantitative and qualitative experiments confirm that our method greatly outperforms a variety of baselines, including two variations of a state-of-the-art generative adversarial network, for transforming faces to be more ad-category appropriate. Finally, we show preliminary generation results for other types of objects, conditioned on an ad topic. △ Less

Submitted 25 July, 2018; originally announced July 2018.

Journal ref: In British Machine Vision Conference (BMVC), Newcastle upon Tyne, UK, September 2018

arXiv:1807.08205 [pdf, other]

Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text

Authors: Mingda Zhang, Rebecca Hwa, Adriana Kovashka

Abstract: Images and text in advertisements interact in complex, non-literal ways. The two channels are usually complementary, with each channel telling a different part of the story. Current approaches, such as image captioning methods, only examine literal, redundant relationships, where image and text show exactly the same content. To understand more complex relationships, we first collect a dataset of a… ▽ More Images and text in advertisements interact in complex, non-literal ways. The two channels are usually complementary, with each channel telling a different part of the story. Current approaches, such as image captioning methods, only examine literal, redundant relationships, where image and text show exactly the same content. To understand more complex relationships, we first collect a dataset of advertisement interpretations for whether the image and slogan in the same visual advertisement form a parallel (conveying the same message without literally saying the same thing) or non-parallel relationship, with the help of workers recruited on Amazon Mechanical Turk. We develop a variety of features that capture the creativity of images and the specificity or ambiguity of text, as well as methods that analyze the semantics within and across channels. We show that our method outperforms standard image-text alignment approaches on predicting the parallel/non-parallel relationship between image and text. △ Less

Submitted 21 July, 2018; originally announced July 2018.

Comments: To appear in BMVC2018

arXiv:1805.03134 [pdf, other]

Image Retrieval with Mixed Initiative and Multimodal Feedback

Authors: Nils Murrugarra-Llerena, Adriana Kovashka

Abstract: How would you search for a unique, fashionable shoe that a friend wore and you want to buy, but you didn't take a picture? Existing approaches propose interactive image search as a promising venue. However, they either entrust the user with taking the initiative to provide informative feedback, or give all control to the system which determines informative questions to ask. Instead, we propose a m… ▽ More How would you search for a unique, fashionable shoe that a friend wore and you want to buy, but you didn't take a picture? Existing approaches propose interactive image search as a promising venue. However, they either entrust the user with taking the initiative to provide informative feedback, or give all control to the system which determines informative questions to ask. Instead, we propose a mixed-initiative framework where both the user and system can be active participants, depending on whose initiative will be more beneficial for obtaining high-quality search results. We develop a reinforcement learning approach which dynamically decides which of three interaction opportunities to give to the user: drawing a sketch, providing free-form attribute feedback, or answering attribute-based questions. By allowing these three options, our system optimizes both the informativeness and exploration capabilities allowing faster image retrieval. We outperform three baselines on three datasets and extensive experimental settings. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Comments: In submission to BMVC 2018

arXiv:1711.06666 [pdf, other]

ADVISE: Symbolism and External Knowledge for Decoding Advertisements

Authors: Keren Ye, Adriana Kovashka

Abstract: In order to convey the most content in their limited space, advertisements embed references to outside knowledge via symbolism. For example, a motorcycle stands for adventure (a positive property the ad wants associated with the product being sold), and a gun stands for danger (a negative property to dissuade viewers from undesirable behaviors). We show how to use symbolic references to better und… ▽ More In order to convey the most content in their limited space, advertisements embed references to outside knowledge via symbolism. For example, a motorcycle stands for adventure (a positive property the ad wants associated with the product being sold), and a gun stands for danger (a negative property to dissuade viewers from undesirable behaviors). We show how to use symbolic references to better understand the meaning of an ad. We further show how anchoring ad understanding in general-purpose object recognition and image captioning improves results. We formulate the ad understanding task as matching the ad image to human-generated statements that describe the action that the ad prompts, and the rationale it provides for taking this action. Our proposed method outperforms the state of the art on this task, and on an alternative formulation of question-answering on ads. We show additional applications of our learned representations for matching ads to slogans, and clustering ads according to their topic, without extra training. △ Less

Submitted 29 July, 2018; v1 submitted 17 November, 2017; originally announced November 2017.

Comments: To appear, Proceedings of the European Conference on Computer Vision (ECCV)

arXiv:1707.03067 [pdf, other]

Automatic Understanding of Image and Video Advertisements

Authors: Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, Adriana Kovashka

Abstract: There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action. We propose the novel problem of automatic advertisement understanding. To enable research on this problem, we create two datasets: an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. Our data contains rich annotations encompassing… ▽ More There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action. We propose the novel problem of automatic advertisement understanding. To enable research on this problem, we create two datasets: an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. Our data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it?"), and symbolic references ads make (e.g. a dove symbolizes peace). We also analyze the most common persuasive strategies ads use, and the capabilities that computer vision systems should have to understand these strategies. We present baseline classification results for several prediction tasks, including automatically answering questions about the messages of the ads. △ Less

Submitted 10 July, 2017; originally announced July 2017.

Comments: To appear in CVPR 2017; data available on http://cs.pitt.edu/~kovashka/ads

arXiv:1611.02145 [pdf, other]

doi 10.1561/0600000073

Crowdsourcing in Computer Vision

Authors: Adriana Kovashka, Olga Russakovsky, Li Fei-Fei, Kristen Grauman

Abstract: Computer vision systems require large amounts of manually annotated data to properly learn challenging visual concepts. Crowdsourcing platforms offer an inexpensive method to capture human knowledge and understanding, for a vast number of visual perception tasks. In this survey, we describe the types of annotations computer vision researchers have collected using crowdsourcing, and how they have e… ▽ More Computer vision systems require large amounts of manually annotated data to properly learn challenging visual concepts. Crowdsourcing platforms offer an inexpensive method to capture human knowledge and understanding, for a vast number of visual perception tasks. In this survey, we describe the types of annotations computer vision researchers have collected using crowdsourcing, and how they have ensured that this data is of high quality while annotation effort is minimized. We begin by discussing data collection on both classic (e.g., object recognition) and recent (e.g., visual story-telling) vision tasks. We then summarize key design decisions for creating effective data collection interfaces and workflows, and present strategies for intelligently selecting the most important data instances to annotate. Finally, we conclude with some thoughts on the future of crowdsourcing in computer vision. △ Less

Submitted 7 November, 2016; originally announced November 2016.

Comments: A 69-page meta review of the field, Foundations and Trends in Computer Graphics and Vision, 2016

arXiv:1508.05038 [pdf, other]

Seeing Behind the Camera: Identifying the Authorship of a Photograph

Authors: Christopher Thomas, Adriana Kovashka

Abstract: We introduce the novel problem of identifying the photographer behind a photograph. To explore the feasibility of current computer vision techniques to address this problem, we created a new dataset of over 180,000 images taken by 41 well-known photographers. Using this dataset, we examined the effectiveness of a variety of features (low and high-level, including CNN features) at identifying the p… ▽ More We introduce the novel problem of identifying the photographer behind a photograph. To explore the feasibility of current computer vision techniques to address this problem, we created a new dataset of over 180,000 images taken by 41 well-known photographers. Using this dataset, we examined the effectiveness of a variety of features (low and high-level, including CNN features) at identifying the photographer. We also trained a new deep convolutional neural network for this task. Our results show that high-level features greatly outperform low-level features. We provide qualitative results using these learned models that give insight into our method's ability to distinguish between photographers, and allow us to draw interesting conclusions about what specific photographers shoot. We also demonstrate two applications of our method. △ Less

Submitted 31 May, 2016; v1 submitted 20 August, 2015; originally announced August 2015.

Comments: Dataset downloadable at http://www.cs.pitt.edu/~chris/photographer To Appear in CVPR 2016

arXiv:1505.04141 [pdf, other]

doi 10.1007/s11263-015-0814-0

WhittleSearch: Interactive Image Search with Relative Attribute Feedback

Authors: Adriana Kovashka, Devi Parikh, Kristen Grauman

Abstract: We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought. For example, perusing image results for a query "black shoes", the user might state, "Show me shoe images like these, but sportier." Offline, our approach first learns a set of ranking functions,… ▽ More We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought. For example, perusing image results for a query "black shoes", the user might state, "Show me shoe images like these, but sportier." Offline, our approach first learns a set of ranking functions, each of which predicts the relative strength of a nameable attribute in an image (e.g., sportiness). At query time, the system presents the user with a set of exemplar images, and the user relates them to his/her target image with comparative statements. Using a series of such constraints in the multi-dimensional attribute space, our method iteratively updates its relevance function and re-ranks the database of images. To determine which exemplar images receive feedback from the user, we present two variants of the approach: one where the feedback is user-initiated and another where the feedback is actively system-initiated. In either case, our approach allows a user to efficiently "whittle away" irrelevant portions of the visual feature space, using semantic language to precisely communicate her preferences to the system. We demonstrate our technique for refining image search for people, products, and scenes, and we show that it outperforms traditional binary relevance feedback in terms of search speed and accuracy. In addition, the ordinal nature of relative attributes helps make our active approach efficient -- both computationally for the machine when selecting the reference images, and for the user by requiring less user interaction than conventional passive and active methods. △ Less

Submitted 18 May, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

Comments: Published in the International Journal of Computer Vision (IJCV), April 2015. The final publication is available at Springer via http://dx.doi.org/10.1007/s11263-015-0814-0

Journal ref: International Journal of Computer Vision, 1573-1405 (2015, Springer)

arXiv:1505.04117 [pdf, other]

doi 10.1007/s11263-014-0798-1

Discovering Attribute Shades of Meaning with the Crowd

Authors: Adriana Kovashka, Kristen Grauman

Abstract: To learn semantic attributes, existing methods typically train one discriminative model for each word in a vocabulary of nameable properties. However, this "one model per word" assumption is problematic: while a word might have a precise linguistic definition, it need not have a precise visual definition. We propose to discover shades of attribute meaning. Given an attribute name, we use crowdsour… ▽ More To learn semantic attributes, existing methods typically train one discriminative model for each word in a vocabulary of nameable properties. However, this "one model per word" assumption is problematic: while a word might have a precise linguistic definition, it need not have a precise visual definition. We propose to discover shades of attribute meaning. Given an attribute name, we use crowdsourced image labels to discover the latent factors underlying how different annotators perceive the named concept. We show that structure in those latent factors helps reveal shades, that is, interpretations for the attribute shared by some group of annotators. Using these shades, we train classifiers to capture the primary (often subtle) variants of the attribute. The resulting models are both semantic and visually precise. By catering to users' interpretations, they improve attribute prediction accuracy on novel images. Shades also enable more successful attribute-based image search, by providing robust personalized models for retrieving multi-attribute query results. They are widely applicable to tasks that involve describing visual content, such as zero-shot category learning and organization of photo collections. △ Less

Submitted 15 May, 2015; originally announced May 2015.

Comments: Published in the International Journal of Computer Vision (IJCV), January 2015. The final publication is available at Springer via http://dx.doi.org/10.1007/s11263-014-0798-1

Journal ref: International Journal of Computer Vision 1573-1405 (2015, Springer)

Showing 1–37 of 37 results for author: Kovashka, A