subscribe to arXiv mailings

Virtual Personas for Language Models via an Anthology of Backstories

Authors: Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, David M. Chan

Abstract: Large language models (LLMs) are trained from vast repositories of text authored by millions of distinct authors, reflecting an enormous diversity of human traits. While these models bear the potential to be used as approximations of human subjects in behavioral studies, prior efforts have been limited in steering model responses to match individual human users. In this work, we introduce "Antholo… ▽ More Large language models (LLMs) are trained from vast repositories of text authored by millions of distinct authors, reflecting an enormous diversity of human traits. While these models bear the potential to be used as approximations of human subjects in behavioral studies, prior efforts have been limited in steering model responses to match individual human users. In this work, we introduce "Anthology", a method for conditioning LLMs to particular virtual personas by harnessing open-ended life narratives, which we refer to as "backstories." We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations. Across three nationally representative human surveys conducted as part of Pew Research Center's American Trends Panel (ATP), we demonstrate that Anthology achieves up to 18% improvement in matching the response distributions of human respondents and 27% improvement in consistency metrics. Our code and generated backstories are available at https://github.com/CannyLab/anthology. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2405.15113 [pdf, other]

A Wearable Resistance Devices Motor Learning Effects in Exercise

Authors: Eugenio Frias-Miranda, Hong-Anh Nguyen, Jeremy Hampton, Trenner Jones, Benjamin Spotts, Matthew Cochran, Deva Chan, Laura H Blumenschein

Abstract: The integration of technology into exercise regimens has emerged as a strategy to enhance normal human capabilities and return human motor function after injury or illness by enhancing motor learning and retention. Much research has focused on how active devices, whether confined to a lab or made into a wearable format, can apply forces at set times and conditions to optimize the process of learni… ▽ More The integration of technology into exercise regimens has emerged as a strategy to enhance normal human capabilities and return human motor function after injury or illness by enhancing motor learning and retention. Much research has focused on how active devices, whether confined to a lab or made into a wearable format, can apply forces at set times and conditions to optimize the process of learning. However, the focus on active force production often forces devices to either be confined to simple movements or interventions. As such, in this paper, we investigate how passive device behaviors can contribute to the process of motor learning by themselves. Our approach involves using a wearable resistance (WR) device, which is outfitted with elastic bands, to apply a force field that changes in response to a person's movements while performing exercises. We develop a method to measure the produced forces from the device without impeding the function and we characterize the device's force generation abilities. We then present a study assessing the impact of the WR device on motor learning of proper squat form compared to visual or no feedback. Biometrics such as knee and hip angles were used to monitor and assess subject performance. Our findings indicate that the force fields produced while training with the WR device can improve performance in full-body exercises similarly to a more direct visual feedback mechanism, though the improvement is not consistent across all performance metrics. Through our research, we contribute important insights into the application of passive wearable resistance technology in practical exercise settings. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 8 pages, 9 figures, To be published in IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob) 2024

arXiv:2405.08272 [pdf, other]

VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons

Authors: Zhen Chen, Xingjian Luo, Jinlin Wu, Danny T. M. Chan, Zhen Lei, Jinqiao Wang, Sebastien Ourselin, Hongbin Liu

Abstract: The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intellige… ▽ More The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2404.05696 [pdf]

BOLD v4: A Centralized Bioinformatics Platform for DNA-based Biodiversity Data

Authors: Sujeevan Ratnasingham, Catherine Wei, Dean Chan, Jireh Agda, Josh Agda, Liliana Ballesteros-Mejia, Hamza Ait Boutou, Zak Mohammad El Bastami, Eddie Ma, Ramya Manjunath, Dana Rea, Chris Ho, Angela Telfer, Jaclyn McKeowan, Miduna Rahulan, Claudia Steinke, Justin Dorsheimer, Megan Milton, Paul D. N. Hebert

Abstract: BOLD, the Barcode of Life Data System, supports the acquisition, storage, validation, analysis, and publication of DNA barcodes, activities requiring the integration of molecular, morphological, and distributional data. Its pivotal role in curating the reference library of DNA barcodes, coupled with its data management and analysis capabilities, make it a central resource for biodiversity science.… ▽ More BOLD, the Barcode of Life Data System, supports the acquisition, storage, validation, analysis, and publication of DNA barcodes, activities requiring the integration of molecular, morphological, and distributional data. Its pivotal role in curating the reference library of DNA barcodes, coupled with its data management and analysis capabilities, make it a central resource for biodiversity science. It enables rapid, accurate identification of specimens and also reveals patterns of genetic diversity and evolutionary relationships among taxa. Launched in 2005, BOLD has become an increasingly powerful tool for advancing understanding of planetary biodiversity. It currently hosts 17 million specimen records and 14 million barcodes that provide coverage for more than a million species from every continent and ocean. The platform has the long-term goal of providing a consistent, accurate system for identifying all species of eukaryotes. BOLD's integrated analytical tools, full data lifecycle support, and secure collaboration framework distinguish it from other biodiversity platforms. BOLD v4 brought enhanced data management and analysis capabilities as well as novel functionality for data dissemination and publication. Its next version will include features to strengthen its utility to the research community, governments, industry, and society-at-large. △ Less

Submitted 5 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.02904 [pdf, other]

ALOHa: A New Measure for Hallucination in Captioning Models

Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell

Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverage… ▽ More Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at https://davidmchan.github.io/aloha/. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: To appear at NAACL 2024

arXiv:2403.19822 [pdf, other]

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Authors: Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

Abstract: Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-trai… ▽ More Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Accepted in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

arXiv:2402.11590 [pdf, other]

Designing interactive data visualizations representing recovery progress for patients after stroke

Authors: Alicia Ouskine, Adrian D. C. Chan, Fateme Rajabiyazdi

Abstract: Stroke is one of the leading causes of disability worldwide. The efficacy of recovery is determined by a variety of factors, including patient adherence to rehabilitation programs. One way to increase patient adherence to their rehabilitation program is to show patients their progress that is visualized in a simple and intuitive way. We begin to gather preliminary information on Functional Capacit… ▽ More Stroke is one of the leading causes of disability worldwide. The efficacy of recovery is determined by a variety of factors, including patient adherence to rehabilitation programs. One way to increase patient adherence to their rehabilitation program is to show patients their progress that is visualized in a simple and intuitive way. We begin to gather preliminary information on Functional Capacity, Motor Function, and Mood/cognition from occupational Therapists at the Bruyere Hospital to gain a better understanding of how stroke recovery data is collected within in-patient stroke rehabilitation centers. The future aim is to design, develop, and evaluate a data visualization tool representing progress made by patients recovering from stroke. △ Less

Submitted 18 February, 2024; originally announced February 2024.

Comments: 2 pages

arXiv:2402.09679 [pdf, other]

Design and Visual Servoing Control of a Hybrid Dual-Segment Flexible Neurosurgical Robot for Intraventricular Biopsy

Authors: Jian Chen, Mingcong Chen, Qingxiang Zhao, Shuai Wang, Yihe Wang, Ying Xiao, Jian Hu, Danny Tat Ming Chan, Kam Tong Leo Yeung, David Yuen Chung Chan, Hongbin Liu

Abstract: Traditional rigid endoscopes have challenges in flexibly treating tumors located deep in the brain, and low operability and fixed viewing angles limit its development. This study introduces a novel dual-segment flexible robotic endoscope MicroNeuro, designed to perform biopsies with dexterous surgical manipulation deep in the brain. Taking into account the uncertainty of the control model, an imag… ▽ More Traditional rigid endoscopes have challenges in flexibly treating tumors located deep in the brain, and low operability and fixed viewing angles limit its development. This study introduces a novel dual-segment flexible robotic endoscope MicroNeuro, designed to perform biopsies with dexterous surgical manipulation deep in the brain. Taking into account the uncertainty of the control model, an image-based visual servoing with online robot Jacobian estimation has been implemented to enhance motion accuracy. Furthermore, the application of model predictive control with constraints significantly bolsters the flexible robot's ability to adaptively track mobile objects and resist external interference. Experimental results underscore that the proposed control system enhances motion stability and precision. Phantom testing substantiates its considerable potential for deployment in neurosurgery. △ Less

Submitted 23 February, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA) 2024, 7 pages, 9 figures

arXiv:2402.08205 [pdf, other]

TurtleRabbit 2024 SSL Team Description Paper

Authors: Linh Trinh, Alif Anzuman, Eric Batkhuu, Dychen Chan, Lisa Graf, Darpan Gurung, Tharunimm Jamal, Jigme Namgyal, Jason Ng, Wing Lam Tsang, X. Rosalind Wang, Eren Yilmaz, Oliver Obst

Abstract: TurtleRabbit is a new RoboCup SSL team from Western Sydney University. This team description paper presents our approach in navigating some of the challenges in developing a new SSL team from scratch. SSL is dominated by teams with extensive experience and customised equipment that has been developed over many years. Here, we outline our approach in overcoming some of the complexities associated w… ▽ More TurtleRabbit is a new RoboCup SSL team from Western Sydney University. This team description paper presents our approach in navigating some of the challenges in developing a new SSL team from scratch. SSL is dominated by teams with extensive experience and customised equipment that has been developed over many years. Here, we outline our approach in overcoming some of the complexities associated with replicating advanced open-sourced designs and managing the high costs of custom components. Opting for simplicity and cost-effectiveness, our strategy primarily employs off-the-shelf electronics components and ``hobby'' brushless direct current (BLDC) motors, complemented by 3D printing and CNC milling. This approach helped us to streamline the development process and, with our open-sourced hardware design, hopefully will also lower the bar for other teams to enter RoboCup SSL in the future. The paper details the specific hardware choices, their approximate costs, the integration of electronics and mechanics, and the initial steps taken in software development, for our entry into SSL that aims to be simple yet competitive. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: Submitted paper as part of the qualification for RoboCup 2024

arXiv:2401.05314 [pdf, other]

ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video

Authors: Kevin Cai, Chonghua Liu, David M. Chan

Abstract: The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex an… ▽ More The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex and challenging task due to pipelines, necessitating precise timing, facial movement synchronization, and prosody matching. While end-to-end dubbing offers a solution, data scarcity continues to impede the progress of both end-to-end and pipeline-based methods. In this work, we introduce Anim-400K, a comprehensive dataset of over 425K aligned animated video segments in Japanese and English supporting various video-related tasks, including automated dubbing, simultaneous translation, guided video summarization, and genre/theme/style classification. Our dataset is made publicly available for research purposes at https://github.com/davidmchan/Anim400K. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: To appear in ICASSP 2024

arXiv:2401.03384 [pdf, other]

conv_einsum: A Framework for Representation and Fast Evaluation of Multilinear Operations in Convolutional Tensorial Neural Networks

Authors: Tahseen Rabbani, Jiahao Su, Xiaoyu Liu, David Chan, Geoffrey Sangston, Furong Huang

Abstract: Modern ConvNets continue to achieve state-of-the-art results over a vast array of vision and image classification tasks, but at the cost of increasing parameters. One strategy for compactifying a network without sacrificing much expressive power is to reshape it into a tensorial neural network (TNN), which is a higher-order tensorization of its layers, followed by a factorization, such as a CP-dec… ▽ More Modern ConvNets continue to achieve state-of-the-art results over a vast array of vision and image classification tasks, but at the cost of increasing parameters. One strategy for compactifying a network without sacrificing much expressive power is to reshape it into a tensorial neural network (TNN), which is a higher-order tensorization of its layers, followed by a factorization, such as a CP-decomposition, which strips a weight down to its critical basis components. Passes through TNNs can be represented as sequences of multilinear operations (MLOs), where the evaluation path can greatly affect the number of floating point operations (FLOPs) incurred. While functions such as the popular einsum can evaluate simple MLOs such as contractions, existing implementations cannot process multi-way convolutions, resulting in scant assessments of how optimal evaluation paths through tensorized convolutional layers can improve training speed. In this paper, we develop a unifying framework for representing tensorial convolution layers as einsum-like strings and a meta-algorithm conv_einsum which is able to evaluate these strings in a FLOPs-minimizing manner. Comprehensive experiments, using our open-source implementation, over a wide range of models, tensor decompositions, and diverse tasks, demonstrate that conv_einsum significantly increases both computational and memory-efficiency of convolutional TNNs. △ Less

Submitted 6 January, 2024; originally announced January 2024.

arXiv:2401.02417 [pdf, other]

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

Authors: David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister

Abstract: While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these… ▽ More While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these interactions, even in an offline fashion. In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new public large-scale semi-synthetic meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines. We make OD3 publicly available at https://github.com/amazon-science/amazon-od3 . △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: To appear in ICASSP 2024

arXiv:2312.14378 [pdf, other]

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowledge transfer from attention matrices of models rooted in high resource modalities, text and images, to those in resource-constrained domains, speech and audio, employing a zero-shot paradigm. MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%, and relative classification error of an Audio Event Classification (AEC) model by 10.63%. In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. △ Less

Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

arXiv:2312.08366 [pdf, other]

See, Say, and Segment: Teaching LMMs to Overcome False Premises

Authors: Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E. Gonzalez, Trevor Darrell

Abstract: Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an obje… ▽ More Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the desired objects if they exist. Additionally, we introduce a novel False Premise Correction benchmark dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches, but under false premise conditions produces relative cIOU improvements of more than 31% over baselines, and produces natural language feedback judged helpful up to 67% of the time. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: Project Page: https://see-say-segment.github.io

arXiv:2310.12971 [pdf, other]

CLAIR: Evaluating Image Captions with Large Language Models

Authors: David Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John Canny

Abstract: The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score… ▽ More The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score that aligns closely with human judgments. Here, we propose CLAIR, a novel method that leverages the zero-shot language modeling capabilities of large language models (LLMs) to evaluate candidate captions. In our evaluations, CLAIR demonstrates a stronger correlation with human judgments of caption quality compared to existing measures. Notably, on Flickr8K-Expert, CLAIR achieves relative correlation improvements over SPICE of 39.6% and over image-augmented methods such as RefCLIP-S of 18.3%. Moreover, CLAIR provides noisily interpretable results by allowing the language model to identify the underlying reasoning behind its assigned score. Code is available at https://davidmchan.github.io/clair/ △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: To Appear at EMNLP 2023

arXiv:2304.02080 [pdf, other]

doi 10.1109/WACVW58289.2023.00043

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Authors: Vladislav Lialin, Stephen Rawls, David Chan, Shalini Ghosh, Anna Rumshisky, Wael Hamza

Abstract: Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty… ▽ More Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions. △ Less

Submitted 4 April, 2023; originally announced April 2023.

Journal ref: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

arXiv:2302.01328 [pdf, other]

IC3: Image Captioning by Committee Consensus

Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

Abstract: If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in th… ▽ More If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating significant improvements over SOTA approaches for visual description. Code is available at https://davidmchan.github.io/caption-by-committee/ △ Less

Submitted 19 October, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: To Appear at EMNLP 2023

arXiv:2301.02736 [pdf, other]

Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition

Authors: David M. Chan, Shalini Ghosh, Ariya Rastrow, Björn Hoffmeister

Abstract: Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveragin… ▽ More Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods, to allow for flexible post-training adaptation to new data distributions. In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR via an approximate k-nearest-neighbor (KNN) based attentive fusion step. Our experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours while providing up to 3% WER improvement compared to a fine-tuning baseline, suggesting a promising approach for adapting production ASR systems in challenging zero and few-shot scenarios. △ Less

Submitted 6 January, 2023; originally announced January 2023.

arXiv:2209.07518 [pdf, other]

Distribution Aware Metrics for Conditional Natural Language Generation

Authors: David M Chan, Yiming Ni, David A Ross, Sudheendra Vijayanarasimhan, Austin Myers, John Canny

Abstract: Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersi… ▽ More Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity. △ Less

Submitted 29 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2207.08024 [pdf, other]

LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training

Authors: Sumanth Gurram, Andy Fang, David Chan, John Canny

Abstract: Generating representations of video data is of key importance in advancing the field of machine perception. Most current techniques rely on hand-annotated data, which can be difficult to work with, expensive to generate, and hard to scale. In this work, we propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representa… ▽ More Generating representations of video data is of key importance in advancing the field of machine perception. Most current techniques rely on hand-annotated data, which can be difficult to work with, expensive to generate, and hard to scale. In this work, we propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representations in a self-supervised manner. We pre-train LAVA on the Kinetics 700 dataset using transformer encoders to learn representations for each modality. We then demonstrate that LAVA performs competitively with the current state-of-the-art self-supervised and weakly-supervised pretraining techniques on UCF-101 and HMDB-51 video action recognition while using a fraction of the unlabeled data. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: Workshop Paper at ICML 2022

arXiv:2206.08353 [pdf, other]

Towards Understanding How Machines Can Learn Causal Overhypotheses

Authors: Eliza Kosoy, David M. Chan, Adrian Liu, Jasmine Collins, Bryanna Kaufmann, Sandy Han Huang, Jessica B. Hamrick, John Canny, Nan Rosemary Ke, Alison Gopnik

Abstract: Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the k… ▽ More Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn and use causal overhypotheses. In this work, we present a new benchmark -- a flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses -- and demonstrate that many existing state-of-the-art methods have trouble generalizing in this environment. The code and resources for this benchmark are available at https://github.com/CannyLab/casual_overhypotheses. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2205.09872 [pdf, other]

Content-Context Factorized Representations for Automated Speech Recognition

Authors: David M. Chan, Shalini Ghosh

Abstract: Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected a… ▽ More Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnecessary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious correlations between the spoken words and the context in which such words were spoken. In this work, we introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations. By doing so, we demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios. △ Less

Submitted 15 September, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Comments: Presented at Interspeech 2022 (On-Site Oral Presentation)

arXiv:2205.06253 [pdf, other]

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny

Abstract: While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In th… ▽ More While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics. △ Less

Submitted 12 January, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: The 1st Workshop on Vision Datasets Understanding, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

arXiv:2202.10430 [pdf, other]

Learning Causal Overhypotheses through Exploration in Children and Computational Models

Authors: Eliza Kosoy, Adrian Liu, Jasmine Collins, David M Chan, Jessica B Hamrick, Nan Rosemary Ke, Sandy H Huang, Bryanna Kaufmann, John Canny, Alison Gopnik

Abstract: Despite recent progress in reinforcement learning (RL), RL algorithms for exploration still remain an active area of research. Existing methods often focus on state-based metrics, which do not consider the underlying causal structures of the environment, and while recent research has begun to explore RL environments for causal learning, these environments primarily leverage causal information thro… ▽ More Despite recent progress in reinforcement learning (RL), RL algorithms for exploration still remain an active area of research. Existing methods often focus on state-based metrics, which do not consider the underlying causal structures of the environment, and while recent research has begun to explore RL environments for causal learning, these environments primarily leverage causal information through causal inference or induction rather than exploration. In contrast, human children - some of the most proficient explorers - have been shown to use causal information to great benefit. In this work, we introduce a novel RL environment designed with a controllable causal structure, which allows us to evaluate exploration strategies used by both agents and children in a unified environment. In addition, through experimentation on both computation models and children, we demonstrate that there are significant differences between information-gain optimal RL exploration in causal environments and the exploration of children in the same environments. We conclude with a discussion of how these findings may inspire new directions of research into efficient exploration and disambiguation of causal structures for RL algorithms. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2202.07706

Misinformation Detection in Social Media Video Posts

Authors: Kehan Wang, David Chan, Seth Z. Zhao, John Canny, Avideh Zakhor

Abstract: With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-… ▽ More With the growing adoption of short-form video by social media platforms, reducing the spread of misinformation through video posts has become a critical challenge for social media providers. In this paper, we develop methods to detect misinformation in social media posts, exploiting modalities such as video and text. Due to the lack of large-scale public data for misinformation detection in multi-modal datasets, we collect 160,000 video posts from Twitter, and leverage self-supervised learning to learn expressive representations of joint visual and textual data. In this work, we propose two new methods for detecting semantic inconsistencies within short-form social media video posts, based on contrastive learning and masked language modeling. We demonstrate that our new approaches outperform current state-of-the-art methods on both artificial data generated by random-swapping of positive samples and in the wild on a new manually-labeled test set for semantic misinformation. △ Less

Submitted 30 July, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: We discovered an error in our dataset construction where retweets were not properly filtered. This resulted in test data leakage in training data, and the results reported are affected

arXiv:2110.09890 [pdf, other]

Multi-Modal Pre-Training for Automated Speech Recognition

Authors: David M. Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister

Abstract: Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise… ▽ More Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models). △ Less

Submitted 15 September, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Presented at ICASSP 2022

arXiv:2110.03588 [pdf]

A transformer-based deep learning approach for classifying brain metastases into primary organ sites using clinical whole brain MRI

Authors: Qing Lyu, Sanjeev V. Namjoshi, Emory McTyre, Umit Topaloglu, Richard Barcus, Michael D. Chan, Christina K. Cramer, Waldemar Debinski, Metin N. Gurcan, Glenn J. Lesser, Hui-Kuan Lin, Reginald F. Munden, Boris C. Pasche, Kiran Kumar Solingapuram Sai, Roy E. Strowd, Stephen B. Tatter, Kounosuke Watabe, Wei Zhang, Ge Wang, Christopher T. Whitlow

Abstract: Treatment decisions for brain metastatic disease rely on knowledge of the primary organ site, and currently made with biopsy and histology. Here we develop a novel deep learning approach for accurate non-invasive digital histology with whole-brain MRI data. Our IRB-approved single-site retrospective study was comprised of patients (n=1,399) referred for MRI treatment-planning and gamma knife radio… ▽ More Treatment decisions for brain metastatic disease rely on knowledge of the primary organ site, and currently made with biopsy and histology. Here we develop a novel deep learning approach for accurate non-invasive digital histology with whole-brain MRI data. Our IRB-approved single-site retrospective study was comprised of patients (n=1,399) referred for MRI treatment-planning and gamma knife radiosurgery over 21 years. Contrast-enhanced T1-weighted and T2-weighted Fluid-Attenuated Inversion Recovery brain MRI exams (n=1,582) were preprocessed and input to the proposed deep learning workflow for tumor segmentation, modality transfer, and primary site classification into one of five classes. Ten-fold cross-validation generated overall AUC of 0.878 (95%CI:0.873,0.883), lung class AUC of 0.889 (95%CI:0.883,0.895), breast class AUC of 0.873 (95%CI:0.860,0.886), melanoma class AUC of 0.852 (95%CI:0.842,0.862), renal class AUC of 0.830 (95%CI:0.809,0.851), and other class AUC of 0.822 (95%CI:0.805,0.839). These data establish that whole-brain imaging features are discriminative to allow accurate diagnosis of the primary organ site of malignancy. Our end-to-end deep radiomic approach has great potential for classifying metastatic tumor types from whole-brain MRI images. Further refinement may offer an invaluable clinical tool to expedite primary cancer site identification for precision treatment and improved outcomes. △ Less

Submitted 20 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2108.13947 [pdf, other]

doi 10.6339/21-JDS1033

Decision Tree-Based Predictive Models for Academic Achievement Using College Students' Support Networks

Authors: Anthony Frazier, Joethi Silva, Rachel Meilak, Indranil Sahoo, David Chan, Michael Broda

Abstract: In this study, we examine a set of primary data collected from 484 students enrolled in a large public university in the Mid-Atlantic United States region during the early stages of the COVID-19 pandemic. The data, called Ties data, included students' demographic and support network information. The support network data comprised of information that highlighted the type of support, (i.e. emotional… ▽ More In this study, we examine a set of primary data collected from 484 students enrolled in a large public university in the Mid-Atlantic United States region during the early stages of the COVID-19 pandemic. The data, called Ties data, included students' demographic and support network information. The support network data comprised of information that highlighted the type of support, (i.e. emotional or educational; routine or intense). Using this data set, models for predicting students' academic achievement, quantified by their self-reported GPA, were created using Chi-Square Automatic Interaction Detection (CHAID), a decision tree algorithm, and cforest, a random forest algorithm that uses conditional inference trees. We compare the methods' accuracy and variation in the set of important variables suggested by each algorithm. Each algorithm found different variables important for different student demographics with some overlap. For White students, different types of educational support were important in predicting academic achievement, while for non-White students, different types of emotional support were important in predicting academic achievement. The presence of differing types of routine support were important in predicting academic achievement for cisgender women, while differing types of intense support were important in predicting academic achievement for cisgender men. △ Less

Submitted 12 September, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2108.01651 [pdf, ps, other]

An Impossibility Result on Strong Linearizability in Message-Passing Systems

Authors: David Yu Cheng Chan, Vassos Hadzilacos, Xing Hu, Sam Toueg

Abstract: We prove that in asynchronous message-passing systems where at most one process may crash, there is no lock-free strongly linearizable implementation of a weak object that we call Test-or-Set (ToS). This object allows a single distinguished process to apply the set operation once, and a different distinguished process to apply the test operation also once. Since this weak object can be directly im… ▽ More We prove that in asynchronous message-passing systems where at most one process may crash, there is no lock-free strongly linearizable implementation of a weak object that we call Test-or-Set (ToS). This object allows a single distinguished process to apply the set operation once, and a different distinguished process to apply the test operation also once. Since this weak object can be directly implemented by a single-writer single-reader (SWSR) register (and other common objects such as max-register, snapshot and counter), this result implies that there is no $1$-resilient lock-free strongly linearizable implementation of a SWSR register (and of these other objects) in message-passing systems. We also prove that there is no $1$-resilient lock-free \emph{write} strongly-linearizable implementation of a 2-writer 1-reader (2W1R) register in asynchronous message-passing systems. △ Less

Submitted 9 August, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: 12 pages

arXiv:2106.03185 [pdf, ps, other]

Tight Lower Bounds for the RMR Complexity of Recoverable Mutual Exclusion

Authors: David Yu Cheng Chan, Philipp Woelfel

Abstract: We present a tight RMR complexity lower bound for the recoverable mutual exclusion (RME) problem, defined by Golab and Ramaraju \cite{GR2019a}. In particular, we show that any $n$-process RME algorithm using only atomic read, write, fetch-and-store, fetch-and-increment, and compare-and-swap operations, has an RMR complexity of $Ω(\log n/\log\log n)$ on the CC and DSM model. This lower bound covers… ▽ More We present a tight RMR complexity lower bound for the recoverable mutual exclusion (RME) problem, defined by Golab and Ramaraju \cite{GR2019a}. In particular, we show that any $n$-process RME algorithm using only atomic read, write, fetch-and-store, fetch-and-increment, and compare-and-swap operations, has an RMR complexity of $Ω(\log n/\log\log n)$ on the CC and DSM model. This lower bound covers all realistic synchronization primitives that have been used in RME algorithms and matches the best upper bounds of algorithms employing swap objects (e.g., [5,6,10]). Algorithms with better RMR complexity than that have only been obtained by either (i) assuming that all failures are system-wide [7], (ii) employing fetch-and-add objects of size $(\log n)^{ω(1)}$ [12], or (iii) using artificially defined synchronization primitives that are not available in actual systems [6,9]. △ Less

Submitted 6 June, 2021; originally announced June 2021.

Comments: 36 pages, 0 figures

arXiv:2105.10880 [pdf, other]

RtFPS: An Interactive Map that Visualizes and Predicts Wildfires in the US

Authors: Yang Li, Hermawan Mulyono, Ying Chen, Zhiyin Lu, Desmond Chan

Abstract: Climate change has largely impacted our daily lives. As one of its consequences, we are experiencing more wildfires. In the year 2020, wildfires burned a record number of 8,888,297 acres in the US. To awaken people's attention to climate change, and to visualize the current risk of wildfires, We developed RtFPS, "Real-Time Fire Prediction System". It provides a real-time prediction visualization o… ▽ More Climate change has largely impacted our daily lives. As one of its consequences, we are experiencing more wildfires. In the year 2020, wildfires burned a record number of 8,888,297 acres in the US. To awaken people's attention to climate change, and to visualize the current risk of wildfires, We developed RtFPS, "Real-Time Fire Prediction System". It provides a real-time prediction visualization of wildfire risk at specific locations base on a Machine Learning model. It also provides interactive map features that show the historical wildfire events with environmental info. △ Less

Submitted 21 June, 2021; v1 submitted 23 May, 2021; originally announced May 2021.

Comments: Source code: https://github.com/yangland/rtfps

MSC Class: 68U05; 68T30 ACM Class: J.2.5; H.4.0; I.5.1

arXiv:2104.01263 [pdf, other]

A Semantic Segmentation Network for Urban-Scale Building Footprint Extraction Using RGB Satellite Imagery

Authors: Aatif Jiwani, Shubhrakanti Ganguly, Chao Ding, Nan Zhou, David M. Chan

Abstract: Urban areas consume over two-thirds of the world's energy and account for more than 70 percent of global CO2 emissions. As stated in IPCC's Global Warming of 1.5C report, achieving carbon neutrality by 2050 requires a clear understanding of urban geometry. High-quality building footprint generation from satellite images can accelerate this predictive process and empower municipal decision-making a… ▽ More Urban areas consume over two-thirds of the world's energy and account for more than 70 percent of global CO2 emissions. As stated in IPCC's Global Warming of 1.5C report, achieving carbon neutrality by 2050 requires a clear understanding of urban geometry. High-quality building footprint generation from satellite images can accelerate this predictive process and empower municipal decision-making at scale. However, previous Deep Learning-based approaches face consequential issues such as scale invariance and defective footprints, partly due to ever-present class-wise imbalance. Additionally, most approaches require supplemental data such as point cloud data, building height information, and multi-band imagery - which has limited availability and are tedious to produce. In this paper, we propose a modified DeeplabV3+ module with a Dilated Res-Net backbone to generate masks of building footprints from three-channel RGB satellite imagery only. Furthermore, we introduce an F-Beta measure in our objective function to help the model account for skewed class distributions and prevent false-positive footprints. In addition to F-Beta, we incorporate an exponentially weighted boundary loss and use a cross-dataset training strategy to further increase the quality of predictions. As a result, we achieve state-of-the-art performances across three public benchmarks and demonstrate that our RGB-only method produces higher quality visual results and is agnostic to the scale, resolution, and urban density of satellite imagery. △ Less

Submitted 18 November, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

Comments: 11 pages, 5 figures. Code available at https://github.com/aatifjiwani/rgb-footprint-extract/

arXiv:2103.11926 [pdf, other]

Differentiated nonblocking: a new progress condition and a matching queue algorithm

Authors: David Y. C. Chan, Shucheng Chi, Vassos Hadzilacos, Sam Toueg

Abstract: In this paper, we first propose a new liveness requirement for shared objects and data structures, we then give a shared queue algorithm that satisfies this requirement and we prove its correctness. We also implement this algorithm and compare it to a well-known shared queue algorithm that is used in practice. In addition to having a stronger worst-case progress guarantee, our experimental results… ▽ More In this paper, we first propose a new liveness requirement for shared objects and data structures, we then give a shared queue algorithm that satisfies this requirement and we prove its correctness. We also implement this algorithm and compare it to a well-known shared queue algorithm that is used in practice. In addition to having a stronger worst-case progress guarantee, our experimental results suggest that, at the cost of a marginal decrease in throughput, our algorithm is significantly fairer, by a natural definition of fairness that we introduce here. △ Less

Submitted 22 March, 2021; originally announced March 2021.

arXiv:2008.02787 [pdf, other]

Efficient Non-Line-of-Sight Imaging from Transient Sinograms

Authors: Mariko Isogawa, Dorian Chan, Ye Yuan, Kris Kitani, Matthew O'Toole

Abstract: Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more… ▽ More Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more efficient form of NLOS scanning that reduces both acquisition times and computational requirements. We propose a circular and confocal non-line-of-sight (C2NLOS) scan that involves illuminating and imaging a common point, and scanning this point in a circular path along a wall. We observe that (1) these C2NLOS measurements consist of a superposition of sinusoids, which we refer to as a transient sinogram, (2) there exists computationally efficient reconstruction procedures that transform these sinusoidal measurements into 3D positions of hidden scatterers or NLOS images of hidden objects, and (3) despite operating on an order of magnitude fewer measurements than previous approaches, these C2NLOS scans provide sufficient information about the hidden scene to solve these different NLOS imaging tasks. We show results from both simulated and real C2NLOS scans. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: ECCV 2020. Project page: https://marikoisogawa.github.io/project/c2nlos

arXiv:2007.13913 [pdf, other]

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

Authors: David M. Chan, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

Abstract: Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we bo… ▽ More Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we both explore various active learning approaches for automatic video captioning and show that a cluster-regularized ensemble strategy provides the best active learning approach to efficiently gather training sets for video captioning. We evaluate our approaches on the MSR-VTT and LSMDC datasets using both transformer and LSTM based captioning models and show that our novel strategy can achieve high performance while using up to 60% fewer training data than the strong state of the art baselines. △ Less

Submitted 2 December, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: Published at the 15th Asian Conference on Computer Vision (ACCV 2020)

arXiv:2006.08335 [pdf, other]

A Dataset and Benchmarks for Multimedia Social Analysis

Authors: Bofan Xue, David Chan, John Canny

Abstract: We present a new publicly available dataset with the goal of advancing multi-modality learning by offering vision and language data within the same context. This is achieved by obtaining data from a social media website with posts containing multiple paired images/videos and text, along with comment trees containing images/videos and/or text. With a total of 677k posts, 2.9 million post images, 48… ▽ More We present a new publicly available dataset with the goal of advancing multi-modality learning by offering vision and language data within the same context. This is achieved by obtaining data from a social media website with posts containing multiple paired images/videos and text, along with comment trees containing images/videos and/or text. With a total of 677k posts, 2.9 million post images, 488k post videos, 1.4 million comment images, 4.6 million comment videos, and 96.9 million comments, data from different modalities can be jointly used to improve performances for a variety of tasks such as image captioning, image classification, next frame prediction, sentiment analysis, and language modeling. We present a wide range of statistics for our dataset. Finally, we provide baseline performance analysis for one of the regression tasks using pre-trained models and several fully connected networks. △ Less

Submitted 5 June, 2020; originally announced June 2020.

Comments: Published as a workshop paper at "Multimodality Learning" (CVPR 2020)

arXiv:2005.05023 [pdf, other]

Facial Electromyography-based Adaptive Virtual Reality Gaming for Cognitive Training

Authors: Lorcan Reidy, Dennis Chan, Charles Nduka, Hatice Gunes

Abstract: Cognitive training has shown promising results for delivering improvements in human cognition related to attention, problem solving, reading comprehension and information retrieval. However, two frequently cited problems in cognitive training literature are a lack of user engagement with the training programme, and a failure of developed skills to generalise to daily life. This paper introduces a… ▽ More Cognitive training has shown promising results for delivering improvements in human cognition related to attention, problem solving, reading comprehension and information retrieval. However, two frequently cited problems in cognitive training literature are a lack of user engagement with the training programme, and a failure of developed skills to generalise to daily life. This paper introduces a new cognitive training (CT) paradigm designed to address these two limitations by combining the benefits of gamification, virtual reality (VR), and affective adaptation in the development of an engaging, ecologically valid, CT task. Additionally, it incorporates facial electromyography (EMG) as a means of determining user affect while engaged in the CT task. This information is then utilised to dynamically adjust the game's difficulty in real-time as users play, with the aim of leading them into a state of flow. Affect recognition rates of 64.1% and 76.2%, for valence and arousal respectively, were achieved by classifying a DWT-Haar approximation of the input signal using kNN. The affect-aware VR cognitive training intervention was then evaluated with a control group of older adults. The results obtained substantiate the notion that adaptation techniques can lead to greater feelings of competence and a more appropriate challenge of the user's skills. △ Less

Submitted 30 August, 2020; v1 submitted 27 April, 2020; originally announced May 2020.

ACM Class: I.2; K.8

arXiv:2005.02880 [pdf, other]

Exploring Exploration: Comparing Children with RL Agents in Unified Environments

Authors: Eliza Kosoy, Jasmine Collins, David M. Chan, Sandy Huang, Deepak Pathak, Pulkit Agrawal, John Canny, Alison Gopnik, Jessica B. Hamrick

Abstract: Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn. In turn, this early learning supports more robust generalization and intelligent behavior later in life. While much work has gone into developing methods for exploration in machine learning, artificial agents have not yet reached the hig… ▽ More Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn. In turn, this early learning supports more robust generalization and intelligent behavior later in life. While much work has gone into developing methods for exploration in machine learning, artificial agents have not yet reached the high standard set by their human counterparts. In this work we propose using DeepMind Lab (Beattie et al., 2016) as a platform to directly compare child and agent behaviors and to develop new exploration techniques. We outline two ongoing experiments to demonstrate the effectiveness of a direct comparison, and outline a number of open research questions that we believe can be tested using this methodology. △ Less

Submitted 1 July, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

Comments: Published as a workshop paper at "Bridging AI and Cognitive Science" (ICLR 2020)

arXiv:1910.12154 [pdf, other]

ZPD Teaching Strategies for Deep Reinforcement Learning from Demonstrations

Authors: Daniel Seita, David Chan, Roshan Rao, Chen Tang, Mandi Zhao, John Canny

Abstract: Learning from demonstrations is a popular tool for accelerating and reducing the exploration requirements of reinforcement learning. When providing expert demonstrations to human students, we know that the demonstrations must fall within a particular range of difficulties called the "Zone of Proximal Development (ZPD)". If they are too easy the student learns nothing, but if they are too difficult… ▽ More Learning from demonstrations is a popular tool for accelerating and reducing the exploration requirements of reinforcement learning. When providing expert demonstrations to human students, we know that the demonstrations must fall within a particular range of difficulties called the "Zone of Proximal Development (ZPD)". If they are too easy the student learns nothing, but if they are too difficult the student is unable to follow along. This raises the question: Given a set of potential demonstrators, which among them is best suited for teaching any particular learner? Prior work, such as the popular Deep Q-learning from Demonstrations (DQfD) algorithm has generally focused on single demonstrators. In this work we consider the problem of choosing among multiple demonstrators of varying skill levels. Our results align with intuition from human learners: it is not always the best policy to draw demonstrations from the best performing demonstrator (in terms of reward). We show that careful selection of teaching strategies can result in sample efficiency gains in the learner's environment across nine Atari games △ Less

Submitted 26 October, 2019; originally announced October 2019.

Comments: Deep Reinforcement Learning Workshop at NeurIPS 2019

arXiv:1902.06085 [pdf]

doi 10.1002/mp.14003

DC-AL GAN: Pseudoprogression and True Tumor Progression of Glioblastoma Multiform Image Classification Based on DCGAN and AlexNet

Authors: Meiyu Li, Hailiang Tang, Michael D. Chan, Xiaobo Zhou, Xiaohua Qian

Abstract: Pseudoprogression (PsP) occurs in 20-30% of patients with glioblastoma multiforme (GBM) after receiving the standard treatment. In the course of post-treatment magnetic resonance imaging (MRI), PsP exhibits similarities in shape and intensity to the true tumor progression (TTP) of GBM. So, these similarities pose challenges on the differentiation of these types of progression and hence the selecti… ▽ More Pseudoprogression (PsP) occurs in 20-30% of patients with glioblastoma multiforme (GBM) after receiving the standard treatment. In the course of post-treatment magnetic resonance imaging (MRI), PsP exhibits similarities in shape and intensity to the true tumor progression (TTP) of GBM. So, these similarities pose challenges on the differentiation of these types of progression and hence the selection of the appropriate clinical treatment strategy. In this paper, we introduce DC-AL GAN, a novel feature learning method based on deep convolutional generative adversarial network (DCGAN) and AlexNet, to discriminate between PsP and TTP in MRI images. Due to the adversarial relationship between the generator and the discriminator of DCGAN, high-level discriminative features of PsP and TTP can be derived for the discriminator with AlexNet. Also, a feature fusion scheme is used to combine higher-layer features with lower-layer information, leading to more powerful features that are used for effectively discriminating between PsP and TTP. The experimental results show that DC-AL GAN achieves desirable PsP and TTP classification performance that is superior to other state-of-the-art methods. △ Less

Submitted 18 May, 2019; v1 submitted 16 February, 2019; originally announced February 2019.

arXiv:1902.04168 [pdf, other]

doi 10.1016/j.enganabound.2014.03.010

A robust and non-singular formulation of the boundary integral method for the potential problem

Authors: Q. Sun, E. Klaseboer, B. C. Khoo, D. Y. C. Chan

Abstract: A non-singular formulation of the boundary integral method (BIM) is presented for the Laplace equation whereby the well-known singularities that arise from the fundamental solution are eliminated analytically. A key advantage of this approach is that numerical errors that arise due to the proximity of nodes located on osculating boundaries are suppressed. This is particularly relevant in multi-sca… ▽ More A non-singular formulation of the boundary integral method (BIM) is presented for the Laplace equation whereby the well-known singularities that arise from the fundamental solution are eliminated analytically. A key advantage of this approach is that numerical errors that arise due to the proximity of nodes located on osculating boundaries are suppressed. This is particularly relevant in multi-scale problems where high accuracy is required without undue increase in computational cost when the spacing between boundaries become much smaller than their characteristic dimensions. The elimination of the singularities means that standard quadrature can be used to evaluate the surface integrals and this results in about 60% savings in coding effort. The new formulation also affords a numerically robust way to calculate the potential close to the boundaries. Detailed implementations of this approach are illustrated with problems involving osculating boundaries, 2D domains with corners and a wave drag problem in a 3D semi-infinite domain. The explicit formulation of problems with axial symmetry is also given. △ Less

Submitted 7 February, 2019; originally announced February 2019.

Journal ref: Engineering Analysis with Boundary Elements 43 (2014) 117

arXiv:1901.05305 [pdf, other]

doi 10.1109/ICASSP.2019.8683229

Seizure Detection using Least EEG Channels by Deep Convolutional Neural Network

Authors: Mustafa Talha Avcu, Zhuo Zhang, Derrick Wei Shih Chan

Abstract: This work aims to develop an end-to-end solution for seizure onset detection. We design the SeizNet, a Convolutional Neural Network for seizure detection. To compare SeizNet with traditional machine learning approach, a baseline classifier is implemented using spectrum band power features with Support Vector Machines (BPsvm). We explore the possibility to use the least number of channels for accur… ▽ More This work aims to develop an end-to-end solution for seizure onset detection. We design the SeizNet, a Convolutional Neural Network for seizure detection. To compare SeizNet with traditional machine learning approach, a baseline classifier is implemented using spectrum band power features with Support Vector Machines (BPsvm). We explore the possibility to use the least number of channels for accurate seizure detection by evaluating SeizNet and BPsvm approaches using all channels and two channels settings respectively. EEG Data is acquired from 29 pediatric patients admitted to KK Woman's and Children's Hospital who were diagnosed as typical absence seizures. We conduct leave-one-out cross validation for all subjects. Using full channel data, BPsvm yields a sensitivity of 86.6\% and 0.84 false alarm (per hour) while SeizNet yields overall sensitivity of 95.8 \% with 0.17 false alarm. More interestingly, two channels seizNet outperforms full channel BPsvm with a sensitivity of 93.3\% and 0.58 false alarm. We further investigate interpretability of SeizNet by decoding the filters learned along convolutional layers. Seizure-like characteristics can be clearly observed in the filters from third and forth convolutional layers. △ Less

Submitted 14 January, 2019; originally announced January 2019.

arXiv:1812.09744 [pdf, other]

Leveraging Class Similarity to Improve Deep Neural Network Robustness

Authors: Pooran Singh Negi, David chan, Mohammad Mahoor

Abstract: Traditionally artificial neural networks (ANNs) are trained by minimizing the cross-entropy between a provided groundtruth delta distribution (encoded as one-hot vector) and the ANN's predictive softmax distribution. It seems, however, unacceptable to penalize networks equally for missclassification between classes. Confusing the class "Automobile" with the class "Truck" should be penalized less t… ▽ More Traditionally artificial neural networks (ANNs) are trained by minimizing the cross-entropy between a provided groundtruth delta distribution (encoded as one-hot vector) and the ANN's predictive softmax distribution. It seems, however, unacceptable to penalize networks equally for missclassification between classes. Confusing the class "Automobile" with the class "Truck" should be penalized less than confusing the class "Automobile" with the class "Donkey". To avoid such representation issues and learn cleaner classification boundaries in the network, this paper presents a variation of cross-entropy loss which depends not only on the sample class but also on a data-driven prior "class-similarity distribution" across the classes encoded in a matrix form. We explore learning the class-similarity distribution using a datadriven method and then show that by training with our modified similarity-driven loss, we obtain slightly better generalization performance over multiple architectures and datasets as well as improved performance on noisy testing scenarios. △ Less

Submitted 27 December, 2018; v1 submitted 23 December, 2018; originally announced December 2018.

arXiv:1812.04604 [pdf, other]

Diagnostic Visualization for Deep Neural Networks Using Stochastic Gradient Langevin Dynamics

Authors: Biye Jiang, David M. Chan, Tianhao Zhang, John F. Canny

Abstract: The internal states of most deep neural networks are difficult to interpret, which makes diagnosis and debugging during training challenging. Activation maximization methods are widely used, but lead to multiple optima and are hard to interpret (appear noise-like) for complex neurons. Image-based methods use maximally-activating image regions which are easier to interpret, but do not provide pixel… ▽ More The internal states of most deep neural networks are difficult to interpret, which makes diagnosis and debugging during training challenging. Activation maximization methods are widely used, but lead to multiple optima and are hard to interpret (appear noise-like) for complex neurons. Image-based methods use maximally-activating image regions which are easier to interpret, but do not provide pixel-level insight into why the neuron responds to them. In this work we introduce an MCMC method: Langevin Dynamics Activation Maximization (LDAM), which is designed for diagnostic visualization. LDAM provides two affordances in combination: the ability to explore the set of maximally activating pre-images, and the ability to trade-off interpretability and pixel-level accuracy using a GAN-style discriminator as a regularizer. We present case studies on MNIST, CIFAR and ImageNet datasets exploring these trade-offs. Finally we show that diagnostic visualization using LDAM leads to a novel insight into the parameter averaging method for deep net training. △ Less

Submitted 11 December, 2018; originally announced December 2018.

arXiv:1811.06040 [pdf, other]

doi 10.1109/NER.2019.8717158

Brain-Computer Interface in Virtual Reality

Authors: Reza Abbasi-Asl, Mohammad Keshavarzi, Dorian Yao Chan

Abstract: We study the performance of brain computer interface (BCI) system in a virtual reality (VR) environment and compare it to 2D regular displays. First, we design a headset that consists of three components: a wearable electroencephalography (EEG) device, a VR headset and an interface. Recordings of brain and behavior from human subjects, performing a wide variety of tasks using our device are collec… ▽ More We study the performance of brain computer interface (BCI) system in a virtual reality (VR) environment and compare it to 2D regular displays. First, we design a headset that consists of three components: a wearable electroencephalography (EEG) device, a VR headset and an interface. Recordings of brain and behavior from human subjects, performing a wide variety of tasks using our device are collected. The tasks consist of object rotation or scaling in VR using either mental commands or facial expression (smile and eyebrow movement). Subjects are asked to repeat similar tasks on regular 2D monitor screens. The performance in 3-D virtual reality environment is considerably higher compared to the to the 2D screen. Particularly, the median number of success rate across trials for VR setting is double of that for the 2D setting (8 successful command in VR setting compared to 4 successful command in 2D screen in 1 minute trials). Our results suggest that the design of future BCI systems can remarkably benefit from the VR setting. △ Less

Submitted 13 November, 2018; originally announced November 2018.

arXiv:1810.00216 [pdf, other]

Parameter Estimation for the Single-Look $\mathcal{G}^0$ Distribution

Authors: Débora Chan, Andrea Rey, Juliana Gambini, Alejandro C. Frery

Abstract: The statistical properties of Synthetic Aperture Radar (SAR) image texture reveals useful target characteristics. It is well-known that these images are affected by speckle, and prone to contamination as double bounce and corner reflectors. The $\mathcal{G}^0$ distribution is flexible enough to model different degrees of texture in speckled data. It is indexed by three parameters: $α$, related to… ▽ More The statistical properties of Synthetic Aperture Radar (SAR) image texture reveals useful target characteristics. It is well-known that these images are affected by speckle, and prone to contamination as double bounce and corner reflectors. The $\mathcal{G}^0$ distribution is flexible enough to model different degrees of texture in speckled data. It is indexed by three parameters: $α$, related to the texture, $γ$, a scale parameter, and $L$, the number of looks which is related to the signal-to-noise ratio. Quality estimation of $α$ is essential due to its immediate interpretability. In this article, we compare the behavior of a number of parameter estimation techniques in the noisiest case, namely single look data. We evaluate them using Monte Carlo methods for non-contaminated and contaminated data, considering convergence rate, bias, mean squared error (MSE) and computational cost. The results are verified with simulated and actual SAR images. △ Less

Submitted 29 September, 2018; originally announced October 2018.

arXiv:1807.11824 [pdf, other]

t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data

Authors: David M. Chan, Roshan Rao, Forrest Huang, John F. Canny

Abstract: Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples. Existing visualization methods which employ dimensionality reduction to two or three dimensions are often inefficient and/or ineffective for these datasets. This paper introduces t-SNE-CUDA, a GPU-accelerated implementation of t-distributed Symmetric… ▽ More Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples. Existing visualization methods which employ dimensionality reduction to two or three dimensions are often inefficient and/or ineffective for these datasets. This paper introduces t-SNE-CUDA, a GPU-accelerated implementation of t-distributed Symmetric Neighbor Embedding (t-SNE) for visualizing datasets and models. t-SNE-CUDA significantly outperforms current implementations with 50-700x speedups on the CIFAR-10 and MNIST datasets. These speedups enable, for the first time, visualization of the neural network activations on the entire ImageNet dataset - a feat that was previously computationally intractable. We also demonstrate visualization performance in the NLP domain by visualizing the GloVe embedding vectors. From these visualizations, we can draw interesting conclusions about using the L2 metric in these embedding spaces. t-SNE-CUDA is publicly available athttps://github.com/CannyLab/tsne-cuda △ Less

Submitted 31 July, 2018; originally announced July 2018.

Comments: To appear in HPML 2018 High Performance Machine Learning Workshop (Accepted, 2018)

arXiv:1605.03639 [pdf, other]

doi 10.1109/CVPRW.2016.188

Facial Expression Recognition from World Wild Web

Authors: Ali Mollahosseini, Behzad Hassani, Michelle J. Salvador, Hojjat Abdollahi, David Chan, Mohammad H. Mahoor

Abstract: Recognizing facial expression in a wild setting has remained a challenging task in computer vision. The World Wide Web is a good source of facial images which most of them are captured in uncontrolled conditions. In fact, the Internet is a Word Wild Web of facial images with expressions. This paper presents the results of a new study on collecting, annotating, and analyzing wild facial expressions… ▽ More Recognizing facial expression in a wild setting has remained a challenging task in computer vision. The World Wide Web is a good source of facial images which most of them are captured in uncontrolled conditions. In fact, the Internet is a Word Wild Web of facial images with expressions. This paper presents the results of a new study on collecting, annotating, and analyzing wild facial expressions from the web. Three search engines were queried using 1250 emotion related keywords in six different languages and the retrieved images were mapped by two annotators to six basic expressions and neutral. Deep neural networks and noise modeling were used in three different training scenarios to find how accurately facial expressions can be recognized when trained on noisy images collected from the web using query terms (e.g. happy face, laughing man, etc)? The results of our experiments show that deep neural networks can recognize wild facial expressions with an accuracy of 82.12%. △ Less

Submitted 5 January, 2017; v1 submitted 11 May, 2016; originally announced May 2016.

arXiv:1511.04110 [pdf, other]

doi 10.1109/WACV.2016.7477450

Going Deeper in Facial Expression Recognition using Deep Neural Networks

Authors: Ali Mollahosseini, David Chan, Mohammad H. Mahoor

Abstract: Automated Facial Expression Recognition (FER) has remained a challenging and interesting problem. Despite efforts made in developing various methods for FER, existing approaches traditionally lack generalizability when applied to unseen images or those that are captured in wild setting. Most of the existing approaches are based on engineered features (e.g. HOG, LBPH, and Gabor) where the classifie… ▽ More Automated Facial Expression Recognition (FER) has remained a challenging and interesting problem. Despite efforts made in developing various methods for FER, existing approaches traditionally lack generalizability when applied to unseen images or those that are captured in wild setting. Most of the existing approaches are based on engineered features (e.g. HOG, LBPH, and Gabor) where the classifier's hyperparameters are tuned to give best recognition accuracies across a single database, or a small collection of similar databases. Nevertheless, the results are not significant when they are applied to novel data. This paper proposes a deep neural network architecture to address the FER problem across multiple well-known standard face datasets. Specifically, our network consists of two convolutional layers each followed by max pooling and then four Inception layers. The network is a single component architecture that takes registered facial images as the input and classifies them into either of the six basic or the neutral expressions. We conducted comprehensive experiments on seven publically available facial expression databases, viz. MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013. The results of proposed architecture are comparable to or better than the state-of-the-art methods and better than traditional convolutional neural networks and in both accuracy and training time. △ Less

Submitted 12 November, 2015; originally announced November 2015.

Comments: To be appear in IEEE Winter Conference on Applications of Computer Vision (WACV), 2016 {Accepted in first round submission}

Journal ref: IEEE Winter Conference on Applications of Computer Vision (WACV), 2016

Showing 1–49 of 49 results for author: Chan, D