-
Data-driven Nucleus Subclassification on Colon H&E using Style-transferred Digital Pathology
Authors:
Lucas W. Remedios,
Shunxing Bao,
Samuel W. Remedios,
Ho Hin Lee,
Leon Y. Cai,
Thomas Li,
Ruining Deng,
Nancy R. Newlin,
Adam M. Saunders,
Can Cui,
Jia Li,
Qi Liu,
Ken S. Lau,
Joseph T. Roland,
Mary K Washington,
Lori A. Coburn,
Keith T. Wilson,
Yuankai Huo,
Bennett A. Landman
Abstract:
Understanding the way cells communicate, co-locate, and interrelate is essential to furthering our understanding of how the body functions. H&E is widely available, however, cell subtyping often requires expert knowledge and the use of specialized stains. To reduce the annotation burden, AI has been proposed for the classification of cells on H&E. For example, the recent Colon Nucleus Identificati…
▽ More
Understanding the way cells communicate, co-locate, and interrelate is essential to furthering our understanding of how the body functions. H&E is widely available, however, cell subtyping often requires expert knowledge and the use of specialized stains. To reduce the annotation burden, AI has been proposed for the classification of cells on H&E. For example, the recent Colon Nucleus Identification and Classification (CoNIC) Challenge focused on labeling 6 cell types on H&E of the colon. However, the CoNIC Challenge was unable to classify epithelial subtypes (progenitor, enteroendocrine, goblet), lymphocyte subtypes (B, helper T, cytotoxic T), and connective subtypes (fibroblasts). We use inter-modality learning to label previously un-labelable cell types on H&E. We take advantage of multiplexed immunofluorescence (MxIF) histology to label 14 cell subclasses. We performed style transfer on the same MxIF tissues to synthesize realistic virtual H&E which we paired with the MxIF-derived cell subclassification labels. We evaluated the efficacy of using a supervised learning scheme where the input was realistic-quality virtual H&E and the labels were MxIF-derived cell subclasses. We assessed our model on private virtual H&E and public real H&E. On virtual H&E, we were able to classify helper T cells and epithelial progenitors with positive predictive values of $0.34 \pm 0.15$ (prevalence $0.03 \pm 0.01$) and $0.47 \pm 0.1$ (prevalence $0.07 \pm 0.02$) respectively, when using ground truth centroid information. On real H&E we could classify helper T cells and epithelial progenitors with upper bound positive predictive values of $0.43 \pm 0.03$ (parent class prevalence 0.21) and $0.94 \pm 0.02$ (parent class prevalence 0.49) when using ground truth centroid information. This is the first work to provide cell type classification for helper T and epithelial progenitor nuclei on H&E.
△ Less
Submitted 15 May, 2024;
originally announced July 2024.
-
Jump Starting Bandits with LLM-Generated Prior Knowledge
Authors:
Parand A. Alamdari,
Yanshuai Cao,
Kevin H. Wilson
Abstract:
We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human…
▽ More
We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Counterfactual Explanations for Multivariate Time-Series without Training Datasets
Authors:
Xiangyu Sun,
Raquel Aoki,
Kevin H. Wilson
Abstract:
Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interp…
▽ More
Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interpretations of opaque ML models and providing a pathway to transition from one decision to another. However, most existing CFE methods require access to the model's training dataset, few methods can handle multivariate time-series, and none can handle multivariate time-series without training datasets. These limitations can be formidable in many scenarios. In this paper, we present CFWoT, a novel reinforcement-learning-based CFE method that generates CFEs when training datasets are unavailable. CFWoT is model-agnostic and suitable for both static and multivariate time-series datasets with continuous and discrete features. Users have the flexibility to specify non-actionable, immutable, and preferred features, as well as causal constraints which CFWoT guarantees will be respected. We demonstrate the performance of CFWoT against four baselines on several datasets and find that, despite not having access to a training dataset, CFWoT finds CFEs that make significantly fewer and significantly smaller changes to the input time-series. These properties make CFEs more actionable, as the magnitude of change required to alter an outcome is vastly reduced.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
On the development of a practical Bayesian optimisation algorithm for expensive experiments and simulations with changing environmental conditions
Authors:
Mike Diessner,
Kevin J. Wilson,
Richard D. Whalley
Abstract:
Experiments in engineering are typically conducted in controlled environments where parameters can be set to any desired value. This assumes that the same applies in a real-world setting -- an assumption that is often incorrect as many experiments are influenced by uncontrollable environmental conditions such as temperature, humidity and wind speed. When optimising such experiments, the focus shou…
▽ More
Experiments in engineering are typically conducted in controlled environments where parameters can be set to any desired value. This assumes that the same applies in a real-world setting -- an assumption that is often incorrect as many experiments are influenced by uncontrollable environmental conditions such as temperature, humidity and wind speed. When optimising such experiments, the focus should lie on finding optimal values conditionally on these uncontrollable variables. This article extends Bayesian optimisation to the optimisation of systems in changing environments that include controllable and uncontrollable parameters. The extension fits a global surrogate model over all controllable and environmental variables but optimises only the controllable parameters conditional on measurements of the uncontrollable variables. The method is validated on two synthetic test functions and the effects of the noise level, the number of the environmental parameters, the parameter fluctuation, the variability of the uncontrollable parameters, and the effective domain size are investigated. ENVBO, the proposed algorithm resulting from this investigation, is applied to a wind farm simulator with eight controllable and one environmental parameter. ENVBO finds solutions for the full domain of the environmental variable that outperforms results from optimisation algorithms that only focus on a fixed environmental value in all but one case while using a fraction of their evaluation budget. This makes the proposed approach very sample-efficient and cost-effective. An off-the-shelf open-source version of ENVBO is available via the NUBO Python package.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Nucleus subtype classification using inter-modality learning
Authors:
Lucas W. Remedios,
Shunxing Bao,
Samuel W. Remedios,
Ho Hin Lee,
Leon Y. Cai,
Thomas Li,
Ruining Deng,
Can Cui,
Jia Li,
Qi Liu,
Ken S. Lau,
Joseph T. Roland,
Mary K. Washington,
Lori A. Coburn,
Keith T. Wilson,
Yuankai Huo,
Bennett A. Landman
Abstract:
Understanding the way cells communicate, co-locate, and interrelate is essential to understanding human physiology. Hematoxylin and eosin (H&E) staining is ubiquitously available both for clinical studies and research. The Colon Nucleus Identification and Classification (CoNIC) Challenge has recently innovated on robust artificial intelligence labeling of six cell types on H&E stains of the colon.…
▽ More
Understanding the way cells communicate, co-locate, and interrelate is essential to understanding human physiology. Hematoxylin and eosin (H&E) staining is ubiquitously available both for clinical studies and research. The Colon Nucleus Identification and Classification (CoNIC) Challenge has recently innovated on robust artificial intelligence labeling of six cell types on H&E stains of the colon. However, this is a very small fraction of the number of potential cell classification types. Specifically, the CoNIC Challenge is unable to classify epithelial subtypes (progenitor, endocrine, goblet), lymphocyte subtypes (B, helper T, cytotoxic T), or connective subtypes (fibroblasts, stromal). In this paper, we propose to use inter-modality learning to label previously un-labelable cell types on virtual H&E. We leveraged multiplexed immunofluorescence (MxIF) histology imaging to identify 14 subclasses of cell types. We performed style transfer to synthesize virtual H&E from MxIF and transferred the higher density labels from MxIF to these virtual H&E images. We then evaluated the efficacy of learning in this approach. We identified helper T and progenitor nuclei with positive predictive values of $0.34 \pm 0.15$ (prevalence $0.03 \pm 0.01$) and $0.47 \pm 0.1$ (prevalence $0.07 \pm 0.02$) respectively on virtual H&E. This approach represents a promising step towards automating annotation in digital pathology.
△ Less
Submitted 28 January, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
Cell Spatial Analysis in Crohn's Disease: Unveiling Local Cell Arrangement Pattern with Graph-based Signatures
Authors:
Shunxing Bao,
Sichen Zhu,
Vasantha L Kolachala,
Lucas W. Remedios,
Yeonjoo Hwang,
Yutong Sun,
Ruining Deng,
Can Cui,
Yike Li,
Jia Li,
Joseph T. Roland,
Qi Liu,
Ken S. Lau,
Subra Kugathasan,
Peng Qiu,
Keith T. Wilson,
Lori A. Coburn,
Bennett A. Landman,
Yuankai Huo
Abstract:
Crohn's disease (CD) is a chronic and relapsing inflammatory condition that affects segments of the gastrointestinal tract. CD activity is determined by histological findings, particularly the density of neutrophils observed on Hematoxylin and Eosin stains (H&E) imaging. However, understanding the broader morphometry and local cell arrangement beyond cell counting and tissue morphology remains cha…
▽ More
Crohn's disease (CD) is a chronic and relapsing inflammatory condition that affects segments of the gastrointestinal tract. CD activity is determined by histological findings, particularly the density of neutrophils observed on Hematoxylin and Eosin stains (H&E) imaging. However, understanding the broader morphometry and local cell arrangement beyond cell counting and tissue morphology remains challenging. To address this, we characterize six distinct cell types from H&E images and develop a novel approach for the local spatial signature of each cell. Specifically, we create a 10-cell neighborhood matrix, representing neighboring cell arrangements for each individual cell. Utilizing t-SNE for non-linear spatial projection in scatter-plot and Kernel Density Estimation contour-plot formats, our study examines patterns of differences in the cellular environment associated with the odds ratio of spatial patterns between active CD and control groups. This analysis is based on data collected at the two research institutes. The findings reveal heterogeneous nearest-neighbor patterns, signifying distinct tendencies of cell clustering, with a particular focus on the rectum region. These variations underscore the impact of data heterogeneity on cell spatial arrangements in CD patients. Moreover, the spatial distribution disparities between the two research sites highlight the significance of collaborative efforts among healthcare organizations. All research analysis pipeline tools are available at https://github.com/MASILab/cellNN.
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
Feasibility of Universal Anomaly Detection without Knowing the Abnormality in Medical Images
Authors:
Can Cui,
Yaohong Wang,
Shunxing Bao,
Yucheng Tang,
Ruining Deng,
Lucas W. Remedios,
Zuhayr Asad,
Joseph T. Roland,
Ken S. Lau,
Qi Liu,
Lori A. Coburn,
Keith T. Wilson,
Bennett A. Landman,
Yuankai Huo
Abstract:
Many anomaly detection approaches, especially deep learning methods, have been recently developed to identify abnormal image morphology by only employing normal images during training. Unfortunately, many prior anomaly detection methods were optimized for a specific "known" abnormality (e.g., brain tumor, bone fraction, cell types). Moreover, even though only the normal images were used in the tra…
▽ More
Many anomaly detection approaches, especially deep learning methods, have been recently developed to identify abnormal image morphology by only employing normal images during training. Unfortunately, many prior anomaly detection methods were optimized for a specific "known" abnormality (e.g., brain tumor, bone fraction, cell types). Moreover, even though only the normal images were used in the training process, the abnormal images were often employed during the validation process (e.g., epoch selection, hyper-parameter tuning), which might leak the supposed ``unknown" abnormality unintentionally. In this study, we investigated these two essential aspects regarding universal anomaly detection in medical images by (1) comparing various anomaly detection methods across four medical datasets, (2) investigating the inevitable but often neglected issues on how to unbiasedly select the optimal anomaly detection model during the validation phase using only normal images, and (3) proposing a simple decision-level ensemble method to leverage the advantage of different kinds of anomaly detection without knowing the abnormality. The results of our experiments indicate that none of the evaluated methods consistently achieved the best performance across all datasets. Our proposed method enhanced the robustness of performance in general (average AUC 0.956).
△ Less
Submitted 19 August, 2023; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Unsupervised Multi-channel Separation and Adaptation
Authors:
Cong Han,
Kevin Wilson,
Scott Wisdom,
John R. Hershey
Abstract:
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. Th…
▽ More
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlapping reverberant and noisy speech from the AMI Corpus. The models are trained on both supervised and unsupervised training data, and are tested on real AMI recordings containing overlapping speech. To objectively evaluate our models, we also use a synthetic multi-channel AMI test set. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest improvement to SI-SNR and to human listening ratings across synthetic and real datasets, outperforming supervised models trained on well-matched synthetic data. Our results demonstrate that unsupervised learning through MixIT enables model adaptation on both single- and multi-channel real-world speech recordings.
△ Less
Submitted 22 March, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
NUBO: A Transparent Python Package for Bayesian Optimization
Authors:
Mike Diessner,
Kevin J. Wilson,
Richard D. Whalley
Abstract:
NUBO, short for Newcastle University Bayesian Optimization, is a Bayesian optimization framework for optimizing expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a cost-efficient optimization strategy that uses surrogate modeling via Gaussian processes to represent an objective function and acquisition functions to guide the s…
▽ More
NUBO, short for Newcastle University Bayesian Optimization, is a Bayesian optimization framework for optimizing expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a cost-efficient optimization strategy that uses surrogate modeling via Gaussian processes to represent an objective function and acquisition functions to guide the selection of candidate points to approximate the global optimum of the objective function. NUBO focuses on transparency and user experience to make Bayesian optimization accessible to researchers from all disciplines. Clean and understandable code, precise references, and thorough documentation ensure transparency, while a modular and flexible design, easy-to-write syntax, and careful selection of Bayesian optimization algorithms ensure a good user experience. NUBO allows users to tailor Bayesian optimization to their problem by writing a custom optimization loop using the provided building blocks. It supports sequential single-point, parallel multi-point, and asynchronous optimization of bounded, constrained, and mixed (discrete and continuous) parameter input spaces. Only algorithms and methods extensively tested and validated to perform well are included in NUBO. This ensures that the package remains compact and does not overwhelm the user with an unnecessarily large number of options. The package is written in Python but does not require expert knowledge of Python to optimize simulators and experiments. NUBO is distributed as open-source software under the BSD 3-Clause license.
△ Less
Submitted 3 June, 2024; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Segment Anything Model (SAM) for Digital Pathology: Assess Zero-shot Segmentation on Whole Slide Imaging
Authors:
Ruining Deng,
Can Cui,
Quan Liu,
Tianyuan Yao,
Lucas W. Remedios,
Shunxing Bao,
Bennett A. Landman,
Lee E. Wheless,
Lori A. Coburn,
Keith T. Wilson,
Yaohong Wang,
Shilin Zhao,
Agnes B. Fogo,
Haichun Yang,
Yucheng Tang,
Yuankai Huo
Abstract:
The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various segmentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital…
▽ More
The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various segmentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital pathology where the training data are rare. In this study, we evaluate the zero-shot segmentation performance of SAM model on representative segmentation tasks on whole slide imaging (WSI), including (1) tumor segmentation, (2) non-tumor tissue segmentation, (3) cell nuclei segmentation. Core Results: The results suggest that the zero-shot SAM model achieves remarkable segmentation performance for large connected objects. However, it does not consistently achieve satisfying performance for dense instance object segmentation, even with 20 prompts (clicks/boxes) on each image. We also summarized the identified limitations for digital pathology: (1) image resolution, (2) multiple scales, (3) prompt selection, and (4) model fine-tuning. In the future, the few-shot fine-tuning with images from downstream pathological segmentation tasks might help the model to achieve better performance in dense object segmentation.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
Cross-scale Multi-instance Learning for Pathological Image Diagnosis
Authors:
Ruining Deng,
Can Cui,
Lucas W. Remedios,
Shunxing Bao,
R. Michael Womick,
Sophie Chiron,
Jia Li,
Joseph T. Roland,
Ken S. Lau,
Qi Liu,
Keith T. Wilson,
Yaohong Wang,
Lori A. Coburn,
Bennett A. Landman,
Yuankai Huo
Abstract:
Analyzing high resolution whole slide images (WSIs) with regard to information across multiple scales poses a significant challenge in digital pathology. Multi-instance learning (MIL) is a common solution for working with high resolution images by classifying bags of objects (i.e. sets of smaller image patches). However, such processing is typically performed at a single scale (e.g., 20x magnifica…
▽ More
Analyzing high resolution whole slide images (WSIs) with regard to information across multiple scales poses a significant challenge in digital pathology. Multi-instance learning (MIL) is a common solution for working with high resolution images by classifying bags of objects (i.e. sets of smaller image patches). However, such processing is typically performed at a single scale (e.g., 20x magnification) of WSIs, disregarding the vital inter-scale information that is key to diagnoses by human pathologists. In this study, we propose a novel cross-scale MIL algorithm to explicitly aggregate inter-scale relationships into a single MIL network for pathological image diagnosis. The contribution of this paper is three-fold: (1) A novel cross-scale MIL (CS-MIL) algorithm that integrates the multi-scale information and the inter-scale relationships is proposed; (2) A toy dataset with scale-specific morphological features is created and released to examine and visualize differential cross-scale attention; (3) Superior performance on both in-house and public datasets is demonstrated by our simple cross-scale MIL strategy. The official implementation is publicly available at https://github.com/hrlblab/CS-MIL.
△ Less
Submitted 16 February, 2024; v1 submitted 31 March, 2023;
originally announced April 2023.
-
Training Machine Learning Models to Characterize Temporal Evolution of Disadvantaged Communities
Authors:
Milan Jain,
Narmadha Meenu Mohankumar,
Heng Wan,
Sumitrra Ganguly,
Kyle D Wilson,
David M Anderson
Abstract:
Disadvantaged communities (DAC), as defined by the Justice40 initiative of the Department of Energy (DOE), USA, identifies census tracts across the USA to determine where benefits of climate and energy investments are or are not currently accruing. The DAC status not only helps in determining the eligibility for future Justice40-related investments but is also critical for exploring ways to achiev…
▽ More
Disadvantaged communities (DAC), as defined by the Justice40 initiative of the Department of Energy (DOE), USA, identifies census tracts across the USA to determine where benefits of climate and energy investments are or are not currently accruing. The DAC status not only helps in determining the eligibility for future Justice40-related investments but is also critical for exploring ways to achieve equitable distribution of resources. However, designing inclusive and equitable strategies not just requires a good understanding of current demographics, but also a deeper analysis of the transformations that happened in those demographics over the years. In this paper, machine learning (ML) models are trained on publicly available census data from recent years to classify the DAC status at the census tracts level and then the trained model is used to classify DAC status for historical years. A detailed analysis of the feature and model selection along with the evolution of disadvantaged communities between 2013 and 2018 is presented in this study.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Optimizing Real-Time Performances for Timed-Loop Racing under F1TENTH
Authors:
Nitish Gupta,
Kurt Wilson,
Zhishan Guo
Abstract:
Motion planning and control in autonomous car racing are one of the most challenging and safety-critical tasks due to high speed and dynamism. The lower-level control nodes are expected to be highly optimized due to resource constraints of onboard embedded processing units, although there are strict latency requirements. Some of these guarantees can be provided at the application level, such as us…
▽ More
Motion planning and control in autonomous car racing are one of the most challenging and safety-critical tasks due to high speed and dynamism. The lower-level control nodes are expected to be highly optimized due to resource constraints of onboard embedded processing units, although there are strict latency requirements. Some of these guarantees can be provided at the application level, such as using ROS2's Real-Time executors. However, the performance can be far from satisfactory as many modern control algorithms (such as Model Predictive Control) rely on solving complicated online optimization problems at each iteration. In this paper, we present a simple yet effective multi-threading technique to optimize the throughput of online-control algorithms for resource-constrained autonomous racing platforms. We achieve this by maintaining a systematic pool of worker threads solving the optimization problem in parallel which can improve the system performance by reducing latency between control input commands. We further demonstrate the effectiveness of our method using the Model Predictive Contouring Control (MPCC) algorithm running on Nvidia's Xavier AGX platform.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
GLARE: A Dataset for Traffic Sign Detection in Sun Glare
Authors:
Nicholas Gray,
Megan Moraes,
Jiang Bian,
Alex Wang,
Allen Tian,
Kurt Wilson,
Yan Huang,
Haoyi Xiong,
Zhishan Guo
Abstract:
Real-time machine learning object detection algorithms are often found within autonomous vehicle technology and depend on quality datasets. It is essential that these algorithms work correctly in everyday conditions as well as under strong sun glare. Reports indicate glare is one of the two most prominent environment-related reasons for crashes. However, existing datasets, such as the Laboratory f…
▽ More
Real-time machine learning object detection algorithms are often found within autonomous vehicle technology and depend on quality datasets. It is essential that these algorithms work correctly in everyday conditions as well as under strong sun glare. Reports indicate glare is one of the two most prominent environment-related reasons for crashes. However, existing datasets, such as the Laboratory for Intelligent & Safe Automobiles Traffic Sign (LISA) Dataset and the German Traffic Sign Recognition Benchmark, do not reflect the existence of sun glare at all. This paper presents the GLARE (GLARE is available at: https://github.com/NicholasCG/GLARE_Dataset ) traffic sign dataset: a collection of images with U.S-based traffic signs under heavy visual interference by sunlight. GLARE contains 2,157 images of traffic signs with sun glare, pulled from 33 videos of dashcam footage of roads in the United States. It provides an essential enrichment to the widely used LISA Traffic Sign dataset. Our experimental study shows that although several state-of-the-art baseline architectures have demonstrated good performance on traffic sign detection in conditions without sun glare in the past, they performed poorly when tested against GLARE (e.g., average mAP0.5:0.95 of 19.4). We also notice that current architectures have better detection when trained on images of traffic signs in sun glare performance (e.g., average mAP0.5:0.95 of 39.6), and perform best when trained on a mixture of conditions (e.g., average mAP0.5:0.95 of 42.3).
△ Less
Submitted 13 December, 2023; v1 submitted 18 September, 2022;
originally announced September 2022.
-
Cross-scale Attention Guided Multi-instance Learning for Crohn's Disease Diagnosis with Pathological Images
Authors:
Ruining Deng,
Can Cui,
Lucas W. Remedios,
Shunxing Bao,
R. Michael Womick,
Sophie Chiron,
Jia Li,
Joseph T. Roland,
Ken S. Lau,
Qi Liu,
Keith T. Wilson,
Yaohong Wang,
Lori A. Coburn,
Bennett A. Landman,
Yuankai Huo
Abstract:
Multi-instance learning (MIL) is widely used in the computer-aided interpretation of pathological Whole Slide Images (WSIs) to solve the lack of pixel-wise or patch-wise annotations. Often, this approach directly applies "natural image driven" MIL algorithms which overlook the multi-scale (i.e. pyramidal) nature of WSIs. Off-the-shelf MIL algorithms are typically deployed on a single-scale of WSIs…
▽ More
Multi-instance learning (MIL) is widely used in the computer-aided interpretation of pathological Whole Slide Images (WSIs) to solve the lack of pixel-wise or patch-wise annotations. Often, this approach directly applies "natural image driven" MIL algorithms which overlook the multi-scale (i.e. pyramidal) nature of WSIs. Off-the-shelf MIL algorithms are typically deployed on a single-scale of WSIs (e.g., 20x magnification), while human pathologists usually aggregate the global and local patterns in a multi-scale manner (e.g., by zooming in and out between different magnifications). In this study, we propose a novel cross-scale attention mechanism to explicitly aggregate inter-scale interactions into a single MIL network for Crohn's Disease (CD), which is a form of inflammatory bowel disease. The contribution of this paper is two-fold: (1) a cross-scale attention mechanism is proposed to aggregate features from different resolutions with multi-scale interaction; and (2) differential multi-scale attention visualizations are generated to localize explainable lesion patterns. By training ~250,000 H&E-stained Ascending Colon (AC) patches from 20 CD patient and 30 healthy control samples at different scales, our approach achieved a superior Area under the Curve (AUC) score of 0.8924 compared with baseline models. The official implementation is publicly available at https://github.com/hrlblab/CS-MIL.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Investigating Bayesian optimization for expensive-to-evaluate black box functions: Application in fluid dynamics
Authors:
Mike Diessner,
Joseph O'Connor,
Andrew Wynn,
Sylvain Laizet,
Yu Guan,
Kevin Wilson,
Richard D. Whalley
Abstract:
Bayesian optimization provides an effective method to optimize expensive-to-evaluate black box functions. It has been widely applied to problems in many fields, including notably in computer science, e.g. in machine learning to optimize hyperparameters of neural networks, and in engineering, e.g. in fluid dynamics to optimize control strategies that maximize drag reduction. This paper empirically…
▽ More
Bayesian optimization provides an effective method to optimize expensive-to-evaluate black box functions. It has been widely applied to problems in many fields, including notably in computer science, e.g. in machine learning to optimize hyperparameters of neural networks, and in engineering, e.g. in fluid dynamics to optimize control strategies that maximize drag reduction. This paper empirically studies and compares the performance and the robustness of common Bayesian optimization algorithms on a range of synthetic test functions to provide general guidance on the design of Bayesian optimization algorithms for specific problems. It investigates the choice of acquisition function, the effect of different numbers of training samples, the exact and Monte Carlo based calculation of acquisition functions, and both single-point and multi-point optimization. The test functions considered cover a wide selection of challenges and therefore serve as an ideal test bed to understand the performance of Bayesian optimization to specific challenges, and in general. To illustrate how these findings can be used to inform a Bayesian optimization setup tailored to a specific problem, two simulations in the area of computational fluid dynamics are optimized, giving evidence that suitable solutions can be found in a small number of evaluations of the objective function for complex, real problems. The results of our investigation can similarly be applied to other areas, such as machine learning and physical experiments, where objective functions are expensive to evaluate and their mathematical expressions are unknown.
△ Less
Submitted 20 December, 2022; v1 submitted 19 July, 2022;
originally announced July 2022.
-
Distance-Based Sound Separation
Authors:
Katharine Patterson,
Kevin Wilson,
Scott Wisdom,
John R. Hershey
Abstract:
We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound selection in noisy environments that would allow the user to focus on sounds relevant to a local conversation. We demonstrate the feasibility of this approach by…
▽ More
We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound selection in noisy environments that would allow the user to focus on sounds relevant to a local conversation. We demonstrate the feasibility of this approach by training a neural network to separate near sounds from far sounds in single channel synthetic reverberant mixtures, relative to a threshold distance defining the boundary between near and far. With a single nearby speaker and four distant speakers, the model improves scale-invariant signal to noise ratio by 4.4 dB for near sounds and 6.8 dB for far sounds.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Deep Multi-modal Fusion of Image and Non-image Data in Disease Diagnosis and Prognosis: A Review
Authors:
Can Cui,
Haichun Yang,
Yaohong Wang,
Shilin Zhao,
Zuhayr Asad,
Lori A. Coburn,
Keith T. Wilson,
Bennett A. Landman,
Yuankai Huo
Abstract:
The rapid development of diagnostic technologies in healthcare is leading to higher requirements for physicians to handle and integrate the heterogeneous, yet complementary data that are produced during routine practice. For instance, the personalized diagnosis and treatment planning for a single cancer patient relies on the various images (e.g., radiological, pathological, and camera images) and…
▽ More
The rapid development of diagnostic technologies in healthcare is leading to higher requirements for physicians to handle and integrate the heterogeneous, yet complementary data that are produced during routine practice. For instance, the personalized diagnosis and treatment planning for a single cancer patient relies on the various images (e.g., radiological, pathological, and camera images) and non-image data (e.g., clinical data and genomic data). However, such decision-making procedures can be subjective, qualitative, and have large inter-subject variabilities. With the recent advances in multi-modal deep learning technologies, an increasingly large number of efforts have been devoted to a key question: how do we extract and aggregate multi-modal information to ultimately provide more objective, quantitative computer-aided clinical decision making? This paper reviews the recent studies on dealing with such a question. Briefly, this review will include the (1) overview of current multi-modal learning workflows, (2) summarization of multi-modal fusion methods, (3) discussion of the performance, (4) applications in disease diagnosis and prognosis, and (5) challenges and future directions.
△ Less
Submitted 26 January, 2023; v1 submitted 25 March, 2022;
originally announced March 2022.
-
Random Multi-Channel Image Synthesis for Multiplexed Immunofluorescence Imaging
Authors:
Shunxing Bao,
Yucheng Tang,
Ho Hin Lee,
Riqiang Gao,
Sophie Chiron,
Ilwoo Lyu,
Lori A. Coburn,
Keith T. Wilson,
Joseph T. Roland,
Bennett A. Landman,
Yuankai Huo
Abstract:
Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell mapping. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce…
▽ More
Multiplex immunofluorescence (MxIF) is an emerging imaging technique that produces the high sensitivity and specificity of single-cell mapping. With a tenet of 'seeing is believing', MxIF enables iterative staining and imaging extensive antibodies, which provides comprehensive biomarkers to segment and group different cells on a single tissue section. However, considerable depletion of the scarce tissue is inevitable from extensive rounds of staining and bleaching ('missing tissue'). Moreover, the immunofluorescence (IF) imaging can globally fail for particular rounds ('missing stain''). In this work, we focus on the 'missing stain' issue. It would be appealing to develop digital image synthesis approaches to restore missing stain images without losing more tissue physically. Herein, we aim to develop image synthesis approaches for eleven MxIF structural molecular markers (i.e., epithelial and stromal) on real samples. We propose a novel multi-channel high-resolution image synthesis approach, called pixN2N-HD, to tackle possible missing stain scenarios via a high-resolution generative adversarial network (GAN). Our contribution is three-fold: (1) a single deep network framework is proposed to tackle missing stain in MxIF; (2) the proposed 'N-to-N' strategy reduces theoretical four years of computational time to 20 hours when covering all possible missing stains scenarios, with up to five missing stains (e.g., '(N-1)-to-1', '(N-2)-to-2'); and (3) this work is the first comprehensive experimental study of investigating cross-stain synthesis in MxIF. Our results elucidate a promising direction of advancing MxIF imaging with deep image synthesis.
△ Less
Submitted 18 September, 2021;
originally announced September 2021.
-
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
Authors:
Soumi Maiti,
Hakan Erdogan,
Kevin Wilson,
Scott Wisdom,
Shinji Watanabe,
John R. Hershey
Abstract:
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers,…
▽ More
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Authors:
Quan Wang,
Ignacio Lopez Moreno,
Mert Saglam,
Kevin Wilson,
Alan Chiao,
Renjie Liu,
Yanzhang He,
Wei Li,
Jason Pelecanos,
Marily Nika,
Alexander Gruenstein
Abstract:
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance unde…
▽ More
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.
△ Less
Submitted 9 September, 2020;
originally announced September 2020.
-
Unsupervised Sound Separation Using Mixture Invariant Training
Authors:
Scott Wisdom,
Efthymios Tzinis,
Hakan Erdogan,
Ron J. Weiss,
Kevin Wilson,
John R. Hershey
Abstract:
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon…
▽ More
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
△ Less
Submitted 23 October, 2020; v1 submitted 22 June, 2020;
originally announced June 2020.
-
On the Distribution of Minima in Intrinsic-Metric Rotation Averaging
Authors:
Kyle Wilson,
David Bindel
Abstract:
Rotation Averaging is a non-convex optimization problem that determines orientations of a collection of cameras from their images of a 3D scene. The problem has been studied using a variety of distances and robustifiers. The intrinsic (or geodesic) distance on SO(3) is geometrically meaningful; but while some extrinsic distance-based solvers admit (conditional) guarantees of correctness, no compar…
▽ More
Rotation Averaging is a non-convex optimization problem that determines orientations of a collection of cameras from their images of a 3D scene. The problem has been studied using a variety of distances and robustifiers. The intrinsic (or geodesic) distance on SO(3) is geometrically meaningful; but while some extrinsic distance-based solvers admit (conditional) guarantees of correctness, no comparable results have been found under the intrinsic metric.
In this paper, we study the spatial distribution of local minima. First, we do a novel empirical study to demonstrate sharp transitions in qualitative behavior: as problems become noisier, they transition from a single (easy-to-find) dominant minimum to a cost surface filled with minima. In the second part of this paper we derive a theoretical bound for when this transition occurs. This is an extension of the results of [24], which used local convexity as a proxy to study the difficulty of problem. By recognizing the underlying quotient manifold geometry of the problem we achieve an n-fold improvement over prior work. Incidentally, our analysis also extends the prior $l_2$ work to general $l_p$ costs. Our results suggest using algebraic connectivity as an indicator of problem difficulty.
△ Less
Submitted 18 March, 2020;
originally announced March 2020.
-
Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement
Authors:
Zhong-Qiu Wang,
Hakan Erdogan,
Scott Wisdom,
Kevin Wilson,
Desh Raj,
Shinji Watanabe,
Zhuo Chen,
John R. Hershey
Abstract:
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, incl…
▽ More
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, including factorizing the spatial covariance into a time-varying amplitude component and a time-invariant spatial component, as well as using block-based techniques. In addition, we introduce a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations. We extensively evaluate and analyze the effects of window size, block size, and multi-frame context size for these methods. Our best method utilizes a sequence of three neural separation and multi-frame time-invariant spatial beamforming stages, and demonstrates an average improvement of 2.75 dB in scale-invariant signal-to-noise ratio and 14.2% absolute reduction in a comparative speech recognition metric across four challenging reverberant speech enhancement and separation tasks. We also use our three-speaker separation model to separate real recordings in the LibriCSS evaluation set into non-overlapping tracks, and achieve a better word error rate as compared to a baseline mask based beamformer.
△ Less
Submitted 3 November, 2020; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Fully-automated patient-level malaria assessment on field-prepared thin blood film microscopy images, including Supplementary Information
Authors:
Charles B. Delahunt,
Mayoore S. Jaiswal,
Matthew P. Horning,
Samantha Janko,
Clay M. Thompson,
Sourabh Kulhare,
Liming Hu,
Travis Ostbye,
Grace Yun,
Roman Gebrehiwot,
Benjamin K. Wilson,
Earl Long,
Stephane Proux,
Dionicia Gamboa,
Peter Chiodini,
Jane Carter,
Mehul Dhorda,
David Isaboke,
Bernhards Ogutu,
Wellington Oyibo,
Elizabeth Villasis,
Kyaw Myo Tun,
Christine Bachman,
David Bell,
Courosh Mehanian
Abstract:
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumb…
▽ More
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumber relatively rare parasites. In this work, we describe a complete, fully-automated framework for thin film malaria analysis that applies ML methods, including convolutional neural nets (CNNs), trained on a large and diverse dataset of field-prepared thin blood films. Quantitation and species identification results are close to sufficiently accurate for the concrete needs of drug resistance monitoring and clinical use-cases on field-prepared samples. We focus our methods and our performance metrics on the field use-case requirements. We discuss key issues and important metrics for the application of ML methods to malaria microscopy.
△ Less
Submitted 11 September, 2022; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Universal Sound Separation
Authors:
Ilya Kavalerov,
Scott Wisdom,
Hakan Erdogan,
Brian Patton,
Kevin Wilson,
Jonathan Le Roux,
John R. Hershey
Abstract:
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a…
▽ More
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
△ Less
Submitted 2 August, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.
-
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
Authors:
Scott Wisdom,
John R. Hershey,
Kevin Wilson,
Jeremy Thorpe,
Michael Chinen,
Brian Patton,
Rif A. Saurous
Abstract:
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglec…
▽ More
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
△ Less
Submitted 20 November, 2018;
originally announced November 2018.
-
Exploring Tradeoffs in Models for Low-latency Speech Enhancement
Authors:
Kevin Wilson,
Michael Chinen,
Jeremy Thorpe,
Brian Patton,
John Hershey,
Rif A. Saurous,
Jan Skoglund,
Richard F. Lyon
Abstract:
We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and…
▽ More
We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and find that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of our best bidirectional model. Further, we find that 200 milliseconds of look-ahead is sufficient to achieve equivalent performance to our best bidirectional model.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Authors:
Quan Wang,
Hannah Muckenhirn,
Kevin Wilson,
Prashant Sridhar,
Zelin Wu,
John Hershey,
Rif A. Saurous,
Ron J. Weiss,
Ye Jia,
Ignacio Lopez Moreno
Abstract:
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe…
▽ More
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
△ Less
Submitted 19 June, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
Authors:
Sourish Chaudhuri,
Joseph Roth,
Daniel P. W. Ellis,
Andrew Gallagher,
Liat Kaver,
Radhika Marvin,
Caroline Pantofaru,
Nathan Reale,
Loretta Guarino Reid,
Kevin Wilson,
Zhonghua Xi
Abstract:
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or…
▽ More
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music, and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlapping noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.
△ Less
Submitted 23 August, 2018; v1 submitted 1 August, 2018;
originally announced August 2018.
-
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Authors:
Ariel Ephrat,
Inbar Mosseri,
Oran Lang,
Tali Dekel,
Kevin Wilson,
Avinatan Hassidim,
William T. Freeman,
Michael Rubinstein
Abstract:
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and aud…
▽ More
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
△ Less
Submitted 9 August, 2018; v1 submitted 10 April, 2018;
originally announced April 2018.
-
AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
Authors:
Brian Patton,
Yannis Agiomyrgiannakis,
Michael Terry,
Kevin Wilson,
Rif A. Saurous,
D. Sculley
Abstract:
Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled hum…
▽ More
Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.
△ Less
Submitted 28 November, 2016;
originally announced November 2016.
-
CNN Architectures for Large-Scale Audio Classification
Authors:
Shawn Hershey,
Sourish Chaudhuri,
Daniel P. W. Ellis,
Jort F. Gemmeke,
Aren Jansen,
R. Channing Moore,
Manoj Plakal,
Devin Platt,
Rif A. Saurous,
Bryan Seybold,
Malcolm Slaney,
Ron J. Weiss,
Kevin Wilson
Abstract:
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th…
▽ More
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.
△ Less
Submitted 10 January, 2017; v1 submitted 29 September, 2016;
originally announced September 2016.
-
Back to the Basics: Bayesian extensions of IRT outperform neural networks for proficiency estimation
Authors:
Kevin H. Wilson,
Yan Karklin,
Bojian Han,
Chaitanya Ekanadham
Abstract:
Estimating student proficiency is an important task for computer based learning systems. We compare a family of IRT-based proficiency estimation methods to Deep Knowledge Tracing (DKT), a recently proposed recurrent neural network model with promising initial results. We evaluate how well each model predicts a student's future response given previous responses using two publicly available and one…
▽ More
Estimating student proficiency is an important task for computer based learning systems. We compare a family of IRT-based proficiency estimation methods to Deep Knowledge Tracing (DKT), a recently proposed recurrent neural network model with promising initial results. We evaluate how well each model predicts a student's future response given previous responses using two publicly available and one proprietary data set. We find that IRT-based methods consistently matched or outperformed DKT across all data sets at the finest level of content granularity that was tractable for them to be trained on. A hierarchical extension of IRT that captured item grouping structure performed best overall. When data sets included non-trivial autocorrelations in student response patterns, a temporal extension of IRT improved performance over standard IRT while the RNN-based method did not. We conclude that IRT-based models provide a simpler, better-performing alternative to existing RNN-based models of student interaction data while also affording more interpretability and guarantees due to their formulation as Bayesian probabilistic models.
△ Less
Submitted 21 May, 2016; v1 submitted 8 April, 2016;
originally announced April 2016.
-
Spectrally and Energy Efficient OFDM (SEE-OFDM) for Intensity Modulated Optical Wireless Systems
Authors:
Emily Lam,
Sarah Kate Wilson,
Hany Elgala,
Thomas D. C. Little
Abstract:
Spectrally and energy efficient orthogonal frequency division multiplexing (SEE-OFDM) is an optical OFDM technique based on combining multiple asymmetrically clipped optical OFDM (ACO-OFDM) signals into one OFDM signal. By summing different components together, SEE-OFDM can achieve the same spectral efficiency as DC-biased optical OFDM (DCO-OFDM) without an energy-inefficient DC-bias. This paper i…
▽ More
Spectrally and energy efficient orthogonal frequency division multiplexing (SEE-OFDM) is an optical OFDM technique based on combining multiple asymmetrically clipped optical OFDM (ACO-OFDM) signals into one OFDM signal. By summing different components together, SEE-OFDM can achieve the same spectral efficiency as DC-biased optical OFDM (DCO-OFDM) without an energy-inefficient DC-bias. This paper introduces multiple methods for decoding a SEE-OFDM symbol and shows that an iterative decoder with hard decisions gives the best performance. Being a multi-component format, different energy allocation amongst the different components of SEE-OFDM is possible. However, equal energy allocation performs 1.5 dB better than unequal energy allocation. A hard-decision, iterative subtraction receiver can further increase performance by another 1.5 dB over soft-decision subtraction and reconstruction receivers. SEE-OFDM consistently performs 3 dB or better and with higher spectral efficiency than ACO-OFDM at the same bit-error-rate (BER). Comparing other combination methods at the same BER, SEE-OFDM performs up to 3 dB better than hybrid asymmetrically clipped optical (OFDM) (HACO-OFDM) and up to 1.5 dB better than asymmetrically and symmetrically clipped optical OFDM (ASCO-OFDM) and enhanced unipolar OFDM (eU-OFDM) when using hard decisions at the receiver. Additionally, SEE-OFDM has the best peak-to-average-power rate (PAPR) as compared to the other combination OFDM formats and ACO-OFDM, which makes it excellent for any range limited optical source, such as laser diodes and light-emitting diodes (LEDs). In summary, SEE-OFDM is shown to have excellent properties to glean additional capacity from an intensity modulation and direct detection (IM/DD) optical wireless communications system.
△ Less
Submitted 27 October, 2015;
originally announced October 2015.