subscribe to arXiv mailings

Interdisciplinary Expertise to Advance Equitable Explainable AI

Authors: Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles

Abstract: The field of artificial intelligence (AI) is rapidly influencing health and healthcare, but bias and poor performance persists for populations who face widespread structural oppression. Previous work has clearly outlined the need for more rigorous attention to data representativeness and model performance to advance equity and reduce bias. However, there is an opportunity to also improve the expla… ▽ More The field of artificial intelligence (AI) is rapidly influencing health and healthcare, but bias and poor performance persists for populations who face widespread structural oppression. Previous work has clearly outlined the need for more rigorous attention to data representativeness and model performance to advance equity and reduce bias. However, there is an opportunity to also improve the explainability of AI by leveraging best practices of social epidemiology and health equity to help us develop hypotheses for associations found. In this paper, we focus on explainable AI (XAI) and describe a framework for interdisciplinary expert panel review to discuss and critically assess AI model explanations from multiple perspectives and identify areas of bias and directions for future research. We emphasize the importance of the interdisciplinary expert panel to produce more accurate, equitable interpretations which are historically and contextually informed. Interdisciplinary panel discussions can help reduce bias, identify potential confounders, and identify opportunities for additional research where there are gaps in the literature. In turn, these insights can suggest opportunities for AI model improvement. △ Less

Submitted 29 May, 2024; originally announced June 2024.

arXiv:2405.14655 [pdf, other]

Multi-turn Reinforcement Learning from Preference Human Feedback

Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to ach… ▽ More Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2403.10578 [pdf, other]

Generative Modelling of Stochastic Rotating Shallow Water Noise

Authors: Dan Crisan, Oana Lang, Alexander Lobbe

Abstract: In recent work, the authors have developed a generic methodology for calibrating the noise in fluid dynamics stochastic partial differential equations where the stochasticity was introduced to parametrize subgrid-scale processes. The stochastic parameterization of sub-grid scale processes is required in the estimation of uncertainty in weather and climate predictions, to represent systematic model… ▽ More In recent work, the authors have developed a generic methodology for calibrating the noise in fluid dynamics stochastic partial differential equations where the stochasticity was introduced to parametrize subgrid-scale processes. The stochastic parameterization of sub-grid scale processes is required in the estimation of uncertainty in weather and climate predictions, to represent systematic model errors arising from subgrid-scale fluctuations. The previous methodology used a principal component analysis (PCA) technique based on the ansatz that the increments of the stochastic parametrization are normally distributed. In this paper, the PCA technique is replaced by a generative model technique. This enables us to avoid imposing additional constraints on the increments. The methodology is tested on a stochastic rotating shallow water model with the elevation variable of the model used as input data. The numerical simulations show that the noise is indeed non-Gaussian. The generative modelling technology gives good RMSE, CRPS score and forecast rank histogram results. △ Less

Submitted 15 March, 2024; originally announced March 2024.

MSC Class: 68T05; 76M35

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2308.12591 [pdf, other]

SICNN: Soft Interference Cancellation Inspired Neural Network Equalizers

Authors: Stefan Baumgartner, Oliver Lang, Mario Huemer

Abstract: In recent years data-driven machine learning approaches have been extensively studied to replace or enhance traditionally model-based processing in digital communication systems. In this work, we focus on equalization and propose a novel neural network (NN-)based approach, referred to as SICNN. SICNN is designed by deep unfolding a model-based iterative soft interference cancellation (SIC) method.… ▽ More In recent years data-driven machine learning approaches have been extensively studied to replace or enhance traditionally model-based processing in digital communication systems. In this work, we focus on equalization and propose a novel neural network (NN-)based approach, referred to as SICNN. SICNN is designed by deep unfolding a model-based iterative soft interference cancellation (SIC) method. It eliminates the main disadvantages of its model-based counterpart, which suffers from high computational complexity and performance degradation due to required approximations. We present different variants of SICNN. SICNNv1 is specifically tailored to single carrier frequency domain equalization (SC-FDE) systems, the communication system mainly regarded in this work. SICNNv2 is more universal and is applicable as an equalizer in any communication system with a block-based data transmission scheme. Moreover, for both SICNNv1 and SICNNv2, we present versions with highly reduced numbers of learnable parameters. Another contribution of this work is a novel approach for generating training datasets for NN-based equalizers, which significantly improves their performance at high signal-to-noise ratios. We compare the bit error ratio performance of the proposed NN-based equalizers with state-of-the-art model-based and NN-based approaches, highlighting the superiority of SICNNv1 over all other methods for SC-FDE. Exemplarily, to emphasize its universality, SICNNv2 is additionally applied to a unique word orthogonal frequency division multiplexing (UW-OFDM) system, where it achieves state-of-the-art performance. Furthermore, we present a thorough complexity analysis of the proposed NN-based equalization approaches, and we investigate the influence of the training set size on the performance of NN-based equalizers. △ Less

Submitted 11 March, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

arXiv:2306.16428 [pdf, ps, other]

Complex-valued Adaptive System Identification via Low-Rank Tensor Decomposition

Authors: Oliver Ploder, Christina Auer, Oliver Lang, Thomas Paireder, Mario Huemer

Abstract: Machine learning (ML) and tensor-based methods have been of significant interest for the scientific community for the last few decades. In a previous work we presented a novel tensor-based system identification framework to ease the computational burden of tensor-only architectures while still being able to achieve exceptionally good performance. However, the derived approach only allows to proces… ▽ More Machine learning (ML) and tensor-based methods have been of significant interest for the scientific community for the last few decades. In a previous work we presented a novel tensor-based system identification framework to ease the computational burden of tensor-only architectures while still being able to achieve exceptionally good performance. However, the derived approach only allows to process real-valued problems and is therefore not directly applicable on a wide range of signal processing and communications problems, which often deal with complex-valued systems. In this work we therefore derive two new architectures to allow the processing of complex-valued signals, and show that these extensions are able to surpass the trivial, complex-valued extension of the original architecture in terms of performance, while only requiring a slight overhead in computational resources to allow for complex-valued operations. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.00985 [pdf]

doi 10.1016/j.ebiom.2024.105075

Using generative AI to investigate medical imagery models and datasets

Authors: Oran Lang, Doron Yaya-Stupp, Ilana Traynis, Heather Cole-Lewis, Chloe R. Bennett, Courtney Lyles, Charles Lau, Michal Irani, Christopher Semturs, Dale R. Webster, Greg S. Corrado, Avinatan Hassidim, Yossi Matias, Yun Liu, Naama Hammel, Boris Babenko

Abstract: AI models have shown promise in many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust in AI-based models, and could enable novel scientific discovery by uncovering signals in the data that are not yet known to experts. In this paper, we present a method for automatic visual expl… ▽ More AI models have shown promise in many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust in AI-based models, and could enable novel scientific discovery by uncovering signals in the data that are not yet known to experts. In this paper, we present a method for automatic visual explanations leveraging team-based expertise by generating hypotheses of what visual signals in the images are correlated with the task. We propose the following 4 steps: (i) Train a classifier to perform a given task (ii) Train a classifier guided StyleGAN-based image generator (StylEx) (iii) Automatically detect and visualize the top visual attributes that the classifier is sensitive towards (iv) Formulate hypotheses for the underlying mechanisms, to stimulate future research. Specifically, we present the discovered attributes to an interdisciplinary panel of experts so that hypotheses can account for social and structural determinants of health. We demonstrate results on eight prediction tasks across three medical imaging modalities: retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples of attributes that capture clinically known features, confounders that arise from factors beyond physiological mechanisms, and reveal a number of physiologically plausible novel attributes. Our approach has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models. Importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors. Finally, we intend to release code to enable researchers to train their own StylEx models and analyze their predictive tasks. △ Less

Submitted 4 July, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: 43 pages, 1 figure

Journal ref: EBioMedicine 102 (2024)

arXiv:2306.00966 [pdf, other]

The Hidden Language of Diffusion Models

Authors: Hila Chefer, Oran Lang, Mor Geva, Volodymyr Polosukhin, Assaf Shocher, Michal Irani, Inbar Mosseri, Lior Wolf

Abstract: Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual prompt. However, the internal representations learned by these models remain an enigma. In this work, we present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model. This interpretation is obtained by decomposing t… ▽ More Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual prompt. However, the internal representations learned by these models remain an enigma. In this work, we present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model. This interpretation is obtained by decomposing the concept into a small set of human-interpretable textual elements. Applied over the state-of-the-art Stable Diffusion model, Conceptor reveals non-trivial structures in the representations of concepts. For example, we find surprising visual connections between concepts, that transcend their textual semantics. We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings of the concept. Through a large battery of experiments, we demonstrate Conceptor's ability to provide meaningful, robust, and faithful decompositions for a wide variety of abstract, concrete, and complex textual concepts, while allowing to naturally connect each decomposition element to its corresponding visual impact on the generated images. Our code will be available at: https://hila-chefer.github.io/Conceptor/ △ Less

Submitted 5 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.10400 [pdf, other]

What You See is What You Read? Improving Text-Image Alignment Evaluation

Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

Abstract: Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to… ▽ More Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation. △ Less

Submitted 26 December, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to NeurIPS 2023. Website: https://wysiwyr-itm.github.io/

arXiv:2303.12438 [pdf, ps, other]

Doppler-Division Multiplexing for MIMO OFDM Joint Sensing and Communications

Authors: Oliver Lang, Christian Hofbauer, Reinhard Feger, Mario Huemer

Abstract: A promising waveform candidate for future joint sensing and communication systems is orthogonal frequencydivision multiplexing (OFDM). For such systems, supporting multiple transmit antennas requires multiplexing methods for the generation of orthogonal transmit signals, where equidistant subcarrier interleaving (ESI) is the most popular multiplexing method. In this work, we analyze a multiplexing… ▽ More A promising waveform candidate for future joint sensing and communication systems is orthogonal frequencydivision multiplexing (OFDM). For such systems, supporting multiple transmit antennas requires multiplexing methods for the generation of orthogonal transmit signals, where equidistant subcarrier interleaving (ESI) is the most popular multiplexing method. In this work, we analyze a multiplexing method called Doppler-division multiplexing (DDM). This method applies a phase shift from OFDM symbol to OFDM symbol to separate signals transmitted by different Tx antennas along the velocity axis of the range-Doppler map. While general properties of DDM for the task of radar sensing are analyzed in this work, the main focus lies on the implications of DDM on the communication task. It will be shown that for DDM, the channels observed in the communication receiver are heavily timevarying, preventing any meaningful transmission of data when not taken into account. In this work, a communication system designed to combat these time-varying channels is proposed, which includes methods for data estimation, synchronization, and channel estimation. Bit error ratio (BER) simulations demonstrate the superiority of this communications system compared to a system utilizing ESI. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 13 pages, 11 figures

arXiv:2211.06054 [pdf, other]

Neural Network Approaches for Data Estimation in Unique Word OFDM Systems

Authors: Stefan Baumgartner, Gergő Bognár, Oliver Lang, Mario Huemer

Abstract: Data estimation is conducted with model-based estimation methods since the beginning of digital communications. However, motivated by the growing success of machine learning, current research focuses on replacing model-based data estimation methods by data-driven approaches, mainly neural networks (NNs). In this work, we particularly investigate the incorporation of existing model knowledge into d… ▽ More Data estimation is conducted with model-based estimation methods since the beginning of digital communications. However, motivated by the growing success of machine learning, current research focuses on replacing model-based data estimation methods by data-driven approaches, mainly neural networks (NNs). In this work, we particularly investigate the incorporation of existing model knowledge into data-driven approaches, which is expected to lead to complexity reduction and / or performance enhancement. We describe three different options, namely "model-inspired'' pre-processing, choosing an NN architecture motivated by the properties of the underlying communication system, and inferring the layer structure of an NN with the help of model knowledge. Most of the current publications on NN-based data estimation deal with general multiple-input multiple-output communication (MIMO) systems. In this work, we investigate NN-based data estimation for so-called unique word orthogonal frequency division multiplexing (UW-OFDM) systems. We highlight differences between UW-OFDM systems and general MIMO systems one has to be aware of when using NNs for data estimation, and we introduce measures for successful utilization of NN-based data estimators in UW-OFDM systems. Further, we investigate the use of NNs for data estimation when channel coded data transmission is conducted, and we present adaptions to be made, such that NN-based data estimators provide satisfying performance for this case. We compare the presented NNs concerning achieved bit error ratio performance and computational complexity, we show the peculiar distributions of their data estimates, and we also point out their downsides compared to model-based equalizers. △ Less

Submitted 11 November, 2022; originally announced November 2022.

arXiv:2210.09276 [pdf, other]

Imagic: Text-Based Real Image Editing with Diffusion Models

Authors: Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani

Abstract: Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-gu… ▽ More Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework. △ Less

Submitted 20 March, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: Project page: https://imagic-editing.github.io/

arXiv:2202.12211 [pdf, other]

Self-Distilled StyleGAN: Towards Generation from Internet Photos

Authors: Ron Mokady, Michal Yarom, Omer Tov, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani, Inbar Mosseri

Abstract: StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated. In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two… ▽ More StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated. In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two main challenges to StyleGAN: they contain many outlier images, and are characterized by a multi-modal distribution. Training StyleGAN on such raw image collections results in degraded image synthesis quality. To meet these challenges, we proposed a StyleGAN-based self-distillation approach, which consists of two main components: (i) A generative-based self-filtering of the dataset to eliminate outlier images, in order to generate an adequate training set, and (ii) Perceptual clustering of the generated images to detect the inherent data modalities, which are then employed to improve StyleGAN's "truncation trick" in the image synthesis process. The presented technique enables the generation of high-quality images, while minimizing the loss in diversity of the data. Through qualitative and quantitative evaluation, we demonstrate the power of our approach to new challenging and diverse domains collected from the Internet. New datasets and pre-trained models are available at https://self-distilled-stylegan.github.io/ . △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2104.13369 [pdf, other]

Explaining in Style: Training a GAN to explain a classifier in StyleSpace

Authors: Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald, Gal Elidan, Avinatan Hassidim, William T. Freeman, Phillip Isola, Amir Globerson, Michal Irani, Inbar Mosseri

Abstract: Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. A natural source for such attributes is t… ▽ More Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. A natural source for such attributes is the StyleSpace of StyleGAN, which is known to generate semantically meaningful dimensions in the image. However, because standard GAN training is not dependent on the classifier, it may not represent these attributes which are important for the classifier decision, and the dimensions of StyleSpace may represent irrelevant attributes. To overcome this, we propose a training procedure for a StyleGAN, which incorporates the classifier model, in order to learn a classifier-specific StyleSpace. Explanatory attributes are then selected from this space. These can be used to visualize the effect of changing multiple attributes per image, thus providing image-specific explanations. We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be modified in different ways to change its classifier output. Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are human-interpretable as measured in user-studies. △ Less

Submitted 1 September, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: Accepted to ICCV 2021. Project page: https://explaining-in-style.github.io/, Code: https://github.com/google/explaining-in-style

arXiv:2004.06130 [pdf, other]

SpeedNet: Learning the Speediness in Videos

Authors: Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, Tali Dekel

Abstract: We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring an… ▽ More We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly. △ Less

Submitted 26 July, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

Comments: Accepted to CVPR 2020 (oral). Project webpage: http://speednet-cvpr20.github.io

arXiv:2002.12764 [pdf, other]

doi 10.21437/Interspeech.2020-1242

Towards Learning a Universal Non-Semantic Representation of Speech

Authors: Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv

Abstract: The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a… ▽ More The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective. The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The benchmark, models, and evaluation code are publicly released. △ Less

Submitted 6 August, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

Journal ref: Proceedings of INTERSPEECH 2020

arXiv:1907.13511 [pdf, other]

doi 10.21437/Interspeech.2019-1427

Personalizing ASR for Dysarthric and Accented Speech with Limited Data

Authors: Joel Shor, Dotan Emanuel, Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, Avinatan Hassidim, Yossi Matias

Abstract: Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech:… ▽ More Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, down to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 71% of the improvement comes from only 5 minutes of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech. △ Less

Submitted 31 July, 2019; originally announced July 2019.

Comments: 5 pages

arXiv:1804.03619 [pdf, other]

doi 10.1145/3197517.3201357

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Authors: Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein

Abstract: We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and aud… ▽ More We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest). △ Less

Submitted 9 August, 2018; v1 submitted 10 April, 2018; originally announced April 2018.

Comments: Accepted to SIGGRAPH 2018. Project webpage: https://looking-to-listen.github.io

Journal ref: ACM Trans. Graph. 37(4): 112:1-112:11 (2018)

arXiv:1612.04059 [pdf, ps, other]

Parameter Estimation Under Model Uncertainties by Iterative Covariance Approximation

Authors: Oliver Lang, Michael Lunglmayr, Mario Huemer

Abstract: We propose a novel iterative algorithm for estimating a deterministic but unknown parameter vector in the presence of model uncertainties. This iterative algorithm is based on a system model where an overall noise term describes both, the measurement noise and the noise resulting from the model uncertainties. This overall noise term is a function of the true parameter vector, allowing for an itera… ▽ More We propose a novel iterative algorithm for estimating a deterministic but unknown parameter vector in the presence of model uncertainties. This iterative algorithm is based on a system model where an overall noise term describes both, the measurement noise and the noise resulting from the model uncertainties. This overall noise term is a function of the true parameter vector, allowing for an iterative algorithm. The proposed algorithm can be applied on structured as well as unstructured models and it outperforms prior art algorithms for a broad range of applications. △ Less

Submitted 23 November, 2017; v1 submitted 13 December, 2016; originally announced December 2016.

Showing 1–19 of 19 results for author: Lang, O