subscribe to arXiv mailings

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech

Authors: Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Noeth, Bjoern Heismann, Andreas Maier, Seung Hee Yang

Abstract: This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task le… ▽ More This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to approximately 87% frame overlap compared to a state-of-the-art text-dependent phoneme force aligner. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: to be published in Interspeech 2024 proceedings

arXiv:2406.14576 [pdf, other]

Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

Authors: Kubilay Can Demir, Belen Lojo Rodriguez, Tobias Weise, Andreas Maier, Seung Hee Yang

Abstract: To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations… ▽ More To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations. Our method merges speech and image models and uses them separately in different surgical phases. Based on the evaluation of 28 operations, we report a frame-wise accuracy of 92.65 $\pm$ 3.52% and an F1-score of 92.30 $\pm$ 3.82%. Our results show approximately 10% improvement in both metrics over previous work and validate the effectiveness of integrating multimodal data for the surgical phase recognition task. We further investigate the contribution of individual data channels by comparing mono-modal models with multimodal models. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 5 Pages, Interspeech 2024

MSC Class: 00b20

arXiv:2404.08064 [pdf]

The Impact of Speech Anonymization on Pathology and Its Limits

Authors: Soroosh Tayebi Arasteh, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Tobias Weise, Kai Packhaeuser, Maria Schuster, Elmar Noeth, Andreas Maier, Seung Hee Yang

Abstract: Integration of speech into healthcare has intensified privacy concerns due to its potential as a non-invasive biomarker containing individual biometric information. In response, speaker anonymization aims to conceal personally identifiable information while retaining crucial linguistic content. However, the application of anonymization techniques to pathological speech, a critical area where priva… ▽ More Integration of speech into healthcare has intensified privacy concerns due to its potential as a non-invasive biomarker containing individual biometric information. In response, speaker anonymization aims to conceal personally identifiable information while retaining crucial linguistic content. However, the application of anonymization techniques to pathological speech, a critical area where privacy is especially vital, has not been extensively examined. This study investigates anonymization's impact on pathological speech across over 2,700 speakers from multiple German institutions, focusing on privacy, pathological utility, and demographic fairness. We explore both deep-learning-based and signal processing-based anonymization methods, and document substantial privacy improvements across disorders-evidenced by equal error rate increases up to 1933%, with minimal overall impact on utility. Specific disorders such as Dysarthria, Dysphonia, and Cleft Lip and Palate experienced minimal utility changes, while Dysglossia showed slight improvements. Our findings underscore that the impact of anonymization varies substantially across different disorders. This necessitates disorder-specific anonymization strategies to optimally balance privacy with diagnostic utility. Additionally, our fairness analysis revealed consistent anonymization effects across most of the demographics. This study demonstrates the effectiveness of anonymization in pathological speech for enhancing privacy, while also highlighting the importance of customized and disorder-specific approaches to account for inversion attacks. △ Less

Submitted 22 June, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.03383 [pdf, other]

doi 10.1364/OE.516198

DaISy: Diffuser-aided Sub-THz Imaging System

Authors: Shao-Hsuan Wu, Yiyao Zhang, Ke Chen, Shang Hua Yang

Abstract: Sub-terahertz (Sub-THz) waves possess exceptional attributes, capable of penetrating non-metallic and non-polarized materials while ensuring bio-safety. However, their practicality in imaging is marred by the emergence of troublesome speckle artifacts, primarily due to diffraction effects caused by wavelengths comparable to object dimensions. In addressing this limitation, we present the Diffuser-… ▽ More Sub-terahertz (Sub-THz) waves possess exceptional attributes, capable of penetrating non-metallic and non-polarized materials while ensuring bio-safety. However, their practicality in imaging is marred by the emergence of troublesome speckle artifacts, primarily due to diffraction effects caused by wavelengths comparable to object dimensions. In addressing this limitation, we present the Diffuser-aided sub-THz Imaging System (DaISy), which utilizes a diffuser and a focusing lens to convert coherent waves into incoherent counterparts. The cornerstone of our progress lies in a coherence theory-based theoretical framework, pivotal for designing and validating the THz diffuser, and systematically evaluating speckle phenomena. Our experimental results utilizing DaISy reveal substantial improvements in imaging quality and nearly diffraction-limited spatial resolution. Moreover, we demonstrate a tangible application of DaISy in the scenario of security scanning, highlighting the versatile potential of sub-THz waves in miscellaneous fields. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: These authors (Shao-Hsuan Wu and Yiyao Zhang) contributed equally to this work. 15 pages, 7 figures. Supplemental Document: https://doi.org/10.6084/m9.figshare.25328746

Journal ref: Optics Express (OE) 2024

arXiv:2305.15993 [pdf, other]

doi 10.21437/Interspeech.2023-753

PoCaPNet: A Novel Approach for Surgical Phase Recognition Using Speech and X-Ray Images

Authors: Kubilay Can Demir, Tobias Weise, Matthias May, Axel Schmid, Andreas Maier, Seung Hee Yang

Abstract: Surgical phase recognition is a challenging and necessary task for the development of context-aware intelligent systems that can support medical personnel for better patient care and effective operating room management. In this paper, we present a surgical phase recognition framework that employs a Multi-Stage Temporal Convolution Network using speech and X-Ray images for the first time. We evalua… ▽ More Surgical phase recognition is a challenging and necessary task for the development of context-aware intelligent systems that can support medical personnel for better patient care and effective operating room management. In this paper, we present a surgical phase recognition framework that employs a Multi-Stage Temporal Convolution Network using speech and X-Ray images for the first time. We evaluate our proposed approach using our dataset that comprises 31 port-catheter placement operations and report 82.56 \% frame-wise accuracy with eight surgical phases. Additionally, we investigate the design choices in the temporal model and solutions for the class-imbalance problem. Our experiments demonstrate that speech and X-Ray data can be effectively utilized for surgical phase recognition, providing a foundation for the development of speech assistants in operating rooms of the future. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: 5 Pages, 3 figures, INTERSPEECH 2023

MSC Class: 00b20

arXiv:2305.11284 [pdf, other]

doi 10.21437/Interspeech.2023-2108

Federated learning for secure development of AI models for Parkinson's disease detection using speech from different languages

Authors: Soroosh Tayebi Arasteh, Cristian David Rios-Urrego, Elmar Noeth, Andreas Maier, Seung Hee Yang, Jan Rusz, Juan Rafael Orozco-Arroyave

Abstract: Parkinson's disease (PD) is a neurological disorder impacting a person's speech. Among automatic PD assessment methods, deep learning models have gained particular interest. Recently, the community has explored cross-pathology and cross-language models which can improve diagnostic accuracy even further. However, strict patient data privacy regulations largely prevent institutions from sharing pati… ▽ More Parkinson's disease (PD) is a neurological disorder impacting a person's speech. Among automatic PD assessment methods, deep learning models have gained particular interest. Recently, the community has explored cross-pathology and cross-language models which can improve diagnostic accuracy even further. However, strict patient data privacy regulations largely prevent institutions from sharing patient speech data with each other. In this paper, we employ federated learning (FL) for PD detection using speech signals from 3 real-world language corpora of German, Spanish, and Czech, each from a separate institution. Our results indicate that the FL model outperforms all the local models in terms of diagnostic accuracy, while not performing very differently from the model based on centrally combined training sets, with the advantage of not requiring any data sharing among collaborators. This will simplify inter-institutional collaborations, resulting in enhancement of patient outcomes. △ Less

Submitted 21 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: INTERSPEECH 2023, pp. 5003--5007, Dublin, Ireland

Journal ref: INTERSPEECH 2023

arXiv:2206.12320 [pdf, other]

PoCaP Corpus: A Multimodal Dataset for Smart Operating Room Speech Assistant using Interventional Radiology Workflow Analysis

Authors: Kubilay Can Demir, Matthias May, Axel Schmid, Michael Uder, Katharina Breininger, Tobias Weise, Andreas Maier, Seung Hee Yang

Abstract: This paper presents a new multimodal interventional radiology dataset, called PoCaP (Port Catheter Placement) Corpus. This corpus consists of speech and audio signals in German, X-ray images, and system commands collected from 31 PoCaP interventions by six surgeons with average duration of 81.4 $\pm$ 41.0 minutes. The corpus aims to provide a resource for developing a smart speech assistant in ope… ▽ More This paper presents a new multimodal interventional radiology dataset, called PoCaP (Port Catheter Placement) Corpus. This corpus consists of speech and audio signals in German, X-ray images, and system commands collected from 31 PoCaP interventions by six surgeons with average duration of 81.4 $\pm$ 41.0 minutes. The corpus aims to provide a resource for developing a smart speech assistant in operating rooms. In particular, it may be used to develop a speech controlled system that enables surgeons to control the operation parameters such as C-arm movements and table positions. In order to record the dataset, we acquired consent by the institutional review board and workers council in the University Hospital Erlangen and by the patients for data privacy. We describe the recording set-up, data structure, workflow and preprocessing steps, and report the first PoCaP Corpus speech recognition analysis results with 11.52 $\%$ word error rate using pretrained models. The findings suggest that the data has the potential to build a robust command recognition system and will allow the development of a novel intervention support systems using speech and image processing in the medical domain. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: 8 pages, 4 figures, Text, Speech and Dialogue 2022 Conference

MSC Class: 00b20

arXiv:2204.06450 [pdf, other]

doi 10.1038/s41598-023-47711-7

The effect of speech pathology on automatic speaker verification -- a large-scale study

Authors: Soroosh Tayebi Arasteh, Tobias Weise, Maria Schuster, Elmar Noeth, Andreas Maier, Seung Hee Yang

Abstract: Navigating the challenges of data-driven speech processing, one of the primary hurdles is accessing reliable pathological speech data. While public datasets appear to offer solutions, they come with inherent risks of potential unintended exposure of patient health information via re-identification attacks. Using a comprehensive real-world pathological speech corpus, with over n=3,800 test subjects… ▽ More Navigating the challenges of data-driven speech processing, one of the primary hurdles is accessing reliable pathological speech data. While public datasets appear to offer solutions, they come with inherent risks of potential unintended exposure of patient health information via re-identification attacks. Using a comprehensive real-world pathological speech corpus, with over n=3,800 test subjects spanning various age groups and speech disorders, we employed a deep-learning-driven automatic speaker verification (ASV) approach. This resulted in a notable mean equal error rate (EER) of 0.89% with a standard deviation of 0.06%, outstripping traditional benchmarks. Our comprehensive assessments demonstrate that pathological speech overall faces heightened privacy breach risks compared to healthy speech. Specifically, adults with dysphonia are at heightened re-identification risks, whereas conditions like dysarthria yield results comparable to those of healthy speakers. Crucially, speech intelligibility does not influence the ASV system's performance metrics. In pediatric cases, particularly those with cleft lip and palate, the recording environment plays a decisive role in re-identification. Merging data across pathological types led to a marked EER decrease, suggesting the potential benefits of pathological diversity in ASV, accompanied by a logarithmic boost in ASV effectiveness. In essence, this research sheds light on the dynamics between pathological speech and speaker verification, emphasizing its crucial role in safeguarding patient confidentiality in our increasingly digitized healthcare era. △ Less

Submitted 22 November, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: Published in Scientific Reports

Journal ref: Sci Rep 13, 20476 (2023)

arXiv:2204.04016 [pdf, other]

Disentangled Latent Speech Representation for Automatic Pathological Intelligibility Assessment

Authors: Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Andreas Maier, Elmar Noeth, Bjoern Heismann, Maria Schuster, Seung Hee Yang

Abstract: Speech intelligibility assessment plays an important role in the therapy of patients suffering from pathological speech disorders. Automatic and objective measures are desirable to assist therapists in their traditionally subjective and labor-intensive assessments. In this work, we investigate a novel approach for obtaining such a measure using the divergence in disentangled latent speech represen… ▽ More Speech intelligibility assessment plays an important role in the therapy of patients suffering from pathological speech disorders. Automatic and objective measures are desirable to assist therapists in their traditionally subjective and labor-intensive assessments. In this work, we investigate a novel approach for obtaining such a measure using the divergence in disentangled latent speech representations of a parallel utterance pair, obtained from a healthy reference and a pathological speaker. Experiments on an English database of Cerebral Palsy patients, using all available utterances per speaker, show high and significant correlation values (R = -0.9) with subjective intelligibility measures, while having only minimal deviation (+-0.01) across four different reference speaker pairs. We also demonstrate the robustness of the proposed method (R = -0.89 deviating +-0.02 over 1000 iterations) by considering a significantly smaller amount of utterances per speaker. Our results are among the first to show that disentangled speech representations can be used for automatic pathological speech intelligibility assessment, resulting in a reference speaker pair invariant method, applicable in scenarios with only few utterances available. △ Less

Submitted 27 June, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted and Accepted at INTERSPEECH2022

arXiv:2204.01677 [pdf, other]

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Authors: Abner Hernandez, Paula Andrea Pérez-Toro, Juan Camilo Vásquez-Correa, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

Abstract: Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech… ▽ More Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech. Converted voices retain a low word error rate within 1% of the original voice. Equal error rate increases from 1.52% to 46.24% on the LibriSpeech test set and from 3.75% to 45.84% on speakers from the VCTK corpus which signifies degraded performance on speaker verification. Lastly, we conduct experiments on dysarthric speech data to show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices for discriminating between healthy and pathological speech. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Submitted for review at Interspeech 2022

arXiv:2204.01670 [pdf, other]

Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Authors: Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

Abstract: State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of sp… ▽ More State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of speech such as articulation, prosody and phonation can be impaired. Specifically, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than filterbanks (Fbank) or models trained on a single language. Improvements were observed in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Submitted for review at Interspeech 2022

arXiv:2202.03540 [pdf, other]

SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Authors: Aline Sindel, Abner Hernandez, Seung Hee Yang, Vincent Christlein, Andreas Maier

Abstract: With the increasing number of online learning material in the web, search for specific content in lecture videos can be time consuming. Therefore, automatic slide extraction from the lecture videos can be helpful to give a brief overview of the main content and to support the students in their studies. For this task, we propose a deep learning method to detect slide transitions in lectures videos.… ▽ More With the increasing number of online learning material in the web, search for specific content in lecture videos can be time consuming. Therefore, automatic slide extraction from the lecture videos can be helpful to give a brief overview of the main content and to support the students in their studies. For this task, we propose a deep learning method to detect slide transitions in lectures videos. We first process each frame of the video by a heuristic-based approach using a 2-D convolutional neural network to predict transition candidates. Then, we increase the complexity by employing two 3-D convolutional neural networks to refine the transition candidates. Evaluation results demonstrate the effectiveness of our method in finding slide transitions. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: 6 pages, 5 figures, 1 table, accepted to OAGM Workshop 2021

arXiv:2112.03678 [pdf, other]

Does Proprietary Software Still Offer Protection of Intellectual Property in the Age of Machine Learning? -- A Case Study using Dual Energy CT Data

Authors: Andreas Maier, Seung Hee Yang, Farhad Maleki, Nikesh Muthukrishnan, Reza Forghani

Abstract: In the domain of medical image processing, medical device manufacturers protect their intellectual property in many cases by shipping only compiled software, i.e. binary code which can be executed but is difficult to be understood by a potential attacker. In this paper, we investigate how well this procedure is able to protect image processing algorithms. In particular, we investigate whether the… ▽ More In the domain of medical image processing, medical device manufacturers protect their intellectual property in many cases by shipping only compiled software, i.e. binary code which can be executed but is difficult to be understood by a potential attacker. In this paper, we investigate how well this procedure is able to protect image processing algorithms. In particular, we investigate whether the computation of mono-energetic images and iodine maps from dual energy CT data can be reverse-engineered by machine learning methods. Our results indicate that both can be approximated using only one single slice image as training data at a very high accuracy with structural similarity greater than 0.98 in all investigated cases. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: 6 pages, 2 figures, 1 table, accepted on BVM 2022

arXiv:2108.04543 [pdf, other]

doi 10.1088/2516-1091/ac5b13

Known Operator Learning and Hybrid Machine Learning in Medical Imaging -- A Review of the Past, the Present, and the Future

Authors: Andreas Maier, Harald Köstler, Marco Heisig, Patrick Krauss, Seung Hee Yang

Abstract: In this article, we perform a review of the state-of-the-art of hybrid machine learning in medical imaging. We start with a short summary of the general developments of the past in machine learning and how general and specialized approaches have been in competition in the past decades. A particular focus will be the theoretical and experimental evidence pro and contra hybrid modelling. Next, we in… ▽ More In this article, we perform a review of the state-of-the-art of hybrid machine learning in medical imaging. We start with a short summary of the general developments of the past in machine learning and how general and specialized approaches have been in competition in the past decades. A particular focus will be the theoretical and experimental evidence pro and contra hybrid modelling. Next, we inspect several new developments regarding hybrid machine learning with a particular focus on so-called known operator learning and how hybrid approaches gain more and more momentum across essentially all applications in medical imaging and medical image analysis. As we will point out by numerous examples, hybrid models are taking over in image reconstruction and analysis. Even domains such as physical simulation and scanner and acquisition design are being addressed using machine learning grey box modelling approaches. Towards the end of the article, we will investigate a few future directions and point out relevant areas in which hybrid modelling, meta learning, and other domains will likely be able to drive the state-of-the-art ahead. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: 22 pages, 4 figures, submitted to "Progress in Biomedical Engineering"

Journal ref: Prog. Biomed. Eng. 4 022002 (2022)

arXiv:2101.09651 [pdf]

doi 10.3390/ma14175002

The Blow-off Impulse Equivalence between Multi-Energy Composite Spectrum Electron Beam and Powerful Pulsed X-ray

Authors: D. W. Wang, S. H. Yang, S. Wang, J. Wang, H. P. Li

Abstract: The electron beam, one of the most effective approaches to simulate the irradiation effects of powerful pulsed X-ray in the laboratory, plays an important role in the experiment of simulating thermodynamic effects of powerful pulsed X-ray. This paper studies the thermodynamics equivalence between multi-energy composite spectrum electron beam and blackbody spectrum X-ray, which is helpful to quickl… ▽ More The electron beam, one of the most effective approaches to simulate the irradiation effects of powerful pulsed X-ray in the laboratory, plays an important role in the experiment of simulating thermodynamic effects of powerful pulsed X-ray. This paper studies the thermodynamics equivalence between multi-energy composite spectrum electron beam and blackbody spectrum X-ray, which is helpful to quickly determine the experimental parameters in the simulation experiment. The experimental data of electron beam is extrapolated by the numerical calculation, to increase the range of energy flux. Through calculating the blow-off impulse of blackbody spectrum X-ray irradiation, we obtained the curve of X-ray blow-off impulse varying with energy flux, and then found two categories of equivalent relations - equal-energy flux and equal-impulse - by analysing the calculation results of electron beam and X-ray blow-off impulse. Based on such relations, we could directly or indirectly obtain the results of blackbody spectrum X-ray irradiation blow-off impulse via electron beam experiment. △ Less

Submitted 24 January, 2021; originally announced January 2021.

Comments: 17 pages, 14 figures

arXiv:2008.01284 [pdf, ps, other]

doi 10.1051/0004-6361/202038668

Sunspot penumbral filaments intruding into a light bridge and the resultant reconnection jets

Authors: Y. J. Hou, T. Li, S. H. Zhong, S. H. Yang, Y. L. Guo, X. H. Li, J. Zhang, Y. Y. Xiang

Abstract: Penumbral filaments and light bridges are prominent structures inside sunspots and are important for understanding the nature of sunspot magnetic fields and magneto-convection underneath. We investigate an interesting event where several penumbral filaments intruded into a sunspot light bridge for more insights into magnetic fields of the sunspot penumbral filament and light bridge, as well as the… ▽ More Penumbral filaments and light bridges are prominent structures inside sunspots and are important for understanding the nature of sunspot magnetic fields and magneto-convection underneath. We investigate an interesting event where several penumbral filaments intruded into a sunspot light bridge for more insights into magnetic fields of the sunspot penumbral filament and light bridge, as well as their interaction. The emission, kinematic, and magnetic topology characteristics of the penumbral filaments intruding into the light bridge and the resultant jets are studied. At the west part of the light bridge, the intruding penumbral filaments penetrated into the umbrae on both sides of the light bridge, and two groups of jets were also detected. The jets shared the same projected morphology with the intruding filaments and were accompanied by intermittent footpoint brightenings. Simultaneous spectral imaging observations provide convincing evidences for the presences of magnetic reconnection related heating and bidirectional flows near the jet bases and contribute to measuring vector velocities of the jets. Additionally, nonlinear force-free field extrapolation results reveal strong and highly inclined magnetic fields along the intruding penumbral filaments, consistent well with the results deduced from the vector velocities of the jets. Therefore, we propose that the jets could be caused by magnetic reconnections between emerging fields within the light bridge and the nearly horizontal fields of intruding filaments. They were then ejected outward along the stronger filaments fields. Our study indicates that magnetic reconnection could occur between the penumbral filament fields and emerging fields within light bridge and produce jets along the stronger filament fields. These results further complement the study of magnetic reconnection and dynamic activities within the sunspot. △ Less

Submitted 3 August, 2020; originally announced August 2020.

Comments: 14 pages, 9 figures, 3 movies, abstract shortened to meet arXiv requirements, accepted for publication in A&A

Journal ref: A&A 642, A44 (2020)

arXiv:2004.06837 [pdf, ps, other]

doi 10.1051/0004-6361/202038072

Fast degradation of the circular flare ribbon on 2014 August 24

Authors: Q. M. Zhang, S. H. Yang, T. Li, Y. J. Hou, Y. Li

Abstract: The separation and elongation motions of solar flare ribbons have extensively been investigated. The degradation and disappearance of ribbons have rarely been explored. In this paper, we report our multiwavelength observations of a C5.5 circular-ribbon flare associated with two jets (jet1 and jet2) on 2014 August 24, focusing on the fast degradation of the outer circular ribbon (CR). The flare, co… ▽ More The separation and elongation motions of solar flare ribbons have extensively been investigated. The degradation and disappearance of ribbons have rarely been explored. In this paper, we report our multiwavelength observations of a C5.5 circular-ribbon flare associated with two jets (jet1 and jet2) on 2014 August 24, focusing on the fast degradation of the outer circular ribbon (CR). The flare, consisting of a short inner ribbon (IR) and outer CR, was triggered by the eruption of a minifilament. The brightness of IR and outer CR reached their maxima simultaneously at $\sim$04:58 UT in all AIA wavelengths. Subsequently, the short eastern part of CR faded out quickly in 1600 Å but gradually in EUV wavelengths. The long western part of CR degraded in the counterclockwise direction and experienced a deceleration. The degradation was distinctly divided into two phases: phase I with faster apparent speeds (58$-$69 km s$^{-1}$) and phase II with slower apparent speeds (29$-$35 km s$^{-1}$). The second phase stopped at $\sim$05:10 UT when the western CR totally disappeared. Besides the outward propagation of jet1, the jet spire experienced untwisting motion in the counterclockwise direction during 04:55$-$05:00 UT. We conclude that the event can be explained by the breakout jet model. The coherent brightenings of the IR and CR at $\sim$04:58 UT may result from the impulsive interchange reconnection near the null point, whereas sub-Alfvénic slipping motion of the western CR in the counterclockwise direction indicates the occurrence of slipping magnetic reconnection. Another possible explanation of the quick disappearance of the hot loops connecting to the western CR is that they are simply reconnected sequentially without the need for significant slippage after the null point reconnection. △ Less

Submitted 14 April, 2020; originally announced April 2020.

Comments: 4 pages, 5 figures, accepted for publication in A&A Letters

Journal ref: A&A 636, L11 (2020)

arXiv:2001.04260 [pdf]

Improving Dysarthric Speech Intelligibility Using Cycle-consistent Adversarial Training

Authors: Seung Hee Yang, Minhwa Chung

Abstract: Dysarthria is a motor speech impairment affecting millions of people. Dysarthric speech can be far less intelligible than those of non-dysarthric speakers, causing significant communication difficulties. The goal of our work is to develop a model for dysarthric to healthy speech conversion using Cycle-consistent GAN. Using 18,700 dysarthric and 8,610 healthy control Korean utterances that were rec… ▽ More Dysarthria is a motor speech impairment affecting millions of people. Dysarthric speech can be far less intelligible than those of non-dysarthric speakers, causing significant communication difficulties. The goal of our work is to develop a model for dysarthric to healthy speech conversion using Cycle-consistent GAN. Using 18,700 dysarthric and 8,610 healthy control Korean utterances that were recorded for the purpose of automatic recognition of voice keyboard in a previous study, the generator is trained to transform dysarthric to healthy speech in the spectral domain, which is then converted back to speech. Objective evaluation using automatic speech recognition of the generated utterance on a held-out test set shows that the recognition performance is improved compared with the original dysarthic speech after performing adversarial training, as the absolute WER has been lowered by 33.4%. It demonstrates that the proposed GAN-based conversion method is useful for improving dysarthric speech intelligibility. △ Less

Submitted 9 January, 2020; originally announced January 2020.

Comments: To be Published on the 24th February in BIOSIGNALS 2020. arXiv admin note: text overlap with arXiv:1904.09407

arXiv:2001.03278 [pdf]

A Scalable Chatbot Platform Leveraging Online Community Posts: A Proof-of-Concept Study

Authors: Sihyeon Jo, Seungryong Yoo, Sangwon Im, Seung Hee Yang, Tong Zuo, Hee-Eun Kim, SangWook Han, Seong-Woo Kim

Abstract: The development of natural language processing algorithms and the explosive growth of conversational data are encouraging researches on the human-computer conversation. Still, getting qualified conversational data on a large scale is difficult and expensive. In this paper, we verify the feasibility of constructing a data-driven chatbot with processed online community posts by using them as pseudo-… ▽ More The development of natural language processing algorithms and the explosive growth of conversational data are encouraging researches on the human-computer conversation. Still, getting qualified conversational data on a large scale is difficult and expensive. In this paper, we verify the feasibility of constructing a data-driven chatbot with processed online community posts by using them as pseudo-conversational data. We argue that chatbots for various purposes can be built extensively through the pipeline exploiting the common structure of community posts. Our experiment demonstrates that chatbots created along the pipeline can yield the proper responses. △ Less

Submitted 9 January, 2020; originally announced January 2020.

Comments: To be Published on the 10th February, 2020, in HCI (Human-Computer Interaction) Conference 2020, Republic of Korea

arXiv:1904.09407 [pdf]

Self-imitating Feedback Generation Using GAN for Computer-Assisted Pronunciation Training

Authors: Seung Hee Yang, Minhwa Chung

Abstract: Self-imitating feedback is an effective and learner-friendly method for non-native learners in Computer-Assisted Pronunciation Training. Acoustic characteristics in native utterances are extracted and transplanted onto learner's own speech input, and given back to the learner as a corrective feedback. Previous works focused on speech conversion using prosodic transplantation techniques based on PS… ▽ More Self-imitating feedback is an effective and learner-friendly method for non-native learners in Computer-Assisted Pronunciation Training. Acoustic characteristics in native utterances are extracted and transplanted onto learner's own speech input, and given back to the learner as a corrective feedback. Previous works focused on speech conversion using prosodic transplantation techniques based on PSOLA algorithm. Motivated by the visual differences found in spectrograms of native and non-native speeches, we investigated applying GAN to generate self-imitating feedback by utilizing generator's ability through adversarial training. Because this mapping is highly under-constrained, we also adopt cycle consistency loss to encourage the output to preserve the global structure, which is shared by native and non-native utterances. Trained on 97,200 spectrogram images of short utterances produced by native and non-native speakers of Korean, the generator is able to successfully transform the non-native spectrogram input to a spectrogram with properties of self-imitating feedback. Furthermore, the transformed spectrogram shows segmental corrections that cannot be obtained by prosodic transplantation. Perceptual test comparing the self-imitating and correcting abilities of our method with the baseline PSOLA method shows that the generative approach with cycle consistency loss is promising. △ Less

Submitted 20 April, 2019; originally announced April 2019.

arXiv:1808.06795 [pdf, ps, other]

doi 10.1051/0004-6361/201732530

Eruption of a multi-flux-rope system in solar active region 12673 leading to the two largest flares in Solar Cycle 24

Authors: Y. J. Hou, J. Zhang, T. Li, S. H. Yang, X. H. Li

Abstract: Solar active region (AR) 12673 in 2017 September produced two largest flares in Solar Cycle 24: the X9.3 flare on September 06 and the X8.2 flare on September 10. We attempt to investigate the evolutions of the two great flares and their associated complex magnetic system in detail. Aided by the NLFFF modeling, we identify a double-decker flux rope configuration above the polarity inversion line (… ▽ More Solar active region (AR) 12673 in 2017 September produced two largest flares in Solar Cycle 24: the X9.3 flare on September 06 and the X8.2 flare on September 10. We attempt to investigate the evolutions of the two great flares and their associated complex magnetic system in detail. Aided by the NLFFF modeling, we identify a double-decker flux rope configuration above the polarity inversion line (PIL) in the AR core region. The north ends of these two flux ropes were rooted in a negative- polarity magnetic patch, which began to move along the PIL and rotate anticlockwise before the X9.3 flare on September 06. The strong shearing motion and rotation contributed to the destabilization of the two magnetic flux ropes, of which the upper one subsequently erupted upward due to the kink-instability. Then another two sets of twisted loop bundles beside these ropes were disturbed and successively erupted within 5 minutes like a chain reaction. Similarly, multiple ejecta components were detected to consecutively erupt during the X8.2 flare occurring in the same AR on September 10. We examine the evolution of the AR magnetic fields from September 03 to 06 and find that five dipoles emerged successively at the east of the main sunspot. The interactions between these dipoles took place continuously, accompanied by magnetic flux cancellations and strong shearing motions. In AR 12673, significant flux emergence and successive interactions between the different emerging dipoles resulted in a complex magnetic system, accompanied by the formations of multiple flux ropes and twisted loop bundles. We propose that the eruptions of a multi-flux-rope system resulted in the two largest flares in Solar Cycle 24. △ Less

Submitted 21 October, 2018; v1 submitted 21 August, 2018; originally announced August 2018.

Comments: 10 pages, 8 figures. To be published in A&A

Journal ref: A&A 619, A100 (2018)

arXiv:1703.10022 [pdf, other]

doi 10.3847/1538-4357/aaeac1

A blowout jet associated with one obvious extreme-ultraviolet wave and one complicated coronal mass ejection event

Authors: Y. H. Miao, Y. Liu, H. B. Li, Y. D. Shen, S. H. Yang, A. Elmhamdi, A. S. Kordi, Z. Z. Abidin

Abstract: In this paper, we present a detailed analysis of a coronal blowout jet eruption which was associated with an obvious extreme-ultraviolet (EUV) wave and one complicated coronal mass ejection (CME) event based on the multi-wavelength and multi-view-angle observations from {\sl Solar Dynamics Observatory} and {\sl Solar Terrestrial Relations Observatory}. It is found that the triggering of the blowou… ▽ More In this paper, we present a detailed analysis of a coronal blowout jet eruption which was associated with an obvious extreme-ultraviolet (EUV) wave and one complicated coronal mass ejection (CME) event based on the multi-wavelength and multi-view-angle observations from {\sl Solar Dynamics Observatory} and {\sl Solar Terrestrial Relations Observatory}. It is found that the triggering of the blowout jet was due to the emergence and cancellation of magnetic fluxes on the photosphere. During the rising stage of the jet, the EUV wave appeared just ahead of the jet top, lasting about 4 minutes and at a speed of 458 - \speed{762}. In addition, obvious dark material is observed along the EUV jet body, which confirms the observation of a mini-filament eruption at the jet base in the chromosphere. Interestingly, two distinct but overlapped CME structures can be observed in corona together with the eruption of the blowout jet. One is in narrow jet-shape, while the other one is in bubble-shape. The jet-shaped component was unambiguously related with the outwardly running jet itself, while the bubble-like one might either be produced due to the reconstruction of the high coronal fields or by the internal reconnection during the mini-filament ejection according to the double-CME blowout jet model firstly proposed by Shen et al. (2012b), suggesting more observational evidence should be supplied to clear the current ambiguity based on large samples of blowout jets in future studies. △ Less

Submitted 23 December, 2018; v1 submitted 29 March, 2017; originally announced March 2017.

Comments: APJ, Accepted October 19, 2018

arXiv:1604.00485 [pdf, ps, other]

doi 10.1051/0004-6361/201628216

Light Walls Around Sunspots Observed by the Interface Region Imaging Spectrograph

Authors: Y. J. Hou, T. Li, S. H. Yang, J. Zhang

Abstract: The Interface Region Imaging Spectrograph (IRIS) mission provides high-resolution observations of the chromosphere and transition region. We try to determine whether the light walls exist somewhere else in active regions besides light bridges. Employing half-year high tempo-spatial data from the IRIS, we find lots of light walls either around sunspots or above light bridges. For the first time, we… ▽ More The Interface Region Imaging Spectrograph (IRIS) mission provides high-resolution observations of the chromosphere and transition region. We try to determine whether the light walls exist somewhere else in active regions besides light bridges. Employing half-year high tempo-spatial data from the IRIS, we find lots of light walls either around sunspots or above light bridges. For the first time, we report one light wall near an umbral-penumbral boundary and another along a neutral line between two small sunspots. These new observations reveal that these light walls are multi-layer and multi-thermal structures which occur along magnetic neutral lines in active regions. △ Less

Submitted 2 April, 2016; originally announced April 2016.

Comments: 4 pages, 4 figures, Accepted for publication in A&A Letters

Journal ref: A&A 589, L7 (2016)

arXiv:1203.3530 [pdf]

Hybrid Generative/Discriminative Learning for Automatic Image Annotation

Authors: Shuang Hong Yang, Jiang Bian, Hongyuan Zha

Abstract: Automatic image annotation (AIA) raises tremendous challenges to machine learning as it requires modeling of data that are both ambiguous in input and output, e.g., images containing multiple objects and labeled with multiple semantic tags. Even more challenging is that the number of candidate tags is usually huge (as large as the vocabulary size) yet each image is only related to a few of them. T… ▽ More Automatic image annotation (AIA) raises tremendous challenges to machine learning as it requires modeling of data that are both ambiguous in input and output, e.g., images containing multiple objects and labeled with multiple semantic tags. Even more challenging is that the number of candidate tags is usually huge (as large as the vocabulary size) yet each image is only related to a few of them. This paper presents a hybrid generative-discriminative classifier to simultaneously address the extreme data-ambiguity and overfitting-vulnerability issues in tasks such as AIA. Particularly: (1) an Exponential-Multinomial Mixture (EMM) model is established to capture both the input and output ambiguity and in the meanwhile to encourage prediction sparsity; and (2) the prediction ability of the EMM model is explicitly maximized through discriminative learning that integrates variational inference of graphical models and the pairwise formulation of ordinal regression. Experiments show that our approach achieves both superior annotation performance and better tag scalability. △ Less

Submitted 15 March, 2012; originally announced March 2012.

Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

Report number: UAI-P-2010-PG-683-690

arXiv:1010.0621 [pdf, ps, other]

Local Optimality of User Choices and Collaborative Competitive Filtering

Authors: Shuang Hong Yang

Abstract: While a user's preference is directly reflected in the interactive choice process between her and the recommender, this wealth of information was not fully exploited for learning recommender models. In particular, existing collaborative filtering (CF) approaches take into account only the binary events of user actions but totally disregard the contexts in which users' decisions are made. In this p… ▽ More While a user's preference is directly reflected in the interactive choice process between her and the recommender, this wealth of information was not fully exploited for learning recommender models. In particular, existing collaborative filtering (CF) approaches take into account only the binary events of user actions but totally disregard the contexts in which users' decisions are made. In this paper, we propose Collaborative Competitive Filtering (CCF), a framework for learning user preferences by modeling the choice process in recommender systems. CCF employs a multiplicative latent factor model to characterize the dyadic utility function. But unlike CF, CCF models the user behavior of choices by encoding a local competition effect. In this way, CCF allows us to leverage dyadic data that was previously lumped together with missing data in existing CF models. We present two formulations and an efficient large scale optimization algorithm. Experiments on three real-world recommendation data sets demonstrate that CCF significantly outperforms standard CF approaches in both offline and online evaluations. △ Less

Submitted 25 February, 2011; v1 submitted 4 October, 2010; originally announced October 2010.

Comments: 27 pages, 4 figure

ACM Class: I.2.6; H.1.1; H.3.3

arXiv:0904.2684 [pdf, ps, other]

doi 10.1051/0004-6361/200810601

Response of the solar atmosphere to magnetic field evolution in a coronal hole region

Authors: S. H. Yang, J. Zhang, C. L. Jin, L. P. Li, H. Y. Duan

Abstract: Methods. We study an equatorial CH observed simultaneously by HINODE and STEREO on July 27, 2007. The HINODE/SP maps are adopted to derive the physical parameters of the photosphere and to research the magnetic field evolution and distribution. The G band and Ca II H images with high tempo-spatial resolution from HINODE/BFI and the multi-wavelength data from STEREO/EUVI are utilized to study the… ▽ More Methods. We study an equatorial CH observed simultaneously by HINODE and STEREO on July 27, 2007. The HINODE/SP maps are adopted to derive the physical parameters of the photosphere and to research the magnetic field evolution and distribution. The G band and Ca II H images with high tempo-spatial resolution from HINODE/BFI and the multi-wavelength data from STEREO/EUVI are utilized to study the corresponding atmospheric response of different overlying layers. Results. We explore an emerging dipole locating at the CH boundary. Mini-scale arch filaments (AFs) accompanying the emerging dipole were observed with the Ca II H line. During the separation of the dipolar footpoints, three AFs appeared and expanded in turn. The first AF divided into two segments in its late stage, while the second and third AFs erupted in their late stages. The lifetimes of these three AFs are 4, 6, 10 minutes, and the two intervals between the three divisions or eruptions are 18 and 12 minutes, respectively. We display an example of mixed-polarity flux emergence of IN fields within the CH and present the corresponding chromospheric response. With the increase of the integrated magnetic flux, the brightness of the Ca II H images exhibits an increasing trend. We also study magnetic flux cancellations of NT fields locating at the CH boundary and present the obvious chromospheric and coronal response. We notice that the brighter regions seen in the 171 A images are relevant to the interacting magnetic elements. By examining the magnetic NT and IN elements and the response of different atmospheric layers, we obtain good positive linear correlations between the NT magnetic flux densities and the brightness of both G band (correlation coefficient 0.85) and Ca II H (correlation coefficient 0.58). △ Less

Submitted 17 April, 2009; originally announced April 2009.

Comments: 9 pages, 9 figures. A&A, in press

Showing 1–26 of 26 results for author: Yang, S H