subscribe to arXiv mailings

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Abstract: Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features… ▽ More Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder. The decoder estimates the output tokens promptly at each block. To this end, we also propose a novel training scheme using random-length prefix prompts to make the model robust to the truncated prompts caused by blockwise processing. An experimental comparison shows that our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: Accepted for Interspeech 2024

arXiv:2406.12611 [pdf, other]

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

Abstract: End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attentio… ▽ More End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.12317 [pdf, other]

Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model

Authors: Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Abstract: Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specif… ▽ More Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. In addition to model compression, we expect that the forgetting of previously trained tasks can be mitigated by updating only a task-specific subnetwork. We conduct experiments on top of the state-of-the-art multi-task SLU model ``UniverSLU'', trained for several tasks such as emotion recognition (ER), intent classification (IC), and automatic speech recognition (ASR). We show that pruned models were successful in adapting to additional ASR or IC data with minimal performance degradation on previously trained tasks. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech2024

arXiv:2405.12488 [pdf, other]

First joint oscillation analysis of Super-Kamiokande atmospheric and T2K accelerator neutrino data

Authors: Super-Kamiokande, T2K collaborations, :, S. Abe, K. Abe, N. Akhlaq, R. Akutsu, H. Alarakia-Charles, A. Ali, Y. I. Alj Hakim, S. Alonso Monsalve, S. Amanai, C. Andreopoulos, L. H. V. Anthony, M. Antonova, S. Aoki, K. A. Apte, T. Arai, T. Arihara, S. Arimoto, Y. Asada, R. Asaka, Y. Ashida, E. T. Atkin, N. Babu , et al. (524 additional authors not shown)

Abstract: The Super-Kamiokande and T2K collaborations present a joint measurement of neutrino oscillation parameters from their atmospheric and beam neutrino data. It uses a common interaction model for events overlapping in neutrino energy and correlated detector systematic uncertainties between the two datasets, which are found to be compatible. Using 3244.4 days of atmospheric data and a beam exposure of… ▽ More The Super-Kamiokande and T2K collaborations present a joint measurement of neutrino oscillation parameters from their atmospheric and beam neutrino data. It uses a common interaction model for events overlapping in neutrino energy and correlated detector systematic uncertainties between the two datasets, which are found to be compatible. Using 3244.4 days of atmospheric data and a beam exposure of $19.7(16.3) \times 10^{20}$ protons on target in (anti)neutrino mode, the analysis finds a 1.9$σ$ exclusion of CP-conservation (defined as $J_{CP}=0$) and a preference for the normal mass ordering. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: 10 pages, 3 figures

arXiv:2404.09920 [pdf, other]

Combined Pre-Supernova Alert System with Kamland and Super-Kamiokande

Authors: KamLAND, Super-Kamiokande Collaborations, :, Seisho Abe, Minori Eizuka, Sawako Futagi, Azusa Gando, Yoshihito Gando, Shun Goto, Takahiko Hachiya, Kazumi Hata, Koichi Ichimura, Sei Ieki, Haruo Ikeda, Kunio Inoue, Koji Ishidoshiro, Yuto Kamei, Nanami Kawada, Yasuhiro Kishimoto, Masayuki Koga, Maho Kurasawa, Tadao Mitsui, Haruhiko Miyake, Daisuke Morita, Takeshi Nakahata , et al. (290 additional authors not shown)

Abstract: Preceding a core-collapse supernova, various processes produce an increasing amount of neutrinos of all flavors characterized by mounting energies from the interior of massive stars. Among them, the electron antineutrinos are potentially detectable by terrestrial neutrino experiments such as KamLAND and Super-Kamiokande via inverse beta decay interactions. Once these pre-supernova neutrinos are ob… ▽ More Preceding a core-collapse supernova, various processes produce an increasing amount of neutrinos of all flavors characterized by mounting energies from the interior of massive stars. Among them, the electron antineutrinos are potentially detectable by terrestrial neutrino experiments such as KamLAND and Super-Kamiokande via inverse beta decay interactions. Once these pre-supernova neutrinos are observed, an early warning of the upcoming core-collapse supernova can be provided. In light of this, KamLAND and Super-Kamiokande, both located in the Kamioka mine in Japan, have been monitoring pre-supernova neutrinos since 2015 and 2021, respectively. Recently, we performed a joint study between KamLAND and Super-Kamiokande on pre-supernova neutrino detection. A pre-supernova alert system combining the KamLAND detector and the Super-Kamiokande detector was developed and put into operation, which can provide a supernova alert to the astrophysics community. Fully leveraging the complementary properties of these two detectors, the combined alert is expected to resolve a pre-supernova neutrino signal from a 15 M$_{\odot}$ star within 510 pc of the Earth, at a significance level corresponding to a false alarm rate of no more than 1 per century. For a Betelgeuse-like model with optimistic parameters, it can provide early warnings up to 12 hours in advance. △ Less

Submitted 1 July, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: Resubmitted to ApJ. 22 pages, 16 figures, for more information about the combined pre-supernova alert system, see https://www.lowbg.org/presnalarm/

arXiv:2404.08725 [pdf, other]

Development of a data overflow protection system for Super-Kamiokande to maximize data from nearby supernovae

Authors: M. Mori, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba, K. Shimizu , et al. (230 additional authors not shown)

Abstract: Neutrinos from very nearby supernovae, such as Betelgeuse, are expected to generate more than ten million events over 10\,s in Super-Kamokande (SK). At such large event rates, the buffers of the SK analog-to-digital conversion board (QBEE) will overflow, causing random loss of data that is critical for understanding the dynamics of the supernova explosion mechanism. In order to solve this problem,… ▽ More Neutrinos from very nearby supernovae, such as Betelgeuse, are expected to generate more than ten million events over 10\,s in Super-Kamokande (SK). At such large event rates, the buffers of the SK analog-to-digital conversion board (QBEE) will overflow, causing random loss of data that is critical for understanding the dynamics of the supernova explosion mechanism. In order to solve this problem, two new DAQ modules were developed to aid in the observation of very nearby supernovae. The first of these, the SN module, is designed to save only the number of hit PMTs during a supernova burst and the second, the Veto module, prescales the high rate neutrino events to prevent the QBEE from overflowing based on information from the SN module. In the event of a very nearby supernova, these modules allow SK to reconstruct the time evolution of the neutrino event rate from beginning to end using both QBEE and SN module data. This paper presents the development and testing of these modules together with an analysis of supernova-like data generated with a flashing laser diode. We demonstrate that the Veto module successfully prevents DAQ overflows for Betelgeuse-like supernovae as well as the long-term stability of the new modules. During normal running the Veto module is found to issue DAQ vetos a few times per month resulting in a total dead time less than 1\,ms, and does not influence ordinary operations. Additionally, using simulation data we find that supernovae closer than 800~pc will trigger Veto module resulting in a prescaling of the observed neutrino data. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 28 pages, 18 figures. Submitted to PTEP

arXiv:2403.08619 [pdf, other]

Measurements of the charge ratio and polarization of cosmic-ray muons with the Super-Kamiokande detector

Authors: H. Kitagawa, T. Tada, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya , et al. (231 additional authors not shown)

Abstract: We present the results of the charge ratio ($R$) and polarization ($P^μ_{0}$) measurements using the decay electron events collected from 2008 September to 2022 June by the Super-Kamiokande detector. Because of its underground location and long operation, we performed high precision measurements by accumulating cosmic-ray muons. We measured the muon charge ratio to be $R=1.32 \pm 0.02$… ▽ More We present the results of the charge ratio ($R$) and polarization ($P^μ_{0}$) measurements using the decay electron events collected from 2008 September to 2022 June by the Super-Kamiokande detector. Because of its underground location and long operation, we performed high precision measurements by accumulating cosmic-ray muons. We measured the muon charge ratio to be $R=1.32 \pm 0.02$ $(\mathrm{stat.}{+}\mathrm{syst.})$ at $E_μ\cos θ_{\mathrm{Zenith}}=0.7^{+0.3}_{-0.2}$ $\mathrm{TeV}$, where $E_μ$ is the muon energy and $θ_{\mathrm{Zenith}}$ is the zenith angle of incoming cosmic-ray muons. This result is consistent with the Honda flux model while this suggests a tension with the $πK$ model of $1.9σ$. We also measured the muon polarization at the production location to be $P^μ_{0}=0.52 \pm 0.02$ $(\mathrm{stat.}{+}\mathrm{syst.})$ at the muon momentum of $0.9^{+0.6}_{-0.1}$ $\mathrm{TeV}/c$ at the surface of the mountain; this also suggests a tension with the Honda flux model of $1.5σ$. This is the most precise measurement ever to experimentally determine the cosmic-ray muon polarization near $1~\mathrm{TeV}/c$. These measurement results are useful to improve the atmospheric neutrino simulations. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 29 pages, 45 figures

arXiv:2403.07796 [pdf, other]

doi 10.1016/j.nima.2024.169480

Second gadolinium loading to Super-Kamiokande

Authors: K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, H. Shiba, K. Shimizu, M. Shiozawa , et al. (225 additional authors not shown)

Abstract: The first loading of gadolinium (Gd) into Super-Kamiokande in 2020 was successful, and the neutron capture efficiency on Gd reached 50\%. To further increase the Gd neutron capture efficiency to 75\%, 26.1 tons of $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ was additionally loaded into Super-Kamiokande (SK) from May 31 to July 4, 2022. As the amount of loaded $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ was do… ▽ More The first loading of gadolinium (Gd) into Super-Kamiokande in 2020 was successful, and the neutron capture efficiency on Gd reached 50\%. To further increase the Gd neutron capture efficiency to 75\%, 26.1 tons of $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ was additionally loaded into Super-Kamiokande (SK) from May 31 to July 4, 2022. As the amount of loaded $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$ was doubled compared to the first loading, the capacity of the powder dissolving system was doubled. We also developed new batches of gadolinium sulfate with even further reduced radioactive impurities. In addition, a more efficient screening method was devised and implemented to evaluate these new batches of $\rm Gd_2(\rm SO_4)_3\cdot \rm 8H_2O$. Following the second loading, the Gd concentration in SK was measured to be $333.5\pm2.5$ ppm via an Atomic Absorption Spectrometer (AAS). From the mean neutron capture time constant of neutrons from an Am/Be calibration source, the Gd concentration was independently measured to be 332.7 $\pm$ 6.8(sys.) $\pm$ 1.1(stat.) ppm, consistent with the AAS result. Furthermore, during the loading the Gd concentration was monitored continually using the capture time constant of each spallation neutron produced by cosmic-ray muons,and the final neutron capture efficiency was shown to become 1.5 times higher than that of the first loaded phase, as expected. △ Less

Submitted 18 June, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

Comments: 34 pages, 13 figures, submitted to Nuclear Inst. and Methods in Physics Research, A

Journal ref: Nuclear Inst. and Methods in Physics Research, A 1065 (2024) 169480

arXiv:2403.06760 [pdf, other]

Performance of SK-Gd's Upgraded Real-time Supernova Monitoring System

Authors: Y. Kashiwagi, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, H. Shiba, K. Shimizu, M. Shiozawa , et al. (214 additional authors not shown)

Abstract: Among multi-messenger observations of the next galactic core-collapse supernova, Super-Kamiokande (SK) plays a critical role in detecting the emitted supernova neutrinos, determining the direction to the supernova (SN), and notifying the astronomical community of these observations in advance of the optical signal. On 2022, SK has increased the gadolinium dissolved in its water target (SK-Gd) and… ▽ More Among multi-messenger observations of the next galactic core-collapse supernova, Super-Kamiokande (SK) plays a critical role in detecting the emitted supernova neutrinos, determining the direction to the supernova (SN), and notifying the astronomical community of these observations in advance of the optical signal. On 2022, SK has increased the gadolinium dissolved in its water target (SK-Gd) and has achieved a Gd concentration of 0.033%, resulting in enhanced neutron detection capability, which in turn enables more accurate determination of the supernova direction. Accordingly, SK-Gd's real-time supernova monitoring system (Abe te al. 2016b) has been upgraded. SK_SN Notice, a warning system that works together with this monitoring system, was released on December 13, 2021, and is available through GCN Notices (Barthelmy et al. 2000). When the monitoring system detects an SN-like burst of events, SK_SN Notice will automatically distribute an alarm with the reconstructed direction to the supernova candidate within a few minutes. In this paper, we present a systematic study of SK-Gd's response to a simulated galactic SN. Assuming a supernova situated at 10 kpc, neutrino fluxes from six supernova models are used to characterize SK-Gd's pointing accuracy using the same tools as the online monitoring system. The pointing accuracy is found to vary from 3-7$^\circ$ depending on the models. However, if the supernova is closer than 10 kpc, SK_SN Notice can issue an alarm with three-degree accuracy, which will benefit follow-up observations by optical telescopes with large fields of view. △ Less

Submitted 13 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 38 pages, 29 figures, 6 tables

arXiv:2312.12907 [pdf, ps, other]

doi 10.1103/PhysRevD.109.092001

Solar neutrino measurements using the full data period of Super-Kamiokande-IV

Authors: Super-Kamiokande Collaboration, :, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, S. Imaizumi, K. Iyogi, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, Y. Kato, Y. Kishimoto, S. Miki, S. Mine, M. Miura, T. Mochizuki, S. Moriyama, Y. Nagao, M. Nakahata , et al. (305 additional authors not shown)

Abstract: An analysis of solar neutrino data from the fourth phase of Super-Kamiokande~(SK-IV) from October 2008 to May 2018 is performed and the results are presented. The observation time of the data set of SK-IV corresponds to $2970$~days and the total live time for all four phases is $5805$~days. For more precise solar neutrino measurements, several improvements are applied in this analysis: lowering th… ▽ More An analysis of solar neutrino data from the fourth phase of Super-Kamiokande~(SK-IV) from October 2008 to May 2018 is performed and the results are presented. The observation time of the data set of SK-IV corresponds to $2970$~days and the total live time for all four phases is $5805$~days. For more precise solar neutrino measurements, several improvements are applied in this analysis: lowering the data acquisition threshold in May 2015, further reduction of the spallation background using neutron clustering events, precise energy reconstruction considering the time variation of the PMT gain. The observed number of solar neutrino events in $3.49$--$19.49$ MeV electron kinetic energy region during SK-IV is $65,443^{+390}_{-388}\,(\mathrm{stat.})\pm 925\,(\mathrm{syst.})$ events. Corresponding $\mathrm{^{8}B}$ solar neutrino flux is $(2.314 \pm 0.014\, \rm{(stat.)} \pm 0.040 \, \rm{(syst.)}) \times 10^{6}~\mathrm{cm^{-2}\,s^{-1}}$, assuming a pure electron-neutrino flavor component without neutrino oscillations. The flux combined with all SK phases up to SK-IV is $(2.336 \pm 0.011\, \rm{(stat.)} \pm 0.043 \, \rm{(syst.)}) \times 10^{6}~\mathrm{cm^{-2}\,s^{-1}}$. Based on the neutrino oscillation analysis from all solar experiments, including the SK $5805$~days data set, the best-fit neutrino oscillation parameters are $\rm{sin^{2} θ_{12,\,solar}} = 0.306 \pm 0.013 $ and $Δm^{2}_{21,\,\mathrm{solar}} = (6.10^{+ 0.95}_{-0.81}) \times 10^{-5}~\rm{eV}^{2}$, with a deviation of about 1.5$σ$ from the $Δm^{2}_{21}$ parameter obtained by KamLAND. The best-fit neutrino oscillation parameters obtained from all solar experiments and KamLAND are $\sin^{2} θ_{12,\,\mathrm{global}} = 0.307 \pm 0.012 $ and $Δm^{2}_{21,\,\mathrm{global}} = (7.50^{+ 0.19}_{-0.18}) \times 10^{-5}~\rm{eV}^{2}$. △ Less

Submitted 20 February, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 47 pages, 61 figures

Journal ref: Phys. Rev. D 109, 092001 (2024)

arXiv:2312.09582 [pdf, other]

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

Abstract: In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusua… ▽ More In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations. As TCPGen handles biasing words as subword units, we propose obtaining subword-level phoneme-aware encoding by using alignment between phonemes and subwords. Furthermore, we propose injecting phoneme-level predictions from CTC into queries of TCPGen so that the model better interprets the phoneme-aware encodings. We conducted ASR experiments with TCPGen for RNN transducer. We observed that proposed phoneme-aware encoding outperformed ordinary grapheme-based encoding on both the English LibriSpeech and Japanese CSJ datasets, demonstrating the robustness of our approach across linguistically diverse languages. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted to ICASSP2024

arXiv:2311.05105 [pdf, other]

doi 10.1103/PhysRevD.109.072014

Atmospheric neutrino oscillation analysis with neutron tagging and an expanded fiducial volume in Super-Kamiokande I-V

Authors: Super-Kamiokande Collaboration, :, T. Wester, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya , et al. (212 additional authors not shown)

Abstract: We present a measurement of neutrino oscillation parameters with the Super-Kamiokande detector using atmospheric neutrinos from the complete pure-water SK I-V (April 1996-July 2020) data set, including events from an expanded fiducial volume. The data set corresponds to 6511.3 live days and an exposure of 484.2 kiloton-years. Measurements of the neutrino oscillation parameters $Δm^2_{32}$,… ▽ More We present a measurement of neutrino oscillation parameters with the Super-Kamiokande detector using atmospheric neutrinos from the complete pure-water SK I-V (April 1996-July 2020) data set, including events from an expanded fiducial volume. The data set corresponds to 6511.3 live days and an exposure of 484.2 kiloton-years. Measurements of the neutrino oscillation parameters $Δm^2_{32}$, $\sin^2θ_{23}$, $\sin^2 θ_{13}$, $δ_{CP}$, and the preference for the neutrino mass ordering are presented with atmospheric neutrino data alone, and with constraints on $\sin^2 θ_{13}$ from reactor neutrino experiments. Our analysis including constraints on $\sin^2 θ_{13}$ favors the normal mass ordering at the 92.3% level. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: 24 pages, 18 figures

arXiv:2311.03842 [pdf, ps, other]

Measurement of the neutrino-oxygen neutral-current quasielastic cross section using atmospheric neutrinos in the SK-Gd experiment

Authors: S. Sakai, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, H. Shiba, K. Shimizu , et al. (211 additional authors not shown)

Abstract: We report the first measurement of the atmospheric neutrino-oxygen neutral-current quasielastic (NCQE) cross section in the gadolinium-loaded Super-Kamiokande (SK) water Cherenkov detector. In June 2020, SK began a new experimental phase, named SK-Gd, by loading 0.011% by mass of gadolinium into the ultrapure water of the SK detector. The introduction of gadolinium to ultrapure water has the effec… ▽ More We report the first measurement of the atmospheric neutrino-oxygen neutral-current quasielastic (NCQE) cross section in the gadolinium-loaded Super-Kamiokande (SK) water Cherenkov detector. In June 2020, SK began a new experimental phase, named SK-Gd, by loading 0.011% by mass of gadolinium into the ultrapure water of the SK detector. The introduction of gadolinium to ultrapure water has the effect of improving the neutron-tagging efficiency. Using a 552.2 day data set from August 2020 to June 2022, we measure the NCQE cross section to be 0.74 $\pm$ 0.22(stat.) $^{+0.85}_{-0.15}$ (syst.) $\times$ 10$^{-38}$ cm$^{2}$/oxygen in the energy range from 160 MeV to 10 GeV, which is consistent with the atmospheric neutrino-flux-averaged theoretical NCQE cross section and the measurement in the SK pure-water phase within the uncertainties. Furthermore, we compare the models of the nucleon-nucleus interactions in water and find that the Binary Cascade model and the Liege Intranuclear Cascade model provide a somewhat better fit to the observed data than the Bertini Cascade model. Since the atmospheric neutrino-oxygen NCQE reactions are one of the main backgrounds in the search for diffuse supernova neutrino background (DSNB), these new results will contribute to future studies - and the potential discovery - of the DSNB in SK. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 8 pages, 3 figures

arXiv:2311.01159 [pdf, other]

Search for Periodic Time Variations of the Solar $^8$B Neutrino Flux between 1996 and 2018 in Super-Kamiokande

Authors: K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, H. Shiba, K. Shimizu, M. Shiozawa , et al. (211 additional authors not shown)

Abstract: We report a search for time variations of the solar $^8$B neutrino flux using 5804 live days of Super-Kamiokande data collected between May 31, 1996, and May 30, 2018. Super-Kamiokande measured the precise time of each solar neutrino interaction over 22 calendar years to search for solar neutrino flux modulations with unprecedented precision. Periodic modulations are searched for in a dataset comp… ▽ More We report a search for time variations of the solar $^8$B neutrino flux using 5804 live days of Super-Kamiokande data collected between May 31, 1996, and May 30, 2018. Super-Kamiokande measured the precise time of each solar neutrino interaction over 22 calendar years to search for solar neutrino flux modulations with unprecedented precision. Periodic modulations are searched for in a dataset comprising five-day interval solar neutrino flux measurements with a maximum likelihood method. We also applied the Lomb-Scargle method to this dataset to compare it with previous reports. The only significant modulation found is due to the elliptic orbit of the Earth around the Sun. The observed modulation is consistent with astronomical data: we measured an eccentricity of (1.53$\pm$0.35)\%, and a perihelion shift of ($-$1.5$\pm$13.5) days. △ Less

Submitted 6 June, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: 8 pages, 5 figures, 2 tables, and data file: "sksolartimevariation5804d.txt" (the data file updated with additional 3 columns -- R^2 correction, upper-error, lower-error)

Journal ref: Phys.Rev.Lett 132, 241803 (2024)

arXiv:2310.02973 [pdf, other]

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

Authors: Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additio… ▽ More Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types. △ Less

Submitted 3 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted at NAACL 2024

arXiv:2309.08876 [pdf, ps, other]

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Abstract: Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we prop… ▽ More Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we propose using a decoder-only architecture for ASR with simple text augmentation. To provide audio information, encoder features compressed by CTC prediction are used as prompts for the decoder, which can be regarded as refining CTC prediction using the decoder-only model. Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training. An experimental comparison using LibriSpeech and Switchboard shows that our proposed models with text augmentation training reduced word error rates from ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set, respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model had advantage on computational efficiency compared with conventional encoder-decoder ASR models with a similar parameter setup, and outperformed them on the LibriSpeech 100h and Switchboard training scenarios. △ Less

Submitted 9 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

arXiv:2307.12767 [pdf, ps, other]

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Abstract: Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder mitigates this issue using soft attention to the input, while it tends to overestimate labels biased towards its training domain, unlike CTC. We exploi… ▽ More Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder mitigates this issue using soft attention to the input, while it tends to overestimate labels biased towards its training domain, unlike CTC. We exploit these complementary attributes and propose to integrate the frame- and label-synchronous (F-/L-Sync) decoding alternately performed within a single beam-search scheme. F-Sync decoding leads the decoding for block-wise processing, while L-Sync decoding provides the prioritized hypotheses using look-ahead future frames within a block. We maintain the hypotheses from both decoding methods to perform effective pruning. Experiments demonstrate that the proposed search algorithm achieves lower error rates compared to the other search methods, while being robust against out-of-domain situations. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: Accepted for Interspeech 2023

arXiv:2307.11005 [pdf, other]

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

Authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

Abstract: There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively int… ▽ More There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LM subnetworks into the SLU formulation for sequence generation tasks. In the first pass, our architecture predicts ASR transcripts using the ASR subnetwork. This is followed by the LM subnetwork, which makes an initial SLU prediction. Finally, in the third pass, the deliberation subnetwork conditions on representations from the ASR and LM subnetworks to make the final prediction. Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets, SLURP and SLUE, especially on acoustically challenging utterances. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted at INTERSPEECH 2023

arXiv:2306.01247 [pdf, other]

Tensor decomposition for minimization of E2E SLU model toward on-device processing

Authors: Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

Abstract: Spoken Language Understanding (SLU) is a critical speech recognition application and is often deployed on edge devices. Consequently, on-device processing plays a significant role in the practical implementation of SLU. This paper focuses on the end-to-end (E2E) SLU model due to its small latency property, unlike a cascade system, and aims to minimize the computational cost. We reduce the model si… ▽ More Spoken Language Understanding (SLU) is a critical speech recognition application and is often deployed on edge devices. Consequently, on-device processing plays a significant role in the practical implementation of SLU. This paper focuses on the end-to-end (E2E) SLU model due to its small latency property, unlike a cascade system, and aims to minimize the computational cost. We reduce the model size by applying tensor decomposition to the Conformer and E-Branchformer architectures used in our E2E SLU models. We propose to apply singular value decomposition to linear layers and the Tucker decomposition to convolution layers, respectively. We also compare COMP/PARFAC decomposition and Tensor-Train decomposition to the Tucker decomposition. Since the E2E model is represented by a single neural network, our tensor decomposition can flexibly control the number of parameters without changing feature dimensions. On the STOP dataset, we achieved 70.9% exact match accuracy under the tight constraint of only 15 million parameters. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2305.05135 [pdf, other]

doi 10.3847/2041-8213/acdc9e

Search for astrophysical electron antineutrinos in Super-Kamiokande with 0.01wt% gadolinium-loaded water

Authors: M. Harada, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba , et al. (216 additional authors not shown)

Abstract: We report the first search result for the flux of astrophysical electron antineutrinos for energies O(10) MeV in the gadolinium-loaded Super-Kamiokande (SK) detector. In June 2020, gadolinium was introduced to the ultra-pure water of the SK detector in order to detect neutrons more efficiently. In this new experimental phase, SK-Gd, we can search for electron antineutrinos via inverse beta decay w… ▽ More We report the first search result for the flux of astrophysical electron antineutrinos for energies O(10) MeV in the gadolinium-loaded Super-Kamiokande (SK) detector. In June 2020, gadolinium was introduced to the ultra-pure water of the SK detector in order to detect neutrons more efficiently. In this new experimental phase, SK-Gd, we can search for electron antineutrinos via inverse beta decay with efficient background rejection and higher signal efficiency thanks to the high efficiency of the neutron tagging technique. In this paper, we report the result for the initial stage of SK-Gd with a $22.5\times552$ $\rm kton\cdot day$ exposure at 0.01% Gd mass concentration. No significant excess over the expected background in the observed events is found for the neutrino energies below 31.3 MeV. Thus, the flux upper limits are placed at the 90% confidence level. The limits and sensitivities are already comparable with the previous SK result with pure-water ($22.5 \times 2970 \rm kton\cdot day$) owing to the enhanced neutron tagging. △ Less

Submitted 30 May, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

arXiv:2305.01620 [pdf, ps, other]

A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

Authors: Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

Abstract: Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing. In this paper, we describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge which is part of ICASSP Signal Processing Grand Challenge 2023. We experiment with both end-to-end and pipeline system… ▽ More Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing. In this paper, we describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge which is part of ICASSP Signal Processing Grand Challenge 2023. We experiment with both end-to-end and pipeline systems for this task. Strong automatic speech recognition (ASR) models like Whisper and pretrained Language models (LM) like BART are utilized inside our SLU framework to boost performance. We also investigate the output level combination of various models to get an exact match accuracy of 80.8, which won the 1st place at the challenge. △ Less

Submitted 6 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: First Place in Track 1 of STOP Challenge, which is part of ICASSP Signal Processing Grand Challenge 2023

arXiv:2305.01194 [pdf, ps, other]

The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

Authors: Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

Abstract: This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on low-resource… ▽ More This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on low-resource domain data. We apply masked LM (MLM) -based data augmentation, where some of input tokens and corresponding target labels are replaced using MLM. We also apply a retrieval-based approach, where model input is augmented with similar training samples. As a result, we achieved exact match (EM) accuracy 63.3/75.0 (average: 69.15) for reminder/weather domain, and won the 1st place at the challenge. △ Less

Submitted 11 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: To appear at ICASSP2023

arXiv:2212.10801 [pdf, other]

Measurement of the cosmogenic neutron yield in Super-Kamiokande with gadolinium loaded water

Authors: Super-Kamiokande Collaboration, :, M. Shinoki, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya , et al. (217 additional authors not shown)

Abstract: Cosmic-ray muons that enter the Super-Kamiokande detector cause hadronic showers due to spallation in water, producing neutrons and radioactive isotopes. Those are a major background source for studies of MeV-scale neutrinos and searches for rare events. Since 2020, gadolinium was introduced in the ultra-pure water in the Super-Kamiokande detector to improve the detection efficiency of neutrons. I… ▽ More Cosmic-ray muons that enter the Super-Kamiokande detector cause hadronic showers due to spallation in water, producing neutrons and radioactive isotopes. Those are a major background source for studies of MeV-scale neutrinos and searches for rare events. Since 2020, gadolinium was introduced in the ultra-pure water in the Super-Kamiokande detector to improve the detection efficiency of neutrons. In this study, the cosmogenic neutron yield was measured using data acquired during the period after the gadolinium loading. The yield was found to be $(2.76 \pm 0.02\,\mathrm{(stat.) \pm 0.19\,\mathrm{(syst.)}}) \times 10^{-4}\,μ^{-1} \mathrm{g^{-1} cm^{2}}$ at 259 GeV of average muon energy at the Super-Kamiokande detector. △ Less

Submitted 25 October, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: 10 pages, 10 figures, 3 tables

arXiv:2211.08726 [pdf, other]

Streaming Joint Speech Recognition and Disfluency Detection

Authors: Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

Abstract: Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to… ▽ More Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese. △ Less

Submitted 11 May, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: Accepted at ICASSP2023

arXiv:2210.12948 [pdf, other]

Searching for neutrinos from solar flares across solar cycles 23 and 24 with the Super-Kamiokande detector

Authors: K. Okamoto, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, Y. Kaneshima, Y. Kataoka, Y. Kashiwagi, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nagao, M. Nakahata, Y. Nakano, S. Nakayama, Y. Noguchi, K. Sato, H. Sekiya, K. Shimizu, M. Shiozawa , et al. (220 additional authors not shown)

Abstract: Neutrinos associated with solar flares (solar-flare neutrinos) provide information on particle acceleration mechanisms during the impulsive phase of solar flares. We searched using the Super-Kamiokande detector for neutrinos from solar flares that occurred during solar cycles $23$ and $24$, including the largest solar flare (X28.0) on November 4th, 2003. In order to minimize the background rate we… ▽ More Neutrinos associated with solar flares (solar-flare neutrinos) provide information on particle acceleration mechanisms during the impulsive phase of solar flares. We searched using the Super-Kamiokande detector for neutrinos from solar flares that occurred during solar cycles $23$ and $24$, including the largest solar flare (X28.0) on November 4th, 2003. In order to minimize the background rate we searched for neutrino interactions within narrow time windows coincident with $γ$-rays and soft X-rays recorded by satellites. In addition, we performed the first attempt to search for solar-flare neutrinos from solar flares on the invisible side of the Sun by using the emission time of coronal mass ejections (CMEs). By selecting twenty powerful solar flares above X5.0 on the visible side and eight CMEs whose emission speed exceeds $2000$ $\mathrm{km \, s^{-1}}$ on the invisible side from 1996 to 2018, we found two (six) neutrino events coincident with solar flares occurring on the visible (invisible) side of the Sun, with a typical background rate of $0.10$ ($0.62$) events per flare in the MeV-GeV energy range. No significant solar-flare neutrino signal above the estimated background rate was observed. As a result we set the following upper limit on neutrino fluence at the Earth $\mathitΦ<1.1\times10^{6}$ $\mathrm{cm^{-2}}$ at the $90\%$ confidence level for the largest solar flare. The resulting fluence limits allow us to constrain some of the theoretical models for solar-flare neutrino emission. △ Less

Submitted 26 October, 2022; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: 36 pages, 18 figures, 9 tables (Figure 12 was replaced because it was incorrect in version 1.)

arXiv:2209.14968 [pdf, other]

doi 10.1103/PhysRevLett.130.031802

Search for Cosmic-ray Boosted Sub-GeV Dark Matter using Recoil Protons at Super-Kamiokande

Authors: The Super-Kamiokande Collaboration, :, K. Abe, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba, K. Shimizu , et al. (197 additional authors not shown)

Abstract: We report a search for cosmic-ray boosted dark matter with protons using the 0.37 megaton$\times$years data collected at Super-Kamiokande experiment during the 1996-2018 period (SKI-IV phase). We searched for an excess of proton recoils above the atmospheric neutrino background from the vicinity of the Galactic Center. No such excess is observed, and limits are calculated for two reference models… ▽ More We report a search for cosmic-ray boosted dark matter with protons using the 0.37 megaton$\times$years data collected at Super-Kamiokande experiment during the 1996-2018 period (SKI-IV phase). We searched for an excess of proton recoils above the atmospheric neutrino background from the vicinity of the Galactic Center. No such excess is observed, and limits are calculated for two reference models of dark matter with either a constant interaction cross-section or through a scalar mediator. This is the first experimental search for boosted dark matter with hadrons using directional information. The results present the most stringent limits on cosmic-ray boosted dark matter and exclude the dark matter-nucleon elastic scattering cross-section between $10^{-33}\text{ cm}^{2}$ and $10^{-27}\text{ cm}^{2}$ for dark matter mass from 10 MeV/$c^2$ to 1 GeV/$c^2$. △ Less

Submitted 30 August, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: With 1-page appendix. A bug was found in July 2023. This version is updated to match the erratum

Journal ref: Phys. Rev. Lett. 130 (2023) 031802

arXiv:2208.13188 [pdf, other]

Search for proton decay via $p\rightarrow μ^+K^0$ in 0.37 megaton-years exposure of Super-Kamiokande

Authors: Super-Kamiokande Collaboration, :, R. Matsumoto, K. Abe, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba , et al. (208 additional authors not shown)

Abstract: We searched for proton decay via $p\toμ^+K^0$ in 0.37\,Mton$\cdot$years of data collected between 1996 and 2018 from the Super-Kamiokande water Cherenkov experiment. The selection criteria were defined separately for $K^0_S$ and $K^0_L$ channels. No significant event excess has been observed. As a result of this analysis, which extends the previous search by an additional 0.2\,Mton$\cdot$years of… ▽ More We searched for proton decay via $p\toμ^+K^0$ in 0.37\,Mton$\cdot$years of data collected between 1996 and 2018 from the Super-Kamiokande water Cherenkov experiment. The selection criteria were defined separately for $K^0_S$ and $K^0_L$ channels. No significant event excess has been observed. As a result of this analysis, which extends the previous search by an additional 0.2\,Mton$\cdot$years of exposure and uses an improved event reconstruction, we set a lower limit of $3.6\times10^{33}$ years on the proton lifetime. △ Less

Submitted 28 August, 2022; originally announced August 2022.

Comments: 13 pages, 11 figures

arXiv:2206.07430 [pdf, ps, other]

Residual Language Model for End-to-end Speech Recognition

Authors: Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe

Abstract: End-to-end automatic speech recognition suffers from adaptation to unknown target domain speech despite being trained with a large amount of paired audio--text data. Recent studies estimate a linguistic bias of the model as the internal language model (LM). To effectively adapt to the target domain, the internal LM is subtracted from the posterior during inference and fused with an external target… ▽ More End-to-end automatic speech recognition suffers from adaptation to unknown target domain speech despite being trained with a large amount of paired audio--text data. Recent studies estimate a linguistic bias of the model as the internal language model (LM). To effectively adapt to the target domain, the internal LM is subtracted from the posterior during inference and fused with an external target-domain LM. However, this fusion complicates the inference and the estimation of the internal LM may not always be accurate. In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training. We directly model the residual factor of the external and internal LMs, namely the residual LM. To stably train the residual LM, we propose smoothing the estimated internal LM and optimizing it with a combination of cross-entropy and mean-squared-error losses, which consider the statistical behaviors of the internal LM in the target domain data. We experimentally confirmed that the proposed residual LM performs better than the internal LM estimation in most of the cross-domain and intra-domain scenarios. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: Accepted for Interspeech2022

arXiv:2206.01380 [pdf, other]

doi 10.3847/1538-4357/ac8f41

Search for supernova bursts in Super-Kamiokande IV

Authors: The Super-Kamiokande collaboration, :, M. Mori, K. Abe, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, S. Imaizumi, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nagao, M. Nakahata, Y. Nakano, S. Nakayama, Y. Noguchi, T. Okada, K. Okamoto , et al. (223 additional authors not shown)

Abstract: Super-Kamiokande has been searching for neutrino bursts characteristic of core-collapse supernovae continuously, in real time, since the start of operations in 1996. The present work focuses on detecting more distant supernovae whose event rate may be too small to trigger in real time, but may be identified using an offline approach. The analysis of data collected from 2008 to 2018 found no eviden… ▽ More Super-Kamiokande has been searching for neutrino bursts characteristic of core-collapse supernovae continuously, in real time, since the start of operations in 1996. The present work focuses on detecting more distant supernovae whose event rate may be too small to trigger in real time, but may be identified using an offline approach. The analysis of data collected from 2008 to 2018 found no evidence of distant supernovae bursts. This establishes an upper limit of 0.29 year$^{-1}$ on the rate of core-collapse supernovae out to 100 kpc at 90% C.L.. For supernovae that fail to explode and collapse directly to black holes the limit reaches to 300 kpc. △ Less

Submitted 2 June, 2022; originally announced June 2022.

arXiv:2205.09881 [pdf, other]

doi 10.3847/1538-4357/ac7f9c

Pre-Supernova Alert System for Super-Kamiokande

Authors: Super-Kamiokande Collaboration, :, L. N. Machado, K. Abe, Y. Hayato, K. Hiraide, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba , et al. (202 additional authors not shown)

Abstract: In 2020, the Super-Kamiokande (SK) experiment moved to a new stage (SK-Gd) in which gadolinium (Gd) sulfate octahydrate was added to the water in the detector, enhancing the efficiency to detect thermal neutrons and consequently improving the sensitivity to low energy electron anti-neutrinos from inverse beta decay (IBD) interactions. SK-Gd has the potential to provide early alerts of incipient co… ▽ More In 2020, the Super-Kamiokande (SK) experiment moved to a new stage (SK-Gd) in which gadolinium (Gd) sulfate octahydrate was added to the water in the detector, enhancing the efficiency to detect thermal neutrons and consequently improving the sensitivity to low energy electron anti-neutrinos from inverse beta decay (IBD) interactions. SK-Gd has the potential to provide early alerts of incipient core-collapse supernovae through detection of electron anti-neutrinos from thermal and nuclear processes responsible for the cooling of massive stars before the gravitational collapse of their cores. These pre-supernova neutrinos emitted during the silicon burning phase can exceed the energy threshold for IBD reactions. We present the sensitivity of SK-Gd to pre-supernova stars and the techniques used for the development of a pre-supernova alarm based on the detection of these neutrinos in SK, as well as prospects for future SK-Gd phases with higher concentrations of Gd. For the current SK-Gd phase, high-confidence alerts for Betelgeuse could be issued up to nine hours in advance of the core-collapse itself. △ Less

Submitted 17 August, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Comments: 20 pages

Journal ref: The Astrophysical Journal, Volume 935, Number 1 (2022)

arXiv:2202.01405 [pdf, other]

Joint Speech Recognition and Audio Captioning

Authors: Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

Abstract: Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AA… ▽ More Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset. We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods. △ Less

Submitted 2 February, 2022; originally announced February 2022.

Comments: 5 pages, 2 figures. Accepted for ICASSP 2022

arXiv:2201.10190 [pdf, ps, other]

Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

Authors: Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

Abstract: A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the numbe… ▽ More A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the number of tokens that are yet to be emitted in the encoder features of the current blocks using the CTC posterior. Based on the expectation value, the decoder predicts the endpoint to realize continuous block synchronization, as a running stitch. Meanwhile, endpoint post-determination probabilistically detects backward jump of the source-target attention, which is caused by the misprediction of endpoints. Then it resumes decoding by discarding those hypotheses, as back stitch. We combine these methods into a hybrid approach, namely run-and-back stitch search, which reduces the computational cost and latency. Evaluations of various ASR tasks show the efficiency of our proposed decoding algorithm, which achieves a latency reduction, for instance in the Librispeech test set from 1487 ms to 821 ms at the 90th percentile, while maintaining a high recognition accuracy. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted for ICASSP2022

arXiv:2110.05968 [pdf, ps, other]

Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

Authors: Ryosuke Sawata, Yosuke Kashiwagi, Shusuke Takahashi

Abstract: A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and… ▽ More A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM). Then both of DNNs are alternately optimized in the training phase. Even if the AM is a black-box, e.g., like one provided by a third-party, the proposed method enables the DNN-based SE model to be optimized in terms of the CER since the DNN mimicking the AM is differentiable. Consequently, it becomes feasible to build CER-centric SE model that has no negative effect, e.g., additional calculation cost and changing network architecture, on the inference phase since our method is merely a training scheme for the existing DNN-based methods. Experimental results show that our method improved CER by 8.8% relative derived through a black-box AM although certain noise levels are kept. △ Less

Submitted 22 February, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Accepted by ICASSP 2022

arXiv:2110.05030 [pdf, other]

doi 10.1093/pasj/psab100

Subaru/FOCAS IFU revealed the metallicity gradient of a local extremely metal-poor galaxy

Authors: Yuri Kashiwagi, Akio K. Inoue, Yuki Isobe, Kimihiko Nakajima, Masami Ouchi, Shinobu Ozaki, Seiji Fujiimoto, Yoshiaki Ono, Takashi Kojima

Abstract: We present the first measurement of the metallicity gradient in extremely metal-poor galaxies (EMPGs). With Subaru/Faint Object Camera And Spectrograph (FOCAS) Integral Field Unit (IFU), we have observed a nearby, low-mass EMPG, HSC J1631+4426, whose oxygen abundance and stellar mass are known to be 12+log(O/H) $=6.9$ and $\log_{10}(M_*/{\rm M}_\odot)=5.8$, respectively. The measured metallicity g… ▽ More We present the first measurement of the metallicity gradient in extremely metal-poor galaxies (EMPGs). With Subaru/Faint Object Camera And Spectrograph (FOCAS) Integral Field Unit (IFU), we have observed a nearby, low-mass EMPG, HSC J1631+4426, whose oxygen abundance and stellar mass are known to be 12+log(O/H) $=6.9$ and $\log_{10}(M_*/{\rm M}_\odot)=5.8$, respectively. The measured metallicity gradient is $-0.36 \pm 0.04$ dex kpc$^{-1}$ corresponding to $-0.049 \pm 0.006$ dex R$_\mathrm{e}^{-1}$ for the continuum effective radius of $R_\mathrm{e} = 0.14$ kpc. Our observation has successfully demonstrated that three-dimensional spectroscopy with 8m-class telescopes is powerful enough to reveal the metallicity distribution in local EMPGs, providing precious information of the baryon cycle in local analogs of primordial galaxies in the early Universe. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: PASJ accepted, 7 pages, 4 figures

arXiv:2106.03419 [pdf, ps, other]

Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

Authors: Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe

Abstract: Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions. In this study, we investigated data augmentation methods for E2E ASR in distant-talk scenarios. E2E ASR models are trained on the series of CHiME challenge datasets, which are sui… ▽ More Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions. In this study, we investigated data augmentation methods for E2E ASR in distant-talk scenarios. E2E ASR models are trained on the series of CHiME challenge datasets, which are suitable tasks for studying robustness against noisy and spontaneous speech. We propose to use three augmentation methods and thier combinations: 1) data augmentation using text-to-speech (TTS) data, 2) cycle-consistent generative adversarial network (Cycle-GAN) augmentation trained to map two different audio characteristics, the one of clean speech and of noisy recordings, to match the testing condition, and 3) pseudo-label augmentation provided by the pretrained ASR module for smoothing label distributions. Experimental results using the CHiME-6/CHiME-4 datasets show that each augmentation method individually improves the accuracy on top of the conventional SpecAugment; further improvements are obtained by combining these approaches. We achieved 4.3\% word error rate (WER) reduction, which was more significant than that of the SpecAugment, when we combine all three augmentations for the CHiME-6 task. △ Less

Submitted 7 June, 2021; originally announced June 2021.

Comments: Accepted for Interspeech2021

arXiv:2102.09168 [pdf, other]

Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

Authors: Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Abstract: Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the… ▽ More Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques. △ Less

Submitted 18 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP2021

arXiv:2006.14941 [pdf, ps, other]

Streaming Transformer ASR with Blockwise Synchronous Beam Search

Authors: Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe

Abstract: The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source--target attention. In this paper, we propose a novel blockwise synchronous beam search algorit… ▽ More The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source--target attention. In this paper, we propose a novel blockwise synchronous beam search algorithm based on blockwise processing of encoder to perform streaming E2E Transformer ASR. In the beam search, encoded feature blocks are synchronously aligned using a block boundary detection technique, where a reliability score of each predicted hypothesis is evaluated based on the end-of-sequence and repeated tokens in the hypothesis. Evaluations of the HKUST and AISHELL-1 Mandarin, LibriSpeech English, and CSJ Japanese tasks show that the proposed streaming Transformer algorithm outperforms conventional online approaches, including monotonic chunkwise attention (MoChA), especially when using the knowledge distillation technique. An ablation study indicates that our streaming approach contributes to reducing the response time, and the repetition criterion contributes significantly in certain tasks. Our streaming ASR models achieve comparable or superior performance to batch models and other streaming-based Transformer methods in all tasks considered. △ Less

Submitted 17 November, 2020; v1 submitted 25 June, 2020; originally announced June 2020.

Comments: Accepted for SLT 2021

arXiv:1910.11871 [pdf, ps, other]

Towards Online End-to-end Transformer Automatic Speech Recognition

Authors: Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

Abstract: The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-awar… ▽ More The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. In this paper, we extend it towards an entire online E2E ASR system by introducing an online decoding process inspired by monotonic chunkwise attention (MoChA) into the Transformer decoder. Our novel MoChA training and inference algorithms exploit the unique properties of Transformer, whose attentions are not always monotonic or peaky, and have multiple heads and residual connections of the decoder layers. Evaluations of the Wall Street Journal (WSJ) and AISHELL-1 show that our proposed online Transformer decoder outperforms conventional chunkwise approaches. △ Less

Submitted 25 October, 2019; originally announced October 2019.

Comments: arXiv admin note: text overlap with arXiv:1910.07204

arXiv:1910.07204 [pdf, ps, other]

Transformer ASR with Contextual Block Processing

Authors: Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

Abstract: The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-attention. In this paper, we propose a new block processing method for the Transformer encoder by in… ▽ More The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-attention. In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. We introduce a novel mask technique to implement the context inheritance to train the model efficiently. Evaluations of the Wall Street Journal (WSJ), Librispeech, VoxForge Italian, and AISHELL-1 Mandarin speech recognition datasets show that our proposed contextual block processing method outperforms naive block processing consistently. Furthermore, the attention weight tendency of each layer is analyzed to clarify how the added contextual inheritance mechanism models the global information. △ Less

Submitted 16 October, 2019; originally announced October 2019.

Comments: Accepted for ASRU 2019

arXiv:1905.07149 [pdf, ps, other]

End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System

Authors: Emiru Tsunoo, Yosuke Kashiwagi, Satoshi Asakawa, Toshiyuki Kumakura

Abstract: An on-device DNN-HMM speech recognition system efficiently works with a limited vocabulary in the presence of a variety of predictable noise. In such a case, vocabulary and environment adaptation is highly effective. In this paper, we propose a novel method of end-to-end (E2E) adaptation, which adjusts not only an acoustic model (AM) but also a weighted finite-state transducer (WFST). We convert a… ▽ More An on-device DNN-HMM speech recognition system efficiently works with a limited vocabulary in the presence of a variety of predictable noise. In such a case, vocabulary and environment adaptation is highly effective. In this paper, we propose a novel method of end-to-end (E2E) adaptation, which adjusts not only an acoustic model (AM) but also a weighted finite-state transducer (WFST). We convert a pretrained WFST to a trainable neural network and adapt the system to target environments/vocabulary by E2E joint training with an AM. We replicate Viterbi decoding with forward--backward neural network computation, which is similar to recurrent neural networks (RNNs). By pooling output score sequences, a vocabulary posterior for each utterance is obtained and used for discriminative loss computation. Experiments using 2--10 hours of English/Japanese adaptation datasets indicate that the fine-tuning of only WFSTs and that of only AMs are both comparable to a state-of-the-art adaptation method, and E2E joint training of the two components achieves the best recognition performance. We also adapt each language system to the other language using the adaptation data, and the results show that the proposed method also works well for language adaptations. △ Less

Submitted 24 June, 2019; v1 submitted 17 May, 2019; originally announced May 2019.

Comments: accepted for Interspeech 2019

Showing 1–40 of 40 results for author: Kashiwagi, Y