Skip to main content

Showing 1–17 of 17 results for author: Kiyono, S

  1. arXiv:2407.00454  [pdf, other

    cs.CL

    Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models

    Authors: Ryokan Ri, Shun Kiyono, Sho Takase

    Abstract: Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method ca… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  2. arXiv:2406.16508  [pdf, ps, other

    cs.CL

    Large Vocabulary Size Improves Large Language Models

    Authors: Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato

    Abstract: This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different t… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Work in progress

  3. arXiv:2312.16903  [pdf, other

    cs.CL cs.AI

    Spike No More: Stabilizing the Pre-training of Large Language Models

    Authors: Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

    Abstract: Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. To investigate the cause of loss spikes, we focus on gradients of internal layers. Through theoretical analyses, we reveal two causes of the explo… ▽ More

    Submitted 2 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: Work in progress

  4. arXiv:2206.00330  [pdf, other

    cs.LG cs.CL

    B2T Connection: Serving Stability and Performance in Deep Transformers

    Authors: Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

    Abstract: From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than… ▽ More

    Submitted 26 May, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: Findings of ACL 2023

  5. arXiv:2205.11833  [pdf, other

    cs.LG cs.CL

    Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

    Authors: Sosuke Kobayashi, Shun Kiyono, Jun Suzuki, Kentaro Inui

    Abstract: Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demon… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

    Comments: Workshop on Challenges & Perspectives in Creating Large Language Models (BigScience) 2022

  6. arXiv:2109.05644  [pdf, other

    cs.CL

    SHAPE: Shifted Absolute Position Embedding for Transformers

    Authors: Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui

    Abstract: Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 (short paper, main conference)

  7. arXiv:2104.07425  [pdf, other

    cs.CL

    Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution

    Authors: Ryuto Konno, Shun Kiyono, Yuichiroh Matsubayashi, Hiroki Ouchi, Kentaro Inui

    Abstract: Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task that trains MLMs on anaphoric relations with explicit supervision, and the second proposal is a new finetuning method that remedies a notorious issue, the pretrai… ▽ More

    Submitted 10 September, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: Long paper accepted by EMNLP2021 main conference

  8. arXiv:2104.06022  [pdf, other

    cs.CL cs.LG

    Lessons on Parameter Sharing across Layers in Transformers

    Authors: Sho Takase, Shun Kiyono

    Abstract: We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Expe… ▽ More

    Submitted 2 June, 2023; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: SustaiNLP 2023

  9. arXiv:2104.01853  [pdf, other

    cs.CL cs.LG

    Rethinking Perturbations in Encoder-Decoders for Fast Training

    Authors: Sho Takase, Shun Kiyono

    Abstract: We often use perturbations to regularize neural models. For neural encoder-decoders, previous studies applied the scheduled sampling (Bengio et al., 2015) and adversarial perturbations (Sato et al., 2019) as perturbations but these methods require considerable computational time. Thus, this study addresses the question of whether these approaches are efficient enough for training time. We compare… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL-HLT 2021

  10. arXiv:2011.00948  [pdf, other

    cs.CL

    An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution

    Authors: Ryuto Konno, Yuichiroh Matsubayashi, Shun Kiyono, Hiroki Ouchi, Ryo Takahashi, Kentaro Inui

    Abstract: One critical issue of zero anaphora resolution (ZAR) is the scarcity of labeled data. This study explores how effectively this problem can be alleviated by data augmentation. We adopt a state-of-the-art data augmentation method, called the contextual data augmentation (CDA), that generates labeled training instances using a pretrained language model. The CDA has been reported to work well for seve… ▽ More

    Submitted 4 November, 2020; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: 13 pages, accepted by COLING 2020

  11. arXiv:2010.03155  [pdf, other

    cs.CL

    A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction

    Authors: Masato Mita, Shun Kiyono, Masahiro Kaneko, Jun Suzuki, Kentaro Inui

    Abstract: Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of "noise" where errors were inappropriately edited or left unc… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: accepted by EMNLP 2020 (Findings)

  12. arXiv:2005.00987  [pdf, other

    cs.CL

    Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction

    Authors: Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Suzuki, Kentaro Inui

    Abstract: This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For exam… ▽ More

    Submitted 31 May, 2020; v1 submitted 3 May, 2020; originally announced May 2020.

    Comments: Accepted as a short paper to the 58th Annual Conference of the Association for Computational Linguistics (ACL-2020)

    Journal ref: Association for Computational Linguistics (ACL-2020)

  13. arXiv:2004.10234  [pdf, ps, other

    cs.CL cs.SD eess.AS

    ESPnet-ST: All-in-One Speech Translation Toolkit

    Authors: Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, Shinji Watanabe

    Abstract: We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-p… ▽ More

    Submitted 30 September, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020 System Demonstration (update Table1, fix typo)

  14. arXiv:1910.03246  [pdf, other

    cs.CL

    Riposte! A Large Corpus of Counter-Arguments

    Authors: Paul Reisert, Benjamin Heinzerling, Naoya Inoue, Shun Kiyono, Kentaro Inui

    Abstract: Constructive feedback is an effective method for improving critical thinking skills. Counter-arguments (CAs), one form of constructive feedback, have been proven to be useful for critical thinking skills. However, little work has been done for constructing a large-scale corpus of them which can drive research on automatic generation of CAs for fallacious micro-level arguments (i.e. a single claim… ▽ More

    Submitted 8 October, 2019; originally announced October 2019.

  15. arXiv:1909.00502  [pdf, other

    cs.CL

    An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction

    Authors: Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, Kentaro Inui

    Abstract: The incorporation of pseudo data in the training of grammatical error correction models has been one of the main factors in improving the performance of such models. However, consensus is lacking on experimental configurations, namely, choosing how the pseudo data should be generated or used. In this study, these choices are investigated through extensive experiments, and state-of-the-art performa… ▽ More

    Submitted 1 September, 2019; originally announced September 2019.

    Comments: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)

  16. arXiv:1810.05788  [pdf, other

    cs.CL

    Mixture of Expert/Imitator Networks: Scalable Semi-supervised Learning Framework

    Authors: Shun Kiyono, Jun Suzuki, Kentaro Inui

    Abstract: The current success of deep neural networks (DNNs) in an increasingly broad range of tasks involving artificial intelligence strongly depends on the quality and quantity of labeled training data. In general, the scarcity of labeled data, which is often observed in many natural language processing tasks, is one of the most important issues to be addressed. Semi-supervised learning (SSL) is a promis… ▽ More

    Submitted 19 November, 2018; v1 submitted 12 October, 2018; originally announced October 2018.

    Comments: Accepted by AAAI 2019

  17. arXiv:1712.08302  [pdf, other

    cs.CL

    Source-side Prediction for Neural Headline Generation

    Authors: Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, Masaaki Nagata

    Abstract: The encoder-decoder model is widely used in natural language generation tasks. However, the model sometimes suffers from repeated redundant generation, misses important phrases, and includes irrelevant entities. Toward solving these problems we propose a novel source-side token prediction module. Our method jointly estimates the probability distributions over source and target vocabularies to capt… ▽ More

    Submitted 21 December, 2017; originally announced December 2017.

    Comments: 19 pages