Skip to main content

Showing 1–22 of 22 results for author: Takase, S

  1. arXiv:2407.00454  [pdf, other

    cs.CL

    Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models

    Authors: Ryokan Ri, Shun Kiyono, Sho Takase

    Abstract: Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method ca… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  2. arXiv:2406.16508  [pdf, ps, other

    cs.CL

    Large Vocabulary Size Improves Large Language Models

    Authors: Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato

    Abstract: This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different t… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Work in progress

  3. arXiv:2312.16903  [pdf, other

    cs.CL cs.AI

    Spike No More: Stabilizing the Pre-training of Large Language Models

    Authors: Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

    Abstract: Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. To investigate the cause of loss spikes, we focus on gradients of internal layers. Through theoretical analyses, we reveal two causes of the explo… ▽ More

    Submitted 2 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: Work in progress

  4. arXiv:2305.18156  [pdf, other

    cs.CL cs.AI

    Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods

    Authors: Mengsay Loem, Masahiro Kaneko, Sho Takase, Naoaki Okazaki

    Abstract: Large-scale pre-trained language models such as GPT-3 have shown remarkable performance across various natural language processing tasks. However, applying prompt-based methods with GPT-3 for Grammatical Error Correction (GEC) tasks and their controllability remains underexplored. Controllability in GEC is crucial for real-world applications, particularly in educational settings, where the ability… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted in BEA 2023

  5. arXiv:2208.12496  [pdf, other

    cs.CL

    Nearest Neighbor Non-autoregressive Text Generation

    Authors: Ayana Niwa, Sho Takase, Naoaki Okazaki

    Abstract: Non-autoregressive (NAR) models can generate sentences with less computation than autoregressive models but sacrifice generation quality. Previous studies addressed this issue through iterative decoding. This study proposes using nearest neighbors as the initial state of an NAR decoder and editing them iteratively. We present a novel training strategy to learn the edit operations on neighbors to i… ▽ More

    Submitted 26 August, 2022; originally announced August 2022.

  6. arXiv:2207.13354  [pdf, other

    cs.CL

    Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

    Authors: Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki

    Abstract: Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$-gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks,… ▽ More

    Submitted 27 July, 2022; originally announced July 2022.

  7. arXiv:2206.00330  [pdf, other

    cs.LG cs.CL

    B2T Connection: Serving Stability and Performance in Deep Transformers

    Authors: Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

    Abstract: From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than… ▽ More

    Submitted 26 May, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: Findings of ACL 2023

  8. arXiv:2203.13528  [pdf, other

    cs.CL

    Single Model Ensemble for Subword Regularized Models in Low-Resource Machine Translation

    Authors: Sho Takase, Tatsuya Hiraoka, Naoaki Okazaki

    Abstract: Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models. In previous subword regularizations, we use multiple segmentations in the training process but use only one segmentation in the inference. In this study, we propose an inference strategy to address this discrepancy. The proposed strategy approximates the margin… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: Findings of ACL 2022

  9. arXiv:2203.07085  [pdf, other

    cs.CL

    Interpretability for Language Learners Using Example-Based Grammatical Error Correction

    Authors: Masahiro Kaneko, Sho Takase, Ayana Niwa, Naoaki Okazaki

    Abstract: Grammatical Error Correction (GEC) should not focus only on high accuracy of corrections but also on interpretability for language learning. However, existing neural-based GEC models mainly aim at improving accuracy, and their interpretability has not been explored. A promising approach for improving interpretability is an example-based method, which uses similar retrieved examples to generate cor… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  10. arXiv:2201.05313  [pdf, other

    cs.CL

    ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization

    Authors: Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki

    Abstract: Neural models trained with large amount of parallel data have achieved impressive performance in abstractive summarization tasks. However, large-scale parallel corpora are expensive and challenging to construct. In this work, we introduce a low-cost and effective strategy, ExtraPhrase, to augment training data for abstractive summarization tasks. ExtraPhrase constructs pseudo training data in two… ▽ More

    Submitted 14 January, 2022; originally announced January 2022.

  11. arXiv:2105.12410  [pdf, other

    cs.CL

    Joint Optimization of Tokenization and Downstream Model

    Authors: Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki

    Abstract: Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance. In this paper, we propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model.… ▽ More

    Submitted 26 May, 2021; originally announced May 2021.

    Comments: Accepted at ACL-IJCNLP 2021 Findings

  12. arXiv:2104.06022  [pdf, other

    cs.CL cs.LG

    Lessons on Parameter Sharing across Layers in Transformers

    Authors: Sho Takase, Shun Kiyono

    Abstract: We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Expe… ▽ More

    Submitted 2 June, 2023; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: SustaiNLP 2023

  13. arXiv:2104.01853  [pdf, other

    cs.CL cs.LG

    Rethinking Perturbations in Encoder-Decoders for Fast Training

    Authors: Sho Takase, Shun Kiyono

    Abstract: We often use perturbations to regularize neural models. For neural encoder-decoders, previous studies applied the scheduled sampling (Bengio et al., 2015) and adversarial perturbations (Sato et al., 2019) as perturbations but these methods require considerable computational time. Thus, this study addresses the question of whether these approaches are efficient enough for training time. We compare… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL-HLT 2021

  14. arXiv:2010.07503  [pdf, other

    cs.CL

    Multi-Task Learning for Cross-Lingual Abstractive Summarization

    Authors: Sho Takase, Naoaki Okazaki

    Abstract: We present a multi-task learning framework for cross-lingual abstractive summarization to augment training data. Recent studies constructed pseudo cross-lingual abstractive summarization data to train their neural encoder-decoders. Meanwhile, we introduce existing genuine data such as translation pairs and monolingual abstractive summarization data into training. Our proposed method, Transum, atta… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  15. arXiv:2005.00882  [pdf, other

    cs.CL

    Improving Truthfulness of Headline Generation

    Authors: Kazuki Matsumaru, Sho Takase, Naoaki Okazaki

    Abstract: Most studies on abstractive summarization report ROUGE scores between system and reference summaries. However, we have a concern about the truthfulness of generated summaries: whether all facts of a generated summary are mentioned in the source text. This paper explores improving the truthfulness in headline generation on two popular datasets. Analyzing headlines generated by the state-of-the-art… ▽ More

    Submitted 4 May, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted to ACL 2020

  16. arXiv:2004.12073  [pdf, other

    cs.CL cs.LG

    All Word Embeddings from One Embedding

    Authors: Sho Takase, Sosuke Kobayashi

    Abstract: In neural network-based models for natural language processing (NLP), the largest part of the parameters often consists of word embeddings. Conventional models prepare a large embedding matrix whose size depends on the vocabulary size. Therefore, storing these models in memory and disk storage is costly. In this study, to reduce the total number of parameters, the embeddings for all words are repr… ▽ More

    Submitted 22 October, 2020; v1 submitted 25 April, 2020; originally announced April 2020.

    Comments: NeurIPS 2020

  17. arXiv:1906.05506  [pdf, other

    cs.CL

    Character n-gram Embeddings to Improve RNN Language Models

    Authors: Sho Takase, Jun Suzuki, Masaaki Nagata

    Abstract: This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016). Our proposed method constructs word embeddings from character n-gram embeddings and combines them with ordinary word embeddings. We demonstrate that the proposed me… ▽ More

    Submitted 13 June, 2019; originally announced June 2019.

    Comments: AAAI 2019 paper

  18. arXiv:1904.07418  [pdf, other

    cs.CL

    Positional Encoding to Control Output Sequence Length

    Authors: Sho Takase, Naoaki Okazaki

    Abstract: Neural encoder-decoder models have been successful in natural language generation tasks. However, real applications of abstractive summarization must consider additional constraint that a generated summary should not exceed a desired length. In this paper, we propose a simple but effective extension of a sinusoidal positional encoding (Vaswani et al., 2017) to enable neural encoder-decoder model t… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: Accepted by NAACL-HLT 2019

  19. arXiv:1808.10143  [pdf, other

    cs.CL

    Direct Output Connection for a High-Rank Language Model

    Authors: Sho Takase, Jun Suzuki, Masaaki Nagata

    Abstract: This paper proposes a state-of-the-art recurrent neural network (RNN) language model that combines probability distributions computed not only from a final RNN layer but also from middle layers. Our proposed method raises the expressive power of a language model based on the matrix factorization interpretation of language modeling introduced by Yang et al. (2018). The proposed method improves the… ▽ More

    Submitted 30 August, 2018; v1 submitted 30 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018 paper

  20. arXiv:1712.08302  [pdf, other

    cs.CL

    Source-side Prediction for Neural Headline Generation

    Authors: Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, Masaaki Nagata

    Abstract: The encoder-decoder model is widely used in natural language generation tasks. However, the model sometimes suffers from repeated redundant generation, misses important phrases, and includes irrelevant entities. Toward solving these problems we propose a novel source-side token prediction module. Our method jointly estimates the probability distributions over source and target vocabularies to capt… ▽ More

    Submitted 21 December, 2017; originally announced December 2017.

    Comments: 19 pages

  21. arXiv:1709.08907  [pdf, other

    cs.CL

    Input-to-Output Gate to Improve RNN Language Models

    Authors: Sho Takase, Jun Suzuki, Masaaki Nagata

    Abstract: This paper proposes a reinforcing method that refines the output layers of existing Recurrent Neural Network (RNN) language models. We refer to our proposed method as Input-to-Output Gate (IOG). IOG has an extremely simple structure, and thus, can be easily combined with any RNN language models. Our experiments on the Penn Treebank and WikiText-2 datasets demonstrate that IOG consistently boosts t… ▽ More

    Submitted 28 September, 2017; v1 submitted 26 September, 2017; originally announced September 2017.

    Comments: Accepted as a conference paper in IJCNLP 2017

  22. arXiv:1707.07265  [pdf, other

    cs.CL

    Composing Distributed Representations of Relational Patterns

    Authors: Sho Takase, Naoaki Okazaki, Kentaro Inui

    Abstract: Learning distributed representations for relation instances is a central technique in downstream NLP applications. In order to address semantic modeling of relational patterns, this paper constructs a new dataset that provides multiple similarity ratings for every pair of relational patterns on the existing dataset. In addition, we conduct a comparative study of different encoders including additi… ▽ More

    Submitted 23 July, 2017; originally announced July 2017.

    Comments: Published as a conference paper at ACL 2016