Self-Translate-Train: A Simple but Strong Baseline
for Cross-lingual Transfer of Large Language Models

Ryokan Ri    Shun Kiyono    Sho Takase
SB Intuitions
{ryokan.ri,shun.kiyono,sho.takase}@sbintuitions.co.jp
Abstract

Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method called Self-Translate-Train. It leverages the translation capability of a large language model to generate synthetic training data in the target language and fine-tunes the model with its own generated data. We evaluate the proposed method on a wide range of tasks and show substantial performance gains across several non-English languages.

Self-Translate-Train: A Simple but Strong Baseline
for Cross-lingual Transfer of Large Language Models


Ryokan Ri  and Shun Kiyono  and Sho Takase SB Intuitions {ryokan.ri,shun.kiyono,sho.takase}@sbintuitions.co.jp


1 Introduction

Cross-lingual transfer is a technique to solve tasks in a target language by leveraging training data from the other languages (Pikuliak et al., 2021). This has been increasingly feasible with the rise of multilingual pre-trained models, which are trained on multilingual corpora and capture commonalities across languages (Conneau et al., 2020; Xue et al., 2021; Scao et al., 2022). Capable multilingual models can perform tasks in a target language without being trained on task-specific data in that language, which is known as zero-shot cross-lingual transfer (Artetxe and Schwenk, 2019; Chen et al., 2021). This technique is expected to reduce the disparity between high-resource and low-resource languages.

Cross-lingual transfer is also exhibited in large language models (LLMs), which refers to auto-regressive language models with billion-scale parameters that are trained with a massive amount of text data (Brown et al., 2020; Touvron et al., 2023). A common approach for zero-shot cross-lingual transfer is fine-tuning the model with supervised training data available in a source languages, mostly English, and then applying the model to the target language (Chen et al., 2024; Shaham et al., 2024). However, we argue that this approach does not fully elicit the model’s cross-lingual capability as the model has no clue the input language at the test time. To achieve better cross-lingual performance, we let the model teach itself how to solve the task in the target language. In our proposed method, Self-Translate-Train, we produce the target language’s translation of the training data leveraging the strong capability of the LLM to generate text, and train the LLM with its own generated translation.

Refer to caption
Figure 1: An overview of Self-Translate-Train. An LLM translates training data to the target language and then fine-tuned on its own generated data.

We evaluate Self-Translate-Train with several tasks including question answering, text-pair classification, and mathematical reasoning across multiple languages. Our experiments show that Self-Translate-Train consistently improves the performance of baselines given a multilingual capability of the LLMs. Our results indicate that we can achieve better cross-lingual performance by correctly elicit the model’s translation capability, which encourages further exploration of how to better utilize the model’s cross-lingual capability.

2 Related Work

2.1 Cross-lingual Transfer Learning

There are two main approaches to transfer task knowledge across languages: data transfer and model transfer (Pikuliak et al., 2021).

Data transfer translates the source language data to the target language. In the Translate-test approach, models are trained on source language data and at inference time, the task inputs are translated into the source language (Conneau et al., 2018; Asai et al., 2018). Although the training stage is simple, it incurs additional translation costs at inference time. The Translate-train approach, on the other hand, translates the training data and the resulting model is used to predict the target language data directly (Conneau et al., 2018). Data transfer is quite effective in terms of performance (Hu et al., 2020), but one drawback is its requirement of additional translation systems.

Model transfer alleviates the need for translation systems by using multilingual pretrained models, which are trained on a large amount of data from multiple languages and capture the commonality between languages. These models can be fine-tuned on task-specific data in a single source language and generalize to solve the task in other languages (Pires et al., 2019; Mulcaire et al., 2019; Conneau et al., 2020), eliminating the need for translation systems.

Our approach, Self-Translate-Train, leverages LLMs’ translation and cross-lingual generalization capabilities. It combines the advantages of data transfer and model transfer by using explicit training signals in the target language while eliminating the need for external translation systems.

2.2 Self-Improvement of LLMs

LLMs have demonstrated remarkable text generation capabilities, which has been leveraged to generate training data for various purposes (Li et al., 2023b; Lee et al., 2024). The generated data can be used to further specialize the LLM itself for downstream applications, without requiring an extensive collection of additional data. This process can be viewed as a form of self-improvement (Bai et al., 2022; Huang et al., 2023; Sun et al., 2023; Li et al., 2023a).

Self-Translate-Train is also a self-improvement approach to specialize the LLM to a target language by translating the source language data to the target language.

3 Self-Translate-Train

Our framework focuses on fine-tuning LLMs on a small amount of data for a specific task. Let the training corpus in a source language, say English, be 𝒟src={(𝐱srci,𝐲srci)}i=1Nsubscript𝒟srcsuperscriptsubscriptsuperscriptsubscript𝐱src𝑖superscriptsubscript𝐲src𝑖𝑖1𝑁\mathcal{D}_{\text{src}}=\{(\mathbf{x}_{\text{src}}^{i},\mathbf{y}_{\text{src}% }^{i})\}_{i=1}^{N}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT src end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT src end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝐱𝐱\mathbf{x}bold_x is the input and 𝐲𝐲\mathbf{y}bold_y is the output. In a typical cross-lingual transfer setting, the model is fine-tuned only on 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT and expected to generalize to target languages.

Translated Synthetic Data

Given the LLM’s translation capability, we can let it translate the training corpus into a synthetic corpus in the target language 𝒟tgtsubscript𝒟tgt\mathcal{D}_{\text{tgt}}caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT. The synthetic data can be added to 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT to achieve a better generalization to the target language.

The translation can be performed in various ways depending on the model’s capabilities or available resources. In this paper, we experiment with the few-shot prompting technique (Section 4.4).

Code-switched Synthetic Data

The generated data has an interesting aspect: each synthetic instance has a corresponding instance in the original dataset with the same semantics. We can exploit this to further synthesize data by generating code-switched instances where the input and output are in different languages.

We pair the original and translated instances to construct 𝒟cs={(𝐱srci,𝐲tgti)}i=1N{(𝐱tgti,𝐲srci)}i=1Nsubscript𝒟cssuperscriptsubscriptsuperscriptsubscript𝐱src𝑖superscriptsubscript𝐲tgt𝑖𝑖1𝑁superscriptsubscriptsuperscriptsubscript𝐱tgt𝑖superscriptsubscript𝐲src𝑖𝑖1𝑁\mathcal{D}_{\text{cs}}=\{(\mathbf{x}_{\text{src}}^{i},\mathbf{y}_{\text{tgt}}% ^{i})\}_{i=1}^{N}\bigcup\{(\mathbf{x}_{\text{tgt}}^{i},\mathbf{y}_{\text{src}}% ^{i})\}_{i=1}^{N}caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT src end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⋃ { ( bold_x start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT src end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. When the task output is natural language, we manually translate the prompt “Please answer in {{ tgt }}.” into the target language, and add it to the input 𝐱𝐱\mathbf{x}bold_x.

4 Experimental Setups

To verify the effectiveness of Self-Translate-Train, we conduct extensive experiments on multiple tasks and languages.

4.1 Task and Datasets

We present a list of datasets for experiments in Table 1. For each task, an English dataset is used for training and a multilingual dataset for evaluation. To make the computational cost feasible, we use a 10,000-sample subset of the training data for SQuAD and MultiNLI.

Task Training Evaluation
QA SQuAD (Rajpurkar et al., 2016) XQuAD (Artetxe et al., 2020)
Classification MultiNLI (Williams et al., 2018) XNLI (Conneau et al., 2018)
Math GSM8k (Cobbe et al., 2021) MGSM (Shi et al., 2023)
Table 1: List of datasets for experiments. The details are described in Section A.1.

4.2 Languages

We conducted evaluation on four languages: German (de), Russian (ru), Thai (th), and Chinese (zh). German is a Germanic language, which is phylogenetically close to English and expected to show better cross-lingual transfer, while Russian, Thai, and Chinese are from different language families. In particular, Thai is a low-resource language with a different script from English, which is expected to be more challenging for cross-lingual transfer.

4.3 Language Models

Our main experiments use Llama2-7B (Touvron et al., 2023), a public LLM. Although 90% of its pretraining corpus is English, the model has a multilingual capability (e.g., Table 2) from the remaining fraction of multilingual data.

4.4 Synthetic Data Generation

Recent LLMs are known to exhibit a translation capability without much task-specific data Briakou et al. (2023). In our experiments, we elicit the translation capability of the LLMs via few-shot in-context learning (Brown et al., 2020).

To construct few-shot translation samples, we sample eight pairs from the train or validation splits of the multilingual datasets, where instances across languages form parallel data. The translation was performed for each field individually, e.g., for GSM8k, we translated the question and answer separately. The prompt template simply alternates the source and target text prepended with the language tag (Section A.2).

An important step to ensure the quality of the synthetic data is to filter out the low-quality data (the details in Section A.3). To remove under- or over-translation (Tu et al., 2016), we filter out texts with an extreme source-target length ratio. Also, to address the repetition problem (Holtzman et al., 2020), we set the max number of tokens for generation and filter out the translation that does not end with the EOS token. With the translations from Llama2-7B, this process removes around 10% of the data for most languages and around 50% for Thai due to the model’s limited generation quality.

To provide the sense of the translation quality, we report the BLEU score (Papineni et al., 2002) measured by the parallel data constructed from questions in the MGSM test set in Table 2. Overall, the translation quality is sufficiently high except for Thai. As we will see in Section 5.1, this poses a challenge for cross-lingual transfer to Thai.

Model de ru th zh
Llama2-7B 37.1 27.2 1.9 29.4
Table 2: BLEU scores from the MGSM test set. The configuration of BLEU is described in Section A.4.

4.5 Fine-tuning

All the tasks are cast as text generation tasks, where the LLM is given the inputs as a prompt and generate the answer. Fine-tuning is conducted with causal language modeling loss, computed only for output tokens. We use LoRA (Hu et al., 2022), a parameter-efficient tuning technique, to reduce computational cost.

We use AdamW (Loshchilov and Hutter, 2019) and the cosine learning rate schedule for optimization, training with a batch size of 64 for 1,000 steps. For each setting, we conduct six runs with two learning rates (5e-5 and 3e-4) and different random seeds, reporting summarization statistics of the top four runs based on validation accuracy to remove runs with optimization failure. See Section A.5 for other hyperparameters.

5 Results

5.1 Main Results

MGSM XQuAD XNLI
de ru th zh de ru th zh de ru th zh
𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT
30.1
±plus-or-minus\pm±0.4
25.0
±plus-or-minus\pm±0.7
8.1
±plus-or-minus\pm±0.7
21.1
±plus-or-minus\pm±1.7
60.3
±plus-or-minus\pm±0.8
49.3
±plus-or-minus\pm±0.4
34.5
±plus-or-minus\pm±1.0
66.3
±plus-or-minus\pm±0.8
79.7
±plus-or-minus\pm±0.4
76.9
±plus-or-minus\pm±0.1
53.7
±plus-or-minus\pm±0.9
74.1
±plus-or-minus\pm±0.2
+𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT *
36.4
±plus-or-minus\pm±1.3
*
*
34.0
±plus-or-minus\pm±1.7
*
7.7
±plus-or-minus\pm±3.4
*
27.1
±plus-or-minus\pm±1.2
*
61.7
±plus-or-minus\pm±0.9
*
57.8
±plus-or-minus\pm±0.8
*
*
46.4
±plus-or-minus\pm±1.3
*
*
77.7
±plus-or-minus\pm±0.4
*
*
81.6
±plus-or-minus\pm±0.7
*
*
78.5
±plus-or-minus\pm±0.6
*
56.3
±plus-or-minus\pm±1.5
*
78.5
±plus-or-minus\pm±0.3
*
+𝒟tgt+𝒟cssubscript𝒟tgtsubscript𝒟cs+\mathcal{D}_{\text{tgt}}+\mathcal{D}_{\text{cs}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT + caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT *
35.9
±plus-or-minus\pm±1.3
*
*
34.5
±plus-or-minus\pm±2.4
*
10.4
±plus-or-minus\pm±1.6
*
28.8
±plus-or-minus\pm±1.9
*
*
62.2
±plus-or-minus\pm±0.9
*
*
58.0
±plus-or-minus\pm±0.6
*
*
46.2
±plus-or-minus\pm±1.7
*
*
77.3
±plus-or-minus\pm±0.7
*
*
81.6
±plus-or-minus\pm±0.5
*
78.4
±plus-or-minus\pm±1.2
*
58.9
±plus-or-minus\pm±1.5
*
*
77.6
±plus-or-minus\pm±0.6
*
Table 3: Results on multilingual evaluation datasets. Scores are marked with if its improvement is statistically significant (p<0.05𝑝0.05p<0.05italic_p < 0.05 in Welch’s t-test) compared to the baseline 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT. The significant and highest score in each column is marked in bold.
Refer to caption
(a) de
Refer to caption
(b) th
Refer to caption
(c) zh
Figure 2: Accuracy in the MGSM dataset with different model sizes of Llama2.

As the baseline, we fine-tune the LLM with the source language dataset 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT. To ensure a fair comparison, we augment 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT with the eight target language samples used for few-shot translation (Section 4.4). We then compare the baseline with the models fine-tuned on the data generated from Self-Translate-Train (𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT and 𝒟cssubscript𝒟cs\mathcal{D}_{\text{cs}}caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT) in Table 3.

First, Self-Translate-Train is indeed an effective method; +𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT almost consistently outperforms the baseline 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT. The only exception is Thai (th), where there is no significant improvement. This is likely due to the low translation quality of the model in Thai (Table 2).

The effectiveness of code-switching dataset is limited. When we add 𝒟cssubscript𝒟cs\mathcal{D}_{\text{cs}}caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT to 𝒟tgtsubscript𝒟tgt\mathcal{D}_{\text{tgt}}caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT, there is no significant improvement from adding 𝒟tgtsubscript𝒟tgt\mathcal{D}_{\text{tgt}}caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT only (p<0.05𝑝0.05p<0.05italic_p < 0.05 in Welch’s t-test). This indicates that the code-switching data does not provide additional information for the model to generalize in the task.

5.2 Does the model size matter?

The size of the language model can influence both its ability to generalize across languages and the quality of its translations, which in turn may impact the effectiveness of Self-Translate-Train. We compare the performance of Llama2 with different sizes, i.e., 7B, 13B, and 65B, on the math task in de, th, and zh (Figure 2).

The larger model size generally tends to perform better, and the improvement from Self-Translate-Train remains consistent across different model sizes. In Thai (th), we did not observe a significant improvement in the 7B model, but do in the larger models (13B and 65B), likely due to their better translation quality. The 7B model has a low Thai translation BLEU score of 1.9 (Table 2), while the 13B and 65B models have BLEU scores of 5.1 and 12.0, respectively.

The improvement in Thai (th) with the 70B model is the most significant (+19.8 average points). This implies that Self-Translate-Train is particularly effective when the model struggles with generalizing across the source and target languages but can still generate reasonable translations.

6 Conclusion

We introduced Self-Translate-Train, a method to improve cross-lingual transfer performance by generating synthetic training data in the target language. We validated its effectiveness on various tasks and languages, demonstrating substantial performance gains across several non-English languages. Self-Translate-Train is effective when the zero-shot cross-lingual transfer performance is suboptimal and the model can generate reasonable translations.

Self-Translate-Train neither requires external translation systems nor intensive additional data collection, making it a simple yet effective method for cross-lingual transfer. We encourage practitioners to try this approach as an improved baseline for cross-lingual transfer of LLMs.

Our research also shows that relying solely on the model’s generalization capability may be suboptimal, and there is a better way to elicit the cross-lingual capability of the model. We hope this work encourages further exploration of how to better utilize the model’s cross-lingual capability.

7 Limitations

Our experiments are conducted on a modern type of LLM, an autoregressive Transformer decoder, and centered around the Llama2 model families (Touvron et al., 2023). Although we further validate our method in Appendix B, the effective of our proposed method is uncertain when applied to other types of LLMs developed in the future.

Our method is based on the assumption that the model can generate reasonable translations in the target language. This may be challenging when the task inputs are long or complex. One solution is to split the input into smaller segments, as we have done with the SQuAD dataset (Section A.2).

Finally, when the task requires generating long and natural text, the quality of the generated translation matters more. If the translation quality is low, the model outputs may degrade due to translation errors or unnaturalness. The application of our method on more challenging tasks requires further investigation.

References

  • Artetxe et al. (2020) Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
  • Artetxe and Schwenk (2019) Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
  • Asai et al. (2018) Akari Asai, Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2018. Multilingual extractive reading comprehension by runtime machine translation. ArXiv, abs/1809.03275.
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, E Perez, Jamie Kerr, Jared Mueller, Jeff Ladish, J Landau, Kamal Ndousse, Kamilė Lukošiūtė, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noem’i Mercado, Nova Dassarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Sam Bowman, Zac Hatfield-Dodds, Benjamin Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom B. Brown, and Jared Kaplan. 2022. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073.
  • Briakou et al. (2023) Eleftheria Briakou, Colin Cherry, and George Foster. 2023. Searching for needles in a haystack: On the role of incidental bilingualism in PaLM’s translation capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9432–9452, Toronto, Canada. Association for Computational Linguistics.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • Chen et al. (2021) Guanhua Chen, Shuming Ma, Yun Chen, Li Dong, Dongdong Zhang, Jia Pan, Wenping Wang, and Furu Wei. 2021. Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 15–26, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Chen et al. (2024) Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. 2024. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, St. Julian’s, Malta. Association for Computational Linguistics.
  • Chen et al. (2023) Yang Chen, Chao Jiang, Alan Ritter, and Wei Xu. 2023. Frustratingly easy label projection for cross-lingual transfer. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5775–5796, Toronto, Canada. Association for Computational Linguistics.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. ArXiv, abs/2110.14168.
  • Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451. Association for Computational Linguistics.
  • Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  • Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Hu et al. (2022) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR.
  • Huang et al. (2023) Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2023. Large language models can self-improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore. Association for Computational Linguistics.
  • Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
  • Lee et al. (2024) Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernández Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha R. Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. Gecko: Versatile text embeddings distilled from large language models. ArXiv, abs/2403.20327.
  • Li et al. (2023a) Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, and Mike Lewis. 2023a. Self-alignment with instruction backtranslation. CoRR, abs/2308.06259.
  • Li et al. (2023b) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023b. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore. Association for Computational Linguistics.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Mulcaire et al. (2019) Phoebe Mulcaire, Jungo Kasai, and Noah A. Smith. 2019. Polyglot contextual representations improve crosslingual transfer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3912–3918, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Pikuliak et al. (2021) Matús̆ Pikuliak, Marián Simko, and Mária Bieliková. 2021. Cross-lingual learning for text processing: A survey. Expert Systems with Applications, 165:113765.
  • Pires et al. (2019) Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  • Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurenccon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa Etxabe, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris C. Emezue, Christopher Klamm, Colin Leong, Daniel Alexander van Strien, David Ifeoluwa Adelani, Dragomir R. Radev, Eduardo Gonz’alez Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady ElSahar, Hamza Benyamina, Hieu Trung Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jorg Frohberg, Josephine L. Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario vSavsko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto L’opez, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, S. Longpre, Somaieh Nikpoor, S. Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-Shaibani, Matteo Manica, Nihal V. Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Févry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiang Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Y Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre Franccois Lavall’ee, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aur’elie N’ev’eol, Charles Lovering, Daniel H Garrette, Deepak R. Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Xiangru Tang, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, S. Osher Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdenvek Kasner, Zdeněk Kasner, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ananda Santa Rosa Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ayoade Ajibade, Bharat Kumar Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David M. Lansky, Davis David, Douwe Kiela, Duong Anh Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatim Tahirah Mirza, Frankline Ononiwu, Habib Rezanejad, H.A. Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jan Passmore, Joshua Seltzer, Julio Bonis Sanz, Karen Fort, Lívia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nourhan Fahmy, Olanrewaju Samuel, Ran An, R. P. Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas L. Wang, Sourav Roy, Sylvain Viguier, Thanh-Cong Le, Tobi Oyebade, Trieu Nguyen Hai Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Kumar Singh, Benjamin Beilharz, Bo Wang, Caio Matheus Fonseca de Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel Le’on Perin’an, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Iman I.B. Bello, Isha Dash, Ji Soo Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthi Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, María Andrea Castillo, Marianna Nezhurina, Mario Sanger, Matthias Samwald, Michael Cullan, Michael Weinberg, M Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patricia Haller, R. Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Pratap Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yashasvi Bajaj, Y. Venkatraman, Yifan Xu, Ying Xu, Yu Xu, Zhee Xao Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. 2022. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100.
  • Shaham et al. (2024) Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. 2024. Multilingual instruction tuning with just a pinch of multilinguality. ArXiv, abs/2401.01854.
  • Shi et al. (2023) Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Sun et al. (2023) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  • Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany. Association for Computational Linguistics.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  • Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.

Appendix A The Details of Experimental Setups

A.1 Tasks and Datasets

We provide the details of the datasets introduced in Section 4.1.

A.1.1 Question Answering (QA)

SQuAD (Rajpurkar et al., 2016) is an English QA dataset created from Wikipedia articles as training data. Given a question and a passage, the task is to extract the answer from the passage. Evaluation is conducted with XQuAD (Artetxe et al., 2020), which consists of translation of SQuAD into multiple languages.

Context: Architecturally, the school has a Catholic character. Atop  the
Main Building’s gold dome is a golden statue of the Virgin Mary. Immediately
in front of the Main Building and facing it, is a copper statue of Christ
with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main
Building is the Basilica of the Sacred Heart. Immediately behind the basilica
is the Grotto, a Marian place of prayer and reflection. It is a replica of the
grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint
Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line
that connects through 3 statues and the Gold Dome), is a simple, modern stone
statue of Mary.
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Saint Bernadette Soubirous
Figure 3: An input and output example of the SQuAD dataset.

A.1.2 Text-Pair Classification

We also evaluate our method on cross-lingual text-pair classification tasks. The MultiNLI dataset (Williams et al., 2018) involves determining the logical relationship between a premise sentence and a hypothesis sentence. XNLI (Conneau et al., 2018) is a multilingual NLI dataset for evaluation.

Premise: Conceptually cream skimming has two basic dimensions - product and geography.
Hypothesis: Product and geography are what make cream skimming work.
What is their logical relation? Entailment, Neutral or Contradition.
Neutral
Figure 4: An input and output example of the MultiNLP dataset.

A.1.3 Mathematical Reasoning

GSM8k (Cobbe et al., 2021) is an English dataset of 8.5K high-quality grade school math problems. Each problem is annotated with a solution that shows the mathematical steps required to reach the final answer. As the evaluation dataset, we use MGSM (Shi et al., 2023), a multilingual version of the GSM8k dataset.

The LLM is trained to generate the step-by-step solution to math problems. The answer is extracted from the LLM output as the final digits, and the accuracy is calculated based on the exact match of the extracted answer and the ground truth.

Natalia sold clips to 48 of her friends in April,
and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?
Natalia sold 48/2 = clips in May.
Natalia sold 48+24 = 72 clips altogether in April and May.
#### 72
Table 4: An input and output example of the GSM8k dataset.

A.2 Prompt Format for LLM Translation

To translate training data using a LLM (Section 4.4), we employed the following prompt template for each task. The template simply consists of the source text and target text prepended with the language tag. The text is surrounded by backticks and the LLM starts generating the target text the open backtick until the close backtick is found.

{% for sample in few_shot_samples %}
en: ‘{{ sample.data_field }}‘
{{ target_language }}: ‘{{ sample.data_field }}‘
{% endfor %}
en: ‘{{ data_field }}‘
{{ target_language }}: ‘
Figure 5: Prompt format for LLM translation.

The SQuAD dataset annotates answer spans in the context passages. We translate the annotations using the mark-then-translate approach (Chen et al., 2023). We mark the answer span in the context passage with the tokens “<answer>” and “</answer>”, translate the marked text, and then extract the translated answer span from the translated context. Note that in this case, the few-shot samples are also marked with the answer span.

The context passages in the SQuAD dataset are relatively long, and it is challenging for the LLM with a limited context window to fit the entire few-shot samples and the source text. To address this issue, we split the context into sentences using the spaCy library111https://spacy.io/ and translate them separately, i.e., the few-shot samples and the source text are sentences.

A.3 Data Filtering for Synthetic Data

We remove pairs where the target length is less than one-third or more than three times the source length. The text length is heuristically determined to account for character length differences between languages. For example, phonogram-based text (e.g., English) has much more characters than ideograph-based text (e.g., Chinese). We set normalization factors where English, German, Thai, and Russian characters count as 1, and Chinese characters as 3.

We also filter out incomplete translations which are typically produced by repetitive generation. We set the maximum number of tokens for generation (Table 5) and remove the outputs not ending with the token indicating the end of the translation, in our case, the backtick character used in the prompt format.

Data Field Max Number of Tokens
SQuAD (Rajpurkar et al., 2016) context 512
question 256
MultiNLI (Williams et al., 2018) premise 256
hypothesis 256
GSM8k (Cobbe et al., 2021) question 512
answer 512
Table 5: Maximum number of tokens set for generating translations.

A.4 Assessing the Translation Quality

To evaluate the translation quality, we use the BLEU score (Papineni et al., 2002) measured by the parallel data constructed from questions in the MGSM test set. The translation is performed in few-shot in-context learning with 8 translation samples constructed from the train set of the MGSM dataset. The BLEU score is calculated using the SacreBLEU library (Post, 2018)222https://github.com/mjpost/sacrebleu. As the tokenizer option, we use “13a” for de and ru, “flores101” for th, and “zh” for zh.

Table 6 shows the BLEU scores from the LLMs evaluated in this paper. The result of Qwen1.5-1.8B is discussed in Appendix B, and gpt-3.5-turbo-0125 in Section C.3.

Model de ru th zh
Llama2-7B 37.1 27.2 1.9 29.4
Llama2-13B 41.3 33.4 5.1 34.3
Llama2-70B 45.6 41.9 12.0 42.4
Qwen1.5-1.8B 21.9 11.3 1.5 41.3
gpt-3.5-turbo-0125 48.0 44.6 23.1 47.2
Table 6: BLEU scores from the MGSM test set.

A.5 Hyper-parameters for Fine-tuning

We provide the hyper-parameters used for fine-tuning the LLMs in Table 7.

Hyper-parameter Value Hyper-parameter Value
Batch size 64 Adam ϵitalic-ϵ\epsilonitalic_ϵ 1e-8
Number of steps 1,000 Adam β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.9
Learning rate [5e-5, 3e-4] Adam β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.999
LR Scheduler Cosine Weight decay 0.1
Warmup ratio 0.05
Table 7: Hyper-parameters used for fine-tuning the LLMs.

Appendix B Results from Qwen1.5-1.8B

To increase the robustness of the results, we also conducted experiments with Qwen1.5-1.8B 333https://qwenlm.github.io/blog/qwen1.5/. While the model is mainly trained on Chinese and English data, it is also constructed with the multilingual use cases in mind.

gsm8k
de ru th zh
𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT
8.0
±plus-or-minus\pm±0.3
6.5
±plus-or-minus\pm±0.4
2.8
±plus-or-minus\pm±0.5
23.3
±plus-or-minus\pm±0.5
+𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT *
18.7
±plus-or-minus\pm±1.8
*
*
15.0
±plus-or-minus\pm±1.2
*
3.4
±plus-or-minus\pm±1.8
21.4
±plus-or-minus\pm±1.2
+𝒟cssubscript𝒟cs+\mathcal{D}_{\text{cs}}+ caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT *
17.1
±plus-or-minus\pm±1.3
*
*
14.9
±plus-or-minus\pm±0.5
*
2.7
±plus-or-minus\pm±0.8
23.2
±plus-or-minus\pm±0.9
Table 8: Results on the MGSM dataset with Qwen1.5-1.8B. Scores are marked with if its improvement is statistically significant (p<0.05𝑝0.05p<0.05italic_p < 0.05 in Welch’s t-test) compared to the baseline 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT. The significant and highest score in each column is marked in bold.

We observe that the results are consistent with the main experiments: Self-Translate-Train is effective when the zero-shot cross-lingual transfer performance is suboptimal and the model can generate reasonable translations. The performance is improved by adding the target language data +𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT. However, when the translation quality is poor as in Thai (1.5 BLEU score in Table 6), the improvement is not observed. Additionally, Qwen1.5-1.8B seems to have good cross-lingual capability between English and Chinese, as indicated by the high BLEU score (41.3 in Table 6). With this, tuning on the source language data alone is sufficient to achieve high performance.

Appendix C Frequently Asked Questions

In this section, we discuss questions that are outside the scope of the main topic of this paper but are somewhat relevant and may be of interest to readers.

C.1 Does Self-Translate-Train improve the performance in the source language?

The performance somtimes improves, given the task is challenging and the translation quality is sufficiently high.

Table 9 shows the results on the English test set with Llama2-7B. The performance improves in the MGSM dataset when adding the synthetic data from de, ru, and zh. The Thai language does not show the improvement possibly due to the low translation quality.

However, the improvement is not observed in the XQuAD and XNLI datasets. This might be because the task performance is already high with the source language data alone, and the synthetic data does not provide additional information to improve the performance.

MGSM XQuAD XNLI
de ru th zh de ru th zh de ru th zh
𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT
37.8
±plus-or-minus\pm±0.8
70.2
±plus-or-minus\pm±0.5
88.2
±plus-or-minus\pm±1.3
+𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT *
42.6
±plus-or-minus\pm±1.9
*
*
41.8
±plus-or-minus\pm±1.3
*
40.0
±plus-or-minus\pm±1.9
*
42.7
±plus-or-minus\pm±1.0
*
69.1
±plus-or-minus\pm±0.7
69.6
±plus-or-minus\pm±0.5
70.0
±plus-or-minus\pm±0.5
69.9
±plus-or-minus\pm±0.4
89.0
±plus-or-minus\pm±0.4
88.3
±plus-or-minus\pm±0.7
88.6
±plus-or-minus\pm±1.0
89.3
±plus-or-minus\pm±0.3
+𝒟cssubscript𝒟cs+\mathcal{D}_{\text{cs}}+ caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT *
42.7
±plus-or-minus\pm±0.7
*
*
42.9
±plus-or-minus\pm±0.7
*
39.8
±plus-or-minus\pm±1.6
40.8
±plus-or-minus\pm±2.0
69.5
±plus-or-minus\pm±0.5
69.6
±plus-or-minus\pm±0.3
69.8
±plus-or-minus\pm±0.5
68.7
±plus-or-minus\pm±0.5
88.7
±plus-or-minus\pm±0.2
88.5
±plus-or-minus\pm±1.0
88.0
±plus-or-minus\pm±0.7
88.7
±plus-or-minus\pm±0.6
Table 9: Results on the Englihs test set with Llama2-7B. Scores are marked with if its improvement is statistically significant (p<0.05𝑝0.05p<0.05italic_p < 0.05 in Welch’s t-test) compared to the baseline 𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT. The significant and highest score in each column is marked in bold.

C.2 Does the synthetic data alone improve the performance in the target language?

Yes, but adding the source language data is more effective. Table 10 shows the results with Llama2-7B on the multilingual evaluation datasets. Tuning on the synthetic data alone (𝒟tgtsubscript𝒟tgt\mathcal{D}_{\text{tgt}}caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT) improves the performance in the target language, but the improvement is not as significant as adding the synthetic data to the source language data (+𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT). In practice, we recommend using the synthetic data in combination with the original data to achieve the best performance.

MGSM XQuAD XNLI
de ru th zh de ru th zh de ru th zh
𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT
30.1
±plus-or-minus\pm±0.4
25.0
±plus-or-minus\pm±0.7
8.1
±plus-or-minus\pm±0.7
21.1
±plus-or-minus\pm±1.7
60.3
±plus-or-minus\pm±0.8
49.3
±plus-or-minus\pm±0.4
34.5
±plus-or-minus\pm±1.0
66.3
±plus-or-minus\pm±0.8
79.7
±plus-or-minus\pm±0.4
76.9
±plus-or-minus\pm±0.1
53.7
±plus-or-minus\pm±0.9
74.1
±plus-or-minus\pm±0.2
Dtgtsubscript𝐷tgtD_{\text{tgt}}italic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT *
32.1
±plus-or-minus\pm±0.6
*
*
30.4
±plus-or-minus\pm±1.8
*
8.7
±plus-or-minus\pm±0.6
24.6
±plus-or-minus\pm±2.5
*
62.5
±plus-or-minus\pm±0.4
*
*
57.4
±plus-or-minus\pm±1.1
*
*
44.3
±plus-or-minus\pm±0.8
*
*
75.4
±plus-or-minus\pm±1.1
*
*
81.5
±plus-or-minus\pm±0.6
*
77.9
±plus-or-minus\pm±0.7
53.8
±plus-or-minus\pm±1.7
*
77.1
±plus-or-minus\pm±0.5
*
+𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT *
36.4
±plus-or-minus\pm±1.3
*
*
34.0
±plus-or-minus\pm±1.7
*
7.7
±plus-or-minus\pm±3.4
*
27.1
±plus-or-minus\pm±1.2
*
61.7
±plus-or-minus\pm±0.9
*
57.8
±plus-or-minus\pm±0.8
*
*
46.4
±plus-or-minus\pm±1.3
*
*
77.7
±plus-or-minus\pm±0.4
*
*
81.6
±plus-or-minus\pm±0.7
*
*
78.5
±plus-or-minus\pm±0.6
*
56.3
±plus-or-minus\pm±1.5
*
78.5
±plus-or-minus\pm±0.3
*
Table 10: Results on the Englihs test set with Llama2-7B with the setting of tuning the synthetic data alone 𝒟tgtsubscript𝒟tgt\mathcal{D}_{\text{tgt}}caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT.

C.3 Is tuning on 𝒟tgtsubscript𝒟tgt\mathcal{D}_{\text{tgt}}caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT generated by another model still effective?

Yes, if the other model can generate reasonable translations. Such approach can be seen as the Translate-train approach (Section 2.1) or sequence distillation from a teacher model (Kim and Rush, 2016).

As an upper-bound experiment using the math task, we fine-tune Llama2-7B on the synthetic data generated by gpt-3.5-turbo-0125 from the OpenAI API444https://openai.com/index/openai-api/, which produces high-quality translations across the languages explored in this paper (Table 6).

MGSM
de ru th zh
𝒟srcsubscript𝒟src\mathcal{D}_{\text{src}}caligraphic_D start_POSTSUBSCRIPT src end_POSTSUBSCRIPT
30.1
±plus-or-minus\pm±0.4
25.0
±plus-or-minus\pm±0.7
8.1
±plus-or-minus\pm±0.7
21.1
±plus-or-minus\pm±1.7
+𝒟tgtsubscript𝒟tgt+\mathcal{D}_{\text{tgt}}+ caligraphic_D start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT *
37.4
±plus-or-minus\pm±1.0
*
*
35.9
±plus-or-minus\pm±1.2
*
*
28.5
±plus-or-minus\pm±1.4
*
*
33.2
±plus-or-minus\pm±1.1
*
+𝒟cssubscript𝒟cs+\mathcal{D}_{\text{cs}}+ caligraphic_D start_POSTSUBSCRIPT cs end_POSTSUBSCRIPT *
38.6
±plus-or-minus\pm±1.5
*
*
34.4
±plus-or-minus\pm±0.5
*
*
28.7
±plus-or-minus\pm±1.3
*
*
34.8
±plus-or-minus\pm±1.3
*
Table 11: Results on multilingual evaluation datasets with Llama2-7B tuned on the synthetic data generated by gpt-3.5-turbo-0125.

Adding the synthetic data generated by gpt-3.5-turbo-0125 improves performance across languages.

However, the outputs from external models are often restricted in its usage555For example, Term of use of OpenAI API (January 31, 2024) restricts the usage of the outputs for training a model that competes with the API (https://openai.com/policies/terms-of-use/). Meta Llama 3 License (April 18, 2024) prohibits using the outputs to improve any other large language model (https://llama.meta.com/llama3/license/)., while the method explored in this paper can be used with the model at hand without other resources. Additionally, our interest in this paper is rather to explore the cross-lingual potential of the model itself and how to better utilize it.