Computer Science > Computation and Language

arXiv:2310.10226 (cs)

[Submitted on 16 Oct 2023]

Title:Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Authors:Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, Yixuan Su

View PDF

Abstract:There are a number of diverging hypotheses about the neural text degeneration problem, i.e., generating repetitive and dull loops, which makes this problem both interesting and confusing. In this work, we aim to advance our understanding by presenting a straightforward and fundamental explanation from the data perspective. Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data. Subsequent experiments also demonstrate that by selectively dropping out the attention to repetitive words in training data, degeneration can be significantly minimized. Furthermore, our empirical analysis illustrates that prior works addressing the degeneration issue from various standpoints, such as the high-inflow words, the likelihood objective, and the self-reinforcement phenomenon, can be interpreted by one simple explanation. That is, penalizing the repetitions in training data is a common and fundamental factor for their effectiveness. Moreover, our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.

Comments:	Accepted to NeurIPS 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.10226 [cs.CL]
	(or arXiv:2310.10226v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.10226

Submission history

From: Huayang Li [view email]
[v1] Mon, 16 Oct 2023 09:35:42 UTC (8,064 KB)

Computer Science > Computation and Language

Title:Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators