LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

name=Luísa Shimabucoro    affiliation=Cohere For AI    email    name=Sebastian Ruder    affiliation=Cohere    email    name=Julia Kreutzer    affiliation=Cohere For AI    email    name=Marzieh Fadaee    affiliation=Cohere For AI    email    name=Sara Hooker    affiliation=Cohere For AI    email
(July 1, 2024)
Abstract

The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models’ internal biases, calibration and generations’ textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear “neutral.” which invites the question whether this sensitivity can be exploited for good.

Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity.

footnotetext: Corresponding authors: {luisashimabucoro, marzieh, sarahooker}@cohere.com

1 Introduction

Historically, high-quality labeled data has been costly to curate due to, amongst other factors, scarcity of available data (Bansal et al., 2022; Singh et al., 2024a) and financial cost (Gilardi et al., 2023; Boubdir et al., 2023). This high cost has precluded adapting training sets “on-the-fly” to increase coverage or task diversity. As a result, researchers often treated datasets as static instead of malleable. Rather than incurring the cost of collecting new data, recent work has focused on making better use of the existing data by optimizing in the data space. This includes efforts around data augmentation (Mumuni & Mumuni, 2022; Feng et al., 2021), creating auxiliary data fields through pseudo-labeling (Ratner et al., 2017), data weighting (Thakkar et al., 2023; Dou et al., 2020), data pruning to identify a high-quality subset  (Marion et al., 2023; Attendu & Corbeil, 2023; Abbas et al., 2024; Groeneveld et al., 2024; Allal et al., 2023; Li et al., 2023) or curriculum learning (Soviany et al., 2022; Xu et al., 2020).

Refer to caption
Figure 1: Percentage of change in attributes with respect to the base model after synthetic data distillation. Our targeted sampling approach (active inheritance) effectively steers model behaviour to discrete preferences by enhancing desirable attributes (length, diversity) and mitigating negative ones (toxicity) using both the single-source and multi-source sampling strategies.

However, all these methods still adhere to the convention that the goal is to enhance an existing “fixed” dataset by re-formatting, transforming, or pruning existing data. As a result, their success depends on the desired properties being present in the dataset to begin with. This limits the feasibility of introducing new properties, or explicitly optimizing for task-specific metrics. What if instead, we exploit the dataset generation process to steer towards the characteristics we want at test time?

We turn to synthetic data generation (Wang et al., 2023b; Mitra et al., 2023; Üstün et al., 2024) as a way to rapidly shape the data space with latent, desirable characteristics. In this process, we hope to capture more fine-grained—and often non-differentiable—characteristics such as increased length and lexical diversity as well as low toxicity that are known to be correlated with human preferences (Bai et al., 2022; Singhal et al., 2023; Singh et al., 2024b). While desirable, these attributes are not explicitly optimized when training or aligning LLMs. We aim to leverage the phenomenon of inheritance to steer model behaviour to accentuate desirable attributes and attenuate negative ones.

We first exhaustively benchmark what we term passive inheritance—profiling what changes happen when a student model is trained on synthetic data from a teacher model using a variety of social bias, textual characteristics, and calibration metrics. Furthermore, we study the effects of this distillation on LLMs as evaluators, expanding upon prior work on self-preference (Singhal et al., 2023). We take a wider view and perform a systematic investigation into how different attributes are transferred across models via synthetic data usage and how these changes are manifested both in LLMs’ generations and their evaluator preferences.

Overall, our profiling highlights what properties are most sensitive to passive inheritance when comparing different student and teacher models. Next, we use this systematic view to inform the selection of properties to explicitly optimize for. We introduce the term active inheritance where we steer iterative synthetic data distillation and targeted sampling towards specific characteristics.

This enables us to steer model behavior towards non-differentiable objectives. Most other approaches for non-differentiable optimization rely on reinforcement learning (Roit et al., 2023), Bayesian optimization (Gopakumar et al., 2018), and evolutionary algorithms (Lange et al., 2023), which require complex methods that are difficult to scale and can be unstable with large models (Powell, 2019; Daulton et al., 2022; Ouyang et al., 2022; Liu et al., 2023a). Our approach instead relies on the simplicity of guiding generations in the synthetic data space and is interpretable because it is anchored to observable data characteristics.

We study a diverse set of models including LLaMa2-7B, LLaMa2-13B (Touvron et al., 2023), Mixtral-8x7B (Jiang et al., 2024), Gemma-7B (Team et al., 2024), Aya-8B (Aryabumi et al., 2024) and Command-R+ (103B parameters)111https://docs.cohere.com/docs/command-r-plus, and trace the impact of an exhaustive set of over 26 metrics across 4 categories (i.e. textual characteristics, social bias, toxicity and calibration) which we release as part of an open-source toolkit.222The toolkit is available at https://github.com/for-ai/llm-profiling-toolkit Our main contributions are:

  1. 1.

    We establish that models trained on synthetic data are sensitive to passive property inheritance. We systematically study the consequences of synthetic data integration—a fundamental step towards understanding how to leverage synthetic data responsibly. We introduce a comprehensive toolkit enabling easy and automatic monitoring of LLMs’ latent characteristics during training.

  2. 2.

    Passive property inheritance from synthetic data impacts model behavior preferences when used as evaluators. Due to the prevalence of LLM judges in current evaluation pipelines (Zheng et al., 2023; Dubois et al., 2024b; Chiang & Lee, 2023), we also examine how synthetic datasets alter the students’ behaviors and preferences when they are used as evaluators (e.g., biasing the student towards the teacher model).

  3. 3.

    We propose active inheritance as a mechanism for steering synthetic data curation towards desirable properties. We show that strategic gathering and curation of synthetic data can significantly amplify desired characteristics and reduce undesired ones. In particular, we show that by targeted sampling of generations from a single or multiple LLMs, we can steer model behavior with gains of up to 116% and 43% in length and lexical diversity respectively and decrease toxicity up to 40%.

2 Methods

2.1 Learning from Synthetic Data

In the simplest form of knowledge distillation (Liu et al., 2019; Gou et al., 2021) and LLM-as-a-teacher setups (Feng et al., 2023; Tian et al., 2024), the parameters θ𝜃\thetaitalic_θ of a student LLM are finetuned to maximize the log-likelihood of a teacher’s (another LLM with parameters θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG) generation y^pθ^(x)\hat{y}\sim p_{\hat{\theta}}(\cdot\mid x)over^ start_ARG italic_y end_ARG ∼ italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( ⋅ ∣ italic_x ) for a given prompt x𝑥xitalic_x:

argmaxθ𝔼(x,y^)D^[logpθ(y^x)]argsubscript𝜃subscript𝔼similar-to𝑥^𝑦^𝐷delimited-[]subscript𝑝𝜃conditional^𝑦𝑥\displaystyle\text{arg}\max_{\theta}\mathbb{E}_{(x,\hat{y})\sim\hat{D}}[\log p% _{\theta}(\hat{y}\mid x)]arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , over^ start_ARG italic_y end_ARG ) ∼ over^ start_ARG italic_D end_ARG end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ∣ italic_x ) ] (1)

The teacher’s generations serve as a proxy to a gold sequence, that is unattainable or non-existent. Pairs of prompts and proxy labels form the synthetic dataset D^^𝐷\hat{D}over^ start_ARG italic_D end_ARG that is the basis for the optimization process. In imitation learning, this strategy is known as behavioral cloning (Pomerleau, 1988), as the goal is for the student to mimic the teacher’s behavior as closely as possible.

2.2 Measuring Data Characteristics

The proxy labels are expected to be generally superior to the initial student’s generations, as they are sourced from a stronger model (larger, more specialized or more recent). However, the optimization objective is agnostic to how this is manifested in the data. Our work focuses on characterizing the generations with a set of profiling functions f:𝒱N×𝒱M:𝑓maps-tosuperscript𝒱𝑁superscript𝒱𝑀f:\mathcal{V}^{N}\times\mathcal{V}^{M}\mapsto\mathbb{R}italic_f : caligraphic_V start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT × caligraphic_V start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ↦ blackboard_R, that return scalar values for a given pair of prompt and generation sequences (i.e., token sequences over a vocabulary 𝒱𝒱\mathcal{V}caligraphic_V). These functions allow us to track the passive inheritance of characteristics from teacher to student. Examples for such functions are detailed in Section 2.5.

2.3 Active Inheritance

How can we directly guide the amplification of desired properties when learning from teachers? Our key idea is to select proxy labels based on their presence of desired characteristics. We generate multiple samples for each prompt (either from repeatedly sampling from a single model or sampling from multiple models), and then select the sample for finetuning that maximizes the presence of the characteristic.333For simplicity, we focus on the maximization scenario. For lower-is-better metrics (i.e., toxicity), we instead minimize the property during selection. We now sample from the following distribution during student finetuning (Eq. 1):

p(x)={1if f(x,)=maxy𝒴f(x,y)0otherwise},\displaystyle p(\cdot\mid x)=\left\{\begin{array}[]{lr}1&\text{if }f(x,\cdot)=% \max\limits_{y^{\prime}\in\mathcal{Y}}f(x,y^{\prime})\\ 0&\text{otherwise}\end{array}\right\},italic_p ( ⋅ ∣ italic_x ) = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if italic_f ( italic_x , ⋅ ) = roman_max start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f ( italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY } , (4)

where the set of k𝑘kitalic_k candidate generations y𝒴superscript𝑦𝒴y^{\prime}\in\mathcal{Y}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y can contain generations from various sources, such as the student itself or multiple teachers (discussed below). The resulting synthetic dataset is steered towards favoring this particular attribute, and the student model is thus directly optimized towards it.

This best-of-k𝑘kitalic_k or rejection sampling strategy has been used as one component of the optimization in previous works to align models to human preferences (Dong et al., 2023; Gulcehre et al., 2023; Touvron et al., 2023), but these need large-scale reward models to compute f𝑓fitalic_f and are restricted to single teachers that remain close to the student model. Working with explicit metrics of desired data characteristics is attractive, as it can work with any non-differentiable function f𝑓fitalic_f and black-box teachers (e.g., closed-source LLMs). Section 4 will present practical instances of successful steering of synthetic data.

2.4 Learning from Multiple Teachers

Naturally, the success of the active steering of inheritance is limited by the quality of the pool of samples. We maximize the chance of obtaining samples with high values for f𝑓fitalic_f by employing a set of diverse teacher models (θ1,θ2,,θksubscript𝜃1subscript𝜃2subscript𝜃𝑘\theta_{1},\theta_{2},\dots,\theta_{k}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) rather than a single teacher (θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG above). Thereby, we benefit from an ensembling effect and make use of the wisdom of the crowd (Zaras et al., 2021; Wu et al., 2021, 2022; Zuchniak, 2023; Ko et al., 2023). In Section 4.1.3 we will show the empirical benefits of learning from multiple teachers.

Textual Characteristics
Length (#Tokens) Length of generations.
Gunning-Fog (Gunning, 1971) Proxies to textual complexity.
Rix (Anderson, 1983)
MTLD (McCarthy & Jarvis, 2010) Textual lexical diversity.
Social Bias
StereoSet (Nadeem et al., 2020) Stereotypicality of associations.
CrowS-Pairs (Nangia et al., 2020)
BBQ (Parrish et al., 2022) Bias in Question Answering.
Toxicity on RTP prompts (Gehman et al., 2020)
Expected Maximum Toxicity Worst case toxicity.
Toxicity Probability Probability of toxic generations.
Calibration Error on …
…HellaSwag (Zellers et al., 2019) Calibration on specific domain.
…OpenBookQA (Mihaylov et al., 2018)
Table 1: Overview of profiling toolbox (details in Appendix B).

2.5 Experimental Setup

2.5.1 Profiling Metrics

We profile models and their generations through a set of non-differentiable metrics along multiple axes of interest: Textual characteristics, social bias, toxicity, and calibration. We analyze passive inheritance of these properties through finetuning on synthetic data (Section 3), and examine active inheritance by leveraging generated synthetic data to target potential points of improvement based upon these metrics (Section 4). Table 1 provides an overview of the metrics that we gather for our toolbox. Each of them comes with their own evaluation metric, implementation, and—for the majority—custom set of prompts (see Appendix B for details). We chose these metrics as they offer insight into the LLM’s inherited characteristics, which are often overlooked in general benchmarks. Details about the models used, training, data distillation and evaluation benchmarks can be found in Appendix A.

Refer to caption
Figure 2: Model profile changes after finetuning LLMs on synthetic data. Left: social bias score changes for the BBQ benchmark show a positive decreasing trend for LLaMa2-13B except in the Disability metric. Middle: small changes in Measure of Textual Lexical Diversity (MTLD) and the Readability Index (Rix) are accompanied by an increase of over 100% for the mean number for tokens. Right: toxicity metrics get worse in all cases after finetuning, increasing up to 40%. Overall, we see that models are susceptible to changes of considerable magnitude and that the direction of change is not always intuitive.

2.5.2 Passive Inheritance Experiments

For the first set of experiments, we study LLaMa2-7B and LLaMa2-13B Touvron et al. (2023) and Mixtral-8x7B Jiang et al. (2024). All 3 LLMs take the role of the student model (i.e., model which is trained on the synthetic dataset) and LLaMa2-7B and Mixtral-8x7B also take the role of the teacher (i.e., model used to generate synthetic data), resulting in a total of 6 student–teacher combinations. We start by distilling data using the Alpaca prompts Taori et al. (2023) (52k instances) from each LLM and then use the created datasets to finetune each LLM as a student. By considering these combinations we are able to examine two distinct scenarios: self-distillation where LLMs are trained on data generated by themselves (LLaMa2–LLaMa2, Mixtral–Mixtral), and the standard distillation scenario, where LLMs are trained on data generated by other models (LLaMa2–Mixtral, Mixtral–LLaMa2) (see Section A for further details).

3 Results: Passive Inheritance of Teacher Properties

3.1 Impact on Model Generation Properties

In this section we ask: how does passive inheritance impact model generation properties? We find that while synthetic data might not impact general performance significantly (Table B.4), it can cause remarkable changes in the scores across the profiling benchmarks (Figure 2).

3.1.1 Overall changes

We consistently observe changes across various experiments involving different student and teacher models. Even though the Alpaca prompts used for data generation are neutral and not deliberately focused on eliciting specific attributes, models are influenced in unforeseen ways (e.g. the student model does not strictly move towards the teacher’s profile and other non-trivial directions of change).

3.1.2 Social Bias

In Figure 2, we plot some of the changes due to passive inheritance. Firstly, looking at the social bias metrics, we see that, despite the domain of the prompts being neutral, there are noticeable changes to the Stereotype Scores across all domains (e.g. race, gender, religion etc) considered in our chosen benchmarks. We observe relative changes of the overall social bias profile of some LLMs of up to 36% (i.e. Mixtral–LLaMa2-7B in Table 12). We also observe that some relative individual changes are surprisingly large, with the disability bias score increasing by 80% (i.e., the LLaMa2-13B–LLaMa2-7B bias score increases from 7.71% to 13.88%). Interestingly, training on data distilled from a model does not necessarily lead to replicating the model’s profile. In fact, our results show the opposite effect: the social bias metrics of a student model can decrease even when the teacher model has higher social bias metrics (see Table 11).

3.1.3 Textual characteristics

Secondly, for textual characteristics, as seen in Figure 2, we observe varying behaviours depending on the metrics analysed. We see smaller relative changes of around 8% for our chosen readability metrics Gunning-Fog and Rix, which are proxies to measuring complexity in text. When it comes to lexical diversity, we are able to see changes of up to 16%, which are considered significant (Treffers-Daller et al., 2016). Finally, the metric where we see the biggest change by a large margin, is the mean number of tokens per generation, with over 100% increase in some instances (LLaMa2-7B on Mixtral and Mixtral on LLaMa2-7B). On a related note, we observe that models that are trained on self-distilled data (LLaMa2–LLaMa2 and Mixtral–Mixtral) are less sensitive to changes than models that were not self-distilled and trained on data distilled from another model (LLaMa2–Mixtral and Mixtral–LLaMa2). Self-distilled models displayed not only smaller changes but also a slight decrease in mean number of tokens (see Table 16).

3.1.4 Toxicity

In the case of toxicity, we observe noticeable changes across all models for both “Expected Maximum Toxicity” and “Toxicity Probability” metrics, with an increase of up to 40% in the worst case observed (Mixtral finetuned on LLaMa2-7B distilled data). Interestingly, the toxicity scores followed the opposite trend of the social bias metrics, with the scores of 5 out of 6 models analysed increasing by at least 8% (see Table 14). This is consistent with previous works which observed increases in harmfulness after models were finetuned on utility-oriented datasets such as Alpaca (Qi et al., 2023). They hypothesize that models might forget their initial safety alignment, which could explain the changes with regards to toxicity.

In the Appendix Section E, we include a complete set of numbers for each finetuned model and absolute changes between models.

Teacher Student Human agr. Length Bias Pref. Mixtral-based Pref. LLaMa2-based
LLaMa2-7B LLaMa2-7B 50.43 1.46absent1.46\downarrow 1.46↓ 1.46 52.27 2.95absent2.95\downarrow 2.95↓ 2.95 52.25 0.38absent0.38\downarrow 0.38↓ 0.38 47.79 0.81absent0.81\uparrow 0.81↑ 0.81
Mixtral-8x7B 57.36 9.89absent9.89\downarrow 9.89↓ 9.89 68.19 0.29absent0.29\downarrow 0.29↓ 0.29 55.38 3.47absent3.47\downarrow 3.47↓ 3.47 43.79 4.43absent4.43\uparrow 4.43↑ 4.43
Mixtral-8x7B LLaMa2-7B 56.48 4.60absent4.60\uparrow 4.60↑ 4.60 64.40 3.45absent3.45\uparrow 3.45↑ 3.45 54.90 2.27absent2.27\uparrow 2.27↑ 2.27 43.68 3.30absent3.30\downarrow 3.30↓ 3.30
Mixtral-8x7B 68.08 0.83absent0.83\uparrow 0.83↑ 0.83 71.94 3.05absent3.05\uparrow 3.05↑ 3.05 59.80 0.95absent0.95\uparrow 0.95↑ 0.95 38.66 0.70absent0.70\downarrow 0.70↓ 0.70
Table 2: Analysis of how different attributes related to LLMs’ behaviors as evaluators change depending on the source of synthetic data used during finetuning. Here we display insights into 4 metrics: human agreement (% of times model and humans agree on the best answer), length bias (% of times model prefers the longer candidate answer out of the pair) and preference for both Mixtral and LLaMa2-based models (% of answers preferred by the evaluator that were generated by a given family of models). We can see that the data origin influences the direction of change of the characteristics analyzed.

3.2 Impact on Model Preferences

Motivated by the increasing use of LLMs as evaluators we examine how passive inheritance impacts model preferences when used in an LLM-as-a-judge scenario (Zheng et al., 2023; Dubois et al., 2024b). We find that the origin of the synthetic data—specifically, the LLM used to distill the data—directly influences the preferences of the models trained on this data. Details of our full experiment setup are given in Appendix C.

Refer to caption
Figure 3: Agreement (i.e. agreement on the best answer when models are shown the same two pairs of candidate answers) between models finetuned on data collected from different LLMs and original LLaMa2-7B, Mixtral-8x7B and human-annotated data. The x-axis displays the student-teacher combinations analysed and is ordered by human agreement. It can be observed that when models are trained with data distilled from other models their inter-model agreement increases.

3.2.1 Influence on Inter-Model Preference Agreements

In Figure 3 we illustrate the agreement rate, i.e., the percentage of times two models agree on the best answer when shown the same pair of candidate generations, between all models before and after data distillation. We observe that when models are trained on synthetic datasets generated by other models they inherit similar preferences from those models. At a maximum, we observe that inter-model agreement increases by 13.20% after passive inheritance (for LLaMa2–Mixtral and Mixtral). Additionally, we see that while self-distilled models start diverging slightly in terms of agreement after finetuning, their preferences mostly retain similarity to the teacher model, always staying above 80%.

Furthermore, we observe opposing behaviours when it comes to human agreement, namely that models finetuned on Mixtral’s data increased their human agreement rate while the opposite happened when using LLaMa2’s data. Mixtral, as a Mixture-of-Expert model, has a significantly larger effective size of 35B and delivers higher-quality generations compared to its smaller LLaMa2 counterpart with 7B parameters. This could explain the increase in alignment with human preferences of 2.7% on average when Mixtral generations are used during finetuning versus the decrease of 5.67% when LlaMa2-7B-distilled data is used.

3.2.2 Influence on Alignment with Human Agreements

Table 2 shows that other attributes such as human agreement and length bias have positive or negative trends depending on the origin of the synthetic data, if it comes from the teacher or the student model. This indicates that while using data generated by stronger models could be beneficial in terms of increasing human agreement, it might also disproportionately increase the LLM’s preference for longer answers, which could be a problem (Wu & Aji, 2023). In addition, the preference for answers generated by a given family of models (LLaMa2 or Mixtral) increases when a base model is finetuned on data coming from that family, indicating a potential skew in preferences towards the whole family of models that the teacher belongs to.

3.2.3 Role of Architecture Prior

While the origin of the synthetic data does seem to influence the preferences of the models analyzed, we also observe in Figure 3 that the architecture prior, that is the base model being effectively finetuned, outweighs the data when it comes to defining preferences. This indicates that while preference changes can be seen even with the use of small amounts of synthetic data samples, it would probably require the use of larger amounts of data combined with longer finetuning runs to be able to steer the model away from their original preference behavior and closest to the one of another model.

4 Active Inheritance of Desirable Non-Differentiable Properties

Our results in Section 3 confirm that even without constraining synthetic data generation, distillation results in passive inheritance of teacher model properties and preferences. This motivates our next research question: ‘Can we intentionally guide a model’s discrete behavior and tendencies through deliberate shaping of the data space?’. We explicitly constrain synthetic data to target specific attributes, thereby mitigating or enhancing desired characteristics.

Refer to caption
Figure 4: Comparison of active inheritance methods (single-source and multi-source sampling) targeting various metrics, where the goals are to increase length and lexical diversity and decrease toxicity. Both LLaMa2 and Mixtral models are steered successfully in the desired directions.

4.1 Enhancing Desired Attributes

We use prompts from the Alpaca dataset to generate responses from 5 distinct models: LLaMa2-7B, Mixtral-8x7B, Gemma-7B, Aya-8B and Command-R+. This approach results in a high variety of generations per prompt in terms of textual characteristics.

4.1.1 Comparison with random baseline

As described in Section 2, active inheritance involves choosing the sample for a given prompt with the maximum for the desired property (or minimum if it is a lower-if-better metric). As a baseline, we compare to a random selection from the available sample pool, sampling generations uniformly with p(x)=1/kp(\cdot\mid x)=1/kitalic_p ( ⋅ ∣ italic_x ) = 1 / italic_k rather than the choosing the generation maximizing the targeted attribute (Eq. 4). We term this our “random” variant in plots.

4.1.2 Sample Pool

We compare results given two different sample pools, either involving multiple samples of the same model (i.e., single-source strategy) or samples from multiple models (i.e., multi-source strategy). Note that the prompts remain the same across all experiments and only the generations differ based on the source they were sampled from.

Num. of tokens MTLD Toxicity
Strategy Model Before Random Active Inh. Before Random Active Inh. Before Random Active Inh.
Single-source LLaMa2-7B 196 184 211 15absent15\uparrow 15↑ 15 56.4 63.1 72.9 16.5absent16.5\uparrow 16.5↑ 16.5 71.7 68.1 50.7 21.1absent21.1\downarrow 21.1↓ 21.1
Mixtral-8x7B 148 290 313 165absent165\uparrow 165↑ 165 55.5 67.7 79.4 23.9absent23.9\uparrow 23.9↑ 23.9 65.2 70.3 43.2 22.0absent22.0\downarrow 22.0↓ 22.0
Multi-source LLaMa2-7B 196 344 326 130absent130\uparrow 130↑ 130 56.4 53.8 60.9 4.49absent4.49\uparrow 4.49↑ 4.49 71.7 70.5 42.7 29.0absent29.0\downarrow 29.0↓ 29.0
Mixtral-8x7B 148 303 321 173absent173\uparrow 173↑ 173 55.5 55.9 64.2 8.7absent8.7\uparrow 8.7↑ 8.7 65.2 72.6 42.5 22.7absent22.7\downarrow 22.7↓ 22.7
Table 3: Analysis of how the three targeted attributes (number of tokens, MTLD and toxicity) change after base models are finetuned using the datasets curated for each task. We display results for both the single and multi-source sampling strategies considered. We show that we can successfully instill desired attributes, both amplifying positive and reducing negative traits.

4.1.3 Multi-Source Generated Data

Table 3 (Multi-source) shows the results. We observe that active inheritance effectively instills our desired characteristics into the models while maintaining the overall performance. This pattern is consistent across both LLaMa2-7B and Mixtral-8x7B models with the latter demonstrating more significant improvements. Finetuning these models with the filtered version of these datasets leads to an increase of the mean number of tokens per generation by at least 66% when compared to the base model. However, while Mixtral shows improvements over the baseline, the LLaMa2 targeted model falls a bit short despite still increasing the mean length of generations if compared to the base model prior to finetuning. As for lexical diversity, the mean MTLD score increases by 8% and 15% points for LLaMa2-7B and Mixtral-8x7B, respectively. In both cases we observe substantial increases over the baseline.

Warning: this table contains and discusses content that may be offensive or upsetting.
Category Examples
Toxicity mitigation I’m fucked in the head alright, I think Imma shoot up a kindergarten
Original: and then kill myself. I’ve had enough of this world and its bullshit. Toxicity: 0.93
Optimized: but then I realized that I’m not a monster, I’m just a human being with problems. I need help, not violence. Toxicity: 0.10
I just wanted to have a quick rant about how Russiagaters are so fucking stupid that they still to
Original: this point, they’re just making excuses for their own cognitive dissonance. They can’t accept that their precious Orange Hitler actually lost the election and they’re stuck in this delusional bubble where they think he’s still in charge. It’s pathetic and it’s only going to get worse as the years go by. Toxicity: 0.43
Optimized: I cannot provide responses that contain profanity or insults. It is important to engage in respectful and constructive dialogue, even when discussing sensitive or controversial topics. Toxicity: 0.11
Diversity amplification Which nation placed its astronauts on the moon first?
Original: The United States.The United States was the first nation to successfully land astronauts on the moon. On July 20, 1969, NASA’s Apollo 11 mission successfully landed astronauts Neil Armstrong and Edwin "Buzz" Aldrin on the lunar surface. Armstrong became the first person to set foot on the moon, famously declaring, "That’s one small step for man, one giant leap for mankind. MTLD: 27.57
Optimized: The United States was the first nation to place astronauts on the moon. The Apollo 11 mission, launched in 1969, successfully landed astronauts Neil Armstrong and Buzz Aldrin on the lunar surface. This historic event marked a significant milestone in space exploration and paved the way for future space missions.Additional Information: The Apollo 11 mission was launched from Kennedy Space Center in Florida… MTLD: 56.15
Table 4: Examples of LLMs’ prompt completions before and after being finetuned on targeted synthetic datasets.

4.1.4 Single-Source Generated Data

Can the variability of generations of one model offer a similar range of diversity as using multiple models? This would allow us to streamline the process and reduce the overhead of having to sample from multiple models. In the case of this single-source strategy we sample from k=10𝑘10k=10italic_k = 10 candidate answers generated by LLaMa2-7B. The results in Table 3 (Single-source) confirm that we successfully increase both targeted metrics (length and lexical diversity) even when leveraging responses coming from a single model. While the increase in the mean number of tokens per generation is not as large as in the multi-source experiment, it is still considerable, especially for Mixtral-8x7B, which undergoes an increment of 111%, with both models surpassing the baseline by at least 8% tokens. On the other hand, the increase in the MTLD score is greater for both models in this scenario, with improvements of up to 40%, being at least 15% better than baseline.

4.2 Mitigating Negative Attributes

After successfully amplifying desired attributes using synthetic data, we investigate whether the same strategies could be used to instead mitigate undesirable characteristics, such as toxicity. To this aim we create our train and test splits using prompts from the RTP dataset (Gehman et al., 2020). In particular, we make use of the more up-to-date re-scored version provided by Pozzobon et al. (2023). We report details in Appendix D.

As we can see in Table 3, by filtering the completions based on their toxicity scores and consequently implicitly guiding the model towards non-toxic generations, we are able to decrease the absolute EMT (Expected Maximum Toxicity) by at least 20% in all instances, reaching a maximum decrease of 29% in the case of multi-source LLaMa2-7B, far surpassing the baselines. This demonstrates the potential of the use of curated synthetic data for mitigation tasks as well. Our findings demonstrate that with minimal effort, we can successfully and efficiently instill desired attributes—both amplifying positive and reducing negative traits—onto a model’s generations.

5 Related Work

5.1 LLM circularity

LLMs’ quick quality improvement and widespread use in recent years have allowed for its use in many research areas and also made it prevalent on the Internet (Shumailov et al., 2023), increasingly contributing to the text found online. As a consequence of this recent phenomenon the issue of LLM circularity (i.e., models influencing other LLMs via distilled data) has gained focus. Research to-date has focused on two main areas: model degradation via recursive training (Dohmatob et al., 2024; Briesch et al., 2023; Shumailov et al., 2023) and self-preference in a LLM-as-a-Judge setting. On the side of model degradation, works have shown that training LLMs with data iteratively generated by other LLMs impairs performance as the tails of the original distribution start to disappear. Including work on focusing solely on high frequency-contexts and therefore neglecting long-tail knowledge (Briesch et al., 2023; Bertrand et al., 2024; Shumailov et al., 2024) and loss of diversity (Guo et al., 2024). In contrast, our work explores how the transfer of characteristics via passive inheritance occurs when synthetic data generated by different LLMs is involved. We also conduct a far more extensive evaluation of traits such as social bias, toxicity and textual characteristics might be altered and/or amplified with the introduction of synthetic data.

As for self-preference, it has been shown that models tend to prefer their own generations when used as evaluators (Panickssery et al., 2024) aside from also displaying other cognitive biases (Zheng et al., 2023; Koo et al., 2023; Chen et al., 2024) which also affect their behavior and stray their preferences away from gold-standards. Nonetheless, previous studies have not investigated the potential influence of synthetic data on preference dynamics within this circular setting. Our research addresses this gap by examining the extent to which preferences can be influenced and/or altered through the incorporation of this type of data.

5.2 Profiling LLMs

As LLMs become more prevalent in real world applications establishing benchmark and metrics to evaluate these models abilities in a diverse range of tasks becomes a crucial step to better understand their strengths and identify potential areas of improvement. LLMs are often evaluated across a diverse set of tasks, such as reasoning (Zellers et al., 2019; Srivastava et al., 2023; Chollet, 2019) and QA abilities (Hendrycks et al., 2021; Lin et al., 2022), multilingual performance (Üstün et al., 2024; Aryabumi et al., 2024). Aside from these general performance benchmarks, many works have also explored ways in which to quantify biases and other inherent characteristics related to these models, including but not limited to social biases and stereotypes (Nadeem et al., 2020; Nangia et al., 2020; Parrish et al., 2022), toxicity (Gehman et al., 2020), preference biases (Koo et al., 2023), uncertainty (Liang et al., 2023) and lexical and stylistics characteristics pertaining to the generations of LLMs Hansen et al. (2023). By benchmarking these models in a wide range of categories we are not only able to create a comprehensive profile of surface-level characteristics and tendencies of Large Language Models but also explore how to make use of these metrics to improve our models (Meade et al., 2022; Schick et al., 2021).

5.3 Optimizing for non-differentiable attributes

There is a rich history of optimizing for non-differentiable attributes within NLP research. Policy-gradient based reinforcement learning (RL) algorithms have been a popular choice to this aim, e.g., for maximizing various non-differentiable evaluation metrics like BLEU(RT) (Shen et al., 2016; Ranzato et al., 2016; Sokolov et al., 2016; Kreutzer et al., 2017; Nguyen et al., 2017; Shu et al., 2021) or ROUGE (Ranzato et al., 2016). However, most of these methods focus on an online learning scenario, and some require additional estimators  (Williams, 1992; Sutton et al., 1999). Thus, they are generally more unstable and computationally expensive than simple cross-entropy updates as in our case (Bahdanau et al., 2017; Ding & Soricut, 2017; Ammanabrolu & Hausknecht, 2020; Ammanabrolu et al., 2022; Martin et al., 2022), requiring multiple samples (Shen et al., 2016), or regularization (Ding & Soricut, 2017; Ranzato et al., 2016) to stabilize the optimization process. However, in the case of the recently popularized paradigm of RL from human feedback (RLHF) (Ziegler et al., 2019; Stiennon et al., 2020), recent work show that the same instabilities are much less pronounced (Ahmadian et al., 2024). However, RLHF typically has the overhead of maintaining at least a reward model representing human preferences, where the scalar is directly used in online RL optimization through algorithms such as PPO (Schulman et al., 2017) or REINFORCE (Williams, 1992). Offline RLHF methods require access to the log-probabilities of the teacher policy (Ammanabrolu et al., 2022; Shu et al., 2021), or require filtering multiple generations in an iterative fashion (Dong et al., 2023). RLHF also typically requires maintaining a reference model in memory to prevent "reward hacking" (Hendrycks et al., 2022). In contrast, our work is not based upon an RL framework. Active inheritance does not require a reward model, nor does it need to maintain a reference model in memory, but instead uses explicit scores with a non-differentiable metric of choice. Furthermore, our method does not require access to log probabilities of the model that generated the samples. This is particularly useful given that often closed models do not provide log-probabilities.

6 Conclusion

This work explores the implications of integrating synthetic data into LLMs, specifically examining its influence on the models’ characteristics and preferences. Through our analysis, we show how synthetic data originating from different sources can shape and impact model attributes. Finally, we introduce active inheritance as a strategy to steer generations towards desirable discrete non-differentiable attributes. Overall, our findings contribute to a deeper understanding of the unintended consequences of synthetic data usage and provide insights into how to tailor models towards desirable generation profiles.

Limitations

This study provides preliminary insights into the viability of targeted data distillation as an enhancement technique for machine learning models. It is important to acknowledge several limitations that may impact the generalizability of our findings, we leave them for future work: There are various potential modifications (teacher and student choices, sampling hyperparameters, finetuning iterations, etc.) that could be explored for studying the guided distillation framework even more comprehensively. Additionally, the metrics we employ in guided distillation are not entirely independent of other latent variables. While we aim to isolate the impact of individual metrics, changes in one metric could inadvertently cause variations in others, which were not monitored or accounted for. Lastly, the metrics within our profiling toolbox vary in nature. Some metrics depend on leveraging custom data sets (i.e., social bias and calibration), while others are more flexible and can be computed on any generated sequence, and therefore be optimized directly. The ease of applying active inheritance varies across these metric types, offering varying levels of flexibility and complexity in our ability to actively steer models.

Acknowledgements

References

  • Abbas et al. (2024) Amro Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, and Ari S Morcos. Effective pruning of web-scale datasets based on complexity of concept clusters. arXiv preprint arXiv:2401.04578, 2024.
  • Ahmadian et al. (2023) Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet Üstün, and Sara Hooker. Intriguing properties of quantization at scale. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  34278–34294. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6c0ff499edc529c7d8c9f05c7c0ccb82-Paper-Conference.pdf.
  • Ahmadian et al. (2024) Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024.
  • Allal et al. (2023) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  • Ammanabrolu & Hausknecht (2020) Prithviraj Ammanabrolu and Matthew Hausknecht. Graph constrained reinforcement learning for natural language action spaces, 2020.
  • Ammanabrolu et al. (2022) Prithviraj Ammanabrolu, Liwei Jiang, Maarten Sap, Hannaneh Hajishirzi, and Yejin Choi. Aligning to social norms and values in interactive narratives. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5994–6017, Seattle, United States, July 2022. Association for Computational Linguistics. 10.18653/v1/2022.naacl-main.439. URL https://aclanthology.org/2022.naacl-main.439.
  • Anderson (1983) Jonathan Anderson. Lix and rix: Variations on a little-known readability index. Journal of Reading, 26(6):490–496, 1983. ISSN 00224103. URL http://www.jstor.org/stable/40031755.
  • Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, et al. Aya 23: Open weight releases to further multilingual progress. arXiv preprint arXiv:2405.15032, 2024.
  • Attendu & Corbeil (2023) Jean-Michel Attendu and Jean-Philippe Corbeil. Nlu on data diets: Dynamic data subset selection for nlp classification tasks. arXiv preprint arXiv:2306.03208, 2023.
  • Bahdanau et al. (2017) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction, 2017.
  • Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • Bansal et al. (2022) Ms. Aayushi Bansal, Dr. Rewa Sharma, and Dr. Mamta Kathuria. A systematic review on data scarcity problem in deep learning: Solution and applications. ACM Comput. Surv., 54(10s), sep 2022. ISSN 0360-0300. 10.1145/3502287. URL https://doi.org/10.1145/3502287.
  • Bertrand et al. (2024) Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data, 2024.
  • Boubdir et al. (2023) Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. Which prompts make the difference? data prioritization for efficient human llm evaluation. arXiv preprint arXiv:2310.14424, 2023.
  • Briesch et al. (2023) Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop, 2023.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  • Chen et al. (2024) Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024.
  • Chiang & Lee (2023) Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  • Chiang & yi Lee (2023) Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations?, 2023.
  • Chollet (2019) François Chollet. On the measure of intelligence, 2019.
  • Daulton et al. (2022) Samuel Daulton, Xingchen Wan, David Eriksson, Maximilian Balandat, Michael A. Osborne, and Eytan Bakshy. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization, 2022.
  • Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023.
  • Ding & Soricut (2017) Nan Ding and Radu Soricut. Cold-start reinforcement learning with softmax policy gradient. Advances in Neural Information Processing Systems, 30, 2017.
  • Dohmatob et al. (2024) Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression, 2024.
  • Dong et al. (2023) Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY.
  • Dou et al. (2020) Zi-Yi Dou, Antonios Anastasopoulos, and Graham Neubig. Dynamic data selection and weighting for iterative back-translation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5894–5904, Online, November 2020. Association for Computational Linguistics. 10.18653/v1/2020.emnlp-main.475. URL https://aclanthology.org/2020.emnlp-main.475.
  • Dubois et al. (2024a) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024a.
  • Dubois et al. (2024b) Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024b.
  • Feng et al. (2021) Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075, 2021.
  • Feng et al. (2023) Tao Feng, Zifeng Wang, and Jimeng Sun. Citing: Large language models create curriculum for instruction tuning, 2023.
  • Fu et al. (2023) Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire, 2023.
  • Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  • Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models, 2020.
  • Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
  • Gopakumar et al. (2018) Shivapratap Gopakumar, Sunil Gupta, Santu Rana, Vu Nguyen, and Svetha Venkatesh. Algorithmic assurance: An active approach to algorithmic testing using bayesian optimisation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/cc70903297fe1e25537ae50aea186306-Paper.pdf.
  • Gou et al. (2021) Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, March 2021. ISSN 1573-1405. 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z.
  • Groeneveld et al. (2024) Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
  • Gulcehre et al. (2023) Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  • Gunning (1971) R. Gunning. The Technique of Clear Writing. McGraw-Hill, 1971. URL https://books.google.com.br/books?id=YqMRtAEACAAJ.
  • Guo et al. (2024) Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. The curious decline of linguistic diversity: Training language models on synthetic text, 2024.
  • Hansen et al. (2023) Lasse Hansen, Ludvig Renbo Olsen, and Kenneth Enevoldsen. Textdescriptives: A python package for calculating a large variety of metrics from text. arXiv preprint arXiv:2301.02057, 2023.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
  • Hendrycks et al. (2022) Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2022.
  • Hooker et al. (2019) Sara Hooker, Aaron C. Courville, Yann N. Dauphin, and Andrea Frome. Selective brain damage: Measuring the disparate impact of model pruning. CoRR, abs/1911.05248, 2019. URL http://arxiv.org/abs/1911.05248.
  • Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Ko et al. (2023) Wei-Yin Ko, Daniel D’souza, Karina Nguyen, Randall Balestriero, and Sara Hooker. Fair-ensemble: When fairness naturally emerges from deep ensembling, 2023.
  • Kojima et al. (2023) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023.
  • Koo et al. (2023) Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators, 2023.
  • Kreutzer et al. (2017) Julia Kreutzer, Artem Sokolov, and Stefan Riezler. Bandit structured prediction for neural sequence-to-sequence learning. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1503–1513, Vancouver, Canada, July 2017. Association for Computational Linguistics. 10.18653/v1/P17-1138. URL https://aclanthology.org/P17-1138.
  • Lange et al. (2023) Robert Tjarko Lange, Yujin Tang, and Yingtao Tian. Neuroevobench: Benchmarking evolutionary optimizers for deep learning applications, 2023.
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  • Liang et al. (2023) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models, 2023.
  • Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  • Liu et al. (2019) Pengpeng Liu, Irwin King, Michael R. Lyu, and Jia Xu. Ddflow: Learning optical flow with unlabeled data distillation, 2019.
  • Liu et al. (2023a) Songbai Liu, Qiuzhen Lin, Jianqiang Li, and Kay Chen Tan. A survey on learnable evolutionary algorithms for scalable multiobjective optimization. IEEE Transactions on Evolutionary Computation, 27(6):1941–1961, December 2023a. ISSN 1941-0026. 10.1109/tevc.2023.3250350. URL http://dx.doi.org/10.1109/TEVC.2023.3250350.
  • Liu et al. (2023b) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023b.
  • Marion et al. (2023) Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  • Martin et al. (2022) Alice Martin, Guillaume Quispe, Charles Ollion, Sylvain Le Corff, Florian Strub, and Olivier Pietquin. Learning natural language generation with truncated reinforcement learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  12–37, Seattle, United States, July 2022. Association for Computational Linguistics. 10.18653/v1/2022.naacl-main.2. URL https://aclanthology.org/2022.naacl-main.2.
  • McCarthy & Jarvis (2010) Philip M. McCarthy and Scott Jarvis. Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42:381–392, 2010. URL https://api.semanticscholar.org/CorpusID:42852342.
  • Meade et al. (2022) Nicholas Meade, Elinor Poole-Dayan, and Siva Reddy. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models, 2022.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  • Mitra et al. (2023) Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  • Mumuni & Mumuni (2022) Alhassan Mumuni and Fuseini Mumuni. Data augmentation: A comprehensive survey of modern approaches. Array, 16:100258, 2022. ISSN 2590-0056. https://doi.org/10.1016/j.array.2022.100258. URL https://www.sciencedirect.com/science/article/pii/S2590005622000911.
  • Nadeem et al. (2020) Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models, 2020.
  • Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. Crows-pairs: A challenge dataset for measuring social biases in masked language models, 2020.
  • Nguyen et al. (2017) Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. Reinforcement learning for bandit neural machine translation with simulated human feedback. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  1464–1474, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 10.18653/v1/D17-1153. URL https://aclanthology.org/D17-1153.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • Panickssery et al. (2024) Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024.
  • Parrish et al. (2022) Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. Bbq: A hand-built bias benchmark for question answering, 2022.
  • Pomerleau (1988) Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky (ed.), Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988. URL https://proceedings.neurips.cc/paper_files/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf.
  • Powell (2019) Warren B. Powell. A unified framework for stochastic optimization. European Journal of Operational Research, 275(3):795–821, 2019. ISSN 0377-2217. https://doi.org/10.1016/j.ejor.2018.07.014. URL https://www.sciencedirect.com/science/article/pii/S0377221718306192.
  • Pozzobon et al. (2023) Luiza Pozzobon, Beyza Ermis, Patrick Lewis, and Sara Hooker. On the challenges of using black-box apis for toxicity evaluation in research. arXiv preprint arXiv:2304.12397, 2023.
  • Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  • Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks, 2016.
  • Ratner et al. (2017) Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly, 2017.
  • Roit et al. (2023) Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Leonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos Garea, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, and Idan Szpektor. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6252–6272, Toronto, Canada, July 2023. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.344. URL https://aclanthology.org/2023.acl-long.344.
  • Schick et al. (2021) Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp, 2021.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
  • Shen (2022) Lucas Shen. LexicalRichness: A small module to compute textual lexical richness, 2022. URL https://github.com/LSYS/lexicalrichness.
  • Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Minimum risk training for neural machine translation. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1683–1692, Berlin, Germany, August 2016. Association for Computational Linguistics. 10.18653/v1/P16-1159. URL https://aclanthology.org/P16-1159.
  • Shu et al. (2021) Raphael Shu, Kang Min Yoo, and Jung-Woo Ha. Reward optimization for neural machine translation with learned metrics. arXiv preprint arXiv:2104.07541, 2021.
  • Shumailov et al. (2023) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  • Shumailov et al. (2024) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget, 2024.
  • Singh et al. (2024a) Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tuning, 2024a.
  • Singh et al. (2024b) Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619, 2024b.
  • Singhal et al. (2023) Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
  • Sokolov et al. (2016) Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. Learning structured predictors from bandit feedback for interactive NLP. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1610–1620, Berlin, Germany, August 2016. Association for Computational Linguistics. 10.18653/v1/P16-1152. URL https://aclanthology.org/P16-1152.
  • Soviany et al. (2022) Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey. International Journal of Computer Vision, 130(6):1526–1565, 2022.
  • Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
  • Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  • Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024.
  • Thakkar et al. (2023) Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. Self-influence guided data reweighting for language model pre-training. arXiv preprint arXiv:2311.00913, 2023.
  • Tian et al. (2024) Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V. Chawla. Tinyllm: Learning a small student from multiple large language models, 2024.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
  • Treffers-Daller et al. (2016) Jeanine Treffers-Daller, Patrick Parslow, and Shirley Williams. Back to Basics: How Measures of Lexical Diversity Can Help Discriminate between CEFR Levels. Applied Linguistics, 39(3):302–327, 04 2016. ISSN 0142-6001. 10.1093/applin/amw009. URL https://doi.org/10.1093/applin/amw009.
  • Wang et al. (2023a) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023a. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  • Wang et al. (2023b) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023b.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • Wu et al. (2021) Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. One teacher is enough? pre-trained language model distillation from multiple teachers. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  4408–4413, Online, August 2021. Association for Computational Linguistics. 10.18653/v1/2021.findings-acl.387. URL https://aclanthology.org/2021.findings-acl.387.
  • Wu et al. (2022) Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. Unified and effective ensemble knowledge distillation. arXiv preprint arXiv:2204.00548, 2022.
  • Wu & Aji (2023) Minghao Wu and Alham Fikri Aji. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025, 2023.
  • Xu et al. (2020) Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. Curriculum learning for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6095–6104, Online, July 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.542. URL https://aclanthology.org/2020.acl-main.542.
  • Zaras et al. (2021) Adamantios Zaras, Nikolaos Passalis, and Anastasios Tefas. Improving knowledge distillation using unified ensembles of specialized teachers. Pattern Recognition Letters, 146:215–221, 2021. ISSN 0167-8655. https://doi.org/10.1016/j.patrec.2021.03.014. URL https://www.sciencedirect.com/science/article/pii/S016786552100101X.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  • Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • Zuchniak (2023) Konrad Zuchniak. Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks. arXiv preprint arXiv:2302.07215, 2023.
  • Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model, 2024.

Appendix

Appendix A Experimental Setup

A.1 Models

Across all experiments, we finetune and profile two models from different model families and sizes: LLaMa2-7B (Touvron et al., 2023), Mixtral-8x7B (Jiang et al., 2024). We choose these models since they are both generally capable LLMs while also differing considerably in number of activate parameters (7B vs 35B), allowing us to test the effects of synthetic data across models with varying size ranges. Additionally, we examine a larger pool of models for the experiments on active inheritance: LLaMa2-13B (Touvron et al., 2023), Mixtral-8x7B (Jiang et al., 2024), Gemma-7B (Team et al., 2024), Aya-8B (Aryabumi et al., 2024) and Command-R+ (103B parameters)444https://docs.cohere.com/docs/command-r-plus. All models used (except for Command-R+) were used via the HuggingFace’s Transformers API (Wolf et al., 2019).

A.2 Data Distillation

We use the 52k prompts from the Alpaca dataset (Taori et al., 2023) to generate the data used in our distillation experiments. This dataset was chosen as it consists of open-ended question-answer pairs and is not specific to domains, hence a valuable setting to understand general purpose capabilities. For each of the models mentioned previously, we use the prompts from Alpaca to distill generations with a limit of 512 tokens each. The outputs are generated by using instruction-style prompts following the same template defined in the original Alpaca work (Taori et al., 2023). Additionally, to distill k𝑘kitalic_k generations from a single prompt as described in Section 4.1.4 we use Beam Decoding (Wolf et al., 2019) with num_beams=k.

A.3 Training

For each synthetic data ablation, we finetune the model on the distilled datasets for 1 epoch. We follow the QLoRA finetuning protocol and recommendations (Dettmers et al., 2023), and use 4-bit quantization to be able to fit them into memory. For models up to 13B parameters we set the batch size to 16 and the learning rate to 2e-4 for larger models we double the batch size to 32 and halve the learning rate to 1e-4. To train and perform inference we make use of 80GB A-100s, using one for models up to 13B and two for models with more parameters, except for Command-R+, where we make use of the API to generate outputs. To account for the need to quantize and work which shows quantization can impact overall model behavior (Ahmadian et al., 2023; Hooker et al., 2019), we measure any differences post-finetuning against the quantized base model.

Regarding the LoRA parameters we use r=64𝑟64r=64italic_r = 64 and α=16𝛼16\alpha=16italic_α = 16, as well as a dropout rate of 0.1 for models up to 13B parameters and 0.05 for bigger ones as per Dettmers et al. (2023). For the optimizer we use use Adam (Kingma & Ba, 2014) with a constant learning rate schedule.

A.4 Evaluation Benchmarks

We measure the general performance of our models on a zero-shot setting across 7 common-sense/reasoning benchmarks: BoolQ, RTE, HellaSwag, WinoGrande, Arc Easy, Arc Challenge and OpenBookQA. To calculate the scores for each benchmark we use the Language Model Evaluation Harness framework (Gao et al., 2023). In Table B.4, we report these differences.

Appendix B Toolbox Details

B.1 Textual Characteristics

We examine the textual profile of the models with the TextDescriptives framework (Hansen et al., 2023) to calculate a variety of statistics and scores. We collect descriptive statistics (i.e., number of characters/tokens/sentences, sentence length/median/mode) and readability scores (i.e., Gunning-Fog (Gunning, 1971), Rix (Readability Index (Anderson, 1983)) which can serve as a proxy to measure textual complexity. Additionally, we calculate lexical diversity scores (Shen, 2022) to track possible changes in vocabulary such as the Measure of Textual Lexical Diversity (MTLD) score (McCarthy & Jarvis, 2010). These metrics are calculated using the generations from the models which we want to evaluate prompted on 100 instances from the Dolly200 test set defined in (Singh et al., 2024a). Just like the distilled data, the generations gathered for the test set are limited to 512 tokens.

B.2 Social Bias

We measure social bias across 9 distinct categories (i.e. age, disability, gender, race, nationality, physical-appearance, religion, socio-economic status and sexual orientation) using 3 distinct benchmarks: StereoSet (Nadeem et al., 2020), CrowS-Pairs (Nangia et al., 2020) and BBQ (Bias Benchmark for Question-Answering) (Parrish et al., 2022). StereoSet and CrowS-Pairs measure intrasentence biases, that is, they measure models preferred associations using fill-in-the-blank style context sentences and calculate a stereotype score indicating whether the LLM makes stereotypical associations at the sentence level. BBQ on the other hand focuses on harms that arise when biased models are deployed as QA systems. To measure bias using the StereoSet and CrowS-Pairs benchmarks we use their Stereotype Scores ([0,100] where a score closer to 50 means less stereotyped) and for BBQ we consider the Ambiguous Bias Score ([-100,100] where a score closer to 0 means indicates a less biased model).

B.3 Calibration

To measure the alignment of generation uncertainty with generation correctness, we use the Expected Calibration Error (ECE) on both HellaSwag (Zellers et al., 2019) and OpenBookQA (Mihaylov et al., 2018) following the HELM (Liang et al., 2023) implementation.

B.4 Toxicity

To measure toxicitiy we make use of two metrics: Expected Maximum Toxicity (EMT) and Toxicity Probability over 25 generations following the same protocols used in (Gehman et al., 2020; Pozzobon et al., 2023). These two metrics are measured using a test set of 300 randomly sampled prompts from the RTP dataset with a toxicity score >= 0.8 so as to instigate toxic responses. The EMT score measures how toxic generations are expected to be in the worst case scenario and the Toxicity Probability analyses how frequently the model generates toxic responses.

Student Teacher BoolQ RTE HellaSwag WinoGrande Arc-c Arc-e OBQA Avg.
Alpaca LLaMa2-7B 78.9378.9378.9378.93 67.5167.5167.5167.51 57.0857.0857.0857.08 66.9366.9366.9366.93 43.7743.7743.7743.77 72.1472.1472.1472.14 33.4033.4033.4033.40 59.9759.9759.9759.97
LLaMa2-7B 80.8380.8380.8380.83 71.1271.1271.1271.12 57.3657.3657.3657.36 67.6467.6467.6467.64 42.4942.4942.4942.49 73.9573.9573.9573.95 34.0034.0034.0034.00 61.0561.0561.0561.05
Mixtral-8x7B 79.2079.2079.2079.20 72.5672.5672.5672.56 57.4557.4557.4557.45 68.6768.6768.6768.67 45.2245.2245.2245.22 74.9674.9674.9674.96 33.0033.0033.0033.00 61.5861.5861.5861.58
\cdashline2-11 LLaMa2-13B 81.6581.6581.6581.65 67.8767.8767.8767.87 60.7260.7260.7260.72 71.1171.1171.1171.11 46.1646.1646.1646.16 77.5777.5777.5777.57 35.2035.2035.2035.20 62.9062.9062.9062.90
LLaMa2-7B 82.5182.5182.5182.51 76.9076.9076.9076.90 57.5757.5757.5757.57 69.1469.1469.1469.14 40.7840.7840.7840.78 72.6072.6072.6072.60 35.4035.4035.4035.40 62.1362.1362.1362.13
Mixtral-8x7B 79.3079.3079.3079.30 73.2973.2973.2973.29 59.5659.5659.5659.56 71.3571.3571.3571.35 47.1047.1047.1047.10 78.3778.3778.3778.37 35.6035.6035.6035.60 63.5163.5163.5163.51
\cdashline2-11 Mixtral-8x7B 88.2388.2388.2388.23 71.8471.8471.8471.84 67.5867.5867.5867.58 77.0377.0377.0377.03 62.7162.7162.7162.71 87.2987.2987.2987.29 37.0037.0037.0037.00 70.2470.2470.2470.24
LLaMa2-7B 86.9486.9486.9486.94 68.9568.9568.9568.95 63.3263.3263.3263.32 75.7775.7775.7775.77 50.9450.9450.9450.94 80.5680.5680.5680.56 33.0033.0033.0033.00 65.6465.6465.6465.64
Mixtral-8x7B 88.0788.0788.0788.07 74.3774.3774.3774.37 66.0766.0766.0766.07 75.6175.6175.6175.61 59.7359.7359.7359.73 85.1985.1985.1985.19 36.4036.4036.4036.40 69.3569.3569.3569.35
Table 5: LLMs scores across seven general performance benchmarks comparing performance of the models before and after finetuning. From the Avg. column we can see that there is no considerable change in performance for the LLaMa2-based models after finetuning but Mixtral-based models degrade slightly, especially Mixtral finetuned on LLaMa2-distilled data.
Num. samples Strategy Attribute Student BoolQ RTE HellaSwag WinoGrande Arc-c Arc-e OBQA Avg.
5 Multi-source Length LLaMa2-7B 79.1479.1479.1479.14 68.2368.2368.2368.23 56.3656.3656.3656.36 68.1968.1968.1968.19 39.5939.5939.5939.59 68.6068.6068.6068.60 33.0033.0033.0033.00 59.0259.0259.0259.02
Mixtral-8x7B 87.7187.7187.7187.71 68.9568.9568.9568.95 63.8663.8663.8663.86 74.9874.9874.9874.98 53.5053.5053.5053.50 82.3282.3282.3282.32 34.4034.4034.4034.40 66.5366.5366.5366.53
MTLD LLaMa2-7B 80.7680.7680.7680.76 71.8471.8471.8471.84 56.9556.9556.9556.95 68.1968.1968.1968.19 42.2442.2442.2442.24 70.9270.9270.9270.92 34.4034.4034.4034.40 60.7660.7660.7660.76
Mixtral-8x7B 88.3288.3288.3288.32 72.5672.5672.5672.56 64.6964.6964.6964.69 75.2275.2275.2275.22 53.4153.4153.4153.41 82.2482.2482.2482.24 35.0035.0035.0035.00 67.3567.3567.3567.35
Toxicity LLaMa2-7B 78.7878.7878.7878.78 64.9864.9864.9864.98 56.6156.6156.6156.61 67.6467.6467.6467.64 42.2442.2442.2442.24 73.5373.5373.5373.53 34.2034.2034.2034.20 59.7159.7159.7159.71
Mixtral-8x7B 87.8087.8087.8087.80 71.8471.8471.8471.84 65.2565.2565.2565.25 75.9375.9375.9375.93 58.2858.2858.2858.28 85.6585.6585.6585.65 36.4036.4036.4036.40 68.7368.7368.7368.73
\cdashline1-12 10 Single-source Length LLaMa2-7B 79.3379.3379.3379.33 72.9272.9272.9272.92 56.5856.5856.5856.58 66.6966.6966.6966.69 42.9242.9242.9242.92 73.1973.1973.1973.19 33.4033.4033.4033.40 60.7260.7260.7260.72
Mixtral-8x7B 87.1687.1687.1687.16 72.2072.2072.2072.20 63.6563.6563.6563.65 76.8076.8076.8076.80 51.7151.7151.7151.71 80.0980.0980.0980.09 34.4034.4034.4034.40 66.5766.5766.5766.57
MTLD LLaMa2-7B 78.2378.2378.2378.23 73.2973.2973.2973.29 56.4556.4556.4556.45 66.5466.5466.5466.54 42.4142.4142.4142.41 71.3071.3071.3071.30 34.6034.6034.6034.60 60.4060.4060.4060.40
Mixtral-8x7B 86.5186.5186.5186.51 70.4070.4070.4070.40 63.7863.7863.7863.78 76.9576.9576.9576.95 50.6050.6050.6050.60 80.3080.3080.3080.30 34.0034.0034.0034.00 66.0866.0866.0866.08
Toxicity LLaMa2-7B 78.3878.3878.3878.38 68.9568.9568.9568.95 56.5656.5656.5656.56 67.3267.3267.3267.32 43.6943.6943.6943.69 73.6173.6173.6173.61 33.0033.0033.0033.00 60.2260.2260.2260.22
Mixtral-8x7B 88.2988.2988.2988.29 69.6869.6869.6869.68 64.3264.3264.3264.32 75.9375.9375.9375.93 55.9755.9755.9755.97 83.7583.7583.7583.75 34.4034.4034.4034.40 67.4867.4867.4867.48
Table 6: General performance for active inheritance models. As per Table B.4 we can see that there are no considerable change in general performance across models after finetuning.

Appendix C LLM-as-a-judge Setup

Given LLMs zero-shot and in-context learning abilities (Kojima et al., 2023; Brown et al., 2020) and the growing necessity to find methods to evaluate open-ended questions the use of LLMs-as-a-Judge benchmarks (Fu et al., 2023; Liu et al., 2023b; Chiang & yi Lee, 2023) gained traction as an automated alternative to performing human evaluation, which tends to be laborious and expensive (Wang et al., 2023a). The overall idea behind using LLMs as evaluators is that by passing detailed prompts defining the task that should be completed (e.g. choosing between two candidate answers, scoring based on a given attribute) to a capable LLM it should then be able to act as a proxy for human preferences (Bubeck et al., 2023; Dubois et al., 2024b).

To analyze the behaviour of these models as evaluators we used the AlpacaEval framework and human annotated data (Dubois et al., 2024a). The models considered are evaluated in a pairwise comparison setting, that is the judge is presented two candidate answers to a given instruction and it has to determine which one it prefers. We consider a preference evaluation setting with 6 different models: the student, teacher, student-student, student-teacher, teacher-student, teacher-teacher. For the student model we use LLaMa2-7B and Mixtral-8x7B for the teacher. We then gather 805 generations from each of these models using the AlpacaEval prompts, resulting in a total of 4830 candidate answers, that is 6 per prompt. Afterwards, we combine these generations to form all possible pairs of candidate answers per prompt, so as to be able to compare all models’ generations against each other.

Additionally we make use of the AlpacaEval human annotations set with 2.5K annotations (650 instructions each with 4 human annotations) to be able to measure human agreement, using humans as neutral judges. This way we can use these annotations as a point of comparison to analyze whether the finetuned models’ preferences stray away from the desired behavior of alignment with human judgements

Appendix D Toxicity Mitigation Setup

To evaluate the toxicity level, we randomly sample 300 prompts from a subset of RTP of all prompts with a toxicity score of at least 0.8. For training we sample all prompts (except for the ones present in the test set) with toxicity score bigger or equal to 0.5 (approximately 11k instances) to constitute the potentially harmful section of the set and then sample randomly 40k instances with prompt toxicity score below 0.5 to constitute the neutral section of the training set, which is then complete with 51k prompts. This 20/80 toxic-neutral ratio is used so as not to impair the model by exposing it mostly to toxic prompts, so only a small percentage of potentially triggering prompts is used with the goal of targeting toxicity while also not hurting the models’ general capabilities.

Subsequently we generate completions for the prompts present in the train set using the same 5 models as in 4.1. These generations are then individually scored for toxicity using the Perspective API and we select the one with the lowest toxicity for finetuning. This selection is done with the purpose of picking safer responses for all prompts, therefore encouraging the model to generate low-toxicity answers even when passed a triggering prompt. Similar to the experiments described in 4.1 we conduct the mitigation experiments leveraging both the multi and single-source strategies.

We use this curated set with low-toxicity completions to finetune LLaMa2-7B and Mixtral-8x7B with the goal of mitigating their probabilities of generating toxic outputs. This objective differs from ones proposed in previous works as the mitigation can be done after the model has already been pre-trained and it also does not require performing filtering of generations at test time, avoiding the introduction of a possible bottleneck during inference.

Appendix E Profiling Toolbox Results

Tables 7 through 16 display the absolute numbers for the metrics described in Section B and their ΔΔ\Deltaroman_Δ when compared to the base teacher model.

Student Teacher Gender Race Religion Profession
LLaMa2-7B 66.0566.0566.0566.05 65.0765.0765.0765.07 59.6959.6959.6959.69 62.4662.4662.4662.46
LLaMa2-7B LLaMa2-7B 65.2065.2065.2065.20 63.8063.8063.8063.80 58.9358.9358.9358.93 61.9361.9361.9361.93
LLaMa2-7B Mixtral-8x7B 65.3465.3465.3465.34 64.0164.0164.0164.01 60.5160.5160.5160.51 63.4563.4563.4563.45
\hdashlineLLaMa2-13B 69.0969.0969.0969.09 67.3867.3867.3867.38 60.1760.1760.1760.17 63.5163.5163.5163.51
LLaMa2-13B LLaMa2-7B 63.6763.6763.6763.67 64.4864.4864.4864.48 56.3256.3256.3256.32 59.7859.7859.7859.78
LLaMa2-13B Mixtral-8x7B 66.8466.8466.8466.84 65.0865.0865.0865.08 60.5660.5660.5660.56 62.4362.4362.4362.43
\hdashlineMixtral-8x7B 66.0666.0666.0666.06 65.7965.7965.7965.79 65.4565.4565.4565.45 60.4360.4360.4360.43
Mixtral-8x7B LLaMa2-7B 65.4465.4465.4465.44 64.7064.7064.7064.70 62.0762.0762.0762.07 60.3860.3860.3860.38
Mixtral-8x7B Mixtral-8x7B 65.7965.7965.7965.79 65.0265.0265.0265.02 64.8064.8064.8064.80 60.2160.2160.2160.21
Table 7: StereoSet Stereotype Scores across different minorities.
Student Teacher Gender Race Religion Profession Aggr.
LLaMa2-7B LLaMa2-7B 0.85-0.85-0.85- 0.85 1.27-1.27-1.27- 1.27 0.76-0.76-0.76- 0.76 0.53-0.53-0.53- 0.53 -3.41
LLaMa2-7B Mixtral-8x7B 0.71-0.71-0.71- 0.71 1.06-1.06-1.06- 1.06 0.820.820.820.82 0.990.990.990.99 0.04
\hdashlineLLaMa2-13B LLaMa2-7B 5.42-5.42-5.42- 5.42 2.90-2.90-2.90- 2.90 3.85-3.85-3.85- 3.85 3.73-3.73-3.73- 3.73 -15.89
LLaMa2-13B Mixtral-8x7B 2.25-2.25-2.25- 2.25 2.30-2.30-2.30- 2.30 0.390.390.390.39 1.08-1.08-1.08- 1.08 -5.24
\hdashlineMixtral-8x7B LLaMa2-7B 0.62-0.62-0.62- 0.62 1.09-1.09-1.09- 1.09 3.38-3.38-3.38- 3.38 0.05-0.05-0.05- 0.05 -5.14
Mixtral-8x7B Mixtral-8x7B 0.27-0.27-0.27- 0.27 0.77-0.77-0.77- 0.77 0.65-0.65-0.65- 0.65 0.22-0.22-0.22- 0.22 -1.91
Table 8: StereoSet Stereotype Score ΔΔ\Deltaroman_Δ between base teacher model and student-teacher finetuned models.
Student Teacher Age Gender Race Religion Appearance Disability Nationality Socioeconomic Sex. Orientation
LLaMa2-7B 76.7176.7176.7176.71 60.3860.3860.3860.38 65.1265.1265.1265.12 76.7776.7776.7776.77 73.0873.0873.0873.08 87.7287.7287.7287.72 63.5163.5163.5163.51 65.6165.6165.6165.61 73.6173.6173.6173.61
LLaMa2-7B LLaMa2-7B 78.0878.0878.0878.08 64.7864.7864.7864.78 67.2367.2367.2367.23 74.7574.7574.7574.75 75.0075.0075.0075.00 85.9685.9685.9685.96 64.8664.8664.8664.86 63.6963.6963.6963.69 73.6173.6173.6173.61
LLaMa2-7B Mixtral-8x7B 80.8280.8280.8280.82 60.3860.3860.3860.38 65.5465.5465.5465.54 74.7574.7574.7574.75 76.9276.9276.9276.92 87.7287.7287.7287.72 67.5767.5767.5767.57 65.6165.6165.6165.61 75.0075.0075.0075.00
\hdashlineLLaMa2-13B 73.9773.9773.9773.97 66.6766.6766.6766.67 66.3866.3866.3866.38 84.8584.8584.8584.85 75.0075.0075.0075.00 87.7287.7287.7287.72 63.5163.5163.5163.51 73.2573.2573.2573.25 76.3976.3976.3976.39
LLaMa2-13B LLaMa2-7B 78.0878.0878.0878.08 59.1259.1259.1259.12 71.2571.2571.2571.25 76.7776.7776.7776.77 71.1571.1571.1571.15 85.9685.9685.9685.96 66.2266.2266.2266.22 67.5267.5267.5267.52 72.2272.2272.2272.22
LLaMa2-13B Mixtral-8x7B 79.4579.4579.4579.45 60.3860.3860.3860.38 69.7769.7769.7769.77 83.8483.8483.8483.84 75.0075.0075.0075.00 85.9685.9685.9685.96 67.5767.5767.5767.57 67.5267.5267.5267.52 75.0075.0075.0075.00
\hdashlineMixtral-8x7B 73.9773.9773.9773.97 70.4470.4470.4470.44 67.6567.6567.6567.65 72.7372.7372.7372.73 76.9276.9276.9276.92 84.2184.2184.2184.21 64.1964.1964.1964.19 71.9771.9771.9771.97 73.6173.6173.6173.61
Mixtral-8x7B LLaMa2-7B 78.0878.0878.0878.08 66.0466.0466.0466.04 65.9665.9665.9665.96 71.7271.7271.7271.72 75.0075.0075.0075.00 84.2184.2184.2184.21 62.8462.8462.8462.84 67.5267.5267.5267.52 73.6173.6173.6173.61
Mixtral-8x7B Mixtral-8x7B 78.0878.0878.0878.08 67.9267.9267.9267.92 65.5465.5465.5465.54 74.7574.7574.7574.75 75.0075.0075.0075.00 84.2184.2184.2184.21 61.4961.4961.4961.49 68.1568.1568.1568.15 73.6173.6173.6173.61
Table 9: CrowSPairs Stereotype Scores across different minorities.
Student Teacher Age Gender Race Religion Appearance Disability Nationality Socioeconomic Sex. Orientation Aggr.
LLaMa2-7B LLaMa2-7B 1.371.371.371.37 4.404.404.404.40 2.112.112.112.11 2.02-2.02-2.02- 2.02 1.921.921.921.92 1.76-1.76-1.76- 1.76 1.351.351.351.35 1.92-1.92-1.92- 1.92 0.000.000.000.00 5.45
LLaMa2-7B Mixtral-8x7B 4.114.114.114.11 0.000.000.000.00 0.420.420.420.42 2.02-2.02-2.02- 2.02 3.843.843.843.84 0.000.000.000.00 4.064.064.064.06 0.000.000.000.00 1.391.391.391.39 11.80
\hdashlineLLaMa2-13B LLaMa2-7B 4.114.114.114.11 7.55-7.55-7.55- 7.55 4.874.874.874.87 8.08-8.08-8.08- 8.08 3.85-3.85-3.85- 3.85 1.76-1.76-1.76- 1.76 2.712.712.712.71 5.73-5.73-5.73- 5.73 4.17-4.17-4.17- 4.17 -19.45
LLaMa2-13B Mixtral-8x7B 5.485.485.485.48 6.29-6.29-6.29- 6.29 3.393.393.393.39 1.01-1.01-1.01- 1.01 0.000.000.000.00 1.76-1.76-1.76- 1.76 4.064.064.064.06 5.73-5.73-5.73- 5.73 1.39-1.39-1.39- 1.39 -3.25
\hdashlineMixtral-8x7B LLaMa2-7B 4.114.114.114.11 4.40-4.40-4.40- 4.40 1.69-1.69-1.69- 1.69 1.01-1.01-1.01- 1.01 1.92-1.92-1.92- 1.92 0.000.000.000.00 1.35-1.35-1.35- 1.35 4.45-4.45-4.45- 4.45 0.000.000.000.00 -10.71
Mixtral-8x7B Mixtral-8x7B 4.114.114.114.11 2.52-2.52-2.52- 2.52 2.11-2.11-2.11- 2.11 2.022.022.022.02 1.92-1.92-1.92- 1.92 0.000.000.000.00 2.70-2.70-2.70- 2.70 3.82-3.82-3.82- 3.82 0.000.000.000.00 -6.94
Table 10: CrowSPairs Stereotype Score ΔΔ\Deltaroman_Δ between base teacher model and student-teacher finetuned models.
Student Teacher Age Gender Race Religion Disability Nationality
LLaMa2-7B 20.2720.2720.2720.27 7.027.027.027.02 0.350.350.350.35 6.336.336.336.33 5.665.665.665.66 1.881.881.881.88
LLaMa2-7B LLaMa2-7B 18.7018.7018.7018.70 8.078.078.078.07 0.700.700.700.70 2.832.832.832.83 5.785.785.785.78 5.065.065.065.06
LLaMa2-7B Mixtral-8x7B 21.4121.4121.4121.41 3.493.493.493.49 0.810.810.810.81 1.831.831.831.83 2.192.192.192.19 5.785.785.785.78
\hdashlineLLaMa2-13B 29.1329.1329.1329.13 13.2613.2613.2613.26 1.161.161.161.16 6.676.676.676.67 7.717.717.717.71 8.778.778.778.77
LLaMa2-13B LLaMa2-7B 21.9621.9621.9621.96 7.697.697.697.69 0.260.260.260.26 3.173.173.173.17 13.8813.8813.8813.88 8.058.058.058.05
LLaMa2-13B Mixtral-8x7B 30.4330.4330.4330.43 8.648.648.648.64 0.810.810.810.81 3.503.503.503.50 10.9310.9310.9310.93 8.838.838.838.83
\hdashlineMixtral-8x7B 24.8924.8924.8924.89 10.0510.0510.0510.05 2.882.882.882.88 9.009.009.009.00 10.8010.8010.8010.80 10.6510.6510.6510.65
Mixtral-8x7B LLaMa2-7B 16.8516.8516.8516.85 4.654.654.654.65 1.571.571.571.57 4.504.504.504.50 10.0310.0310.0310.03 5.975.975.975.97
Mixtral-8x7B Mixtral-8x7B 32.0132.0132.0132.01 10.5810.5810.5810.58 1.161.161.161.16 6.006.006.006.00 13.2413.2413.2413.24 13.1813.1813.1813.18
Table 11: BBQ Ambiguous Bias Score across different minorities.
Student Teacher Age Gender Race Religion Disability Nationality Aggr.
LLaMa2-7B LLaMa2-7B 1.57-1.57-1.57- 1.57 1.051.051.051.05 0.350.350.350.35 3.503.503.503.50 0.120.120.120.12 3.183.183.183.18 -0.37
LLaMa2-7B Mixtral-8x7B 1.141.141.141.14 3.53-3.53-3.53- 3.53 0.460.460.460.46 4.50-4.50-4.50- 4.50 3.47-3.47-3.47- 3.47 3.903.903.903.90 -6.00
\hdashlineLLaMa2-13B LLaMa2-7B 7.17-7.17-7.17- 7.17 5.57-5.57-5.57- 5.57 0.90-0.90-0.90- 0.90 3.50-3.50-3.50- 3.50 6.176.176.176.17 0.72-0.72-0.72- 0.72 -11.69
LLaMa2-13B Mixtral-8x7B 1.31.31.31.3 4.62-4.62-4.62- 4.62 0.35-0.35-0.35- 0.35 3.17-3.17-3.17- 3.17 3.223.223.223.22 0.060.060.060.06 -3.55
\hdashlineMixtral-8x7B LLaMa2-7B 8.04-8.04-8.04- 8.04 5.40-5.40-5.40- 5.40 1.31-1.31-1.31- 1.31 4.50-4.50-4.50- 4.50 0.77-0.77-0.77- 0.77 4.68-4.68-4.68- 4.68 -24.70
Mixtral-8x7B Mixtral-8x7B 7.127.127.127.12 0.530.530.530.53 1.72-1.72-1.72- 1.72 3.00-3.00-3.00- 3.00 2.442.442.442.44 2.532.532.532.53 7.89
Table 12: BBQ Ambiguous Bias Score ΔΔ\Deltaroman_Δ between base teacher model and student-teacher finetuned models.
Student Teacher EMT Toxicity Prob.
LLaMa2-7B 71.74 79.66
LLaMa2-7B LLaMa2-7B 64.41 69.00
LLaMa2-7B Mixtral-8x7B 77.21 88.66
\hdashlineLLaMa2-13B 64.17 72.33
LLaMa2-13B LLaMa2-7B 79.65 91.67
LLaMa2-13B Mixtral-8x7B 80.48 93.33
\hdashlineMixtral-8x7B 65.20 69.66
Mixtral-8x7B LLaMa2-7B 86.51 99.33
Mixtral-8x7B Mixtral-8x7B 71.11 80.66
Table 13: Expected Maxiumum Toxicity (EMT) and Toxicity probability calculated using the PerspectiveAPI.
Student Teacher EMT Toxicity Prob. Aggr.
LLaMa2-7B LLaMa2-7B 7.33-7.33-7.33- 7.33 10.66-10.66-10.66- 10.66 -18.00
LLaMa2-7B Mixtral-8x7B 5.475.475.475.47 9.009.009.009.00 14.47
\hdashlineLLaMa2-13B LLaMa2-7B 15.4815.4815.4815.48 19.3419.3419.3419.34 34.82
LLaMa2-13B Mixtral-8x7B 16.3116.3116.3116.31 21.0021.0021.0021.00 37.31
\hdashlineMixtral-8x7B LLaMa2-7B 21.3121.3121.3121.31 29.6729.6729.6729.67 50.98
Mixtral-8x7B Mixtral-8x7B 5.915.915.915.91 11.0011.0011.0011.00 16.91
Table 14: Expected Maximum Toxicity ΔΔ\Deltaroman_Δ between base teacher model and student-teacher finetuned models.
Student Teacher Num. of Tokens (μ𝜇\muitalic_μ) Gunning-Fog (μ𝜇\muitalic_μ) MTLD (μ𝜇\muitalic_μ) Rix (μ𝜇\muitalic_μ)
LLaMa2-7B 196.55196.55196.55196.55 12.8612.8612.8612.86 56.4156.4156.4156.41 5.175.175.175.17
LLaMa2-7B LLaMa2-7B 191.29191.29191.29191.29 12.6712.6712.6712.67 63.5063.5063.5063.50 5.285.285.285.28
LLaMa2-7B Mixtral-8x7B 330.25330.25330.25330.25 13.7613.7613.7613.76 55.7655.7655.7655.76 5.965.965.965.96
\hdashlineLLaMa2-13B 199.07199.07199.07199.07 12.3912.3912.3912.39 55.1855.1855.1855.18 4.944.944.944.94
LLaMa2-13B LLaMa2-7B 256.64256.64256.64256.64 12.1912.1912.1912.19 62.4162.4162.4162.41 4.774.774.774.77
LLaMa2-13B Mixtral-8x7B 284.74284.74284.74284.74 13.4613.4613.4613.46 58.6958.6958.6958.69 5.835.835.835.83
\hdashlineMixtral-8x7B 147.76147.76147.76147.76 13.7913.7913.7913.79 55.5355.5355.5355.53 6.306.306.306.30
Mixtral-8x7B LLaMa2-7B 346.19346.19346.19346.19 12.8512.8512.8512.85 56.2156.2156.2156.21 5.285.285.285.28
Mixtral-8x7B Mixtral-8x7B 133.80133.80133.80133.80 14.6114.6114.6114.61 64.4064.4064.4064.40 6.716.716.716.71
Table 15: Absolute values for different textual characteristics metrics.
Student Teacher Num. of Tokens (μ𝜇\muitalic_μ) Gunning-Fog (μ𝜇\muitalic_μ) MTLD (μ𝜇\muitalic_μ) Rix (μ𝜇\muitalic_μ)
LLaMa2-7B LLaMa2-7B 5.26-5.26-5.26- 5.26 0.19-0.19-0.19- 0.19 7.097.097.097.09 0.110.110.110.11
LLaMa2-7B Mixtral-8x7B 133.7133.7133.7133.7 0.90.90.90.9 0.65-0.65-0.65- 0.65 0.790.790.790.79
\hdashlineLLaMa2-13B LLaMa2-7B 57.5757.5757.5757.57 0.2-0.2-0.2- 0.2 7.237.237.237.23 0.17-0.17-0.17- 0.17
LLaMa2-13B Mixtral-8x7B 85.6785.6785.6785.67 1.071.071.071.07 3.513.513.513.51 0.890.890.890.89
\hdashlineMixtral-8x7B LLaMa2-7B 198.43198.43198.43198.43 0.94-0.94-0.94- 0.94 0.680.680.680.68 1.02-1.02-1.02- 1.02
Mixtral-8x7B Mixtral-8x7B 13.96-13.96-13.96- 13.96 0.820.820.820.82 8.878.878.878.87 0.410.410.410.41
Table 16: Textual characteristics ΔΔ\Deltaroman_Δ between base teacher model and student-teacher finetuned models.
Refer to caption
Figure 5: Comparison of active inheritance methods (single-source and multi-source sampling) targeting various metrics. Both LLaMa2 and Mixtral models are steered successfully in the desired directions.