DiffuseDef: Improved Robustness to Adversarial Attacks

Zhenhao Li, Marek Rei, Lucia Specia
Language and Multimodal AI (LAMA) Lab, Imperial College London
{zhenhao.li18, marek.rei, l.specia}@imperial.ac.uk

Abstract

Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef¹¹1https://github.com/Nickeilf/DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.

Zhenhao Li, Marek Rei, Lucia Specia Language and Multimodal AI (LAMA) Lab, Imperial College London {zhenhao.li18, marek.rei, l.specia}@imperial.ac.uk

1 Introduction

Pretrained language models (PLM) have significantly advanced the performance of various natural language processing (NLP) tasks. Despite such improvements, current NLP systems remain susceptible to adversarial attacks where carefully crafted text perturbations can lead to incorrect model outputs (Alzantot et al., 2018; Jin et al., 2020; Li et al., 2020). In order to improve robustness to adversarial attacks, various defense methods have been proposed, such as adversarial training (Zhu et al., 2020; Si et al., 2021; Zhou et al., 2021; Xi et al., 2022), text denoising (Nguyen Minh and Luu, 2022; Wang et al., 2023), ensembling (Zhou et al., 2021; Zeng et al., 2023; Li et al., 2023), etc. However, existing defense methods either assume the test-time perturbation/attack set is similar to that used in training (Li et al., 2021), or are limited to specific architectures (Xi et al., 2022), or at inference time require large computational cost, thereby limiting their practical applicability.

Diffusion models are commonly used in computer vision (CV) to generate high-quality images by predicting and removing noise from a sampled noisy image. Therefore, they can be adopted to remove noise from adversarial images and thus improve robustness to attacks (Nie et al., 2022). However, in NLP very limited research has investigated adversarial defense with diffusion models due to the discrete and contextual nature of text data. Li et al. (2023) adopt the idea of iterative denoising and reconstruct adversarial texts from masked texts, while Yuan et al. (2024) use a diffusion model as a classifier and perform reverse diffusion steps on the label vector, conditioning on the input text. Inspired by the general noise prediction and reduction capability of diffusion models, we propose DiffuseDef, a novel adversarial defense method which employs diffusion training to denoise hidden representations of adversarial texts. Unlike Li et al. (2023) and Yuan et al. (2024) which apply diffusion on texts or labels, DiffuseDef directly removes noise to the hidden states, providing a more effective and robust text representation to defend against adversarial texts. Compared to diffusion-based defense in CV (Nie et al., 2022), DiffuseDef further enhances robustness with ensembling and improves efficiency with fewer diffusion steps.

DiffuseDef combines adversarial training with diffusion training, where the diffusion layer is trained to predict randomly sampled noise at a given timestep. During inference, the diffusion layer serves as a denoiser, iteratively removing noise from adversarial hidden states to yield a robust hidden representation. Moreover, we adopt the ensembling strategy by first adding random noise to text hidden states to create multiple variants then denoising them via the diffusion layer. The model output is made by averaging all denoised hidden states. Since ensembling happens solely at the diffusion layer, DiffuseDef is more efficient than traditional ensembling-based methods (Ye et al., 2020; Zeng et al., 2023), which require a full forward pass through all model parameters.

Through systematic experimentation, we demonstrate that DiffuseDef outperforms strong defense methods and is able to defend against multiple types of adversarial attacks, while preserving performance on clean texts. Our analysis also reveals that the ensembling diffused representations provides a stronger defense against finding vulnerable words to attack and can reduce the distance in latent space between adversarial texts and their clean text counterpart.

Our contributions can be summarised as follows:

•

We propose DiffuseDef, a novel and flexible adversarial defense method that can be added on top of any existing adversarial defense methods to further improve robustness to adversarial attacks.
•

DiffuseDef outperforms existing adversarial methods and achieves state-of-the-art performance against prevalent adversarial attacks.
•

Through extensive analysis, we demonstrate the effectiveness of the ensembling diffused representation and the efficiency of DiffuseDef compared to existing ensembling-based methods.

2 Related Work

2.1 Textual Adversarial Attacks

Textual adversarial attacks focus on constructing adversarial examples from an original text that maximise the likelihood of incorrect predictions by a neural network. These attacks require adversarial examples to be perceptually similar to the original text, which is typically achieved by introducing subtle perturbations to the original text, such as character swapping (Gao et al., 2018; Ebrahimi et al., 2018), synonym-substitutions (Ren et al., 2019; Yoo and Qi, 2021), and paraphrasing (Gan and Ng, 2019; Huang and Chang, 2021). Taking the text classification task as an example, given a classifier $\mathcal{C}(\mathbf{x})$ that maps an input sequence of words $\mathbf{x}=[w_{1},w_{2},...,w_{L}]$ to its designated label $y$ , the goal of the attack model is to construct an adversarial example $\mathbf{x^{\prime}}=\mathbf{x}+\delta$ to fool the classifier, where $\delta$ is a subtle adversarial perturbation constrained by $||\delta||<\omega$ . The adversarial example $\mathbf{x^{\prime}}$ is considered a successful attack if it leads to an incorrect prediction $\mathcal{C}(\mathbf{x}^{\prime})\not=y$ . The attacker can iteratively generate multiple adversarial examples and query the classifier to obtain a successful attack, whereas the classifier must consistently return the correct prediction within a specified number of query attempts to be considered robust.

Common textual adversarial attack methods adopt a two-stage process to construct effective adversarial examples: word importance ranking and word substitution. In the first stage, words or subwords are ranked based on their influence on the model’s prediction. This is measured by leveraging either gradient information (Liu et al., 2022) or changes in prediction probabilities when words are removed (Jin et al., 2020) or masked (Ren et al., 2019; Li et al., 2020). In the second stage, candidate words are substituted with synonyms (Zang et al., 2020), perturbed variants (Gao et al., 2018), or outputs from masked language models (Garg and Ramakrishnan, 2020; Li et al., 2020). The substitution process is guided by various constraints to ensure the adversarial example remains natural and semantically equivalent to the original text. Common constraints include thresholding the similarity between the replacement word embedding and the substituted word embedding, or ensuring the semantic similarity between sentence vectors modeled from Universal Sentence Encoder (Cer et al., 2018). Despite these constraints, current textual adversarial attacks still pose significant challenges to NLP models (Liu et al., 2022; Xu et al., 2021; Yuan et al., 2023), highlighting the necessity for defense methods for better adversarial robustness.

2.2 Adversarial Defense Methods

To mitigate the performance degradation caused by adversarial attacks, various adversarial defense methods have been developed. They can be grouped into three categories: training-based, ensembling-based, and denoising-based methods. Adversarial training improves the robustness of the model to adversarial examples through strategies like data augmentation (Si et al., 2021) and adversarial regularisation (Madry et al., 2018; Zhu et al., 2020; Wang et al., 2021; Xi et al., 2022; Gao et al., 2023). However, adversarial training methods are limited as they assumes similar train-test adversarial examples, and thus tend to overfit to specific types of adversarial attacks. Ensembling-based methods generate multiple variants of the input text at inference time and ensemble model predictions over all the variants (Ye et al., 2020; Zhou et al., 2021; Zeng et al., 2023; Li et al., 2023), but they can be inefficient given that model predictions are needed on every ensemble, increasing the inference time with the number of ensembles. More recently, denoising-based methods have been proposed to improve adversarial robustness by mapping the vector representation of the adversarial text to another point in the latent space that is close to the clean text (Nguyen Minh and Luu, 2022; Wang et al., 2023; Moon et al., 2023; Yuan et al., 2024). The denoised representation makes it more difficult to find vulnerable words to attack, thus improving adversarial robustness (Wang et al., 2023). Nevertheless, denoising might lead to very different representations of clean text and adversarial text, therefore changing the semantic meanings.

The proposed DiffuseDef builds on these three approaches and can use any adversarially trained classifier as the base, applying denoising via a diffusion layer, and ensembling the diffused representations with a small number of ensembles. Using a diffusion layer as a denoiser addresses the overfitting problem from adversarial training and mitigates the efficiency problem by performing ensembling only at the diffusion layer. By averaging denoised hidden states across all ensembles, DiffuseDef also addresses the issue stemming from denoising, maintaining good performance on clean texts.

3 DiffuseDef

Refer to caption — Figure 1: Training and inference of DiffuseDef model. The adversarial training stage trains the pretrained encoder and classifier with perturbed input for adversarial robustness. The diffusion training trains the diffusion layer to predict injected noise at a given timestep $t$ . At inference time, the text hidden state is first noised by 1 step and then denoised by $t^{\prime}$ steps to create the denoised hidden states, which are ensembled to make the final prediction.

3.1 Training

The proposed diffusion defense model consists of a pretrained encoder for feature extraction, a transformer-based diffusion layer for noise prediction and reduction, and a classifier layer for output generation. The training process is split into two stages: adversarial training and diffusion training (Figure 1). The adversarial training stage employs any neural network-based adversarial training methods like FreeLB++ (Li et al., 2021) and RSMI (Moon et al., 2023), which optimise the encoder and classifier for robustness by perturbing the latent representation of the text input.

In the diffusion training stage, only the diffusion layer is trained to predict random noise added to the clean text hidden state at different timesteps, enabling it to denoise the adversarial hidden state at inference time. The pretrained encoder, however is frozen during this stage. Since the pretrained encoder is only used for feature extraction, the diffusion training method is compatible with any neural network-based adversarial training method.

Given an input sequence of tokens $\mathbf{x}\in\mathbb{R}^{L}$ , the pretrained encoder extracts the hidden state $h\in\mathbb{R}^{L\times D}$ . A random Gaussian noise $\epsilon$ is sampled to perturb hidden state $h$ . Sohl-Dickstein et al. (2015) define the forward diffusion process as a Markov Chain where at each timestep a Gaussian noise is sampled and added to the previous latent feature: $h_{t}=\sqrt{1-\beta_{t}}h_{t-1}+\sqrt{\beta}\epsilon$ , where $\epsilon\in\mathcal{N}(0,\mathcal{I})$ , $h_{t}$ is the noisy hidden state at step $t$ and $\beta$ is a pre-calculated variance schedule changing with $t$ . As shown by Ho et al. (2020), this equation can be reformulated to calculate $h_{t}$ directly from $h$ by defining $\alpha_{t}=1-\beta_{t}$ and $\bar{\alpha}=\prod^{t}_{i=1}\alpha_{i}$ , thus

h_{t}=\sqrt{\bar{\alpha}_{t}}h+\sqrt{1-\bar{\alpha}_{t}}\epsilon

(1)

At each training step, a random forward diffusion timestep $t$ is sampled from a uniform distribution. Therefore, the noisy hidden state $h_{t}$ is created from $h$ , $t$ , and $\epsilon$ . The diffusion layer $\theta$ consists of a time embedding and a transformer layer. The time embedding receives the diffusion timestep $t$ as input and produces an embedding $e_{t}$ , which is added to $h_{t}$ as input for the transformer layer. Finally, the transformer layer outputs the predicted noise $\epsilon_{\theta}(h_{t},t)$ , and mean square error is used to compute the loss between predicted noise $\epsilon_{\theta}(h_{t},t)$ and actual sampled noise $\epsilon$ .

L=\mathbb{E}_{t,h,\epsilon}\left[\left\|\epsilon-\epsilon_{\theta}(\sqrt{\bar{% \alpha}_{t}}h+\sqrt{1-\bar{\alpha}_{t}}\epsilon)\right\|^{2}\right]

(2)

3.2 Inference

Leveraging the diffusion layer’s ability to predict noise at a given timestep $t$ , we utilise it as a denoiser during inference by iteratively performing the reverse diffusion steps, which sample from $p_{\theta}(h_{t-1}|h_{t})=\mathcal{N}(h_{t-1};\mu_{\theta}(h_{t},t),\Sigma_{% \theta}(h_{t},t))$ to produce the denoised hidden state

	$\displaystyle\mu_{\theta}(h_{t},t)$	$\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\left(h_{t}-\frac{1-\alpha_{t}}{\sqrt% {1-\bar{\alpha}_{t}}}\epsilon_{t}\right)$		(3)
	$\displaystyle\Sigma_{\theta}(h_{t},t)$	$\displaystyle=\sigma^{2}_{t}\mathcal{I}$		(4)

where $\epsilon_{t}$ is the predicted noise from diffusion layer and $\sigma^{2}_{t}=\beta_{t}$ . The denoised hidden state can thus be computed with

h_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(h_{t}-\frac{1-\alpha_{t}}{\sqrt{1-% \bar{\alpha}_{t}}}\epsilon_{t}\right)+\sigma_{t}z

(5)

where $z\in\mathcal{N}(0,\mathcal{I})$ .

Inference in DiffuseDef combines a one-step noising, a multi-step denoising, and an ensembling step. After the pretrained encoder extracts its hidden state $h$ , a set of $k$ Gaussian noise vectors $E=[\epsilon^{0},\epsilon^{1},...,\epsilon^{k}]$ are sampled to perform a single forward diffusion step. These noise vectors $E$ are then added to the hidden state $h$ following equation 1, resulting in a set of noisy hidden states $H_{t^{\prime}}=[h^{0}_{t^{\prime}},h^{1}_{t^{\prime}},...,h^{k}_{t^{\prime}}]$ , where ${t^{\prime}}$ denotes the number of denoising steps. The noisy hidden states $H_{t^{\prime}}$ are subsequently denoised through $t^{\prime}$ reverse diffusion steps, where noise is predicted by the diffusion layer and subtracted from the previous noisy hidden states. Unlike Ho et al. (2020) where the reverse diffusion step starts with pure noise sampled from standard normal distribution, we assume the noisy hidden state $H_{t^{\prime}}$ is already an intermediate state in the reverse diffusion steps. This allows us to use a smaller number of $t^{\prime}$ than the training timestep $t$ to prevent the denoised hidden states from diverging substantially from the initial hidden state $h$ . This sequence of denoising steps creates the final denoised hidden states $H_{0}=[h^{0}_{0},h^{1}_{0},...,h^{k}_{0}]$ , which are averaged and used by the classifier to output the final predicted label. This process is summarised in Algorithm 1.

Data: Input text

\mathbf{x}

Result: Predicted label

y^{\prime}

h\leftarrow Enc(\mathbf{x})

;

2 Sample

E=[\epsilon^{0},\epsilon^{1},...,\epsilon^{k}]

\epsilon\sim\mathcal{N}(0,\mathcal{I})

;

H_{t^{\prime}}\leftarrow\sqrt{\bar{\alpha}_{1}}h+\sqrt{1-\bar{\alpha}_{1}}E

;

4 for $i\leftarrow 0$ to $t^{\prime}-1$ do

E_{t^{\prime}-i}\leftarrow\epsilon_{\theta}(H_{t^{\prime}-i},{t^{\prime}-i})

;

H_{t^{\prime}-i-1}\leftarrow\frac{1}{\sqrt{\alpha_{t^{\prime}-i}}}\left(H_{t^{% \prime}-i}-\frac{1-\alpha_{t^{\prime}-i}}{\sqrt{1-\bar{\alpha}_{t^{\prime}-i}}% }E_{t^{\prime}-i}\right)+\sigma_{t^{\prime}-i}z

;

8 end for

y^{\prime}\leftarrow CLS\left(avg(H_{0})\right)

;

Algorithm 1 Inference of DiffuseDef

4 Experiments

Datasets

We focus on two common NLP tasks in our experiments: topic classification and natural language inference (NLI). In the text classification task, we compare our method with other defense algorithms on two standard datasets for adversarial defense: AG News (Zhang et al., 2015a) and IMDB (Maas et al., 2011a) datasets. In the NLI task, we perform an ablation analysis with the Question-answering NLI (QNLI) dataset (Wang et al., 2018). We randomly split AGNews, IMDB, and QNLI datasets into train, validation, and test splits.

Evaluation

Following previous work on adversarial defense, we use three benchmarking attack methods to evaluate the robustness of DiffuseDef: TextFooler (TF) (Jin et al., 2020), TextBugger (TB) (Li et al., 2019), and Bert-Attack (BA) (Li et al., 2020). The three attack methods create adversarial attacks in different granularities: character-level perturbation (TextBugger), word substitution (TextFooler), and subword substitution (BertAttack). Regarding evaluation metrics, we measure the clean accuracy (Clean%) on the test set, the accuracy under attack (AUA%), and the number of adversarial queries (#Query) needed for a successful attack. Higher scores on the three metrics denote a better robustness performance of a defense method. The accuracy on clean data is measured across the entire test set. The accuracy under attack and number of queries, due to the lengthy attacking process, is measured on a randomly sampled subset of 1000 examples from the test set. We use the TextAttack library as the adversarial evaluation framework. To ensure a fair comparison and high-quality adversarial examples, we follow the same evaluation constraints as in Li et al. (2021). The evaluation metrics are averaged based on experiments run with 5 random seeds.

4.1 Comparison to SOTA

We compare our proposed method with state-of-the-art adversarial defense approaches, trained using both BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) as backbones: Fine-tune: Fine-tuning pretrained models on downstream task with no defense method applied²²2”Fine-tune” is a baseline approach used to illustrate the effect of adversarial attacks.. InfoBERT (Wang et al., 2021): Applying mutual-information-based regularizers during fine-tuning of pretrained models to improve robustness. FreeLB++ (Li et al., 2021): An adversarial training method improving on FreeLB(Zhu et al., 2020), which adds adversarial perturbations to word embedding during fine-tuning. EarlyRobust³³3We only run EarlyRobust with BERT as its implementation with RoBERTa has not been released. (Xi et al., 2022): Extracting early-bird subnetworks and pruning pretrained models for efficient adversarial training. RSMI (Moon et al., 2023): A two-stage training method that combines randomised smoothing and masked inference to improve adversarial robustness.

4.2 Implementation and Settings

We train two DiffuseDef variants using FreeLB++ and RSMI models as base models considering their robust adversarial defense capabilities. In the diffusion layer, only one transformer encoder layer (Vaswani et al., 2017) is used. The maximum noising timestep $t$ during training is set to 30 for AGNews and QNLI datasets, and 10 for IMDB dataset, while at inference time, we only apply 5 denoising steps for $t^{\prime}$ . We follow (Ho et al., 2020) to use a linear $\beta_{t}$ schedule from $\beta_{1}=10^{-4}$ to $\beta_{t}=0.02$ . The diffusion layer is trained for 100 epochs, with the base classifier parameters frozen for efficiency. During the diffusion training stage, the same train-dev splits are used as in the adversarial training stage, thus ensuring no data leakage. At inference time, the number of ensembles is set to 10. Appendix C lists the hyper-parameters for each dataset.

5 Results and Analysis

5.1 Adversarial Robustness

Dataset	PLM	Method	Clean%	AUA%			#Query
Dataset	PLM	Method	Clean%	TF	TB	BA	TF	TB	BA
AGNews	BERT-base	Fine-Tuned	94.4	10.2	25.4	27.1	348	372	379
		InfoBERT	95.0	35.5	39.1	42.6	377	397	397
		FreeLB++	95.0	54.7	56.5	44.6	426	430	390
		EarlyRobust	94.4	35.6	37.2	45.7	475	516	533
		RSMI	94.3	52.6	56.7	55.4	680	737	687
		DiffuseDef-FreeLB++ (Ours)	94.8	84.5	86.0	84.6	877	972	910
		DiffuseDef-RSMI (Ours)	93.8	82.7	83.3	84.4	894	1029	930
	RoBERTa-base	Fine-Tuned	94.9	34.1	36.9	43.6	372	396	410
		InfoBERT	95.5	40.2	45.2	48.6	392	421	430
		FreeLB++	95.4	57.5	62.9	55.9	444	467	447
		RSMI	93.1	64.2	66.4	67.4	774	861	808
		DiffuseDef-FreeLB++ (Ours)	95.3	85.6	87.6	85.3	880	976	906
		DiffuseDef-RSMI (Ours)	92.9	82.9	83.5	82.2	905	925	1047
IMDB	BERT-base	Fine-Tuned	93.3	7.7	8.3	10.5	540	534	378
		InfoBERT	93.9	29.2	25.4	30.7	642	644	390
		FreeLB++	94.3	44.2	39.6	40.6	784	829	426
		EarlyRobust	92.7	49.7	46.8	43.8	2267	2788	1841
		RSMI	90.9	60.0	54.4	51.1	2840	3455	2070
		DiffuseDef-FreeLB++ (Ours)	94.4	82.1	83.0	84.0	3174	4348	2842
		DiffuseDef-RSMI (Ours)	90.2	80.9	79.8	79.8	3590	4748	2901
	RoBERTa-base	Fine-Tuned	94.6	21.3	17.9	13.6	587	671	493
		InfoBERT	94.8	30.9	27.9	21.8	681	760	549
		FreeLB++	95.3	46.0	42.1	33.9	829	974	637
		RSMI	92.7	77.9	74.3	70.6	3443	4342	2619
		DiffuseDef-FreeLB++ (Ours)	95.0	86.2	85.9	86.8	3573	4663	2941
		DiffuseDef-RSMI (Ours)	92.4	84.7	84.1	84.3	3673	4782	3007

Table 1: Main adversarial robustness results on classification tasks with BERT and RoBERTa PLMs. Clean: accuracy on clean test set. TF: TextFooler. TB: TextBugger. BA: BertAttack.

In Table 1, we compare the adversarial robustness of DiffuseDef with baselines and SOTA methods on AGNews and IMDB datasets trained with BERT and RoBERTa. DiffuseDef consistently outperforms all other methods on both datasets across both PLMs, exhibiting substantial improvements in accuracy under attack. After applying diffusion training, the AUA score for both FreeLB++ and RSMI models improves significantly, with an average increase of 30% AUA against the three attack methods. Note that despite the robust adversarial performance of the RSMI model, especially when trained with RoBERTa on the IMDB dataset, it still benefits from DiffuseDef. When comparing the clean accuracies to its base model (i.e. FreeLB++ and RSMI), DiffuseDef only shows a minor decline, between 0.2 and 0.7 accuracy score, which indicates that it can preserve the clean text performance while improving adversarial robustness. Moreover, models trained with DiffuseDef show a much smaller gap between clean accuracy and accuracy under attack, and such difference can be reduced to less than 10% AUA.

Another benefit of DiffuseDef is the increased number of adversarial queries needed to obtain a successful attack. Models applying DiffuseDef require over twice the number of queries on both datasets compared to the other methods. This increase is even larger on the IMDB dataset due to the longer text length. For example, DiffuseDef model requires on average over 3000 queries to achieve a successful attack while FreeLB++ only needs 400 to 800 queries. The substantial increase suggests that even if the attackers manage to construct a successful adversarial attack, they need 2x to 3x more time to find the attack on DiffuseDef than other models, affirming the improved robustness from diffusion training. In addition, we observe that the number of queries for denoising-based methods (i.e. RSMI, DiffuseDef) is generally higher than adversarial training-based methods (i.e. InfoBERT, FreeLB++). This is because denoising-based methods transform the hidden representations of the adversarial texts into a non-deterministic representation. The introduction of randomness in hidden states results in uncertainty in model logits, thus increasing the difficulty finding vulnerable words to attack (Wang et al., 2023).

5.2 Ablation - NLI Task

To understand how each component contributes to DiffuseDef, we conduct an ablation analysis on the QNLI dataset (Table 2). Compared to the fine-tuning baseline, FreeLB++ increases the AUA score from 21.5 to 45.6, showing the benefit of adversarial training. After applying diffusion training (with inference timestep $t^{\prime}=30$ ), the score is further improved to 49.2, showing that diffusion training complements adversarial training. Finally, ensembling enhances adversarial performance and improves the score to 66.7, with the number of queries growing from 392 to 485. Similar improvements in both AUA and number of queries is found with the RSMI model after applying diffusion training and ensembling, which validates that the two components are complementary and that DiffuseDef is compatible with multiple SOTA defense methods.

Method	Clean%	AUA%	#Query
Fine-Tuned (BERT)	90.8	21.5	195
FreeLB++	90.3	45.6	253
+ diffusion training	90.2	49.2	392
+ ensembling	90.3	66.7	485
RSMI	87.4	35.2	314
+ diffusion training	86.5	40.0	353
+ ensembling	86.4	55.5	459

Table 2: Ablation results for DiffuseDef on QNLI datasets. AUA% and #Query are measured under TextFooler attack.

5.3 Robustness w.r.t Token Length

Figure 2 provides comparison of defense rate for different models by token length on the IMDB dataset. The defense rate is calculated as the percentage of test examples in which TextFooler fails to construct a successful attack. All models except RSMI show a consistent trend that the defense rate declines as the texts lengthen. This trend can be attributed to the nature of adversarial attacks as longer texts allow for the generation of more adversarial examples. Specifically, adversarial training defense methods like InfoBERT and FreeLB++ show poor performance on longer texts (more than 300 tokens), with the defense rate reduced to near 0. This drastic decline indicates that given an adequate number of queries, the attacker is guaranteed to find a successful attack to fool these models. Similarly, EarlyRobust exhibits a performance drop on long texts as it is based on FreeLB training. RSMI, however, performs worse on short texts, but its defense rate increases as the text length grows. Compared to all SOTA defense approaches, the two DiffuseDef variants show a more steadily declining trend and maintain a higher defense rate across all token lengths, i.e. DiffuseDef is more robust to input text length.

5.4 Effect of Additional Denoising Steps

In Figure 3 we study how the inference denoising steps $t^{\prime}$ can affect the adversarial performance. For the DiffuseDef model without ensembling, both AUA score and the number of queries required to attack increase as the inference denoising step is larger. As the denoising step $t^{\prime}$ grows from 1 to 30, the AUA score improves from 58 to 65 while the number of attack queries grows from 430 to 780. In contrast, for DiffuseDef with ensembling, the model maintains a stable but robust performance in AUA and number of queries, regardless of the increase of $t^{\prime}$ . Considering that the ensembling introduces a notable performance increase, the DiffuseDef model is likely to be hitting an upperbound in both metrics, thus no further improvement is reached by increasing the denoising steps. However, it also shows that with ensembling, DiffuseDef can be applied with a smaller $t^{\prime}$ for better efficiency while maintaining a robust adversarial performance.

5.5 Ensembling Diffused Hidden Representations

In DiffuseDef the text hidden state is diffused and ensembled to form a denoised hidden representation, which contributes significantly to the improved adversarial robustness. In this section, we study how the ensembling diffused hidden representation helps defend against adversarial attacks.

As mentioned in Section 2.1, attack methods need to first rank token importance based on its influence on prediction. Specifically, the importance score is calculated by comparing the change of model prediction probablities after removing each word. In Figure 4, we compare the distribution of max token importance score between FreeLB++ and its DiffuseDef counterpart. Both FreeLB++ and DiffuseDef show a long-tail distribution with over 80 percent examples having a max token importance score below 0.1. This suggests that in most cases changing one single token will not significantly alter the prediction for both models. However, DiffuseDef shows a notably lower percentage of tokens when the max importance score is between 0.9 and 1, where the attacker can easily find the vulnerable token to construct adversarial examples. This difference shows that DiffuseDef can complicate the process of important word searching, which accounts for the increased number of queries required for a successful attack.

Method	L2	Cosine
FreeLB++	12.53	0.35
DiffuseDef-FreeLB++	10.66	0.27
RSMI	9.72	0.24
DiffuseDef-RSMI	8.61	0.21

Table 3: L2 and cosine distance between hidden states for clean and adversarial texts.

In addition, DiffuseDef mitigates the difference between clean and adversarial texts by reducing the distance between their hidden states. In Table 3, we report the L2 and cosine distance between clean and adversarial hidden states for FreeLB++ and RSMI. Both show lower L2 and cosine distance after applying DiffuseDef, indicating that ensembling diffused representation repositions the adversarial example closer to the clean example, leading to the model maintaining its predictions.

5.6 Efficiency of DiffuseDef

Method	Params	FLOPS
Fine-Tuned (BERT)	110M	46G
EarlyRobust	82M	32G
FreeLB++	110M	46G
InfoBERT	110M	46G
RSMI	110M	92G
RanMask ( $k=10$ )	110M	459G
SAFER ( $k=10)$	110M	459G
DiffuseDef ( $t^{\prime}=1,k=10$ )	120M	96G
DiffuseDef ( $t^{\prime}=5,k=10$ )	120M	267G

Table 4: Efficiency comparison of DiffuseDef-FreeLB++ with other methods. Params: number of model parameters. FLOPS: number of floating point operations per second at inference time, calculated with batch size of 1 and sequence length of 256.

Given that DiffuseDef adds additional denoising and ensembling steps during inference, it inevitably increases the computation time compared to its base model. To study its efficiency, we report the number of model parameters and inference FLOPS in Table 4. In addition to the defense methods in Table 1, we also compare the efficiency of DiffuseDef with two other SOTA ensembling-based defense methods, i.e. RanMask (Zeng et al., 2023) and SAFER Ye et al. (2020).

All SOTA models have the same number of parameters as the fine-tuned BERT model, except EarlyRobust which applies attention head pruning for better efficiency. DiffuseDef, with 1 additional diffusion layer, increases the number of parameters from 110M to 120M. DiffuseDef requires more inference FLOPS than non ensembling-based baselines such as FreeLB++ and EarlyRobust. With $t^{\prime}=1$ and $k=10$ , the FLOPS for DiffuseDef doubles from 46G to 96G, nevertheless, this number is close to RSMI model (92G FLOPS) as it requires gradient information during inference. Despite this increase, DiffuseDef is more efficient than ensembling-based methods like RanMask and SAFER which need to go through a full forward pass for all ensembles. With the same ensembling number of 10, both RanMask and SAFER require 459G FLOPS, which is 10x the number for BERT baseline. In contrast, even with $t^{\prime}$ increased to 5, DiffuseDef can be run faster with 267G FLOPS, showing that it can mitigate the efficiency problem from ensembling while maintaining the benefit of improved robustness.

6 Conclusions

We propose a novel adversarial defense method, DiffuseDef, which combines adversarial training, diffusion training, and ensembling to improve model robustness to adversarial attacks. DiffuseDef can build on any existing adversarial training method, training an additional diffusion layer to predict and remove randomly sampled noise at a given timestep. During inference, the diffusion layer is used to denoise the adversarial hidden states, which are ensembled to construct a robust text representation. Our experiments validate the effectiveness and efficiency of DiffuseDef, which significantly outperforms SOTA on three common adversarial attack methods. Analysis shows that DiffuseDef makes it difficult to find vulnerable tokens to attack, and also reduces the difference between the hidden representations of clean and adversarial texts.

7 Limitations

Scope

Our experiments focus on defending against three common black-box adversarial attack methods, while whether DiffuseDef improves model robustness against white-box attacks is unclear. White-box attacks have access to model parameters and can utilize gradient information to construct adversarial examples more efficiently than black-box attacks. Defending against white-box attacks is more challenging, and we consider this as a future direction of DiffuseDef.

Comparison with additional approaches

Due to the length limit, we do not compare against all current approaches. However we do compare with the SOTA methods with best adversarial robustness based on our preliminary experiments.

Efficiency

Despite the fact that DiffuseDef is more efficient than existing ensembling-based methods, it still requires more model parameters and inference FLOPS than non-ensembling-based models to achieve a better robustness. Future directions of this work might involve efforts to reduce the size of diffusion layer and number of ensembles to make DiffuseDef more efficient.

8 Ethical Considerations

In this paper we propose a new method DiffuseDef which uses a diffusion layer as a denoiser to provide robust and efficient text representation. We demonstrate that the proposed method could significantly improve the robustness of NLP systems to adversarial attacks. However, DiffuseDef cannot defend against all adversarial attacks without limitations (e.g. number of perturbed words, semantic similarity between original and adversarial examples). Potential risks might include creation of new adversarial attacks devised specifically for DiffuseDef.

References

Alzantot et al. (2018) Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2890–2896, Brussels, Belgium. Association for Computational Linguistics.
Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 169–174, Brussels, Belgium. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ebrahimi et al. (2018) Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 653–663, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Gan and Ng (2019) Wee Chung Gan and Hwee Tou Ng. 2019. Improving the robustness of question answering systems to question paraphrasing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6065–6075, Florence, Italy. Association for Computational Linguistics.
Gao et al. (2018) J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56.
Gao et al. (2023) SongYang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, and Ying Shan. 2023. DSRM: Boost textual adversarial training with distribution shift risk minimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12177–12189, Toronto, Canada. Association for Computational Linguistics.
Garg and Ramakrishnan (2020) Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6174–6181, Online. Association for Computational Linguistics.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc.
Huang and Chang (2021) Kuan-Hao Huang and Kai-Wei Chang. 2021. Generating syntactically controlled paraphrases without using annotated parallel pairs. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1022–1033, Online. Association for Computational Linguistics.
Jin et al. (2020) Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8018–8025.
Li et al. (2019) Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. Textbugger: Generating adversarial text against real-world applications. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society.
Li et al. (2020) Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, Online. Association for Computational Linguistics.
Li et al. (2023) Linyang Li, Demin Song, and Xipeng Qiu. 2023. Text adversarial purification as defense against adversarial attacks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 338–350, Toronto, Canada. Association for Computational Linguistics.
Li et al. (2021) Zongyi Li, Jianhan Xu, Jiehang Zeng, Linyang Li, Xiaoqing Zheng, Qi Zhang, Kai-Wei Chang, and Cho-Jui Hsieh. 2021. Searching for an effective defender: Benchmarking defense against adversarial word substitution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3137–3147, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Liu et al. (2022) Aiwei Liu, Honghai Yu, Xuming Hu, Shu’ang Li, Li Lin, Fukun Ma, Yawen Yang, and Lijie Wen. 2022. Character-level white-box adversarial attacks against transformers via attachable subwords substitution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7664–7676, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. Preprint, arXiv:1907.11692.
Maas et al. (2011a) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011a. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
Maas et al. (2011b) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011b. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations.
Moon et al. (2023) Han Cheol Moon, Shafiq Joty, Ruochen Zhao, Megh Thakkar, and Chi Xu. 2023. Randomized smoothing with masked inference for adversarially robust text classifications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5145–5165, Toronto, Canada. Association for Computational Linguistics.
Morris et al. (2020) John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, Online. Association for Computational Linguistics.
Nguyen Minh and Luu (2022) Dang Nguyen Minh and Anh Tuan Luu. 2022. Textual manifold-based defense against natural language adversarial examples. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6612–6625, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Nie et al. (2022) Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. 2022. Diffusion models for adversarial purification. In International Conference on Machine Learning (ICML).
Ren et al. (2019) Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097, Florence, Italy. Association for Computational Linguistics.
Si et al. (2021) Chenglei Si, Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Qun Liu, and Maosong Sun. 2021. Better robustness by more coverage: Adversarial and mixup data augmentation for robust finetuning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1569–1576, Online. Association for Computational Linguistics.
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France. PMLR.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
Wang et al. (2021) Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, and Jingjing Liu. 2021. Infobert: Improving robustness of language models from an information theoretic perspective. In International Conference on Learning Representations.
Wang et al. (2023) Zhaoyang Wang, Zhiyue Liu, Xiaopeng Zheng, Qinliang Su, and Jiahai Wang. 2023. RMLM: A flexible defense framework for proactively mitigating word-level adversarial attacks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2757–2774, Toronto, Canada. Association for Computational Linguistics.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Xi et al. (2022) Zhiheng Xi, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. 2022. Efficient adversarial training with robust early-bird tickets. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8318–8331, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Xu et al. (2021) Ying Xu, Xu Zhong, Antonio Jimeno Yepes, and Jey Han Lau. 2021. Grey-box adversarial attack and defence for sentiment classification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4078–4087, Online. Association for Computational Linguistics.
Ye et al. (2020) Mao Ye, Chengyue Gong, and Qiang Liu. 2020. SAFER: A structure-free approach for certified robustness to adversarial word substitutions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3465–3475, Online. Association for Computational Linguistics.
Yoo and Qi (2021) Jin Yong Yoo and Yanjun Qi. 2021. Towards improving adversarial training of NLP models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 945–956, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Yuan et al. (2023) Lifan Yuan, YiChi Zhang, Yangyi Chen, and Wei Wei. 2023. Bridge the gap between CV and NLP! a gradient-based textual adversarial attack framework. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7132–7146, Toronto, Canada. Association for Computational Linguistics.
Yuan et al. (2024) Shilong Yuan, Wei Yuan, Hongzhi Yin, and Tieke He. 2024. Roic-dm: Robust text inference and classification via diffusion model. Preprint, arXiv:2401.03514.
Zang et al. (2020) Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. 2020. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online. Association for Computational Linguistics.
Zeng et al. (2023) Jiehang Zeng, Jianhan Xu, Xiaoqing Zheng, and Xuanjing Huang. 2023. Certified robustness to text adversarial attacks by randomized [MASK]. Computational Linguistics, 49(2):395–427.
Zhang et al. (2015a) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015a. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA. MIT Press.
Zhang et al. (2015b) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015b. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Zhou et al. (2021) Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Chang, and Xuanjing Huang. 2021. Defense against synonym substitution-based adversarial attacks via Dirichlet neighborhood ensemble. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5482–5492, Online. Association for Computational Linguistics.
Zhu et al. (2020) Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. 2020. Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations.

Appendix A Data Preparation

Dataset	Train	Valid	Test	Avg Len
AGNews	108K	12K	7K	51.3
IMDB	40K	5K	5K	311.9
QNLI	94K	10K	5K	47.2

Table 5: Dataset statistics. The average text length is counted with BertTokenizer.

Table 5 presents the number of examples in train/valid/test splits and the average token length for the three datasets used in the experiments. For QNLI and AGNews datasets, we randomly split the training set into our train/valid splits, with a ratio of 0.9/0.1, and use its test set as our test split. For IMDB dataset, we randomly split the dataset into train/valid/test splits with a ratio of 0.8/0.1/0.1. All train/valid/test splitting is performed using a random seed of 42.

Appendix B Evaluation Constraints

Dataset	$\mathbf{\varepsilon_{min}}$	$\mathbf{K_{max}}$	$\mathbf{\rho_{max}}$
AGNews	0.84	50	0.3
IMDB	0.84	50	0.1
QNLI	0.84	50	0.2

Table 6: Evaluation parameters for each dataset.

When evaluating with adversarial attack, We follow the parameter settings for TextAttack as suggested in (Li et al., 2021). The minimum semantic similarity $\mathbf{\varepsilon_{min}}$ between the clean text and adversarial text is set to 0.84, with the score computed using Universal Sentence Encoder (Cer et al., 2018). The maximum number of candidate substitution $\mathbf{K_{max}}$ from attacker is 50, thus the maximum number of queries $\mathbf{Q_{max}}=\mathbf{K_{max}}\times\mathbf{L}$ where $\mathbf{L}$ is the number of tokens. Finally, the maximum percentage of changed tokens $\mathbf{\rho_{max}}$ is set to 0.3/0.1/0.2 for AGNews, IMDB, and QNLI dataset respectively.

Appendix C Training

	AGNews	IMDB	QNLI
Epochs	100	100	100
Batch size	64	64	64
Sequence len	128	256	256
Dropout	0.1	0.1	0.1
Optimizer	AdamW	AdamW	AdamW
Lr	2e-5	2e-5	2e-5
$t$	30	10	30
$t^{\prime}$	5	5	5
$k$	10	10	10

Table 7: Hyperparameters for training DiffuseDef.

The details on hyper-parameters of diffusion training can be found in Table 7. All models are trained on a single RTX A6000 GPU. The diffusion training of 100 epochs takes 6/4/3 hours on AGNews, IMDB, QNLI datasets respectively.

Appendix D License for Scientific Artifacts

Artifact	License
AGNews (Zhang et al., 2015b)	Custom (non-commercial)
IMDB (Maas et al., 2011b)	-
QNLI (Wang et al., 2018)	CC BY-SA 4.0
transformers (Wolf et al., 2020)	Apache License 2.0
TextAttack (Morris et al., 2020)	MIT License
BERT (Devlin et al., 2019)	Apache License 2.0
RoBERTa (Liu et al., 2019)	MIT License

Table 8: Licenses of scientific artifacts used in this paper.

Table 8 lists the scientific artifacts including data, codes, and models used in this paper. The use of these artifacts in this paper is consistent with their intended use, i.e. for scientific research only. The data used in the experiment is in English and does not contain personally identifying info or offensive content.

Appendix E Example of noising and denoising in DiffuseDef

Adding and removing noise to hidden states are essential features in DiffuseDef which contribute to the improved adversarial robustness. To study how adding or removing noise can affect the semantic meaning of the text, we feed the hidden states to the pretrained BERT model with masked language modeling (MLM) head to generate the text output.

In Table 9, we present the MLM outputs from hidden states added with different steps of noise and the MLM outputs from noise hidden states denoised with same number of steps. In the example shown, with more noise added some semantic information can be lost and replaced by symbols or function words like "." or "the". In contrast, denoising for the same number of steps help alleviate such information lost. For example, the word "IBM" can be recovered from the noise.

However, in practise it is not possible to assume number of denoising steps therefore in Table 10 we show the MLM outputs of denoised hidden states directly from clean and adversarial texts. On clean text, we observe that a higher number of denoising steps can result in more abstraction of the texts. For example, more words are replaced with "the" in the MLM outputs as $t^{\prime}$ grows. However, words related to the topic (e.g. "Manchester United", "Liverpool") are kept during the denoising process, thus the model can predict correctly. Similarly, the trend of abstraction can be also found on adversarial text while we observe that the denoising can help remove the adversarial noise / perturbation and recover the word "united" from "nation", thus resulting its correct prediction on the adversarial text.

$\mathbf{t^{\prime}}$	MLM Output (add noise)	MLM Output (add noise then denoise)
0	IBM Chips May Someday Heal Themselves New technology applies electrical fuses to help identify and repair faults.	-
5	the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.	the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.
6	) ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.	the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.
7	the. chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.	the. chips may someday heal themselves new technology uses electrical fuses to help identify and repair faults.
8	).. may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.	the ibm. may someday heal themselves new technology uses electrical fuses to help identify and repair faults.
9	the ibm chips may someday heal themselves new technology uses electrical fuses to help identify and repair faults.	the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.
10	the. chips may someday heal themselves new technology extends electrical fuses to help identify and repair faults.	the ibm. may someday heal themselves new technology develops electrical fuses to help identify and repair faults.

Table 9: MLM outputs from hidden states with noise added and hidden states with first noise added but then denoised. We only report

t^{\prime}

above 5 as the MLM outputs with smaller

t^{\prime}

are identical to the clean text.

$\mathbf{t^{\prime}}$	Clean Text / MLM Output	Adv Text / MLM Output	Pred clean	Pred adv
0	United Apology over Website Abuse Manchester United have been forced to issue an embarrassing apology to Liverpool for an ill-advised attack on the Anfield outfit on its own website.	United Apology over Website Abuse Manchester Nations have been forced to issue an embarrassing apology to Liverpool for an ill-advised attack on the Anfield outfit on its own website.	Sports	World
1	football. apology over website abuse manchester united have been - to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	the. apology over website abuse manchester nations have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	Sports	World
2	the. apology over website abuse manchester united have been - to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	the. apology over website abuse manchester nations have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	Sports	World
3	the. apology over website abuse manchester united have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	the. apology over website abuse manchester s have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	Sports	Sports
4	the. apology over website abuse manchester united have the - to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	the. apology over website abuse manchester s have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website.	Sports	Sports
5	the. apology over website abuse manchester united have the the to issue an a apology to liverpool for an the - advised attack on the anfield outfit on its own website.	the. apology over website abuse manchester united have been the to issue an the apology to liverpool for an’- advised attack on the anfield outfit on its own website.	Sports	Sports

Table 10: MLM outputs and FreeLB++ model predictions from ensembling diffused hidden states at different denoising steps.

Appendix F Confusion Matrix under Attack

Figure 5 and 6 present the confusion matrixes of models prediction on clean text and on adversarial texts (successful attack example) on AGNews and IMDB test sets respectively.