DiffuseDef: Improved Robustness to Adversarial Attacks

Zhenhao Li, Marek Rei, Lucia Specia
Language and Multimodal AI (LAMA) Lab, Imperial College London
{zhenhao.li18, marek.rei, l.specia}@imperial.ac.uk
Abstract

Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef111https://github.com/Nickeilf/DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.

DiffuseDef: Improved Robustness to Adversarial Attacks


Zhenhao Li, Marek Rei, Lucia Specia Language and Multimodal AI (LAMA) Lab, Imperial College London {zhenhao.li18, marek.rei, l.specia}@imperial.ac.uk


1 Introduction

Pretrained language models (PLM) have significantly advanced the performance of various natural language processing (NLP) tasks. Despite such improvements, current NLP systems remain susceptible to adversarial attacks where carefully crafted text perturbations can lead to incorrect model outputs (Alzantot et al., 2018; Jin et al., 2020; Li et al., 2020). In order to improve robustness to adversarial attacks, various defense methods have been proposed, such as adversarial training (Zhu et al., 2020; Si et al., 2021; Zhou et al., 2021; Xi et al., 2022), text denoising (Nguyen Minh and Luu, 2022; Wang et al., 2023), ensembling (Zhou et al., 2021; Zeng et al., 2023; Li et al., 2023), etc. However, existing defense methods either assume the test-time perturbation/attack set is similar to that used in training (Li et al., 2021), or are limited to specific architectures (Xi et al., 2022), or at inference time require large computational cost, thereby limiting their practical applicability.

Diffusion models are commonly used in computer vision (CV) to generate high-quality images by predicting and removing noise from a sampled noisy image. Therefore, they can be adopted to remove noise from adversarial images and thus improve robustness to attacks (Nie et al., 2022). However, in NLP very limited research has investigated adversarial defense with diffusion models due to the discrete and contextual nature of text data. Li et al. (2023) adopt the idea of iterative denoising and reconstruct adversarial texts from masked texts, while Yuan et al. (2024) use a diffusion model as a classifier and perform reverse diffusion steps on the label vector, conditioning on the input text. Inspired by the general noise prediction and reduction capability of diffusion models, we propose DiffuseDef, a novel adversarial defense method which employs diffusion training to denoise hidden representations of adversarial texts. Unlike Li et al. (2023) and Yuan et al. (2024) which apply diffusion on texts or labels, DiffuseDef directly removes noise to the hidden states, providing a more effective and robust text representation to defend against adversarial texts. Compared to diffusion-based defense in CV (Nie et al., 2022), DiffuseDef further enhances robustness with ensembling and improves efficiency with fewer diffusion steps.

DiffuseDef combines adversarial training with diffusion training, where the diffusion layer is trained to predict randomly sampled noise at a given timestep. During inference, the diffusion layer serves as a denoiser, iteratively removing noise from adversarial hidden states to yield a robust hidden representation. Moreover, we adopt the ensembling strategy by first adding random noise to text hidden states to create multiple variants then denoising them via the diffusion layer. The model output is made by averaging all denoised hidden states. Since ensembling happens solely at the diffusion layer, DiffuseDef is more efficient than traditional ensembling-based methods (Ye et al., 2020; Zeng et al., 2023), which require a full forward pass through all model parameters.

Through systematic experimentation, we demonstrate that DiffuseDef outperforms strong defense methods and is able to defend against multiple types of adversarial attacks, while preserving performance on clean texts. Our analysis also reveals that the ensembling diffused representations provides a stronger defense against finding vulnerable words to attack and can reduce the distance in latent space between adversarial texts and their clean text counterpart.

Our contributions can be summarised as follows:

  • We propose DiffuseDef, a novel and flexible adversarial defense method that can be added on top of any existing adversarial defense methods to further improve robustness to adversarial attacks.

  • DiffuseDef outperforms existing adversarial methods and achieves state-of-the-art performance against prevalent adversarial attacks.

  • Through extensive analysis, we demonstrate the effectiveness of the ensembling diffused representation and the efficiency of DiffuseDef compared to existing ensembling-based methods.

2 Related Work

2.1 Textual Adversarial Attacks

Textual adversarial attacks focus on constructing adversarial examples from an original text that maximise the likelihood of incorrect predictions by a neural network. These attacks require adversarial examples to be perceptually similar to the original text, which is typically achieved by introducing subtle perturbations to the original text, such as character swapping (Gao et al., 2018; Ebrahimi et al., 2018), synonym-substitutions (Ren et al., 2019; Yoo and Qi, 2021), and paraphrasing (Gan and Ng, 2019; Huang and Chang, 2021). Taking the text classification task as an example, given a classifier 𝒞(𝐱)𝒞𝐱\mathcal{C}(\mathbf{x})caligraphic_C ( bold_x ) that maps an input sequence of words 𝐱=[w1,w2,,wL]𝐱subscript𝑤1subscript𝑤2subscript𝑤𝐿\mathbf{x}=[w_{1},w_{2},...,w_{L}]bold_x = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] to its designated label y𝑦yitalic_y, the goal of the attack model is to construct an adversarial example 𝐱=𝐱+δsuperscript𝐱𝐱𝛿\mathbf{x^{\prime}}=\mathbf{x}+\deltabold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_x + italic_δ to fool the classifier, where δ𝛿\deltaitalic_δ is a subtle adversarial perturbation constrained by δ<ωnorm𝛿𝜔||\delta||<\omega| | italic_δ | | < italic_ω. The adversarial example 𝐱superscript𝐱\mathbf{x^{\prime}}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is considered a successful attack if it leads to an incorrect prediction 𝒞(𝐱)y𝒞superscript𝐱𝑦\mathcal{C}(\mathbf{x}^{\prime})\not=ycaligraphic_C ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≠ italic_y. The attacker can iteratively generate multiple adversarial examples and query the classifier to obtain a successful attack, whereas the classifier must consistently return the correct prediction within a specified number of query attempts to be considered robust.

Common textual adversarial attack methods adopt a two-stage process to construct effective adversarial examples: word importance ranking and word substitution. In the first stage, words or subwords are ranked based on their influence on the model’s prediction. This is measured by leveraging either gradient information (Liu et al., 2022) or changes in prediction probabilities when words are removed (Jin et al., 2020) or masked (Ren et al., 2019; Li et al., 2020). In the second stage, candidate words are substituted with synonyms (Zang et al., 2020), perturbed variants (Gao et al., 2018), or outputs from masked language models (Garg and Ramakrishnan, 2020; Li et al., 2020). The substitution process is guided by various constraints to ensure the adversarial example remains natural and semantically equivalent to the original text. Common constraints include thresholding the similarity between the replacement word embedding and the substituted word embedding, or ensuring the semantic similarity between sentence vectors modeled from Universal Sentence Encoder (Cer et al., 2018). Despite these constraints, current textual adversarial attacks still pose significant challenges to NLP models (Liu et al., 2022; Xu et al., 2021; Yuan et al., 2023), highlighting the necessity for defense methods for better adversarial robustness.

2.2 Adversarial Defense Methods

To mitigate the performance degradation caused by adversarial attacks, various adversarial defense methods have been developed. They can be grouped into three categories: training-based, ensembling-based, and denoising-based methods. Adversarial training improves the robustness of the model to adversarial examples through strategies like data augmentation (Si et al., 2021) and adversarial regularisation (Madry et al., 2018; Zhu et al., 2020; Wang et al., 2021; Xi et al., 2022; Gao et al., 2023). However, adversarial training methods are limited as they assumes similar train-test adversarial examples, and thus tend to overfit to specific types of adversarial attacks. Ensembling-based methods generate multiple variants of the input text at inference time and ensemble model predictions over all the variants (Ye et al., 2020; Zhou et al., 2021; Zeng et al., 2023; Li et al., 2023), but they can be inefficient given that model predictions are needed on every ensemble, increasing the inference time with the number of ensembles. More recently, denoising-based methods have been proposed to improve adversarial robustness by mapping the vector representation of the adversarial text to another point in the latent space that is close to the clean text (Nguyen Minh and Luu, 2022; Wang et al., 2023; Moon et al., 2023; Yuan et al., 2024). The denoised representation makes it more difficult to find vulnerable words to attack, thus improving adversarial robustness (Wang et al., 2023). Nevertheless, denoising might lead to very different representations of clean text and adversarial text, therefore changing the semantic meanings.

The proposed DiffuseDef builds on these three approaches and can use any adversarially trained classifier as the base, applying denoising via a diffusion layer, and ensembling the diffused representations with a small number of ensembles. Using a diffusion layer as a denoiser addresses the overfitting problem from adversarial training and mitigates the efficiency problem by performing ensembling only at the diffusion layer. By averaging denoised hidden states across all ensembles, DiffuseDef also addresses the issue stemming from denoising, maintaining good performance on clean texts.

3 DiffuseDef

Refer to caption
Figure 1: Training and inference of DiffuseDef model. The adversarial training stage trains the pretrained encoder and classifier with perturbed input for adversarial robustness. The diffusion training trains the diffusion layer to predict injected noise at a given timestep t𝑡titalic_t. At inference time, the text hidden state is first noised by 1 step and then denoised by tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT steps to create the denoised hidden states, which are ensembled to make the final prediction.

3.1 Training

The proposed diffusion defense model consists of a pretrained encoder for feature extraction, a transformer-based diffusion layer for noise prediction and reduction, and a classifier layer for output generation. The training process is split into two stages: adversarial training and diffusion training (Figure 1). The adversarial training stage employs any neural network-based adversarial training methods like FreeLB++ (Li et al., 2021) and RSMI (Moon et al., 2023), which optimise the encoder and classifier for robustness by perturbing the latent representation of the text input.

In the diffusion training stage, only the diffusion layer is trained to predict random noise added to the clean text hidden state at different timesteps, enabling it to denoise the adversarial hidden state at inference time. The pretrained encoder, however is frozen during this stage. Since the pretrained encoder is only used for feature extraction, the diffusion training method is compatible with any neural network-based adversarial training method.

Given an input sequence of tokens 𝐱L𝐱superscript𝐿\mathbf{x}\in\mathbb{R}^{L}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, the pretrained encoder extracts the hidden state hL×Dsuperscript𝐿𝐷h\in\mathbb{R}^{L\times D}italic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT. A random Gaussian noise ϵitalic-ϵ\epsilonitalic_ϵ is sampled to perturb hidden state hhitalic_h. Sohl-Dickstein et al. (2015) define the forward diffusion process as a Markov Chain where at each timestep a Gaussian noise is sampled and added to the previous latent feature: ht=1βtht1+βϵsubscript𝑡1subscript𝛽𝑡subscript𝑡1𝛽italic-ϵh_{t}=\sqrt{1-\beta_{t}}h_{t-1}+\sqrt{\beta}\epsilonitalic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β end_ARG italic_ϵ, where ϵ𝒩(0,)italic-ϵ𝒩0\epsilon\in\mathcal{N}(0,\mathcal{I})italic_ϵ ∈ caligraphic_N ( 0 , caligraphic_I ), htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the noisy hidden state at step t𝑡titalic_t and β𝛽\betaitalic_β is a pre-calculated variance schedule changing with t𝑡titalic_t. As shown by Ho et al. (2020), this equation can be reformulated to calculate htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT directly from hhitalic_h by defining αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and α¯=i=1tαi¯𝛼subscriptsuperscriptproduct𝑡𝑖1subscript𝛼𝑖\bar{\alpha}=\prod^{t}_{i=1}\alpha_{i}over¯ start_ARG italic_α end_ARG = ∏ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, thus

ht=α¯th+1α¯tϵsubscript𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡italic-ϵh_{t}=\sqrt{\bar{\alpha}_{t}}h+\sqrt{1-\bar{\alpha}_{t}}\epsilonitalic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_h + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ (1)

At each training step, a random forward diffusion timestep t𝑡titalic_t is sampled from a uniform distribution. Therefore, the noisy hidden state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is created from hhitalic_h, t𝑡titalic_t, and ϵitalic-ϵ\epsilonitalic_ϵ. The diffusion layer θ𝜃\thetaitalic_θ consists of a time embedding and a transformer layer. The time embedding receives the diffusion timestep t𝑡titalic_t as input and produces an embedding etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is added to htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as input for the transformer layer. Finally, the transformer layer outputs the predicted noise ϵθ(ht,t)subscriptitalic-ϵ𝜃subscript𝑡𝑡\epsilon_{\theta}(h_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), and mean square error is used to compute the loss between predicted noise ϵθ(ht,t)subscriptitalic-ϵ𝜃subscript𝑡𝑡\epsilon_{\theta}(h_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and actual sampled noise ϵitalic-ϵ\epsilonitalic_ϵ.

L=𝔼t,h,ϵ[ϵϵθ(α¯th+1α¯tϵ)2]𝐿subscript𝔼𝑡italic-ϵdelimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript¯𝛼𝑡1subscript¯𝛼𝑡italic-ϵ2L=\mathbb{E}_{t,h,\epsilon}\left[\left\|\epsilon-\epsilon_{\theta}(\sqrt{\bar{% \alpha}_{t}}h+\sqrt{1-\bar{\alpha}_{t}}\epsilon)\right\|^{2}\right]italic_L = blackboard_E start_POSTSUBSCRIPT italic_t , italic_h , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_h + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (2)

3.2 Inference

Leveraging the diffusion layer’s ability to predict noise at a given timestep t𝑡titalic_t, we utilise it as a denoiser during inference by iteratively performing the reverse diffusion steps, which sample from pθ(ht1|ht)=𝒩(ht1;μθ(ht,t),Σθ(ht,t))subscript𝑝𝜃conditionalsubscript𝑡1subscript𝑡𝒩subscript𝑡1subscript𝜇𝜃subscript𝑡𝑡subscriptΣ𝜃subscript𝑡𝑡p_{\theta}(h_{t-1}|h_{t})=\mathcal{N}(h_{t-1};\mu_{\theta}(h_{t},t),\Sigma_{% \theta}(h_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) to produce the denoised hidden state

μθ(ht,t)subscript𝜇𝜃subscript𝑡𝑡\displaystyle\mu_{\theta}(h_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) =1αt(ht1αt1α¯tϵt)absent1subscript𝛼𝑡subscript𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝑡\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\left(h_{t}-\frac{1-\alpha_{t}}{\sqrt% {1-\bar{\alpha}_{t}}}\epsilon_{t}\right)= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (3)
Σθ(ht,t)subscriptΣ𝜃subscript𝑡𝑡\displaystyle\Sigma_{\theta}(h_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) =σt2absentsubscriptsuperscript𝜎2𝑡\displaystyle=\sigma^{2}_{t}\mathcal{I}= italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_I (4)

where ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the predicted noise from diffusion layer and σt2=βtsubscriptsuperscript𝜎2𝑡subscript𝛽𝑡\sigma^{2}_{t}=\beta_{t}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The denoised hidden state can thus be computed with

ht1=1αt(ht1αt1α¯tϵt)+σtzsubscript𝑡11subscript𝛼𝑡subscript𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝑡subscript𝜎𝑡𝑧h_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(h_{t}-\frac{1-\alpha_{t}}{\sqrt{1-% \bar{\alpha}_{t}}}\epsilon_{t}\right)+\sigma_{t}zitalic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z (5)

where z𝒩(0,)𝑧𝒩0z\in\mathcal{N}(0,\mathcal{I})italic_z ∈ caligraphic_N ( 0 , caligraphic_I ).

Inference in DiffuseDef combines a one-step noising, a multi-step denoising, and an ensembling step. After the pretrained encoder extracts its hidden state hhitalic_h, a set of k𝑘kitalic_k Gaussian noise vectors E=[ϵ0,ϵ1,,ϵk]𝐸superscriptitalic-ϵ0superscriptitalic-ϵ1superscriptitalic-ϵ𝑘E=[\epsilon^{0},\epsilon^{1},...,\epsilon^{k}]italic_E = [ italic_ϵ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] are sampled to perform a single forward diffusion step. These noise vectors E𝐸Eitalic_E are then added to the hidden state hhitalic_h following equation 1, resulting in a set of noisy hidden states Ht=[ht0,ht1,,htk]subscript𝐻superscript𝑡subscriptsuperscript0superscript𝑡subscriptsuperscript1superscript𝑡subscriptsuperscript𝑘superscript𝑡H_{t^{\prime}}=[h^{0}_{t^{\prime}},h^{1}_{t^{\prime}},...,h^{k}_{t^{\prime}}]italic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = [ italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , italic_h start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ], where tsuperscript𝑡{t^{\prime}}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the number of denoising steps. The noisy hidden states Htsubscript𝐻superscript𝑡H_{t^{\prime}}italic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are subsequently denoised through tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT reverse diffusion steps, where noise is predicted by the diffusion layer and subtracted from the previous noisy hidden states. Unlike Ho et al. (2020) where the reverse diffusion step starts with pure noise sampled from standard normal distribution, we assume the noisy hidden state Htsubscript𝐻superscript𝑡H_{t^{\prime}}italic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is already an intermediate state in the reverse diffusion steps. This allows us to use a smaller number of tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT than the training timestep t𝑡titalic_t to prevent the denoised hidden states from diverging substantially from the initial hidden state hhitalic_h. This sequence of denoising steps creates the final denoised hidden states H0=[h00,h01,,h0k]subscript𝐻0subscriptsuperscript00subscriptsuperscript10subscriptsuperscript𝑘0H_{0}=[h^{0}_{0},h^{1}_{0},...,h^{k}_{0}]italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_h start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ], which are averaged and used by the classifier to output the final predicted label. This process is summarised in Algorithm 1.

Data: Input text 𝐱𝐱\mathbf{x}bold_x
Result: Predicted label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
1 hEnc(𝐱)𝐸𝑛𝑐𝐱h\leftarrow Enc(\mathbf{x})italic_h ← italic_E italic_n italic_c ( bold_x );
2 Sample E=[ϵ0,ϵ1,,ϵk]𝐸superscriptitalic-ϵ0superscriptitalic-ϵ1superscriptitalic-ϵ𝑘E=[\epsilon^{0},\epsilon^{1},...,\epsilon^{k}]italic_E = [ italic_ϵ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ], ϵ𝒩(0,)similar-toitalic-ϵ𝒩0\epsilon\sim\mathcal{N}(0,\mathcal{I})italic_ϵ ∼ caligraphic_N ( 0 , caligraphic_I );
3 Htα¯1h+1α¯1Esubscript𝐻superscript𝑡subscript¯𝛼11subscript¯𝛼1𝐸H_{t^{\prime}}\leftarrow\sqrt{\bar{\alpha}_{1}}h+\sqrt{1-\bar{\alpha}_{1}}Eitalic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_h + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_E;
4 for i0𝑖0i\leftarrow 0italic_i ← 0 to t1superscript𝑡1t^{\prime}-1italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 do
5       Etiϵθ(Hti,ti)subscript𝐸superscript𝑡𝑖subscriptitalic-ϵ𝜃subscript𝐻superscript𝑡𝑖superscript𝑡𝑖E_{t^{\prime}-i}\leftarrow\epsilon_{\theta}(H_{t^{\prime}-i},{t^{\prime}-i})italic_E start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT ← italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i );
6       Hti11αti(Hti1αti1α¯tiEti)+σtizsubscript𝐻superscript𝑡𝑖11subscript𝛼superscript𝑡𝑖subscript𝐻superscript𝑡𝑖1subscript𝛼superscript𝑡𝑖1subscript¯𝛼superscript𝑡𝑖subscript𝐸superscript𝑡𝑖subscript𝜎superscript𝑡𝑖𝑧H_{t^{\prime}-i-1}\leftarrow\frac{1}{\sqrt{\alpha_{t^{\prime}-i}}}\left(H_{t^{% \prime}-i}-\frac{1-\alpha_{t^{\prime}-i}}{\sqrt{1-\bar{\alpha}_{t^{\prime}-i}}% }E_{t^{\prime}-i}\right)+\sigma_{t^{\prime}-i}zitalic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i - 1 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT end_ARG end_ARG ( italic_H start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_E start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i end_POSTSUBSCRIPT italic_z;
7      
8 end for
9yCLS(avg(H0))superscript𝑦𝐶𝐿𝑆𝑎𝑣𝑔subscript𝐻0y^{\prime}\leftarrow CLS\left(avg(H_{0})\right)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_C italic_L italic_S ( italic_a italic_v italic_g ( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) );
Algorithm 1 Inference of DiffuseDef

4 Experiments

Datasets

We focus on two common NLP tasks in our experiments: topic classification and natural language inference (NLI). In the text classification task, we compare our method with other defense algorithms on two standard datasets for adversarial defense: AG News (Zhang et al., 2015a) and IMDB (Maas et al., 2011a) datasets. In the NLI task, we perform an ablation analysis with the Question-answering NLI (QNLI) dataset (Wang et al., 2018). We randomly split AGNews, IMDB, and QNLI datasets into train, validation, and test splits.

Evaluation

Following previous work on adversarial defense, we use three benchmarking attack methods to evaluate the robustness of DiffuseDef: TextFooler (TF) (Jin et al., 2020), TextBugger (TB) (Li et al., 2019), and Bert-Attack (BA) (Li et al., 2020). The three attack methods create adversarial attacks in different granularities: character-level perturbation (TextBugger), word substitution (TextFooler), and subword substitution (BertAttack). Regarding evaluation metrics, we measure the clean accuracy (Clean%) on the test set, the accuracy under attack (AUA%), and the number of adversarial queries (#Query) needed for a successful attack. Higher scores on the three metrics denote a better robustness performance of a defense method. The accuracy on clean data is measured across the entire test set. The accuracy under attack and number of queries, due to the lengthy attacking process, is measured on a randomly sampled subset of 1000 examples from the test set. We use the TextAttack library as the adversarial evaluation framework. To ensure a fair comparison and high-quality adversarial examples, we follow the same evaluation constraints as in Li et al. (2021). The evaluation metrics are averaged based on experiments run with 5 random seeds.

4.1 Comparison to SOTA

We compare our proposed method with state-of-the-art adversarial defense approaches, trained using both BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) as backbones: Fine-tune: Fine-tuning pretrained models on downstream task with no defense method applied222”Fine-tune” is a baseline approach used to illustrate the effect of adversarial attacks.. InfoBERT (Wang et al., 2021): Applying mutual-information-based regularizers during fine-tuning of pretrained models to improve robustness. FreeLB++ (Li et al., 2021): An adversarial training method improving on FreeLB(Zhu et al., 2020), which adds adversarial perturbations to word embedding during fine-tuning. EarlyRobust333We only run EarlyRobust with BERT as its implementation with RoBERTa has not been released. (Xi et al., 2022): Extracting early-bird subnetworks and pruning pretrained models for efficient adversarial training. RSMI (Moon et al., 2023): A two-stage training method that combines randomised smoothing and masked inference to improve adversarial robustness.

4.2 Implementation and Settings

We train two DiffuseDef variants using FreeLB++ and RSMI models as base models considering their robust adversarial defense capabilities. In the diffusion layer, only one transformer encoder layer (Vaswani et al., 2017) is used. The maximum noising timestep t𝑡titalic_t during training is set to 30 for AGNews and QNLI datasets, and 10 for IMDB dataset, while at inference time, we only apply 5 denoising steps for tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We follow (Ho et al., 2020) to use a linear βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT schedule from β1=104subscript𝛽1superscript104\beta_{1}=10^{-4}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to βt=0.02subscript𝛽𝑡0.02\beta_{t}=0.02italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.02. The diffusion layer is trained for 100 epochs, with the base classifier parameters frozen for efficiency. During the diffusion training stage, the same train-dev splits are used as in the adversarial training stage, thus ensuring no data leakage. At inference time, the number of ensembles is set to 10. Appendix C lists the hyper-parameters for each dataset.

5 Results and Analysis

5.1 Adversarial Robustness

Dataset PLM Method Clean% AUA% #Query
TF TB BA TF TB BA
AGNews BERT-base Fine-Tuned 94.4 10.2 25.4 27.1 348 372 379
InfoBERT 95.0 35.5 39.1 42.6 377 397 397
FreeLB++ 95.0 54.7 56.5 44.6 426 430 390
EarlyRobust 94.4 35.6 37.2 45.7 475 516 533
RSMI 94.3 52.6 56.7 55.4 680 737 687
DiffuseDef-FreeLB++ (Ours) 94.8 84.5 86.0 84.6 877 972 910
DiffuseDef-RSMI (Ours) 93.8 82.7 83.3 84.4 894 1029 930
RoBERTa-base Fine-Tuned 94.9 34.1 36.9 43.6 372 396 410
InfoBERT 95.5 40.2 45.2 48.6 392 421 430
FreeLB++ 95.4 57.5 62.9 55.9 444 467 447
RSMI 93.1 64.2 66.4 67.4 774 861 808
DiffuseDef-FreeLB++ (Ours) 95.3 85.6 87.6 85.3 880 976 906
DiffuseDef-RSMI (Ours) 92.9 82.9 83.5 82.2 905 925 1047
IMDB BERT-base Fine-Tuned 93.3 7.7 8.3 10.5 540 534 378
InfoBERT 93.9 29.2 25.4 30.7 642 644 390
FreeLB++ 94.3 44.2 39.6 40.6 784 829 426
EarlyRobust 92.7 49.7 46.8 43.8 2267 2788 1841
RSMI 90.9 60.0 54.4 51.1 2840 3455 2070
DiffuseDef-FreeLB++ (Ours) 94.4 82.1 83.0 84.0 3174 4348 2842
DiffuseDef-RSMI (Ours) 90.2 80.9 79.8 79.8 3590 4748 2901
RoBERTa-base Fine-Tuned 94.6 21.3 17.9 13.6 587 671 493
InfoBERT 94.8 30.9 27.9 21.8 681 760 549
FreeLB++ 95.3 46.0 42.1 33.9 829 974 637
RSMI 92.7 77.9 74.3 70.6 3443 4342 2619
DiffuseDef-FreeLB++ (Ours) 95.0 86.2 85.9 86.8 3573 4663 2941
DiffuseDef-RSMI (Ours) 92.4 84.7 84.1 84.3 3673 4782 3007
Table 1: Main adversarial robustness results on classification tasks with BERT and RoBERTa PLMs. Clean: accuracy on clean test set. TF: TextFooler. TB: TextBugger. BA: BertAttack.

In Table 1, we compare the adversarial robustness of DiffuseDef with baselines and SOTA methods on AGNews and IMDB datasets trained with BERT and RoBERTa. DiffuseDef consistently outperforms all other methods on both datasets across both PLMs, exhibiting substantial improvements in accuracy under attack. After applying diffusion training, the AUA score for both FreeLB++ and RSMI models improves significantly, with an average increase of 30% AUA against the three attack methods. Note that despite the robust adversarial performance of the RSMI model, especially when trained with RoBERTa on the IMDB dataset, it still benefits from DiffuseDef. When comparing the clean accuracies to its base model (i.e. FreeLB++ and RSMI), DiffuseDef only shows a minor decline, between 0.2 and 0.7 accuracy score, which indicates that it can preserve the clean text performance while improving adversarial robustness. Moreover, models trained with DiffuseDef show a much smaller gap between clean accuracy and accuracy under attack, and such difference can be reduced to less than 10% AUA.

Another benefit of DiffuseDef is the increased number of adversarial queries needed to obtain a successful attack. Models applying DiffuseDef require over twice the number of queries on both datasets compared to the other methods. This increase is even larger on the IMDB dataset due to the longer text length. For example, DiffuseDef model requires on average over 3000 queries to achieve a successful attack while FreeLB++ only needs 400 to 800 queries. The substantial increase suggests that even if the attackers manage to construct a successful adversarial attack, they need 2x to 3x more time to find the attack on DiffuseDef than other models, affirming the improved robustness from diffusion training. In addition, we observe that the number of queries for denoising-based methods (i.e. RSMI, DiffuseDef) is generally higher than adversarial training-based methods (i.e. InfoBERT, FreeLB++). This is because denoising-based methods transform the hidden representations of the adversarial texts into a non-deterministic representation. The introduction of randomness in hidden states results in uncertainty in model logits, thus increasing the difficulty finding vulnerable words to attack (Wang et al., 2023).

5.2 Ablation - NLI Task

To understand how each component contributes to DiffuseDef, we conduct an ablation analysis on the QNLI dataset (Table 2). Compared to the fine-tuning baseline, FreeLB++ increases the AUA score from 21.5 to 45.6, showing the benefit of adversarial training. After applying diffusion training (with inference timestep t=30superscript𝑡30t^{\prime}=30italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 30), the score is further improved to 49.2, showing that diffusion training complements adversarial training. Finally, ensembling enhances adversarial performance and improves the score to 66.7, with the number of queries growing from 392 to 485. Similar improvements in both AUA and number of queries is found with the RSMI model after applying diffusion training and ensembling, which validates that the two components are complementary and that DiffuseDef is compatible with multiple SOTA defense methods.

Method Clean% AUA% #Query
Fine-Tuned (BERT) 90.8 21.5 195
FreeLB++ 90.3 45.6 253
   + diffusion training 90.2 49.2 392
    + ensembling 90.3 66.7 485
RSMI 87.4 35.2 314
   + diffusion training 86.5 40.0 353
    + ensembling 86.4 55.5 459
Table 2: Ablation results for DiffuseDef on QNLI datasets. AUA% and #Query are measured under TextFooler attack.

5.3 Robustness w.r.t Token Length

Refer to caption
Figure 2: Defense rate (against TextFooler) w.r.t token length for different models on IMDB dataset.

Figure 2 provides comparison of defense rate for different models by token length on the IMDB dataset. The defense rate is calculated as the percentage of test examples in which TextFooler fails to construct a successful attack. All models except RSMI show a consistent trend that the defense rate declines as the texts lengthen. This trend can be attributed to the nature of adversarial attacks as longer texts allow for the generation of more adversarial examples. Specifically, adversarial training defense methods like InfoBERT and FreeLB++ show poor performance on longer texts (more than 300 tokens), with the defense rate reduced to near 0. This drastic decline indicates that given an adequate number of queries, the attacker is guaranteed to find a successful attack to fool these models. Similarly, EarlyRobust exhibits a performance drop on long texts as it is based on FreeLB training. RSMI, however, performs worse on short texts, but its defense rate increases as the text length grows. Compared to all SOTA defense approaches, the two DiffuseDef variants show a more steadily declining trend and maintain a higher defense rate across all token lengths, i.e. DiffuseDef is more robust to input text length.

5.4 Effect of Additional Denoising Steps

Refer to caption
Figure 3: AUA and #Query (TextFooler) w.r.t inference denoising step for DiffuseDef w/ and w/o ensembling.

In Figure 3 we study how the inference denoising steps tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can affect the adversarial performance. For the DiffuseDef model without ensembling, both AUA score and the number of queries required to attack increase as the inference denoising step is larger. As the denoising step tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT grows from 1 to 30, the AUA score improves from 58 to 65 while the number of attack queries grows from 430 to 780. In contrast, for DiffuseDef with ensembling, the model maintains a stable but robust performance in AUA and number of queries, regardless of the increase of tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Considering that the ensembling introduces a notable performance increase, the DiffuseDef model is likely to be hitting an upperbound in both metrics, thus no further improvement is reached by increasing the denoising steps. However, it also shows that with ensembling, DiffuseDef can be applied with a smaller tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for better efficiency while maintaining a robust adversarial performance.

5.5 Ensembling Diffused Hidden Representations

In DiffuseDef the text hidden state is diffused and ensembled to form a denoised hidden representation, which contributes significantly to the improved adversarial robustness. In this section, we study how the ensembling diffused hidden representation helps defend against adversarial attacks.

Refer to caption
Figure 4: Distribution of max token importance score in the AGNews test set.

As mentioned in Section 2.1, attack methods need to first rank token importance based on its influence on prediction. Specifically, the importance score is calculated by comparing the change of model prediction probablities after removing each word. In Figure 4, we compare the distribution of max token importance score between FreeLB++ and its DiffuseDef counterpart. Both FreeLB++ and DiffuseDef show a long-tail distribution with over 80 percent examples having a max token importance score below 0.1. This suggests that in most cases changing one single token will not significantly alter the prediction for both models. However, DiffuseDef shows a notably lower percentage of tokens when the max importance score is between 0.9 and 1, where the attacker can easily find the vulnerable token to construct adversarial examples. This difference shows that DiffuseDef can complicate the process of important word searching, which accounts for the increased number of queries required for a successful attack.

Method L2 Cosine
FreeLB++ 12.53 0.35
DiffuseDef-FreeLB++ 10.66 0.27
RSMI 9.72 0.24
DiffuseDef-RSMI 8.61 0.21
Table 3: L2 and cosine distance between hidden states for clean and adversarial texts.

In addition, DiffuseDef mitigates the difference between clean and adversarial texts by reducing the distance between their hidden states. In Table 3, we report the L2 and cosine distance between clean and adversarial hidden states for FreeLB++ and RSMI. Both show lower L2 and cosine distance after applying DiffuseDef, indicating that ensembling diffused representation repositions the adversarial example closer to the clean example, leading to the model maintaining its predictions.

5.6 Efficiency of DiffuseDef

Method Params FLOPS
Fine-Tuned (BERT) 110M 46G
EarlyRobust 82M 32G
FreeLB++ 110M 46G
InfoBERT 110M 46G
RSMI 110M 92G
RanMask (k=10𝑘10k=10italic_k = 10) 110M 459G
SAFER (k=10)k=10)italic_k = 10 ) 110M 459G
DiffuseDef (t=1,k=10formulae-sequencesuperscript𝑡1𝑘10t^{\prime}=1,k=10italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 , italic_k = 10) 120M 96G
DiffuseDef (t=5,k=10formulae-sequencesuperscript𝑡5𝑘10t^{\prime}=5,k=10italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 5 , italic_k = 10) 120M 267G
Table 4: Efficiency comparison of DiffuseDef-FreeLB++ with other methods. Params: number of model parameters. FLOPS: number of floating point operations per second at inference time, calculated with batch size of 1 and sequence length of 256.

Given that DiffuseDef adds additional denoising and ensembling steps during inference, it inevitably increases the computation time compared to its base model. To study its efficiency, we report the number of model parameters and inference FLOPS in Table 4. In addition to the defense methods in Table 1, we also compare the efficiency of DiffuseDef with two other SOTA ensembling-based defense methods, i.e. RanMask (Zeng et al., 2023) and SAFER Ye et al. (2020).

All SOTA models have the same number of parameters as the fine-tuned BERT model, except EarlyRobust which applies attention head pruning for better efficiency. DiffuseDef, with 1 additional diffusion layer, increases the number of parameters from 110M to 120M. DiffuseDef requires more inference FLOPS than non ensembling-based baselines such as FreeLB++ and EarlyRobust. With t=1superscript𝑡1t^{\prime}=1italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 and k=10𝑘10k=10italic_k = 10, the FLOPS for DiffuseDef doubles from 46G to 96G, nevertheless, this number is close to RSMI model (92G FLOPS) as it requires gradient information during inference. Despite this increase, DiffuseDef is more efficient than ensembling-based methods like RanMask and SAFER which need to go through a full forward pass for all ensembles. With the same ensembling number of 10, both RanMask and SAFER require 459G FLOPS, which is 10x the number for BERT baseline. In contrast, even with tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT increased to 5, DiffuseDef can be run faster with 267G FLOPS, showing that it can mitigate the efficiency problem from ensembling while maintaining the benefit of improved robustness.

6 Conclusions

We propose a novel adversarial defense method, DiffuseDef, which combines adversarial training, diffusion training, and ensembling to improve model robustness to adversarial attacks. DiffuseDef can build on any existing adversarial training method, training an additional diffusion layer to predict and remove randomly sampled noise at a given timestep. During inference, the diffusion layer is used to denoise the adversarial hidden states, which are ensembled to construct a robust text representation. Our experiments validate the effectiveness and efficiency of DiffuseDef, which significantly outperforms SOTA on three common adversarial attack methods. Analysis shows that DiffuseDef makes it difficult to find vulnerable tokens to attack, and also reduces the difference between the hidden representations of clean and adversarial texts.

7 Limitations

Scope

Our experiments focus on defending against three common black-box adversarial attack methods, while whether DiffuseDef improves model robustness against white-box attacks is unclear. White-box attacks have access to model parameters and can utilize gradient information to construct adversarial examples more efficiently than black-box attacks. Defending against white-box attacks is more challenging, and we consider this as a future direction of DiffuseDef.

Comparison with additional approaches

Due to the length limit, we do not compare against all current approaches. However we do compare with the SOTA methods with best adversarial robustness based on our preliminary experiments.

Efficiency

Despite the fact that DiffuseDef is more efficient than existing ensembling-based methods, it still requires more model parameters and inference FLOPS than non-ensembling-based models to achieve a better robustness. Future directions of this work might involve efforts to reduce the size of diffusion layer and number of ensembles to make DiffuseDef more efficient.

8 Ethical Considerations

In this paper we propose a new method DiffuseDef which uses a diffusion layer as a denoiser to provide robust and efficient text representation. We demonstrate that the proposed method could significantly improve the robustness of NLP systems to adversarial attacks. However, DiffuseDef cannot defend against all adversarial attacks without limitations (e.g. number of perturbed words, semantic similarity between original and adversarial examples). Potential risks might include creation of new adversarial attacks devised specifically for DiffuseDef.

References

Appendix A Data Preparation

Dataset Train Valid Test Avg Len
AGNews 108K 12K 7K 51.3
IMDB 40K 5K 5K 311.9
QNLI 94K 10K 5K 47.2
Table 5: Dataset statistics. The average text length is counted with BertTokenizer.

Table 5 presents the number of examples in train/valid/test splits and the average token length for the three datasets used in the experiments. For QNLI and AGNews datasets, we randomly split the training set into our train/valid splits, with a ratio of 0.9/0.1, and use its test set as our test split. For IMDB dataset, we randomly split the dataset into train/valid/test splits with a ratio of 0.8/0.1/0.1. All train/valid/test splitting is performed using a random seed of 42.

Appendix B Evaluation Constraints

Dataset ε𝐦𝐢𝐧subscript𝜀𝐦𝐢𝐧\mathbf{\varepsilon_{min}}italic_ε start_POSTSUBSCRIPT bold_min end_POSTSUBSCRIPT 𝐊𝐦𝐚𝐱subscript𝐊𝐦𝐚𝐱\mathbf{K_{max}}bold_K start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT ρ𝐦𝐚𝐱subscript𝜌𝐦𝐚𝐱\mathbf{\rho_{max}}italic_ρ start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT
AGNews 0.84 50 0.3
IMDB 0.84 50 0.1
QNLI 0.84 50 0.2
Table 6: Evaluation parameters for each dataset.

When evaluating with adversarial attack, We follow the parameter settings for TextAttack as suggested in (Li et al., 2021). The minimum semantic similarity ε𝐦𝐢𝐧subscript𝜀𝐦𝐢𝐧\mathbf{\varepsilon_{min}}italic_ε start_POSTSUBSCRIPT bold_min end_POSTSUBSCRIPT between the clean text and adversarial text is set to 0.84, with the score computed using Universal Sentence Encoder (Cer et al., 2018). The maximum number of candidate substitution 𝐊𝐦𝐚𝐱subscript𝐊𝐦𝐚𝐱\mathbf{K_{max}}bold_K start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT from attacker is 50, thus the maximum number of queries 𝐐𝐦𝐚𝐱=𝐊𝐦𝐚𝐱×𝐋subscript𝐐𝐦𝐚𝐱subscript𝐊𝐦𝐚𝐱𝐋\mathbf{Q_{max}}=\mathbf{K_{max}}\times\mathbf{L}bold_Q start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT = bold_K start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT × bold_L where 𝐋𝐋\mathbf{L}bold_L is the number of tokens. Finally, the maximum percentage of changed tokens ρ𝐦𝐚𝐱subscript𝜌𝐦𝐚𝐱\mathbf{\rho_{max}}italic_ρ start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT is set to 0.3/0.1/0.2 for AGNews, IMDB, and QNLI dataset respectively.

Appendix C Training

AGNews IMDB QNLI
Epochs 100 100 100
Batch size 64 64 64
Sequence len 128 256 256
Dropout 0.1 0.1 0.1
Optimizer AdamW AdamW AdamW
Lr 2e-5 2e-5 2e-5
t𝑡titalic_t 30 10 30
tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 5 5 5
k𝑘kitalic_k 10 10 10
Table 7: Hyperparameters for training DiffuseDef.

The details on hyper-parameters of diffusion training can be found in Table 7. All models are trained on a single RTX A6000 GPU. The diffusion training of 100 epochs takes 6/4/3 hours on AGNews, IMDB, QNLI datasets respectively.

Appendix D License for Scientific Artifacts

Artifact License
AGNews (Zhang et al., 2015b) Custom (non-commercial)
IMDB (Maas et al., 2011b) -
QNLI (Wang et al., 2018) CC BY-SA 4.0
transformers (Wolf et al., 2020) Apache License 2.0
TextAttack (Morris et al., 2020) MIT License
BERT (Devlin et al., 2019) Apache License 2.0
RoBERTa (Liu et al., 2019) MIT License
Table 8: Licenses of scientific artifacts used in this paper.

Table 8 lists the scientific artifacts including data, codes, and models used in this paper. The use of these artifacts in this paper is consistent with their intended use, i.e. for scientific research only. The data used in the experiment is in English and does not contain personally identifying info or offensive content.

Appendix E Example of noising and denoising in DiffuseDef

Adding and removing noise to hidden states are essential features in DiffuseDef which contribute to the improved adversarial robustness. To study how adding or removing noise can affect the semantic meaning of the text, we feed the hidden states to the pretrained BERT model with masked language modeling (MLM) head to generate the text output.

In Table 9, we present the MLM outputs from hidden states added with different steps of noise and the MLM outputs from noise hidden states denoised with same number of steps. In the example shown, with more noise added some semantic information can be lost and replaced by symbols or function words like "." or "the". In contrast, denoising for the same number of steps help alleviate such information lost. For example, the word "IBM" can be recovered from the noise.

However, in practise it is not possible to assume number of denoising steps therefore in Table 10 we show the MLM outputs of denoised hidden states directly from clean and adversarial texts. On clean text, we observe that a higher number of denoising steps can result in more abstraction of the texts. For example, more words are replaced with "the" in the MLM outputs as tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT grows. However, words related to the topic (e.g. "Manchester United", "Liverpool") are kept during the denoising process, thus the model can predict correctly. Similarly, the trend of abstraction can be also found on adversarial text while we observe that the denoising can help remove the adversarial noise / perturbation and recover the word "united" from "nation", thus resulting its correct prediction on the adversarial text.

𝐭superscript𝐭\mathbf{t^{\prime}}bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT MLM Output (add noise) MLM Output (add noise then denoise)
0 IBM Chips May Someday Heal Themselves New technology applies electrical fuses to help identify and repair faults. -
5 the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults. the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.
6 ) ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults. the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.
7 the. chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults. the. chips may someday heal themselves new technology uses electrical fuses to help identify and repair faults.
8 ).. may someday heal themselves new technology introduces electrical fuses to help identify and repair faults. the ibm. may someday heal themselves new technology uses electrical fuses to help identify and repair faults.
9 the ibm chips may someday heal themselves new technology uses electrical fuses to help identify and repair faults. the ibm chips may someday heal themselves new technology introduces electrical fuses to help identify and repair faults.
10 the. chips may someday heal themselves new technology extends electrical fuses to help identify and repair faults. the ibm. may someday heal themselves new technology develops electrical fuses to help identify and repair faults.
Table 9: MLM outputs from hidden states with noise added and hidden states with first noise added but then denoised. We only report tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT above 5 as the MLM outputs with smaller tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are identical to the clean text.
𝐭superscript𝐭\mathbf{t^{\prime}}bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Clean Text / MLM Output Adv Text / MLM Output Pred clean Pred adv
0 United Apology over Website Abuse Manchester United have been forced to issue an embarrassing apology to Liverpool for an ill-advised attack on the Anfield outfit on its own website. United Apology over Website Abuse Manchester Nations have been forced to issue an embarrassing apology to Liverpool for an ill-advised attack on the Anfield outfit on its own website. Sports World
1 football. apology over website abuse manchester united have been - to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. the. apology over website abuse manchester nations have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. Sports World
2 the. apology over website abuse manchester united have been - to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. the. apology over website abuse manchester nations have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. Sports World
3 the. apology over website abuse manchester united have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. the. apology over website abuse manchester s have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. Sports Sports
4 the. apology over website abuse manchester united have the - to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. the. apology over website abuse manchester s have been the to issue an embarrassing apology to liverpool for an the - advised attack on the anfield outfit on its own website. Sports Sports
5 the. apology over website abuse manchester united have the the to issue an a apology to liverpool for an the - advised attack on the anfield outfit on its own website. the. apology over website abuse manchester united have been the to issue an the apology to liverpool for an’- advised attack on the anfield outfit on its own website. Sports Sports
Table 10: MLM outputs and FreeLB++ model predictions from ensembling diffused hidden states at different denoising steps.

Appendix F Confusion Matrix under Attack

Figure 5 and 6 present the confusion matrixes of models prediction on clean text and on adversarial texts (successful attack example) on AGNews and IMDB test sets respectively.

Refer to caption
Figure 5: Confusion matrix of models under attack on AGNews test set.
Refer to caption
Figure 6: Confusion matrix of models under attack on IMDB test set.