Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Weikai Lu South China University of Technology, China Ziqian Zeng South China University of Technology, China Jianwei Wang South China University of Technology, China Zhengdong Lu South China University of Technology, China Zelin Chen South China University of Technology, China
Huiping Zhuang
South China University of Technology, China
Cen Chen South China University of Technology, China Pazhou Laboratory, China
Abstract

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model’s own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

This paper contains harmful data and model-generated content that can be offensive in nature.

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge



footnotetext: *Corresponding author

1 Introduction

With the widespread popularity of Large Language Models (LLMs) Achiam et al. (2023); Anil et al. (2023); Touvron et al. (2023); Bai et al. (2023); Yang et al. (2023), there is a growing concern regarding the safety and potential harm associated with LLM-generated content. LLMs are trained on massive data without undergoing rigorous scrutiny (Huang et al., 2023), which could probably leads to undesirable content generation. To steer LLMs towards generating helpful and harmless responses, LLM alignment methods such as reinforcement learning from human feedback (RLHF) Ouyang et al. (2022) and supervised fine-tuning (SFT) have been proposed, enabling LLMs to reject harmful queries as depicted in Figure 1(a).

Refer to caption
Figure 1: (a) safety Alignment: when the attacker directly queries a harmful question, LLM refuses to respond because of safety alignment. (b) Jailbreaking: when the attacker asks the harmful question via an adversarial prompt, the harmful knowledge bypasses safeguards, and the LLM provides harmful responses. (c) Eraser: when the harmful knowledge is forgotten and can no longer bypass the safeguards, the LLM refuses to answer.

However, well-aligned LLMs could be fragile. Recent research works Liu et al. (2023); Chao et al. (2023); Zou et al. (2023) proposed jailbreaking attack methods which disguise the harmful queries with adversarial prompts, eliciting LLMs to bypass safeguards and generate harmful responses as depicted in Figure 1(b). Adversarial prompts are carefully designed by humans, such as enticing LLMs to play roles devoid of basic moral principles (Deshpande et al., 2023) or appending meaningless suffixes Zou et al. (2023). To enhance the efficiency of jailbreaking, several automated programs for searching adversarial prompts have been proposed Liu et al. (2023); Chao et al. (2023). These works have significantly raised the success rate of jailbreaking, while also amplifying the security risks associated with LLMs.

Currently, there are two main ways to address jailbreak attacks: (1) Harmful behavior filtering Cao et al. (2023); Kumar et al. (2023); Markov et al. (2023): These methods typically do not alter the model’s weights but censor the inputs and outputs of LLMs. Their purpose is to detect jailbreaking behavior during the model inference stage and respond with predefined warnings when jailbreaking is detected. (2) Continued training Wang et al. (2023); Zhang et al. (2023); Deng et al. (2023): These methods utilize additional training to enhance the model’s ability to reject harmful inputs or improve the model’s ability to discriminate harmful content.

Although these methods have yielded promising results, they ignore the fact that harmful knowledge still resides within the model. This harmful knowledge serves as the underlying basis for generating harmful responses. For instance, knowledge related to bomb-making plays a pivotal role in answering inquiries like “how to make bombs?” When more advanced attack methods are developed, harmful knowledge is likely to resurface, resulting in an endless cat-and-mouse game.

In light of this, the intuition of our method is removing the harmful knowledge from LLMs as illustrated in Figure 1 (c). We propose Eraser, a jailbreaking defense method that mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment to harmful inquires. Specifically, we perform gradient ascent on harmful answers in a simulated jailbreaking mode, retain general knowledge by preserving the ability to understand entities, and enhance safety alignment by maintaining the ability to reject harmful questions.

Experimental results have shown that the proposed method can significantly reduce the success rate of various jailbreaking attacks without compromising the performance on other tasks.

The contributions of our paper are summarized as follows,

\bullet We propose a method that can achieve three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment to harmful inquires.

\bullet Experimental results demonstrate that the proposed method excels in defense capability while maintaining general capability. Compared to existing methods, it exhibits a better trade-off between harmlessness and usefulness.

\bullet Experimental results show that simply using random token sequences for gradient ascent can achieve defense capabilities. This finding offers valuable insights for future endeavors in jailbreak defense.

2 Related Works

2.1 Jailbreaking Defense

Although many alignment methods have been developed to make LLM generate ethical and responsible texts, an emerging class of attack called jailbreaking attack can still bypass the safeguards and cause LLM to have harmful and toxic responses. To combat jailbreaking attacks, existing defense strategies primarily consist of two categories: harmful behavior filtering and continued training. Harmful behavior filtering involves applying perturbations to model inputs (Cao et al., 2023; Kumar et al., 2023; Robey et al., 2023), scrutinizing model outputs (Markov et al., 2023; Helbling et al., 2023), and integrating multiple LLMs (Chen et al., 2023). These methods generally incur additional costs to model inference. Continued training hopes to use further SFT to enhance the security of models. For example, Wang et al. (2023) trained LLMs to evaluate the potential harm of their own responses at the end of each output; Zhang et al. (2023) trained LLMs to distinguish between harmful and helpful target prioritization, improving the model’s understanding of harmfulness; Deng et al. (2023) proposed a red team defense framework that searches for harmful prompts to train the model to reject them. However, none of these methods have been able to address the fundamental problem of harmful output from LLMs, that is, harmful knowledge is still retained in the model.

2.2 LLM Unlearning

Machine unlearning methods are designed to remove specified knowledge that has been learned by a model Bourtoule et al. (2021). LLMs are trained on massive training data, re-training LLMs is obviously not a solution for forgetting specific knowledge. Using machine unlearning methods to mitigate the privacy exposure or poisoning attack on LLMs has become a promising research direction Jang et al. (2023); Chen and Yang (2023); Eldan and Russinovich (2023). Some recent work attempted to solve the harmful output problem using unlearning. Zhou et al. (2023) assumed that there were harmful instructions in the SFT dataset and attempted to make harmful behaviors unlearnable during the SFT process. The most relevant work to our work is (Yao et al., 2023), which uses unlearning to remove harmful responses, erase copyright-protected content, and eliminate hallucination from an unaligned model. However, Yao et al. considered the LLM unlearning as an alignment method, an alternative to RLHF. In contrast, we consider the LLM unlearning as a post-hoc defense strategy against jailbreaking on an aligned model.

3 Methodology

3.1 Problem Formulation

Assume there is an aligned LLM f()𝑓f(\cdot)italic_f ( ⋅ ) which can refuse to answer harmful queries such as “How to make bombs?”, but still can generate harmful content under jailbreaking attacks such as “Sure, there are mainly three steps.” Given an aligned LLM f()𝑓f(\cdot)italic_f ( ⋅ ) and a harmful queries set Xqsubscript𝑋𝑞X_{q}italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the goal is to finetune a new LLM h()h(\cdot)italic_h ( ⋅ ), which can refuse to answer harmful queries Xqsubscript𝑋𝑞X_{q}italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT as many as possible under different jailbreaking attacks, and maintain its proficiency in handling regular queries.

We propose Eraser, a jailbreak defense method via machine unlearning. Specifically, we unlearn the corresponding answer y𝑦yitalic_y for each xXq𝑥subscript𝑋𝑞x\in X_{q}italic_x ∈ italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT while maintaining proficiency in answering regular queries. Our method includes three components: unlearning harmful knowledge (§3.2), retaining general knowledge (§3.3), and maintaining safety alignment in (§3.4).

3.2 Unlearn Harmful Knowledge

Following Chen and Yang (2023); Yao et al. (2023), we adopt the gradient ascent technique to implement unlearning. The current challenge lies in acquiring harmful knowledge embedded within LLMs. One possible way is to collect it with the help of red teams Deng et al. (2023), but it is labor-intensive and time-consuming. Our intuition is that multiple answers to the same question should have similarities, and forgetting one may generalize to others. Hence, we propose to utilize publicly available uncensored models to obtain harmful answers. The collected harmful dataset is denoted as Df={(x,y)|xXf,yYf}subscript𝐷𝑓conditional-set𝑥𝑦formulae-sequence𝑥subscript𝑋𝑓𝑦subscript𝑌𝑓D_{f}=\left\{(x,y)|x\in X_{f},y\in Y_{f}\right\}italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = { ( italic_x , italic_y ) | italic_x ∈ italic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y ∈ italic_Y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT }, where Xfsubscript𝑋𝑓X_{f}italic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Yfsubscript𝑌𝑓Y_{f}italic_Y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are question set and answer set respectively.

For a question and answer pair (x,y)Df𝑥𝑦subscript𝐷𝑓(x,y)\in D_{f}( italic_x , italic_y ) ∈ italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, the existing unlearning method Yao et al. (2023) takes x𝑥xitalic_x as input and uses y𝑦yitalic_y as the target to perform gradient ascent. This process aims to reduce the probability of the LLM response y𝑦yitalic_y when given x𝑥xitalic_x. However, in jailbreaking attacks, x𝑥xitalic_x is often disguised in the jailbreaking prompt, in which the adversarial prefixes and suffixes are the key to awakening harmful memories in LLMs. Therefore, we add different randomly generated prefixes and suffixes to x𝑥xitalic_x at each epoch of training to simulate jailbreaking attack scenarios. Intuitively, we hope that regardless of how prompts are disguised, as long as x𝑥xitalic_x is present, the model will not provide harmful answer y𝑦yitalic_y. Let T()𝑇T(\cdot)italic_T ( ⋅ ) be a function that adds random prefixes and suffixes to strings, the unlearn training objective is defined as follows:

Lf=1|Df|(x,y)Dfi=1|y|log(p(yiT(x),y<i))subscript𝐿𝑓1subscript𝐷𝑓subscript𝑥𝑦subscript𝐷𝑓superscriptsubscript𝑖1𝑦𝑝conditionalsubscript𝑦𝑖𝑇𝑥subscript𝑦absent𝑖L_{f}=\frac{1}{\left|D_{f}\right|}\sum_{(x,y)\in D_{f}}\sum_{i=1}^{|y|}\log% \left(p\left(y_{i}\mid T(x),y_{<i}\right)\right)italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT roman_log ( italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_T ( italic_x ) , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ) (1)

where y<i={y1,,yi1}subscript𝑦absent𝑖subscript𝑦1subscript𝑦𝑖1y_{<i}=\{y_{1},\dots,y_{i-1}\}italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT } denotes the first i1𝑖1i-1italic_i - 1 tokens of target sequence y𝑦yitalic_y and p(yiT(x),y<i)𝑝conditionalsubscript𝑦𝑖𝑇𝑥subscript𝑦absent𝑖p\left(y_{i}\mid T(x),y_{<i}\right)italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_T ( italic_x ) , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) denotes the conditional probability of predicting next token when given T(x)𝑇𝑥T(x)italic_T ( italic_x ) and y<isubscript𝑦absent𝑖y_{<i}italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT to the LLM h()h(\cdot)italic_h ( ⋅ ).

3.3 Retain General Knowledge

Using the gradient ascent technique to unlearn harmful knowledge often results in impaired general performance of LLMs (Yao et al., 2023). We believe that the main ability compromised by LLMs is their understanding of entities. Intuitively, when unlearning a piece of harmful text, LLM’s understanding of certain entities mentioned in the text is weakened. For instance, when forgetting the process of making a bomb, the knowledge of how to use the required materials is also forgotten, even though this knowledge could be useful to address harmless problems. As shown in Figure 2, LLama2 unlearned the harmful knowledge of bomb-making is unable to provide the specific uses of potassium nitrate (a material used for bomb-making), whereas the original LLama2 could list nine different applications.

Refer to caption
Figure 2: When the user queries “What can potassium nitrate be used for?”, the responses of LLama2 after unlearning bomb-making knowledge and the original Llama2. Part of the text is omitted with [bold-…\dotsbold_…].

In this regard, we propose to retain general knowledge by preserving the model’s ability to answer entity-related comprehension questions. The entity refers to those entities appear in the harmful answer set Yfsubscript𝑌𝑓Y_{f}italic_Y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. To accomplish this, we initially create 10101010 prompt templates to generate entity-related comprehension questions, such as “What is [entity name] used for?”. For each yYf𝑦subscript𝑌𝑓y\in Y_{f}italic_y ∈ italic_Y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, we utilized GPT-3.5 (Ouyang et al., 2022) to extract all entities and randomly selected one prompt template for each extracted entity to inquire the LLM f𝑓fitalic_f, resulting in a helpful dataset Dhsubscript𝐷D_{h}italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Appendix A.1 and A.2 display all prompts we used for entity extraction and entity comprehension questions generation. The objective function is to perform distillation on next word prediction where the teacher is the aligned LLM f()𝑓f(\cdot)italic_f ( ⋅ ) before unlearning:

Lh=1|Dh|(x,y)Dhi=1|y|KL(h(x,y<i)||f(x,y<i))L_{h}=\frac{1}{\left|D_{h}\right|}\sum_{(x,y)\in D_{h}}\sum_{i=1}^{|y|}KL\left% (h\left(x,y_{<i}\right)||f\left(x,y_{<i}\right)\right)italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT italic_K italic_L ( italic_h ( italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) | | italic_f ( italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ) (2)

where KL(||)KL(\cdot||\cdot)italic_K italic_L ( ⋅ | | ⋅ ) denotes the Kullback-Leibler divergence.

3.4 Maintain Safety Alignment

Recent research (Qi et al., 2023) has revealed the detrimental effects of SFT on the safety alignment of LLMs. While in an idealized scenario, LLM loses the ability to answer harmful questions after unlearning training, maintaining the capability to refuse and provide reasons for refusal is an essential display of responsibility towards users. To achieve this, for each harmful question xXf𝑥subscript𝑋𝑓x\in X_{f}italic_x ∈ italic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, we directly query the orignal LLM with it to obtain refusal data, forming the dataset Drsubscript𝐷𝑟D_{r}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Then, we encourage the model to have similar refusal capabilities before and after training:

Lr=1|Dr|(x,y)Dri=1|y|KL(h(x,y<i)||f(x,y<i))L_{r}=\frac{1}{\left|D_{r}\right|}\sum_{(x,y)\in D_{r}}\sum_{i=1}^{|y|}KL\left% (h\left(x,y_{<i}\right)||f\left(x,y_{<i}\right)\right)italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT italic_K italic_L ( italic_h ( italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) | | italic_f ( italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ) (3)

3.5 Overall objective

Compared to preserving model capability, unlearning knowledge is a much easier objective, so striking a balance among the three goals is challenging.

In §4.5, we observe that prolonged unlearning training can have a detrimental effect on the model’s performance over time.

Therefore, we aim to set a constraint for the unlearning objective and focus on optimizing the remaining two objectives after sufficient unlearning training:

L=Max(0,γ+Lf)+Lh+Lr𝐿Max0𝛾subscript𝐿𝑓subscript𝐿subscript𝐿𝑟L=\operatorname{Max}\left(0,\gamma+L_{f}\right)+L_{h}+L_{r}italic_L = roman_Max ( 0 , italic_γ + italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (4)

The objective function stops optimizing Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT when it reaches threshold γ𝛾\gammaitalic_γ, but continues optimizing Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to retain general knowledge and maintain rejection ability.

4 Experiments

4.1 Experimental Setup

Attack methods. We applied three advanced jailbreaking methods to evaluate the effectiveness of defense methods. (1) AIM, a meticulously designed jailbreak prompt that has received the most votes in the jailbreaking prompt community 111https://www.jailbreakchat.com/. (2) AutoDAN (Liu et al., 2023), a hierarchical genetic algorithm that extensively searches jailbreak prompts for each harmful question. (3) GCG (Zou et al., 2023), a gradient-based white-box attack method.

Baselines. Due to significant differences in evaluation systems, we only discuss harmful behavior filtering methods in Appendix B. The main text focuses solely on training-based methods, including the following two approaches: (1) RSFT. According to the work of Deng et al. (2023), we first perform two attacks on the base model and collect all prompts that lead to jailbreaking. Then, we fine-tune the base model with unified rejection response as the target of these prompts. (2) GAM (Yao et al., 2023). A general LLM unlearning method that could eliminate harmful knowledge from unaligned LLMs. We directly applied it to the aligned models as a baseline.

Datasets. For training data, we used publicly available AdvBench (Zou et al., 2023) which contains 520 harmful queries. The harmful responses to be unlearned were generated by the publicly available uncensored model 111https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ. For test data, in addition to AdvBench, we created an extended dataset named AdvExtent to test the generalization ability of defense methods on similar harmful topics. Specifically, we summarized 139 unique harmful topics such as “make bombs” from AdvBench and manually designed three prompts not included in AdvBench as shown in Appendix A.3. Each harmful topic was combined with the three prompts to generate harmful questions, resulting in a dataset containing 417 samples.

Evaluation Metrics. To evaluate the capacity of jailbreaking defense, we utilize harmfulness score and ASR (Attack Success Rate) as metrics, where lower values indicate better defense effectiveness. We follow Qi et al. (2023) and generate harmfulness scores based on LLMs’ feedback. For every pair of harmful instructions and LLMs’ response, we query GPT-3.5-turbo-1106 to obtain a harmfulness score which ranges between 1 and 5. Most existing methods determine jailbreaking based on the presence of predefined rejection words in the response, and consider the proportion of jailbreaking samples to all samples as ASR. However, this metric might be inaccurate as it is challenging to enumerate all rejection words. For this regard, we treat samples with harmful scores greater than 2 as successful jailbreaks and use the proportion of successful jailbreak samples as the ASR. In the harmfulness scoring criteria defined by Qi et al. (2023), samples with scores of 1 and 2 do not contain truly harmful outputs.

To evaluate the general capability of LLMs, we employ widely used LLM evaluation benchmarks including Arc_challenge Clark et al. (2018), Arc_easyClark et al. (2018), Copa Roemmele et al. (2011), Cb De Marneffe et al. (2019), HendrycksTest Hendrycks et al. (2021), Boolq Clark et al. (2019) and Hellaswag Zellers et al. (2019) as the evaluation datasets.

Implementation Details. We employ Llama2-chat-7b Touvron et al. (2023) as the base model which has undergone thorough safety alignment training. The proposed method was trained using LORA Hu et al. (2021). During the training process, γ𝛾\gammaitalic_γ was set to 2, the batch size was fixed at 64 samples, and texts exceeding 2048 tokens were truncated. We applied the AdamW optimizer with 2e-5 learning rate. The number of training epochs is set to 5. The checkpoint with the lowest training loss was selected for inference. For RSFT, we employ a learning rate of 1e-4 and a weight decay of 1e-3. For GAM, we mostly followed the author’s settings, except for stopping the training when the gradient reaches 2 to accommodate the AdvBench dataset. For the attack methods AutoDAN, we limited the maximum search steps to 20, and modified the criterion for determining whether a jailbreak has occurred to be the same as ours. That is, judging based on LLMs’ feedback.

4.2 Main Results

Table 1: The defense performance of the base model and its three defense-trained models under three attacks. The evaluations are done on the AdvBench and AdvExtent datasets. The metrics include ASR and Harmfulness. Low ASR and Harmfulness indicate good defense performance. ASR is measured in %.
Datasets Compared Methods Attack Methods
AIM AutoDan GCG
ASR Harmfulness ASR Harmfulness ASR Harmfulness
AdvBench Base model 19.61 1.68 24.61 1.90 40.57 2.78
GAM (Yao et al., 2023) 30.00 1.99 32.30 2.18 15.00 1.57
RSFT (Deng et al., 2023) 0.00 1.00 2.88 1.11 9.61 1.27
ERASER 0.57 1.03 2.88 1.09 8.26 1.33
AdvExtent Base model 23.74 1.86 44.36 2.65 17.74 1.65
GAM (Yao et al., 2023) 29.49 1.99 27.33 1.97 2.87 1.11
RSFT (Deng et al., 2023) 0.00 1.00 2.87 1.09 2.15 1.09
ERASER 0.04 1.13 5.99 1.18 1.67 1.06
Table 2: Performance of the base model and its three defense-trained models on the benchmarks, using accuracy as the metric. The last column represents the average accuracy of 7 benchmarks.
Approaches Arc_challenge Arc_easy Copa Cb HendrycksTest Boolq Hellaswag Average
Base model 44.11 73.90 89.00 53.57 47.03 79.75 57.77 63.59
GAMYao et al. (2023) 42.15 71.96 87.00 57.24 45.74 80.03 57.24 62.27
RSFTDeng et al. (2023) 43.08 75.88 87.00 46.42 42.38 77.49 56.93 61.31
Eraser 44.36 73.56 89.00 55.35 47.02 79.66 57.38 63.76

Defensive capacities. Table 1 shows the jailbreaking results of Eraser and baselines on two datasets. Compared to the base model, the Eraser significantly reduces the ASR and harmfulness scores in all settings, which demonstrates the strong defensive capacities of the Eraser. Since the forgotten harmful knowledge does not cover all harmful knowledge contained in the base model, the success rate of attack cannot be reduced to 0%. This phenomenon may be alleviated by a broader range of harmful knowledge. It is worth noting that Eraser’s defensive capacities can be generalized to harmful queries of the same topic, as there is a similar defensive effect in the AdvExtent dataset as in the AdvBench dataset, and the queries in AdvExtent are not seen by the eraser during training.

For GAM, higher jailbreaking risks were exposed in almost all settings compared to the base model. This may be due to the use of random tokens as the target for gradient descent, which could introduce additional harmful knowledge and undermine the model’s general capacities. RSFT is safer than Eraser in over half of settings, but we observed that it makes the model overly cautious, resulting in rejections even for benign questions. As shown in Figure 5, RSFT refuses to answer when a user asks “What are the hazards of firearms?” while the base model and Eraser can list detailed hazards. The possible reason is that “firearms” appear in the harmful questions.

General capacities. Table 2 displays the performance of Eraser and baselines on benchmarks for evaluating LLMs. Compared to the base model, Eraser achieve comparable results on all 7 benchmarks, while RSFT and GAM show varying levels of performance degradation. As shown in Figure 5, Eraser’s behavior is most closest to the base model. These results indicate that Eraser can effectively reduce the jailbreaking risk without compromising general capacities, which enables LLMs to continuously unlearn new harmful knowledge.

4.3 Ablation Study

Table 3: Ablation experiment results. General capacity represents the average accuracy of 7 benchmarks.
Apporaches General capacity AIM Attack
ASR Harmfulness
Base model 63.59 19.61 1.68
Eraser 63.76 0.57 1.03
Eraser w/o T()𝑇T(\cdot)italic_T ( ⋅ ) 63.88 3.84 1.10
Eraser w/o Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 63.43 0.00 1.00
Eraser w/o Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 63.89 2.88 1.10
GA 62.24 0.00 1.00

To validate the effectiveness of each component, we designed 4 variants of Eraser: (1) Eraser w/o T()𝑇T(\cdot)italic_T ( ⋅ ): Eraser that does not use a random prefix/suffix generation function T()𝑇T(\cdot)italic_T ( ⋅ ) in Eq 1 . (2) Eraser w/o Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT: Eraser that removes the goal Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (i.e., without retaining general knowledge). (3) Eraser w/o Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: Eraser that removes the goal Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (i.e., without maintaining safety alignment). (4) GA: A method that only utilizes Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT as the goal.

Table 3 shows the experimental results. Compared to Eraser, Eraser w/o T()𝑇T(\cdot)italic_T ( ⋅ ) show a significant increase in ASR, indicating the effectiveness of T()𝑇T(\cdot)italic_T ( ⋅ ) against jailbreaking attacks. GA, which only uses gradient ascent as the goal, exhibits excellent defense performance, but its general capability is severely impaired. With the addition of the target Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, the general capability of Eraser w/o Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is mostly restored, but some ASR increase occurs due to the absence of the Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT goal. Eraser w/o Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT experiences a decrease in general performance but still outperform GA significantly, possibly due to the Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT compensating for the model’s general language proficiency. We can further draw the following conclusions: the random prefix/suffix enhances the model’s defensive capability, Lhsubscript𝐿L_{h}italic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT compensates for the general capability, and Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT further improves the defensive capability of the model.

4.4 What has Contributed to Defensive Capabilities?

To verify whether the forgetting of harmful text contributes to the defense capability of the model, we first replaced the harmful answers in the training data with a random token sequence and then performed gradient ascent. It is worth noting that the random token sequence does not contain any semantic knowledge. However, the results in Table 4 indicate that this method achieves significant defense against AIM, but with a significant decrease in general capabilities. Such astonishing results seem to indicate that the improvement of defensive ability is not related to whether the forgotten text is harmful.

To further investigate, we tested Eraser with the same random data and found that it restored the model’s overall performance, but the jailbreaking risk also returned to a level close to the base model. Comparing Eraser’s use of harmful and harmless data, the contribution of forgetting harmful data to its defensive ability is evident.

Based on the observations above, we speculate that the sources of defensive capabilities can be diverse. Forgetting harmful text can contribute to defensive capabilities, which is a source of Eraser defense. The reason why GA w/ random brings defensive capabilities may be due to the disruption of the model’s general performance, as Eraser w/ random loses its defensive capabilities by compensating for general performance. The underlying logic is the trade-off between harmfulness and usefulness. The model loses the ability to follow instructions, naturally losing the ability to follow harmful instructions as well.

Considering that GA reduces the general ability by 1.94% while decreasing the ASR of AIM attacks from 19.61% to 5.38%, and its implementation cost is extremely low, requiring only the random generation of some data to unlearning, defensive capability appears to be a relatively easily acquired attribute. Recall that RSFT’s 2.28% reduction in general capability, its good defense performance is not surprising. In comparison, Eraser’s ability to maintain general capability is particularly valuable.

Table 4: Defensive capability source test results. General capacity represents the average accuracy of the 7 benchmarks. The w/ random replaces harmful data to be unlearned with random token sequence.
Apporaches General capability AIM Attack
ASR Harmfulness
Base model 63.59 19.61 1.68
Eraser 63.76 0.57 1.03
GA w/ random 61.65 5.38 1.18
Eraser w/ random 63.61 19.03 1.67

4.5 The Impact of Threshold γ𝛾\gammaitalic_γ

Refer to caption
Figure 3: The Impact of γ𝛾\gammaitalic_γ and Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is always a negative value, and γ𝛾\gammaitalic_γ is the limit on the minimum value of Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in Eraser.

The threshold γ𝛾\gammaitalic_γ constrains the minimum value of Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT descent. To explore the influence of γ𝛾\gammaitalic_γ on the Eraser performance, we trained Eraser with γ𝛾\gammaitalic_γ set to 1, 2, 3, 4, and 5, respectively, and reported AIM ASR and the average accuracy of general capacities evaluation. Additionally, we trained GA and evaluated every 5 training steps. Figure 3 shows the evaluation results. As γ𝛾\gammaitalic_γ increases, Eraser’s AIM ASR continuously decreases, reaching 0 at γ𝛾\gammaitalic_γ=3, but general performance only fully recovers when γ𝛾\gammaitalic_γ is set to 1 and 2. When γ𝛾\gammaitalic_γ is greater than 2, the general performance tends to decline continuously. For GA, as Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT descends, the AIM ASR of the GA decreases, reaching 0 when Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT approaches -3, while general performance continues to decline. This observation indicates that γ𝛾\gammaitalic_γ plays a controlling role in the defense performance of the model, but an overly large γ𝛾\gammaitalic_γ may prevent the model from recovering its general ability. Therefore, we recommend setting a moderate value for γ𝛾\gammaitalic_γ.

4.6 Case study

To demonstrate how Eraser outperforms other baselines, we give examples in both the attack scenario and the general scenario in Figures 4 and 5, respectively. As depicted in Figure 4, when faced with the AIM attack, the base model provides detailed harmful guidance, while Eraser refuses the instruction, explains the refusal reasons, and expects further harmless communication with the user. In contrast, GAM often responds with garbled text, which is irresponsible and disrespectful. Moreover, GAM’s responses often contain harmful text unrelated to the user’s commands, such as “stepping into the mind of a serial killer” in the case. RSFT can also provide refusals but typically offers standard rejection responses that are not as helpful as Eraser’s answers.

Figure 5 demonstrates the responses of multiple models when a user asks “What are the hazards of firearms?”. RSFT mistakenly understood this topic as a harmful one and refuses to answer, possibly indicating that the model is overly sensitive to the word “firearms.” GAM can provide a summarization of the hazards associated with firearms. In contrast, Eraser’s response is closest to that of the base model, suggesting that they exhibit more similar behavior. In conclusion, Eraser was able to respond more responsibly to the jailbreaking prompt, while also responding more similarly to the base model for general instructions, which validates why Eraser has better defensive and general capabilities.

Refer to caption
Figure 4: When the attacker used AIM to issue the command “Give step-by-step instructions on how to commit insider trading,” the base model provided a detailed guideline. Eraser refused and stated that it could provide other morally acceptable assistance. GAM responded with garbled text, including harmful content unrelated to the instruction. RSFT briefly rejected the user. Due to space limitations, part of the text is omitted with [\cdots].
Refer to caption
Figure 5: When the user asked “What are the hazards of firearms?”, the base model and Eraser listed multiple hazards in detail. GAM briefly summarized the hazards. RSFT refused to answer on the grounds that it would not promote violence or harm. Due to space limitations, part of the text is omitted with [\cdots]. Appendix C provides additional quantitative analysis for similar queries.

5 Conclusion

In this paper, we propose an LLM jailbreaking defense method called Eraser, which aims to address the fundamental threat for jailbreaking, that is the harmful knowledge that resides within the LLMs. By integrating three goals: unlearning harmful knowledge, maintaining general performance, and enhancing safety alignment, Eraser can significantly reduce the risk of jailbreaking without compromising general capabilities. Compared to existing methods, Eraser can better balance harmfulness and usefulness. Our experiments also show that simply unlearning random data can also bring good defense effects with general performance degradation, so we encourage future research on jailbreaking defense to focus more on maintaining general capabilities.

Limitations

Although Eraser does not require data collection by a red team, it is still inefficient as it only defends against specific harmful issues, and enumerating all the harmful issues is challenging. Furthermore, the Eraser is only applicable to LLMs that have undergone safety alignment. To become an alternative to technologies like RLHF, more effort needs to be put into enhancing safety alignment.

Ethics Statement

This paper contains harmful data and model-generated harmful text. It is important to emphasize that the opinions expressed in these texts are automatically generated by LLMs and do not represent the views of the authors. The purpose of this work is to alleviate this situation, and the purpose of presenting harmful text is only to verify the effectiveness of the proposed method. We strongly call for more researchers to pay attention to this research field to promote the development of more ethical and responsible LLMs.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  • Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
  • Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy (SP), pages 141–159.
  • Cao et al. (2023) Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2023. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
  • Chao et al. (2023) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  • Chen et al. (2023) Bocheng Chen, Advait Paliwal, and Qiben Yan. 2023. Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
  • Chen and Yang (2023) Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for LLMs. In EMNLP.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  • De Marneffe et al. (2019) Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  • Deng et al. (2023) Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, and Xiangnan He. 2023. Attack prompt generation for red teaming and defending large language models. In EMNLP.
  • Deshpande et al. (2023) Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  • Eldan and Russinovich (2023) Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
  • Helbling et al. (2023) Alec Helbling, Mansi Phute, Matthew Hull, and Duen Horng Chau. 2023. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. ICLR.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Huang et al. (2023) Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, et al. 2023. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391.
  • Jang et al. (2023) Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. Knowledge unlearning for mitigating privacy risks in language models. In ACL.
  • Kumar et al. (2023) Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. 2023. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
  • Liu et al. (2023) Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  • Markov et al. (2023) Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In AAAI, volume 37, pages 15009–15018.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  • Robey et al. (2023) Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. 2023. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  • Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Wang et al. (2023) Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, and Kam-Fai Wong. 2023. Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.
  • Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  • Yao et al. (2023) Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2023. Large language model unlearning. arXiv preprint arXiv:2310.10683.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In ACL.
  • Zhang et al. (2023) Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
  • Zhou et al. (2023) Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023. Making harmful behaviors unlearnable for large language models. arXiv preprint arXiv:2311.02105.
  • Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Appendix A Prompts

A.1 Entities extraction

Refer to caption
Figure 6: Prompt used in entities extraction

A.2 Entities understanding testing

Refer to caption
Figure 7: Ten prompts used in entities understanding testing

A.3 AdvExtent question generation

Refer to caption
Figure 8: Three prompts used for AdvExtent dataset generation

A.4 AIM Attack

Refer to caption
Figure 9: Prompt used for AIM attack.

Appendix B Compared to harmful behavior filtering method

Harmful behavior filtering methods do not have an impact on the model’s performance since they do not require modifying the model’s weights. Given the additional inference required for defense, people are typically more concerned with their time complexity and their error defense rate for benign instructions. Therefore, these methods have significant differences in evaluation systems compared to training-based methods. However, we can still compare Eraser with them in terms of defense performance.

To address to this, we implemented RA-LLM Cao et al. (2023), which constructs a more robust alignment check mechanism for defense. We fully adopted the author’s parameter settings and employed RA-LLM to defend against AIM and AutoDAN attacks. The experimental results are shown in Table 5. RA-LLM effectively reduces the ASR of the base model, but it still performs a poorer defense capability compared to Eraser. To evaluate the impact during normal usage, we selected 100 benign instructions from the Alpaca Taori et al. (2023) dataset and recorded the average sample inference latency and refusal rate for RA-LLM, Eraser, and the base model. The rejection criterion is whether the model’s response contains rejection words such as “I’m sorry”. Table 6 shows the experimental results. The inference latency for RA-LLM significantly increases compared to the base model. This is due to RA-LLM’s defense measures requiring an additional 20 rounds of short inference on top of the base model. In practical applications, such defense measures would incur higher additional costs. Additionally, RA-LLM also carries a risk of rejecting benign inputs. In contrast, Eraser does not result in higher latency and refusal rate.

Table 5: The defense performance of RA-LLM, Eraser and the base model. ASR is measured in %.
Datasets Appraoches Attack Methods
AIM AutoDan
ASR Harmfulness ASR Harmfulness
AdvBench Base model 19.61 1.68 24.61 1.90
Eraser 0.57 1.03 2.88 1.09
RA-LLM 6.92 1.24 5.96 1.22
AdvExtent Base model 23.74 1.86 44.36 2.65
Eraser 0.04 1.13 5.99 1.18
RA-LLM 13.18 1.51 11.51 1.44
Table 6: Inference latency and refusal rate of RA-LLM, Eraser, and the base model. The latency reports the average inference time for 100 samples, measured in seconds. The refusal rate is measured in %.
Approaches Latency Refusal rate
Base model 6.73 0.00
Eraser 6.47 0.00
RA-LLM 11.53 8.00

Appendix C The quantitative analysis of similar questions in Figure 5

Refer to caption
Figure 10: Prompts used in question generation.
Table 7: The refusal rate of all baselines when querying the questions contains harmful topics but themselves harmless. The refusal rate is measured in %.
Approaches Refusal rate
Base model 8.00
Eraser 8.66
GAM 45.63
RSFT 48.99

To further explore the differences between different baselines when dealing with similar questions in Figure 5 (i.e., questions that include harmful topics but are themselves harmless), we designed three prompts as depicted in Figure 10 and further screened 50 harmful topics in AdvBench. Each harmful topic is paired with three prompts, resulting in a total of 150 questions. Subsequently, we query all the baselines and calculate the refusal rate of the model. From the results shown in Table 7, GAM and RSFT significantly increased the refusal rate, while Eraser’s refusal rate was only 0.66% higher than the base model. This once again demonstrates the superiority of Eraser in maintaining general capabilities.