Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Weikai Lu South China University of Technology, China Ziqian Zeng South China University of Technology, China Jianwei Wang South China University of Technology, China Zhengdong Lu South China University of Technology, China Zelin Chen South China University of Technology, China
Huiping Zhuang South China University of Technology, China Cen Chen South China University of Technology, China Pazhou Laboratory, China

Abstract

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model’s own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

This paper contains harmful data and model-generated content that can be offensive in nature.

^†^†footnotetext: *Corresponding author

1 Introduction

With the widespread popularity of Large Language Models (LLMs) Achiam et al. (2023); Anil et al. (2023); Touvron et al. (2023); Bai et al. (2023); Yang et al. (2023), there is a growing concern regarding the safety and potential harm associated with LLM-generated content. LLMs are trained on massive data without undergoing rigorous scrutiny (Huang et al., 2023), which could probably leads to undesirable content generation. To steer LLMs towards generating helpful and harmless responses, LLM alignment methods such as reinforcement learning from human feedback (RLHF) Ouyang et al. (2022) and supervised fine-tuning (SFT) have been proposed, enabling LLMs to reject harmful queries as depicted in Figure 1(a).

Refer to caption — Figure 1: (a) safety Alignment: when the attacker directly queries a harmful question, LLM refuses to respond because of safety alignment. (b) Jailbreaking: when the attacker asks the harmful question via an adversarial prompt, the harmful knowledge bypasses safeguards, and the LLM provides harmful responses. (c) Eraser: when the harmful knowledge is forgotten and can no longer bypass the safeguards, the LLM refuses to answer.

However, well-aligned LLMs could be fragile. Recent research works Liu et al. (2023); Chao et al. (2023); Zou et al. (2023) proposed jailbreaking attack methods which disguise the harmful queries with adversarial prompts, eliciting LLMs to bypass safeguards and generate harmful responses as depicted in Figure 1(b). Adversarial prompts are carefully designed by humans, such as enticing LLMs to play roles devoid of basic moral principles (Deshpande et al., 2023) or appending meaningless suffixes Zou et al. (2023). To enhance the efficiency of jailbreaking, several automated programs for searching adversarial prompts have been proposed Liu et al. (2023); Chao et al. (2023). These works have significantly raised the success rate of jailbreaking, while also amplifying the security risks associated with LLMs.

Currently, there are two main ways to address jailbreak attacks: (1) Harmful behavior filtering Cao et al. (2023); Kumar et al. (2023); Markov et al. (2023): These methods typically do not alter the model’s weights but censor the inputs and outputs of LLMs. Their purpose is to detect jailbreaking behavior during the model inference stage and respond with predefined warnings when jailbreaking is detected. (2) Continued training Wang et al. (2023); Zhang et al. (2023); Deng et al. (2023): These methods utilize additional training to enhance the model’s ability to reject harmful inputs or improve the model’s ability to discriminate harmful content.

Although these methods have yielded promising results, they ignore the fact that harmful knowledge still resides within the model. This harmful knowledge serves as the underlying basis for generating harmful responses. For instance, knowledge related to bomb-making plays a pivotal role in answering inquiries like “how to make bombs?” When more advanced attack methods are developed, harmful knowledge is likely to resurface, resulting in an endless cat-and-mouse game.

In light of this, the intuition of our method is removing the harmful knowledge from LLMs as illustrated in Figure 1 (c). We propose Eraser, a jailbreaking defense method that mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment to harmful inquires. Specifically, we perform gradient ascent on harmful answers in a simulated jailbreaking mode, retain general knowledge by preserving the ability to understand entities, and enhance safety alignment by maintaining the ability to reject harmful questions.

Experimental results have shown that the proposed method can significantly reduce the success rate of various jailbreaking attacks without compromising the performance on other tasks.

The contributions of our paper are summarized as follows,

$\bullet$ We propose a method that can achieve three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment to harmful inquires.

$\bullet$ Experimental results demonstrate that the proposed method excels in defense capability while maintaining general capability. Compared to existing methods, it exhibits a better trade-off between harmlessness and usefulness.

$\bullet$ Experimental results show that simply using random token sequences for gradient ascent can achieve defense capabilities. This finding offers valuable insights for future endeavors in jailbreak defense.

2 Related Works

2.1 Jailbreaking Defense

Although many alignment methods have been developed to make LLM generate ethical and responsible texts, an emerging class of attack called jailbreaking attack can still bypass the safeguards and cause LLM to have harmful and toxic responses. To combat jailbreaking attacks, existing defense strategies primarily consist of two categories: harmful behavior filtering and continued training. Harmful behavior filtering involves applying perturbations to model inputs (Cao et al., 2023; Kumar et al., 2023; Robey et al., 2023), scrutinizing model outputs (Markov et al., 2023; Helbling et al., 2023), and integrating multiple LLMs (Chen et al., 2023). These methods generally incur additional costs to model inference. Continued training hopes to use further SFT to enhance the security of models. For example, Wang et al. (2023) trained LLMs to evaluate the potential harm of their own responses at the end of each output; Zhang et al. (2023) trained LLMs to distinguish between harmful and helpful target prioritization, improving the model’s understanding of harmfulness; Deng et al. (2023) proposed a red team defense framework that searches for harmful prompts to train the model to reject them. However, none of these methods have been able to address the fundamental problem of harmful output from LLMs, that is, harmful knowledge is still retained in the model.

2.2 LLM Unlearning

Machine unlearning methods are designed to remove specified knowledge that has been learned by a model Bourtoule et al. (2021). LLMs are trained on massive training data, re-training LLMs is obviously not a solution for forgetting specific knowledge. Using machine unlearning methods to mitigate the privacy exposure or poisoning attack on LLMs has become a promising research direction Jang et al. (2023); Chen and Yang (2023); Eldan and Russinovich (2023). Some recent work attempted to solve the harmful output problem using unlearning. Zhou et al. (2023) assumed that there were harmful instructions in the SFT dataset and attempted to make harmful behaviors unlearnable during the SFT process. The most relevant work to our work is (Yao et al., 2023), which uses unlearning to remove harmful responses, erase copyright-protected content, and eliminate hallucination from an unaligned model. However, Yao et al. considered the LLM unlearning as an alignment method, an alternative to RLHF. In contrast, we consider the LLM unlearning as a post-hoc defense strategy against jailbreaking on an aligned model.

3 Methodology

3.1 Problem Formulation

Assume there is an aligned LLM $f(\cdot)$ which can refuse to answer harmful queries such as “How to make bombs?”, but still can generate harmful content under jailbreaking attacks such as “Sure, there are mainly three steps.” Given an aligned LLM $f(\cdot)$ and a harmful queries set $X_{q}$ , the goal is to finetune a new LLM $h(\cdot)$ , which can refuse to answer harmful queries $X_{q}$ as many as possible under different jailbreaking attacks, and maintain its proficiency in handling regular queries.

We propose Eraser, a jailbreak defense method via machine unlearning. Specifically, we unlearn the corresponding answer $y$ for each $x\in X_{q}$ while maintaining proficiency in answering regular queries. Our method includes three components: unlearning harmful knowledge (§3.2), retaining general knowledge (§3.3), and maintaining safety alignment in (§3.4).

3.2 Unlearn Harmful Knowledge

Following Chen and Yang (2023); Yao et al. (2023), we adopt the gradient ascent technique to implement unlearning. The current challenge lies in acquiring harmful knowledge embedded within LLMs. One possible way is to collect it with the help of red teams Deng et al. (2023), but it is labor-intensive and time-consuming. Our intuition is that multiple answers to the same question should have similarities, and forgetting one may generalize to others. Hence, we propose to utilize publicly available uncensored models to obtain harmful answers. The collected harmful dataset is denoted as $D_{f}=\left\{(x,y)|x\in X_{f},y\in Y_{f}\right\}$ , where $X_{f}$ and $Y_{f}$ are question set and answer set respectively.

For a question and answer pair $(x,y)\in D_{f}$ , the existing unlearning method Yao et al. (2023) takes $x$ as input and uses $y$ as the target to perform gradient ascent. This process aims to reduce the probability of the LLM response $y$ when given $x$ . However, in jailbreaking attacks, $x$ is often disguised in the jailbreaking prompt, in which the adversarial prefixes and suffixes are the key to awakening harmful memories in LLMs. Therefore, we add different randomly generated prefixes and suffixes to $x$ at each epoch of training to simulate jailbreaking attack scenarios. Intuitively, we hope that regardless of how prompts are disguised, as long as $x$ is present, the model will not provide harmful answer $y$ . Let $T(\cdot)$ be a function that adds random prefixes and suffixes to strings, the unlearn training objective is defined as follows:

L_{f}=\frac{1}{\left|D_{f}\right|}\sum_{(x,y)\in D_{f}}\sum_{i=1}^{|y|}\log% \left(p\left(y_{i}\mid T(x),y_{<i}\right)\right)

(1)

where $y_{<i}=\{y_{1},\dots,y_{i-1}\}$ denotes the first $i-1$ tokens of target sequence $y$ and $p\left(y_{i}\mid T(x),y_{<i}\right)$ denotes the conditional probability of predicting next token when given $T(x)$ and $y_{<i}$ to the LLM $h(\cdot)$ .

3.3 Retain General Knowledge

Using the gradient ascent technique to unlearn harmful knowledge often results in impaired general performance of LLMs (Yao et al., 2023). We believe that the main ability compromised by LLMs is their understanding of entities. Intuitively, when unlearning a piece of harmful text, LLM’s understanding of certain entities mentioned in the text is weakened. For instance, when forgetting the process of making a bomb, the knowledge of how to use the required materials is also forgotten, even though this knowledge could be useful to address harmless problems. As shown in Figure 2, LLama2 unlearned the harmful knowledge of bomb-making is unable to provide the specific uses of potassium nitrate (a material used for bomb-making), whereas the original LLama2 could list nine different applications.

In this regard, we propose to retain general knowledge by preserving the model’s ability to answer entity-related comprehension questions. The entity refers to those entities appear in the harmful answer set $Y_{f}$ . To accomplish this, we initially create $10$ prompt templates to generate entity-related comprehension questions, such as “What is [entity name] used for?”. For each $y\in Y_{f}$ , we utilized GPT-3.5 (Ouyang et al., 2022) to extract all entities and randomly selected one prompt template for each extracted entity to inquire the LLM $f$ , resulting in a helpful dataset $D_{h}$ . Appendix A.1 and A.2 display all prompts we used for entity extraction and entity comprehension questions generation. The objective function is to perform distillation on next word prediction where the teacher is the aligned LLM $f(\cdot)$ before unlearning:

L_{h}=\frac{1}{\left|D_{h}\right|}\sum_{(x,y)\in D_{h}}\sum_{i=1}^{|y|}KL\left% (h\left(x,y_{<i}\right)||f\left(x,y_{<i}\right)\right)

(2)

where $KL(\cdot||\cdot)$ denotes the Kullback-Leibler divergence.

3.4 Maintain Safety Alignment

Recent research (Qi et al., 2023) has revealed the detrimental effects of SFT on the safety alignment of LLMs. While in an idealized scenario, LLM loses the ability to answer harmful questions after unlearning training, maintaining the capability to refuse and provide reasons for refusal is an essential display of responsibility towards users. To achieve this, for each harmful question $x\in X_{f}$ , we directly query the orignal LLM with it to obtain refusal data, forming the dataset $D_{r}$ . Then, we encourage the model to have similar refusal capabilities before and after training:

L_{r}=\frac{1}{\left|D_{r}\right|}\sum_{(x,y)\in D_{r}}\sum_{i=1}^{|y|}KL\left% (h\left(x,y_{<i}\right)||f\left(x,y_{<i}\right)\right)

(3)

3.5 Overall objective

Compared to preserving model capability, unlearning knowledge is a much easier objective, so striking a balance among the three goals is challenging.

In §4.5, we observe that prolonged unlearning training can have a detrimental effect on the model’s performance over time.

Therefore, we aim to set a constraint for the unlearning objective and focus on optimizing the remaining two objectives after sufficient unlearning training:

L=\operatorname{Max}\left(0,\gamma+L_{f}\right)+L_{h}+L_{r}

(4)

The objective function stops optimizing $L_{f}$ when it reaches threshold $\gamma$ , but continues optimizing $L_{h}$ and $L_{r}$ to retain general knowledge and maintain rejection ability.

4 Experiments

4.1 Experimental Setup

Attack methods. We applied three advanced jailbreaking methods to evaluate the effectiveness of defense methods. (1) AIM, a meticulously designed jailbreak prompt that has received the most votes in the jailbreaking prompt community ¹¹1https://www.jailbreakchat.com/. (2) AutoDAN (Liu et al., 2023), a hierarchical genetic algorithm that extensively searches jailbreak prompts for each harmful question. (3) GCG (Zou et al., 2023), a gradient-based white-box attack method.

Baselines. Due to significant differences in evaluation systems, we only discuss harmful behavior filtering methods in Appendix B. The main text focuses solely on training-based methods, including the following two approaches: (1) RSFT. According to the work of Deng et al. (2023), we first perform two attacks on the base model and collect all prompts that lead to jailbreaking. Then, we fine-tune the base model with unified rejection response as the target of these prompts. (2) GAM (Yao et al., 2023). A general LLM unlearning method that could eliminate harmful knowledge from unaligned LLMs. We directly applied it to the aligned models as a baseline.

Datasets. For training data, we used publicly available AdvBench (Zou et al., 2023) which contains 520 harmful queries. The harmful responses to be unlearned were generated by the publicly available uncensored model ¹¹1https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ. For test data, in addition to AdvBench, we created an extended dataset named AdvExtent to test the generalization ability of defense methods on similar harmful topics. Specifically, we summarized 139 unique harmful topics such as “make bombs” from AdvBench and manually designed three prompts not included in AdvBench as shown in Appendix A.3. Each harmful topic was combined with the three prompts to generate harmful questions, resulting in a dataset containing 417 samples.

Evaluation Metrics. To evaluate the capacity of jailbreaking defense, we utilize harmfulness score and ASR (Attack Success Rate) as metrics, where lower values indicate better defense effectiveness. We follow Qi et al. (2023) and generate harmfulness scores based on LLMs’ feedback. For every pair of harmful instructions and LLMs’ response, we query GPT-3.5-turbo-1106 to obtain a harmfulness score which ranges between 1 and 5. Most existing methods determine jailbreaking based on the presence of predefined rejection words in the response, and consider the proportion of jailbreaking samples to all samples as ASR. However, this metric might be inaccurate as it is challenging to enumerate all rejection words. For this regard, we treat samples with harmful scores greater than 2 as successful jailbreaks and use the proportion of successful jailbreak samples as the ASR. In the harmfulness scoring criteria defined by Qi et al. (2023), samples with scores of 1 and 2 do not contain truly harmful outputs.

To evaluate the general capability of LLMs, we employ widely used LLM evaluation benchmarks including Arc_challenge Clark et al. (2018), Arc_easyClark et al. (2018), Copa Roemmele et al. (2011), Cb De Marneffe et al. (2019), HendrycksTest Hendrycks et al. (2021), Boolq Clark et al. (2019) and Hellaswag Zellers et al. (2019) as the evaluation datasets.

Implementation Details. We employ Llama2-chat-7b Touvron et al. (2023) as the base model which has undergone thorough safety alignment training. The proposed method was trained using LORA Hu et al. (2021). During the training process, $\gamma$ was set to 2, the batch size was fixed at 64 samples, and texts exceeding 2048 tokens were truncated. We applied the AdamW optimizer with 2e-5 learning rate. The number of training epochs is set to 5. The checkpoint with the lowest training loss was selected for inference. For RSFT, we employ a learning rate of 1e-4 and a weight decay of 1e-3. For GAM, we mostly followed the author’s settings, except for stopping the training when the gradient reaches 2 to accommodate the AdvBench dataset. For the attack methods AutoDAN, we limited the maximum search steps to 20, and modified the criterion for determining whether a jailbreak has occurred to be the same as ours. That is, judging based on LLMs’ feedback.

4.2 Main Results

Table 1: The defense performance of the base model and its three defense-trained models under three attacks. The evaluations are done on the AdvBench and AdvExtent datasets. The metrics include ASR and Harmfulness. Low ASR and Harmfulness indicate good defense performance. ASR is measured in %.

Datasets	Compared Methods	Attack Methods
		AIM		AutoDan		GCG
		ASR	Harmfulness	ASR	Harmfulness	ASR	Harmfulness
AdvBench	Base model	19.61	1.68	24.61	1.90	40.57	2.78
	GAM (Yao et al., 2023)	30.00	1.99	32.30	2.18	15.00	1.57
	RSFT (Deng et al., 2023)	0.00	1.00	2.88	1.11	9.61	1.27
	ERASER	0.57	1.03	2.88	1.09	8.26	1.33
AdvExtent	Base model	23.74	1.86	44.36	2.65	17.74	1.65
	GAM (Yao et al., 2023)	29.49	1.99	27.33	1.97	2.87	1.11
	RSFT (Deng et al., 2023)	0.00	1.00	2.87	1.09	2.15	1.09
	ERASER	0.04	1.13	5.99	1.18	1.67	1.06

Table 2: Performance of the base model and its three defense-trained models on the benchmarks, using accuracy as the metric. The last column represents the average accuracy of 7 benchmarks.

Approaches	Arc_challenge	Arc_easy	Copa	Cb	HendrycksTest	Boolq	Hellaswag	Average
Base model	44.11	73.90	89.00	53.57	47.03	79.75	57.77	63.59
GAMYao et al. (2023)	42.15	71.96	87.00	57.24	45.74	80.03	57.24	62.27
RSFTDeng et al. (2023)	43.08	75.88	87.00	46.42	42.38	77.49	56.93	61.31
Eraser	44.36	73.56	89.00	55.35	47.02	79.66	57.38	63.76

Defensive capacities. Table 1 shows the jailbreaking results of Eraser and baselines on two datasets. Compared to the base model, the Eraser significantly reduces the ASR and harmfulness scores in all settings, which demonstrates the strong defensive capacities of the Eraser. Since the forgotten harmful knowledge does not cover all harmful knowledge contained in the base model, the success rate of attack cannot be reduced to 0%. This phenomenon may be alleviated by a broader range of harmful knowledge. It is worth noting that Eraser’s defensive capacities can be generalized to harmful queries of the same topic, as there is a similar defensive effect in the AdvExtent dataset as in the AdvBench dataset, and the queries in AdvExtent are not seen by the eraser during training.

For GAM, higher jailbreaking risks were exposed in almost all settings compared to the base model. This may be due to the use of random tokens as the target for gradient descent, which could introduce additional harmful knowledge and undermine the model’s general capacities. RSFT is safer than Eraser in over half of settings, but we observed that it makes the model overly cautious, resulting in rejections even for benign questions. As shown in Figure 5, RSFT refuses to answer when a user asks “What are the hazards of firearms?” while the base model and Eraser can list detailed hazards. The possible reason is that “firearms” appear in the harmful questions.

General capacities. Table 2 displays the performance of Eraser and baselines on benchmarks for evaluating LLMs. Compared to the base model, Eraser achieve comparable results on all 7 benchmarks, while RSFT and GAM show varying levels of performance degradation. As shown in Figure 5, Eraser’s behavior is most closest to the base model. These results indicate that Eraser can effectively reduce the jailbreaking risk without compromising general capacities, which enables LLMs to continuously unlearn new harmful knowledge.

4.3 Ablation Study

Table 3: Ablation experiment results. General capacity represents the average accuracy of 7 benchmarks.

Apporaches	General capacity	AIM Attack
Apporaches	General capacity	ASR	Harmfulness
Base model	63.59	19.61	1.68
Eraser	63.76	0.57	1.03
Eraser w/o $T(\cdot)$	63.88	3.84	1.10
Eraser w/o $L_{h}$	63.43	0.00	1.00
Eraser w/o $L_{r}$	63.89	2.88	1.10
GA	62.24	0.00	1.00

To validate the effectiveness of each component, we designed 4 variants of Eraser: (1) Eraser w/o $T(\cdot)$ : Eraser that does not use a random prefix/suffix generation function $T(\cdot)$ in Eq 1 . (2) Eraser w/o $L_{h}$ : Eraser that removes the goal $L_{h}$ (i.e., without retaining general knowledge). (3) Eraser w/o $L_{r}$ : Eraser that removes the goal $L_{r}$ (i.e., without maintaining safety alignment). (4) GA: A method that only utilizes $L_{f}$ as the goal.

Table 3 shows the experimental results. Compared to Eraser, Eraser w/o $T(\cdot)$ show a significant increase in ASR, indicating the effectiveness of $T(\cdot)$ against jailbreaking attacks. GA, which only uses gradient ascent as the goal, exhibits excellent defense performance, but its general capability is severely impaired. With the addition of the target $L_{h}$ , the general capability of Eraser w/o $L_{h}$ is mostly restored, but some ASR increase occurs due to the absence of the $L_{r}$ goal. Eraser w/o $L_{h}$ experiences a decrease in general performance but still outperform GA significantly, possibly due to the $L_{r}$ compensating for the model’s general language proficiency. We can further draw the following conclusions: the random prefix/suffix enhances the model’s defensive capability, $L_{h}$ compensates for the general capability, and $L_{r}$ further improves the defensive capability of the model.

4.4 What has Contributed to Defensive Capabilities?

To verify whether the forgetting of harmful text contributes to the defense capability of the model, we first replaced the harmful answers in the training data with a random token sequence and then performed gradient ascent. It is worth noting that the random token sequence does not contain any semantic knowledge. However, the results in Table 4 indicate that this method achieves significant defense against AIM, but with a significant decrease in general capabilities. Such astonishing results seem to indicate that the improvement of defensive ability is not related to whether the forgotten text is harmful.

To further investigate, we tested Eraser with the same random data and found that it restored the model’s overall performance, but the jailbreaking risk also returned to a level close to the base model. Comparing Eraser’s use of harmful and harmless data, the contribution of forgetting harmful data to its defensive ability is evident.

Based on the observations above, we speculate that the sources of defensive capabilities can be diverse. Forgetting harmful text can contribute to defensive capabilities, which is a source of Eraser defense. The reason why GA w/ random brings defensive capabilities may be due to the disruption of the model’s general performance, as Eraser w/ random loses its defensive capabilities by compensating for general performance. The underlying logic is the trade-off between harmfulness and usefulness. The model loses the ability to follow instructions, naturally losing the ability to follow harmful instructions as well.

Considering that GA reduces the general ability by 1.94% while decreasing the ASR of AIM attacks from 19.61% to 5.38%, and its implementation cost is extremely low, requiring only the random generation of some data to unlearning, defensive capability appears to be a relatively easily acquired attribute. Recall that RSFT’s 2.28% reduction in general capability, its good defense performance is not surprising. In comparison, Eraser’s ability to maintain general capability is particularly valuable.

Table 4: Defensive capability source test results. General capacity represents the average accuracy of the 7 benchmarks. The w/ random replaces harmful data to be unlearned with random token sequence.

Apporaches	General capability	AIM Attack
Apporaches	General capability	ASR	Harmfulness
Base model	63.59	19.61	1.68
Eraser	63.76	0.57	1.03
GA w/ random	61.65	5.38	1.18
Eraser w/ random	63.61	19.03	1.67

4.5 The Impact of Threshold $\gamma$

The threshold $\gamma$ constrains the minimum value of $L_{f}$ descent. To explore the influence of $\gamma$ on the Eraser performance, we trained Eraser with $\gamma$ set to 1, 2, 3, 4, and 5, respectively, and reported AIM ASR and the average accuracy of general capacities evaluation. Additionally, we trained GA and evaluated every 5 training steps. Figure 3 shows the evaluation results. As $\gamma$ increases, Eraser’s AIM ASR continuously decreases, reaching 0 at $\gamma$ =3, but general performance only fully recovers when $\gamma$ is set to 1 and 2. When $\gamma$ is greater than 2, the general performance tends to decline continuously. For GA, as $L_{f}$ descends, the AIM ASR of the GA decreases, reaching 0 when $L_{f}$ approaches -3, while general performance continues to decline. This observation indicates that $\gamma$ plays a controlling role in the defense performance of the model, but an overly large $\gamma$ may prevent the model from recovering its general ability. Therefore, we recommend setting a moderate value for $\gamma$ .

4.6 Case study

To demonstrate how Eraser outperforms other baselines, we give examples in both the attack scenario and the general scenario in Figures 4 and 5, respectively. As depicted in Figure 4, when faced with the AIM attack, the base model provides detailed harmful guidance, while Eraser refuses the instruction, explains the refusal reasons, and expects further harmless communication with the user. In contrast, GAM often responds with garbled text, which is irresponsible and disrespectful. Moreover, GAM’s responses often contain harmful text unrelated to the user’s commands, such as “stepping into the mind of a serial killer” in the case. RSFT can also provide refusals but typically offers standard rejection responses that are not as helpful as Eraser’s answers.

Figure 5 demonstrates the responses of multiple models when a user asks “What are the hazards of firearms?”. RSFT mistakenly understood this topic as a harmful one and refuses to answer, possibly indicating that the model is overly sensitive to the word “firearms.” GAM can provide a summarization of the hazards associated with firearms. In contrast, Eraser’s response is closest to that of the base model, suggesting that they exhibit more similar behavior. In conclusion, Eraser was able to respond more responsibly to the jailbreaking prompt, while also responding more similarly to the base model for general instructions, which validates why Eraser has better defensive and general capabilities.

5 Conclusion

In this paper, we propose an LLM jailbreaking defense method called Eraser, which aims to address the fundamental threat for jailbreaking, that is the harmful knowledge that resides within the LLMs. By integrating three goals: unlearning harmful knowledge, maintaining general performance, and enhancing safety alignment, Eraser can significantly reduce the risk of jailbreaking without compromising general capabilities. Compared to existing methods, Eraser can better balance harmfulness and usefulness. Our experiments also show that simply unlearning random data can also bring good defense effects with general performance degradation, so we encourage future research on jailbreaking defense to focus more on maintaining general capabilities.

Limitations

Although Eraser does not require data collection by a red team, it is still inefficient as it only defends against specific harmful issues, and enumerating all the harmful issues is challenging. Furthermore, the Eraser is only applicable to LLMs that have undergone safety alignment. To become an alternative to technologies like RLHF, more effort needs to be put into enhancing safety alignment.

Ethics Statement

This paper contains harmful data and model-generated harmful text. It is important to emphasize that the opinions expressed in these texts are automatically generated by LLMs and do not represent the views of the authors. The purpose of this work is to alleviate this situation, and the purpose of presenting harmful text is only to verify the effectiveness of the proposed method. We strongly call for more researchers to pay attention to this research field to promote the development of more ethical and responsible LLMs.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy (SP), pages 141–159.
Cao et al. (2023) Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2023. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
Chao et al. (2023) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
Chen et al. (2023) Bocheng Chen, Advait Paliwal, and Qiben Yan. 2023. Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
Chen and Yang (2023) Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for LLMs. In EMNLP.
Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
De Marneffe et al. (2019) Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
Deng et al. (2023) Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, and Xiangnan He. 2023. Attack prompt generation for red teaming and defending large language models. In EMNLP.
Deshpande et al. (2023) Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
Eldan and Russinovich (2023) Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
Helbling et al. (2023) Alec Helbling, Mansi Phute, Matthew Hull, and Duen Horng Chau. 2023. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. ICLR.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Huang et al. (2023) Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, et al. 2023. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391.
Jang et al. (2023) Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. Knowledge unlearning for mitigating privacy risks in language models. In ACL.
Kumar et al. (2023) Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. 2023. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
Liu et al. (2023) Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
Markov et al. (2023) Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In AAAI, volume 37, pages 15009–15018.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
Robey et al. (2023) Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. 2023. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang et al. (2023) Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, and Kam-Fai Wong. 2023. Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.
Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
Yao et al. (2023) Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2023. Large language model unlearning. arXiv preprint arXiv:2310.10683.
Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In ACL.
Zhang et al. (2023) Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
Zhou et al. (2023) Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023. Making harmful behaviors unlearnable for large language models. arXiv preprint arXiv:2311.02105.
Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Appendix A Prompts

A.1 Entities extraction

A.2 Entities understanding testing

A.3 AdvExtent question generation

A.4 AIM Attack

Appendix B Compared to harmful behavior filtering method

Harmful behavior filtering methods do not have an impact on the model’s performance since they do not require modifying the model’s weights. Given the additional inference required for defense, people are typically more concerned with their time complexity and their error defense rate for benign instructions. Therefore, these methods have significant differences in evaluation systems compared to training-based methods. However, we can still compare Eraser with them in terms of defense performance.

To address to this, we implemented RA-LLM Cao et al. (2023), which constructs a more robust alignment check mechanism for defense. We fully adopted the author’s parameter settings and employed RA-LLM to defend against AIM and AutoDAN attacks. The experimental results are shown in Table 5. RA-LLM effectively reduces the ASR of the base model, but it still performs a poorer defense capability compared to Eraser. To evaluate the impact during normal usage, we selected 100 benign instructions from the Alpaca Taori et al. (2023) dataset and recorded the average sample inference latency and refusal rate for RA-LLM, Eraser, and the base model. The rejection criterion is whether the model’s response contains rejection words such as “I’m sorry”. Table 6 shows the experimental results. The inference latency for RA-LLM significantly increases compared to the base model. This is due to RA-LLM’s defense measures requiring an additional 20 rounds of short inference on top of the base model. In practical applications, such defense measures would incur higher additional costs. Additionally, RA-LLM also carries a risk of rejecting benign inputs. In contrast, Eraser does not result in higher latency and refusal rate.

Table 5: The defense performance of RA-LLM, Eraser and the base model. ASR is measured in %.

Datasets	Appraoches	Attack Methods
		AIM		AutoDan
		ASR	Harmfulness	ASR	Harmfulness
AdvBench	Base model	19.61	1.68	24.61	1.90
	Eraser	0.57	1.03	2.88	1.09
	RA-LLM	6.92	1.24	5.96	1.22
AdvExtent	Base model	23.74	1.86	44.36	2.65
	Eraser	0.04	1.13	5.99	1.18
	RA-LLM	13.18	1.51	11.51	1.44

Table 6: Inference latency and refusal rate of RA-LLM, Eraser, and the base model. The latency reports the average inference time for 100 samples, measured in seconds. The refusal rate is measured in %.

Approaches	Latency	Refusal rate
Base model	6.73	0.00
Eraser	6.47	0.00
RA-LLM	11.53	8.00

Appendix C The quantitative analysis of similar questions in Figure 5

Table 7: The refusal rate of all baselines when querying the questions contains harmful topics but themselves harmless. The refusal rate is measured in %.

Approaches	Refusal rate
Base model	8.00
Eraser	8.66
GAM	45.63
RSFT	48.99

To further explore the differences between different baselines when dealing with similar questions in Figure 5 (i.e., questions that include harmful topics but are themselves harmless), we designed three prompts as depicted in Figure 10 and further screened 50 harmful topics in AdvBench. Each harmful topic is paired with three prompts, resulting in a total of 150 questions. Subsequently, we query all the baselines and calculate the refusal rate of the model. From the results shown in Table 7, GAM and RSFT significantly increased the refusal rate, while Eraser’s refusal rate was only 0.66% higher than the base model. This once again demonstrates the superiority of Eraser in maintaining general capabilities.

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Abstract

1 Introduction

2 Related Works

2.1 Jailbreaking Defense

2.2 LLM Unlearning

3 Methodology

3.1 Problem Formulation

3.2 Unlearn Harmful Knowledge

3.3 Retain General Knowledge

3.4 Maintain Safety Alignment

3.5 Overall objective

4 Experiments

4.1 Experimental Setup

4.2 Main Results

4.3 Ablation Study

4.4 What has Contributed to Defensive Capabilities?

4.5 The Impact of Threshold γ𝛾\gammaitalic_γ

4.6 Case study

5 Conclusion

Limitations

Ethics Statement

References

Appendix A Prompts

A.1 Entities extraction

A.2 Entities understanding testing

A.3 AdvExtent question generation

A.4 AIM Attack

Appendix B Compared to harmful behavior filtering method

Appendix C The quantitative analysis of similar questions in Figure 5

4.5 The Impact of Threshold $\gamma$