-
Defending LLMs against Jailbreaking Attacks via Backtranslation
Authors:
Yihan Wang,
Zhouxing Shi,
Andrew Bai,
Cho-Jui Hsieh
Abstract:
Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input pro…
▽ More
Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/llm-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.
△ Less
Submitted 6 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Which Pretrain Samples to Rehearse when Finetuning Pretrained Models?
Authors:
Andrew Bai,
Chih-Kuan Yeh,
Cho-Jui Hsieh,
Ankur Taly
Abstract:
Fine-tuning pretrained foundational models on specific tasks is now the de facto approach for text and vision tasks. A known pitfall of this approach is the forgetting of pretraining knowledge that happens during finetuning. Rehearsing samples randomly from the pretrain dataset is a common approach to alleviate such forgetting. However, we find that random mixing unintentionally includes samples w…
▽ More
Fine-tuning pretrained foundational models on specific tasks is now the de facto approach for text and vision tasks. A known pitfall of this approach is the forgetting of pretraining knowledge that happens during finetuning. Rehearsing samples randomly from the pretrain dataset is a common approach to alleviate such forgetting. However, we find that random mixing unintentionally includes samples which are not (yet) forgotten or unlearnable by the model. We propose a novel sampling scheme, mix-cd, that identifies and prioritizes samples that actually face forgetting, which we call collateral damage. Since directly identifying collateral damage samples is computationally expensive, we propose a procedure to estimate the distribution of such samples by tracking the statistics of finetuned samples. Our approach is lightweight, easy to implement, and can be seamlessly integrated into existing models, offering an effective means to retain pretrain performance without additional computational costs.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation
Authors:
Tong Xie,
Haoyu Li,
Andrew Bai,
Cho-Jui Hsieh
Abstract:
Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ''black-box'' neural networks. While prior research has established quantifiable links between model output and training data in diverse settings, interpreting diffusion model outputs in relation to training samples remains underexplored. In particular, diffusion models o…
▽ More
Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ''black-box'' neural networks. While prior research has established quantifiable links between model output and training data in diverse settings, interpreting diffusion model outputs in relation to training samples remains underexplored. In particular, diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts, posing a significant challenge to extend existing frameworks to diffusion models directly. Notably, we present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep. This trend leads to a prominent bias in influence estimation, and is particularly noticeable for samples trained on large-norm-inducing timesteps, causing them to be generally influential. To mitigate this effect, we introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest, facilitating a localized measurement of influence and considerably more intuitive visualization. We demonstrate the efficacy of our approach through various evaluation metrics and auxiliary tasks, reducing the amount of generally influential samples to $\frac{1}{3}$ of its original quantity.
△ Less
Submitted 21 January, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
Examining the Influence of Job Satisfaction on Individual Innovation and Its Components: Considering the Moderating Role of Technostress
Authors:
Fatemeh Daneshmandi,
Hassan Hessari,
Tahmineh Nategh,
Ali Bai
Abstract:
Background: Employee innovation is a crucial aspect of organizations in the current era. Therefore, studying the factors influencing individual innovation is vital and unavoidable. Undoubtedly, job satisfaction is a significant variable in management sciences. Nowadays, all organizations are interconnected with technology. Objective: This research explores the relationship between job satisfaction…
▽ More
Background: Employee innovation is a crucial aspect of organizations in the current era. Therefore, studying the factors influencing individual innovation is vital and unavoidable. Undoubtedly, job satisfaction is a significant variable in management sciences. Nowadays, all organizations are interconnected with technology. Objective: This research explores the relationship between job satisfaction and individual innovation, including its components, and the moderating role of technostress. Research Method: This study, in terms of purpose, is applied, and in terms of data collection method, it is a descriptive survey. Data collection tools included the Technostress Inventory by Tarafdar and colleagues (2007), Janssen's Individual Innovation Questionnaire (2000), and the Job Satisfaction Survey (JSS) by Spector (1994). The validity and reliability of these questionnaires were confirmed. The sample size for this study was 215, and data analysis was performed using SPSS and SMART-PLS software. Findings: Job satisfaction has a significant and positive relationship with individual innovation, idea generation, idea promotion, and idea implementation. Technostress moderates the relationship between job satisfaction and individual innovation, as well as idea generation and idea promotion. However, technostress does not play a moderating role in the relationship between job satisfaction and idea implementation. Conclusion: Based on the obtained results, organizations should take necessary measures to increase job satisfaction and reduce technostress among their employees.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
RD-Suite: A Benchmark for Ranking Distillation
Authors:
Zhen Qin,
Rolf Jagerman,
Rama Pasumarthi,
Honglei Zhuang,
He Zhang,
Aijun Bai,
Kai Hui,
Le Yan,
Xuanhui Wang
Abstract:
The distillation of ranking models has become an important topic in both academia and industry. In recent years, several advanced methods have been proposed to tackle this problem, often leveraging ranking information from teacher rankers that is absent in traditional classification settings. To date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsi…
▽ More
The distillation of ranking models has become an important topic in both academia and industry. In recent years, several advanced methods have been proposed to tackle this problem, often leveraging ranking information from teacher rankers that is absent in traditional classification settings. To date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide range of tasks and datasets make it difficult to assess or invigorate advances in this field. This paper first examines representative prior arts on ranking distillation, and raises three questions to be answered around methodology and reproducibility. To that end, we propose a systematic and unified benchmark, Ranking Distillation Suite (RD-Suite), which is a suite of tasks with 4 large real-world datasets, encompassing two major modalities (textual and numeric) and two applications (standard distillation and distillation transfer). RD-Suite consists of benchmark results that challenge some of the common wisdom in the field, and the release of datasets with teacher scores and evaluation scripts for future research. RD-Suite paves the way towards better understanding of ranking distillation, facilities more research in this direction, and presents new challenges.
△ Less
Submitted 12 June, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance
Authors:
Aijun Bai,
Rolf Jagerman,
Zhen Qin,
Le Yan,
Pratyush Kar,
Bing-Rong Lin,
Xuanhui Wang,
Michael Bendersky,
Marc Najork
Abstract:
As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily com…
▽ More
As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily compatible, which makes the trade-off less ideal for either of them. In this paper, we propose a practical regression compatible ranking (RCR) approach that achieves a better trade-off, where the two ranking and regression components are proved to be mutually aligned. Although the same idea applies to ranking with both binary and graded relevance, we mainly focus on binary labels in this paper. We evaluate the proposed approach on several public LTR benchmarks and show that it consistently achieves either best or competitive result in terms of both regression and ranking metrics, and significantly improves the Pareto frontiers in the context of multi-objective optimization. Furthermore, we evaluated the proposed approach on YouTube Search and found that it not only improved the ranking quality of the production pCTR model, but also brought gains to the click prediction accuracy. The proposed approach has been successfully deployed in the YouTube production system.
△ Less
Submitted 21 August, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Reducing Training Sample Memorization in GANs by Training with Memorization Rejection
Authors:
Andrew Bai,
Cho-Jui Hsieh,
Wendy Kan,
Hsuan-Tien Lin
Abstract:
Generative adversarial network (GAN) continues to be a popular research direction due to its high generation quality. It is observed that many state-of-the-art GANs generate samples that are more similar to the training set than a holdout testing set from the same distribution, hinting some training samples are implicitly memorized in these models. This memorization behavior is unfavorable in many…
▽ More
Generative adversarial network (GAN) continues to be a popular research direction due to its high generation quality. It is observed that many state-of-the-art GANs generate samples that are more similar to the training set than a holdout testing set from the same distribution, hinting some training samples are implicitly memorized in these models. This memorization behavior is unfavorable in many applications that demand the generated samples to be sufficiently distinct from known samples. Nevertheless, it is unclear whether it is possible to reduce memorization without compromising the generation quality. In this paper, we propose memorization rejection, a training scheme that rejects generated samples that are near-duplicates of training samples during training. Our scheme is simple, generic and can be directly applied to any GAN architecture. Experiments on multiple datasets and GAN models validate that memorization rejection effectively reduces training sample memorization, and in many cases does not sacrifice the generation quality. Code to reproduce the experiment results can be found at $\texttt{https://github.com/jybai/MRGAN}$.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Concept Gradient: Concept-based Interpretation Without Linear Assumption
Authors:
Andrew Bai,
Chih-Kuan Yeh,
Pradeep Ravikumar,
Neil Y. C. Lin,
Cho-Jui Hsieh
Abstract:
Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The linear separability is usually implicitly assumed but does not hold true in general. I…
▽ More
Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The linear separability is usually implicitly assumed but does not hold true in general. In this work, we started from the original intent of concept-based interpretation and proposed Concept Gradient (CG), extending concept-based interpretation beyond linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically evaluate how a small change of concept affecting the model's prediction, which leads to an extension of gradient-based interpretation to the concept space. We demonstrated empirically that CG outperforms CAV in both toy examples and real world datasets.
△ Less
Submitted 5 February, 2024; v1 submitted 31 August, 2022;
originally announced August 2022.
-
Million.js: A Fast Compiler-Augmented Virtual DOM for the Web
Authors:
Aiden Bai
Abstract:
Interactive web applications created with declarative JavaScript User Interface (UI) libraries have increasingly dominated the modern internet. However, existing libraries are primarily made for run-time execution, and rely on the user to load and render web applications. This led us to create Million.js, a fast compiler-augmented virtual Document Object Model (DOM) for the web. Million.js reduces…
▽ More
Interactive web applications created with declarative JavaScript User Interface (UI) libraries have increasingly dominated the modern internet. However, existing libraries are primarily made for run-time execution, and rely on the user to load and render web applications. This led us to create Million.js, a fast compiler-augmented virtual Document Object Model (DOM) for the web. Million.js reduces load time and time-to-interactive by creating a compiler to compute interactive regions of a web application before the user visits the page. The virtual DOM run-time optimizes interactive content through compiler flags, compute batching, scheduling, and reactive data primitives to achieve optimal performance. When benchmarked against the most popular virtual DOM libraries, Million.js resulted in 133% to 300% faster rendering and 2347\% faster load. In a real-world web application with both comparative benchmarks and an informal user study, Million.js loaded 35.11% faster after migrating from React. The findings show that web applications have the potential to be orders of magnitude faster through JavaScript UI libraries that use Million.js.
△ Less
Submitted 1 January, 2023; v1 submitted 16 February, 2022;
originally announced February 2022.
-
RoboCup 2D Soccer Simulation League: Evaluation Challenges
Authors:
Mikhail Prokopenko,
Peter Wang,
Sebastian Marian,
Aijun Bai,
Xiao Li,
Xiaoping Chen
Abstract:
We summarise the results of RoboCup 2D Soccer Simulation League in 2016 (Leipzig), including the main competition and the evaluation round. The evaluation round held in Leipzig confirmed the strength of RoboCup-2015 champion (WrightEagle, i.e. WE2015) in the League, with only eventual finalists of 2016 competition capable of defeating WE2015. An extended, post-Leipzig, round-robin tournament which…
▽ More
We summarise the results of RoboCup 2D Soccer Simulation League in 2016 (Leipzig), including the main competition and the evaluation round. The evaluation round held in Leipzig confirmed the strength of RoboCup-2015 champion (WrightEagle, i.e. WE2015) in the League, with only eventual finalists of 2016 competition capable of defeating WE2015. An extended, post-Leipzig, round-robin tournament which included the top 8 teams of 2016, as well as WE2015, with over 1000 games played for each pair, placed WE2015 third behind the champion team (Gliders2016) and the runner-up (HELIOS2016). This establishes WE2015 as a stable benchmark for the 2D Simulation League. We then contrast two ranking methods and suggest two options for future evaluation challenges. The first one, "The Champions Simulation League", is proposed to include 6 previous champions, directly competing against each other in a round-robin tournament, with the view to systematically trace the advancements in the League. The second proposal, "The Global Challenge", is aimed to increase the realism of the environmental conditions during the simulated games, by simulating specific features of different participating countries.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
Multi-Object Tracking and Identification over Sets
Authors:
Aijun Bai
Abstract:
The ability for an autonomous agent or robot to track and identify potentially multiple objects in a dynamic environment is essential for many applications, such as automated surveillance, traffic monitoring, human-robot interaction, etc. The main challenge is due to the noisy and incomplete perception including inevitable false negative and false positive errors from a low-level detector. In this…
▽ More
The ability for an autonomous agent or robot to track and identify potentially multiple objects in a dynamic environment is essential for many applications, such as automated surveillance, traffic monitoring, human-robot interaction, etc. The main challenge is due to the noisy and incomplete perception including inevitable false negative and false positive errors from a low-level detector. In this paper, we propose a novel multi-object tracking and identification over sets approach to address this challenge. We define joint states and observations both as finite sets, and develop motion and observation functions accordingly. The object identification problem is then formulated and solved by using expectation-maximization methods. The set formulation enables us to avoid directly performing observation-to-object association. We empirically confirm that the overall algorithm outperforms the state-of-the-art in a popular PETS dataset.
△ Less
Submitted 25 May, 2016;
originally announced May 2016.