Skip to main content

Showing 1–21 of 21 results for author: Casper, S

  1. arXiv:2404.09932  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    Foundational Challenges in Assuring Alignment and Safety of Large Language Models

    Authors: Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi , et al. (13 additional authors not shown)

    Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.

    Submitted 15 April, 2024; originally announced April 2024.

  2. arXiv:2404.02949  [pdf, other

    cs.LG cs.AI

    The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

    Authors: Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

    Abstract: Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured compet… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Competition for SaTML 2024

  3. arXiv:2403.05030  [pdf, other

    cs.CR cs.AI cs.LG

    Defending Against Unforeseen Failure Modes with Latent Adversarial Training

    Authors: Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

    Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically st… ▽ More

    Submitted 1 April, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  4. arXiv:2402.16835  [pdf, other

    cs.CL

    Eight Methods to Evaluate Robust Unlearning in LLMs

    Authors: Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

    Abstract: Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  5. arXiv:2402.08787  [pdf, other

    cs.LG cs.CL

    Rethinking Machine Unlearning for Large Language Models

    Authors: Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

    Abstract: We explore machine unlearning (MU) in the domain of large language models (LLMs), referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. We envision LLM unlearning bec… ▽ More

    Submitted 14 July, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

  6. arXiv:2401.14446  [pdf, other

    cs.CY cs.AI cs.CR

    Black-Box Access is Insufficient for Rigorous AI Audits

    Authors: Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

    Abstract: External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workin… ▽ More

    Submitted 29 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: FAccT 2024

    Journal ref: The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24), June 3-6, 2024, Rio de Janeiro, Brazil

  7. arXiv:2312.03729  [pdf, other

    cs.CL cs.AI

    Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

    Authors: Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas

    Abstract: Neural language models (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherw… ▽ More

    Submitted 27 November, 2023; originally announced December 2023.

    Comments: Accepted to EMNLP, 2024

  8. arXiv:2311.03348  [pdf, other

    cs.CL cs.AI cs.LG

    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

    Authors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

    Abstract: Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each person… ▽ More

    Submitted 24 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

  9. arXiv:2307.15217  [pdf, other

    cs.AI cs.CL cs.LG

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen , et al. (7 additional authors not shown)

    Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel… ▽ More

    Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

  10. arXiv:2307.04028  [pdf, other

    cs.CV cs.AI cs.LG

    Measuring the Success of Diffusion Models at Imitating Human Artists

    Authors: Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

    Abstract: Modern diffusion models have set the state-of-the-art in AI image generation. Their success is due, in part, to training on Internet-scale data which often includes copyrighted work. This prompts questions about the extent to which these models learn from, imitate, or copy the work of human artists. This work suggests that tying copyright liability to the capabilities of the model may be useful gi… ▽ More

    Submitted 8 July, 2023; originally announced July 2023.

    Comments: Accepted to the 1 st Workshop on Generative AI and Law

  11. arXiv:2306.09442  [pdf, other

    cs.CL cs.AI cs.LG

    Explore, Establish, Exploit: Red Teaming Language Models from Scratch

    Authors: Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

    Abstract: Deploying large language models (LMs) can pose hazards from harmful outputs such as toxic or false text. Prior work has introduced automated tools that elicit harmful outputs to identify these risks. While this is a valuable step toward securing models, these approaches rely on a pre-existing way to efficiently classify undesirable outputs. Using a pre-existing classifier does not allow for red-te… ▽ More

    Submitted 10 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

  12. arXiv:2302.10894  [pdf, other

    cs.LG cs.AI cs.CV

    Red Teaming Deep Neural Networks with Feature Synthesis Tools

    Authors: Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell

    Abstract: Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by… ▽ More

    Submitted 21 September, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  13. arXiv:2211.10024  [pdf, other

    cs.LG cs.AI cs.CR

    Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

    Authors: Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

    Abstract: This paper considers the problem of helping humans exercise scalable oversight over deep neural networks (DNNs). Adversarial examples can be useful by helping to reveal weaknesses in DNNs, but they can be difficult to interpret or draw actionable conclusions from. Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image… ▽ More

    Submitted 5 May, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Best paper award at the NeurIPS 2022 ML Safety Workshop -- https://neurips2022.mlsafety.org/

  14. arXiv:2209.02167  [pdf, other

    cs.AI cs.CR cs.LG

    Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

    Authors: Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

    Abstract: Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other pa… ▽ More

    Submitted 13 October, 2023; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: Code is available at https://github.com/thestephencasper/lm_white_box_attacks

  15. arXiv:2207.13243  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

    Authors: Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell

    Abstract: The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to… ▽ More

    Submitted 18 August, 2023; v1 submitted 26 July, 2022; originally announced July 2022.

  16. arXiv:2110.08058  [pdf, other

    cs.LG cs.AI cs.NE

    Quantifying Local Specialization in Deep Neural Networks

    Authors: Shlomi Hod, Daniel Filan, Stephen Casper, Andrew Critch, Stuart Russell

    Abstract: A neural network is locally specialized to the extent that parts of its computational graph (i.e. structure) can be abstractly represented as performing some comprehensible sub-task relevant to the overall task (i.e. functionality). Are modern deep neural networks locally specialized? How can this be quantified? In this paper, we consider the problem of taking a neural network whose neurons are pa… ▽ More

    Submitted 7 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 21 pages, 6 figures. Code is available at https://github.com/thestephencasper/detecting_nn_modularity

  17. arXiv:2110.03605  [pdf, other

    cs.LG cs.AI cs.CV

    Robust Feature-Level Adversaries are Interpretability Tools

    Authors: Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

    Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we… ▽ More

    Submitted 11 September, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2022, code available at https://github.com/thestephencasper/feature_level_adv

  18. arXiv:2103.03386  [pdf, other

    cs.NE

    Clusterability in Neural Networks

    Authors: Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

    Abstract: The learned weights of a neural network have often been considered devoid of scrutable internal structure. In this paper, however, we look for structure in the form of clusterability: how well a network can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized… ▽ More

    Submitted 4 March, 2021; originally announced March 2021.

    Comments: 20 pages, 22 figures. arXiv admin note: text overlap with arXiv:2003.04881

  19. arXiv:2010.05418  [pdf, ps, other

    cs.AI

    Achilles Heels for AGI/ASI via Decision Theoretic Adversaries

    Authors: Stephen Casper

    Abstract: As progress in AI continues to advance, it is important to know how advanced systems will make choices and in what ways they may fail. Machines can already outsmart humans in some domains, and understanding how to safely build ones which may have capabilities at or above the human level is of particular concern. One might suspect that artificially generally intelligent (AGI) and artificially super… ▽ More

    Submitted 1 April, 2023; v1 submitted 11 October, 2020; originally announced October 2020.

    Comments: Contact info for author at stephencasper.com

  20. arXiv:2006.08331  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    Probing Neural Dialog Models for Conversational Understanding

    Authors: Abdelrhman Saleh, Tovly Deutsch, Stephen Casper, Yonatan Belinkov, Stuart Shieber

    Abstract: The predominant approach to open-domain dialog generation relies on end-to-end training of neural models on chat datasets. However, this approach provides little insight as to what these models learn (or do not learn) about engaging in dialog. In this study, we analyze the internal representations learned by neural open-domain dialog systems and evaluate the quality of these representations for le… ▽ More

    Submitted 7 June, 2020; originally announced June 2020.

  21. arXiv:1912.04783  [pdf, other

    cs.LG cs.CV stat.ML

    Frivolous Units: Wider Networks Are Not Really That Wide

    Authors: Stephen Casper, Xavier Boix, Vanessa D'Amario, Ling Guo, Martin Schrimpf, Kasper Vinken, Gabriel Kreiman

    Abstract: A remarkable characteristic of overparameterized deep neural networks (DNNs) is that their accuracy does not degrade when the network's width is increased. Recent evidence suggests that developing compressible representations is key for adjusting the complexity of large networks to the learning task at hand. However, these compressible representations are poorly understood. A promising strand of r… ▽ More

    Submitted 31 May, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 2021