Skip to main content

Showing 1–2 of 2 results for author: Ewart, A

  1. arXiv:2402.16835  [pdf, other

    cs.CL

    Eight Methods to Evaluate Robust Unlearning in LLMs

    Authors: Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

    Abstract: Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  2. arXiv:2309.08600  [pdf, other

    cs.LG cs.CL

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

    Abstract: One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neura… ▽ More

    Submitted 4 October, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: 20 pages, 18 figures, 2 tables