Skip to main content

Showing 1–1 of 1 results for author: Barney, M

  1. arXiv:2407.03621  [pdf, other

    cs.CL

    The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

    Authors: Brenden Smith, Dallin Baker, Clayton Chase, Myles Barney, Kaden Parker, Makenna Allred, Peter Hu, Alex Evans, Nancy Fulda

    Abstract: Large Language Models (LLMs) have an unrivaled and invaluable ability to "align" their output to a diverse range of human preferences, by mirroring them in the text they generate. The internal characteristics of such models, however, remain largely opaque. This work presents the Injectable Realignment Model (IRM) as a novel approach to language model interpretability and explainability. Inspired b… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: 21 pages, 17 figures