Skip to main content

Showing 1–11 of 11 results for author: Gurnee, W

  1. arXiv:2406.19384  [pdf, other

    cs.LG cs.AI cs.CL

    The Remarkable Robustness of LLMs: Stages of Inference?

    Authors: Vedang Lad, Wes Gurnee, Max Tegmark

    Abstract: We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95\% of the original model's prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesiz… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2406.16254  [pdf, other

    cs.LG cs.AI cs.CL

    Confidence Regulation Neurons in Language Models

    Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

    Abstract: Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized b… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 25 pages, 14 figures

  3. arXiv:2406.11717  [pdf, other

    cs.LG cs.AI cs.CL

    Refusal in Language Models Is Mediated by a Single Direction

    Authors: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

    Abstract: Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models… ▽ More

    Submitted 15 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  4. arXiv:2405.14860  [pdf, other

    cs.LG

    Not All Language Model Features Are Linear

    Authors: Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

    Abstract: Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on w… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Code and data at https://github.com/JoshEngels/MultiDimensionalFeatures

  5. arXiv:2401.12181  [pdf, other

    cs.LG cs.AI cs.CL

    Universal Neurons in GPT2 Language Models

    Authors: Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

    Abstract: A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neuron… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  6. arXiv:2311.00863  [pdf, other

    cs.LG cs.AI cs.CL

    Training Dynamics of Contextual N-Grams in Language Models

    Authors: Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

    Abstract: Prior work has shown the existence of contextual neurons in language models, including a neuron that activates on German text. We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active. We investigate the formation of this circuit throughou… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: Accepted workshop paper at ATTRIB 2023 (@ NeurIPS)

  7. arXiv:2310.02207  [pdf, other

    cs.LG cs.AI cs.CL

    Language Models Represent Space and Time

    Authors: Wes Gurnee, Max Tegmark

    Abstract: The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a set of more coherent and grounded representations that reflect the real world. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historica… ▽ More

    Submitted 4 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

  8. arXiv:2305.01610  [pdf, other

    cs.LG cs.AI

    Finding Neurons in a Haystack: Case Studies with Sparse Probing

    Authors: Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

    Abstract: Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of f… ▽ More

    Submitted 2 June, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  9. arXiv:2206.00176  [pdf, other

    cs.LG eess.SY math.OC

    Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization

    Authors: Dimitris Bertsimas, Wes Gurnee

    Abstract: Discovering governing equations of complex dynamical systems directly from data is a central problem in scientific machine learning. In recent years, the sparse identification of nonlinear dynamics (SINDy) framework, powered by heuristic sparse regression methods, has become a dominant tool for learning parsimonious models. We propose an exact formulation of the SINDy problem using mixed-integer o… ▽ More

    Submitted 31 May, 2022; originally announced June 2022.

  10. arXiv:2107.07083  [pdf, other

    cs.GT

    Combatting Gerrymandering with Social Choice: the Design of Multi-member Districts

    Authors: Nikhil Garg, Wes Gurnee, David Rothschild, David Shmoys

    Abstract: Every representative democracy must specify a mechanism under which voters choose their representatives. The most common mechanism in the United States -- Winner takes all single-member districts -- both enables substantial partisan gerrymandering and constrains `fair' redistricting, preventing proportional representation in legislatures. We study the design of multi-member districts (MMDs), in wh… ▽ More

    Submitted 9 August, 2022; v1 submitted 14 July, 2021; originally announced July 2021.

    Comments: 34 pages

  11. arXiv:2103.11469  [pdf, other

    cs.CY cs.DM cs.DS

    Fairmandering: A column generation heuristic for fairness-optimized political districting

    Authors: Wes Gurnee, David B. Shmoys

    Abstract: The American winner-take-all congressional district system empowers politicians to engineer electoral outcomes by manipulating district boundaries. Existing computational solutions mostly focus on drawing unbiased maps by ignoring political and demographic input, and instead simply optimize for compactness. We claim that this is a flawed approach because compactness and fairness are orthogonal qua… ▽ More

    Submitted 25 June, 2021; v1 submitted 21 March, 2021; originally announced March 2021.