Skip to main content

Showing 1–2 of 2 results for author: Rosenblatt, J

  1. arXiv:2407.10188  [pdf, other

    cs.LG

    Unexpected Benefits of Self-Modeling in Neural Systems

    Authors: Vickram N. Premakumar, Michael Vaiana, Florin Pop, Judd Rosenblatt, Diogo Schwerz de Lucena, Kirsten Ziman, Michael S. A. Graziano

    Abstract: Self-models have been a topic of great interest for decades in studies of human cognition and more recently in machine learning. Yet what benefits do self-models confer? Here we show that when artificial networks learn to predict their internal states as an auxiliary task, they change in a fundamental way. To better perform the self-model task, the network learns to make itself simpler, more regul… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  2. arXiv:2406.19552  [pdf, other

    cs.CL cs.AI cs.LG

    Rethinking harmless refusals when fine-tuning foundation models

    Authors: Florin Pop, Judd Rosenblatt, Diogo Schwerz de Lucena, Michael Vaiana

    Abstract: In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reas… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: ICLR 2024 AGI Workshop Poster