Skip to main content

Showing 1–22 of 22 results for author: Evans, O

  1. arXiv:2407.04694  [pdf, other

    cs.CL cs.AI cs.LG

    Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

    Authors: Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

    Abstract: AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 11 page main body, 98 page appendix, 58 figures

  2. arXiv:2406.14546  [pdf, other

    cs.CL cs.AI cs.LG

    Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

    Authors: Johannes Treutlein, Dami Choi, Jan Betley, Cem Anil, Samuel Marks, Roger Baker Grosse, Owain Evans

    Abstract: One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  3. arXiv:2405.07436  [pdf, other

    cs.LG cs.AI

    Can Language Models Explain Their Own Classification Behavior?

    Authors: Dane Sherburn, Bilal Chughtai, Owain Evans

    Abstract: Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated wi… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  4. arXiv:2312.07779  [pdf, other

    cs.AI

    Tell, don't show: Declarative facts influence how LLMs generalize

    Authors: Alexander Meinke, Owain Evans

    Abstract: We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining). Another possibility is that the… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  5. arXiv:2309.15840  [pdf, other

    cs.CL cs.AI cs.LG

    How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

    Authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

    Abstract: Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  6. arXiv:2309.12288  [pdf, other

    cs.CL cs.AI cs.LG

    The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

    Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

    Abstract: We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Valentina Tereshkova was the first woman to travel to space", it will not automatically be able to answe… ▽ More

    Submitted 26 May, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: 21 pages, 11 figures

  7. arXiv:2309.00667  [pdf, other

    cs.CL cs.LG

    Taken out of context: On measuring situational awareness in LLMs

    Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

    Abstract: We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while tak… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  8. arXiv:2206.15474  [pdf, other

    cs.LG cs.CL

    Forecasting Future World Events with Neural Networks

    Authors: Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

    Abstract: Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing tho… ▽ More

    Submitted 9 October, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022; our dataset is available at https://github.com/andyzoujm/autocast

  9. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  10. arXiv:2205.14334  [pdf, other

    cs.CL cs.AI cs.LG

    Teaching Models to Express Their Uncertainty in Words

    Authors: Stephanie Lin, Jacob Hilton, Owain Evans

    Abstract: We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. "90% confidence" or "high confidence"). These levels map to probabilities that are well calibrated. The model also remains moderately calibrated under distribution shift, and i… ▽ More

    Submitted 13 June, 2022; v1 submitted 28 May, 2022; originally announced May 2022.

    Comments: CalibratedMath tasks and evaluation code are available at https://github.com/sylinrl/CalibratedMath

  11. arXiv:2110.06674  [pdf, other

    cs.CY cs.AI cs.CL

    Truthful AI: Developing and governing AI that does not lie

    Authors: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders

    Abstract: In many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent. This raises the question of how we should limit the harm caused by AI "lies" (i.e. falsehoods that are actively selected for). Human truthfulness is governed by social norms and… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    ACM Class: I.2.0

  12. arXiv:2109.07958  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Authors: Stephanie Lin, Jacob Hilton, Owain Evans

    Abstract: We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating hum… ▽ More

    Submitted 7 May, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

    Comments: ACL 2022 (main conference); the TruthfulQA benchmark and evaluation code is available at https://github.com/sylinrl/TruthfulQA

  13. arXiv:2011.06709  [pdf, other

    cs.LG cs.AI stat.ML

    Active Reinforcement Learning: Observing Rewards at a Cost

    Authors: David Krueger, Jan Leike, Owain Evans, John Salvatier

    Abstract: Active reinforcement learning (ARL) is a variant on reinforcement learning where the agent does not observe the reward unless it chooses to pay a query cost c > 0. The central question of ARL is how to quantify the long-term value of reward information. Even in multi-armed bandits, computing the value of this information is intractable and we have to rely on heuristics. We propose and evaluate sev… ▽ More

    Submitted 24 November, 2020; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: Originally appeared at the NeurIPS 2016 "Future of Interactive Learning Machines (FILM)" workshop

  14. Software Sustainability & High Energy Physics

    Authors: Daniel S. Katz, Sudhir Malik, Mark S. Neubauer, Graeme A. Stewart, Kétévi A. Assamagan, Erin A. Becker, Neil P. Chue Hong, Ian A. Cosden, Samuel Meehan, Edward J. W. Moyse, Adrian M. Price-Whelan, Elizabeth Sexton-Kennedy, Meirin Oan Evans, Matthew Feickert, Clemens Lange, Kilian Lieret, Rob Quick, Arturo Sánchez Pineda, Christopher Tunnell

    Abstract: New facilities of the 2020s, such as the High Luminosity Large Hadron Collider (HL-LHC), will be relevant through at least the 2030s. This means that their software efforts and those that are used to analyze their data need to consider sustainability to enable their adaptability to new challenges, longevity, and efficiency, over at least this period. This will help ensure that this software will b… ▽ More

    Submitted 16 October, 2020; v1 submitted 10 October, 2020; originally announced October 2020.

    Comments: A report from the "Sustainable Software in HEP" IRIS-HEP blueprint workshop: https://indico.cern.ch/event/930127/

  15. arXiv:1911.07068  [pdf, other

    cs.AI cs.CV cs.LG

    Sensory Optimization: Neural Networks as a Model for Understanding and Creating Art

    Authors: Owain Evans

    Abstract: This article is about the cognitive science of visual art. Artists create physical artifacts (such as sculptures or paintings) which depict people, objects, and events. These depictions are usually stylized rather than photo-realistic. How is it that humans are able to understand and create stylized representations? Does this ability depend on general cognitive capacities or an evolutionary adapta… ▽ More

    Submitted 16 November, 2019; originally announced November 2019.

    Comments: 27 pages. Web version with high-resolution images: https://owainevans.github.io/visual_aesthetics/sensory-optimization.html

  16. arXiv:1907.01475  [pdf, other

    cs.LG cs.AI stat.ML

    Generalizing from a few environments in safety-critical reinforcement learning

    Authors: Zachary Kenton, Angelos Filos, Owain Evans, Yarin Gal

    Abstract: Before deploying autonomous agents in the real world, we need to be confident they will perform safely in novel situations. Ideally, we would expose agents to a very wide range of situations during training, allowing them to learn about every possible danger, but this is often impractical. This paper investigates safety and generalization from a limited number of training environments in deep rein… ▽ More

    Submitted 2 July, 2019; originally announced July 2019.

  17. arXiv:1803.04926  [pdf, other

    cs.LG stat.ML

    Active Reinforcement Learning with Monte-Carlo Tree Search

    Authors: Sebastian Schulze, Owain Evans

    Abstract: Active Reinforcement Learning (ARL) is a twist on RL where the agent observes reward information only if it pays a cost. This subtle change makes exploration substantially more challenging. Powerful principles in RL like optimism, Thompson sampling, and random exploration do not help with ARL. We relate ARL in tabular environments to Bayes-Adaptive MDPs. We provide an ARL algorithm using Monte-Car… ▽ More

    Submitted 26 March, 2018; v1 submitted 13 March, 2018; originally announced March 2018.

    Comments: 11 pages, 10 figures

  18. arXiv:1802.07228  [pdf

    cs.AI cs.CR cs.CY

    The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation

    Authors: Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, Hyrum Anderson, Heather Roff, Gregory C. Allen, Jacob Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, Simon Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy , et al. (1 additional authors not shown)

    Abstract: This report surveys the landscape of potential security threats from malicious uses of AI, and proposes ways to better forecast, prevent, and mitigate these threats. After analyzing the ways in which AI may influence the threat landscape in the digital, physical, and political domains, we make four high-level recommendations for AI researchers and other stakeholders. We also suggest several promis… ▽ More

    Submitted 20 February, 2018; originally announced February 2018.

  19. arXiv:1707.05173  [pdf, other

    cs.AI cs.LG cs.NE

    Trial without Error: Towards Safe Reinforcement Learning via Human Intervention

    Authors: William Saunders, Girish Sastry, Andreas Stuhlmueller, Owain Evans

    Abstract: AI systems are increasingly applied to complex tasks that involve interaction with humans. During training, such systems are potentially dangerous, as they haven't yet learned to avoid actions that could cause serious harm. How can an AI system explore and learn without making a single mistake that harms humans or otherwise causes serious damage? For model-free reinforcement learning, having a hum… ▽ More

    Submitted 17 July, 2017; originally announced July 2017.

  20. arXiv:1705.08807  [pdf, other

    cs.AI cs.CY

    When Will AI Exceed Human Performance? Evidence from AI Experts

    Authors: Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, Owain Evans

    Abstract: Advances in artificial intelligence (AI) will transform modern life by reshaping transportation, health, science, finance, and the military. To adapt public policy, we need to better anticipate these advances. Here we report the results from a large survey of machine learning researchers on their beliefs about progress in AI. Researchers predict AI will outperform humans in many activities in the… ▽ More

    Submitted 3 May, 2018; v1 submitted 24 May, 2017; originally announced May 2017.

    Comments: Accepted by Journal of Artificial Intelligence Research (AI and Society Track). Minor update to refer to related work (page 5)

  21. arXiv:1701.04079  [pdf, other

    cs.LG cs.AI

    Agent-Agnostic Human-in-the-Loop Reinforcement Learning

    Authors: David Abel, John Salvatier, Andreas Stuhlmüller, Owain Evans

    Abstract: Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. Prior work has developed teaching protocols that enable agents to learn efficiently in complex environments; many of these methods tailor the teacher's guidance to agents with a particular representation or underlying learning scheme, offering effective but specialized teaching procedur… ▽ More

    Submitted 15 January, 2017; originally announced January 2017.

    Comments: Presented at the NIPS Workshop on the Future of Interactive Learning Machines, 2016

  22. arXiv:1512.05832  [pdf, other

    cs.AI

    Learning the Preferences of Ignorant, Inconsistent Agents

    Authors: Owain Evans, Andreas Stuhlmueller, Noah D. Goodman

    Abstract: An important use of machine learning is to learn what people value. What posts or photos should a user be shown? Which jobs or activities would a person find rewarding? In each case, observations of people's past choices can inform our inferences about their likes and preferences. If we assume that choices are approximately optimal according to some utility function, we can treat preference infere… ▽ More

    Submitted 17 December, 2015; originally announced December 2015.

    Comments: AAAI 2016