Skip to main content

Showing 1–6 of 6 results for author: Mukobi, G

  1. arXiv:2406.04391  [pdf, other

    cs.LG cs.AI cs.CL

    Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

    Authors: Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

    Abstract: Predictable behavior from scaling advanced AI systems is an extremely desirable property. Although a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities scale is significantly muddier. In this work, we take a step back and ask: why has predicting specific downstream capabilities with scale remained elusive? While many f… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  2. arXiv:2405.10295  [pdf

    cs.CY cs.AI cs.HC

    Societal Adaptation to Advanced AI

    Authors: Jamie Bernardi, Gabriel Mukobi, Hilary Greaves, Lennart Heim, Markus Anderljung

    Abstract: Existing strategies for managing risks from advanced AI systems often focus on affecting what AI systems are developed and how they diffuse. However, this approach becomes less feasible as the number of developers of advanced AI grows, and impedes beneficial use-cases as well as harmful ones. In response, we urge a complementary approach: increasing societal adaptation to advanced AI, that is, red… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  3. arXiv:2403.03218  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

    Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: See the project page at https://wmdp.ai

  4. arXiv:2401.03408  [pdf, other

    cs.AI cs.CL cs.CY cs.MA

    Escalation Risks from Language Models in Military and Diplomatic Decision-Making

    Authors: Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, Jacquelyn Schneider

    Abstract: Governments are increasingly considering integrating autonomous AI agents in high-stakes military and foreign-policy decision-making, especially with the emergence of advanced generative AI models like GPT-4. Our work aims to scrutinize the behavior of multiple AI agents in simulated wargames, specifically focusing on their predilection to take escalatory actions that may exacerbate multilateral c… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: 10 pages body, 57 pages appendix, 46 figures, 11 tables

    Journal ref: The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT 24), June 3-6, 2024, Rio de Janeiro, Brazil

  5. arXiv:2310.16763  [pdf, other

    cs.CL cs.AI cs.LG

    SuperHF: Supervised Iterative Learning from Human Feedback

    Authors: Gabriel Mukobi, Peter Chatain, Su Fong, Robert Windesheim, Gitta Kutyniok, Kush Bhatia, Silas Alberti

    Abstract: While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). SFT is simple and robust, powering a host of open-source models, while RL… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Accepted to the Socially Responsible Language Modelling Research (SoLaR) workshop at NeurIPS 2023

  6. arXiv:2310.08901  [pdf, other

    cs.MA cs.AI cs.CL

    Welfare Diplomacy: Benchmarking Language Model Cooperation

    Authors: Gabriel Mukobi, Hannah Erlebach, Niklas Lauffer, Lewis Hammond, Alan Chan, Jesse Clifton

    Abstract: The growing capabilities and increasingly widespread deployment of AI systems necessitate robust benchmarks for measuring their cooperative capabilities. Unfortunately, most multi-agent benchmarks are either zero-sum or purely cooperative, providing limited opportunities for such measurements. We introduce a general-sum variant of the zero-sum board game Diplomacy -- called Welfare Diplomacy -- in… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.