Skip to main content

Showing 1–20 of 20 results for author: Mazeika, M

  1. arXiv:2403.03218  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

    Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: See the project page at https://wmdp.ai

  2. arXiv:2402.04249  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Authors: Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

    Abstract: Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties prev… ▽ More

    Submitted 26 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: Website: https://www.harmbench.org

  3. arXiv:2310.01405  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.CY

    Representation Engineering: A Top-Down Approach to AI Transparency

    Authors: Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

    Abstract: In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive p… ▽ More

    Submitted 10 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Code is available at https://github.com/andyzoujm/representation-engineering

  4. arXiv:2306.12001  [pdf, other

    cs.CY cs.AI cs.LG

    An Overview of Catastrophic AI Risks

    Authors: Dan Hendrycks, Mantas Mazeika, Thomas Woodside

    Abstract: Rapid advancements in artificial intelligence (AI) have sparked growing concerns among experts, policymakers, and world leaders regarding the potential for increasingly advanced AI systems to pose catastrophic risks. Although numerous risks have been detailed separately, there is a pressing need for a systematic discussion and illustration of the potential dangers to better inform efforts to mitig… ▽ More

    Submitted 9 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

  5. arXiv:2306.11698  [pdf, other

    cs.CL cs.AI cs.CR

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

    Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

    Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To thi… ▽ More

    Submitted 26 February, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 Outstanding Paper (Datasets and Benchmarks Track)

  6. arXiv:2210.10039  [pdf, other

    cs.CV cs.CY cs.LG

    How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

    Authors: Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks

    Abstract: In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To fac… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022; datasets available at https://github.com/hendrycks/emodiversity/

  7. arXiv:2206.15474  [pdf, other

    cs.LG cs.CL

    Forecasting Future World Events with Neural Networks

    Authors: Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

    Abstract: Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing tho… ▽ More

    Submitted 9 October, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022; our dataset is available at https://github.com/andyzoujm/autocast

  8. arXiv:2206.14157  [pdf, other

    cs.LG cs.CR

    How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection

    Authors: Mantas Mazeika, Bo Li, David Forsyth

    Abstract: Model stealing attacks present a dilemma for public machine learning APIs. To protect financial investments, companies may be forced to withhold important information about their models that could facilitate theft, including uncertainty estimates and prediction explanations. This compromise is harmful not only to users but also to external transparency. Model stealing defenses seek to resolve this… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: ICML 2022

  9. arXiv:2206.05862  [pdf, other

    cs.CY cs.AI cs.LG

    X-Risk Analysis for AI Research

    Authors: Dan Hendrycks, Mantas Mazeika

    Abstract: Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more inte… ▽ More

    Submitted 20 September, 2022; v1 submitted 12 June, 2022; originally announced June 2022.

  10. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  11. arXiv:2112.05135  [pdf, other

    cs.LG cs.CV

    PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

    Authors: Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt

    Abstract: In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy. These other goals include out-of-distribution (OOD) robustness, prediction consistency, resilience to adversaries, calibrated uncertainty estimates, and the ability to detect anomalous inputs. However, improving performance towards these goals is often… ▽ More

    Submitted 29 March, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

    Comments: CVPR 2022. Code and models are available at https://github.com/andyzoujm/pixmix

  12. arXiv:2110.13136  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    What Would Jiminy Cricket Do? Towards Agents That Behave Morally

    Authors: Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

    Abstract: When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environme… ▽ More

    Submitted 7 February, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2021. Environments available here https://github.com/hendrycks/jiminy-cricket

  13. arXiv:2105.09938  [pdf, other

    cs.SE cs.CL cs.LG

    Measuring Coding Challenge Competence With APPS

    Authors: Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

    Abstract: While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for c… ▽ More

    Submitted 8 November, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: NeurIPS 2021. Code and the APPS dataset is available at https://github.com/hendrycks/apps

  14. arXiv:2009.03300  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Measuring Massive Multitask Language Understanding

    Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

    Abstract: We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over… ▽ More

    Submitted 12 January, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: ICLR 2021; the test and code is available at https://github.com/hendrycks/test

  15. arXiv:1911.11132  [pdf, other

    cs.CV cs.LG

    Scaling Out-of-Distribution Detection for Real-World Settings

    Authors: Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

    Abstract: Detecting out-of-distribution examples is important for safety-critical machine learning applications such as detecting novel biological phenomena and self-driving cars. However, existing research mainly focuses on simple small-scale settings. To set the stage for more realistic out-of-distribution detection, we depart from small-scale settings and explore large-scale multiclass and multi-label se… ▽ More

    Submitted 15 May, 2022; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: ICML 2022; The Species dataset and code are available at https://github.com/hendrycks/anomaly-seg

  16. arXiv:1908.08016  [pdf, other

    cs.LG cs.CR cs.CV stat.ML

    Testing Robustness Against Unforeseen Adversaries

    Authors: Max Kaufmann, Daniel Kang, Yi Sun, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks

    Abstract: Adversarial robustness research primarily focuses on L_p perturbations, and most defenses are developed with identical training-time and test-time adversaries. However, in real-world applications developers are unlikely to have access to the full range of attacks or corruptions their system will face. Furthermore, worst-case inputs are likely to be diverse and need not be constrained to the L_p ba… ▽ More

    Submitted 30 October, 2023; v1 submitted 21 August, 2019; originally announced August 2019.

    Comments: Datasets available at https://github.com/centerforaisafety/adversarial-corruptions

  17. arXiv:1906.12340  [pdf, other

    cs.LG cs.CV stat.ML

    Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

    Authors: Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, Dawn Song

    Abstract: Self-supervision provides effective representations for downstream tasks without requiring labels. However, existing approaches lag behind fully supervised training and are often not thought beneficial beyond obviating or reducing the need for annotations. We find that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and… ▽ More

    Submitted 29 October, 2019; v1 submitted 28 June, 2019; originally announced June 2019.

    Comments: NeurIPS 2019; code and data available at https://github.com/hendrycks/ss-ood

  18. arXiv:1901.09960  [pdf, other

    cs.LG cs.CV stat.ML

    Using Pre-Training Can Improve Model Robustness and Uncertainty

    Authors: Dan Hendrycks, Kimin Lee, Mantas Mazeika

    Abstract: He et al. (2018) have called into question the utility of pre-training by showing that training from scratch can often yield similar performance to pre-training. We show that although pre-training may not improve performance on traditional classification metrics, it improves model robustness and uncertainty estimates. Through extensive experiments on adversarial examples, label corruption, class i… ▽ More

    Submitted 20 October, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

    Comments: ICML 2019. PyTorch code here: https://github.com/hendrycks/pre-training Figure 3 updated

  19. arXiv:1812.04606  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    Deep Anomaly Detection with Outlier Exposure

    Authors: Dan Hendrycks, Mantas Mazeika, Thomas Dietterich

    Abstract: It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training ano… ▽ More

    Submitted 28 January, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

    Comments: ICLR 2019; PyTorch code available at https://github.com/hendrycks/outlier-exposure

  20. arXiv:1802.05300  [pdf, other

    cs.LG cs.CL cs.CV cs.NE

    Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise

    Authors: Dan Hendrycks, Mantas Mazeika, Duncan Wilson, Kevin Gimpel

    Abstract: The growing importance of massive datasets used for deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling, non-expert labeling, and label corruption by data poisoning adversaries. Numerous previous works assume that no source of labels can be trusted. We relax this assumption and assume that a small subset of th… ▽ More

    Submitted 28 January, 2019; v1 submitted 14 February, 2018; originally announced February 2018.

    Comments: NeurIPS 2018. PyTorch code available at https://github.com/mmazeika/glc