DafnyBench: A Benchmark for Formal Software Verification
Authors:
Chloe Loughridge,
Qinyi Sun,
Seth Ahrenbach,
Federico Cassano,
Chuyue Sun,
Ying Sheng,
Anish Mudide,
Md Rakib Hossain Misu,
Nada Amin,
Max Tegmark
Abstract:
We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% succe…
▽ More
We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
Opening the AI black box: program synthesis via mechanistic interpretability
Authors:
Eric J. Michaud,
Isaac Liao,
Vedang Lad,
Ziming Liu,
Anish Mudide,
Chloe Loughridge,
Zifan Carl Guo,
Tara Rezaei Kheirkhah,
Mateja Vukelić,
Max Tegmark
Abstract:
We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by G…
▽ More
We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.