subscribe to arXiv mailings

UQE: A Query Engine for Unstructured Databases

Authors: Hanjun Dai, Bethany Yixin Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Phitchaya Mangpo Phothilimthana, Charles Sutton, Dale Schuurmans

Abstract: Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data… ▽ More Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation. △ Less

Submitted 23 June, 2024; originally announced July 2024.

arXiv:2405.20270 [pdf, other]

Bridging electronic and classical density-functional theory using universal machine-learned functional approximations

Authors: Michelle M. Kelley, Joshua Quinton, Kamron Fazel, Nima Karimitari, Christopher Sutton, Ravishankar Sundararaman

Abstract: The accuracy of density-functional theory (DFT) is determined by the quality of the approximate functionals, such as exchange-correlation in electronic DFT and the excess functional in the classical DFT formalism of fluids. The exact functional is highly nonlocal for both electrons and fluids, yet most approximate functionals are semi-local or nonlocal in a limited weighted-density form. Machine-l… ▽ More The accuracy of density-functional theory (DFT) is determined by the quality of the approximate functionals, such as exchange-correlation in electronic DFT and the excess functional in the classical DFT formalism of fluids. The exact functional is highly nonlocal for both electrons and fluids, yet most approximate functionals are semi-local or nonlocal in a limited weighted-density form. Machine-learned (ML) nonlocal density-functional approximations are promising in both electronic and classical DFT, but have so far employed disparate approaches with limited generality. Here, we formulate a universal approximation framework and training protocol for nonlocal ML functionals, combining features of equivariant convolutional neural networks and the weighted-density approximation. We prototype this approach for several 1D and quasi-1D problems and demonstrate that a functional with exactly the same hyperparameters achieves excellent accuracy for the hard-rod fluid, the inhomogeneous Ising model, the exact exchange functional for electrons, the electron kinetic energy functional for orbital-free DFT, as well as for liquid water with 1D inhomogeneities. These results lay the foundation for a universal ML approach to exact 3D functionals spanning electronic and classical DFT. △ Less

Submitted 17 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: 11 pages, 7 figures

arXiv:2405.20239 [pdf]

BEAST DB: Grand-Canonical Database of Electrocatalyst Properties

Authors: Cooper Tezak, Jacob Clary, Sophie Gerits, Joshua Quinton, Benjamin Rich, Nicholas Singstock, Abdulaziz Alherz, Taylor Aubry, Struan Clark, Rachel Hurst, Mauro Del Ben, Christopher Sutton, Ravishankar Sundararaman, Charles Musgrave, Derek Vigil-Fowler

Abstract: We present BEAST DB, an open-source database comprised of ab initio electrochemical data computed using grand-canonical density functional theory in implicit solvent at consistent calculation parameters. The database contains over 20,000 surface calculations and covers a broad set of heterogeneous catalyst materials and electrochemical reactions. Calculations were performed at self-consistent fixe… ▽ More We present BEAST DB, an open-source database comprised of ab initio electrochemical data computed using grand-canonical density functional theory in implicit solvent at consistent calculation parameters. The database contains over 20,000 surface calculations and covers a broad set of heterogeneous catalyst materials and electrochemical reactions. Calculations were performed at self-consistent fixed potential as well as constant charge to facilitate comparisons to the computational hydrogen electrode. This article presents common use cases of the database to rationalize trends in catalyst activity, screen catalyst material spaces, understand elementary mechanistic steps, analyze electronic structure, and train machine learning models to predict higher fidelity properties. Users can interact graphically with the database by querying for individual calculations to gain granular understanding of reaction steps or by querying for an entire reaction pathway on a given material using an interactive reaction pathway tool. BEAST DB will be periodically updated, with planned future updates to include advanced electronic structure data, surface speciation studies, and greater reaction coverage. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 24 pages, 8 figures

arXiv:2404.14662 [pdf, other]

NExT: Teaching Large Language Models to Reason about Code Execution

Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin

Abstract: A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of h… ▽ More A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, by 26.1% and 14.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. Our model can also generalize to scenarios where program traces are absent at test-time. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 35 pages

arXiv:2403.06955 [pdf, other]

Accurate Crystal Structure Prediction of New 2D Hybrid Organic Inorganic Perovskites

Authors: Nima Karimitari, William J. Baldwin, Evan W. Muller, Zachary J. L. Bare, W. Joshua Kennedy, Gábor Csányi, Christopher Sutton

Abstract: Low dimensional hybrid organic-inorganic perovskites (HOIPs) represent a promising class of electronically active materials for both light absorption and emission. The design space of HOIPs is extremely large, since a diverse space of organic cations can be combined with different inorganic frameworks. This immense design space allows for tunable electronic and mechanical properties, but also nece… ▽ More Low dimensional hybrid organic-inorganic perovskites (HOIPs) represent a promising class of electronically active materials for both light absorption and emission. The design space of HOIPs is extremely large, since a diverse space of organic cations can be combined with different inorganic frameworks. This immense design space allows for tunable electronic and mechanical properties, but also necessitates the development of new tools for in silico high throughput analysis of candidate structures. In this work, we present an accurate, efficient, transferable and widely applicable machine learning interatomic potential (MLIP) for predicting the structure of new 2D HOIPs. Using the MACE architecture, an MLIP is trained on 86 diverse experimentally reported HOIP structures. The model is tested on 73 unseen perovskite compositions, and achieves chemical accuracy with respect to the reference electronic structure method. Our model is then combined with a simple random structure search algorithm to predict the structure of hypothetical HOIPs given only the proposed composition. Success is demonstrated by correctly and reliably recovering the crystal structure of a set of experimentally known 2D perovskites. Such a random structure search is impossible with ab initio methods due to the associated computational cost, but is relatively inexpensive with the MACE potential. Finally, the procedure is used to predict the structure formed by a new organic cation with no previously known corresponding perovskite. Laboratory synthesis of the new hybrid perovskite confirms the accuracy of our prediction. This capability, applied at scale, enables efficient screening of thousands of combinations of organic cations and inorganic layers. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 14 pages and 9 figures in the main text. Supplementary included in pdf

arXiv:2401.10998 [pdf, other]

Leveraging Domain Adaptation for Accurate Machine Learning Predictions of New Halide Perovskites

Authors: Dipannoy Das Gupta, Zachary J. L. Bare, Suxuen Yew, Santosh Adhikari, Brian DeCost, Qi Zhang, Charles Musgrave, Christopher Sutton

Abstract: We combine graph neural networks (GNN) with an inexpensive and reliable structure generation approach based on the bond-valence method (BVM) to train accurate machine learning models for screening 222,960 halide perovskites using statistical estimates of the DFT/PBE formation energy (Ef), and the PBE and HSE band gaps (Eg). The GNNs were fined tuned using domain adaptation (DA) from a source model… ▽ More We combine graph neural networks (GNN) with an inexpensive and reliable structure generation approach based on the bond-valence method (BVM) to train accurate machine learning models for screening 222,960 halide perovskites using statistical estimates of the DFT/PBE formation energy (Ef), and the PBE and HSE band gaps (Eg). The GNNs were fined tuned using domain adaptation (DA) from a source model, which yields a factor of 1.8 times improvement in Ef and 1.2 - 1.35 times improvement in HSE Eg compared to direct training (i.e., without DA). Using these two ML models, 48 compounds were identified out of 222,960 candidates as both stable and that have an HSE Eg that is relevant for photovoltaic applications. For this subset, only 8 have been reported to date, indicating that 40 compounds remain unexplored to the best of our knowledge and therefore offer opportunities for potential experimental examination. △ Less

Submitted 19 January, 2024; originally announced January 2024.

arXiv:2401.00096 [pdf, other]

A foundation model for atomistic materials chemistry

Authors: Ilyes Batatia, Philipp Benner, Yuan Chiang, Alin M. Elena, Dávid P. Kovács, Janosh Riebesell, Xavier R. Advincula, Mark Asta, Matthew Avaylon, William J. Baldwin, Fabian Berger, Noam Bernstein, Arghya Bhowmik, Samuel M. Blau, Vlad Cărare, James P. Darby, Sandip De, Flaviano Della Pia, Volker L. Deringer, Rokas Elijošius, Zakariya El-Machachi, Fabio Falcioni, Edvin Fako, Andrea C. Ferrari, Annalena Genreith-Schriever , et al. (51 additional authors not shown)

Abstract: Machine-learned force fields have transformed the atomistic modelling of materials by enabling simulations of ab initio quality on unprecedented time and length scales. However, they are currently limited by: (i) the significant computational and human effort that must go into development and validation of potentials for each particular system of interest; and (ii) a general lack of transferabilit… ▽ More Machine-learned force fields have transformed the atomistic modelling of materials by enabling simulations of ab initio quality on unprecedented time and length scales. However, they are currently limited by: (i) the significant computational and human effort that must go into development and validation of potentials for each particular system of interest; and (ii) a general lack of transferability from one chemical system to the next. Here, using the state-of-the-art MACE architecture we introduce a single general-purpose ML model, trained on a public database of 150k inorganic crystals, that is capable of running stable molecular dynamics on molecules and materials. We demonstrate the power of the MACE-MP-0 model - and its qualitative and at times quantitative accuracy - on a diverse set problems in the physical sciences, including the properties of solids, liquids, gases, chemical reactions, interfaces and even the dynamics of a small protein. The model can be applied out of the box and as a starting or "foundation model" for any atomistic system of interest and is thus a step towards democratising the revolution of ML force fields by lowering the barriers to entry. △ Less

Submitted 1 March, 2024; v1 submitted 29 December, 2023; originally announced January 2024.

Comments: 119 pages, 63 figures, 37MB PDF

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.02179 [pdf, other]

Training Chain-of-Thought via Latent-Variable Inference

Authors: Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

Abstract: Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training se… ▽ More Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT. △ Less

Submitted 28 November, 2023; originally announced December 2023.

Comments: 23 pages, 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2311.17311 [pdf, other]

Universal Self-Consistency for Large Language Model Generation

Authors: Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou

Abstract: Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Con… ▽ More Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation. △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2307.13883 [pdf, other]

ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

Authors: Kensen Shi, Joey Hong, Yinlin Deng, Pengcheng Yin, Manzil Zaheer, Charles Sutton

Abstract: When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more co… ▽ More When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance. △ Less

Submitted 6 May, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

Comments: ICLR 2024

arXiv:2307.07609 [pdf, other]

Interpretable machine learning to understand the performance of semi local density functionals for materials thermochemistry

Authors: Santosh Adhikari, Christopher J. Bartel, Christopher Sutton

Abstract: This study investigates the use of machine learning (ML) to correct the enthalpy of formation (Hf) from two separate DFT functionals, PBE and SCAN, to the experimental Hf across 1011 solid-state compounds. The ML model uses a set of 25 properties that characterize the electronic structure as calculated using PBE and SCAN. The ML model significantly decreases the error in PBE-calculated Hf values f… ▽ More This study investigates the use of machine learning (ML) to correct the enthalpy of formation (Hf) from two separate DFT functionals, PBE and SCAN, to the experimental Hf across 1011 solid-state compounds. The ML model uses a set of 25 properties that characterize the electronic structure as calculated using PBE and SCAN. The ML model significantly decreases the error in PBE-calculated Hf values from an mean absolute error (MAE) of 195 meV/atom to an MAE = 80 meV/atom when compared to the experiment. For PBE, the PDP+GAM analysis shows compounds with a high ionicity (I), i.e., I>0.22, have errors in Hf that are twice as large as compounds having I < 0.22 (246 meV/atom compared to 113 meV/atom). Conversely, no analogous trend is observed for SCAN-calculated Hfs, which explains why the ML model for PBE can more easily correct the systematic error in calculated Hfs for PBE but not for SCAN. Although the literature suggests PBE is reliable for intermetallics but less so for oxides and halides, our analysis reveals intermetallics pose a challenge for PBE only when the charge transfer is significant (I >0.22). Meanwhile, oxides and halides may be described accurately by PBE for systems in which charge transfer is relatively low (I < 0.22). △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2306.12272 [pdf, other]

From structure mining to unsupervised exploration of atomic octahedral networks

Authors: R. Patrick Xian, Ryan J. Morelock, Ido Hadar, Charles B. Musgrave, Christopher Sutton

Abstract: Networks of atom-centered coordination octahedra commonly occur in inorganic and hybrid solid-state materials. Characterizing their spatial arrangements and characteristics is crucial for relating structures to properties for many materials families. The traditional method using case-by-case inspection becomes prohibitive for discovering trends and similarities in large datasets. Here, we operatio… ▽ More Networks of atom-centered coordination octahedra commonly occur in inorganic and hybrid solid-state materials. Characterizing their spatial arrangements and characteristics is crucial for relating structures to properties for many materials families. The traditional method using case-by-case inspection becomes prohibitive for discovering trends and similarities in large datasets. Here, we operationalize chemical intuition to automate the geometric parsing, quantification, and classification of coordination octahedral networks. We find axis-resolved tilting trends in ABO$_{3}$ perovskite polymorphs, which assist in detecting oxidation state changes. Moreover, we develop a scale-invariant encoding scheme to represent these networks, which, combined with human-assisted unsupervised machine learning, allows us to taxonomize the inorganic framework polytypes in hybrid iodoplumbates (A$_x$Pb$_y$I$_z$). Consequently, we uncover a violation of Pauling's third rule and the design principles underpinning their topological diversity. Our results offer a glimpse into the vast design space of atomic octahedral networks and inform high-throughput, targeted screening of specific structure types. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: 56 pages

arXiv:2306.06545 [pdf, other]

A Probabilistic Framework for Modular Continual Learning

Authors: Lazar Valkov, Akash Srivastava, Swarat Chaudhuri, Charles Sutton

Abstract: Modular approaches that use a different composition of modules for each problem are a promising direction in continual learning (CL). However, searching through the large, discrete space of module compositions is challenging, especially because evaluating a composition's performance requires a round of neural network training. We address this challenge through a modular CL framework, PICLE, that u… ▽ More Modular approaches that use a different composition of modules for each problem are a promising direction in continual learning (CL). However, searching through the large, discrete space of module compositions is challenging, especially because evaluating a composition's performance requires a round of neural network training. We address this challenge through a modular CL framework, PICLE, that uses a probabilistic model to cheaply compute the fitness of each composition, allowing PICLE to achieve both perceptual, few-shot and latent transfer. The model combines prior knowledge about good module compositions with dataset-specific information. We evaluate PICLE using two benchmark suites designed to assess different desiderata of CL techniques. Comparing to a wide range of approaches, we show that PICLE is the first modular CL algorithm to achieve perceptual, few-shot and latent transfer while scaling well to large search spaces, outperforming previous state-of-the-art modular CL approaches on long problem sequences. △ Less

Submitted 2 May, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

arXiv:2306.02049 [pdf, other]

LambdaBeam: Neural Program Search with Higher-Order Functions and Lambdas

Authors: Kensen Shi, Hanjun Dai, Wen-Ding Li, Kevin Ellis, Charles Sutton

Abstract: Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches. However, a common drawback of those approaches is the inability to handle iterative loops, higher-order functions, or lambd… ▽ More Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches. However, a common drawback of those approaches is the inability to handle iterative loops, higher-order functions, or lambda functions, thus limiting prior neural searches from synthesizing longer and more general programs. We address this gap by designing a search algorithm called LambdaBeam that can construct arbitrary lambda functions that compose operations within a given DSL. We create semantic vector representations of the execution behavior of the lambda functions and train a neural policy network to choose which lambdas to construct during search, and pass them as arguments to higher-order functions to perform looping computations. Our experiments show that LambdaBeam outperforms neural, symbolic, and LLM-based techniques in an integer list manipulation domain. △ Less

Submitted 28 October, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.00970 [pdf, other]

Improving the reliability of machine learned potentials for modeling inhomogenous liquids

Authors: Kamron Fazel, Nima Karimitari, Tanooj Shah, Christopher Sutton, Ravishankar Sundararaman

Abstract: The atomic-scale response of inhomogeneous fluids at interfaces and surrounding solute particles plays a critical role in governing chemical, electrochemical and biological processes at such interfaces. Classical molecular dynamics simulations have been applied extensively to simulate the response of inhomogeneous fluids directly, and as inputs to classical density functional theory, but are limit… ▽ More The atomic-scale response of inhomogeneous fluids at interfaces and surrounding solute particles plays a critical role in governing chemical, electrochemical and biological processes at such interfaces. Classical molecular dynamics simulations have been applied extensively to simulate the response of inhomogeneous fluids directly, and as inputs to classical density functional theory, but are limited by the accuracy of the underlying empirical force fields. Here, we deploy neural network potentials (NNPs) trained to ab initio simulations to accurately predict the inhomogeneous response of two widely different fluids: liquid water and molten NaCl. Although NNPs can be readily trained to model complex bulk systems across a range of state points, in order to appropriately model a fluid's response at an interface, inhomogeneous configurations must be included in the training data. We establish protocols based on molecular dynamics simulations in external atomic potentials in order to sufficiently sample the correct configurations of inhomogeneous fluids. We show that NNPs trained to inhomogeneous fluid configurations can predict several properties such as the density response, surface tension and size-dependent cavitation free energies in water and molten NaCl corresponding to ab initio interactions more accurately than empirical force fields. This work therefore provides a first demonstration and framework for extracting the response of inhomogeneous fluids from first principles for classical density-functional treatment of fluids free from empirical potentials. △ Less

Submitted 27 November, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: 8 pages, 4 figures

arXiv:2304.04714 [pdf, other]

Dynamic Local Structure in Caesium Lead Iodide: Spatial Correlation and Transient Domains

Authors: William Baldwin, Xia Liang, Johan Klarbring, Milos Dubajic, David Dell'Angelo, Christopher Sutton, Claudia Caddeo, Samuel D. Stranks, Alessandro Mattoni, Aron Walsh, Gábor Csányi

Abstract: Metal halide perovskites are multifunctional semiconductors with tunable structures and properties. They are highly dynamic crystals with complex octahedral tilting patterns and strongly anharmonic atomic behaviour. In the higher temperature, higher symmetry phases of these materials, several complex structural features have been observed. The local structure can differ greatly from the average st… ▽ More Metal halide perovskites are multifunctional semiconductors with tunable structures and properties. They are highly dynamic crystals with complex octahedral tilting patterns and strongly anharmonic atomic behaviour. In the higher temperature, higher symmetry phases of these materials, several complex structural features have been observed. The local structure can differ greatly from the average structure and there is evidence that dynamic two-dimensional structures of correlated octahedral motion form. An understanding of the underlying complex atomistic dynamics is, however, still lacking. In this work, the local structure of the inorganic perovskite CsPbI$_3$ is investigated using a new machine learning force field based on the atomic cluster expansion framework. Through analysis of the temporal and spatial correlation observed during large-scale simulations, we reveal that the low frequency motion of octahedral tilts implies a double-well effective potential landscape, even well into the cubic phase. Moreover, dynamic local regions of lower symmetry are present within both higher symmetry phases. These regions are planar and we report the length and timescales of the motion. Finally, we investigate and visualise the spatial arrangement of these features and their interactions, providing a comprehensive picture of local structure in the higher symmetry phases. △ Less

Submitted 11 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

arXiv:2212.09248 [pdf, other]

Natural Language to Code Generation in Interactive Data Science Notebooks

Authors: Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton

Abstract: Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using… ▽ More Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 46 pages. 32 figures

arXiv:2208.07461 [pdf, other]

A Library for Representing Python Programs as Graphs for Machine Learning

Authors: David Bieber, Kensen Shi, Petros Maniatis, Charles Sutton, Vincent Hellendoorn, Daniel Johnson, Daniel Tarlow

Abstract: Graph representations of programs are commonly a central element of machine learning for code research. We introduce an open source Python library python_graphs that applies static analysis to construct graph representations of Python programs suitable for training machine learning models. Our library admits the construction of control-flow graphs, data-flow graphs, and composite ``program graphs'… ▽ More Graph representations of programs are commonly a central element of machine learning for code research. We introduce an open source Python library python_graphs that applies static analysis to construct graph representations of Python programs suitable for training machine learning models. Our library admits the construction of control-flow graphs, data-flow graphs, and composite ``program graphs'' that combine control-flow, data-flow, syntactic, and lexical information about a program. We present the capabilities and limitations of the library, perform a case study applying the library to millions of competitive programming submissions, and showcase the library's utility for machine learning research. △ Less

Submitted 15 August, 2022; originally announced August 2022.

Comments: 21 pages, 14 figures

arXiv:2207.14405 [pdf, ps, other]

Spectral multiplicity and nodal sets for generic torus-invariant metrics

Authors: Donato Cianci, Chris Judge, Samuel Lin, Craig Sutton

Abstract: Let a torus $T$ act freely on a closed manifold $M$ of dimension at least two. We demonstrate that, for a generic $T$-invariant Riemannian metric $g$ on $M$, each real $Δ_g$-eigenspace is an irreducible real representation of $T$ and, therefore, has dimension at most two. We also show that, for the generic $T$-invariant metric on $M$, if $u$ is a non-invariant real-valued $Δ_g$-eigenfunction that… ▽ More Let a torus $T$ act freely on a closed manifold $M$ of dimension at least two. We demonstrate that, for a generic $T$-invariant Riemannian metric $g$ on $M$, each real $Δ_g$-eigenspace is an irreducible real representation of $T$ and, therefore, has dimension at most two. We also show that, for the generic $T$-invariant metric on $M$, if $u$ is a non-invariant real-valued $Δ_g$-eigenfunction that vanishes on some $T$-orbit, then the nodal set of $u$ is a connected smooth hypersurface whose complement has exactly two connected components. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: 18 pages

MSC Class: 58J50 (Primary) 35P05; 81Q10 (Secondary)

arXiv:2207.10342 [pdf, ps, other]

Language Model Cascades

Authors: David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton

Abstract: Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities. These compositions are probabilistic models, and may be expressed in the language of graphical models with random variables whose values are complex data types such as strings. Cases with cont… ▽ More Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities. These compositions are probabilistic models, and may be expressed in the language of graphical models with random variables whose values are complex data types such as strings. Cases with control flow and dynamic structure require techniques from probabilistic programming, which allow implementing disparate model structures and inference strategies in a unified language. We formalize several existing techniques from this perspective, including scratchpads / chain of thought, verifiers, STaR, selection-inference, and tool use. We refer to the resulting programs as language model cascades. △ Less

Submitted 28 July, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

Comments: Presented as spotlight at the Beyond Bases workshop at ICML 2022 (https://beyond-bayes.github.io)

arXiv:2207.08050 [pdf, other]

Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

Authors: Simao Eduardo, Kai Xu, Alfredo Nazabal, Charles Sutton

Abstract: Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of… ▽ More Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of patterns of a clean instance and systematic error patterns, our main insight is that inliers can be modelled by a smaller representation (subspace) in a model than outliers. By exploiting this, we propose Clean Subspace Variational Autoencoder (CLSVAE), a novel semi-supervised model for detection and automated repair of systematic errors. The main idea is to partition the latent space and model inlier and outlier patterns separately. CLSVAE is effective with much less labelled data compared to previous related models, often with less than 2% of the data. We provide experiments using three image datasets in scenarios with different levels of corruption and labelled set sizes, comparing to relevant baselines. CLSVAE provides superior repairs without human intervention, e.g. with just 0.25% of labelled data we see a relative error decrease of 58% compared to the closest baseline. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: Submitted for review in ICLR 2022

arXiv:2204.03758 [pdf, other]

Compositional Generalization and Decomposition in Neural Program Synthesis

Authors: Kensen Shi, Joey Hong, Manzil Zaheer, Pengcheng Yin, Charles Sutton

Abstract: When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve… ▽ More When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize. We first characterize several different axes along which program synthesis methods would be desired to generalize, e.g., length generalization, or the ability to combine known subroutines in new ways that do not occur in the training data. Based on this characterization, we introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets, SCAN and RobustFill. Finally, we make first attempts to improve the compositional generalization ability of Transformer models along these axes through novel attention mechanisms that draw inspiration from a human-like decomposition strategy. Empirically, we find our modified Transformer models generally perform better than natural baselines, but the tasks remain challenging. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: Published at the Deep Learning for Code (DL4C) Workshop at ICLR 2022

arXiv:2204.02311 [pdf, other]

PaLM: Scaling Language Modeling with Pathways

Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin , et al. (42 additional authors not shown)

Abstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran… ▽ More Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. △ Less

Submitted 5 October, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

arXiv:2203.10452 [pdf, other]

CrossBeam: Learning to Search in Bottom-Up Program Synthesis

Authors: Kensen Shi, Hanjun Dai, Kevin Ellis, Charles Sutton

Abstract: Many approaches to program synthesis perform a search within an enormous space of programs to find one that satisfies a given specification. Prior works have used neural models to guide combinatorial search algorithms, but such approaches still explore a huge portion of the search space and quickly become intractable as the size of the desired program increases. To tame the search space blowup, we… ▽ More Many approaches to program synthesis perform a search within an enormous space of programs to find one that satisfies a given specification. Prior works have used neural models to guide combinatorial search algorithms, but such approaches still explore a huge portion of the search space and quickly become intractable as the size of the desired program increases. To tame the search space blowup, we propose training a neural model to learn a hands-on search policy for bottom-up synthesis, instead of relying on a combinatorial search algorithm. Our approach, called CrossBeam, uses the neural model to choose how to combine previously-explored programs into new programs, taking into account the search history and partial program executions. Motivated by work in structured prediction on learning to search, CrossBeam is trained on-policy using data extracted from its own bottom-up searches on training tasks. We evaluate CrossBeam in two very different domains, string manipulation and logic programming. We observe that CrossBeam learns to search efficiently, exploring much smaller portions of the program space compared to the state-of-the-art. △ Less

Submitted 20 March, 2022; originally announced March 2022.

Comments: Published at ICLR 2022

arXiv:2112.00114 [pdf, other]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Authors: Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena

Abstract: Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even… ▽ More Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations. △ Less

Submitted 30 November, 2021; originally announced December 2021.

arXiv:2108.07732 [pdf, other]

Program Synthesis with Large Language Models

Authors: Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton

Abstract: This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize… ▽ More This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input. △ Less

Submitted 15 August, 2021; originally announced August 2021.

Comments: Jacob and Augustus contributed equally

arXiv:2106.15339 [pdf, other]

SpreadsheetCoder: Formula Prediction from Semi-structured Context

Authors: Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, Denny Zhou

Abstract: Spreadsheet formula prediction has been an important program synthesis problem with many real-world applications. Previous works typically utilize input-output examples as the specification for spreadsheet formula synthesis, where each input-output pair simulates a separate row in the spreadsheet. However, this formulation does not fully capture the rich context in real-world spreadsheets. First,… ▽ More Spreadsheet formula prediction has been an important program synthesis problem with many real-world applications. Previous works typically utilize input-output examples as the specification for spreadsheet formula synthesis, where each input-output pair simulates a separate row in the spreadsheet. However, this formulation does not fully capture the rich context in real-world spreadsheets. First, spreadsheet data entries are organized as tables, thus rows and columns are not necessarily independent from each other. In addition, many spreadsheet tables include headers, which provide high-level descriptions of the cell data. However, previous synthesis approaches do not consider headers as part of the specification. In this work, we present the first approach for synthesizing spreadsheet formulas from tabular context, which includes both headers and semi-structured tabular data. In particular, we propose SpreadsheetCoder, a BERT-based model architecture to represent the tabular context in both row-based and column-based formats. We train our model on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder achieves top-1 prediction accuracy of 42.51%, which is a considerable improvement over baselines that do not employ rich tabular context. Compared to the rule-based system, SpreadsheetCoder assists 82% more users in composing formulas on Google Sheets. △ Less

Submitted 26 June, 2021; originally announced June 2021.

Comments: Published in ICML 2021

arXiv:2104.05134 [pdf, other]

Couplings for Multinomial Hamiltonian Monte Carlo

Authors: Kai Xu, Tor Erlend Fjelde, Charles Sutton, Hong Ge

Abstract: Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, pro… ▽ More Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time - the time it takes for the coupled chains to meet - based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods. △ Less

Submitted 11 April, 2021; originally announced April 2021.

Comments: Published in AISTATS 2021

arXiv:2103.14429 [pdf, other]

doi 10.1088/1748-0221/16/07/T07012

Measurement of the distribution of $^{207}$Bi depositions on calibration sources for SuperNEMO

Authors: R. Arnold, C. Augier, A. S. Barabash, A. Basharina-Freshville, E. Birdsall, S. Blondel, M. Bongrand, D. Boursette, R. Breier, V. Brudanin, J. Busto, S. Calvez, C. Cerna, J. P. Cesar, M. Ceschia, A. Chapon, E. Chauveau, A. Chopra, L. Dawson, S. De Capua, D. Duchesneau, D. Durand, G. Eurin, J. J. Evans, D. Filosofov , et al. (75 additional authors not shown)

Abstract: The SuperNEMO experiment will search for neutrinoless double-beta decay ($0νββ$), and study the Standard-Model double-beta decay process ($2νββ$). The SuperNEMO technology can measure the energy of each of the electrons produced in a double-beta ($ββ$) decay, and can reconstruct the topology of their individual tracks. The study of the double-beta decay spectrum requires very accurate energy calib… ▽ More The SuperNEMO experiment will search for neutrinoless double-beta decay ($0νββ$), and study the Standard-Model double-beta decay process ($2νββ$). The SuperNEMO technology can measure the energy of each of the electrons produced in a double-beta ($ββ$) decay, and can reconstruct the topology of their individual tracks. The study of the double-beta decay spectrum requires very accurate energy calibration to be carried out periodically. The SuperNEMO Demonstrator Module will be calibrated using 42 calibration sources, each consisting of a droplet of $^{207}$Bi within a frame assembly. The quality of these sources, which depends upon the entire $^{207}$Bi droplet being contained within the frame, is key for correctly calibrating SuperNEMO's energy response. In this paper, we present a novel method for precisely measuring the exact geometry of the deposition of $^{207}$Bi droplets within the frames, using Timepix pixel detectors. We studied 49 different sources and selected 42 high-quality sources with the most central source positioning. △ Less

Submitted 20 May, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: 16 pages, 12 figures, submitted to JINST, response to reviewer comments

arXiv:2012.00377 [pdf, other]

Latent Programmer: Discrete Latent Codes for Program Synthesis

Authors: Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, Manzil Zaheer

Abstract: In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences. We propose to learn representations of the outputs that are specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient. Discrete latent codes are appealing for this purpose,… ▽ More In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences. We propose to learn representations of the outputs that are specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient. Discrete latent codes are appealing for this purpose, as they naturally allow sophisticated combinatorial search strategies. The latent codes are learned using a self-supervised learning principle, in which first a discrete autoencoder is trained on the output sequences, and then the resulting latent codes are used as intermediate targets for the end-to-end sequence prediction task. Based on these insights, we introduce the \emph{Latent Programmer}, a program synthesis method that first predicts a discrete latent code from input/output examples, and then generates the program in the target language. We evaluate the Latent Programmer on two domains: synthesis of string transformation programs, and generation of programs from natural language descriptions. We demonstrate that the discrete latent representation significantly improves synthesis accuracy. △ Less

Submitted 5 August, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

Comments: ICML 2021; 15 pages, 9 figures

arXiv:2011.07657 [pdf, other]

Search for Periodic Modulations of the Rate of Double-Beta Decay of $^{100}$Mo in the NEMO-3 Detector

Authors: NEMO-3 Collaboration, :, R. Arnold, C. Augier, A. S. Barabash, A. Basharina-Freshville, S. Blondel, S. Blot, M. Bongrand, D. Boursette, R. Breier, V. Brudanin, J. Busto, A. J. Caffrey, S. Calvez, C. Cerna, J. P. Cesar, M. Ceschia, A. Chapon, E. Chauveau, A. Chopra, L. Dawson, D. Duchesneau, D. Durand, G. Eurin , et al. (84 additional authors not shown)

Abstract: Double-beta decays of $^{100}$Mo from the 6.0195-year exposure of a 6.914 kg high-purity sample were recorded by the NEMO-3 experiment that searched for neutrinoless double-beta decays. These ultra-rare transitions to $^{100}$Ru have a half-life of approximately $7\times10^{18}$ years, and have been used to conduct the first ever search for periodic variations of this decay mode. The Lomb-Scargle… ▽ More Double-beta decays of $^{100}$Mo from the 6.0195-year exposure of a 6.914 kg high-purity sample were recorded by the NEMO-3 experiment that searched for neutrinoless double-beta decays. These ultra-rare transitions to $^{100}$Ru have a half-life of approximately $7\times10^{18}$ years, and have been used to conduct the first ever search for periodic variations of this decay mode. The Lomb-Scargle periodogram technique, and its error-weighted extension, were employed to look for periodic modulations of the half-life. Monte Carlo modeling was used to study the modulation sensitivity of the data over a broad range of amplitudes and frequencies. Data show no evidence of modulations with amplitude greater than 2.5% in the frequency range of $0.33225\,{\rm y^{-1}}$ to $365.25\,{\rm y^{-1}}$. △ Less

Submitted 15 November, 2020; originally announced November 2020.

arXiv:2011.05363 [pdf, other]

Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration

Authors: Hanjun Dai, Rishabh Singh, Bo Dai, Charles Sutton, Dale Schuurmans

Abstract: Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions,… ▽ More Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions, but require partition function estimation. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data, where parameter gradients are estimated using a learned sampler that mimics local search. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration, achieving a better trade-off between flexibility and tractability. Experimentally, we show that learning local search leads to significant improvements in challenging application domains. Most notably, we present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: NeurIPS 2020

arXiv:2010.12621 [pdf, other]

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

Authors: David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow

Abstract: Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural n… ▽ More Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Networks (IPA-GNN), which achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks. △ Less

Submitted 23 October, 2020; originally announced October 2020.

Comments: Accepted at NeurIPS 2020

arXiv:2010.11887 [pdf, other]

doi 10.1145/3490421

Conditional independence by typing

Authors: Maria I. Gorinova, Andrew D. Gordon, Charles Sutton, Matthijs Vákár

Abstract: A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that cap… ▽ More A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that capture a qualitative summary of the specified model and can facilitate more efficient inference. We present an information flow type system for probabilistic programming that captures conditional independence (CI) relationships, and show that, for a well-typed program in our system, the distribution it implements is guaranteed to have certain CI-relationships. Further, by using type inference, we can statically deduce which CI-properties are present in a specified model. As a practical application, we consider the problem of how to perform inference on models with mixed discrete and continuous parameters. Inference on such models is challenging in many existing PPLs, but can be improved through a workaround, where the discrete parameters are used implicitly, at the expense of manual model re-writing. We present a source-to-source semantics-preserving transformation, which uses our CI-type system to automate this workaround by eliminating the discrete parameters from a probabilistic program. The resulting program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, while the discrete parameters are inferred using variable elimination. We implement our CI-type system and its example application in SlicStan: a compositional variant of Stan. △ Less

Submitted 18 February, 2022; v1 submitted 22 October, 2020; originally announced October 2020.

Journal ref: ACM Transactions on Programming Languages and Systems, Volume 44, Issue 1, March 2022, Article No 4, pp 1-54

arXiv:2008.04573 [pdf]

doi 10.1103/PhysRevMaterials.4.125001

Investigating the ranges of (meta)stable phase formation in (InxGa1-x)2O3: Impact of the cation coordination

Authors: C. Wouters, C. Sutton, L. M. Ghiringhelli, T. Markurt, R. Schewski, A. Hassa, H. von Wenckstern, M. Grundmann, M. Scheffler, M. Albrecht

Abstract: We investigate the phase diagram of the heterostructural solid solution (InxGa1-x)2O3 both computationally, by combining cluster expansion and density functional theory, and experimentally, by means of TEM measurements of pulsed laser deposited (PLD) heteroepitaxial thin films. The shapes of the Gibbs free energy curves for the monoclinic, hexagonal and cubic bixbyite alloy as a function of compos… ▽ More We investigate the phase diagram of the heterostructural solid solution (InxGa1-x)2O3 both computationally, by combining cluster expansion and density functional theory, and experimentally, by means of TEM measurements of pulsed laser deposited (PLD) heteroepitaxial thin films. The shapes of the Gibbs free energy curves for the monoclinic, hexagonal and cubic bixbyite alloy as a function of composition can be explained in terms of the preferred cation coordination environments of indium and gallium. We show by atomically resolved STEM that the strong preference of indium for six-fold coordination results in ordered monoclinic and hexagonal lattices. This ordering impacts the configurational entropy in the solid solution and thereby the (InxGa1-x)2O3 phase diagram. The resulting phase diagram is characterized by very limited solubilities of gallium and indium in the monoclinic, hexagonal and cubic ground state phases respectively but exhibits wide metastable ranges at realistic growth temperatures. On the indium rich side of the phase diagram a wide miscibility gap is found, which results in phase separated layers. The experimentally observed indium solubilities in the PLD samples are in the range of x=0.45 and x=0.55 for monoclinic and hexagonal single-phase films, while for phase separated films we find x=0.5 for the monoclinic phase, x=0.65-0.7 for the hexagonal phase and x>0.9 for the cubic phase. These values are consistent with the computed metastable ranges for each phase. △ Less

Submitted 11 August, 2020; originally announced August 2020.

Comments: 16 pages, 7 figures

Journal ref: Phys. Rev. Materials 4, 125001 (2020)

arXiv:2007.14381 [pdf, other]

BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

Authors: Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton, Hanjun Dai

Abstract: Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analyzing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up sear… ▽ More Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analyzing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties. First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on two datasets, one from the SyGuS competition and one of our own creation. △ Less

Submitted 30 September, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

arXiv:2006.10924 [pdf, other]

Neural Program Synthesis with a Differentiable Fixer

Authors: Matej Balog, Rishabh Singh, Petros Maniatis, Charles Sutton

Abstract: We present a new program synthesis approach that combines an encoder-decoder based synthesis architecture with a differentiable program fixer. Our approach is inspired from the fact that human developers seldom get their program correct on the first attempt, and perform iterative testing-based program fixing to get to the desired program functionality. Similarly, our approach first learns a distri… ▽ More We present a new program synthesis approach that combines an encoder-decoder based synthesis architecture with a differentiable program fixer. Our approach is inspired from the fact that human developers seldom get their program correct on the first attempt, and perform iterative testing-based program fixing to get to the desired program functionality. Similarly, our approach first learns a distribution over programs conditioned on an encoding of a set of input-output examples, and then iteratively performs fix operations using the differentiable fixer. The fixer takes as input the original examples and the current program's outputs on example inputs, and generates a new distribution over the programs with the goal of reducing the discrepancies between the current program outputs and the desired example outputs. We train our architecture end-to-end on the RobustFill domain, and show that the addition of the fixer module leads to a significant improvement on synthesis accuracy compared to using beam search. △ Less

Submitted 18 June, 2020; originally announced June 2020.

arXiv:2004.13214 [pdf, ps, other]

SCELMo: Source Code Embeddings from Language Models

Authors: Rafael - Michael Karampatsis, Charles Sutton

Abstract: Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on… ▽ More Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on language models. We train a set of embeddings using the ELMo (embeddings from language models) framework of Peters et al (2018). We investigate whether these embeddings are effective when fine-tuned for the downstream task of bug detection. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: 12 pages

arXiv:2004.00348 [pdf, other]

OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints

Authors: Irene Vlassi Pandi, Earl T. Barr, Andrew D. Gordon, Charles Sutton

Abstract: We present a new approach to the type inference problem for dynamic languages. Our goal is to combine \emph{logical} constraints, that is, deterministic information from a type system, with \emph{natural} constraints, that is, uncertain statistical information about types learnt from sources like identifier names. To this end, we introduce a framework for probabilistic type inference that combines… ▽ More We present a new approach to the type inference problem for dynamic languages. Our goal is to combine \emph{logical} constraints, that is, deterministic information from a type system, with \emph{natural} constraints, that is, uncertain statistical information about types learnt from sources like identifier names. To this end, we introduce a framework for probabilistic type inference that combines logic and learning: logical constraints on the types are extracted from the program, and deep learning is applied to predict types from surface-level code properties that are statistically associated. The foremost insight of our method is to constrain the predictions from the learning procedure to respect the logical constraints, which we achieve by relaxing the logical inference problem of type prediction into a continuous optimisation problem. We build a tool called OptTyper to predict missing types for TypeScript files. OptTyper combines a continuous interpretation of logical constraints derived by classical static analysis of TypeScript code, with natural constraints obtained from a deep learning model, which learns naming conventions for types from a large codebase. By evaluating OptTyper, we show that the combination of logical and natural constraints yields a large improvement in performance over either kind of information individually and achieves a 4% improvement over the state-of-the-art. △ Less

Submitted 26 March, 2021; v1 submitted 1 April, 2020; originally announced April 2020.

Comments: 29 pages, 5 figures, 2 tables

arXiv:2003.07914 [pdf, ps, other]

doi 10.1145/3377811.3380342

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Authors: Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, Andrea Janes

Abstract: Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large… ▽ More Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available. △ Less

Submitted 17 March, 2020; originally announced March 2020.

Comments: 13 pages; to appear in Proceedings of ICSE 2020

arXiv:2003.04227 [pdf, other]

Towards Modular Algorithm Induction

Authors: Daniel A. Abolafia, Rishabh Singh, Manzil Zaheer, Charles Sutton

Abstract: We present a modular neural network architecture Main that learns algorithms given a set of input-output examples. Main consists of a neural controller that interacts with a variable-length input tape and learns to compose modules together with their corresponding argument choices. Unlike previous approaches, Main uses a general domain-agnostic mechanism for selection of modules and their argument… ▽ More We present a modular neural network architecture Main that learns algorithms given a set of input-output examples. Main consists of a neural controller that interacts with a variable-length input tape and learns to compose modules together with their corresponding argument choices. Unlike previous approaches, Main uses a general domain-agnostic mechanism for selection of modules and their arguments. It uses a general input tape layout together with a parallel history tape to indicate most recently used locations. Finally, it uses a memoryless controller with a length-invariant self-attention based input tape encoding to allow for random access to tape locations. The Main architecture is trained end-to-end using reinforcement learning from a set of input-output examples. We evaluate Main on five algorithmic tasks and show that it can learn policies that generalizes perfectly to inputs of much longer lengths than the ones used for training. △ Less

Submitted 27 February, 2020; originally announced March 2020.

Comments: 10 pages, 4 figures, 2 tables

arXiv:2002.09067 [pdf, other]

Incremental Sampling Without Replacement for Sequence Models

Authors: Kensen Shi, David Bieber, Charles Sutton

Abstract: Sampling is a fundamental technique, and sampling without replacement is often desirable when duplicate samples are not beneficial. Within machine learning, sampling is useful for generating diverse outputs from a trained model. We present an elegant procedure for sampling without replacement from a broad class of randomized programs, including generative neural models that construct outputs seque… ▽ More Sampling is a fundamental technique, and sampling without replacement is often desirable when duplicate samples are not beneficial. Within machine learning, sampling is useful for generating diverse outputs from a trained model. We present an elegant procedure for sampling without replacement from a broad class of randomized programs, including generative neural models that construct outputs sequentially. Our procedure is efficient even for exponentially-large output spaces. Unlike prior work, our approach is incremental, i.e., samples can be drawn one at a time, allowing for increased flexibility. We also present a new estimator for computing expectations from samples drawn without replacement. We show that incremental sampling without replacement is applicable to many domains, e.g., program synthesis and combinatorial optimization. △ Less

Submitted 19 July, 2021; v1 submitted 20 February, 2020; originally announced February 2020.

arXiv:2002.09030 [pdf, other]

Learning to Represent Programs with Property Signatures

Authors: Augustus Odena, Charles Sutton

Abstract: We introduce the notion of property signatures, a representation for programs and program specifications meant for consumption by machine learning algorithms. Given a function with input type $τ_{in}$ and output type $τ_{out}$, a property is a function of type: $(τ_{in}, τ_{out}) \rightarrow \texttt{Bool}$ that (informally) describes some simple property of the function under consideration. For in… ▽ More We introduce the notion of property signatures, a representation for programs and program specifications meant for consumption by machine learning algorithms. Given a function with input type $τ_{in}$ and output type $τ_{out}$, a property is a function of type: $(τ_{in}, τ_{out}) \rightarrow \texttt{Bool}$ that (informally) describes some simple property of the function under consideration. For instance, if $τ_{in}$ and $τ_{out}$ are both lists of the same type, one property might ask `is the input list the same length as the output list?'. If we have a list of such properties, we can evaluate them all for our function to get a list of outputs that we will call the property signature. Crucially, we can `guess' the property signature for a function given only a set of input/output pairs meant to specify that function. We discuss several potential applications of property signatures and show experimentally that they can be used to improve over a baseline synthesizer so that it emits twice as many programs in less than one-tenth of the time. △ Less

Submitted 12 February, 2020; originally announced February 2020.

Comments: ICLR 2020

arXiv:2001.06388 [pdf, other]

doi 10.1016/j.nuclphysa.2020.121701

Search for the double-beta decay of 82Se to the excited states of 82Kr with NEMO-3

Authors: The NEMO-3 collaboration R. Arnold, C. Augier, A. S. Barabash, A. Basharina-Freshville, S. Blondel, S. Blot, M. Bongrand, D. Boursette, R. Breier, V. Brudanin, J. Busto, A. J. Caffrey, S. Calvez, M. Cascella, C. Cerna, J. P. Cesar, A. Chapon, E. Chauveau, A. Chopra, L. Dawson, D. Duchesneau, D. Durand, V. Egorov, G. Eurin, J. J. Evans , et al. (82 additional authors not shown)

Abstract: The double-beta decay of 82Se to the 0+1 excited state of 82Kr has been studied with the NEMO-3 detector using 0.93 kg of enriched 82Se measured for 4.75 y, corresponding to an exposure of 4.42 kg y. A dedicated analysis to reconstruct the gamma-rays has been performed to search for events in the 2e2g channel. No evidence of a 2nbb decay to the 0+1 state has been observed and a limit of T2n 1/2(82… ▽ More The double-beta decay of 82Se to the 0+1 excited state of 82Kr has been studied with the NEMO-3 detector using 0.93 kg of enriched 82Se measured for 4.75 y, corresponding to an exposure of 4.42 kg y. A dedicated analysis to reconstruct the gamma-rays has been performed to search for events in the 2e2g channel. No evidence of a 2nbb decay to the 0+1 state has been observed and a limit of T2n 1/2(82Se; 0+gs -> 0+1) > 1.3 1021 y at 90% CL has been set. Concerning the 0nbb decay to the 0+1 state, a limit for this decay has been obtained with T0n 1/2(82Se; 0+g s -> 0+1) > 2.3 1022 y at 90% CL, independently from the 2nbb decay process. These results are obtained for the first time with a tracko-calo detector, reconstructing every particle in the final state. △ Less

Submitted 17 January, 2020; originally announced January 2020.

Journal ref: Nuclear Physics A Volume 996, April 2020, 121701

arXiv:1911.01205 [pdf, other]

Learning to Fix Build Errors with Graph2Diff Neural Networks

Authors: Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, Edward Aftandilian

Abstract: Professional software developers spend a significant amount of time fixing builds, but this has received little attention as a problem in automatic program repair. We present a new deep learning architecture, called Graph2Diff, for automatically localizing and fixing build errors. We represent source code, build configuration files, and compiler diagnostic messages as a graph, and then use a Graph… ▽ More Professional software developers spend a significant amount of time fixing builds, but this has received little attention as a problem in automatic program repair. We present a new deep learning architecture, called Graph2Diff, for automatically localizing and fixing build errors. We represent source code, build configuration files, and compiler diagnostic messages as a graph, and then use a Graph Neural Network model to predict a diff. A diff specifies how to modify the code's abstract syntax tree, represented in the neural network as a sequence of tokens and of pointers to code locations. Our network is an instance of a more general abstraction that we call Graph2Tocopo, which is potentially useful in any development tool for predicting source code changes. We evaluate the model on a dataset of over 500k real build errors and their resolutions from professional developers. Compared to the approach of DeepDelta (Mesbah et al., 2019), our approach tackles the harder task of predicting a more precise diff but still achieves over double the accuracy. △ Less

Submitted 4 November, 2019; originally announced November 2019.

Comments: Submitted for review on Aug 23, 2019

arXiv:1910.14118 [pdf, ps, other]

Geometric structures and the Laplace spectrum, part II

Authors: Samuel Lin, Benjamin Schmidt, Craig Sutton

Abstract: We continue our exploration of the extent to which the spectrum encodes the local geometry of a locally homogeneous three-manifold and find that if $(M,g)$ and $(N,h)$ are a pair of locally homogeneous, locally non-isometric isospectral three-manifolds, where $M$ is an elliptic three-manifold, then $(1)$ $N$ is also an elliptic three-manifold, $(2)$ $M$ and $N$ have fundamental groups of different… ▽ More We continue our exploration of the extent to which the spectrum encodes the local geometry of a locally homogeneous three-manifold and find that if $(M,g)$ and $(N,h)$ are a pair of locally homogeneous, locally non-isometric isospectral three-manifolds, where $M$ is an elliptic three-manifold, then $(1)$ $N$ is also an elliptic three-manifold, $(2)$ $M$ and $N$ have fundamental groups of different orders, $(3)$ $(M,g)$ and $(N,h)$ both have non-degenerate Ricci tensors and $(4)$ the metrics $g$ and $h$ are sufficiently far from a metric of constant sectional curvature. We are unaware of any such isospectral pair and such a pair could not arise via the classical Sunada method. As part of the proof, we provide an explicit description of the isometry group of a compact simple Lie group equipped with a left-invariant metric---improving upon the results of Ochiai-Takahashi and Onishchik---which we use to classify the locally homogeneous metrics on an elliptic three-manifold $Γ\backslash S^3$ and we determine that any collection of isospectral locally homogeneous metrics on an elliptic three-manifold consists of at most two isometry classes that are necessarily locally isometric. In particular, the left-invariant metrics on $\operatorname{SO}(3)$ (respectively, $S^3$) can be mutually distinguished via their spectra. The previous statement has the following interpretation in terms of physical chemistry: the moments of inertia of a molecule can be recovered from its rotational spectrum. △ Less

Submitted 30 October, 2019; originally announced October 2019.

Comments: 42 pages, 1 Figure

MSC Class: 53C20; 58J50

arXiv:1909.10742 [pdf]

doi 10.1016/j.jenvman.2020.111381

Effects of green revolution led agricultural expansion on net ecosystem service values in India

Authors: Srikanta Sannigrahi, Suman Chakraborti, Pawan Kumar Joshi, Saskia Keesstra, P. S. Roy, Paul. C. Sutton, Urs Kreuter, Saikat Kumar Paul, Somnath Sen, Sandeep Bhatt, Shahid Rahmat, Shouvik Jha, Qi Zhang, Laishram Kanta Singh

Abstract: Ecosystem Services are a bundle of natural processes and functions that are essential for human well-being, subsistence, and livelihood. The expansion of cultivation and cropland, which is the backbone of the Indian economy, is one of the main drivers of rapid Land Use Land Cover changes in India. To assess the impact of the Green Revolution led agrarian expansion on the total ecosystem service va… ▽ More Ecosystem Services are a bundle of natural processes and functions that are essential for human well-being, subsistence, and livelihood. The expansion of cultivation and cropland, which is the backbone of the Indian economy, is one of the main drivers of rapid Land Use Land Cover changes in India. To assess the impact of the Green Revolution led agrarian expansion on the total ecosystem service values, we first estimated the ESVs from 1985 to 2005 for eight ecoregions in India using several value transfer approaches. Five explanatory factors such as Total Crop Area, Crop Production, Crop Yield, Net Irrigated Area, and Cropping Intensity representing the cropping scenarios in the country were used in constructing local Geographical Weighted Regression model to explore the cumulative and individual effects on ESVs. A Multi-Layer Perceptron based Artificial Neural Network algorithm was employed to estimate the normalized importance of these explanatory factors. During the observation periods, cropland, forestland, and water bodies have contributed the most and form a significant proportion of ESVs, followed by grassland, mangrove, wetland, and urban builtup. In all three years, among the nine ESs, the highest ESV accounts for water regulation, followed by soil formation and soilwater retention, biodiversity maintenance, waste treatment, climate regulation, and gas regulation. Among the five explanatory factors, TCA, NIA, CP showed a strong positive association with ESVs, while the CI exhibited a negative association. The study reveals a strong association between GR led agricultural expansion and ESVs in India. △ Less

Submitted 15 November, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

Report number: Volume 277, 111381

Journal ref: Journal of Environmental Management, 2020

arXiv:1907.06671 [pdf, other]

Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data

Authors: Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, Charles Sutton

Abstract: We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE)… ▽ More We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE), a deep generative model that learns the joint distribution of the clean data while identifying the outlier cells, allowing their imputation (repair). RVAE explicitly learns the probability of each cell being an outlier, balancing different likelihood models in the row outlier score, making the method suitable for outlier detection in mixed-type datasets. We show experimentally that not only RVAE performs better than several state-of-the-art methods in cell outlier detection and repair for tabular data, but also that is robust against the initial hyper-parameter selection. △ Less

Submitted 3 March, 2020; v1 submitted 15 July, 2019; originally announced July 2019.

Comments: Accepted for publication at AISTATS 2020

arXiv:1906.00781 [pdf, other]

Learning Semantic Annotations for Tabular Data

Authors: Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, Charles Sutton

Abstract: The usefulness of tabular data such as web tables critically depends on understanding their semantics. This study focuses on column type prediction for tables without any meta data. Unlike traditional lexical matching-based methods, we propose a deep prediction model that can fully exploit a table's contextual semantics, including table locality features learned by a Hybrid Neural Network (HNN), a… ▽ More The usefulness of tabular data such as web tables critically depends on understanding their semantics. This study focuses on column type prediction for tables without any meta data. Unlike traditional lexical matching-based methods, we propose a deep prediction model that can fully exploit a table's contextual semantics, including table locality features learned by a Hybrid Neural Network (HNN), and inter-column semantics features learned by a knowledge base (KB) lookup and query answering algorithm.It exhibits good performance not only on individual table sets, but also when transferring from one table set to another. △ Less

Submitted 30 May, 2019; originally announced June 2019.

Comments: 7 pages

Journal ref: IJCAI 2019

Showing 1–50 of 128 results for author: Sutton, C