-
Hot Fixing Software: A Comprehensive Review of Terminology, Techniques, and Applications
Authors:
Carol Hanna,
David Clark,
Federica Sarro,
Justyna Petke
Abstract:
A hot fix is an unplanned improvement to a specific time-critical issue deployed to a software system in production. While hot fixing is an essential and common activity in software maintenance, it has never been surveyed as a research activity. Thus, such a review is long overdue. In this paper, we conduct a comprehensive literature review of work on hot fixing. We highlight the fields where this…
▽ More
A hot fix is an unplanned improvement to a specific time-critical issue deployed to a software system in production. While hot fixing is an essential and common activity in software maintenance, it has never been surveyed as a research activity. Thus, such a review is long overdue. In this paper, we conduct a comprehensive literature review of work on hot fixing. We highlight the fields where this topic has been addressed, inconsistencies we identified in the terminology, gaps in the literature, and directions for future work. Our search concluded with 91 articles on the topic between the years 2000 and 2022. The articles found encompass many different research areas such as log analysis, runtime patching (also known as hot patching), and automated repair, as well as various application domains such as security, mobile, and video games. We find that many directions can take hot fix research forward such as unifying existing terminology, establishing a benchmark set of hot fixes, researching costs and frequency of hot fixes, and researching the possibility of end-to-end automation of detection, mitigation, and deployment. We discuss these avenues in detail to inspire the community to systematize hot fixing as a software engineering activity.
△ Less
Submitted 15 May, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
User-Centric Deployment of Automated Program Repair at Bloomberg
Authors:
David Williams,
James Callan,
Serkan Kirbas,
Sergey Mechtaev,
Justyna Petke,
Thomas Prideaux-Ghee,
Federica Sarro
Abstract:
Automated program repair (APR) tools have unlocked the potential for the rapid rectification of codebase issues. However, to encourage wider adoption of program repair in practice, it is necessary to address the usability concerns related to generating irrelevant or out-of-context patches. When software engineers are presented with patches they deem uninteresting or unhelpful, they are burdened wi…
▽ More
Automated program repair (APR) tools have unlocked the potential for the rapid rectification of codebase issues. However, to encourage wider adoption of program repair in practice, it is necessary to address the usability concerns related to generating irrelevant or out-of-context patches. When software engineers are presented with patches they deem uninteresting or unhelpful, they are burdened with more "noise" in their workflows and become less likely to engage with APR tools in future. This paper presents a novel approach to optimally time, target, and present auto-generated patches to software engineers. To achieve this, we designed, developed, and deployed a new tool dubbed B-Assist, which leverages GitHub's Suggested Changes interface to seamlessly integrate automated suggestions into active pull requests (PRs), as opposed to creating new, potentially distracting PRs. This strategy ensures that suggestions are not only timely, but also contextually relevant and delivered to engineers most familiar with the affected code. Evaluation among Bloomberg software engineers demonstrated their preference for this approach. From our user study, B-Assist's efficacy is evident, with the acceptance rate of patch suggestions being as high as 74.56%; engineers also found the suggestions valuable, giving usefulness ratings of at least 4 out of 5 in 78.2% of cases. Further, this paper sheds light on persisting usability challenges in APR and lays the groundwork for enhancing the user experience in future APR tools.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
Enhancing Genetic Improvement Mutations Using Large Language Models
Authors:
Alexander E. I. Brownlee,
James Callan,
Karine Even-Mendoza,
Alina Geiger,
Carol Hanna,
Justyna Petke,
Federica Sarro,
Dominik Sobania
Abstract:
Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to genera…
▽ More
Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to generate edits for the JCodec tool. We randomly sample the space of edits using 5 different edit types. We find that the number of patches passing unit tests is up to 75% higher with LLM-based edits than with standard Insert edits. Further, we observe that the patches found with LLMs are generally less diverse compared to standard edits. We ran GI with local search to find runtime improvements. Although many improving patches are found by LLM-enhanced GI, the best improving patch was found by standard GI.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Multi-Objective Improvement of Android Applications
Authors:
James Callan,
Justyna Petke
Abstract:
Non-functional properties, such as runtime or memory use, are important to mobile app users and developers, as they affect user experience. Previous work on automated improvement of non-functional properties in mobile apps failed to address the inherent trade-offs between such properties. We propose a practical approach and the first open-source tool, GIDroid (2023), for multi-objective automated…
▽ More
Non-functional properties, such as runtime or memory use, are important to mobile app users and developers, as they affect user experience. Previous work on automated improvement of non-functional properties in mobile apps failed to address the inherent trade-offs between such properties. We propose a practical approach and the first open-source tool, GIDroid (2023), for multi-objective automated improvement of Android apps. In particular, we use Genetic improvement, a search-based technique that navigates the space of software variants to find improved software. We use a simulation-based testing framework to greatly improve the speed of search. GIDroid contains three state-of-the-art multi-objective algorithms, and two new mutation operators, which cache the results of method calls. Genetic improvement relies on testing to validate patches. Previous work showed that tests in open-source Android applications are scarce. We thus wrote tests for 21 versions of 7 Android apps, creating a new benchmark for performance improvements. We used GIDroid to improve versions of mobile apps where developers had previously found improvements to runtime, memory, and bandwidth use. Our technique automatically re-discovers 64% of existing improvements. We then applied our approach to current versions of software in which there were no known improvements. We were able to improve execution time by up to 35%, and memory use by up to 33% in these apps.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
Software Product Line Engineering via Software Transplantation
Authors:
Leandro O. Souza,
Earl T. Barr,
Justyna Petke,
Eduardo S. Almeida,
Paulo Anselmo M. S. Neto
Abstract:
For companies producing related products, a Software Product Line (SPL) is a software reuse method that improves time-to-market and software quality, achieving substantial cost reductions.These benefits do not come for free. It often takes years to re-architect and re-engineer a codebase to support SPL and, once adopted, it must be maintained. Current SPL practice relies on a collection of tools,…
▽ More
For companies producing related products, a Software Product Line (SPL) is a software reuse method that improves time-to-market and software quality, achieving substantial cost reductions.These benefits do not come for free. It often takes years to re-architect and re-engineer a codebase to support SPL and, once adopted, it must be maintained. Current SPL practice relies on a collection of tools, tailored for different reengineering phases, whose output developers must coordinate and integrate. We present Foundry, a general automated approach for leveraging software transplantation to speed conversion to and maintenance of SPL. Foundry facilitates feature extraction and migration. It can efficiently, repeatedly, transplant a sequence of features, implemented in multiple files. We used Foundry to create two valid product lines that integrate features from three real-world systems in an automated way. Moreover, we conducted an experiment comparing Foundry's feature migration with manual effort. We show that Foundry automatically migrated features across codebases 4.8 times faster, on average, than the average time a group of SPL experts took to accomplish the task.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Reinforcement Learning for Mutation Operator Selection in Automated Program Repair
Authors:
Carol Hanna,
Aymeric Blot,
Justyna Petke
Abstract:
Automated program repair techniques aim to aid software developers with the challenging task of fixing bugs. In heuristic-based program repair, a search space of program variants, created via mutations on software, is explored to find potential patches for bugs. Most commonly, every selection of a mutation operator during search is performed uniformly at random, whcih can generate many buggy, even…
▽ More
Automated program repair techniques aim to aid software developers with the challenging task of fixing bugs. In heuristic-based program repair, a search space of program variants, created via mutations on software, is explored to find potential patches for bugs. Most commonly, every selection of a mutation operator during search is performed uniformly at random, whcih can generate many buggy, even uncompilable program variants. Our goal is to reduce the generation of variants that do not compile or break intended functionality which waste considerable resources. In this paper, we investigate the feasibility of a reinforcement learning-based approach for the selection of mutation operators in heuristic-based program repair. Our proposed approach is programming language, granularity-level, and search strategy agnostic and allows for easy augmentation into existing heuristic-based repair tools. We conduct an extensive empirical evaluation of four operator selection techniques, two reward types, two credit assignment strategies, two integration methods, and three sets of mutation operators using 30,080 independent repair attempts. We evaluate our approach on 353 real-world bugs from the Defects4J benchmark.The reinforcement learning-based mutation operator selection results in a higher number of test-passing variants, but does not exhibit a noticeable improvement in the number of bugs patched in comparison with the baseline, which uses random selection. While reinforcement learning has been previously shown to be successful in improving the search of evolutionary algorithms, often used in heuristic-based program repair, it has not shown such improvements when applied to this area of research.
△ Less
Submitted 6 May, 2024; v1 submitted 9 June, 2023;
originally announced June 2023.
-
GI Software with fewer Data Cache Misses
Authors:
William B. Langdon,
Justyna Petke,
Aymeric Blot,
David Clark
Abstract:
By their very name caches are often overlooked and yet play a vital role in the performance of modern and indeed future hardware. Using MAGPIE (Machine Automated General Performance Improvement via Evolution of software) we show genetic improvement GI can reduce the cache load of existing computer programs. Operating on lines of C and C++ source code using local search, Magpie can generate new fun…
▽ More
By their very name caches are often overlooked and yet play a vital role in the performance of modern and indeed future hardware. Using MAGPIE (Machine Automated General Performance Improvement via Evolution of software) we show genetic improvement GI can reduce the cache load of existing computer programs. Operating on lines of C and C++ source code using local search, Magpie can generate new functionally equivalent variants which generate fewer L1 data cache misses. Cache miss reduction is tested on two industrial open source programs (Google's Open Location Code OLC and Uber's Hexagonal Hierarchical Spatial Index H3) and two 2D photograph image processing tasks, counting pixels and OpenCV's SEEDS segmentation algorithm.
Magpie's patches functionally generalise. In one case they reduce data misses on the highest performance L1 cache dramatically by 47 percent.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
An Analysis of the Automatic Bug Fixing Performance of ChatGPT
Authors:
Dominik Sobania,
Martin Briesch,
Carol Hanna,
Justyna Petke
Abstract:
To support software developers in finding and fixing software bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize a repair, or navigate a search space of software edits to find test-suite passing variants. Recent program repair methods are based on deep learning approaches. One of these novel methods, which is not p…
▽ More
To support software developers in finding and fixing software bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize a repair, or navigate a search space of software edits to find test-suite passing variants. Recent program repair methods are based on deep learning approaches. One of these novel methods, which is not primarily intended for automated program repair, but is still suitable for it, is ChatGPT. The bug fixing performance of ChatGPT, however, is so far unclear. Therefore, in this paper we evaluate ChatGPT on the standard bug fixing benchmark set, QuixBugs, and compare the performance with the results of several other approaches reported in the literature. We find that ChatGPT's bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches. In contrast to previous approaches, ChatGPT offers a dialogue system through which further information, e.g., the expected output for a certain input or an observed error message, can be entered. By providing such hints to ChatGPT, its success rate can be further increased, fixing 31 out of 40 bugs, outperforming state-of-the-art.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
A Comprehensive Survey of Benchmarks for Automated Improvement of Software's Non-Functional Properties
Authors:
Aymeric Blot,
Justyna Petke
Abstract:
Performance is a key quality of modern software. Although recent years have seen a spike in research on automated improvement of software's execution time, energy, memory consumption, etc., there is a noticeable lack of standard benchmarks for such work. It is also unclear how such benchmarks are representative of current software. Furthermore, frequently non-functional properties of software are…
▽ More
Performance is a key quality of modern software. Although recent years have seen a spike in research on automated improvement of software's execution time, energy, memory consumption, etc., there is a noticeable lack of standard benchmarks for such work. It is also unclear how such benchmarks are representative of current software. Furthermore, frequently non-functional properties of software are targeted for improvement one-at-a-time, neglecting potential negative impact on other properties.
In order to facilitate more research on automated improvement of non-functional properties of software, we conducted a survey gathering benchmarks used in previous work. We considered 5 major online repositories of software engineering work: ACM Digital Library, IEEE Xplore, Scopus, Google Scholar, and ArXiV. We gathered 5000 publications (3749 unique), which were systematically reviewed to identify work that empirically improves non-functional properties of software. We identified 386 relevant papers.
We find that execution time is the most frequently targeted property for improvement (in 62% of relevant papers), while multi-objective improvement is rarely considered (5%). Static approaches are prevalent (in 53% of papers), with exploratory approaches (evolutionary in 18% and non-evolutionary in 14% of papers) increasingly popular in the last 10 years. Only 40% of 386 papers describe work that uses benchmark suites, rather than single software, of those SPEC is most popular (covered in 33 papers). We also provide recommendations for choice of benchmarks in future work, noting, e.g., lack of work that covers Python or JavaScript. We provide all programs found in the 386 papers on our dedicated webpage at https://bloa.github.io/nfunc_survey/
We hope that this effort will facilitate more research on the topic of automated improvement of software's non-functional properties.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
MAGPIE: Machine Automated General Performance Improvement via Evolution of Software
Authors:
Aymeric Blot,
Justyna Petke
Abstract:
Performance is one of the most important qualities of software. Several techniques have thus been proposed to improve it, such as program transformations, optimisation of software parameters, or compiler flags. Many automated software improvement approaches use similar search strategies to explore the space of possible improvements, yet available tooling only focuses on one approach at a time. Thi…
▽ More
Performance is one of the most important qualities of software. Several techniques have thus been proposed to improve it, such as program transformations, optimisation of software parameters, or compiler flags. Many automated software improvement approaches use similar search strategies to explore the space of possible improvements, yet available tooling only focuses on one approach at a time. This makes comparisons and exploration of interactions of the various types of improvement impractical.
We propose MAGPIE, a unified software improvement framework. It provides a common edit sequence based representation that isolates the search process from the specific improvement technique, enabling a much simplified synergistic workflow. We provide a case study using a basic local search to compare compiler optimisation, algorithm configuration, and genetic improvement. We chose running time as our efficiency measure and evaluated our approach on four real-world software, written in C, C++, and Java.
Our results show that, used independently, all techniques find significant running time improvements: up to 25% for compiler optimisation, 97% for algorithm configuration, and 61% for evolving source code using genetic improvement. We also show that up to 10% further increase in performance can be obtained with partial combinations of the variants found by the different techniques. Furthermore, the common representation also enables simultaneous exploration of all techniques, providing a competitive alternative to using each technique individually.
△ Less
Submitted 4 August, 2022;
originally announced August 2022.
-
Test-based Patch Clustering for Automatically-Generated Patches Assessment
Authors:
Matias Martinez,
Maria Kechagia,
Anjana Perera,
Justyna Petke,
Federica Sarro,
Aldeida Aleti
Abstract:
Previous studies have shown that Automated Program Repair (APR) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by APR tools need to be validated by human pr…
▽ More
Previous studies have shown that Automated Program Repair (APR) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by APR tools need to be validated by human programmers, which can be very costly, and prevents APR tools adoption in practice.Our work aims at increasing developer trust in automated patch generation by minimizing the number of plausible patches that they have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed with patches that fail on the same generated test cases. The output from xTestCluster gives developers a) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, b) additional information attached to each patch. After analyzing 1910 plausible patches from 25 Java APR tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by APR tools, and provides them with new test cases that show the differences in behavior between generated patches.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
HyperGI: Automated Detection and Repair of Information Flow Leakage
Authors:
Ibrahim Mesecan,
Daniel Blackwell,
David Clark,
Myra B. Cohen,
Justyna Petke
Abstract:
Maintaining confidential information control in software is a persistent security problem where failure means secrets can be revealed via program behaviors. Information flow control techniques traditionally have been based on static or symbolic analyses -- limited in scalability and specialized to particular languages. When programs do leak secrets there are no approaches to automatically repair t…
▽ More
Maintaining confidential information control in software is a persistent security problem where failure means secrets can be revealed via program behaviors. Information flow control techniques traditionally have been based on static or symbolic analyses -- limited in scalability and specialized to particular languages. When programs do leak secrets there are no approaches to automatically repair them unless the leak causes a functional test to fail. We present our vision for HyperGI, a genetic improvement framework tha detects, localizes and repairs information leakage. Key elements of HyperGI include (1) the use of two orthogonal test suites, (2) a dynamic leak detection approach which estimates and localizes potential leaks, and (3) a repair component that produces a candidate patch using genetic improvement. We demonstrate the successful use of HyperGI on several programs which have no failing functional tests. We manually examine the resulting patches and identify trade-offs and future directions for fully realizing our vision.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
Genetic Improvement @ ICSE 2020
Authors:
William B. Langdon,
Westley Weimer,
Justyna Petke,
Erik Fredericks,
Seongmin Lee,
Emily Winter,
Michail Basios,
Myra B. Cohen,
Aymeric Blot,
Markus Wagner,
Bobby R. Bruce,
Shin Yoo,
Simos Gerasimou,
Oliver Krauss,
Yu Huang,
Michael Gerten
Abstract:
Following Prof. Mark Harman of Facebook's keynote and formal presentations (which are recorded in the proceedings) there was a wide ranging discussion at the eighth international Genetic Improvement workshop, GI-2020 @ ICSE (held as part of the 42nd ACM/IEEE International Conference on Software Engineering on Friday 3rd July 2020). Topics included industry take up, human factors, explainabiloity (…
▽ More
Following Prof. Mark Harman of Facebook's keynote and formal presentations (which are recorded in the proceedings) there was a wide ranging discussion at the eighth international Genetic Improvement workshop, GI-2020 @ ICSE (held as part of the 42nd ACM/IEEE International Conference on Software Engineering on Friday 3rd July 2020). Topics included industry take up, human factors, explainabiloity (explainability, justifyability, exploitability) and GI benchmarks. We also contrast various recent online approaches (e.g. SBST 2020) to holding virtual computer science conferences and workshops via the WWW on the Internet without face-2-face interaction. Finally we speculate on how the Coronavirus Covid-19 Pandemic will affect research next year and into the future.
△ Less
Submitted 31 July, 2020;
originally announced July 2020.
-
A Survey of Constrained Combinatorial Testing
Authors:
Huayao Wu,
Changhai Nie,
Justyna Petke,
Yue Jia,
Mark Harman
Abstract:
Combinatorial Testing (CT) is a potentially powerful testing technique, whereas its failure revealing ability might be dramatically reduced if it fails to handle constraints in an adequate and efficient manner. To ensure the wider applicability of CT in the presence of constrained problem domains, large and diverse efforts have been invested towards the techniques and applications of constrained c…
▽ More
Combinatorial Testing (CT) is a potentially powerful testing technique, whereas its failure revealing ability might be dramatically reduced if it fails to handle constraints in an adequate and efficient manner. To ensure the wider applicability of CT in the presence of constrained problem domains, large and diverse efforts have been invested towards the techniques and applications of constrained combinatorial testing. In this paper, we provide a comprehensive survey of representations, influences, and techniques that pertain to constraints in CT, covering 129 papers published between 1987 and 2018. This survey not only categorises the various constraint handling techniques, but also reviews comparatively less well-studied, yet potentially important, constraint identification and maintenance techniques. Since real-world programs are usually constrained, this survey can be of interest to researchers and practitioners who are looking to use and study constrained combinatorial testing techniques.
△ Less
Submitted 7 August, 2019;
originally announced August 2019.
-
Local Consistency and SAT-Solvers
Authors:
Peter Jeavons,
Justyna Petke
Abstract:
Local consistency techniques such as k-consistency are a key component of specialised solvers for constraint satisfaction problems. In this paper we show that the power of using k-consistency techniques on a constraint satisfaction problem is precisely captured by using a particular inference rule, which we call negative-hyper-resolution, on the standard direct encoding of the problem into Boolea…
▽ More
Local consistency techniques such as k-consistency are a key component of specialised solvers for constraint satisfaction problems. In this paper we show that the power of using k-consistency techniques on a constraint satisfaction problem is precisely captured by using a particular inference rule, which we call negative-hyper-resolution, on the standard direct encoding of the problem into Boolean clauses. We also show that current clause-learning SAT-solvers will discover in expected polynomial time any inconsistency that can be deduced from a given set of clauses using negative-hyper-resolvents of a fixed size. We combine these two results to show that, without being explicitly designed to do so, current clause-learning SAT-solvers efficiently simulate k-consistency techniques, for all fixed values of k. We then give some experimental results to show that this feature allows clause-learning SAT-solvers to efficiently solve certain families of constraint problems which are challenging for conventional constraint-programming solvers.
△ Less
Submitted 18 January, 2014;
originally announced January 2014.
-
Cliquewidth and Knowledge Compilation
Authors:
Igor Razgon,
Justyna Petke
Abstract:
In this paper we study the role of cliquewidth in succinct representation of Boolean functions. Our main statement is the following: Let $Z$ be a Boolean circuit having cliquewidth $k$. Then there is another circuit $Z^*$ computing the same function as $Z$ having treewidth at most $18k+2$ and which has at most $4|Z|$ gates where $|Z|$ is the number of gates of $Z$. In this sense, cliquewidth is no…
▽ More
In this paper we study the role of cliquewidth in succinct representation of Boolean functions. Our main statement is the following: Let $Z$ be a Boolean circuit having cliquewidth $k$. Then there is another circuit $Z^*$ computing the same function as $Z$ having treewidth at most $18k+2$ and which has at most $4|Z|$ gates where $|Z|$ is the number of gates of $Z$. In this sense, cliquewidth is not more `powerful' than treewidth for the purpose of representation of Boolean functions. We believe this is quite a surprising fact because it contrasts the situation with graphs where an upper bound on the treewidth implies an upper bound on the cliquewidth but not vice versa.
We demonstrate the usefulness of the new theorem for knowledge compilation. In particular, we show that a circuit $Z$ of cliquewidth $k$ can be compiled into a Decomposable Negation Normal Form ({\sc dnnf}) of size $O(9^{18k}k^2|Z|)$ and the same runtime. To the best of our knowledge, this is the first result on efficient knowledge compilation parameterized by cliquewidth of a Boolean circuit.
△ Less
Submitted 21 March, 2013; v1 submitted 17 March, 2013;
originally announced March 2013.