-
The Effects of Computational Resources on Flaky Tests
Authors:
Denini Silva,
Martin Gruber,
Satyajit Gokhale,
Ellen Arteca,
Alexi Turcotte,
Marcelo d'Amorim,
Wing Lam,
Stefan Winter,
Jonathan Bell
Abstract:
Flaky tests are tests that nondeterministically pass and fail in unchanged code. These tests can be detrimental to developers' productivity. Particularly when tests run in continuous integration environments, the tests may be competing for access to limited computational resources (CPUs, memory etc.), and we hypothesize that resource (in)availability may be a significant factor in the failure rate…
▽ More
Flaky tests are tests that nondeterministically pass and fail in unchanged code. These tests can be detrimental to developers' productivity. Particularly when tests run in continuous integration environments, the tests may be competing for access to limited computational resources (CPUs, memory etc.), and we hypothesize that resource (in)availability may be a significant factor in the failure rate of flaky tests. We present the first assessment of the impact that computational resources have on flaky tests, including a total of 52 projects written in Java, JavaScript and Python, and 27 different resource configurations. Using a rigorous statistical methodology, we determine which tests are RAFT (Resource-Affected Flaky Tests). We find that 46.5% of the flaky tests in our dataset are RAFT, indicating that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests. We report RAFTs and configurations to avoid them to developers, and received interest to either fix the RAFTs or to improve the specifications of the projects so that tests would be run only in configurations that are unlikely to encounter RAFT failures. Our results also have implications for researchers attempting to detect flaky tests, e.g., reducing the resources available when running tests is a cost-effective approach to detect more flaky failures.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Do Automatic Test Generation Tools Generate Flaky Tests?
Authors:
Martin Gruber,
Muhammad Firhard Roslan,
Owain Parry,
Fabian Scharnböck,
Phil McMinn,
Gordon Fraser
Abstract:
Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests…
▽ More
Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
FlaPy: Mining Flaky Python Tests at Scale
Authors:
Martin Gruber,
Gordon Fraser
Abstract:
Flaky tests obstruct software development, and studying and proposing mitigations against them has therefore become an important focus of software engineering research. To conduct sound investigations on test flakiness, it is crucial to have large, diverse, and unbiased datasets of flaky tests. A common method to build such datasets is by rerunning the test suites of selected projects multiple tim…
▽ More
Flaky tests obstruct software development, and studying and proposing mitigations against them has therefore become an important focus of software engineering research. To conduct sound investigations on test flakiness, it is crucial to have large, diverse, and unbiased datasets of flaky tests. A common method to build such datasets is by rerunning the test suites of selected projects multiple times and checking for tests that produce different outcomes. While using this technique on a single project is mostly straightforward, applying it to a large and diverse set of projects raises several implementation challenges such as (1) isolating the test executions, (2) supporting multiple build mechanisms, (3) achieving feasible run times on large datasets, and (4) analyzing and presenting the test outcomes. To address these challenges we introduce FlaPy, a framework for researchers to mine flaky tests in a given or automatically sampled set of Python projects by rerunning their test suites. FlaPy isolates the test executions using containerization and fresh execution environments to simulate real-world CI conditions and to achieve accurate results. By supporting multiple dependency installation strategies, it promotes diversity among the studied projects. FlaPy supports parallelizing the test executions using SLURM, making it feasible to scan thousands of projects for test flakiness. Finally, FlaPy analyzes the test outcomes to determine which tests are flaky and depicts the results in a concise table. A demo video of FlaPy is available at https://youtu.be/ejy-be-FvDY
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
Debugging Flaky Tests using Spectrum-based Fault Localization
Authors:
Martin Gruber,
Gordon Fraser
Abstract:
Non-deterministically behaving (i.e., flaky) tests hamper regression testing as they destroy trust and waste computational and human resources. Eradicating flakiness in test suites is therefore an important goal, but automated debugging tools are needed to support developers when trying to understand the causes of flakiness. A popular example for an automated approach to support regular debugging…
▽ More
Non-deterministically behaving (i.e., flaky) tests hamper regression testing as they destroy trust and waste computational and human resources. Eradicating flakiness in test suites is therefore an important goal, but automated debugging tools are needed to support developers when trying to understand the causes of flakiness. A popular example for an automated approach to support regular debugging is spectrum-based fault localization (SFL), a technique that identifies software components that are most likely the causes of failures. While it is possible to also apply SFL for locating likely sources of flakiness in code, unfortunately the flakiness makes SFL both imprecise and non-deterministic. In this paper we introduce SFFL (Spectrum-based Flaky Fault Localization), an extension of traditional coverage-based SFL that exploits our observation that 80% of flaky tests exhibit varying coverage behavior between different runs. By distinguishing between stable and flaky coverage, SFFL is able to locate the sources of flakiness more precisely and keeps the localization itself deterministic. An evaluation on 101 flaky tests taken from 48 open-source Python projects demonstrates that SFFL is effective: Of five prominent SFL formulas, DStar, Ochiai, and Op2 yield the best overall performance. On average, they are able to narrow down the fault's location to 3.5 % of the project's code base, which is 18.7 % better than traditional SFL (for DStar). SFFL's effectiveness, however, depends on the root causes of flakiness: The source of non-order-dependent flaky tests can be located far more precisely than order-dependent faults.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
Practical Flaky Test Prediction using Common Code Evolution and Test History Data
Authors:
Martin Gruber,
Michael Heine,
Norbert Oster,
Michael Philippsen,
Gordon Fraser
Abstract:
Non-deterministically behaving test cases cause developers to lose trust in their regression test suites and to eventually ignore failures. Detecting flaky tests is therefore a crucial task in maintaining code quality, as it builds the necessary foundation for any form of systematic response to flakiness, such as test quarantining or automated debugging. Previous research has proposed various meth…
▽ More
Non-deterministically behaving test cases cause developers to lose trust in their regression test suites and to eventually ignore failures. Detecting flaky tests is therefore a crucial task in maintaining code quality, as it builds the necessary foundation for any form of systematic response to flakiness, such as test quarantining or automated debugging. Previous research has proposed various methods to detect flakiness, but when trying to deploy these in an industrial context, their reliance on instrumentation, test reruns, or language-specific artifacts was inhibitive. In this paper, we therefore investigate the prediction of flaky tests without such requirements on the underlying programming language, CI, build or test execution framework. Instead, we rely only on the most commonly available artifacts, namely the tests' outcomes and durations, as well as basic information about the code evolution to build predictive models capable of detecting flakiness. Furthermore, our approach does not require additional reruns, since it gathers this data from existing test executions. We trained several established classifiers on the suggested features and evaluated their performance on a large-scale industrial software system, from which we collected a data set of 100 flaky and 100 non-flaky test- and code-histories. The best model was able to achieve an F1-score of 95.5% using only 3 features: the tests' flip rates, the number of changes to source files in the last 54 days, as well as the number of changed files in the most recent pull request.
△ Less
Submitted 18 March, 2023; v1 submitted 18 February, 2023;
originally announced February 2023.
-
Updates on the Low-Level Abstraction of Memory Access
Authors:
Bernhard Manfred Gruber
Abstract:
Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. The low-level abstraction of memory access (LLAMA) is a C++ library that provides a zero-r…
▽ More
Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. The low-level abstraction of memory access (LLAMA) is a C++ library that provides a zero-runtime-overhead abstraction layer, underneath which memory mappings can be freely exchanged to customize data layouts, memory access and access instrumentation, focusing on multidimensional arrays of nested, structured data.
After its scientific debut, several improvements and extensions have been added to LLAMA. This includes compile-time array extents for zero-memory-overhead views, support for computations during memory access, new mappings for bit-packing, switching types, byte-splitting, memory access instrumentation, and explicit SIMD support. This contribution provides an overview of recent developments in the LLAMA library.
△ Less
Submitted 11 April, 2024; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Synthesizing Annotated Image and Video Data Using a Rendering-Based Pipeline for Improved License Plate Recognition
Authors:
Andreas Spruck,
Maximilane Gruber,
Anatol Maier,
Denise Moussa,
Jürgen Seiler,
Christian Riess,
André Kaup
Abstract:
An insufficient number of training samples is a common problem in neural network applications. While data augmentation methods require at least a minimum number of samples, we propose a novel, rendering-based pipeline for synthesizing annotated data sets. Our method does not modify existing samples but synthesizes entirely new samples. The proposed rendering-based pipeline is capable of generating…
▽ More
An insufficient number of training samples is a common problem in neural network applications. While data augmentation methods require at least a minimum number of samples, we propose a novel, rendering-based pipeline for synthesizing annotated data sets. Our method does not modify existing samples but synthesizes entirely new samples. The proposed rendering-based pipeline is capable of generating and annotating synthetic and partly-real image and video data in a fully automatic procedure. Moreover, the pipeline can aid the acquisition of real data. The proposed pipeline is based on a rendering process. This process generates synthetic data. Partly-real data bring the synthetic sequences closer to reality by incorporating real cameras during the acquisition process. The benefits of the proposed data generation pipeline, especially for machine learning scenarios with limited available training data, are demonstrated by an extensive experimental validation in the context of automatic license plate recognition. The experiments demonstrate a significant reduction of the character error rate and miss rate from 73.74% and 100% to 14.11% and 41.27% respectively, compared to an OCR algorithm trained on a real data set solely. These improvements are achieved by training the algorithm on synthesized data solely. When additionally incorporating real data, the error rates can be decreased further. Thereby, the character error rate and miss rate can be reduced to 11.90% and 39.88% respectively. All data used during the experiments as well as the proposed rendering-based pipeline for the automated data generation is made publicly available under (URL will be revealed upon publication).
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It
Authors:
Martin Gruber,
Gordon Fraser
Abstract:
Non-deterministically passing and failing test cases, so-called flaky tests, have recently become a focus area of software engineering research. While this research focus has been met with some enthusiastic endorsement from industry, prior work nevertheless mostly studied flakiness using a code-centric approach by mining software repositories. What data extracted from software repositories cannot…
▽ More
Non-deterministically passing and failing test cases, so-called flaky tests, have recently become a focus area of software engineering research. While this research focus has been met with some enthusiastic endorsement from industry, prior work nevertheless mostly studied flakiness using a code-centric approach by mining software repositories. What data extracted from software repositories cannot tell us, however, is how developers perceive flakiness: How prevalent is test flakiness in developers' daily routine, how does it affect them, and most importantly: What do they want us researchers to do about it? To answer these questions, we surveyed 335 professional software developers and testers in different domains. The survey respondents confirm that flaky tests are a common and serious problem, thus reinforcing ongoing research on flaky test detection. Developers are less worried about the computational costs caused by re-running tests and more about the loss of trust in the test outcomes. Therefore, they would like to have IDE plugins to detect flaky code as well as better visualizations of the problem, particularly dashboards showing test outcomes over time; they also wish for more training and information on flakiness. These important aspects will require the attention of researchers as well as tool developers.
△ Less
Submitted 8 April, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
LLAMA: The Low-Level Abstraction For Memory Access
Authors:
Bernhard Manfred Gruber,
Guilherme Amadio,
Jakob Blomer,
Alexander Matthes,
René Widera,
Michael Bussmann
Abstract:
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished v…
▽ More
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged.
We present the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators.
Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Structs) and SoA (Struct of Arrays) layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU\textsuperscript{\textregistered} lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying.
LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment.
△ Less
Submitted 9 March, 2022; v1 submitted 8 June, 2021;
originally announced June 2021.
-
An Empirical Study of Flaky Tests in Python
Authors:
Martin Gruber,
Stephan Lukasczyk,
Florian Kroiß,
Gordon Fraser
Abstract:
Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of…
▽ More
Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.
△ Less
Submitted 22 January, 2021;
originally announced January 2021.
-
An Architectural Design for Measurement Uncertainty Evaluation in Cyber-Physical Systems
Authors:
Wenzel Pilar von Pilchau,
Varun Gowtham,
Maximilian Gruber,
Matthias Riedl,
Nikolaos-Stefanos Koutrakis,
Jawad Tayyub,
Jörg Hähner,
Sascha Eichstädt,
Eckart Uhlmann,
Julian Polte,
Volker Frey,
Alexander Willner
Abstract:
Several use cases from the areas of manufacturing and process industry, require highly accurate sensor data. As sensors always have some degree of uncertainty, methods are needed to increase their reliability. The common approach is to regularly calibrate the devices to enable traceability according to national standards and Système international (SI) units - which follows costly processes. Howeve…
▽ More
Several use cases from the areas of manufacturing and process industry, require highly accurate sensor data. As sensors always have some degree of uncertainty, methods are needed to increase their reliability. The common approach is to regularly calibrate the devices to enable traceability according to national standards and Système international (SI) units - which follows costly processes. However, sensor networks can also be represented as Cyber Physical Systems (CPS) and a single sensor can have a digital representation (Digital Twin) to use its data further on. To propagate uncertainty in a reliable way in the network, we present a system architecture to communicate measurement uncertainties in sensor networks utilizing the concept of Asset Administration Shells alongside methods from the domain of Organic Computing. The presented approach contains methods for uncertainty propagation as well as concepts from the Machine Learning domain that combine the need for an accurate uncertainty estimation. The mathematical description of the metrological uncertainty of fused or propagated values can be seen as a first step towards the development of a harmonized approach for uncertainty in distributed CPSs in the context of Industrie 4.0. In this paper, we present basic use cases, conceptual ideas and an agenda of how to proceed further on.
△ Less
Submitted 17 August, 2020;
originally announced August 2020.
-
Causal Modeling of Twitter Activity During COVID-19
Authors:
Oguzhan Gencoglu,
Mathias Gruber
Abstract:
Understanding the characteristics of public attention and sentiment is an essential prerequisite for appropriate crisis management during adverse health events. This is even more crucial during a pandemic such as COVID-19, as primary responsibility of risk management is not centralized to a single institution, but distributed across society. While numerous studies utilize Twitter data in descripti…
▽ More
Understanding the characteristics of public attention and sentiment is an essential prerequisite for appropriate crisis management during adverse health events. This is even more crucial during a pandemic such as COVID-19, as primary responsibility of risk management is not centralized to a single institution, but distributed across society. While numerous studies utilize Twitter data in descriptive or predictive context during COVID-19 pandemic, causal modeling of public attention has not been investigated. In this study, we propose a causal inference approach to discover and quantify causal relationships between pandemic characteristics (e.g. number of infections and deaths) and Twitter activity as well as public sentiment. Our results show that the proposed method can successfully capture the epidemiological domain knowledge and identify variables that affect public attention and sentiment. We believe our work contributes to the field of infodemiology by distinguishing events that correlate with public attention from events that cause public attention.
△ Less
Submitted 23 September, 2020; v1 submitted 16 May, 2020;
originally announced May 2020.
-
Statistical Ineffective Fault Analysis of GIMLI
Authors:
Michael Gruber,
Matthias Probst,
Michael Tempelmeier
Abstract:
Ineffective Fault Analysis (SIFA) was introduced as a new approach to attack block ciphers at CHES 2018. Since then, they have been proven to be a powerful class of attacks, with an easy to achieve fault model. One of the main benefits of SIFA is to overcome detection-based and infection-based countermeasures. In this paper we explain how the principles of SIFA can be applied to GIMLI, an authenti…
▽ More
Ineffective Fault Analysis (SIFA) was introduced as a new approach to attack block ciphers at CHES 2018. Since then, they have been proven to be a powerful class of attacks, with an easy to achieve fault model. One of the main benefits of SIFA is to overcome detection-based and infection-based countermeasures. In this paper we explain how the principles of SIFA can be applied to GIMLI, an authenticated encryption cipher participating the NIST-LWC competition. We identified two possible rounds during the intialization phase of GIMLI to mount our attack. If we attack the first location we are able to recover 3 bits of the key uniquely and the parity of 8 key-bits organized in 3 sums using 180 ineffective faults per biased single intermediate bit. If we attack the second location we are able to recover 15 bits of the key uniquely and the parity of 22 key-bits organized in 7 sums using 340 ineffective faults per biased intermediate bit. Furthermore, we investigated the influence of the fault model on the rate of ineffective faults in GIMLI. Finally, we verify the efficiency of our attacks by means of simulation.
△ Less
Submitted 16 November, 2019; v1 submitted 8 November, 2019;
originally announced November 2019.
-
HARK Side of Deep Learning -- From Grad Student Descent to Automated Machine Learning
Authors:
Oguzhan Gencoglu,
Mark van Gils,
Esin Guldogan,
Chamin Morikawa,
Mehmet Süzen,
Mathias Gruber,
Jussi Leinonen,
Heikki Huttunen
Abstract:
Recent advancements in machine learning research, i.e., deep learning, introduced methods that excel conventional algorithms as well as humans in several complex tasks, ranging from detection of objects in images and speech recognition to playing difficult strategic games. However, the current methodology of machine learning research and consequently, implementations of the real-world applications…
▽ More
Recent advancements in machine learning research, i.e., deep learning, introduced methods that excel conventional algorithms as well as humans in several complex tasks, ranging from detection of objects in images and speech recognition to playing difficult strategic games. However, the current methodology of machine learning research and consequently, implementations of the real-world applications of such algorithms, seems to have a recurring HARKing (Hypothesizing After the Results are Known) issue. In this work, we elaborate on the algorithmic, economic and social reasons and consequences of this phenomenon. We present examples from current common practices of conducting machine learning research (e.g. avoidance of reporting negative results) and failure of generalization ability of the proposed algorithms and datasets in actual real-life usage. Furthermore, a potential future trajectory of machine learning research and development from the perspective of accountable, unbiased, ethical and privacy-aware algorithmic decision making is discussed. We would like to emphasize that with this discussion we neither claim to provide an exhaustive argumentation nor blame any specific institution or individual on the raised issues. This is simply a discussion put forth by us, insiders of the machine learning field, reflecting on us.
△ Less
Submitted 16 April, 2019;
originally announced April 2019.
-
Can we learn where people go?
Authors:
Marion Gödel,
Gerta Köster,
Daniel Lehmberg,
Manfred Gruber,
Angelika Kneidl,
Florian Sesser
Abstract:
In most agent-based simulators, pedestrians navigate from origins to destinations. Consequently, destinations are essential input parameters to the simulation. While many other relevant parameters as positions, speeds and densities can be obtained from sensors, like cameras, destinations cannot be observed directly. Our research question is: Can we obtain this information from video data using mac…
▽ More
In most agent-based simulators, pedestrians navigate from origins to destinations. Consequently, destinations are essential input parameters to the simulation. While many other relevant parameters as positions, speeds and densities can be obtained from sensors, like cameras, destinations cannot be observed directly. Our research question is: Can we obtain this information from video data using machine learning methods? We use density heatmaps, which indicate the pedestrian density within a given camera cutout, as input to predict the destination distributions. For our proof of concept, we train a Random Forest predictor on an exemplary data set generated with the Vadere microscopic simulator. The scenario is a crossroad where pedestrians can head left, straight or right. In addition, we gain first insights on suitable placement of the camera. The results motivate an in-depth analysis of the methodology.
△ Less
Submitted 20 December, 2018; v1 submitted 10 December, 2018;
originally announced December 2018.