subscribe to arXiv mailings

The Social Psychology of Software Security (Psycurity)

Abstract: This position paper explores the intricate relationship between social psychology and secure software engineering, underscoring the vital role social psychology plays in the realm of engineering secure software systems. Beyond a mere technical endeavor, this paper contends that understanding and integrating social psychology principles into software processes are imperative for establishing robust… ▽ More This position paper explores the intricate relationship between social psychology and secure software engineering, underscoring the vital role social psychology plays in the realm of engineering secure software systems. Beyond a mere technical endeavor, this paper contends that understanding and integrating social psychology principles into software processes are imperative for establishing robust and secure software systems. Recent studies in related fields show the importance of understanding the social psychology of other security domains. Finally, we identify critical gaps in software security research and present a set of research questions for incorporating more social psychology into software security research. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2405.13786 [pdf, ps, other]

doi 10.1109/ICSTW58534.2023.00023

Towards Explainable Test Case Prioritisation with Learning-to-Rank Models

Authors: Aurora Ramírez, Mario Berrios, José Raúl Romero, Robert Feldt

Abstract: Test case prioritisation (TCP) is a critical task in regression testing to ensure quality as software evolves. Machine learning has become a common way to achieve it. In particular, learning-to-rank (LTR) algorithms provide an effective method of ordering and prioritising test cases. However, their use poses a challenge in terms of explainability, both globally at the model level and locally for p… ▽ More Test case prioritisation (TCP) is a critical task in regression testing to ensure quality as software evolves. Machine learning has become a common way to achieve it. In particular, learning-to-rank (LTR) algorithms provide an effective method of ordering and prioritising test cases. However, their use poses a challenge in terms of explainability, both globally at the model level and locally for particular results. Here, we present and discuss scenarios that require different explanations and how the particularities of TCP (multiple builds over time, test case and test suite variations, etc.) could influence them. We include a preliminary experiment to analyse the similarity of explanations, showing that they do not only vary depending on test case-specific predictions, but also on the relative ranks. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 3rd International Workshop on Artificial Intelligence in Software Testing (AIST) - International Conference on Software Testing and Validation (ICST)

ACM Class: D.2.5; I.2.6

Journal ref: Proc. 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 66-69

arXiv:2405.04236 [pdf, other]

Semantic API Alignment: Linking High-level User Goals to APIs

Authors: Robert Feldt, Riccardo Coppola

Abstract: Large Language Models (LLMs) are becoming key in automating and assisting various software development tasks, including text-based tasks in requirements engineering but also in coding. Typically, these models are used to automate small portions of existing tasks, but we present a broader vision to span multiple steps from requirements engineering to implementation using existing libraries. This ap… ▽ More Large Language Models (LLMs) are becoming key in automating and assisting various software development tasks, including text-based tasks in requirements engineering but also in coding. Typically, these models are used to automate small portions of existing tasks, but we present a broader vision to span multiple steps from requirements engineering to implementation using existing libraries. This approach, which we call Semantic API Alignment (SEAL), aims to bridge the gap between a user's high-level goals and the specific functions of one or more APIs. In this position paper, we propose a system architecture where a set of LLM-powered ``agents'' match such high-level objectives with appropriate API calls. This system could facilitate automated programming by finding matching links or, alternatively, explaining mismatches to guide manual intervention or further development. As an initial pilot, our paper demonstrates this concept by applying LLMs to Goal-Oriented Requirements Engineering (GORE), via sub-goal analysis, for aligning with REST API specifications, specifically through a case study involving a GitHub statistics API. We discuss the potential of our approach to enhance complex tasks in software development and requirements engineering and outline future directions for research. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2404.02785 [pdf, other]

Domain Generalization through Meta-Learning: A Survey

Authors: Arsham Gholamzadeh Khoee, Yinan Yu, Robert Feldt

Abstract: Deep neural networks (DNNs) have revolutionized artificial intelligence but often lack performance when faced with out-of-distribution (OOD) data, a common scenario due to the inevitable domain shifts in real-world applications. This limitation stems from the common assumption that training and testing data share the same distribution-an assumption frequently violated in practice. Despite their ef… ▽ More Deep neural networks (DNNs) have revolutionized artificial intelligence but often lack performance when faced with out-of-distribution (OOD) data, a common scenario due to the inevitable domain shifts in real-world applications. This limitation stems from the common assumption that training and testing data share the same distribution-an assumption frequently violated in practice. Despite their effectiveness with large amounts of data and computational power, DNNs struggle with distributional shifts and limited labeled data, leading to overfitting and poor generalization across various tasks and domains. Meta-learning presents a promising approach by employing algorithms that acquire transferable knowledge across various tasks for fast adaptation, eliminating the need to learn each task from scratch. This survey paper delves into the realm of meta-learning with a focus on its contribution to domain generalization. We first clarify the concept of meta-learning for domain generalization and introduce a novel taxonomy based on the feature extraction strategy and the classifier learning methodology, offering a granular view of methodologies. Through an exhaustive review of existing methods and underlying theories, we map out the fundamentals of the field. Our survey provides practical insights and an informed discussion on promising research directions, paving the way for future innovation in meta-learning for domain generalization. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2311.08649 [pdf, other]

Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing

Authors: Juyeon Yoon, Robert Feldt, Shin Yoo

Abstract: GUI testing checks if a software system behaves as expected when users interact with its graphical interface, e.g., testing specific functionality or validating relevant use case scenarios. Currently, deciding what to test at this high level is a manual task since automated GUI testing tools target lower level adequacy metrics such as structural code coverage or activity coverage. We propose Droid… ▽ More GUI testing checks if a software system behaves as expected when users interact with its graphical interface, e.g., testing specific functionality or validating relevant use case scenarios. Currently, deciding what to test at this high level is a manual task since automated GUI testing tools target lower level adequacy metrics such as structural code coverage or activity coverage. We propose DroidAgent, an autonomous GUI testing agent for Android, for semantic, intent-driven automation of GUI testing. It is based on Large Language Models and support mechanisms such as long- and short-term memory. Given an Android app, DroidAgent sets relevant task goals and subsequently tries to achieve them by interacting with the app. Our empirical evaluation of DroidAgent using 15 apps from the Themis benchmark shows that it can set up and perform realistic tasks, with a higher level of autonomy. For example, when testing a messaging app, DroidAgent created a second account and added a first account as a friend, testing a realistic use case, without human intervention. On average, DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques. Further, manual analysis shows that 317 out of the 374 autonomously created tasks are realistic and relevant to app functionalities, and also that DroidAgent interacts deeply with the apps and covers more features. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 10 pages

arXiv:2310.02046 [pdf, other]

Improving web element localization by using a large language model

Authors: Michel Nass, Emil Alegroth, Robert Feldt

Abstract: Web-based test automation heavily relies on accurately finding web elements. Traditional methods compare attributes but don't grasp the context and meaning of elements and words. The emergence of Large Language Models (LLMs) like GPT-4, which can show human-like reasoning abilities on some tasks, offers new opportunities for software engineering and web element localization. This paper introduces… ▽ More Web-based test automation heavily relies on accurately finding web elements. Traditional methods compare attributes but don't grasp the context and meaning of elements and words. The emergence of Large Language Models (LLMs) like GPT-4, which can show human-like reasoning abilities on some tasks, offers new opportunities for software engineering and web element localization. This paper introduces and evaluates VON Similo LLM, an enhanced web element localization approach. Using an LLM, it selects the most likely web element from the top-ranked ones identified by the existing VON Similo method, ideally aiming to get closer to human-like selection accuracy. An experimental study was conducted using 804 web element pairs from 48 real-world web applications. We measured the number of correctly identified elements as well as the execution times, comparing the effectiveness and efficiency of VON Similo LLM against the baseline algorithm. In addition, motivations from the LLM were recorded and analyzed for all instances where the original approach failed to find the right web element. VON Similo LLM demonstrated improved performance, reducing failed localizations from 70 to 39 (out of 804), a 44 percent reduction. Despite its slower execution time and additional costs of using the GPT-4 model, the LLMs human-like reasoning showed promise in enhancing web element localization. LLM technology can enhance web element identification in GUI test automation, reducing false positives and potentially lowering maintenance costs. However, further research is necessary to fully understand LLMs capabilities, limitations, and practical use in GUI testing. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2308.11750 [pdf, other]

doi 10.1007/s10664-015-9410-8

Large-scale information retrieval in software engineering -- an experience report from industrial application

Authors: Michael Unterkalmsteiner, Tony Gorschek, Robert Feldt, Niklas Lavesson

Abstract: Software Engineering activities are information intensive. Research proposes Information Retrieval (IR) techniques to support engineers in their daily tasks, such as establishing and maintaining traceability links, fault identification, and software maintenance. We describe an engineering task, test case selection, and illustrate our problem analysis and solution discovery process. The objective o… ▽ More Software Engineering activities are information intensive. Research proposes Information Retrieval (IR) techniques to support engineers in their daily tasks, such as establishing and maintaining traceability links, fault identification, and software maintenance. We describe an engineering task, test case selection, and illustrate our problem analysis and solution discovery process. The objective of the study is to gain an understanding of to what extent IR techniques (one potential solution) can be applied to test case selection and provide decision support in a large-scale, industrial setting. We analyze, in the context of the studied company, how test case selection is performed and design a series of experiments evaluating the performance of different IR techniques. Each experiment provides lessons learned from implementation, execution, and results, feeding to its successor. The three experiments led to the following observations: 1) there is a lack of research on scalable parameter optimization of IR techniques for software engineering problems; 2) scaling IR techniques to industry data is challenging, in particular for latent semantic analysis; 3) the IR context poses constraints on the empirical evaluation of IR techniques, requiring more research on developing valid statistical approaches. We believe that our experiences in conducting a series of IR experiments with industry grade data are valuable for peer researchers so that they can avoid the pitfalls that we have encountered. Furthermore, we identified challenges that need to be addressed in order to bridge the gap between laboratory IR experiments and real applications of IR in the industry. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Journal ref: Empir. Softw. Eng. 21(6): 2324-2365 (2016)

arXiv:2308.07640 [pdf, ps, other]

doi 10.1145/2815021.2815036

Assessing requirements engineering and software test alignment -- Five case studies

Authors: Michael Unterkalmsteiner, Tony Gorschek, Robert Feldt, Eriks Klotins

Abstract: The development of large, software-intensive systems is a complex undertaking that we generally tackle by a divide and conquer strategy. Companies thereby face the challenge of coordinating individual aspects of software development, in particular between requirements engineering (RE) and software testing (ST). A lack of REST alignment can not only lead to wasted effort but also to defective softw… ▽ More The development of large, software-intensive systems is a complex undertaking that we generally tackle by a divide and conquer strategy. Companies thereby face the challenge of coordinating individual aspects of software development, in particular between requirements engineering (RE) and software testing (ST). A lack of REST alignment can not only lead to wasted effort but also to defective software. However, before a company can improve the mechanisms of coordination they need to be understood first. With REST-bench we aim at providing an assessment tool that illustrates the coordination in software development projects and identify concrete improvement opportunities. We have developed REST-bench on the sound fundamentals of a taxonomy on REST alignment methods and validated the method in five case studies. Following the principles of technical action research, we collaborated with five companies, applying REST-bench and iteratively improving the method based on the lessons we learned. We applied REST-bench both in Agile and plan-driven environments, in projects lasting from weeks to years, and staffed as large as 1000 employees. The improvement opportunities we identified and the feedback we received indicate that the assessment was effective and efficient. Furthermore, participants confirmed that their understanding on the coordination between RE and ST improved. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Journal ref: J. Syst. Softw. 109: 62-77 (2015)

arXiv:2307.13143 [pdf, ps, other]

doi 10.1109/TSE.2011.26

Evaluation and Measurement of Software Process Improvement -- A Systematic Literature Review

Authors: Michael Unterkalmsteiner, Tony Gorschek, A. K. M. Moinul Islam, Chow Kian Cheng, Rahadian Bayu Permadi, Robert Feldt

Abstract: BACKGROUND: Software Process Improvement (SPI) is a systematic approach to increase the efficiency and effectiveness of a software development organization and to enhance software products. OBJECTIVE: This paper aims to identify and characterize evaluation strategies and measurements used to assess the impact of different SPI initiatives. METHOD: The systematic literature review includes 148 paper… ▽ More BACKGROUND: Software Process Improvement (SPI) is a systematic approach to increase the efficiency and effectiveness of a software development organization and to enhance software products. OBJECTIVE: This paper aims to identify and characterize evaluation strategies and measurements used to assess the impact of different SPI initiatives. METHOD: The systematic literature review includes 148 papers published between 1991 and 2008. The selected papers were classified according to SPI initiative, applied evaluation strategies, and measurement perspectives. Potential confounding factors interfering with the evaluation of the improvement effort were assessed. RESULTS: Seven distinct evaluation strategies were identified, wherein the most common one, "Pre-Post Comparison" was applied in 49 percent of the inspected papers. Quality was the most measured attribute (62 percent), followed by Cost (41 percent), and Schedule (18 percent). Looking at measurement perspectives, "Project" represents the majority with 66 percent. CONCLUSION: The evaluation validity of SPI initiatives is challenged by the scarce consideration of potential confounding factors, particularly given that "Pre-Post Comparison" was identified as the most common evaluation strategy, and the inaccurate descriptions of the evaluation context. Measurements to assess the short and mid-term impact of SPI initiatives prevail, whereas long-term measurements in terms of customer satisfaction and return on investment tend to be less used. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Journal ref: IEEE Trans. Software Eng. 38(2): 398-424 (2012)

arXiv:2307.13089 [pdf, ps, other]

doi 10.1002/smr.1637

A conceptual framework for SPI evaluation

Authors: Michael Unterkalmsteiner, Tony Gorschek, A. K. M. Moinul Islam, Chow Kian Cheng, Rahadian Bayu Permadi, Robert Feldt

Abstract: Software Process Improvement (SPI) encompasses the analysis and modification of the processes within software development, aimed at improving key areas that contribute to the organizations' goals. The task of evaluating whether the selected improvement path meets these goals is challenging. On the basis of the results of a systematic literature review on SPI measurement and evaluation practices, w… ▽ More Software Process Improvement (SPI) encompasses the analysis and modification of the processes within software development, aimed at improving key areas that contribute to the organizations' goals. The task of evaluating whether the selected improvement path meets these goals is challenging. On the basis of the results of a systematic literature review on SPI measurement and evaluation practices, we developed a framework (SPI Measurement and Evaluation Framework (SPI-MEF)) that supports the planning and implementation of SPI evaluations. SPI-MEF guides the practitioner in scoping the evaluation, determining measures, and performing the assessment. SPI-MEF does not assume a specific approach to process improvement and can be integrated in existing measurement programs, refocusing the assessment on evaluating the improvement initiative's outcome. Sixteen industry and academic experts evaluated the framework's usability and capability to support practitioners, providing additional insights that were integrated in the application guidelines of the framework. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Journal ref: J. Softw. Evol. Process. 26(2): 251-279 (2014)

arXiv:2307.12489 [pdf]

doi 10.1007/s10664-013-9263-y

Challenges and Practices in Aligning Requirements with Verification and Validation: A Case Study of Six Companies

Authors: Elizabeth Bjarnason, Per Runeson, Markus Borg, Michael Unterkalmsteiner, Emelie Engström, Björn Regnell, Giedre Sabaliauskaite, Annabella Loconsole, Tony Gorschek, Robert Feldt

Abstract: Weak alignment of requirements engineering (RE) with verification and validation (VV) may lead to problems in delivering the required products in time with the right quality. For example, weak communication of requirements changes to testers may result in lack of verification of new requirements and incorrect verification of old invalid requirements, leading to software quality problems, wasted ef… ▽ More Weak alignment of requirements engineering (RE) with verification and validation (VV) may lead to problems in delivering the required products in time with the right quality. For example, weak communication of requirements changes to testers may result in lack of verification of new requirements and incorrect verification of old invalid requirements, leading to software quality problems, wasted effort and delays. However, despite the serious implications of weak alignment research and practice both tend to focus on one or the other of RE or VV rather than on the alignment of the two. We have performed a multi-unit case study to gain insight into issues around aligning RE and VV by interviewing 30 practitioners from 6 software developing companies, involving 10 researchers in a flexible research process for case studies. The results describe current industry challenges and practices in aligning RE with VV, ranging from quality of the individual RE and VV activities, through tracing and tools, to change control and sharing a common understanding at strategy, goal and design level. The study identified that human aspects are central, i.e. cooperation and communication, and that requirements engineering practices are a critical basis for alignment. Further, the size of an organisation and its motivation for applying alignment practices, e.g. external enforcement of traceability, are variation factors that play a key role in achieving alignment. Our results provide a strategic roadmap for practitioners improvement work to address alignment challenges. Furthermore, the study provides a foundation for continued research to improve the alignment of RE with VV. △ Less

Submitted 23 July, 2023; originally announced July 2023.

Journal ref: Empir. Softw. Eng. 19(6): 1809-1855 (2014)

arXiv:2307.12477 [pdf, ps, other]

doi 10.1145/2523088

A Taxonomy for Requirements Engineering and Software Test Alignment

Authors: Michael Unterkalmsteiner, Robert Feldt, Tony Gorschek

Abstract: Requirements Engineering and Software Testing are mature areas and have seen a lot of research. Nevertheless, their interactions have been sparsely explored beyond the concept of traceability. To fill this gap, we propose a definition of requirements engineering and software test (REST) alignment, a taxonomy that characterizes the methods linking the respective areas, and a process to assess align… ▽ More Requirements Engineering and Software Testing are mature areas and have seen a lot of research. Nevertheless, their interactions have been sparsely explored beyond the concept of traceability. To fill this gap, we propose a definition of requirements engineering and software test (REST) alignment, a taxonomy that characterizes the methods linking the respective areas, and a process to assess alignment. The taxonomy can support researchers to identify new opportunities for investigation, as well as practitioners to compare alignment methods and evaluate alignment, or lack thereof. We constructed the REST taxonomy by analyzing alignment methods published in literature, iteratively validating the emerging dimensions. The resulting concept of an information dyad characterizes the exchange of information required for any alignment to take place. We demonstrate use of the taxonomy by applying it on five in-depth cases and illustrate angles of analysis on a set of thirteen alignment methods. In addition, we developed an assessment framework (REST-bench), applied it in an industrial assessment, and showed that it, with a low effort, can identify opportunities to improve REST alignment. Although we expect that the taxonomy can be further refined, we believe that the information dyad is a valid and useful construct to understand alignment. △ Less

Submitted 23 July, 2023; originally announced July 2023.

Journal ref: ACM Trans. Softw. Eng. Methodol. 23(2): 16:1-16:38 (2014)

arXiv:2307.12419 [pdf]

doi 10.1007/978-3-642-14192-8_14

Challenges in aligning requirements engineering and verification in a large-scale industrial context

Authors: Giedre Sabaliauskaite, Annabella Loconsole, Emelie Engström, Michael Unterkalmsteiner, Björn Regnell, Per Runeson, Tony Gorschek, Robert Feldt

Abstract: [Context and motivation] When developing software, coordination between different organizational units is essential in order to develop a good quality product, on time and within budget. Particularly, the synchronization between requirements and verification processes is crucial in order to assure that the developed software product satisfies customer requirements. [Question/problem] Our research… ▽ More [Context and motivation] When developing software, coordination between different organizational units is essential in order to develop a good quality product, on time and within budget. Particularly, the synchronization between requirements and verification processes is crucial in order to assure that the developed software product satisfies customer requirements. [Question/problem] Our research question is: what are the current challenges in aligning the requirements and verification processes? [Principal ideas/results] We conducted an interview study at a large software development company. This paper presents preliminary findings of these interviews that identify key challenges in aligning requirements and verification processes. [Contribution] The result of this study includes a range of challenges faced by the studied organization grouped into the categories: organization and processes, people, tools, requirements process, testing process, change management, traceability, and measurement. The findings of this study can be used by practitioners as a basis for investigating alignment in their organizations, and by scientists in developing approaches for more efficient and effective management of the alignment between requirements and verification. △ Less

Submitted 23 July, 2023; originally announced July 2023.

Comments: Requirements Engineering: Foundation for Software Quality: 16th International Working Conference, REFSQ 2010, Essen, Germany, June 30-July 2, 2010. Proceedings 16 (pp. 128-142). Springer Berlin Heidelberg

arXiv:2306.05152 [pdf, ps, other]

Towards Autonomous Testing Agents via Conversational Large Language Models

Authors: Robert Feldt, Sungmin Kang, Juyeon Yoon, Shin Yoo

Abstract: Software testing is an important part of the development cycle, yet it requires specialized expertise and substantial developer effort to adequately test software. Recent discoveries of the capabilities of large language models (LLMs) suggest that they can be used as automated testing assistants, and thus provide helpful information and even drive the testing process. To highlight the potential of… ▽ More Software testing is an important part of the development cycle, yet it requires specialized expertise and substantial developer effort to adequately test software. Recent discoveries of the capabilities of large language models (LLMs) suggest that they can be used as automated testing assistants, and thus provide helpful information and even drive the testing process. To highlight the potential of this technology, we present a taxonomy of LLM-based testing agents based on their level of autonomy, and describe how a greater level of autonomy can benefit developers in practice. An example use of LLMs as a testing assistant is provided to demonstrate how a conversational framework for testing can help developers. This also highlights how the often criticized hallucination of LLMs can be beneficial for testing. We identify other tangible benefits that LLM-driven testing agents can bestow, and also discuss potential limitations. △ Less

Submitted 5 September, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.02319 [pdf, other]

Learning Test-Mutant Relationship for Accurate Fault Localisation

Authors: Jinhan Kim, Gabin An, Robert Feldt, Shin Yoo

Abstract: Context: Automated fault localisation aims to assist developers in the task of identifying the root cause of the fault by narrowing down the space of likely fault locations. Simulating variants of the faulty program called mutants, several Mutation Based Fault Localisation (MBFL) techniques have been proposed to automatically locate faults. Despite their success, existing MBFL techniques suffer fr… ▽ More Context: Automated fault localisation aims to assist developers in the task of identifying the root cause of the fault by narrowing down the space of likely fault locations. Simulating variants of the faulty program called mutants, several Mutation Based Fault Localisation (MBFL) techniques have been proposed to automatically locate faults. Despite their success, existing MBFL techniques suffer from the cost of performing mutation analysis after the fault is observed. Method: To overcome this shortcoming, we propose a new MBFL technique named SIMFL (Statistical Inference for Mutation-based Fault Localisation). SIMFL localises faults based on the past results of mutation analysis that has been done on the earlier version in the project history, allowing developers to make predictions on the location of incoming faults in a just-in-time manner. Using several statistical inference methods, SIMFL models the relationship between test results of the mutants and their locations, and subsequently infers the location of the current faults. Results: The empirical study on Defects4J dataset shows that SIMFL can localise 113 faults on the first rank out of 224 faults, outperforming other MBFL techniques. Even when SIMFL is trained on the predicted kill matrix, SIMFL can still localise 95 faults on the first rank out of 194 faults. Moreover, removing redundant mutants significantly improves the localisation accuracy of SIMFL by the number of faults localised at the first rank up to 51. Conclusion: This paper proposes a new MBFL technique called SIMFL, which exploits ahead-of-time mutation analysis to localise current faults. SIMFL is not only cost-effective, as it does not need a mutation analysis after the fault is observed, but also capable of localising faults accurately. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Paper accepted for publication at IST. arXiv admin note: substantial text overlap with arXiv:1902.09729

arXiv:2301.07524 [pdf, other]

doi 10.1145/3611667

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Authors: Carlo A. Furia, Richard Torkar, Robert Feldt

Abstract: There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations -- instead of potentially more insightful and robust causal relations. To support analyzing purely observational data for causal relations, and to assess any differences bet… ▽ More There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations -- instead of potentially more insightful and robust causal relations. To support analyzing purely observational data for causal relations, and to assess any differences between purely predictive and causal models of the same data, this paper discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses; and then use the causal information to guide the construction of a statistical model that captures genuine causal relations -- such that correlation does imply causation. We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant's performance in the contest. While the overall effect associated with programming languages is weak compared to other variables -- regardless of whether we consider correlational or causal links -- we found considerable differences between a purely associational and a causal analysis of the very same data. The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques -- where genuine causal effects may be confounded. △ Less

Submitted 1 September, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

Comments: Improve the detail of a few references

arXiv:2301.03863 [pdf, other]

Robust web element identification for evolving applications by considering visual overlaps

Authors: Michel Nass, Riccardo Coppola, Emil Alégroth, Robert Feldt

Abstract: Fragile (i.e., non-robust) test execution is a common challenge for automated GUI-based testing of web applications as they evolve. Despite recent progress, there is still room for improvement since test execution failures caused by technical limitations result in unnecessary maintenance costs that limit its effectiveness and efficiency. One of the most reported technical challenges for web-based… ▽ More Fragile (i.e., non-robust) test execution is a common challenge for automated GUI-based testing of web applications as they evolve. Despite recent progress, there is still room for improvement since test execution failures caused by technical limitations result in unnecessary maintenance costs that limit its effectiveness and efficiency. One of the most reported technical challenges for web-based tests concerns how to reliably locate a web element used by a test script. This paper proposes the novel concept of Visually Overlapping Nodes (VON) that reduces fragility by utilizing the phenomenon that visual web elements (observed by the user) are constructed from multiple web-elements in the Document Object Model (DOM) that overlaps visually. We demonstrate the approach in a tool, VON Similo, which extends the state-of-the-art multi-locator approach (Similo) that is also used as the baseline for an experiment. In the experiment, a ground truth set of 1163 manually collected web element pairs, from different releases of the 40 most popular websites on the internet, are used to compare the approaches' precision, recall, and accuracy. Our results show that VON Similo provides 94.7% accuracy in identifying a web element in a new release of the same SUT. In comparison, Similo provides 83.8% accuracy. These results demonstrate the applicability of the visually overlapping nodes concept/tool for web element localization in evolving web applications and contribute a novel way of thinking about web element localization in future research on GUI-based testing. △ Less

Submitted 13 January, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

arXiv:2208.00677 [pdf, other]

Similarity-based web element localization for robust test automation

Authors: Michel Nass, Emil Alégroth, Robert Feldt, Maurizio Leotta, Filippo Ricca

Abstract: Non-robust (fragile) test execution is a commonly reported challenge in GUI-based test automation, despite much research and several proposed solutions. A test script needs to be resilient to (minor) changes in the tested application but, at the same time, fail when detecting potential issues that require investigation. Test script fragility is a multi-faceted problem, but one crucial challenge is… ▽ More Non-robust (fragile) test execution is a commonly reported challenge in GUI-based test automation, despite much research and several proposed solutions. A test script needs to be resilient to (minor) changes in the tested application but, at the same time, fail when detecting potential issues that require investigation. Test script fragility is a multi-faceted problem, but one crucial challenge is reliably identifying and locating the correct target web elements when the website evolves between releases or otherwise fails and reports an issue. This paper proposes and evaluates a novel approach called similarity-based web element localization (Similo), which leverages information from multiple web element locator parameters to identify a target element using a weighted similarity score. The experimental study compares Similo to a baseline approach for web element localization. To get an extensive empirical basis, we target 40 of the most popular websites on the Internet in our evaluation. Robustness is considered by counting the number of web elements found in a recent website version compared to how many of these existed in an older version. Results of the experiment show that Similo outperforms the baseline representing the current state-of-the-art; it failed to locate the correct target web element in 72 out of 598 considered cases compared to 146 failed cases for the baseline approach. This study presents evidence that quantifying the similarity between multiple attributes of web elements when trying to locate them, as in our proposed Similo approach, is beneficial. With acceptable efficiency, Similo gives significantly higher effectiveness (i.e., robustness) than the baseline web element localization approach. △ Less

Submitted 1 August, 2022; originally announced August 2022.

arXiv:2207.09065 [pdf, other]

Automated Black-Box Boundary Value Detection

Authors: Felix Dobslaw, Robert Feldt, Francisco de Oliveira Neto

Abstract: The input domain of software systems can typically be divided into sub-domains for which the outputs are similar. To ensure high quality it is critical to test the software on the boundaries between these sub-domains. Consequently, boundary value analysis and testing has been part of the toolbox of software testers for long and is typically taught early to students. However, despite its many argue… ▽ More The input domain of software systems can typically be divided into sub-domains for which the outputs are similar. To ensure high quality it is critical to test the software on the boundaries between these sub-domains. Consequently, boundary value analysis and testing has been part of the toolbox of software testers for long and is typically taught early to students. However, despite its many argued benefits, boundary value analysis for a given specification or piece of software is typically described in abstract terms which allow for variation in how testers apply it. Here we propose an automated, black-box boundary value detection method to support software testers in systematic boundary value analysis with consistent results. The method builds on a metric to quantify the level of boundariness of test inputs: the program derivative. By coupling it with search algorithms we find and rank pairs of inputs as good boundary candidates, i.e. inputs close together but with outputs far apart. We implement our AutoBVA approach and evaluate it on a curated dataset of example programs. Our results indicate that even with a simple and generic program derivative variant in combination with broad sampling over the input space, interesting boundary candidates can be identified. △ Less

Submitted 19 July, 2022; originally announced July 2022.

arXiv:2206.15428 [pdf]

Test2Vec: An Execution Trace Embedding for Test Case Prioritization

Authors: Emad Jabbar, Soheila Zangeneh, Hadi Hemmati, Robert Feldt

Abstract: Most automated software testing tasks can benefit from the abstract representation of test cases. Traditionally, this is done by encoding test cases based on their code coverage. Specification-level criteria can replace code coverage to better represent test cases' behavior, but they are often not cost-effective. In this paper, we hypothesize that execution traces of the test cases can be a good a… ▽ More Most automated software testing tasks can benefit from the abstract representation of test cases. Traditionally, this is done by encoding test cases based on their code coverage. Specification-level criteria can replace code coverage to better represent test cases' behavior, but they are often not cost-effective. In this paper, we hypothesize that execution traces of the test cases can be a good alternative to abstract their behavior for automated testing tasks. We propose a novel embedding approach, Test2Vec, that maps test execution traces to a latent space. We evaluate this representation in the test case prioritization (TP) task. Our default TP method is based on the similarity of the embedded vectors to historical failing test vectors. We also study an alternative based on the diversity of test vectors. Finally, we propose a method to decide which TP to choose, for a given test suite. The experiment is based on several real and seeded faults with over a million execution traces. Results show that our proposed TP improves best alternatives by 41.80% in terms of the median normalized rank of the first failing test case (FFR). It outperforms traditional code coverage-based approaches by 25.05% and 59.25% in terms of median APFD and median normalized FFR. △ Less

Submitted 28 June, 2022; originally announced June 2022.

arXiv:2201.06044 [pdf, ps, other]

doi 10.1145/3511805

A Taxonomy of Information Attributes for Test Case Prioritisation: Applicability, Machine Learning

Authors: Aurora Ramírez, Robert Feldt, José Raúl Romero

Abstract: Most software companies have extensive test suites and re-run parts of them continuously to ensure recent changes have no adverse effects. Since test suites are costly to execute, industry needs methods for test case prioritisation (TCP). Recently, TCP methods use machine learning (ML) to exploit the information known about the system under test (SUT) and its test cases. However, the value added b… ▽ More Most software companies have extensive test suites and re-run parts of them continuously to ensure recent changes have no adverse effects. Since test suites are costly to execute, industry needs methods for test case prioritisation (TCP). Recently, TCP methods use machine learning (ML) to exploit the information known about the system under test (SUT) and its test cases. However, the value added by ML-based TCP methods should be critically assessed with respect to the cost of collecting the information. This paper analyses two decades of TCP research, and presents a taxonomy of 91 information attributes that have been used. The attributes are classified with respect to their information sources and the characteristics of their extraction process. Based on this taxonomy, TCP methods validated with industrial data and those applying ML are analysed in terms of information availability, attribute combination and definition of data features suitable for ML. Relying on a high number of information attributes, assuming easy access to SUT code and simplified testing environments are identified as factors that might hamper industrial applicability of ML-based TCP. The TePIA taxonomy provides a reference framework to unify terminology and evaluate alternatives considering the cost-benefit of the information attributes. △ Less

Submitted 16 January, 2022; originally announced January 2022.

Comments: Accepted for publication in ACM Transactions on Software Engineering and Methodology. Additional material available from a GitHub repository: https://github.com/tepia-taxonomy/taxonomy-analysis

ACM Class: D.2.5; I.2.6

arXiv:2201.05551 [pdf, other]

Cognition in Software Engineering: A Taxonomy and Survey of a Half-Century of Research

Authors: Fabian Fagerholm, Michael Felderer, Davide Fucci, Michael Unterkalmsteiner, Bogdan Marculescu, Markus Martini, Lars Göran Wallgren Tengberg, Robert Feldt, Bettina Lehtelä, Balázs Nagyváradi, Jehan Khattak

Abstract: Cognition plays a fundamental role in most software engineering activities. This article provides a taxonomy of cognitive concepts and a survey of the literature since the beginning of the Software Engineering discipline. The taxonomy comprises the top-level concepts of perception, attention, memory, cognitive load, reasoning, cognitive biases, knowledge, social cognition, cognitive control, and e… ▽ More Cognition plays a fundamental role in most software engineering activities. This article provides a taxonomy of cognitive concepts and a survey of the literature since the beginning of the Software Engineering discipline. The taxonomy comprises the top-level concepts of perception, attention, memory, cognitive load, reasoning, cognitive biases, knowledge, social cognition, cognitive control, and errors, and procedures to assess them both qualitatively and quantitatively. The taxonomy provides a useful tool to filter existing studies, classify new studies, and support researchers in getting familiar with a (sub) area. In the literature survey, we systematically collected and analysed 311 scientific papers spanning five decades and classified them using the cognitive concepts from the taxonomy. Our analysis shows that the most developed areas of research correspond to the four life-cycle stages, software requirements, design, construction, and maintenance. Most research is quantitative and focuses on knowledge, cognitive load, memory, and reasoning. Overall, the state of the art appears fragmented when viewed from the perspective of cognition. There is a lack of use of cognitive concepts that would represent a coherent picture of the cognitive processes active in specific tasks. Accordingly, we discuss the research gap in each cognitive concept and provide recommendations for future research. △ Less

Submitted 14 January, 2022; originally announced January 2022.

arXiv:2110.13575 [pdf, other]

Automated Support for Unit Test Generation: A Tutorial Book Chapter

Authors: Afonso Fontes, Gregory Gay, Francisco Gomes de Oliveira Neto, Robert Feldt

Abstract: Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system - often a class - is tested. Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python. Creating unit tests is a time and effort-intensive process with many repetitive, manual elements. To ill… ▽ More Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system - often a class - is tested. Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python. Creating unit tests is a time and effort-intensive process with many repetitive, manual elements. To illustrate how AI can support unit testing, this chapter introduces the concept of search-based unit test generation. This technique frames the selection of test input as an optimization problem - we seek a set of test cases that meet some measurable goal of a tester - and unleashes powerful metaheuristic search algorithms to identify the best possible test cases within a restricted timeframe. This chapter introduces two algorithms that can generate pytest-formatted unit tests, tuned towards coverage of source code statements. The chapter concludes by discussing more advanced concepts and gives pointers to further reading for how artificial intelligence can support developers and testers when unit testing software. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: This is a preprint of a chapter from the upcoming book, "Optimising the Software Development Process with Artificial Intelligence" (Springer, 2022)

arXiv:2104.09107 [pdf, other]

Causal Program Dependence Analysis

Authors: Seongmin Lee, Dave Binkley, Robert Feldt, Nicolas Gold, Shin Yoo

Abstract: We introduce Causal Program Dependence Analysis (CPDA), a dynamic dependence analysis that applies causal inference to model the strength of program dependence relations in a continuous space. CPDA observes the association between program elements by constructing and executing modified versions of a program. One advantage of CPDA is that this construction requires only light-weight parsing rather… ▽ More We introduce Causal Program Dependence Analysis (CPDA), a dynamic dependence analysis that applies causal inference to model the strength of program dependence relations in a continuous space. CPDA observes the association between program elements by constructing and executing modified versions of a program. One advantage of CPDA is that this construction requires only light-weight parsing rather than sophisticated static analysis. The result is a collection of observations based on how often a change in the value produced by a mutated program element affects the behavior of other elements. From this set of observations, CPDA discovers a causal structure capturing the causal (i.e., dependence) relation between program elements. Qualitative evaluation finds that CPDA concisely expresses key dependence relationships between program elements. As an example application, we apply CPDA to the problem of fault localization. Using minimal test suites, our approach can rank twice as many faults compared to SBFL. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: 12 pages, 10 main text pages, 1 reference page, 1 appendix page, 5 figures, and 5 tables

arXiv:2103.04749 [pdf, other]

Towards Human-Like Automated Test Generation: Perspectives from Cognition and Problem Solving

Authors: Eduard Enoiu, Robert Feldt

Abstract: Automated testing tools typically create test cases that are different from what human testers create. This often makes the tools less effective, the created tests harder to understand, and thus results in tools providing less support to human testers. Here, we propose a framework based on cognitive science and, in particular, an analysis of approaches to problem-solving, for identifying cognitive… ▽ More Automated testing tools typically create test cases that are different from what human testers create. This often makes the tools less effective, the created tests harder to understand, and thus results in tools providing less support to human testers. Here, we propose a framework based on cognitive science and, in particular, an analysis of approaches to problem-solving, for identifying cognitive processes of testers. The framework helps map test design steps and criteria used in human test activities and thus to better understand how effective human testers perform their tasks. Ultimately, our goal is to be able to mimic how humans create test cases and thus to design more human-like automated test generation systems. We posit that such systems can better augment and support testers in a way that is meaningful to them. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Comments: preprint; accepted by CHASE 2020 as a note paper

arXiv:2101.12591 [pdf, other]

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality

Authors: Carlo A. Furia, Richard Torkar, Robert Feldt

Abstract: Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which help ferret out issues with the data or model, and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such gu… ▽ More Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which help ferret out issues with the data or model, and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering. To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset's original analysis (Ray et al., 2014) and a critical reanalysis (Berger at al., 2019) have attracted considerable attention -- in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase some of its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality. The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state of the art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results, and hence help buttress continued progress in empirical software engineering research. △ Less

Submitted 28 July, 2021; v1 submitted 29 January, 2021; originally announced January 2021.

arXiv:2010.09144 [pdf, other]

doi 10.1109/ICSTW50294.2020.00051

Using mutation testing to measure behavioural test diversity

Authors: Francisco Gomes de Oliveira Neto, Felix Dobslaw, Robert Feldt

Abstract: Diversity has been proposed as a key criterion to improve testing effectiveness and efficiency.It can be used to optimise large test repositories but also to visualise test maintenance issues and raise practitioners' awareness about waste in test artefacts and processes. Even though these diversity-based testing techniques aim to exercise diverse behavior in the system under test (SUT), the divers… ▽ More Diversity has been proposed as a key criterion to improve testing effectiveness and efficiency.It can be used to optimise large test repositories but also to visualise test maintenance issues and raise practitioners' awareness about waste in test artefacts and processes. Even though these diversity-based testing techniques aim to exercise diverse behavior in the system under test (SUT), the diversity has mainly been measured on and between artefacts (e.g., inputs, outputs or test scripts). Here, we introduce a family of measures to capture behavioural diversity (b-div) of test cases by comparing their executions and failure outcomes. Using failure information to capture the SUT behaviour has been shown to improve effectiveness of history-based test prioritisation approaches. However, history-based techniques require reliable test execution logs which are often not available or can be difficult to obtain due to flaky tests, scarcity of test executions, etc. To be generally applicable we instead propose to use mutation testing to measure behavioral diversity by running the set of test cases on various mutated versions of the SUT. Concretely, we propose two specific b-div measures (based on accuracy and Matthew's correlation coefficient, respectively) and compare them with artefact-based diversity (a-div) for prioritising the test suites of 6 different open-source projects. Our results show that our b-div measures outperform a-div and random selection in all of the studied projects. The improvement is substantial with an average increase in average percentage of faults detected (APFD) of between 19% to 31% depending on the size of the subset of prioritised tests. △ Less

Submitted 18 October, 2020; originally announced October 2020.

Comments: Published at the 15th International Workshop on Mutation Analysis

arXiv:2010.03525 [pdf]

Empirical Standards for Software Engineering Research

Authors: Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri , et al. (18 additional authors not shown)

Abstract: Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear… ▽ More Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair. △ Less

Submitted 4 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: For the complete standards, supplements and other resources, see https://github.com/acmsigsoft/EmpiricalStandards

arXiv:2007.08927 [pdf, other]

Towards a Model of Testers' Cognitive Processes: Software Testing as a Problem Solving Approach

Authors: Eduard Enoiu, Gerald Tukseferi, Robert Feldt

Abstract: Software testing is a complex, intellectual activity based (at least) on analysis, reasoning, decision making, abstraction and collaboration performed in a highly demanding environment. Naturally, it uses and allocates multiple cognitive resources in software testers. However, while a cognitive psychology perspective is increasingly used in the general software engineering literature, it has yet t… ▽ More Software testing is a complex, intellectual activity based (at least) on analysis, reasoning, decision making, abstraction and collaboration performed in a highly demanding environment. Naturally, it uses and allocates multiple cognitive resources in software testers. However, while a cognitive psychology perspective is increasingly used in the general software engineering literature, it has yet to find its place in software testing. To the best of our knowledge, no theory of software testers' cognitive processes exists. Here, we take the first step towards such a theory by presenting a cognitive model of software testing based on how problem solving is conceptualized in cognitive psychology. Our approach is to instantiate a general problem solving process for the specific problem of creating test cases. We then propose an experiment for testing our cognitive test design model. The experiment makes use of verbal protocol analysis to understand the mechanisms by which human testers choose, design, implement and evaluate test cases. An initial evaluation was then performed with five software engineering master students as subjects. The results support a problem solving-based model of test design for capturing testers' cognitive processes. △ Less

Submitted 9 December, 2020; v1 submitted 17 July, 2020; originally announced July 2020.

Comments: (v3) minor issues fixed, Accepted and presented in the IEEE International Workshop on Human and Social Aspects of Software Quality (HASQ 2020)

arXiv:2006.00894 [pdf, other]

doi 10.1145/3368089.3417065

Reducing DNN Labelling Cost using Surprise Adequacy: An Industrial Case Study for Autonomous Driving

Authors: Jinhan Kim, Jeongil Ju, Robert Feldt, Shin Yoo

Abstract: Deep Neural Networks (DNNs) are rapidly being adopted by the automotive industry, due to their impressive performance in tasks that are essential for autonomous driving. Object segmentation is one such task: its aim is to precisely locate boundaries of objects and classify the identified objects, helping autonomous cars to recognise the road environment and the traffic situation. Not only is this… ▽ More Deep Neural Networks (DNNs) are rapidly being adopted by the automotive industry, due to their impressive performance in tasks that are essential for autonomous driving. Object segmentation is one such task: its aim is to precisely locate boundaries of objects and classify the identified objects, helping autonomous cars to recognise the road environment and the traffic situation. Not only is this task safety critical, but developing a DNN based object segmentation module presents a set of challenges that are significantly different from traditional development of safety critical software. The development process in use consists of multiple iterations of data collection, labelling, training, and evaluation. Among these stages, training and evaluation are computation intensive while data collection and labelling are manual labour intensive. This paper shows how development of DNN based object segmentation can be improved by exploiting the correlation between Surprise Adequacy (SA) and model performance. The correlation allows us to predict model performance for inputs without manually labelling them. This, in turn, enables understanding of model performance, more guided data collection, and informed decisions about further training. In our industrial case study the technique allows cost savings of up to 50% with negligible evaluation inaccuracy. Furthermore, engineers can trade off cost savings versus the tolerable level of inaccuracy depending on different development phases and scenarios. △ Less

Submitted 7 September, 2020; v1 submitted 29 May, 2020; originally announced June 2020.

Comments: to be published in Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

arXiv:2005.09959 [pdf, other]

doi 10.1145/3469888

Psychometrics in Behavioral Software Engineering: A Methodological Introduction with Guidelines

Authors: Daniel Graziotin, Per Lenberg, Robert Feldt, Stefan Wagner

Abstract: A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered. Psychology theory can facilitate the systematic and sound development as well as the adoption of instruments (e.g., psychological tests, questionnaires) to assess these constructs. In particular, to ensure high quality, the psychometric properties of instruments… ▽ More A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered. Psychology theory can facilitate the systematic and sound development as well as the adoption of instruments (e.g., psychological tests, questionnaires) to assess these constructs. In particular, to ensure high quality, the psychometric properties of instruments need evaluation. In this paper, we provide an introduction to psychometric theory for the evaluation of measurement instruments for SE researchers. We present guidelines that enable using existing instruments and developing new ones adequately. We conducted a comprehensive review of the psychology literature framed by the Standards for Educational and Psychological Testing. We detail activities used when operationalizing new psychological constructs, such as item pooling, item review, pilot testing, item analysis, factor analysis, statistical property of items, reliability, validity, and fairness in testing and test bias. We provide an openly available example of a psychometric evaluation based on our guideline. We hope to encourage a culture change in SE research towards the adoption of established methods from psychology. To improve the quality of behavioral research in SE, studies focusing on introducing, validating, and then using psychometric instruments need to be more common. △ Less

Submitted 8 June, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

Comments: 56 pages (pp. 1-36 for the main paper, pp. 37-56 working example in the appendix), 8 figures in the main paper. Accepted for publication at ACM TOSEM

Journal ref: ACM Trans. Softw. Eng. Methodol. 31, 1, Article 7 (September 2021), 36 pages

arXiv:2005.09296 [pdf, other]

SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation

Authors: Sungmin Kang, Robert Feldt, Shin Yoo

Abstract: The testing of Deep Neural Networks (DNNs) has become increasingly important as DNNs are widely adopted by safety critical systems. While many test adequacy criteria have been suggested, automated test input generation for many types of DNNs remains a challenge because the raw input space is too large to randomly sample or to navigate and search for plausible inputs. Consequently, current testing… ▽ More The testing of Deep Neural Networks (DNNs) has become increasingly important as DNNs are widely adopted by safety critical systems. While many test adequacy criteria have been suggested, automated test input generation for many types of DNNs remains a challenge because the raw input space is too large to randomly sample or to navigate and search for plausible inputs. Consequently, current testing techniques for DNNs depend on small local perturbations to existing inputs, based on the metamorphic testing principle. We propose new ways to search not over the entire image space, but rather over a plausible input space that resembles the true training distribution. This space is constructed using Variational Autoencoders (VAEs), and navigated through their latent vector space. We show that this space helps efficiently produce test inputs that can reveal information about the robustness of DNNs when dealing with realistic tests, opening the field to meaningful exploration through the space of highly structured images. △ Less

Submitted 19 May, 2020; originally announced May 2020.

arXiv:2001.06652 [pdf, other]

doi 10.1109/ICSTW50294.2020.00062

Boundary Value Exploration for Software Analysis

Authors: Felix Dobslaw, Francisco Gomes de Oliveira Neto, Robert Feldt

Abstract: For software to be reliable and resilient, it is widely accepted that tests must be created and maintained alongside the software itself. One safeguard from vulnerabilities and failures in code is to ensure correct behavior on the boundaries between the input space sub-domains. So-called boundary value analysis (BVA) and boundary value testing (BVT) techniques aim to exercise those boundaries and… ▽ More For software to be reliable and resilient, it is widely accepted that tests must be created and maintained alongside the software itself. One safeguard from vulnerabilities and failures in code is to ensure correct behavior on the boundaries between the input space sub-domains. So-called boundary value analysis (BVA) and boundary value testing (BVT) techniques aim to exercise those boundaries and increase test effectiveness. However, the concepts of BVA and BVT themselves are not generally well defined, and it is not clear how to identify relevant sub-domains, and thus the boundaries delineating them, given a specification. This has limited adoption and hindered automation. We clarify BVA and BVT and introduce Boundary Value Exploration (BVE) to describe techniques that support them by helping to detect and identify boundary inputs. Additionally, we propose two concrete BVE techniques based on information-theoretic distance functions: (i) an algorithm for boundary detection and (ii) the usage of software visualization to explore the behavior of the software under test and identify its boundary behavior. As an initial evaluation, we apply these techniques on a much used and well-tested date handling library. Our results reveal questionable behavior at boundaries highlighted by our techniques. In conclusion, we argue that the boundary value exploration that our techniques enable is a step towards automated boundary value analysis and testing, fostering their wider use and improving test effectiveness and efficiency. △ Less

Submitted 12 October, 2020; v1 submitted 18 January, 2020; originally announced January 2020.

arXiv:1907.03475 [pdf, other]

Estimating Return on Investment for GUI Test Automation Tools

Authors: Felix Dobslaw, Robert Feldt, David Michaelsson, Patrick Haar, Francisco G. de Oliveira Neto, Richard Torkar

Abstract: Automated graphical user interface (GUI) tests can reduce manual testing activities and increase test frequency. This motivates the conversion of manual test cases into automated GUI tests. However, it is not clear whether such automation is cost-effective given that GUI automation scripts add to the code base and demand maintenance as a system evolves. In this paper, we introduce a method for est… ▽ More Automated graphical user interface (GUI) tests can reduce manual testing activities and increase test frequency. This motivates the conversion of manual test cases into automated GUI tests. However, it is not clear whether such automation is cost-effective given that GUI automation scripts add to the code base and demand maintenance as a system evolves. In this paper, we introduce a method for estimating maintenance cost and Return on Investment (ROI) for Automated GUI Testing (AGT). The method utilizes the existing source code change history and can be used for evaluation also of other testing or quality assurance automation technologies. We evaluate the method for a real-world, industrial software system and compare two fundamentally different AGT tools, namely Selenium and EyeAutomate, to estimate and compare their ROI. We also report on their defect-finding capabilities and usability. The quantitative data is complemented by interviews with employees at the case company. The method was successfully applied and estimated maintenance cost and ROI for both tools are reported. Overall, the study supports earlier results showing that implementation time is the leading cost for introducing AGT. The findings further suggest that while EyeAutomate tests are significantly faster to implement, Selenium tests require more of a programming background but less maintenance. △ Less

Submitted 1 November, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

Comments: 12 pages

arXiv:1905.11198 [pdf, other]

Towards Automated Boundary Value Testing with Program Derivatives and Search

Authors: Robert Feldt, Felix Dobslaw

Abstract: A natural and often used strategy when testing software is to use input values at boundaries, i.e. where behavior is expected to change the most, an approach often called boundary value testing or analysis (BVA). Even though this has been a key testing idea for long it has been hard to clearly define and formalize. Consequently, it has also been hard to automate. In this research note we propose… ▽ More A natural and often used strategy when testing software is to use input values at boundaries, i.e. where behavior is expected to change the most, an approach often called boundary value testing or analysis (BVA). Even though this has been a key testing idea for long it has been hard to clearly define and formalize. Consequently, it has also been hard to automate. In this research note we propose one such formalization of BVA by, in a similar way as to how the derivative of a function is defined in mathematics, considering (software) program derivatives. Critical to our definition is the notion of distance between inputs and outputs which we can formalize and then quantify based on ideas from Information theory. However, for our (black-box) approach to be practical one must search for test inputs with specific properties. Coupling it with search-based software engineering is thus required and we discuss how program derivatives can be used as and within fitness functions. This brief note does not allow a deeper, empirical investigation but we use a simple illustrative example throughout to introduce the main ideas. By combining program derivatives with search, we thus propose a practical as well as theoretically interesting technique for automated boundary value (analysis and) testing. △ Less

Submitted 27 May, 2019; originally announced May 2019.

arXiv:1904.03948 [pdf, other]

The Unfulfilled Potential of Data-Driven Decision Making in Agile Software Development

Authors: Richard Berntsson Svensson, Robert Feldt, Richard Torkar

Abstract: With the general trend towards data-driven decision making (DDDM), organizations are looking for ways to use DDDM to improve their decisions. However, few studies have looked into the practitioners view of DDDM, in particular for agile organizations. In this paper we investigated the experiences of using DDDM, and how data can improve decision making. An emailed questionnaire was sent out to 124 i… ▽ More With the general trend towards data-driven decision making (DDDM), organizations are looking for ways to use DDDM to improve their decisions. However, few studies have looked into the practitioners view of DDDM, in particular for agile organizations. In this paper we investigated the experiences of using DDDM, and how data can improve decision making. An emailed questionnaire was sent out to 124 industry practitioners in agile software developing companies, of which 84 answered. The results show that few practitioners indicated a widespread use of DDDM in their current decision making practices. The practitioners were more positive to its future use for higher-level and more general decision making, fairly positive to its use for requirements elicitation and prioritization decisions, while being less positive to its future use at the team level. The practitioners do see a lot of potential for DDDM in an agile context; however, currently unfulfilled. △ Less

Submitted 8 April, 2019; originally announced April 2019.

Journal ref: 20th International Conference on Agile Software Development (XP), 2019

arXiv:1904.02468 [pdf, other]

doi 10.1016/j.jss.2016.11.024

Group development and group maturity when building agile teams: A qualitative and quantitative investigation at eight large companies

Authors: Lucas Gren, Richard Torkar, Robert Feldt

Abstract: The agile approach to projects focuses more on close-knit teams than traditional waterfall projects, which means that aspects of group maturity become even more important. This psychological aspect is not much researched in connection to the building of an "agile team." The purpose of this study is to investigate how building agile teams is connected to a group development model taken from social… ▽ More The agile approach to projects focuses more on close-knit teams than traditional waterfall projects, which means that aspects of group maturity become even more important. This psychological aspect is not much researched in connection to the building of an "agile team." The purpose of this study is to investigate how building agile teams is connected to a group development model taken from social psychology. We conducted ten semi-structured interviews with coaches, Scrum Masters, and managers responsible for the agile process from seven different companies, and collected survey data from 66 group-members from four companies (a total of eight different companies). The survey included an agile measurement tool and the one part of the Group Development Questionnaire. The results show that the practitioners define group developmental aspects as key factors to a successful agile transition. Also, the quantitative measurement of agility was significantly correlated to the group maturity measurement. We conclude that adding these psychological aspects to the description of the "agile team" could increase the understanding of agility and partly help define an "agile team." We propose that future work should develop specific guidelines for how software development teams at different maturity levels might adopt agile principles and practices differently. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Journal ref: The Journal of Systems and Software 124 (2017) 104-119

arXiv:1904.02451 [pdf, other]

doi 10.1109/SEAA.2015.31

Group Maturity and Agility, Are They Connected? - A Survey Study

Authors: Lucas Gren, Richard Torkar, Robert Feldt

Abstract: The focus on psychology has increased within software engineering due to the project management innovation "agile development processes". The agile methods do not explicitly consider group development aspects; they simply assume what is described in group psychology as mature groups. This study was conducted with 45 employees and their twelve managers (N=57) from two SAP customers in the US that w… ▽ More The focus on psychology has increased within software engineering due to the project management innovation "agile development processes". The agile methods do not explicitly consider group development aspects; they simply assume what is described in group psychology as mature groups. This study was conducted with 45 employees and their twelve managers (N=57) from two SAP customers in the US that were working with agile methods, and the data were collected via an online survey. The selected Agility measurement was correlated to a Group Development measurement and showed significant convergent validity, i.e., a more mature team is also a more agile team. This means that the agile methods probably would benefit from taking group development into account when its practices are being introduced. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Journal ref: 41st Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 2015

arXiv:1904.02444 [pdf, other]

doi 10.1016/j.jss.2015.05.008

The prospects of a quantitative measurement of agility: A validation study on an agile maturity model

Authors: Lucas Gren, Richard Torkar, Robert Feldt

Abstract: Agile development has now become a well-known approach to collaboration in professional work life. Both researchers and practitioners want validated tools to measure agility. This study sets out to validate an agile maturity measurement model with statistical tests and empirical data. First, a pretest was conducted as a case study including a survey and focus group. Second, the main study was cond… ▽ More Agile development has now become a well-known approach to collaboration in professional work life. Both researchers and practitioners want validated tools to measure agility. This study sets out to validate an agile maturity measurement model with statistical tests and empirical data. First, a pretest was conducted as a case study including a survey and focus group. Second, the main study was conducted with 45 employees from two SAP customers in the US. We used internal consistency (by a Cronbach's alpha) as the main measure for reliability and analyzed construct validity by exploratory principal factor analysis (PFA). The results suggest a new categorization of a subset of items existing in the tool and provides empirical support for these new groups of factors. However, we argue that more work is needed to reach the point where a maturity models with quantitative data can be said to validly measure agility, and even then, such a measurement still needs to include some deeper analysis with cultural and contextual items. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Journal ref: The Journal of Systems and Software 107 (2015) 38-49

arXiv:1904.02439 [pdf, ps, other]

doi 10.1109/AGILE.2014.13

Work Motivational Challenges Regarding the Interface Between Agile Teams and a Non-Agile Surrounding Organization: A case study

Authors: Lucas Gren, Richard Torkar, Robert Feldt

Abstract: There are studies showing what happens if agile teams are introduced into a non-agile organization, e.g. higher overhead costs and the necessity of an understanding of agile methods even outside the teams. This case study shows an example of work motivational aspects that might surface when an agile team exists in the middle of a more traditional structure. This case study was conducted at a car m… ▽ More There are studies showing what happens if agile teams are introduced into a non-agile organization, e.g. higher overhead costs and the necessity of an understanding of agile methods even outside the teams. This case study shows an example of work motivational aspects that might surface when an agile team exists in the middle of a more traditional structure. This case study was conducted at a car manufacturer in Sweden, consisting of an unstructured interview with the Scrum Master and a semi-structured focus group. The results show that the teams felt that the feedback from the surrounding organization was unsynchronized resulting in them not feeling appreciated when delivering their work. Moreover, they felt frustrated when working on non-agile teams after have been working on agile ones. This study concludes that there were work motivational affects of fitting an agile team into a non-agile surrounding organization, and therefore this might also be true for other organizations. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Journal ref: 2014 Agile Conference, July 28-August 1, 2014

arXiv:1904.00661 [pdf, other]

Bayesian data analysis in empirical software engineering---The case of missing data

Authors: Richard Torkar, Robert Feldt, Carlo A. Furia

Abstract: Bayesian data analysis (BDA) is today used by a multitude of research disciplines. These disciplines use BDA as a way to embrace uncertainty by using multilevel models and making use of all available information at hand. In this chapter, we first introduce the reader to BDA and then provide an example from empirical software engineering, where we also deal with a common issue in our field, i.e., m… ▽ More Bayesian data analysis (BDA) is today used by a multitude of research disciplines. These disciplines use BDA as a way to embrace uncertainty by using multilevel models and making use of all available information at hand. In this chapter, we first introduce the reader to BDA and then provide an example from empirical software engineering, where we also deal with a common issue in our field, i.e., missing data. The example we make use of presents the steps done when conducting state of the art statistical analysis. First, we need to understand the problem we want to solve. Second, we conduct causal analysis. Third, we analyze non-identifiability. Fourth, we conduct missing data analysis. Finally, we do a sensitivity analysis of priors. All this before we design our statistical model. Once we have a model, we present several diagnostics one can use to conduct sanity checks. We hope that through these examples, the reader will see the advantages of using BDA. This way, we hope Bayesian statistics will become more prevalent in our field, thus partly avoiding the reproducibility crisis we have seen in other disciplines. △ Less

Submitted 1 January, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

Comments: 34 pages, 15 figures. Chapter in the book Contemporary Empirical Methods in Software Engineering

arXiv:1902.09729 [pdf, other]

doi 10.1109/ISSRE52982.2021.00036

Ahead of Time Mutation Based Fault Localisation using Statistical Inference

Authors: Jinhan Kim, Gabin An, Robert Feldt, Shin Yoo

Abstract: Mutation analysis can effectively capture the dependency between source code and test results. This has been exploited by Mutation Based Fault Localisation (MBFL) techniques. However, MBFL techniques suffer from the need to expend the high cost of mutation analysis after the observation of failures, which may present a challenge for its practical adoption. We introduce SIMFL (Statistical Inference… ▽ More Mutation analysis can effectively capture the dependency between source code and test results. This has been exploited by Mutation Based Fault Localisation (MBFL) techniques. However, MBFL techniques suffer from the need to expend the high cost of mutation analysis after the observation of failures, which may present a challenge for its practical adoption. We introduce SIMFL (Statistical Inference for Mutation-based Fault Localisation), an MBFL technique that allows users to perform the mutation analysis in advance before a failure is observed, allowing the amortisation of the analysis cost. SIMFL uses mutants as artificial faults and aims to learn the failure patterns among test cases against different locations of mutations. Once a failure is observed, SIMFL requires either almost no or very small additional cost for analysis, depending on the used inference model. An empirical evaluation using Defects4J shows that SIMFL can successfully localise up to 113 out of 203 studied faults (55%) at the top, and 159 (78%) faults within the top five, significantly outperforming existing MBFL techniques while using the results of mutation analysis that has been undertaken before the test failure. The amortised cost of mutation analysis can be further reduced by mutation sampling: SIMFL retains 80% of its localisation accuracy at the top rank when using only 10% of generated mutants, compared to results obtained without sampling. △ Less

Submitted 23 August, 2021; v1 submitted 26 February, 2019; originally announced February 2019.

Comments: To be published at 32nd IEEE International Symposium on Software Reliability Engineering (ISSRE 2021)

arXiv:1811.05422 [pdf, other]

doi 10.1109/TSE.2019.2935974

Bayesian Data Analysis in Empirical Software Engineering Research

Authors: Carlo A. Furia, Robert Feldt, Richard Torkar

Abstract: Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings---such as lack of flexibility and results that are unintu… ▽ More Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings---such as lack of flexibility and results that are unintuitive and hard to interpret---that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that provide tangible benefits---as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, we present the reanalysis of two empirical studies on the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analyses with our new Bayesian analyses, we demonstrate the concrete advantages of the latter. To conclude we advocate a more prominent role for Bayesian statistical techniques in empirical software engineering research and practice. △ Less

Submitted 26 August, 2019; v1 submitted 13 November, 2018; originally announced November 2018.

Comments: To appear in IEEE Transactions on Software Engineering

arXiv:1810.06720 [pdf, other]

Finding a boundary between valid and invalid regions of the input space

Authors: Bogdan Marculescu, Robert Feldt

Abstract: In the context of robustness testing, the boundary between the valid and invalid regions of the input space can be an interesting source of erroneous inputs. Knowing where a specific software under test (SUT) has a boundary is essential for validation in relation to requirements. However, finding where a SUT actually implements the boundary is a non-trivial problem that has not gotten much attenti… ▽ More In the context of robustness testing, the boundary between the valid and invalid regions of the input space can be an interesting source of erroneous inputs. Knowing where a specific software under test (SUT) has a boundary is essential for validation in relation to requirements. However, finding where a SUT actually implements the boundary is a non-trivial problem that has not gotten much attention. This paper proposes a method of finding the boundary between the valid and invalid regions of the input space. The proposed method consists of two steps. First, test data generators, directed by a search algorithm to maximise distance to known, valid test cases, generate valid test cases that are closer to the boundary. Second, these valid test cases undergo mutations to try to push them over the boundary and into the invalid part of the input space. This results in a pair of test sets, one consisting of test cases on the valid side of the boundary and a matched set on the outer side, with only a small distance between the two sets. The method is evaluated on a number of examples from the standard library of a modern programming language. We propose a method of determining the boundary between valid and invalid regions of the input space and apply it on a SUT that has a non-contiguous valid region of the input space. From the small distance between the developed pairs of test sets, and the fact that one test set contains valid test cases and the other invalid test cases, we conclude that the pair of test sets described the boundary between the valid and invalid regions of that input space. Differences of behaviour can be observed between different distances and sets of mutation operators, but all show that the method is able to identify the boundary between the valid and invalid regions of the input space. This is an important step towards more automated robustness testing. △ Less

Submitted 15 October, 2018; originally announced October 2018.

Comments: 10 pages, conference

arXiv:1810.06104 [pdf, other]

Misaligned Values in Software Engineering Organizations

Authors: Per Lenberg, Robert Feldt, Lars Göran Wallgren Tengberg

Abstract: The values of software organizations are crucial for achieving high performance; in particular, agile development approaches emphasize their importance. Researchers have thus far often assumed that a specific set of values, compatible with the development methodologies, must be adopted homogeneously throughout the company. It is not clear, however, to what extent such assumptions are accurate. P… ▽ More The values of software organizations are crucial for achieving high performance; in particular, agile development approaches emphasize their importance. Researchers have thus far often assumed that a specific set of values, compatible with the development methodologies, must be adopted homogeneously throughout the company. It is not clear, however, to what extent such assumptions are accurate. Preliminary findings have highlighted the misalignment of values between groups as a source of problems when engineers discuss their challenges. Therefore, in this study, we examine how discrepancies in values between groups affect software companies' performance. To meet our objectives, we chose a mixed method research design. First, we collected qualitative data by interviewing fourteen (\textit{N} = 14) employees working in four different organizations and processed it using thematic analysis. We then surveyed seven organizations (\textit{N} = 184). Our analysis indicated that value misalignment between groups is related to organizational performance. The aligned companies were more effective, more satisfied, had higher trust, and fewer conflicts. Our efforts provide encouraging findings in a critical software engineering research area. They can help to explain why some companies are more efficient than others and, thus, point the way to interventions to address organizational challenges. △ Less

Submitted 15 October, 2018; v1 submitted 14 October, 2018; originally announced October 2018.

Comments: accepted for publication in Journal of Software: Evolution and Process

arXiv:1809.09849 [pdf, other]

A Method to Assess and Argue for Practical Significance in Software Engineering

Authors: Richard Torkar, Carlo A. Furia, Robert Feldt, Francisco Gomes de Oliveira Neto, Lucas Gren, Per Lenberg, Neil A. Ernst

Abstract: A key goal of empirical research in software engineering is to assess practical significance, which answers whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed… ▽ More A key goal of empirical research in software engineering is to assess practical significance, which answers whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed, only a few empirical studies in software engineering assess practical significance in a principled and systematic way. In this paper, we argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously. We demonstrate our claims in a case study comparing different test techniques. The case study's data was previously analyzed (Afzal et al., 2015) using standard techniques focusing on statistical significance. Here, we build a multilevel model of the same data, which we fit and validate using Bayesian techniques. Our method is to apply cumulative prospect theory on top of the statistical model to quantitatively connect our statistical analysis output to a practically meaningful context. This is then the basis both for assessing and arguing for practical significance. Our study demonstrates that Bayesian analysis provides a technically rigorous yet practical framework for empirical software engineering. A substantial side effect is that any uncertainty in the underlying data will be propagated through the statistical model, and its effects on practical significance are made clear. Thus, in combination with cumulative prospect theory, Bayesian analysis supports seamlessly assessing practical significance in an empirical software engineering context, thus potentially clarifying and extending the relevance of research for practitioners. △ Less

Submitted 25 December, 2020; v1 submitted 26 September, 2018; originally announced September 2018.

Comments: 13 pages, 9 figures, 3 tables. Minor rev update

arXiv:1808.08444 [pdf, other]

doi 10.1109/ICSE.2019.00108

Guiding Deep Learning System Testing using Surprise Adequacy

Authors: Jinhan Kim, Robert Feldt, Shin Yoo

Abstract: Deep Learning (DL) systems are rapidly being adopted in safety and security critical domains, urgently calling for ways to test their correctness and robustness. Testing of DL systems has traditionally relied on manual collection and labelling of data. Recently, a number of coverage criteria based on neuron activation values have been proposed. These criteria essentially count the number of neuron… ▽ More Deep Learning (DL) systems are rapidly being adopted in safety and security critical domains, urgently calling for ways to test their correctness and robustness. Testing of DL systems has traditionally relied on manual collection and labelling of data. Recently, a number of coverage criteria based on neuron activation values have been proposed. These criteria essentially count the number of neurons whose activation during the execution of a DL system satisfied certain properties, such as being above predefined thresholds. However, existing coverage criteria are not sufficiently fine grained to capture subtle behaviours exhibited by DL systems. Moreover, evaluations have focused on showing correlation between adversarial examples and proposed criteria rather than evaluating and guiding their use for actual testing of DL systems. We propose a novel test adequacy criterion for testing of DL systems, called Surprise Adequacy for Deep Learning Systems (SADL), which is based on the behaviour of DL systems with respect to their training data. We measure the surprise of an input as the difference in DL system's behaviour between the input and the training data (i.e., what was learnt during training), and subsequently develop this as an adequacy criterion: a good test input should be sufficiently but not overtly surprising compared to training data. Empirical evaluation using a range of DL systems from simple image classifiers to autonomous driving car platforms shows that systematic sampling of inputs based on their surprise can improve classification accuracy of DL systems against adversarial examples by up to 77.5% via retraining. △ Less

Submitted 25 August, 2018; originally announced August 2018.

arXiv:1807.05593 [pdf, other]

Visualizing test diversity to support test optimisation

Authors: Francisco Gomes de Oliveira Neto, Robert Feldt, Linda Erlenhov, José Benardi de Souza Nunes

Abstract: Diversity has been used as an effective criteria to optimise test suites for cost-effective testing. Particularly, diversity-based (alternatively referred to as similarity-based) techniques have the benefit of being generic and applicable across different Systems Under Test (SUT), and have been used to automatically select or prioritise large sets of test cases. However, it is a challenge to feedb… ▽ More Diversity has been used as an effective criteria to optimise test suites for cost-effective testing. Particularly, diversity-based (alternatively referred to as similarity-based) techniques have the benefit of being generic and applicable across different Systems Under Test (SUT), and have been used to automatically select or prioritise large sets of test cases. However, it is a challenge to feedback diversity information to developers and testers since results are typically many-dimensional. Furthermore, the generality of diversity-based approaches makes it harder to choose when and where to apply them. In this paper we address these challenges by investigating: i) what are the trade-off in using different sources of diversity (e.g., diversity of test requirements or test scripts) to optimise large test suites, and ii) how visualisation of test diversity data can assist testers for test optimisation and improvement. We perform a case study on three industrial projects and present quantitative results on the fault detection capabilities and redundancy levels of different sets of test cases. Our key result is that test similarity maps, based on pair-wise diversity calculations, helped industrial practitioners identify issues with their test repositories and decide on actions to improve. We conclude that the visualisation of diversity information can assist testers in their maintenance and optimisation activities. △ Less

Submitted 17 July, 2018; v1 submitted 15 July, 2018; originally announced July 2018.

arXiv:1805.01151 [pdf, other]

Involving External Stakeholders in Project Courses

Authors: Jan-Philipp Steghöfer, Håkan Burden, Regina Hebig, Gul Calikli, Robert Feldt, Imed Hammouda, Jennifer Horkoff, Eric Knauss, Grischa Liebel

Abstract: Problem: The involvement of external stakeholders in capstone projects and project courses is desirable due to its potential positive effects on the students. Capstone projects particularly profit from the inclusion of an industrial partner to make the project relevant and help students acquire professional skills. In addition, an increasing push towards education that is aligned with industry and… ▽ More Problem: The involvement of external stakeholders in capstone projects and project courses is desirable due to its potential positive effects on the students. Capstone projects particularly profit from the inclusion of an industrial partner to make the project relevant and help students acquire professional skills. In addition, an increasing push towards education that is aligned with industry and incorporates industrial partners can be observed. However, the involvement of external stakeholders in teaching moments can create friction and could, in the worst case, lead to frustration of all involved parties. Contribution: We developed a model that allows analysing the involvement of external stakeholders in university courses both in a retrospective fashion, to gain insights from past course instances, and in a constructive fashion, to plan the involvement of external stakeholders. Key Concepts: The conceptual model and the accompanying guideline guide the teachers in their analysis of stakeholder involvement. The model is comprised of several activities (define, execute, and evaluate the collaboration). The guideline provides questions that the teachers should answer for each of these activities. In the constructive use, the model allows teachers to define an action plan based on an analysis of potential stakeholders and the pedagogical objectives. In the retrospective use, the model allows teachers to identify issues that appeared during the project and their underlying causes. Drawing from ideas of the reflective practitioner, the model contains an emphasis on reflection and interpretation of the observations made by the teacher and other groups involved in the courses. Key Lessons: Applying the model retrospectively to a total of eight courses shows that it is possible to reveal hitherto implicit risks and assumptions and to gain a better insight into the interaction... △ Less

Submitted 4 May, 2018; v1 submitted 3 May, 2018; originally announced May 2018.

Comments: Abstract shortened since arxiv.org limits length of abstracts. See paper/pdf for full abstract. Paper is forthcoming, accepted August 2017. Arxiv version 2 corrects misspelled author name

Journal ref: ACM Transactions on Computing Education (TOCE), acc. August 2017

arXiv:1804.09232 [pdf, other]

Transferring Interactive Search-Based Software Testing to Industry

Authors: Bogdan Marculescu, Robert Feldt, Richard Torkar, Simon Poulding

Abstract: Search-Based Software Testing (SBST) is the application of optimization algorithms to problems in software testing. In previous work, we have implemented and evaluated Interactive Search-Based Software Testing (ISBST) tool prototypes, with a goal to successfully transfer the technique to industry. While SBSE solutions are often validated on benchmark problems, there is a need to validate them in a… ▽ More Search-Based Software Testing (SBST) is the application of optimization algorithms to problems in software testing. In previous work, we have implemented and evaluated Interactive Search-Based Software Testing (ISBST) tool prototypes, with a goal to successfully transfer the technique to industry. While SBSE solutions are often validated on benchmark problems, there is a need to validate them in an operational setting. The present paper discusses the development and deployment of SBST tools for use in industry and reflects on the transfer of these techniques to industry. In addition to previous work discussing the development and validation of an ISBST prototype, a new version of the prototype ISBST system was evaluated in the laboratory and in industry. This evaluation is based on an industrial System under Test (SUT) and was carried out with industrial practitioners. The Technology Transfer Model is used as a framework to describe the progression of the development and evaluation of the ISBST system. The paper presents a synthesis of previous work developing and evaluating the ISBST prototype, as well as presenting an evaluation, in both academia and industry, of that prototype's latest version. This paper presents an overview of the development and deployment of the ISBST system in an industrial setting, using the framework of the Technology Transfer Model. We conclude that the ISBST system is capable of evolving useful test cases for that setting, though improvements in the means the system uses to communicate that information to the user are still required. In addition, a set of lessons learned from the project are listed and discussed. Our objective is to help other researchers that wish to validate search-based systems in industry and provide more information about the benefits and drawbacks of these systems. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Comments: 40 pages, 5 figures

Showing 1–50 of 67 results for author: Feldt, R