subscribe to arXiv mailings

doi 10.1016/j.infsof.2023.107218

Dev2vec: Representing Domain Expertise of Developers in an Embedding Space

Authors: Arghavan Moradi Dakhel, Michel C. Desmarais, Foutse Khomh

Abstract: Accurate assessment of the domain expertise of developers is important for assigning the proper candidate to contribute to a project or to attend a job role. Since the potential candidate can come from a large pool, the automated assessment of this domain expertise is a desirable goal. While previous methods have had some success within a single software project, the assessment of a developer's do… ▽ More Accurate assessment of the domain expertise of developers is important for assigning the proper candidate to contribute to a project or to attend a job role. Since the potential candidate can come from a large pool, the automated assessment of this domain expertise is a desirable goal. While previous methods have had some success within a single software project, the assessment of a developer's domain expertise from contributions across multiple projects is more challenging. In this paper, we employ doc2vec to represent the domain expertise of developers as embedding vectors. These vectors are derived from different sources that contain evidence of developers' expertise, such as the description of repositories that they contributed, their issue resolving history, and API calls in their commits. We name it dev2vec and demonstrate its effectiveness in representing the technical specialization of developers. Our results indicate that encoding the expertise of developers in an embedding vector outperforms state-of-the-art methods and improves the F1-score up to 21%. Moreover, our findings suggest that ``issue resolving history'' of developers is the most informative source of information to represent the domain expertise of developers in embedding spaces. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 30 pages, 5 figures

arXiv:2207.00091 [pdf, other]

Threat Assessment in Machine Learning based Systems

Authors: Lionel Nganyewou Tidjon, Foutse Khomh

Abstract: Machine learning is a field of artificial intelligence (AI) that is becoming essential for several critical systems, making it a good target for threat actors. Threat actors exploit different Tactics, Techniques, and Procedures (TTPs) against the confidentiality, integrity, and availability of Machine Learning (ML) systems. During the ML cycle, they exploit adversarial TTPs to poison data and fool… ▽ More Machine learning is a field of artificial intelligence (AI) that is becoming essential for several critical systems, making it a good target for threat actors. Threat actors exploit different Tactics, Techniques, and Procedures (TTPs) against the confidentiality, integrity, and availability of Machine Learning (ML) systems. During the ML cycle, they exploit adversarial TTPs to poison data and fool ML-based systems. In recent years, multiple security practices have been proposed for traditional systems but they are not enough to cope with the nature of ML-based systems. In this paper, we conduct an empirical study of threats reported against ML-based systems with the aim to understand and characterize the nature of ML threats and identify common mitigation strategies. The study is based on 89 real-world ML attack scenarios from the MITRE's ATLAS database, the AI Incident Database, and the literature; 854 ML repositories from the GitHub search and the Python Packaging Advisory database, selected based on their reputation. Attacks from the AI Incident Database and the literature are used to identify vulnerabilities and new types of threats that were not documented in ATLAS. Results show that convolutional neural networks were one of the most targeted models among the attack scenarios. ML repositories with the largest vulnerability prominence include TensorFlow, OpenCV, and Notebook. In this paper, we also report the most frequent vulnerabilities in the studied ML repositories, the most targeted ML phases and models, the most used TTPs in ML phases and attack scenarios. This information is particularly important for red/blue teams to better conduct attacks/defenses, for practitioners to prevent threats during ML development, and for researchers to develop efficient defense mechanisms. △ Less

Submitted 30 June, 2022; originally announced July 2022.

arXiv:2206.15331 [pdf, other]

GitHub Copilot AI pair programmer: Asset or Liability?

Authors: Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, Zhen Ming, Jiang

Abstract: Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it e… ▽ More Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (ii) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of humans' solutions is greater than Copilot's suggestions, while the buggy solutions generated by Copilot require less effort to be repaired. △ Less

Submitted 14 April, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

Comments: 27 pages, 8 figures

arXiv:2206.14322 [pdf, other]

An Empirical Study of Challenges in Converting Deep Learning Models

Authors: Moses Openja, Amin Nikanjam, Ahmed Haj Yahmed, Foutse Khomh, Zhen Ming, Jiang

Abstract: There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually DL models are developed and trained using DL frameworks that have their own internal mechanisms/formats to represent and train DL models, and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where… ▽ More There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually DL models are developed and trained using DL frameworks that have their own internal mechanisms/formats to represent and train DL models, and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where they were developed. To solve the interoperability issue and make DL models compatible with different frameworks/environments, some exchange formats are introduced for DL models, like ONNX and CoreML. However, ONNX and CoreML were never empirically evaluated by the community to reveal their prediction accuracy, performance, and robustness after conversion. Poor accuracy or non-robust behavior of converted models may lead to poor quality of deployed DL-based software systems. We conduct, in this paper, the first empirical study to assess ONNX and CoreML for converting trained DL models. In our systematic approach, two popular DL frameworks, Keras and PyTorch, are used to train five widely used DL models on three popular datasets. The trained models are then converted to ONNX and CoreML and transferred to two runtime environments designated for such formats, to be evaluated. We investigate the prediction accuracy before and after conversion. Our results unveil that the prediction accuracy of converted models are at the same level of originals. The performance (time cost and memory consumption) of converted models are studied as well. The size of models are reduced after conversion, which can result in optimized DL-based software deployment. Converted models are generally assessed as robust at the same level of originals. However, obtained results show that CoreML models are more vulnerable to adversarial attacks compared to ONNX. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted for publication in ICSME 2022

arXiv:2206.12311 [pdf, other]

Bugs in Machine Learning-based Systems: A Faultload Benchmark

Authors: Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, Zhen Ming, Jiang

Abstract: The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their perfo… ▽ More The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques. △ Less

Submitted 16 January, 2023; v1 submitted 24 June, 2022; originally announced June 2022.

arXiv:2206.11981 [pdf, other]

Never trust, always verify : a roadmap for Trustworthy AI?

Authors: Lionel Nganyewou Tidjon, Foutse Khomh

Abstract: Artificial Intelligence (AI) is becoming the corner stone of many systems used in our daily lives such as autonomous vehicles, healthcare systems, and unmanned aircraft systems. Machine Learning is a field of AI that enables systems to learn from data and make decisions on new data based on models to achieve a given goal. The stochastic nature of AI models makes verification and validation tasks c… ▽ More Artificial Intelligence (AI) is becoming the corner stone of many systems used in our daily lives such as autonomous vehicles, healthcare systems, and unmanned aircraft systems. Machine Learning is a field of AI that enables systems to learn from data and make decisions on new data based on models to achieve a given goal. The stochastic nature of AI models makes verification and validation tasks challenging. Moreover, there are intrinsic biaises in AI models such as reproductibility bias, selection bias (e.g., races, genders, color), and reporting bias (i.e., results that do not reflect the reality). Increasingly, there is also a particular attention to the ethical, legal, and societal impacts of AI. AI systems are difficult to audit and certify because of their black-box nature. They also appear to be vulnerable to threats; AI systems can misbehave when untrusted data are given, making them insecure and unsafe. Governments, national and international organizations have proposed several principles to overcome these challenges but their applications in practice are limited and there are different interpretations in the principles that can bias implementations. In this paper, we examine trust in the context of AI-based systems to understand what it means for an AI system to be trustworthy and identify actions that need to be undertaken to ensure that AI systems are trustworthy. To achieve this goal, we first review existing approaches proposed for ensuring the trustworthiness of AI systems, in order to identify potential conceptual gaps in understanding what trustworthy AI is. Then, we suggest a trust (resp. zero-trust) model for AI and suggest a set of properties that should be satisfied to ensure the trustworthiness of AI systems. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2206.03225 [pdf, other]

The Different Faces of AI Ethics Across the World: A Principle-Implementation Gap Analysis

Authors: Lionel Nganyewou Tidjon, Foutse Khomh

Abstract: Artificial Intelligence (AI) is transforming our daily life with several applications in healthcare, space exploration, banking and finance. These rapid progresses in AI have brought increasing attention to the potential impacts of AI technologies on society, with ethically questionable consequences. In recent years, several ethical principles have been released by governments, national and intern… ▽ More Artificial Intelligence (AI) is transforming our daily life with several applications in healthcare, space exploration, banking and finance. These rapid progresses in AI have brought increasing attention to the potential impacts of AI technologies on society, with ethically questionable consequences. In recent years, several ethical principles have been released by governments, national and international organisations. These principles outline high-level precepts to guide the ethical development, deployment, and governance of AI. However, the abstract nature, diversity, and context-dependency of these principles make them difficult to implement and operationalize, resulting in gaps between principles and their execution. Most recent work analysed and summarized existing AI principles and guidelines but they did not provide findings on principle-implementation gaps and how to mitigate them. These findings are particularly important to ensure that AI implementations are aligned with ethical principles and values. In this paper, we provide a contextual and global evaluation of current ethical AI principles for all continents, with the aim to identify potential principle characteristics tailored to specific countries or applicable across countries. Next, we analyze the current level of AI readiness and current implementations of ethical AI principles in different countries, to identify gaps in the implementation of AI principles and their causes. Finally, we propose recommendations to mitigate the principle-implementation gaps. △ Less

Submitted 12 May, 2022; originally announced June 2022.

arXiv:2206.00699 [pdf, other]

doi 10.1145/3530019.3530039

Studying the Practices of Deploying Machine Learning Projects on Docker

Authors: Moses Openja, Forough Majidi, Foutse Khomh, Bhagya Chembakottu, Heng Li

Abstract: Docker is a containerization service that allows for convenient deployment of websites, databases, applications' APIs, and machine learning (ML) models with a few lines of code. Studies have recently explored the use of Docker for deploying general software projects with no specific focus on how Docker is used to deploy ML-based projects. In this study, we conducted an exploratory study to under… ▽ More Docker is a containerization service that allows for convenient deployment of websites, databases, applications' APIs, and machine learning (ML) models with a few lines of code. Studies have recently explored the use of Docker for deploying general software projects with no specific focus on how Docker is used to deploy ML-based projects. In this study, we conducted an exploratory study to understand how Docker is being used to deploy ML-based projects. As the initial step, we examined the categories of ML-based projects that use Docker. We then examined why and how these projects use Docker, and the characteristics of the resulting Docker images. Our results indicate that six categories of ML-based projects use Docker for deployment, including ML Applications, MLOps/ AIOps, Toolkits, DL Frameworks, Models, and Documentation. We derived the taxonomy of 21 major categories representing the purposes of using Docker, including those specific to models such as model management tasks (e.g., testing, training). We then showed that ML engineers use Docker images mostly to help with the platform portability, such as transferring the software across the operating systems, runtimes such as GPU, and language constraints. However, we also found that more resources may be required to run the Docker images for building ML-based software projects due to the large number of files contained in the image layers with deeply nested directories. We hope to shed light on the emerging practices of deploying ML software projects using containers and highlight aspects that should be improved. △ Less

Submitted 1 June, 2022; originally announced June 2022.

Journal ref: The International Conference on Evaluation and Assessment in Software Engineering 2022 (EASE 2022), June 13--15, 2022, Gothenburg, Sweden

arXiv:2206.00666 [pdf, other]

Technical Debts and Faults in Open-source Quantum Software Systems: An Empirical Study

Authors: Moses Openja, Mohammad Mehdi Morovati, Le An, Foutse Khomh, Mouna Abidi

Abstract: Quantum computing is a rapidly growing field attracting the interest of both researchers and software developers. Supported by its numerous open-source tools, developers can now build, test, or run their quantum algorithms. Although the maintenance practices for traditional software systems have been extensively studied, the maintenance of quantum software is still a new field of study but a criti… ▽ More Quantum computing is a rapidly growing field attracting the interest of both researchers and software developers. Supported by its numerous open-source tools, developers can now build, test, or run their quantum algorithms. Although the maintenance practices for traditional software systems have been extensively studied, the maintenance of quantum software is still a new field of study but a critical part to ensure the quality of a whole quantum computing system. In this work, we set out to investigate the distribution and evolution of technical debts in quantum software and their relationship with fault occurrences. Understanding these problems could guide future quantum development and provide maintenance recommendations for the key areas where quantum software developers and researchers should pay more attention. In this paper, we empirically studied 118 open-source quantum projects, which were selected from GitHub. The projects are categorized into 10 categories. We found that the studied quantum software suffers from the issues of code convention violation, error-handling, and code design. We also observed a statistically significant correlation between code design, redundant code or code convention, and the occurrences of faults in quantum software. △ Less

Submitted 1 June, 2022; originally announced June 2022.

arXiv:2205.15419 [pdf, other]

Fool SHAP with Stealthily Biased Sampling

Authors: Gabriel Laberge, Ulrich Aïvodji, Satoshi Hara, Mario Marchand., Foutse Khomh

Abstract: SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a comple… ▽ More SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups while remaining undetected. More precisely, experiments performed on real-world datasets showed that our attack could yield up to a 90\% relative decrease in amplitude of the sensitive feature attribution. These results highlight the manipulability of SHAP explanations and encourage auditors to treat them with skepticism. △ Less

Submitted 3 March, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

arXiv:2205.03181 [pdf, other]

Understanding Quantum Software Engineering Challenges An Empirical Study on Stack Exchange Forums and GitHub Issues

Authors: Mohamed Raed El aoun, Heng Li, Foutse Khomh, Moses Openja

Abstract: With the advance in quantum computing, quantum software becomes critical for exploring the full potential of quantum computing systems. Recently, quantum software engineering (QSE) becomes an emerging area attracting more and more attention. However, it is not clear what are the challenges and opportunities of quantum computing facing the software engineering community. This work aims to understan… ▽ More With the advance in quantum computing, quantum software becomes critical for exploring the full potential of quantum computing systems. Recently, quantum software engineering (QSE) becomes an emerging area attracting more and more attention. However, it is not clear what are the challenges and opportunities of quantum computing facing the software engineering community. This work aims to understand the QSE-related challenges perceived by developers. We perform an empirical study on Stack Exchange forums where developers post-QSE-related questions & answers and Github issue reports where developers raise QSE-related issues in practical quantum computing projects. Based on an existing taxonomy of question types on Stack Overflow, we first perform a qualitative analysis of the types of QSE-related questions asked on Stack Exchange forums. We then use automated topic modeling to uncover the topics in QSE-related Stack Exchange posts and GitHub issue reports. Our study highlights some particularly challenging areas of QSE that are different from that of traditional software engineering, such as explaining the theory behind quantum computing code, interpreting quantum program outputs, and bridging the knowledge gap between quantum computing and classical computing, as well as their associated opportunities. △ Less

Submitted 6 May, 2022; originally announced May 2022.

arXiv:2204.11965 [pdf, other]

Bug Characteristics in Quantum Software Ecosystem

Authors: Mohamed Raed El aoun, Heng Li, Foutse Khomh, Lionel Tidjon

Abstract: With the advance in quantum computing in recent years, quantum software becomes vital for exploring the full potential of quantum computing systems. Quantum programming is different from classical programming, for example, the state of a quantum program is probabilistic in nature, and a quantum computer is error-prone due to the instability of quantum mechanisms. Therefore, the characteristics of… ▽ More With the advance in quantum computing in recent years, quantum software becomes vital for exploring the full potential of quantum computing systems. Quantum programming is different from classical programming, for example, the state of a quantum program is probabilistic in nature, and a quantum computer is error-prone due to the instability of quantum mechanisms. Therefore, the characteristics of bugs in quantum software projects may be very different from that of classical software projects. This work aims to understand the characteristics of bugs in quantum software projects, in order to provide insights to help devise effective testing and debugging mechanisms. To achieve this goal, we conduct an empirical study on the bug reports of 125 quantum software projects. We observe that quantum software projects are more buggy than classical software projects and that quantum project bugs are more costly to fix than classical project bugs. We also identify the types of the bugs and the quantum programming components where they occurred. Our study shows that the bugs are spread across different components, but quantum-specific bugs particularly appear in the compiler, gate operation, and state preparation components. The three most occurring types of bugs are Program anomaly bugs, Configuration bugs, and Data type and structure bugs. Our study highlights some particularly challenging areas in quantum software development, such as the lack of scientific quantum computation libraries that implement comprehensive mathematical functions for quantum computing. Quantum developers also seek specialized data manipulation libraries for quantum software engineering like Numpy for quantum computing. Our findings also provide insights for future work to advance the quantum program development, testing, and debugging of quantum software, such as providing tooling support for debugging low-level circuits. △ Less

Submitted 25 April, 2022; originally announced April 2022.

arXiv:2204.00694 [pdf, other]

Testing Feedforward Neural Networks Training Programs

Authors: Houssem Ben Braiek, Foutse Khomh

Abstract: Nowadays, we are witnessing an increasing effort to improve the performance and trustworthiness of Deep Neural Networks (DNNs), with the aim to enable their adoption in safety critical systems such as self-driving cars. Multiple testing techniques are proposed to generate test cases that can expose inconsistencies in the behavior of DNN models. These techniques assume implicitly that the training… ▽ More Nowadays, we are witnessing an increasing effort to improve the performance and trustworthiness of Deep Neural Networks (DNNs), with the aim to enable their adoption in safety critical systems such as self-driving cars. Multiple testing techniques are proposed to generate test cases that can expose inconsistencies in the behavior of DNN models. These techniques assume implicitly that the training program is bug-free and appropriately configured. However, satisfying this assumption for a novel problem requires significant engineering work to prepare the data, design the DNN, implement the training program, and tune the hyperparameters in order to produce the model for which current automated test data generators search for corner-case behaviors. All these model training steps can be error-prone. Therefore, it is crucial to detect and correct errors throughout all the engineering steps of DNN-based software systems and not only on the resulting DNN model. In this paper, we gather a catalog of training issues and based on their symptoms and their effects on the behavior of the training program, we propose practical verification routines to detect the aforementioned issues, automatically, by continuously validating that some important properties of the learning dynamics hold during the training. Then, we design, TheDeepChecker, an end-to-end property-based debugging approach for DNN training programs. We assess the effectiveness of TheDeepChecker on synthetic and real-world buggy DL programs and compare it with Amazon SageMaker Debugger (SMD). Results show that TheDeepChecker's on-execution validation of DNN-based program's properties succeeds in revealing several coding bugs and system misconfigurations, early on and at a low cost. Moreover, TheDeepChecker outperforms the SMD's offline rules verification on training logs in terms of detection accuracy and DL bugs coverage. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2203.12138 [pdf, other]

A Search-Based Framework for Automatic Generation of Testing Environments for Cyber-Physical Systems

Authors: Dmytro Humeniuk, Foutse Khomh, Giuliano Antoniol

Abstract: Many modern cyber physical systems incorporate computer vision technologies, complex sensors and advanced control software, allowing them to interact with the environment autonomously. Testing such systems poses numerous challenges: not only should the system inputs be varied, but also the surrounding environment should be accounted for. A number of tools have been developed to test the system mod… ▽ More Many modern cyber physical systems incorporate computer vision technologies, complex sensors and advanced control software, allowing them to interact with the environment autonomously. Testing such systems poses numerous challenges: not only should the system inputs be varied, but also the surrounding environment should be accounted for. A number of tools have been developed to test the system model for the possible inputs falsifying its requirements. However, they are not directly applicable to autonomous cyber physical systems, as the inputs to their models are generated while operating in a virtual environment. In this paper, we aim to design a search based framework, named AmbieGen, for generating diverse fault revealing test scenarios for autonomous cyber physical systems. The scenarios represent an environment in which an autonomous agent operates. The framework should be applicable to generating different types of environments. To generate the test scenarios, we leverage the NSGA II algorithm with two objectives. The first objective evaluates the deviation of the observed system behaviour from its expected behaviour. The second objective is the test case diversity, calculated as a Jaccard distance with a reference test case. We evaluate AmbieGen on three scenario generation case studies, namely a smart-thermostat, a robot obstacle avoidance system, and a vehicle lane keeping assist system. We compared three configurations of AmbieGen: based on a single objective genetic algorithm, multi objective, and random search. Both single and multi objective configurations outperform the random search. Multi objective configuration can find the individuals of the same quality as the single objective, producing more unique test scenarios in the same time budget. △ Less

Submitted 22 March, 2022; originally announced March 2022.

arXiv:2202.03270 [pdf, other]

Do Developers Refactor Data Access Code? An Empirical Study

Authors: Biruk Asmare Muse, Foutse Khomh, Giuliano Antoniol

Abstract: Developers often refactor code to improve the maintainability and comprehension of the software. There are many studies on refactoring activities in traditional software systems. However, refactoring in data-intensive systems is not well explored. Understanding the refactoring practices of developers is important to develop efficient tool support.We conducted a longitudinal study of refactoring ac… ▽ More Developers often refactor code to improve the maintainability and comprehension of the software. There are many studies on refactoring activities in traditional software systems. However, refactoring in data-intensive systems is not well explored. Understanding the refactoring practices of developers is important to develop efficient tool support.We conducted a longitudinal study of refactoring activities in data access classes using 12 data-intensive subject systems. We investigated the prevalence and evolution of refactorings and the association of refactorings with data access smells. We also conducted a manual analysis of over 378 samples of data access refactoring instances to identify the functionalities of the code that are targeted by such refactorings. Our results show that (1) data access refactorings are prevalent and different in type. \textit{Rename variable} is the most prevalent data access refactoring. (2) The prevalence and type of refactorings vary as systems evolve in time. (3) Most data access refactorings target codes that implement data fetching and insertion. (4) Data access refactorings do not generally touch SQL queries. Overall, the results show that data access refactorings focus on improving the code quality but not the underlying data access operations. Hence, more work is needed from the research community on providing awareness and support to practitioners on the benefits of addressing data access smells with refactorings. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: 29th IEEE International Conference on Software Analysis, Evolution and Reengineering

arXiv:2201.02215 [pdf, other]

On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems

Authors: Biruk Asmare Muse, Mohammad Masudur Rahman, Csaba Nagy, Anthony Cleve, Foutse Khomh, Giuliano Antoniol

Abstract: Code smells indicate software design problems that harm software quality. Data-intensive systems that frequently access databases often suffer from SQL code smells besides the traditional smells. While there have been extensive studies on traditional code smells, recently, there has been a growing interest in SQL code smells. In this paper, we conduct an empirical study to investigate the prevalen… ▽ More Code smells indicate software design problems that harm software quality. Data-intensive systems that frequently access databases often suffer from SQL code smells besides the traditional smells. While there have been extensive studies on traditional code smells, recently, there has been a growing interest in SQL code smells. In this paper, we conduct an empirical study to investigate the prevalence and evolution of SQL code smells in open-source, data-intensive systems. We collected 150 projects and examined both traditional and SQL code smells in these projects. Our investigation delivers several important findings. First, SQL code smells are indeed prevalent in data-intensive software systems. Second, SQL code smells have a weak co-occurrence with traditional code smells. Third, SQL code smells have a weaker association with bugs than that of traditional code smells. Fourth, SQL code smells are more likely to be introduced at the beginning of the project lifetime and likely to be left in the code without a fix, compared to traditional code smells. Overall, our results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems. Developers should be aware of these smells and consider detecting and refactoring SQL code smells and traditional code smells separately, using dedicated tools. △ Less

Submitted 6 January, 2022; originally announced January 2022.

Journal ref: In Proceedings of the 17th International Conference on Mining Software Repositories (pp. 327-338) 2020

arXiv:2201.02180 [pdf, other]

FIXME: Synchronize with Database An Empirical Study of Data Access Self-Admitted Technical Debt

Authors: Biruk Asmare Muse, Csaba Nagy, Anthony Cleve, Foutse Khomh, Giuliano Antoniol

Abstract: Developers sometimes choose design and implementation shortcuts due to the pressure from tight release schedules. However, shortcuts introduce technical debt that increases as the software evolves. The debt needs to be repaid as fast as possible to minimize its impact on software development and software quality. Sometimes, technical debt is admitted by developers in comments and commit messages.… ▽ More Developers sometimes choose design and implementation shortcuts due to the pressure from tight release schedules. However, shortcuts introduce technical debt that increases as the software evolves. The debt needs to be repaid as fast as possible to minimize its impact on software development and software quality. Sometimes, technical debt is admitted by developers in comments and commit messages. Such debt is known as self-admitted technical debt (SATD). In data-intensive systems, where data manipulation is a critical functionality, the presence of SATD in the data access logic could seriously harm performance and maintainability. Understanding the composition and distribution of the SATDs across software systems and their evolution could provide insights into managing technical debt efficiently. We present a large-scale empirical study on the prevalence, composition, and evolution of SATD in data-intensive systems. We analyzed 83 open-source systems relying on relational databases as well as 19 systems relying on NoSQL databases. We detected SATD in source code comments obtained from different snapshots of the subject systems. To understand the evolution dynamics of SATDs, we conducted a survival analysis. Next, we performed a manual analysis of 361 sample data-access SATDs, investigating the composition of data-access SATDs and the reasons behind their introduction and removal. We identified 15 new SATD categories, out of which 11 are specific to database access operations. We found that most of the data-access SATDs are introduced in the later stages of change history rather than at the beginning. We also observed that bug fixing and refactoring are the main reasons behind the introduction of data-access SATDs. △ Less

Submitted 6 January, 2022; originally announced January 2022.

arXiv:2112.15277 [pdf, other]

Machine Learning Application Development: Practitioners' Insights

Authors: Md Saidur Rahman, Foutse Khomh, Alaleh Hamidi, Jinghui Cheng, Giuliano Antoniol, Hironori Washizaki

Abstract: Nowadays, intelligent systems and services are getting increasingly popular as they provide data-driven solutions to diverse real-world problems, thanks to recent breakthroughs in Artificial Intelligence (AI) and Machine Learning (ML). However, machine learning meets software engineering not only with promising potentials but also with some inherent challenges. Despite some recent research efforts… ▽ More Nowadays, intelligent systems and services are getting increasingly popular as they provide data-driven solutions to diverse real-world problems, thanks to recent breakthroughs in Artificial Intelligence (AI) and Machine Learning (ML). However, machine learning meets software engineering not only with promising potentials but also with some inherent challenges. Despite some recent research efforts, we still do not have a clear understanding of the challenges of developing ML-based applications and the current industry practices. Moreover, it is unclear where software engineering researchers should focus their efforts to better support ML application developers. In this paper, we report about a survey that aimed to understand the challenges and best practices of ML application development. We synthesize the results obtained from 80 practitioners (with diverse skills, experience, and application domains) into 17 findings; outlining challenges and best practices for ML application development. Practitioners involved in the development of ML-based software systems can leverage the summarized best practices to improve the quality of their system. We hope that the reported challenges will inform the research community about topics that need to be investigated to improve the engineering process and the quality of ML-based applications. △ Less

Submitted 30 December, 2021; originally announced December 2021.

arXiv:2112.13314 [pdf, other]

Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow

Authors: Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol

Abstract: Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applications even to non DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to th… ▽ More Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applications even to non DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user. Such bugs are even more dangerous in DL applications and frameworks due to the "black-box" and stochastic nature of the systems (the end user can not understand how the model makes decisions). This paper presents the first empirical study of Keras and TensorFlow silent bugs, and their impact on users' programs. We extracted closed issues related to Keras from the TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were reproducible silent bugs affecting users' programs. We categorized the bugs based on the effects on the users' programs and the components where the issues occurred, using information from the issue reports. We then derived a threat level for each of the issues, based on the impact they had on the users' programs. To assess the relevance of identified categories and the impact scale, we conducted an online survey with 103 DL developers. The participants generally agreed with the significant impact of silent bugs in DL libraries and acknowledged our findings (i.e., categories of silent bugs and the proposed impact scale). Finally, leveraging our analysis, we provide a set of guidelines to facilitate safeguarding against such bugs in DL frameworks. △ Less

Submitted 1 September, 2023; v1 submitted 25 December, 2021; originally announced December 2021.

arXiv:2111.07101 [pdf]

Reputation Gaming in Stack Overflow

Authors: Iren Mazloomzadeh, Gias Udin, Foutse Khomh, Ashkan Sami

Abstract: Stack Overflow incentive system awards users with reputation scores to ensure quality. The decentralized nature of the forum may make the incentive system prone to manipulation. This paper offers, for the first time, a comprehensive study of the reported types of reputation manipulation scenarios that might be exercised in Stack Overflow and the prevalence of such reputation gamers by qualitative… ▽ More Stack Overflow incentive system awards users with reputation scores to ensure quality. The decentralized nature of the forum may make the incentive system prone to manipulation. This paper offers, for the first time, a comprehensive study of the reported types of reputation manipulation scenarios that might be exercised in Stack Overflow and the prevalence of such reputation gamers by qualitative study of 1,697 posts from meta Stack Exchange sites. We found six different types of reputation fraud scenarios, such as voting rings where communities form to upvote each other repeatedly on similar posts. We sought to develop algorithms to allow platform managers to automatically identify these suspicious reputation gaming scenarios, for review. The first algorithm identifies isolated/semi-isolated communities where probable reputation frauds may occur mostly by collaborating with each other. The second algorithm looks for sudden unusual big jumps in the reputation scores of users. We evaluated the performance of our algorithms by examining the reputation history dashboard of Stack Overflow users from the Stack Overflow website. We observed that around 60-80% of users that are considered to be suspicious by our algorithms got their reputation scores removed by Stack Overflow. △ Less

Submitted 13 November, 2021; originally announced November 2021.

arXiv:2111.04865 [pdf, other]

On Assessing The Safety of Reinforcement Learning algorithms Using Formal Methods

Authors: Paulina Stevia Nouwou Mindom, Amin Nikanjam, Foutse Khomh, John Mullins

Abstract: The increasing adoption of Reinforcement Learning in safety-critical systems domains such as autonomous vehicles, health, and aviation raises the need for ensuring their safety. Existing safety mechanisms such as adversarial training, adversarial detection, and robust learning are not always adapted to all disturbances in which the agent is deployed. Those disturbances include moving adversaries w… ▽ More The increasing adoption of Reinforcement Learning in safety-critical systems domains such as autonomous vehicles, health, and aviation raises the need for ensuring their safety. Existing safety mechanisms such as adversarial training, adversarial detection, and robust learning are not always adapted to all disturbances in which the agent is deployed. Those disturbances include moving adversaries whose behavior can be unpredictable by the agent, and as a matter of fact harmful to its learning. Ensuring the safety of critical systems also requires methods that give formal guarantees on the behaviour of the agent evolving in a perturbed environment. It is therefore necessary to propose new solutions adapted to the learning challenges faced by the agent. In this paper, first we generate adversarial agents that exhibit flaws in the agent's policy by presenting moving adversaries. Secondly, We use reward shaping and a modified Q-learning algorithm as defense mechanisms to improve the agent's policy when facing adversarial perturbations. Finally, probabilistic model checking is employed to evaluate the effectiveness of both mechanisms. We have conducted experiments on a discrete grid world with a single agent facing non-learning and learning adversaries. Our results show a diminution in the number of collisions between the agent and the adversaries. Probabilistic model checking provides lower and upper probabilistic bounds regarding the agent's safety in the adversarial environment. △ Less

Submitted 9 November, 2021; v1 submitted 8 November, 2021; originally announced November 2021.

arXiv:2111.03196 [pdf, other]

An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets

Authors: Gias Uddin, Yann-Gael Gueheneuc, Foutse Khomh, Chanchal K Roy

Abstract: Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sen… ▽ More Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [31, 32], who first reported negative results with standalone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [31]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for SE. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong, but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) - 100% (over POME [31]). In a second phase, we compare and improve Sentisead infrastructure using Pre-trained Transformer Models (PTMs). We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [31, 32] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801. △ Less

Submitted 4 November, 2021; originally announced November 2021.

Journal ref: ACM Transactions on Software Engineering and Methodology (TOSEM), 2021

arXiv:2110.13369 [pdf, other]

Partial Order in Chaos: Consensus on Feature Attributions in the Rashomon Set

Authors: Gabriel Laberge, Yann Pequignot, Alexandre Mathieu, Foutse Khomh, Mario Marchand

Abstract: Post-hoc global/local feature attribution methods are progressively being employed to understand the decisions of complex machine learning models. Yet, because of limited amounts of data, it is possible to obtain a diversity of models with good empirical performance but that provide very different explanations for the same prediction, making it hard to derive insight from them. In this work, inste… ▽ More Post-hoc global/local feature attribution methods are progressively being employed to understand the decisions of complex machine learning models. Yet, because of limited amounts of data, it is possible to obtain a diversity of models with good empirical performance but that provide very different explanations for the same prediction, making it hard to derive insight from them. In this work, instead of aiming at reducing the under-specification of model explanations, we fully embrace it and extract logical statements about feature attributions that are consistent across all models with good empirical performance (i.e. all models in the Rashomon Set). We show that partial orders of local/global feature importance arise from this methodology enabling more nuanced interpretations by allowing pairs of features to be incomparable when there is no consensus on their relative importance. We prove that every relation among features present in these partial orders also holds in the rankings provided by existing approaches. Finally, we present three use cases employing hypothesis spaces with tractable Rashomon Sets (Additive models, Kernel Ridge, and Random Forests) and show that partial orders allow one to extract consistent local and global interpretations of models despite their under-specification. △ Less

Submitted 28 December, 2023; v1 submitted 25 October, 2021; originally announced October 2021.

Journal ref: Journal of Machine Learning Research, 2023, vol. 24, no 364, p. 1-50

arXiv:2109.04196 [pdf, other]

doi 10.4204/EPTCS.342.10

Failure Analysis of Hadoop Schedulers using an Integration of Model Checking and Simulation

Authors: Mbarka Soualhia, Foutse Khomh, Sofiene Tahar

Abstract: The Hadoop scheduler is a centerpiece of Hadoop, the leading processing framework for data-intensive applications in the cloud. Given the impact of failures on the performance of applications running on Hadoop, testing and verifying the performance of the Hadoop scheduler is critical. Existing approaches such as performance simulation and analytical modeling are inadequate because they are not abl… ▽ More The Hadoop scheduler is a centerpiece of Hadoop, the leading processing framework for data-intensive applications in the cloud. Given the impact of failures on the performance of applications running on Hadoop, testing and verifying the performance of the Hadoop scheduler is critical. Existing approaches such as performance simulation and analytical modeling are inadequate because they are not able to ascertain a complete verification of a Hadoop scheduler. This is due to the wide range of constraints and aspects involved in Hadoop. In this paper, we propose a novel methodology that integrates and combines simulation and model checking techniques to perform a formal verification of Hadoop schedulers, focusing on the following properties: schedulability, fairness and resources-deadlock freeness. We use the CSP language to formally describe a Hadoop scheduler, and the PAT model checker to verify its properties. Next, we use the proposed formal model to analyze the scheduler of OpenCloud, a Hadoop-based cluster that simulates the Hadoop load, in order to illustrate the usability and benefits of our work. Results show that our proposed methodology can help identify several tasks failures (up to 78%) early on, i.e., before the tasks are executed on the cluster. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: In Proceedings SCSS 2021, arXiv:2109.02501

Journal ref: EPTCS 342, 2021, pp. 114-128

arXiv:2109.03991 [pdf, other]

The challenge of reproducible ML: an empirical study on the impact of bugs

Authors: Emilio Rivera-Landos, Foutse Khomh, Amin Nikanjam

Abstract: Reproducibility is a crucial requirement in scientific research. When results of research studies and scientific papers have been found difficult or impossible to reproduce, we face a challenge which is called reproducibility crisis. Although the demand for reproducibility in Machine Learning (ML) is acknowledged in the literature, a main barrier is inherent non-determinism in ML training and infe… ▽ More Reproducibility is a crucial requirement in scientific research. When results of research studies and scientific papers have been found difficult or impossible to reproduce, we face a challenge which is called reproducibility crisis. Although the demand for reproducibility in Machine Learning (ML) is acknowledged in the literature, a main barrier is inherent non-determinism in ML training and inference. In this paper, we establish the fundamental factors that cause non-determinism in ML systems. A framework, ReproduceML, is then introduced for deterministic evaluation of ML experiments in a real, controlled environment. ReproduceML allows researchers to investigate software configuration effects on ML training and inference. Using ReproduceML, we run a case study: investigation of the impact of bugs inside ML libraries on performance of ML experiments. This study attempts to quantify the impact that the occurrence of bugs in a popular ML framework, PyTorch, has on the performance of trained models. To do so, a comprehensive methodology is proposed to collect buggy versions of ML libraries and run deterministic ML experiments using ReproduceML. Our initial finding is that there is no evidence based on our limited dataset to show that bugs which occurred in PyTorch do affect the performance of trained models. The proposed methodology as well as ReproduceML can be employed for further research on non-determinism and bugs. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2108.05341 [pdf, other]

The Forgotten Role of Search Queries in IR-based Bug Localization: An Empirical Study

Authors: Mohammad Masudur Rahman, Foutse Khomh, Shamima Yeasmin, Chanchal K. Roy

Abstract: Being light-weight and cost-effective, IR-based approaches for bug localization have shown promise in finding software bugs. However, the accuracy of these approaches heavily depends on their used bug reports. A significant number of bug reports contain only plain natural language texts. According to existing studies, IR-based approaches cannot perform well when they use these bug reports as searc… ▽ More Being light-weight and cost-effective, IR-based approaches for bug localization have shown promise in finding software bugs. However, the accuracy of these approaches heavily depends on their used bug reports. A significant number of bug reports contain only plain natural language texts. According to existing studies, IR-based approaches cannot perform well when they use these bug reports as search queries. On the other hand, there is a piece of recent evidence that suggests that even these natural language-only reports contain enough good keywords that could help localize the bugs successfully. On one hand, these findings suggest that natural language-only bug reports might be a sufficient source for good query keywords. On the other hand, they cast serious doubt on the query selection practices in the IR-based bug localization. In this article, we attempted to clear the sky on this aspect by conducting an in-depth empirical study that critically examines the state-of-the-art query selection practices in IR-based bug localization. In particular, we use a dataset of 2,320 bug reports, employ ten existing approaches from the literature, exploit the Genetic Algorithm-based approach to construct optimal, near-optimal search queries from these bug reports, and then answer three research questions. We confirmed that the state-of-the-art query construction approaches are indeed not sufficient for constructing appropriate queries (for bug localization) from certain natural language-only bug reports although they contain such queries. We also demonstrate that optimal queries and non-optimal queries chosen from bug report texts are significantly different in terms of several keyword characteristics, which has led us to actionable insights. Furthermore, we demonstrate 27%--34% improvement in the performance of non-optimal queries through the application of our actionable insights to them. △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: 57 pages, EMSE (2021)

ACM Class: D.2; D.2.5; D.2.7

arXiv:2108.05316 [pdf, other]

doi 10.1109/ICSME46990.2020.00063

Why are Some Bugs Non-Reproducible? An Empirical Investigation using Data Fusion

Authors: Mohammad Masudur Rahman, Foutse Khomh, Marco Castelluccio

Abstract: Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this paper, we conduct a multimodal study to bette… ▽ More Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this paper, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study). △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: 12 pages

ACM Class: D.2; D.2.5; D.2.7

Journal ref: 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)

arXiv:2108.02702 [pdf, other]

Improved Retrieval of Programming Solutions With Code Examples Using a Multi-featured Score

Authors: Rodrigo F. Silva, M. Masudur Rahman, Carlos Eduardo Dantas, Chanchal Roy, Foutse Khomh, Marcelo A. Maia

Abstract: Developers often depend on code search engines to obtain solutions for their programming tasks. However, finding an expected solution containing code examples along with their explanations is challenging due to several issues. There is a vocabulary mismatch between the search keywords (the query) and the appropriate solutions. Semantic gap may increase for similar bag of words due to antonyms and… ▽ More Developers often depend on code search engines to obtain solutions for their programming tasks. However, finding an expected solution containing code examples along with their explanations is challenging due to several issues. There is a vocabulary mismatch between the search keywords (the query) and the appropriate solutions. Semantic gap may increase for similar bag of words due to antonyms and negation. Moreover, documents retrieved by search engines might not contain solutions containing both code examples and their explanations. So, we propose CRAR (Crowd Answer Recommender) to circumvent those issues aiming at improving retrieval of relevant answers from Stack Overflow containing not only the expected code examples for the given task but also their explanations. Given a programming task, we investigate the effectiveness of combining information retrieval techniques along with a set of features to enhance the ranking of important threads (i.e., the units containing questions along with their answers) for the given task and then selects relevant answers contained in those threads, including semantic features, like word embeddings and sentence embeddings, for instance, a Convolutional Neural Network (CNN). CRAR also leverages social aspects of Stack Overflow discussions like popularity to select relevant answers for the tasks. Our experimental evaluation shows that the combination of the different features performs better than each one individually. We also compare the retrieval performance with the state-of-art CROKAGE (Crowd Knowledge Answer Generator), which is also a system aimed at retrieving relevant answers from Stack Overflow. We show that CRAR outperforms CROKAGE in Mean Reciprocal Rank and Mean Recall with small and medium effect sizes, respectively. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: 31 pages, 5 figures, 9 tables

arXiv:2107.13614 [pdf, other]

Clones in Deep Learning Code: What, Where, and Why?

Authors: Hadhemi Jebnoun, Md Saidur Rahman, Foutse Khomh, Biruk Asmare Muse

Abstract: Deep Learning applications are becoming increasingly popular. Developers of deep learning systems strive to write more efficient code. Deep learning systems are constantly evolving, imposing tighter development timelines and increasing complexity, which may lead to bad design decisions. A copy-paste approach is widely used among deep learning developers because they rely on common frameworks and d… ▽ More Deep Learning applications are becoming increasingly popular. Developers of deep learning systems strive to write more efficient code. Deep learning systems are constantly evolving, imposing tighter development timelines and increasing complexity, which may lead to bad design decisions. A copy-paste approach is widely used among deep learning developers because they rely on common frameworks and duplicate similar tasks. Developers often fail to properly propagate changes to all clones fragments during a maintenance activity. To our knowledge, no study has examined code cloning practices in deep learning development. Given the negative impacts of clones on software quality reported in the studies on traditional systems, it is very important to understand the characteristics and potential impacts of code clones on deep learning systems. To this end, we use the NiCad tool to detect clones from 59 Python, 14 C# and 6 Java-based deep learning systems and an equal number of traditional software systems. We then analyze the frequency and distribution of code clones in deep learning and traditional systems. We do further analysis of the distribution of code clones using location-based taxonomy. We also study the correlation between bugs and code clones to assess the impacts of clones on the quality of the studied systems. Finally, we introduce a code clone taxonomy related to deep learning programs and identify the deep learning system development phases in which cloning has the highest risk of faults. Our results show that code cloning is a frequent practice in deep learning systems and that deep learning developers often clone code from files in distant repositories in the system. In addition, we found that code cloning occurs more frequently during DL model construction. And that hyperparameters setting is the phase during which cloning is the riskiest, since it often leads to faults. △ Less

Submitted 28 July, 2021; originally announced July 2021.

arXiv:2107.13491 [pdf, other]

Models of Computational Profiles to Study the Likelihood of DNN Metamorphic Test Cases

Authors: Ettore Merlo, Mira Marhaba, Foutse Khomh, Houssem Ben Braiek, Giuliano Antoniol

Abstract: Neural network test cases are meant to exercise different reasoning paths in an architecture and used to validate the prediction outcomes. In this paper, we introduce "computational profiles" as vectors of neuron activation levels. We investigate the distribution of computational profile likelihood of metamorphic test cases with respect to the likelihood distributions of training, test and error c… ▽ More Neural network test cases are meant to exercise different reasoning paths in an architecture and used to validate the prediction outcomes. In this paper, we introduce "computational profiles" as vectors of neuron activation levels. We investigate the distribution of computational profile likelihood of metamorphic test cases with respect to the likelihood distributions of training, test and error control cases. We estimate the non-parametric probability densities of neuron activation levels for each distinct output class. Probabilities are inferred using training cases only, without any additional knowledge about metamorphic test cases. Experiments are performed by training a network on the MNIST Fashion library of images and comparing prediction likelihoods with those obtained from error control-data and from metamorphic test cases. Experimental results show that the distributions of computational profile likelihood for training and test cases are somehow similar, while the distribution of the random-noise control-data is always remarkably lower than the observed one for the training and testing sets. In contrast, metamorphic test cases show a prediction likelihood that lies in an extended range with respect to training, tests, and random noise. Moreover, the presented approach allows the independent assessment of different training classes and experiments to show that some of the classes are more sensitive to misclassifying metamorphic test cases than other classes. In conclusion, metamorphic test cases represent very aggressive tests for neural network architectures. Furthermore, since metamorphic test cases force a network to misclassify those inputs whose likelihood is similar to that of training cases, they could also be considered as adversarial attacks that evade defenses based on computational profile likelihood evaluation. △ Less

Submitted 28 July, 2021; originally announced July 2021.

Comments: 9 pages (10 pages with ref.)

Journal ref: Published in iMLSE 2020 2nd International Workshop on Machine Learning Systems Engineering https://sig-mlse.wixsite.com/imlse2020

arXiv:2107.12045 [pdf, other]

doi 10.1007/s10515-022-00337-x

How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review

Authors: Florian Tambon, Gabriel Laberge, Le An, Amin Nikanjam, Paulina Stevia Nouwou Mindom, Yann Pequignot, Foutse Khomh, Giulio Antoniol, Ettore Merlo, François Laviolette

Abstract: Context: Machine Learning (ML) has been at the heart of many innovations over the past years. However, including it in so-called 'safety-critical' systems such as automotive or aeronautic has proven to be very challenging, since the shift in paradigm that ML brings completely changes traditional certification approaches. Objective: This paper aims to elucidate challenges related to the certifica… ▽ More Context: Machine Learning (ML) has been at the heart of many innovations over the past years. However, including it in so-called 'safety-critical' systems such as automotive or aeronautic has proven to be very challenging, since the shift in paradigm that ML brings completely changes traditional certification approaches. Objective: This paper aims to elucidate challenges related to the certification of ML-based safety-critical systems, as well as the solutions that are proposed in the literature to tackle them, answering the question 'How to Certify Machine Learning Based Safety-critical Systems?'. Method: We conduct a Systematic Literature Review (SLR) of research papers published between 2015 to 2020, covering topics related to the certification of ML systems. In total, we identified 217 papers covering topics considered to be the main pillars of ML certification: Robustness, Uncertainty, Explainability, Verification, Safe Reinforcement Learning, and Direct Certification. We analyzed the main trends and problems of each sub-field and provided summaries of the papers extracted. Results: The SLR results highlighted the enthusiasm of the community for this subject, as well as the lack of diversity in terms of datasets and type of models. It also emphasized the need to further develop connections between academia and industries to deepen the domain study. Finally, it also illustrated the necessity to build connections between the above mention main pillars that are for now mainly studied separately. Conclusion: We highlighted current efforts deployed to enable the certification of ML based software systems, and discuss some future research directions. △ Less

Submitted 1 December, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

Comments: 60 pages (92 pages with references and complements), submitted to a journal (Automated Software Engineering). Changes: Emphasizing difference traditional software engineering / ML approach. Adding Related Works, Threats to Validity and Complementary Materials. Adding a table listing papers reference for each section/subsections

Journal ref: Autom Softw Eng 29, 38 (2022)

arXiv:2107.04863 [pdf, other]

HOMRS: High Order Metamorphic Relations Selector for Deep Neural Networks

Authors: Florian Tambon, Giulio Antoniol, Foutse Khomh

Abstract: Deep Neural Networks (DNN) applications are increasingly becoming a part of our everyday life, from medical applications to autonomous cars. Traditional validation of DNN relies on accuracy measures, however, the existence of adversarial examples has highlighted the limitations of these accuracy measures, raising concerns especially when DNN are integrated into safety-critical systems. In this p… ▽ More Deep Neural Networks (DNN) applications are increasingly becoming a part of our everyday life, from medical applications to autonomous cars. Traditional validation of DNN relies on accuracy measures, however, the existence of adversarial examples has highlighted the limitations of these accuracy measures, raising concerns especially when DNN are integrated into safety-critical systems. In this paper, we present HOMRS, an approach to boost metamorphic testing by automatically building a small optimized set of high order metamorphic relations from an initial set of elementary metamorphic relations. HOMRS' backbone is a multi-objective search; it exploits ideas drawn from traditional systems testing such as code coverage, test case, path diversity as well as input validation. We applied HOMRS to MNIST/LeNet and SVHN/VGG and we report evidence that it builds a small but effective set of high-order transformations that generalize well to the input data distribution. Moreover, comparing to similar generation technique such as DeepXplore, we show that our distribution-based approach is more effective, generating valid transformations from an uncertainty quantification point of view, while requiring less computation time by leveraging the generalization ability of the approach. △ Less

Submitted 21 December, 2021; v1 submitted 10 July, 2021; originally announced July 2021.

Comments: 33 pages

arXiv:2107.02279 [pdf, other]

Design Smells in Deep Learning Programs: An Empirical Study

Authors: Amin Nikanjam, Foutse Khomh

Abstract: Nowadays, we are witnessing an increasing adoption of Deep Learning (DL) based software systems in many industries. Designing a DL program requires constructing a deep neural network (DNN) and then training it on a dataset. This process requires that developers make multiple architectural (e.g., type, size, number, and order of layers) and configuration (e.g., optimizer, regularization methods, an… ▽ More Nowadays, we are witnessing an increasing adoption of Deep Learning (DL) based software systems in many industries. Designing a DL program requires constructing a deep neural network (DNN) and then training it on a dataset. This process requires that developers make multiple architectural (e.g., type, size, number, and order of layers) and configuration (e.g., optimizer, regularization methods, and activation functions) choices that affect the quality of the DL models, and consequently software quality. An under-specified or poorly-designed DL model may train successfully but is likely to perform poorly when deployed in production. Design smells in DL programs are poor design and-or configuration decisions taken during the development of DL components, that are likely to have a negative impact on the performance (i.e., prediction accuracy) and then quality of DL based software systems. In this paper, we present a catalogue of 8 design smells for a popular DL architecture, namely deep Feedforward Neural Networks which is widely employed in industrial applications. The design smells were identified through a review of the existing literature on DL design and a manual inspection of 659 DL programs with performance issues and design inefficiencies. The smells are specified by describing their context, consequences, and recommended refactorings. To provide empirical evidence on the relevance and perceived impact of the proposed design smells, we conducted a survey with 81 DL developers. In general, the developers perceived the proposed design smells as reflective of design or implementation problems, with agreement levels varying between 47\% and 68\%. △ Less

Submitted 7 July, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

Comments: Accepted for publication by ICSME 2021

arXiv:2105.08095 [pdf, other]

Automatic Fault Detection for Deep Learning Programs Using Graph Transformations

Authors: Amin Nikanjam, Houssem Ben Braiek, Mohammad Mehdi Morovati, Foutse Khomh

Abstract: Nowadays, we are witnessing an increasing demand in both corporates and academia for exploiting Deep Learning (DL) to solve complex real-world problems. A DL program encodes the network structure of a desirable DL model and the process by which the model learns from the training dataset. Like any software, a DL program can be faulty, which implies substantial challenges of software quality assuran… ▽ More Nowadays, we are witnessing an increasing demand in both corporates and academia for exploiting Deep Learning (DL) to solve complex real-world problems. A DL program encodes the network structure of a desirable DL model and the process by which the model learns from the training dataset. Like any software, a DL program can be faulty, which implies substantial challenges of software quality assurance, especially in safety-critical domains. It is therefore crucial to equip DL development teams with efficient fault detection techniques and tools. In this paper, we propose NeuraLint, a model-based fault detection approach for DL programs, using meta-modelling and graph transformations. First, we design a meta-model for DL programs that includes their base skeleton and fundamental properties. Then, we construct a graph-based verification process that covers 23 rules defined on top of the meta-model and implemented as graph transformations to detect faults and design inefficiencies in the generated models (i.e., instances of the meta-model). First, the proposed approach is evaluated by finding faults and design inefficiencies in 28 synthesized examples built from common problems reported in the literature. Then NeuraLint successfully finds 64 faults and design inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts and GitHub repositories. The results show that NeuraLint effectively detects faults and design issues in both synthesized and real-world examples with a recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is designed for feedforward neural networks, it can be extended to support other neural network architectures such as recurrent neural networks. Researchers can also expand our set of verification rules to cover more types of issues in DL programs. △ Less

Submitted 30 May, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

arXiv:2104.00058 [pdf, other]

Investigating Design Anti-pattern and Design Pattern Mutations and Their Change- and Fault-proneness

Authors: Zeinab, Kermansaravi, Md Saidur Rahman, Foutse Khomh, Fehmi Jaafar, Yann-Gael Gueheneuc

Abstract: During software evolution, inexperienced developers may introduce design anti-patterns when they modify their software systems to fix bugs or to add new functionalities based on changes in requirements. Developers may also use design patterns to promote software quality or as a possible cure for some design anti-patterns. Thus, design patterns and design anti-patterns are introduced, removed, and… ▽ More During software evolution, inexperienced developers may introduce design anti-patterns when they modify their software systems to fix bugs or to add new functionalities based on changes in requirements. Developers may also use design patterns to promote software quality or as a possible cure for some design anti-patterns. Thus, design patterns and design anti-patterns are introduced, removed, and mutated from one another by developers. Many studies investigated the evolution of design patterns and design anti-patterns and their impact on software development. However, they investigated design patterns or design anti-patterns in isolation and did not consider their mutations and the impact of these mutations on software quality. Therefore, we report our study of bidirectional mutations between design patterns and design anti-patterns and the impacts of these mutations on software change- and fault-proneness. We analyzed snapshots of seven Java software systems with diverse sizes, evolution histories, and application domains. We built Markov models to capture the probability of occurrences of the different design patterns and design anti-patterns mutations. Results from our study show that (1) design patterns and design anti-patterns mutate into other design patterns and/or design anti-patterns. They also show that (2) some change types primarily trigger mutations of design patterns and design anti-patterns (renaming and changes to comments, declarations, and operators), and (3) some mutations of design anti-patterns and design patterns are more faulty in specific contexts. These results provide important insights into the evolution of design patterns and design anti-patterns and its impact on the change- and fault-proneness of software systems. △ Less

Submitted 31 March, 2021; originally announced April 2021.

arXiv:2102.11491 [pdf, other]

Data Driven Testing of Cyber Physical Systems

Authors: Dmytro Humeniuk, Giuliano Antoniol, Foutse Khomh

Abstract: Consumer grade cyber-physical systems (CPS) are becoming an integral part of our life, automatizing and simplifying everyday tasks. Indeed, due to complex interactions between hardware, networking and software, developing and testing such systems is known to be a challenging task. Various quality assurance and testing strategies have been proposed. The most common approach for pre-deployment testi… ▽ More Consumer grade cyber-physical systems (CPS) are becoming an integral part of our life, automatizing and simplifying everyday tasks. Indeed, due to complex interactions between hardware, networking and software, developing and testing such systems is known to be a challenging task. Various quality assurance and testing strategies have been proposed. The most common approach for pre-deployment testing is to model the system and run simulations with models or software in the loop. In practice, most often, tests are run for a small number of simulations, which are selected based on the engineers' domain knowledge and experience. In this paper we propose an approach to automatically generate fault-revealing test cases for CPS. We have implemented our approach in Python, using standard frameworks and used it to generate scenarios violating temperature constraints for a smart thermostat implemented as a part of our IoT testbed. Data collected from an application managing a smart building have been used to learn models of the environment under ever changing conditions. The suggested approach allowed us to identify several pit-fails, scenarios (i.e., environment conditions and inputs), where the system behaves not as expected. △ Less

Submitted 23 March, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: 4 pages, to be published in SBST2021 workshop proceedings

arXiv:2102.08874 [pdf, other]

Mining API Usage Scenarios from Stack Overflow

Authors: Gias Uddin, Foutse Khomh, Chanchal K Roy

Abstract: We propose a framework to mine API usage scenarios from Stack Overflow. Each task consists of a code example, the task description, and the reactions of developers towards the code example. First, we present an algorithm to automatically link a code example in a forum post to an API mentioned in the textual contents of the forum post. Second, we generate a natural language description of the task… ▽ More We propose a framework to mine API usage scenarios from Stack Overflow. Each task consists of a code example, the task description, and the reactions of developers towards the code example. First, we present an algorithm to automatically link a code example in a forum post to an API mentioned in the textual contents of the forum post. Second, we generate a natural language description of the task by summarizing the discussions around the code example. Third, we automatically associate developers reactions (i.e., positive and negative opinions) towards the code example to offer information about code quality. We evaluate the algorithms using three benchmarks. △ Less

Submitted 17 February, 2021; originally announced February 2021.

Journal ref: 2020 Information and Software Technology (IST)

arXiv:2102.08502 [pdf, other]

Automatic API Usage Scenario Documentation from Technical Q&A Sites

Authors: Gias Uddin, Foutse Khomh, Chanchal K Roy

Abstract: The online technical Q&A site Stack Overflow (SO) is popular among developers to support their coding and diverse development needs. To address shortcomings in API official documentation resources, several research has thus focused on augmenting official API documentation with insights (e.g., code examples) from SO. The techniques propose to add code examples/insights about APIs into its official… ▽ More The online technical Q&A site Stack Overflow (SO) is popular among developers to support their coding and diverse development needs. To address shortcomings in API official documentation resources, several research has thus focused on augmenting official API documentation with insights (e.g., code examples) from SO. The techniques propose to add code examples/insights about APIs into its official documentation. Reviews are opinionated sentences with positive/negative sentiments. However, we are aware of no previous research that attempts to automatically produce API documentation from SO by considering both API code examples and reviews. In this paper, we present two novel algorithms that can be used to automatically produce API documentation from SO by combining code examples and reviews towards those examples. The first algorithm is called statistical documentation, which shows the distribution of positivity and negativity around the code examples of an API using different metrics (e.g., star ratings). The second algorithm is called concept-based documentation, which clusters similar and conceptually relevant usage scenarios. An API usage scenario contains a code example, a textual description of the underlying task addressed by the code example, and the reviews (i.e., opinions with positive and negative sentiments) from other developers towards the code example. We deployed the algorithms in Opiner, a web-based platform to aggregate information about APIs from online forums. We evaluated the algorithms by mining all Java JSON-based posts in SO and by conducting three user studies based on produced documentation from the posts. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Journal ref: 2021 ACM Transactions on Software Engineering and Methodology (TOSEM)

arXiv:2102.08495 [pdf, other]

Understanding How and Why Developers Seek and Analyze API-related Opinions

Authors: Gias Uddin, Olga Baysal, Latifa Guerrouj, Foutse Khomh

Abstract: With the advent and proliferation of online developer forums as informal documentation, developers often share their opinions about the APIs they use. Thus, opinions of others often shape the developer's perception and decisions related to software development. For example, the choice of an API or how to reuse the functionality the API offers are, to a considerable degree, conditioned upon what ot… ▽ More With the advent and proliferation of online developer forums as informal documentation, developers often share their opinions about the APIs they use. Thus, opinions of others often shape the developer's perception and decisions related to software development. For example, the choice of an API or how to reuse the functionality the API offers are, to a considerable degree, conditioned upon what other developers think about the API. While many developers refer to and rely on such opinion-rich information about APIs, we found little research that investigates the use and benefits of public opinions. To understand how developers seek and evaluate API opinions, we conducted two surveys involving a total of 178 software developers. We analyzed the data in two dimensions, each corresponding to specific needs related to API reviews: (1) Needs for seeking API reviews, and (2) Needs for automated tool support to assess the reviews. We observed that developers seek API reviews and often have to summarize those for diverse development needs (e.g., API suitability). Developers also make conscious efforts to judge the trustworthiness of the provided opinions and believe that automated tool support for API reviews analysis can assist in diverse development scenarios, including, for example, saving time in API selection as well as making informed decisions on a particular API features. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Journal ref: 2019 IEEE Transactions on Software Engineering (TSE)

arXiv:2101.00135 [pdf, other]

Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach

Authors: Amin Nikanjam, Mohammad Mehdi Morovati, Foutse Khomh, Houssem Ben Braiek

Abstract: A growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep Reinforcement Learning (DRL) is the application of DL in the domain of Reinforcement Learning (RL). Like any software systems, DRL applications can fail because of faults in their programs. In this paper, we present the first attempt to categorize fault… ▽ More A growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep Reinforcement Learning (DRL) is the application of DL in the domain of Reinforcement Learning (RL). Like any software systems, DRL applications can fail because of faults in their programs. In this paper, we present the first attempt to categorize faults occurring in DRL programs. We manually analyzed 761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues) developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl, Tensorforce) and identified faults reported by developers/users. We labeled and taxonomized the identified faults through several rounds of discussions. The resulting taxonomy is validated using an online survey with 19 developers/researchers. To allow for the automatic detection of faults in DRL programs, we have defined a meta-model of DRL programs and developed DRLinter, a model-based fault detection approach that leverages static analysis and graph transformations. The execution flow of DRLinter consists in parsing a DRL program to generate a model conforming to our meta-model and applying detection rules on the model to identify faults occurrences. The effectiveness of DRLinter is evaluated using 15 synthetic DRLprograms in which we injected faults observed in the analyzed artifacts of the taxonomy. The results show that DRLinter can successfully detect faults in all synthetic faulty programs. △ Less

Submitted 28 November, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

arXiv:2010.14331 [pdf, other]

Are Multi-language Design Smells Fault-prone? An Empirical Study

Authors: Mouna Abidi, Md Saidur Rahman, Moses Openja, Foutse Khomh

Abstract: Nowadays, modern applications are developed using components written in different programming languages. These systems introduce several advantages. However, as the number of languages increases, so does the challenges related to the development and maintenance of these systems. In such situations, developers may introduce design smells (i.e., anti-patterns and code smells) which are symptoms of p… ▽ More Nowadays, modern applications are developed using components written in different programming languages. These systems introduce several advantages. However, as the number of languages increases, so does the challenges related to the development and maintenance of these systems. In such situations, developers may introduce design smells (i.e., anti-patterns and code smells) which are symptoms of poor design and implementation choices. Design smells are defined as poor design and coding choices that can negatively impact the quality of a software program despite satisfying functional requirements. Studies on mono-language systems suggest that the presence of design smells affects code comprehension, thus making systems harder to maintain. However, these studies target only mono-language systems and do not consider the interaction between different programming languages. In this paper, we present an approach to detect multi-language design smells in the context of JNI systems. We then investigate the prevalence of those design smells. Specifically, we detect 15 design smells in 98 releases of nine open-source JNI projects. Our results show that the design smells are prevalent in the selected projects and persist throughout the releases of the systems. We observe that in the analyzed systems, 33.95% of the files involving communications between Java and C/C++ contains occurrences of multi-language design smells. Some kinds of smells are more prevalent than others, e.g., Unused Parameters, Too Much Scattering, Unused Method Declaration. Our results suggest that files with multi-language design smells can often be more associated with bugs than files without these smells, and that specific smells are more correlated to fault-proneness than others. △ Less

Submitted 2 November, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

Journal ref: ACM Transactions on Software Engineering and Methodology (TOSEM) 2020

arXiv:2009.02438 [pdf, other]

doi 10.1016/j.infsof.2020.106278

A Large Scale Empirical Study of the Impact of Spaghetti Code and Blob Anti-patterns on Program Comprehension

Authors: Cristiano Politowski, Foutse Khomh, Simone Romano, Giuseppe Scanniello, Fabio Petrillo, Yann-Gaël Guéhéneuc, Abdou Maiga

Abstract: Context: Several studies investigated the impact of anti-patterns (i.e., "poor" solutions to recurring design problems) during maintenance activities and reported that anti-patterns significantly affect the developers' effort required to edit files. However, before developers edit files, they must understand the source code of the systems. This source code must be easy to understand by developers.… ▽ More Context: Several studies investigated the impact of anti-patterns (i.e., "poor" solutions to recurring design problems) during maintenance activities and reported that anti-patterns significantly affect the developers' effort required to edit files. However, before developers edit files, they must understand the source code of the systems. This source code must be easy to understand by developers. Objective: In this work, we provide a complete assessment of the impact of two instances of two anti-patterns, Blob or Spaghetti Code, on program comprehension. Method: We analyze the impact of these two anti-patterns through three empirical studies conducted at Polytechnique Montréal (Canada) with 24 participants; at Carlton University (Canada) with 30 participants; and at University Basilicata (Italy) with 79 participants. Results: We collect data from 372 tasks obtained thanks to 133 different participants from the three universities. We use three metrics to assess the developers' comprehension of the source code: (1) the duration to complete each task; (2) their percentage of correct answers; and, (3) the NASA task load index for their effort. Conclusions: We report that, although single occurrences of Blob or Spaghetti code anti-patterns have little effect on code comprehension, two occurrences of either Blob or Spaghetti Code significantly increases the developers' time spent in their tasks, reduce their percentage of correct answers, and increase their effort. Hence, we recommend that developers act on both anti-patterns, which should be refactored out of the source code whenever possible. We also recommend further studies on combinations of anti-patterns rather than on single anti-patterns one at a time. △ Less

Submitted 4 September, 2020; originally announced September 2020.

arXiv:1912.09303 [pdf, other]

SIGMA : Strengthening IDS with GAN and Metaheuristics Attacks

Authors: Simon Msika, Alejandro Quintero, Foutse Khomh

Abstract: An Intrusion Detection System (IDS) is a key cybersecurity tool for network administrators as it identifies malicious traffic and cyberattacks. With the recent successes of machine learning techniques such as deep learning, more and more IDS are now using machine learning algorithms to detect attacks faster. However, these systems lack robustness when facing previously unseen types of attacks. Wit… ▽ More An Intrusion Detection System (IDS) is a key cybersecurity tool for network administrators as it identifies malicious traffic and cyberattacks. With the recent successes of machine learning techniques such as deep learning, more and more IDS are now using machine learning algorithms to detect attacks faster. However, these systems lack robustness when facing previously unseen types of attacks. With the increasing number of new attacks, especially against Internet of Things devices, having a robust IDS able to spot unusual and new attacks becomes necessary. This work explores the possibility of leveraging generative adversarial models to improve the robustness of machine learning based IDS. More specifically, we propose a new method named SIGMA, that leverages adversarial examples to strengthen IDS against new types of attacks. Using Generative Adversarial Networks (GAN) and metaheuristics, SIGMA %Our method consists in generates adversarial examples, iteratively, and uses it to retrain a machine learning-based IDS, until a convergence of the detection rate (i.e. until the detection system is not improving anymore). A round of improvement consists of a generative phase, in which we use GANs and metaheuristics to generate instances ; an evaluation phase in which we calculate the detection rate of those newly generated attacks ; and a training phase, in which we train the IDS with those attacks. We have evaluated the SIGMA method for four standard machine learning classification algorithms acting as IDS, with a combination of GAN and a hybrid local-search and genetic algorithm, to generate new datasets of attacks. Our results show that SIGMA can successfully generate adversarial attacks against different machine learning based IDS. Also, using SIGMA, we can improve the performance of an IDS to up to 100\% after as little as two rounds of improvement. △ Less

Submitted 18 December, 2019; originally announced December 2019.

Comments: 11 pages, 6 figures

arXiv:1910.07658 [pdf, other]

doi 10.1109/ICSME.2019.00021

Deep Learning Anti-patterns from Code Metrics History

Authors: Antoine Barbez, Foutse Khomh, Yann-Gaël Guéhéneuc

Abstract: Anti-patterns are poor solutions to recurring design problems. Number of empirical studies have highlighted the negative impact of anti-patterns on software maintenance which motivated the development of various detection techniques. Most of these approaches rely on structural metrics of software systems to identify affected components while others exploit historical information by analyzing co-ch… ▽ More Anti-patterns are poor solutions to recurring design problems. Number of empirical studies have highlighted the negative impact of anti-patterns on software maintenance which motivated the development of various detection techniques. Most of these approaches rely on structural metrics of software systems to identify affected components while others exploit historical information by analyzing co-changes occurring between code components. By relying solely on one aspect of software systems (i.e., structural or historical), existing approaches miss some precious information which limits their performances. In this paper, we propose CAME (Convolutional Analysis of code Metrics Evolution), a deep-learning based approach that relies on both structural and historical information to detect anti-patterns. Our approach exploits historical values of structural code metrics mined from version control systems and uses a Convolutional Neural Network classifier to infer the presence of anti-patterns from this information. We experiment our approach for the widely known God Class anti-pattern and evaluate its performances on three software systems. With the results of our study, we show that: (1) using historical values of source code metrics allows to increase the precision; (2) CAME outperforms existing static machine-learning classifiers; and (3) CAME outperforms existing detection tools. △ Less

Submitted 16 October, 2019; originally announced October 2019.

Comments: Preprint. Paper accepted for inclusion in the Research Track of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME 2019), Cleveland, Ohio, USA

arXiv:1910.04736 [pdf, other]

Studying Software Engineering Patterns for Designing Machine Learning Systems

Authors: Hironori Washizaki, Hiromu Uchida, Foutse Khomh, Yann-Gael Gueheneuc

Abstract: Machine-learning (ML) techniques have become popular in the recent years. ML techniques rely on mathematics and on software engineering. Researchers and practitioners studying best practices for designing ML application systems and software to address the software complexity and quality of ML techniques. Such design practices are often formalized as architecture patterns and design patterns by enc… ▽ More Machine-learning (ML) techniques have become popular in the recent years. ML techniques rely on mathematics and on software engineering. Researchers and practitioners studying best practices for designing ML application systems and software to address the software complexity and quality of ML techniques. Such design practices are often formalized as architecture patterns and design patterns by encapsulating reusable solutions to commonly occurring problems within given contexts. However, to the best of our knowledge, there has been no work collecting, classifying, and discussing these software-engineering (SE) design patterns for ML techniques systematically. Thus, we set out to collect good/bad SE design patterns for ML techniques to provide developers with a comprehensive and ordered classification of such patterns. We report here preliminary results of a systematic-literature review (SLR) of good/bad design patterns for ML. △ Less

Submitted 11 October, 2019; v1 submitted 10 October, 2019; originally announced October 2019.

arXiv:1910.01321 [pdf]

An Empirical Study of C++ Vulnerabilities in Crowd-Sourced Code Examples

Authors: Morteza Verdi, Ashkan Sami, Jafar Akhondali, Foutse Khomh, Gias Uddin, Alireza Karami Motlagh

Abstract: Software developers share programming solutions in Q&A sites like Stack Overflow. The reuse of crowd-sourced code snippets can facilitate rapid prototyping. However, recent research shows that the shared code snippets may be of low quality and can even contain vulnerabilities. This paper aims to understand the nature and the prevalence of security vulnerabilities in crowd-sourced code examples. To… ▽ More Software developers share programming solutions in Q&A sites like Stack Overflow. The reuse of crowd-sourced code snippets can facilitate rapid prototyping. However, recent research shows that the shared code snippets may be of low quality and can even contain vulnerabilities. This paper aims to understand the nature and the prevalence of security vulnerabilities in crowd-sourced code examples. To achieve this goal, we investigate security vulnerabilities in the C++ code snippets shared on Stack Overflow over a period of 10 years. In collaborative sessions involving multiple human coders, we manually assessed each code snippet for security vulnerabilities following CWE (Common Weakness Enumeration) guidelines. From the 72,483 reviewed code snippets used in at least one project hosted on GitHub, we found a total of 69 vulnerable code snippets categorized into 29 types. Many of the investigated code snippets are still not corrected on Stack Overflow. The 69 vulnerable code snippets found in Stack Overflow were reused in a total of 2859 GitHub projects. To help improve the quality of code snippets shared on Stack Overflow, we developed a browser extension that allow Stack Overflow users to check for vulnerabilities in code snippets when they upload them on the platform. △ Less

Submitted 19 January, 2021; v1 submitted 3 October, 2019; originally announced October 2019.

Comments: 14 pages

arXiv:1909.02563 [pdf, ps, other]

DeepEvolution: A Search-Based Testing Approach for Deep Neural Networks

Authors: Houssem Ben Braiek, Foutse khomh

Abstract: The increasing inclusion of Deep Learning (DL) models in safety-critical systems such as autonomous vehicles have led to the development of multiple model-based DL testing techniques. One common denominator of these testing techniques is the automated generation of test cases, e.g., new inputs transformed from the original training data with the aim to optimize some test adequacy criteria. So far,… ▽ More The increasing inclusion of Deep Learning (DL) models in safety-critical systems such as autonomous vehicles have led to the development of multiple model-based DL testing techniques. One common denominator of these testing techniques is the automated generation of test cases, e.g., new inputs transformed from the original training data with the aim to optimize some test adequacy criteria. So far, the effectiveness of these approaches has been hindered by their reliance on random fuzzing or transformations that do not always produce test cases with a good diversity. To overcome these limitations, we propose, DeepEvolution, a novel search-based approach for testing DL models that relies on metaheuristics to ensure a maximum diversity in generated test cases. We assess the effectiveness of DeepEvolution in testing computer-vision DL models and found that it significantly increases the neuronal coverage of generated test cases. Moreover, using DeepEvolution, we could successfully find several corner-case behaviors. Finally, DeepEvolution outperformed Tensorfuzz (a coverage-guided fuzzing tool developed at Google Brain) in detecting latent defects introduced during the quantization of the models. These results suggest that search-based approaches can help build effective testing tools for DL systems. △ Less

Submitted 5 September, 2019; originally announced September 2019.

arXiv:1909.02562 [pdf, other]

TFCheck : A TensorFlow Library for Detecting Training Issues in Neural Network Programs

Authors: Houssem Ben Braiek, Foutse Khomh

Abstract: The increasing inclusion of Machine Learning (ML) models in safety critical systems like autonomous cars have led to the development of multiple model-based ML testing techniques. One common denominator of these testing techniques is their assumption that training programs are adequate and bug-free. These techniques only focus on assessing the performance of the constructed model using manually la… ▽ More The increasing inclusion of Machine Learning (ML) models in safety critical systems like autonomous cars have led to the development of multiple model-based ML testing techniques. One common denominator of these testing techniques is their assumption that training programs are adequate and bug-free. These techniques only focus on assessing the performance of the constructed model using manually labeled data or automatically generated data. However, their assumptions about the training program are not always true as training programs can contain inconsistencies and bugs. In this paper, we examine training issues in ML programs and propose a catalog of verification routines that can be used to detect the identified issues, automatically. We implemented the routines in a Tensorflow-based library named TFCheck. Using TFCheck, practitioners can detect the aforementioned issues automatically. To assess the effectiveness of TFCheck, we conducted a case study with real-world, mutants, and synthetic training programs. Results show that TFCheck can successfully detect training issues in ML code implementations. △ Less

Submitted 5 September, 2019; originally announced September 2019.

arXiv:1906.07154 [pdf, other]

Machine Learning Software Engineering in Practice: An Industrial Case Study

Authors: Md Saidur Rahman, Emilio Rivera, Foutse Khomh, Yann-Gaël Guéhéneuc, Bernd Lehnert

Abstract: SAP is the market leader in enterprise software offering an end-to-end suite of applications and services to enable their customers worldwide to operate their business. Especially, retail customers of SAP deal with millions of sales transactions for their day-to-day business. Transactions are created during retail sales at the point of sale (POS) terminals and then sent to some central servers for… ▽ More SAP is the market leader in enterprise software offering an end-to-end suite of applications and services to enable their customers worldwide to operate their business. Especially, retail customers of SAP deal with millions of sales transactions for their day-to-day business. Transactions are created during retail sales at the point of sale (POS) terminals and then sent to some central servers for validations and other business operations. A considerable proportion of the retail transactions may have inconsistencies due to many technical and human errors. SAP provides an automated process for error detection but still requires a manual process by dedicated employees using workbench software for correction. However, manual corrections of these errors are time-consuming, labor-intensive, and may lead to further errors due to incorrect modifications. This is not only a performance overhead on the customers' business workflow but it also incurs high operational costs. Thus, automated detection and correction of transaction errors are very important regarding their potential business values and the improvement in the business workflow. In this paper, we present an industrial case study where we apply machine learning (ML) to automatically detect transaction errors and propose corrections. We identify and discuss the challenges that we faced during this collaborative research and development project, from three distinct perspectives: Software Engineering, Machine Learning, and industry-academia collaboration. We report on our experience and insights from the project with guidelines for the identified challenges. We believe that our findings and recommendations can help researchers and practitioners embarking into similar endeavors. △ Less

Submitted 17 June, 2019; originally announced June 2019.

Comments: 21 pages, 5 figures

arXiv:1903.01899 [pdf, other]

A Machine-learning Based Ensemble Method For Anti-patterns Detection

Authors: Antoine Barbez, Foutse Khomh, Yann-Gaël Guéhéneuc

Abstract: Anti-patterns are poor solutions to recurring design problems. Several empirical studies have highlighted their negative impact on program comprehension, maintainability, as well as fault-proneness. A variety of detection approaches have been proposed to identify their occurrences in source code. However, these approaches can identify only a subset of the occurrences and report large numbers of fa… ▽ More Anti-patterns are poor solutions to recurring design problems. Several empirical studies have highlighted their negative impact on program comprehension, maintainability, as well as fault-proneness. A variety of detection approaches have been proposed to identify their occurrences in source code. However, these approaches can identify only a subset of the occurrences and report large numbers of false positives and misses. Furthermore, a low agreement is generally observed among different approaches. Recent studies have shown the potential of machine-learning models to improve this situation. However, such algorithms require large sets of manually-produced training-data, which often limits their application in practice. In this paper, we present SMAD (SMart Aggregation of Anti-patterns Detectors), a machine-learning based ensemble method to aggregate various anti-patterns detection approaches on the basis of their internal detection rules. Thus, our method uses several detection tools to produce an improved prediction from a reasonable number of training examples. We implemented SMAD for the detection of two well known anti-patterns: God Class and Feature Envy. With the results of our experiments conducted on eight java projects, we show that: (1) our method clearly improves the so aggregated tools; (2) SMAD significantly outperforms other ensemble methods. △ Less

Submitted 16 October, 2019; v1 submitted 29 January, 2019; originally announced March 2019.

Comments: Preprint Submitted to Journal of Systems and Software, Elsevier

Showing 51–100 of 111 results for author: Khomh, F