subscribe to arXiv mailings

Streamlining Software Reviews: Efficient Predictive Modeling with Minimal Examples

Abstract: This paper proposes a new challenge problem for software analytics. In the process we shall call "software review", a panel of SMEs (subject matter experts) review examples of software behavior to recommend how to improve that's software's operation. SME time is usually extremely limited so, ideally, this panel can complete this optimization task after looking at just a small number of very inform… ▽ More This paper proposes a new challenge problem for software analytics. In the process we shall call "software review", a panel of SMEs (subject matter experts) review examples of software behavior to recommend how to improve that's software's operation. SME time is usually extremely limited so, ideally, this panel can complete this optimization task after looking at just a small number of very informative, examples. To support this review process, we explore methods that train a predictive model to guess if some oracle will like/dislike the next example. Such a predictive model can work with the SMEs to guide them in their exploration of all the examples. Also, after the panelists leave, that model can be used as an oracle in place of the panel (to handle new examples, while the panelists are busy, elsewhere). In 31 case studies (ranging from from high-level decisions about software processes to low-level decisions about how to configure video encoding software), we show that such predictive models can be built using as few as 12 to 30 labels. To the best of our knowledge, this paper's success with only a handful of examples (and no large language model) is unprecedented. In accordance with the principles of open science, we offer all our code and data at https://github.com/timm/ez/tree/Stable-EMSE-paper so that others can repeat/refute/improve these results. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2401.09622 [pdf, other]

SMOOTHIE: A Theory of Hyper-parameter Optimization for Software Analytics

Authors: Rahul Yedida, Tim Menzies

Abstract: Hyper-parameter optimization is the black art of tuning a learner's control parameters. In software analytics, a repeated result is that such tuning can result in dramatic performance improvements. Despite this, hyper-parameter optimization is often applied rarely or poorly in software analytics--perhaps due to the CPU cost of exploring all those parameter options can be prohibitive. We theorize… ▽ More Hyper-parameter optimization is the black art of tuning a learner's control parameters. In software analytics, a repeated result is that such tuning can result in dramatic performance improvements. Despite this, hyper-parameter optimization is often applied rarely or poorly in software analytics--perhaps due to the CPU cost of exploring all those parameter options can be prohibitive. We theorize that learners generalize better when the loss landscape is ``smooth''. This theory is useful since the influence on ``smoothness'' of different hyper-parameter choices can be tested very quickly (e.g. for a deep learner, after just one epoch). To test this theory, this paper implements and tests SMOOTHIE, a novel hyper-parameter optimizer that guides its optimizations via considerations of ``smothness''. The experiments of this paper test SMOOTHIE on numerous SE tasks including (a) GitHub issue lifetime prediction; (b) detecting false alarms in static code warnings; (c) defect prediction, and (d) a set of standard ML datasets. In all these experiments, SMOOTHIE out-performed state-of-the-art optimizers. Better yet, SMOOTHIE ran 300% faster than the prior state-of-the art. We hence conclude that this theory (that hyper-parameter optimization is best viewed as a ``smoothing'' function for the decision landscape), is both theoretically interesting and practically very useful. To support open science and other researchers working in this area, all our scripts and datasets are available on-line at https://github.com/yrahul3910/smoothness-hpo/. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: v1

arXiv:2401.01883 [pdf, other]

Mining Temporal Attack Patterns from Cyberthreat Intelligence Reports

Authors: Md Rayhanur Rahman, Brandon Wroblewski, Quinn Matthews, Brantley Morgan, Tim Menzies, Laurie Williams

Abstract: Defending from cyberattacks requires practitioners to operate on high-level adversary behavior. Cyberthreat intelligence (CTI) reports on past cyberattack incidents describe the chain of malicious actions with respect to time. To avoid repeating cyberattack incidents, practitioners must proactively identify and defend against recurring chain of actions - which we refer to as temporal attack patter… ▽ More Defending from cyberattacks requires practitioners to operate on high-level adversary behavior. Cyberthreat intelligence (CTI) reports on past cyberattack incidents describe the chain of malicious actions with respect to time. To avoid repeating cyberattack incidents, practitioners must proactively identify and defend against recurring chain of actions - which we refer to as temporal attack patterns. Automatically mining the patterns among actions provides structured and actionable information on the adversary behavior of past cyberattacks. The goal of this paper is to aid security practitioners in prioritizing and proactive defense against cyberattacks by mining temporal attack patterns from cyberthreat intelligence reports. To this end, we propose ChronoCTI, an automated pipeline for mining temporal attack patterns from cyberthreat intelligence (CTI) reports of past cyberattacks. To construct ChronoCTI, we build the ground truth dataset of temporal attack patterns and apply state-of-the-art large language models, natural language processing, and machine learning techniques. We apply ChronoCTI on a set of 713 CTI reports, where we identify 124 temporal attack patterns - which we categorize into nine pattern categories. We identify that the most prevalent pattern category is to trick victim users into executing malicious code to initiate the attack, followed by bypassing the anti-malware system in the victim network. Based on the observed patterns, we advocate organizations to train users about cybersecurity best practices, introduce immutable operating systems with limited functionalities, and enforce multi-user authentications. Moreover, we advocate practitioners to leverage the automated mining capability of ChronoCTI and design countermeasures against the recurring attack patterns. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: A modified version of this pre-print is submitted to IEEE Transactions on Software Engineering, and is under review

arXiv:2312.05436 [pdf, other]

Trading Off Scalability, Privacy, and Performance in Data Synthesis

Authors: Xiao Ling, Tim Menzies, Christopher Hazard, Jack Shu, Jacob Beel

Abstract: Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is ge… ▽ More Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability. △ Less

Submitted 8 December, 2023; originally announced December 2023.

Comments: 13 pages, 2 figures, 6 tables, submitted to IEEEAccess

arXiv:2310.19125 [pdf, other]

iSNEAK: Partial Ordering as Heuristics for Model-Based Reasoning in Software Engineering

Authors: Andre Lustosa, Tim Menzies

Abstract: A "partial ordering" is a way to heuristically order a set of examples (partial orderings are a set where, for certain pairs of elements, one precedes the other). While these orderings may only be approximate, they can be useful for guiding a search towards better regions of the data. To illustrate the value of that technique, this paper presents iSNEAK, an incremental human-in-the-loop AI problem… ▽ More A "partial ordering" is a way to heuristically order a set of examples (partial orderings are a set where, for certain pairs of elements, one precedes the other). While these orderings may only be approximate, they can be useful for guiding a search towards better regions of the data. To illustrate the value of that technique, this paper presents iSNEAK, an incremental human-in-the-loop AI problem solver. iSNEAK uses partial orderings and feedback from humans to prune the space of options. Further, in experiments with a dozen software models of increasing size and complexity (with up to 10,000 variables), iSNEAK only asked a handful of questions to return human-acceptable solutions that outperformed the prior state-of-the-art. We propose the use of partial orderings and tools like iSNEAK to solve the information overload problem where human experts grow fatigued and make mistakes when they are asked too many questions. iSNEAK mitigates the information overload problem since it allows humans to explore complex problem spaces in far less time, with far less effort. △ Less

Submitted 14 July, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.07109 [pdf, other]

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

Authors: Xueqi Yang, Mariusz Jakubowski, Kelly Kang, Haojie Yu, Tim Menzies

Abstract: As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating… ▽ More As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Extensive experiments carried out on a large-scale dataset for vulnerability detection demonstrate the effectiveness and efficiency of SparseCoder, scaling from quadratically to linearly on long code sequence analysis in comparison to CodeBERT and RoBERTa. We further achieve 50% FLOPs reduction with a negligible performance drop of less than 1% comparing to Transformer leveraging sparse attention. Moverover, SparseCoder goes beyond making "black-box" decisions by elucidating the rationale behind those decisions. Code segments that contribute to the final decision can be highlighted with importance scores, offering an interpretable, transparent analysis tool for the software engineering landscape. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 11 pages, 8 figures, pre-print

arXiv:2309.01314 [pdf, other]

doi 10.1145/3617555.3617876

Model Review: A PROMISEing Opportunity

Authors: Tim Menzies

Abstract: To make models more understandable and correctable, I propose that the PROMISE community pivots to the problem of model review. Over the years, there have been many reports that very simple models can perform exceptionally well. Yet, where are the researchers asking "say, does that mean that we could make software analytics simpler and more comprehensible?" This is an important question, since hum… ▽ More To make models more understandable and correctable, I propose that the PROMISE community pivots to the problem of model review. Over the years, there have been many reports that very simple models can perform exceptionally well. Yet, where are the researchers asking "say, does that mean that we could make software analytics simpler and more comprehensible?" This is an important question, since humans often have difficulty accurately assessing complex models (leading to unreliable and sometimes dangerous results). Prior PROMISE results have shown that data mining can effectively summarizing large models/ data sets into simpler and smaller ones. Therefore, the PROMISE community has the skills and experience needed to redefine, simplify, and improve the relationship between humans and AI. △ Less

Submitted 6 September, 2023; v1 submitted 3 September, 2023; originally announced September 2023.

Comments: 5 pages, 1 figure

arXiv:2305.03714 [pdf, other]

On the Benefits of Semi-Supervised Test Case Generation for Simulation Models

Authors: Xiao Ling, Tim Menzies

Abstract: Testing complex simulation models can be expensive and time consuming. Current state-of-the-art methods that explore this problem are fully-supervised; i.e. they require that all examples are labeled. On the other hand, the GenClu system (introduced in this paper) takes a semi-supervised approach; i.e. (a) only a small subset of information is actually labeled (via simulation) and (b) those labels… ▽ More Testing complex simulation models can be expensive and time consuming. Current state-of-the-art methods that explore this problem are fully-supervised; i.e. they require that all examples are labeled. On the other hand, the GenClu system (introduced in this paper) takes a semi-supervised approach; i.e. (a) only a small subset of information is actually labeled (via simulation) and (b) those labels are then spread across the rest of the data. When applied to five open-source simulation models of cyber-physical systems, GenClu's test generation can be multiple orders of magnitude faster than the prior state of the art. Further, when assessed via mutation testing, tests generated by GenClu were as good or better than anything else tested here. Hence, we recommend semi-supervised methods over prior methods (evolutionary search and fully-supervised learning). △ Less

Submitted 1 December, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: 14 pages, 4 figures, 6 tables, first round review in TSE

arXiv:2302.01997 [pdf, other]

Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised Learning for Software Analytics

Authors: Huy Tu, Tim Menzies

Abstract: In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based o… ▽ More In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based on specific SE knowledge) such as (e.g.) co-train two learners and use good labels from one to train the other. Another approach of SSL in software analytics is potentially use ``strong'' knowledge that use SE knowledge. For example, an often-used heuristic in SE is that unusually large artifacts contain undesired properties (e.g. more bugs). This paper argues that such ``strong'' algorithms perform better than those standard, weaker, SSL algorithms. We show this by learning models from labels generated using weak SSL or our ``stronger'' FRUGAL algorithm. In four domains (distinguishing security-related bug reports; mitigating bias in decision-making; predicting issue close time; and (reducing false alarms in static code warnings), FRUGAL required only 2.5% of the data to be labeled yet out-performed standard semi-supervised learners that relied on (e.g.) some domain-independent graph theory concepts. Hence, for future work, we strongly recommend the use of strong heuristics for semi-supervised learning for SE applications. To better support other researchers, our scripts and data are on-line at https://github.com/HuyTu7/FRUGAL. △ Less

Submitted 3 February, 2023; originally announced February 2023.

Comments: Submitting to EMSE

arXiv:2301.10407 [pdf, other]

Don't Lie to Me: Avoiding Malicious Explanations with STEALTH

Authors: Lauren Alvarez, Tim Menzies

Abstract: STEALTH is a method for using some AI-generated model, without suffering from malicious attacks (i.e. lying) or associated unfairness issues. After recursively bi-clustering the data, STEALTH system asks the AI model a limited number of queries about class labels. STEALTH asks so few queries (1 per data cluster) that malicious algorithms (a) cannot detect its operation, nor (b) know when to lie. STEALTH is a method for using some AI-generated model, without suffering from malicious attacks (i.e. lying) or associated unfairness issues. After recursively bi-clustering the data, STEALTH system asks the AI model a limited number of queries about class labels. STEALTH asks so few queries (1 per data cluster) that malicious algorithms (a) cannot detect its operation, nor (b) know when to lie. △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: 6 pages, 6 Tables, 3 figures

arXiv:2301.06577 [pdf, other]

Learning from Very Little Data: On the Value of Landscape Analysis for Predicting Software Project Health

Authors: Andre Lustosa, Tim Menzies

Abstract: When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many pr… ▽ More When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many prediction errors. Those errors can be tamed by a {\em landscape analysis} that selects better learner control parameters. Our niSNEAK tool (a)~clusters the data to find the general landscape of the hyperparameters; then (b)~explores a few representatives from each part of that landscape. niSNEAK is both faster and more effective than prior state-of-the-art hyperparameter optimization algorithms (e.g. FLASH, HYPEROPT, OPTUNA). The configurations found by niSNEAK have far less error than other methods. For example, for project health indicators such as $C$= number of commits; $I$=number of closed issues, and $R$=number of closed pull requests, niSNEAK's 12 month prediction errors are \{I=0\%, R=33\%\,C=47\%\} Based on the above, we recommend landscape analytics (e.g. niSNEAK) especially when learning from very small data sets. This paper only explores the application of niSNEAK to project health. That said, we see nothing in principle that prevents the application of this technique to a wider range of problems. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak △ Less

Submitted 11 October, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

arXiv:2211.10012 [pdf, other]

doi 10.1109/MC.2022.3223646

A Tale of Two Cities: Data and Configuration Variances in Robust Deep Learning

Authors: Guanqin Zhang, Jiankun Sun, Feng Xu, H. M. N. Dilum Bandara, Shiping Chen, Yulei Sui, Tim Menzies

Abstract: Deep neural networks (DNNs), are widely used in many industries such as image recognition, supply chain, medical diagnosis, and autonomous driving. However, prior work has shown the high accuracy of a DNN model does not imply high robustness (i.e., consistent performances on new and future datasets) because the input data and external environment (e.g., software and model configurations) for a dep… ▽ More Deep neural networks (DNNs), are widely used in many industries such as image recognition, supply chain, medical diagnosis, and autonomous driving. However, prior work has shown the high accuracy of a DNN model does not imply high robustness (i.e., consistent performances on new and future datasets) because the input data and external environment (e.g., software and model configurations) for a deployed model are constantly changing. Hence, ensuring the robustness of deep learning is not an option but a priority to enhance business and consumer confidence. Previous studies mostly focus on the data aspect of model variance. In this article, we systematically summarize DNN robustness issues and formulate them in a holistic view through two important aspects, i.e., data and software configuration variances in DNNs. We also provide a predictive framework to generate representative variances (counterexamples) by considering both data and configurations for robust learning through the lens of search-based optimization. △ Less

Submitted 25 November, 2022; v1 submitted 17 November, 2022; originally announced November 2022.

arXiv:2211.05920 [pdf, other]

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

Authors: Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies

Abstract: Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods ha… ▽ More Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised. △ Less

Submitted 15 February, 2024; v1 submitted 10 November, 2022; originally announced November 2022.

Comments: 36 pages, 10 figures, 5 tables

arXiv:2208.01595 [pdf, other]

Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application

Authors: Sarah Elder, Nusrat Zahan, Rui Shu, Monica Metro, Valeri Kozarev, Tim Menzies, Laurie Williams

Abstract: CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based we… ▽ More CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based web application. METHOD: We apply four different categories of vulnerability detection techniques \textendash~ systematic manual penetration testing (SMPT), exploratory manual penetration testing (EMPT), dynamic application security testing (DAST), and static application security testing (SAST) \textendash\ to an open-source medical records system. RESULTS: We found the most vulnerabilities using SAST. However, EMPT found more severe vulnerabilities. With each technique, we found unique vulnerabilities not found using the other techniques. The efficiency of manual techniques (EMPT, SMPT) was comparable to or better than the efficiency of automated techniques (DAST, SAST) in terms of Vulnerabilities per Hour (VpH). CONCLUSIONS: The vulnerability detection technique practitioners should select may vary based on the goals and available resources of the project. If the goal of an organization is to find "all" vulnerabilities in a project, they need to use as many techniques as their resources allow. △ Less

Submitted 2 August, 2022; originally announced August 2022.

ACM Class: D.2.5

arXiv:2205.10504 [pdf, other]

How to Find Actionable Static Analysis Warnings: A Case Study with FindBugs

Authors: Rahul Yedida, Hong Jin Kang, Huy Tu, Xueqi Yang, David Lo, Tim Menzies

Abstract: Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show he… ▽ More Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show here that effective predictors of such warnings can be created by methods that locally adjust the decision boundary (between actionable warnings and others). These methods yield a new high water-mark for recognizing actionable static code warnings. For eight open-source Java projects (cassandra, jmeter, commons, lucene-solr, maven, ant, tomcat, derby) we achieve perfect test results on 4/8 datasets and, overall, a median AUC (area under the true negatives, true positives curve) of 92%. △ Less

Submitted 23 December, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

Comments: Accepted to TSE

arXiv:2205.00665 [pdf, other]

Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

Authors: Rui Shu, Tianpei Xia, Huy Tu, Laurie Williams, Tim Menzies

Abstract: Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expen… ▽ More Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10% of original labeled data but achieve close or even better classification performance than using 100% labeled data in a supervised way. Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data. △ Less

Submitted 2 May, 2022; originally announced May 2022.

arXiv:2203.11410 [pdf, other]

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

Abstract: Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security d… ▽ More Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues. △ Less

Submitted 2 May, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

arXiv:2202.01322 [pdf, other]

How to Improve Deep Learning for Software Analytics (a case study with code smell detection)

Authors: Rahul Yedida, Tim Menzies

Abstract: To reduce technical debt and make code more maintainable, it is important to be able to warn programmers about code smells. State-of-the-art code small detectors use deep learners, without much exploration of alternatives within that technology. One promising alternative for software analytics and deep learning is GHOST (from TSE'21) that relies on a combination of hyper-parameter optimization o… ▽ More To reduce technical debt and make code more maintainable, it is important to be able to warn programmers about code smells. State-of-the-art code small detectors use deep learners, without much exploration of alternatives within that technology. One promising alternative for software analytics and deep learning is GHOST (from TSE'21) that relies on a combination of hyper-parameter optimization of feedforward neural networks and a novel oversampling technique to deal with class imbalance. The prior study from TSE'21 proposing this novel "fuzzy sampling" was somewhat limited in that the method was tested on defect prediction, but nothing else. Like defect prediction, code smell detection datasets have a class imbalance (which motivated "fuzzy sampling"). Hence, in this work we test if fuzzy sampling is useful for code smell detection. The results of this paper show that we can achieve better than state-of-the-art results on code smell detection with fuzzy oversampling. For example, for "feature envy", we were able to achieve 99+\% AUC across all our datasets, and on 8/10 datasets for "misplaced class". While our specific results refer to code smell detection, they do suggest other lessons for other kinds of analytics. For example: (a) try better preprocessing before trying complex learners (b) include simpler learners as a baseline in software analytics (c) try "fuzzy sampling" as one such baseline. △ Less

Submitted 27 March, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

Comments: Accepted to MSR 2022

arXiv:2201.10592 [pdf, other]

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

Authors: Huy Tu, Tim Menzies

Abstract: Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments wh… ▽ More Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments which indicates the inefficiency of the tool. Plus, human experts are still prone to error: 95% of the false-positive labels from previous work were actually true positives. To solve the above problems, we propose DebtFree, a two-mode framework based on unsupervised learning for identifying SATDs. In mode1, when the existing training data is unlabeled, DebtFree starts with an unsupervised learner to automatically pseudo-label the programming comments in the training data. In contrast, in mode2 where labels are available with the corresponding training data, DebtFree starts with a pre-processor that identifies the highly prone SATDs from the test dataset. Then, our machine learning model is employed to assist human experts in manually identifying the remaining SATDs. Our experiments on 10 software projects show that both models yield a statistically significant improvement in effectiveness over the state-of-the-art automated and semi-automated models. Specifically, DebtFree can reduce the labeling effort by 99% in mode1 (unlabeled training data), and up to 63% in mode2 (labeled training data) while improving the current active learner's F1 relatively to almost 100%. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted at EMSE

arXiv:2112.01598 [pdf, other]

What Not to Test (for Cyber-Physical Systems)

Authors: Xiao Ling, Tim Menzies

Abstract: For simulation-based systems, finding a set of test cases with the least cost by exploring multiple goals is a complex task. Domain-specific optimization goals (e.g. maximize output variance) are useful for guiding the rapid selection of test cases via mutation. But evaluating the selected test cases via mutation (that can distinguish the current program from the mutated systems) is a different go… ▽ More For simulation-based systems, finding a set of test cases with the least cost by exploring multiple goals is a complex task. Domain-specific optimization goals (e.g. maximize output variance) are useful for guiding the rapid selection of test cases via mutation. But evaluating the selected test cases via mutation (that can distinguish the current program from the mutated systems) is a different goal to domain-specific optimizations. While the optimization goals can be used to guide the mutation analysis, that guidance should be viewed as a weak indicator since it can hurt the mutation effectiveness goals by focusing too much on the optimization goals. Based on the above, this paper proposes DoLesS (Domination with Least Squares Approximation) that selects the minimal and effective test cases by averaging over a coarse-grained grid of the information gained from multiple optimizations goals. DoLesS applies an inverted least squares approximation approach to find a minimal set of tests that can distinguish better from worse parts of the optimization goals. When tested on multiple simulation-based systems, DoLesS performs as well or even better as the prior state-of-the-art, while running 80-360 times faster on average (seconds instead of hours). △ Less

Submitted 5 May, 2023; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: 17 pages, 5 figures, 7 tables. Accepted by TSE

arXiv:2110.13029 [pdf, other]

Fair Enough: Searching for Sufficient Measures of Fairness

Authors: Suvodeep Majumder, Joymallya Chakraborty, Gina R. Bai, Kathryn T. Stolee, Tim Menzies

Abstract: Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we c… ▽ More Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we can certainly reduce the problem. This paper shows that many of those fairness metrics effectively measure the same thing. Based on experiments using seven real-world datasets, we find that (a) 26 classification metrics can be clustered into seven groups, and (b) four dataset metrics can be clustered into three groups. Further, each reduced set may actually predict different things. Hence, it is no longer necessary (or even possible) to satisfy all fairness metrics. In summary, to simplify the fairness testing problem, we recommend the following steps: (1)~determine what type of fairness is desirable (and we offer a handful of such types); then (2) lookup those types in our clusters; then (3) just test for one item per cluster. △ Less

Submitted 21 March, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: 8 tables and 1 figure

arXiv:2110.02922

SNEAK: Faster Interactive Search-based SE

Authors: Andre Lustosa, Jaydeep Patel, Venkata Sai Teja Malapati, Tim Menzies

Abstract: When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. This paper argues that when optimizing a model using human-in-the-loop, data mining methods such as our SNEAK tool (t… ▽ More When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. This paper argues that when optimizing a model using human-in-the-loop, data mining methods such as our SNEAK tool (that recurses into divisions of the data) perform better than standard iSBSE methods (that mutates multiple candidate solutions over many generations). For our case studies, SNEAK runs faster, asks fewer questions, achieves better solutions (that are within 3% of the best solutions seen in our sample space), and scales to large problems (in our experiments, models with 1000 variables can be explored with half a dozen interactions where, each time, we ask only four questions). Accordingly, we recommend SNEAK as a baseline against which future iSBSE work should be compared. To facilitate that, all our scripts are online at https://github.com/ai-se/sneak. △ Less

Submitted 16 January, 2023; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: removal for resubmission under different title and more information

arXiv:2110.01710 [pdf, other]

PyTorrent: A Python Library Corpus for Large-scale Language Models

Authors: Mehdi Bahrami, N. C. Shrikanth, Shade Ruangwan, Lei Liu, Yuji Mizobuchi, Masahiro Fukuyori, Wei-Peng Chen, Kazuki Munakata, Tim Menzies

Abstract: A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate… ▽ More A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate a corpus titled PyTorrent that contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure. The dataset, schema and a pretrained language model is available at: https://github.com/fla-sil/PyTorrent △ Less

Submitted 4 October, 2021; originally announced October 2021.

Comments: 10 pages, 2 figures, 5 tables

arXiv:2110.01109 [pdf, other]

FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes

Authors: Kewen Peng, Joymallya Chakraborty, Tim Menzies

Abstract: Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanation… ▽ More Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanations on what is the root cause of bias. Objective: We aim at better detection and mitigation of algorithmic discrimination in machine learning software problems. Method: Here we propose xFAIR, a model-based extrapolation method, that is capable of both mitigating bias and explaining the cause. In our xFAIR approach, protected attributes are represented by models learned from the other independent variables (and these models offer extrapolations over the space between existing examples). We then use the extrapolation models to relabel protected attributes later seen in testing data or deployment time. Our approach aims to offset the biased predictions of the classification model via rebalancing the distribution of protected attributes. Results: The experiments of this paper show that, without compromising (original) model performance, xFAIR can achieve significantly better group and individual fairness (as measured in different metrics) than benchmark methods. Moreover, when compared to another instance-based rebalancing method, our model-based approach shows faster runtime and thus better scalability. Conclusion: Algorithmic decision bias can be removed via extrapolation that smooths away outlier points. As evidence for this, our proposed xFAIR is not only performance-wise better (measured by fairness and performance metrics) than two state-of-the-art fairness algorithms. △ Less

Submitted 27 October, 2022; v1 submitted 3 October, 2021; originally announced October 2021.

Comments: 14 pages, 6 figures, 7 tables, accepted by TSE

ACM Class: D.2

arXiv:2109.14569 [pdf, other]

An Expert System for Redesigning Software for Cloud Applications

Authors: Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic

Abstract: Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplif… ▽ More Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB. △ Less

Submitted 27 June, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: version 3

arXiv:2108.09847 [pdf, other]

FRUGAL: Unlocking SSL for Software Analytics

Authors: Huy Tu, Tim Menzies

Abstract: Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn h… ▽ More Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn hidden patterns within unlabelled data, which has only been extensively studied in defect prediction. Nevertheless, unsupervised learning can be ineffective by itself and has not been explored in other domains (e.g., static analysis and issue close time). Motivated by this literature gap and technical limitations, we present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme that does not require sophisticated (e.g., deep learners) and expensive (e.g., 100% manually labelled data) methods. FRUGAL optimizes the unsupervised learner's configurations (via a simple grid search) while validating our design decision of labelling just 2.5% of the data before prediction. As shown by the experiments of this paper FRUGAL outperforms the state-of-the-art adoptable static code warning recognizer and issue closed time predictor, while reducing the cost of labelling by a factor of 40 (from 100% to 2.5%). Hence we assert that FRUGAL can save considerable effort in data labelling especially in validating prior work or researching new problems. Based on this work, we suggest that proponents of complex and expensive methods should always baseline such methods against simpler and cheaper alternatives. For instance, a semi-supervised learner like FRUGAL can serve as a baseline to the state-of-the-art software analytics. △ Less

Submitted 22 August, 2021; originally announced August 2021.

Comments: Accepted for ASE 2022

arXiv:2108.06821 [pdf, other]

doi 10.1145/3554976

Crowdsourcing the State of the Art(ifacts)

Authors: Maria Teresa Baldassarre, Neil Ernst, Ben Hermann, Tim Menzies, Rahul Yedida

Abstract: In any field, finding the "leading edge" of research is an on-going challenge. Researchers cannot appease reviewers and educators cannot teach to the leading edge of their field if no one agrees on what is the state-of-the-art. Using a novel crowdsourced "reuse graph" approach, we propose here a new method to learn this state-of-the-art. Our reuse graphs are less effort to build and verify than… ▽ More In any field, finding the "leading edge" of research is an on-going challenge. Researchers cannot appease reviewers and educators cannot teach to the leading edge of their field if no one agrees on what is the state-of-the-art. Using a novel crowdsourced "reuse graph" approach, we propose here a new method to learn this state-of-the-art. Our reuse graphs are less effort to build and verify than other community monitoring methods (e.g. artifact tracks or citation-based searches). Based on a study of 170 papers from software engineering (SE) conferences in 2020, we have found over 1,600 instances of reuse; i.e., reuse is rampant in SE research. Prior pessimism about a lack of reuse in SE research may have been a result of using the wrong methods to measure the wrong things. △ Less

Submitted 15 August, 2021; originally announced August 2021.

Comments: Submitted to Communications ACM

Journal ref: CACM February 2023 (Vol. 66, No. 2)

arXiv:2107.08310 [pdf, other]

FairBalance: How to Achieve Equalized Odds With Data Pre-processing

Authors: Zhe Yu, Joymallya Chakraborty, Tim Menzies

Abstract: This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically… ▽ More This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically targets "equalized odds" given its advantage in always allowing perfect classifiers. Equalized odds requires that members of every demographic group do not receive disparate mistreatment. Prior works either optimize for an equalized odds related metric during the learning process like a black-box, or manipulate the training data following some intuition. This work studies the root cause of the violation of equalized odds and how to tackle it. We found that equalizing the class distribution in each demographic group with sample weights is a necessary condition for achieving equalized odds without modifying the normal training process. In addition, an important partial condition for equalized odds (zero average odds difference) can be guaranteed when the class distributions are weighted to be not only equal but also balanced (1:1). Based on these analyses, we proposed FairBalance, a pre-processing algorithm which balances the class distribution in each demographic group by assigning calculated weights to the training data. On eight real-world datasets, our empirical results show that, at low computational overhead, the proposed pre-processing algorithm FairBalance can significantly improve equalized odds without much, if any damage to the utility. FairBalance also outperforms existing state-of-the-art approaches in terms of equalized odds. To facilitate reuse, reproduction, and validation, we made our scripts available at https://github.com/hil-se/FairBalance. △ Less

Submitted 26 April, 2023; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: 13 pages

arXiv:2107.05088 [pdf, other]

Fairer Software Made Easier (using "Keys")

Authors: Tim Menzies, Kewen Peng, Andre Lustosa

Abstract: Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically si… ▽ More Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically simplify even complex explanations, such as those required for ethical AI systems. △ Less

Submitted 11 July, 2021; originally announced July 2021.

Comments: Submitted to NIER ASE 2021 (new ideas, emerging research)

arXiv:2106.06652 [pdf, ps, other]

Lessons learned from hyper-parameter tuning for microservice candidate identification

Authors: Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic

Abstract: When optimizing software for the cloud, monolithic applications need to be partitioned into many smaller *microservices*. While many tools have been proposed for this task, we warn that the evaluation of those approaches has been incomplete; e.g. minimal prior exploration of hyperparameter optimization. Using a set of open source Java EE applications, we show here that (a) such optimization can si… ▽ More When optimizing software for the cloud, monolithic applications need to be partitioned into many smaller *microservices*. While many tools have been proposed for this task, we warn that the evaluation of those approaches has been incomplete; e.g. minimal prior exploration of hyperparameter optimization. Using a set of open source Java EE applications, we show here that (a) such optimization can significantly improve microservice partitioning; and that (b) an open issue for future work is how to find which optimizer works best for different problems. To facilitate that future work, see [https://github.com/yrahul3910/ase-tuned-mono2micro](https://github.com/yrahul3910/ase-tuned-mono2micro) for a reproduction package for this research. △ Less

Submitted 10 August, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: Accepted to ASE 2021 (industry track, short paper)

arXiv:2106.03792

Preference Discovery in Large Product Lines

Authors: Andre Lustosa, Tim Menzies

Abstract: When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. Current iSBSE methods can lead to cognitive fatigue (when they overwhelm humans with too many overly elaborate questi… ▽ More When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. Current iSBSE methods can lead to cognitive fatigue (when they overwhelm humans with too many overly elaborate questions). WHUN is an iSBSE algorithm that avoids that problem. Due to its recursive clustering procedure, WHUN only pesters humans for $O(log_2{N})$ interactions. Further, each interaction is mediated via a feature selection procedure that reduces the number of asked questions. When compared to prior state-of-the-art iSBSE systems, WHUN runs faster, asks fewer questions, and achieves better solutions that are within $0.1\%$ of the best solutions seen in our sample space. More importantly, WHUN scales to large problems (in our experiments, models with 1000 variables can be explored with half a dozen interactions where, each time, we ask only four questions). Accordingly, we recommend WHUN as a baseline against which future iSBSE work should be compared. To facilitate that, all our scripts are online at https://github.com/ai-se/whun. △ Less

Submitted 16 January, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: Reformatting and republishing of the paper under a different name

arXiv:2106.02716 [pdf, other]

VEER: Enhancing the Interpretability of Model-based Optimizations

Authors: Kewen Peng, Christian Kaltenecker, Norbert Siegmund, Sven Apel, Tim Menzies

Abstract: Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Optimizers built for different objectives suffer from "model disagreement"; i.e., they have different (or even opposite) insights and tactics on how to optimize a system. Model disagreement is rampant (at least for configuration problems). Yet p… ▽ More Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Optimizers built for different objectives suffer from "model disagreement"; i.e., they have different (or even opposite) insights and tactics on how to optimize a system. Model disagreement is rampant (at least for configuration problems). Yet prior to this paper, it has barely been explored. This paper shows that model disagreement can be mitigated via VEER, a one-dimensional approximation to the N-objective space. Since it is exploring a simpler goal space, VEER runs very fast (for eleven configuration problems). Even for our largest problem (with tens of thousands of possible configurations), VEER finds as good or better optimizations with zero model disagreements, three orders of magnitude faster (since its one-dimensional output no longer needs the sorting procedure). Based on the above, we recommend VEER as a very fast method to solve complex configuration problems, while at the same time avoiding model disagreement. △ Less

Submitted 12 February, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: 27 pages, 7 figures, 4 tables, accepted by EMSE

ACM Class: D.2; K.6.3

arXiv:2105.12195 [pdf, other]

doi 10.1145/3468264.3468537

Bias in Machine Learning Software: Why? How? What to do?

Authors: Joymallya Chakraborty, Suvodeep Majumder, Tim Menzies

Abstract: Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of th… ▽ More Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of that improves fairness. Perhaps a better approach is to postulate root causes of bias and then applying some resolution strategy. This paper postulates that the root causes of bias are the prior decisions that affect- (a) what data was selected and (b) the labels assigned to those examples. Our Fair-SMOTE algorithm removes biased labels; and rebalances internal distributions such that based on sensitive attribute, examples are equal in both positive and negative classes. On testing, it was seen that this method was just as effective at reducing bias as prior approaches. Further, models generated via Fair-SMOTE achieve higher performance (measured in terms of recall and F1) than other state-of-the-art fairness improvement algorithms. To the best of our knowledge, measured in terms of number of analyzed learners and datasets, this study is one of the largest studies on bias mitigation yet presented in the literature. △ Less

Submitted 9 July, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

Journal ref: ESEC/FSE'2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Athens, Greece, August 23-28, 2021

arXiv:2105.11082 [pdf, other]

Assessing the Early Bird Heuristic (for Predicting Project Quality)

Authors: N. C. Shrikanth, Tim Menzies

Abstract: Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that… ▽ More Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that the information in those projects "clump" towards the earliest parts of the project. A quality prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this "early bird" data, we can build models very quickly and very early in the project life cycle. Moreover, using this early bird method, we have shown that a simple model (with just a few features) generalizes to hundreds of projects. Based on this experience, we doubt that prior work on generalizing quality models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are available here: https://github.com/snaraya7/early-bird △ Less

Submitted 11 January, 2023; v1 submitted 23 May, 2021; originally announced May 2021.

Comments: 38 pages (Accepted TOSEM Jan 2023)

arXiv:2103.12221 [pdf, other]

Mining Scientific Workflows for Anomalous Data Transfers

Authors: Huy Tu, George Papadimitriou, Mariam Kiran, Cong Wang, Anirban Mandal, Ewa Deelman, Tim Menzies

Abstract: Modern scientific workflows are data-driven and are often executed on distributed, heterogeneous, high-performance computing infrastructures. Anomalies and failures in the workflow execution cause loss of scientific productivity and inefficient use of the infrastructure. Hence, detecting, diagnosing, and mitigating these anomalies are immensely important for reliable and performant scientific work… ▽ More Modern scientific workflows are data-driven and are often executed on distributed, heterogeneous, high-performance computing infrastructures. Anomalies and failures in the workflow execution cause loss of scientific productivity and inefficient use of the infrastructure. Hence, detecting, diagnosing, and mitigating these anomalies are immensely important for reliable and performant scientific workflows. Since these workflows rely heavily on high-performance network transfers that require strict QoS constraints, accurately detecting anomalous network performance is crucial to ensure reliable and efficient workflow execution. To address this challenge, we have developed X-FLASH, a network anomaly detection tool for faulty TCP workflow transfers. X-FLASH incorporates novel hyperparameter tuning and data mining approaches for improving the performance of the machine learning algorithms to accurately classify the anomalous TCP packets. X-FLASH leverages XGBoost as an ensemble model and couples XGBoost with a sequential optimizer, FLASH, borrowed from search-based Software Engineering to learn the optimal model parameters. X-FLASH found configurations that outperformed the existing approach up to 28\%, 29\%, and 40\% relatively for F-measure, G-score, and recall in less than 30 evaluations. From (1) large improvement and (2) simple tuning, we recommend future research to have additional tuning study as a new standard, at least in the area of scientific workflow anomaly detection. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: Accepted for MSR 2021: Working Conference on Mining Software Repositories (https://2021.msrconf.org/details/msr-2021-technical-papers/1/Mining-Workflows-for-Anomalous-Data-Transfers)

arXiv:2103.05088 [pdf, other]

Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

Authors: Sarah Elder, Nusrat Zahan, Val Kozarev, Rui Shu, Tim Menzies, Laurie Williams

Abstract: Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software securit… ▽ More Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software security course by sharing an experience running a software security course for the eleventh time. Through all the eleven years of running the software security course, the course objectives have been comprehensive - ranging from security testing, to secure design and coding, to security requirements to security risk management. For the first time in this eleventh year, a theme of the course assignments was to map vulnerability discovery to the security controls of the Open Web Application Security Project (OWASP) Application Security Verification Standard (ASVS). Based upon student performance on a final exploratory penetration testing project, this mapping may have increased students' depth of understanding of a wider range of security topics. The students efficiently detected 191 unique and verified vulnerabilities of 28 different Common Weakness Enumeration (CWE) types during a three-hour period in the OpenMRS project, an electronic health record application in active use. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Comments: 10 pages, 5 figures, 1 table, submitted to International Conference on Software Engineering: Joint Track on Software Engineering Education and Training (ICSE-JSEET)

ACM Class: K.3.0; D.2.0; K.6.5

arXiv:2101.06319 [pdf, other]

Old but Gold: Reconsidering the value of feedforward learners for software analytics

Authors: Rahul Yedida, Xueqi Yang, Tim Menzies

Abstract: There has been an increased interest in the use of deep learning approaches for software analytics tasks. State-of-the-art techniques leverage modern deep learning techniques such as LSTMs, yielding competitive performance, albeit at the price of longer training times. Recently, Galke and Scherp [18] showed that at least for image recognition, a decades-old feedforward neural network can match t… ▽ More There has been an increased interest in the use of deep learning approaches for software analytics tasks. State-of-the-art techniques leverage modern deep learning techniques such as LSTMs, yielding competitive performance, albeit at the price of longer training times. Recently, Galke and Scherp [18] showed that at least for image recognition, a decades-old feedforward neural network can match the performance of modern deep learning techniques. This motivated us to try the same in the SE literature. Specifically, in this paper, we apply feedforward networks with some preprocessing to two analytics tasks: issue close time prediction, and vulnerability detection. We test the hypothesis laid by Galke and Scherp [18], that feedforward networks suffice for many analytics tasks (which we call, the "Old but Gold" hypothesis) for these two tasks. For three out of five datasets from these tasks, we achieve new high-water mark results (that out-perform the prior state-of-the-art results) and for a fourth data set, Old but Gold performed as well as the recent state of the art. Furthermore, the old but gold results were obtained orders of magnitude faster than prior work. For example, for issue close time, old but gold found good predictors in 90 seconds (as opposed to the newer methods, which took 6 hours to run). Our results supports the "Old but Gold" hypothesis and leads to the following recommendation: try simpler alternatives before more complex methods. At the very least, this will produce a baseline result against which researchers can compare some other, supposedly more sophisticated, approach. And in the best case, they will obtain useful results that are as good as anything else, in a small fraction of the effort. To support open science, all our scripts and data are available on-line at https://github.com/fastidiouschipmunk/simple. △ Less

Submitted 5 February, 2022; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: v2

arXiv:2101.02817 [pdf, other]

Faster SAT Solving for Software with Repeated Structures (with Case Studies on Software Test Suite Minimization)

Authors: Jianfeng Chen, Xipeng Shen, Tim Menzies

Abstract: Theorem provers has been used extensively in software engineering for software testing or verification. However, software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. The SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: cluster some candidate tests, then… ▽ More Theorem provers has been used extensively in software engineering for software testing or verification. However, software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. The SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: cluster some candidate tests, then search for valid tests by proposing small mutations to the cluster centroids. This technique effectively removes repeated structures in the tests since many repeated structures can be replaced with one centroid. In practice, SNAP is remarkably effective. For 27 real-world programs with up to half a million variables, SNAP found test suites which were 10 to 750 smaller times than those found by the prior state-of-the-art. Also, SNAP ran orders of magnitude faster and (unlike prior work) generated 100% valid tests. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: Submitted to Journal Software and Systems. arXiv admin note: substantial text overlap with arXiv:1905.05358

arXiv:2011.13071 [pdf, other]

Early Life Cycle Software Defect Prediction. Why? How?

Authors: N. C. Shrikanth, Suvodeep Majumder, Tim Menzies

Abstract: Many researchers assume that, for software analytics, "more data is better." We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, d… ▽ More Many researchers assume that, for software analytics, "more data is better." We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, at least for the projects studied here, after the first few months, we need not continually update our defect prediction models. We hope these results inspire other researchers to adopt a "simplicity-first" approach to their work. Some domains require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for "short cuts" that can simplify the analysis. △ Less

Submitted 8 February, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

Comments: 12 pages (To appear ICSE 2021)

arXiv:2011.12720 [pdf, other]

Omni: Automated Ensemble with Unexpected Models against Adversarial Evasion Attack

Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

Abstract: Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing st… ▽ More Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing state-of-the-art models (e.g., deep neural networks). Once the attackers can fool a classifier to think that a malicious input is actually benign, they can render a machine learning-based malware or intrusion detection system ineffective. Goal: To help security practitioners and researchers build a more robust model against non-adaptive, white-box, and non-targeted adversarial evasion attacks through the idea of an ensemble model. Method: We propose an approach called Omni, the main idea of which is to explore methods that create an ensemble of "unexpected models"; i.e., models whose control hyperparameters have a large distance to the hyperparameters of an adversary's target model, with which we then make an optimized weighted ensemble prediction. Result: In studies with five types of adversarial evasion attacks (FGSM, BIM, JSMA, DeepFooland Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017, CSE-CIC-IDS2018, CICAnd-Mal2017, and the Contagio PDF dataset), we show Omni is a promising approach as a defense strategy against adversarial attacks when compared with other baseline treatments. Conclusion: When employing ensemble defense against adversarial evasion attacks, we suggest creating an ensemble with unexpected models that are distant from the attacker's expected model (i.e., target model) through methods such as hyperparameter optimization. △ Less

Submitted 12 October, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: Submitted to EMSE

arXiv:2010.03525 [pdf]

Empirical Standards for Software Engineering Research

Authors: Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri , et al. (18 additional authors not shown)

Abstract: Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear… ▽ More Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair. △ Less

Submitted 4 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: For the complete standards, supplements and other resources, see https://github.com/acmsigsoft/EmpiricalStandards

arXiv:2008.09569 [pdf, other]

doi 10.1007/s10664-021-10068-4

Revisiting Process versus Product Metrics: a Large Scale Analysis

Authors: Suvodeep Majumder, Pranav Mody, Tim Menzies

Abstract: Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granular… ▽ More Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granularity of metrics) using 722,471 commits from 700 Github projects. We find that some analytics in-the-small conclusions still hold when scaling up to analytics in-the-large. For example, like prior work, we see that process metrics are better predictors for defects than product metrics (best process/product-based learners respectively achieve recalls of 98\%/44\% and AUCs of 95\%/54\%, median values). That said, we warn that it is unwise to trust metric importance results from analytics in-the-small studies since those change dramatically when moving to analytics in-the-large. Also, when reasoning in-the-large about hundreds of projects, it is better to use predictions from multiple models (since single model predictions can become confused and exhibit a high variance). △ Less

Submitted 26 October, 2021; v1 submitted 21 August, 2020; originally announced August 2020.

Comments: 36 pages, 12 figures and 5 tables

Journal ref: Empirical Software Engineering, Volume 27, Issue 3, May 2022

arXiv:2008.07334

Simpler Hyperparameter Optimization for Software Analytics: Why, How, When?

Authors: Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Xipeng Shen, Tim Menzies

Abstract: How to make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve for improving the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by just DODGE-ing aw… ▽ More How to make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve for improving the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by just DODGE-ing away from things tried before. But when is it wise to use DODGE and when must we use more complex (and much slower) optimizers? To answer this, we applied hyperparameter optimization to 120 SE data sets that explored bad smell detection, predicting Github ssue close time, bug report analysis, defect prediction, and dozens of other non-SE problems. We find that DODGE works best for data sets with low "intrinsic dimensionality" (D = 3) and very poorly for higher-dimensional data (D over 8). Nearly all the SE data seen here was intrinsically low-dimensional, indicating that DODGE is applicable for many SE analytics tasks. △ Less

Submitted 22 April, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

Comments: made a mistake with my co-author. the current version of this doc is their version arXiv:1912.04061

arXiv:2008.03835 [pdf, other]

On the Value of Oversampling for Deep Learning in Software Defect Prediction

Authors: Rahul Yedida, Tim Menzies

Abstract: One truism of deep learning is that the automatic feature engineering (seen in the first layers of those networks) excuses data scientists from performing tedious manual feature engineering prior to running DL. For the specific case of deep learning for defect prediction, we show that that truism is false. Specifically, when we preprocess data with a novel oversampling technique called fuzzy sampl… ▽ More One truism of deep learning is that the automatic feature engineering (seen in the first layers of those networks) excuses data scientists from performing tedious manual feature engineering prior to running DL. For the specific case of deep learning for defect prediction, we show that that truism is false. Specifically, when we preprocess data with a novel oversampling technique called fuzzy sampling, as part of a larger pipeline called GHOST (Goal-oriented Hyper-parameter Optimization for Scalable Training), then we can do significantly better than the prior DL state of the art in 14/20 defect data sets. Our approach yields state-of-the-art results significantly faster deep learners. These results present a cogent case for the use of oversampling prior to applying deep learning on software defect prediction datasets. △ Less

Submitted 20 April, 2021; v1 submitted 9 August, 2020; originally announced August 2020.

Comments: v3, revision 2 (minor revision); submitted to TSE

arXiv:2008.00612 [pdf, other]

How Different is Test Case Prioritization for Open and Closed Source Projects?

Authors: Xiao Ling, Rishabh Agrawal, Tim Menzies

Abstract: Improved test case prioritization means that software developers can detect and fix more software faults sooner than usual. But is there one "best" prioritization algorithm? Or do different kinds of projects deserve special kinds of prioritization? To answer these questions, this paper applies nine prioritization schemes to 31 projects that range from (a) highly rated open-source Github projects t… ▽ More Improved test case prioritization means that software developers can detect and fix more software faults sooner than usual. But is there one "best" prioritization algorithm? Or do different kinds of projects deserve special kinds of prioritization? To answer these questions, this paper applies nine prioritization schemes to 31 projects that range from (a) highly rated open-source Github projects to (b) computational science software to (c) a closed-source project. We find that prioritization approaches that work best for open-source projects can work worst for the closed-source project (and vice versa). From these experiments, we conclude that (a) it is ill-advised to always apply one prioritization scheme to all projects since (b) prioritization requires tuning to different project types. △ Less

Submitted 20 February, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

Comments: 15 pages, 4 figures, 16 tables, accepted to TSE

arXiv:2007.02893 [pdf, other]

doi 10.1145/3324884.3418932

Making Fair ML Software using Trustworthy Explanation

Authors: Joymallya Chakraborty, Kewen Peng, Tim Menzies

Abstract: Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race, etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model-agnost… ▽ More Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race, etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model-agnostic explanation methods such as LIME to find out bias in the model prediction. Our work concentrates on finding shortcomings of current bias measures and explanation methods. We show how our proposed method based on K nearest neighbors can overcome those shortcomings and find the underlying bias of black-box models. Our results are more trustworthy and helpful for the practitioners. Finally, We describe our future framework combining explanation and planning to build fair software. △ Less

Submitted 18 August, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: New Ideas and Emerging Results (NIER) track; The 35th IEEE/ACM International Conference on Automated Software Engineering; Melbourne, Australia

Journal ref: ASE 2020: The 35th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, Mon 21 - Fri 25 September 2020

arXiv:2006.07416 [pdf, other]

Defect Reduction Planning (using TimeLIME)

Authors: Kewen Peng, Tim Menzies

Abstract: Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can r… ▽ More Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can report the fewest changes needed to alter the classification of some code module (e.g., from "defective" to "non-defective"). TimeLIME is a new tool, introduced in this paper, that improves LIME by restricting its plans to just those attributes which change the most within a project. In this study, we compared the performance of LIME and TimeLIME and several other defect reduction planning algorithms. The generated plans were assessed via (a) the similarity scores between the proposed code changes and the real code changes made by developers; and (b) the improvement scores seen within projects that followed the plans. For nine project trails, we found that TimeLIME outperformed all other algorithms (in 8 out of 9 trials). Hence, we strongly recommend using past releases as a source of knowledge for computing fixes for new releases (using TimeLIME). Apart from these specific results about planning defect reductions and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools, without first applying SE knowledge. In this case study, it was not difficult to augment a standard AI algorithm with SE knowledge (that past releases are a good source of knowledge for planning defect reductions). As shown here, once that SE knowledge is applied, this can result in dramatically better systems. △ Less

Submitted 15 February, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 15 pages, 5 figures, 12 tables, accepted by TSE. arXiv admin note: substantial text overlap with arXiv:2003.06887

arXiv:2006.07240 [pdf, other]

Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization)

Authors: Tianpei Xia, Wei Fu, Rui Shu, Rishabh Agrawal, Tim Menzies

Abstract: Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159… ▽ More Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like $k$-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. △ Less

Submitted 17 March, 2022; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: Accepted to EMSE 2022

arXiv:2006.05060 [pdf, other]

doi 10.1007/s10664-021-09957-5

Assessing Practitioner Beliefs about Software Engineering

Authors: N. C. Shrikanth, William Nichols, Fahmid Morshed Fahid, Tim Menzies

Abstract: Software engineering is a highly dynamic discipline. Hence, as times change, so too might our beliefs about core processes in this field. This paper checks some five beliefs that originated in the past decades that comment on the relationships between (i) developer productivity; (ii) software quality and (iii) years of developer experience. Using data collected from 1,356 developers in the period… ▽ More Software engineering is a highly dynamic discipline. Hence, as times change, so too might our beliefs about core processes in this field. This paper checks some five beliefs that originated in the past decades that comment on the relationships between (i) developer productivity; (ii) software quality and (iii) years of developer experience. Using data collected from 1,356 developers in the period 1995 to 2006, we found support for only one of the five beliefs titled "Quality entails productivity". We found no clear support for four other beliefs based on programming languages and software developers. However, from the sporadic evidence of the four other beliefs we learned that a narrow scope could delude practitioners in misinterpreting certain effects to hold in their day to day work. Lastly, through an aggregated view of assessing the five beliefs, we find programming languages act as a confounding factor for developer productivity and software quality. Thus the overall message of this work is that it is both important and possible to revisit old beliefs in SE. Researchers and practitioners should routinely retest old beliefs. △ Less

Submitted 24 May, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: 32 pages, published https://link.springer.com/article/10.1007/s10664-021-09957-5

arXiv:2006.00444 [pdf, other]

Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy)

Authors: Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, Tim Menzies

Abstract: Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algori… ▽ More Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algorithms can find actionable warnings with remarkable ease. Specifically, a range of data mining methods (deep learners, random forests, decision tree learners, and support vector machines) all achieved very good results (recalls and AUC (TRN, TPR) measures usually over 95% and false alarms usually under 5%). Given that all these learners succeeded so easily, it is appropriate to ask if there is something about this task that is inherently easy. We report that while our data sets have up to 58 raw features, those features can be approximated by less than two underlying dimensions. For such intrinsically simple data, many different kinds of learners can generate useful models with similar performance. Based on the above, we conclude that learning to recognize actionable static code warnings is easy, using a wide range of learning algorithms, since the underlying data is intrinsically simple. If we had to pick one particular learner for this task, we would suggest linear SVMs (since, at least in our sample, that learner ran relatively quickly and achieved the best median performance) and we would not recommend deep learning (since this data is intrinsically very simple). △ Less

Submitted 10 January, 2021; v1 submitted 31 May, 2020; originally announced June 2020.

Comments: 24 pages, 5 figures, 7 tables, accepted to Empirical Software Engineering and to appear

Showing 1–50 of 120 results for author: Menzies, T