Skip to main content

Showing 1–50 of 120 results for author: Menzies, T

  1. arXiv:2405.12920  [pdf, ps, other

    cs.SE

    Streamlining Software Reviews: Efficient Predictive Modeling with Minimal Examples

    Authors: Tim Menzies, Andre Lustosa

    Abstract: This paper proposes a new challenge problem for software analytics. In the process we shall call "software review", a panel of SMEs (subject matter experts) review examples of software behavior to recommend how to improve that's software's operation. SME time is usually extremely limited so, ideally, this panel can complete this optimization task after looking at just a small number of very inform… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  2. arXiv:2401.09622  [pdf, other

    cs.SE cs.LG

    SMOOTHIE: A Theory of Hyper-parameter Optimization for Software Analytics

    Authors: Rahul Yedida, Tim Menzies

    Abstract: Hyper-parameter optimization is the black art of tuning a learner's control parameters. In software analytics, a repeated result is that such tuning can result in dramatic performance improvements. Despite this, hyper-parameter optimization is often applied rarely or poorly in software analytics--perhaps due to the CPU cost of exploring all those parameter options can be prohibitive. We theorize… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: v1

  3. arXiv:2401.01883  [pdf, other

    cs.CR cs.IR cs.LG cs.SE

    Mining Temporal Attack Patterns from Cyberthreat Intelligence Reports

    Authors: Md Rayhanur Rahman, Brandon Wroblewski, Quinn Matthews, Brantley Morgan, Tim Menzies, Laurie Williams

    Abstract: Defending from cyberattacks requires practitioners to operate on high-level adversary behavior. Cyberthreat intelligence (CTI) reports on past cyberattack incidents describe the chain of malicious actions with respect to time. To avoid repeating cyberattack incidents, practitioners must proactively identify and defend against recurring chain of actions - which we refer to as temporal attack patter… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: A modified version of this pre-print is submitted to IEEE Transactions on Software Engineering, and is under review

  4. arXiv:2312.05436  [pdf, other

    cs.SE

    Trading Off Scalability, Privacy, and Performance in Data Synthesis

    Authors: Xiao Ling, Tim Menzies, Christopher Hazard, Jack Shu, Jacob Beel

    Abstract: Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is ge… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

    Comments: 13 pages, 2 figures, 6 tables, submitted to IEEEAccess

  5. arXiv:2310.19125  [pdf, other

    cs.SE

    iSNEAK: Partial Ordering as Heuristics for Model-Based Reasoning in Software Engineering

    Authors: Andre Lustosa, Tim Menzies

    Abstract: A "partial ordering" is a way to heuristically order a set of examples (partial orderings are a set where, for certain pairs of elements, one precedes the other). While these orderings may only be approximate, they can be useful for guiding a search towards better regions of the data. To illustrate the value of that technique, this paper presents iSNEAK, an incremental human-in-the-loop AI problem… ▽ More

    Submitted 14 July, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

  6. arXiv:2310.07109  [pdf, other

    cs.SE

    SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

    Authors: Xueqi Yang, Mariusz Jakubowski, Kelly Kang, Haojie Yu, Tim Menzies

    Abstract: As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: 11 pages, 8 figures, pre-print

  7. Model Review: A PROMISEing Opportunity

    Authors: Tim Menzies

    Abstract: To make models more understandable and correctable, I propose that the PROMISE community pivots to the problem of model review. Over the years, there have been many reports that very simple models can perform exceptionally well. Yet, where are the researchers asking "say, does that mean that we could make software analytics simpler and more comprehensible?" This is an important question, since hum… ▽ More

    Submitted 6 September, 2023; v1 submitted 3 September, 2023; originally announced September 2023.

    Comments: 5 pages, 1 figure

  8. arXiv:2305.03714  [pdf, other

    cs.SE

    On the Benefits of Semi-Supervised Test Case Generation for Simulation Models

    Authors: Xiao Ling, Tim Menzies

    Abstract: Testing complex simulation models can be expensive and time consuming. Current state-of-the-art methods that explore this problem are fully-supervised; i.e. they require that all examples are labeled. On the other hand, the GenClu system (introduced in this paper) takes a semi-supervised approach; i.e. (a) only a small subset of information is actually labeled (via simulation) and (b) those labels… ▽ More

    Submitted 1 December, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 14 pages, 4 figures, 6 tables, first round review in TSE

  9. arXiv:2302.01997  [pdf, other

    cs.SE cs.AI

    Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised Learning for Software Analytics

    Authors: Huy Tu, Tim Menzies

    Abstract: In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based o… ▽ More

    Submitted 3 February, 2023; originally announced February 2023.

    Comments: Submitting to EMSE

  10. arXiv:2301.10407  [pdf, other

    cs.SE cs.AI cs.CR

    Don't Lie to Me: Avoiding Malicious Explanations with STEALTH

    Authors: Lauren Alvarez, Tim Menzies

    Abstract: STEALTH is a method for using some AI-generated model, without suffering from malicious attacks (i.e. lying) or associated unfairness issues. After recursively bi-clustering the data, STEALTH system asks the AI model a limited number of queries about class labels. STEALTH asks so few queries (1 per data cluster) that malicious algorithms (a) cannot detect its operation, nor (b) know when to lie.

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: 6 pages, 6 Tables, 3 figures

  11. arXiv:2301.06577  [pdf, other

    cs.SE cs.LG

    Learning from Very Little Data: On the Value of Landscape Analysis for Predicting Software Project Health

    Authors: Andre Lustosa, Tim Menzies

    Abstract: When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many pr… ▽ More

    Submitted 11 October, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

  12. A Tale of Two Cities: Data and Configuration Variances in Robust Deep Learning

    Authors: Guanqin Zhang, Jiankun Sun, Feng Xu, H. M. N. Dilum Bandara, Shiping Chen, Yulei Sui, Tim Menzies

    Abstract: Deep neural networks (DNNs), are widely used in many industries such as image recognition, supply chain, medical diagnosis, and autonomous driving. However, prior work has shown the high accuracy of a DNN model does not imply high robustness (i.e., consistent performances on new and future datasets) because the input data and external environment (e.g., software and model configurations) for a dep… ▽ More

    Submitted 25 November, 2022; v1 submitted 17 November, 2022; originally announced November 2022.

  13. arXiv:2211.05920  [pdf, other

    cs.SE cs.LG

    When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

    Authors: Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies

    Abstract: Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods ha… ▽ More

    Submitted 15 February, 2024; v1 submitted 10 November, 2022; originally announced November 2022.

    Comments: 36 pages, 10 figures, 5 tables

  14. arXiv:2208.01595  [pdf, other

    cs.SE cs.CR

    Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application

    Authors: Sarah Elder, Nusrat Zahan, Rui Shu, Monica Metro, Valeri Kozarev, Tim Menzies, Laurie Williams

    Abstract: CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based we… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    ACM Class: D.2.5

  15. arXiv:2205.10504  [pdf, other

    cs.SE cs.LG

    How to Find Actionable Static Analysis Warnings: A Case Study with FindBugs

    Authors: Rahul Yedida, Hong Jin Kang, Huy Tu, Xueqi Yang, David Lo, Tim Menzies

    Abstract: Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show he… ▽ More

    Submitted 23 December, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

    Comments: Accepted to TSE

  16. arXiv:2205.00665  [pdf, other

    cs.CR cs.SE

    Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

    Authors: Rui Shu, Tianpei Xia, Huy Tu, Laurie Williams, Tim Menzies

    Abstract: Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expen… ▽ More

    Submitted 2 May, 2022; originally announced May 2022.

  17. arXiv:2203.11410  [pdf, other

    cs.CR cs.LG cs.SE

    Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

    Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

    Abstract: Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security d… ▽ More

    Submitted 2 May, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

  18. arXiv:2202.01322  [pdf, other

    cs.SE

    How to Improve Deep Learning for Software Analytics (a case study with code smell detection)

    Authors: Rahul Yedida, Tim Menzies

    Abstract: To reduce technical debt and make code more maintainable, it is important to be able to warn programmers about code smells. State-of-the-art code small detectors use deep learners, without much exploration of alternatives within that technology. One promising alternative for software analytics and deep learning is GHOST (from TSE'21) that relies on a combination of hyper-parameter optimization o… ▽ More

    Submitted 27 March, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: Accepted to MSR 2022

  19. arXiv:2201.10592  [pdf, other

    cs.SE cs.AI

    DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

    Authors: Huy Tu, Tim Menzies

    Abstract: Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments wh… ▽ More

    Submitted 25 January, 2022; originally announced January 2022.

    Comments: Accepted at EMSE

  20. arXiv:2112.01598  [pdf, other

    cs.SE

    What Not to Test (for Cyber-Physical Systems)

    Authors: Xiao Ling, Tim Menzies

    Abstract: For simulation-based systems, finding a set of test cases with the least cost by exploring multiple goals is a complex task. Domain-specific optimization goals (e.g. maximize output variance) are useful for guiding the rapid selection of test cases via mutation. But evaluating the selected test cases via mutation (that can distinguish the current program from the mutated systems) is a different go… ▽ More

    Submitted 5 May, 2023; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: 17 pages, 5 figures, 7 tables. Accepted by TSE

  21. arXiv:2110.13029  [pdf, other

    cs.LG cs.CY cs.SE

    Fair Enough: Searching for Sufficient Measures of Fairness

    Authors: Suvodeep Majumder, Joymallya Chakraborty, Gina R. Bai, Kathryn T. Stolee, Tim Menzies

    Abstract: Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we c… ▽ More

    Submitted 21 March, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: 8 tables and 1 figure

  22. arXiv:2110.02922   

    cs.SE

    SNEAK: Faster Interactive Search-based SE

    Authors: Andre Lustosa, Jaydeep Patel, Venkata Sai Teja Malapati, Tim Menzies

    Abstract: When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. This paper argues that when optimizing a model using human-in-the-loop, data mining methods such as our SNEAK tool (t… ▽ More

    Submitted 16 January, 2023; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: removal for resubmission under different title and more information

  23. arXiv:2110.01710  [pdf, other

    cs.SE

    PyTorrent: A Python Library Corpus for Large-scale Language Models

    Authors: Mehdi Bahrami, N. C. Shrikanth, Shade Ruangwan, Lei Liu, Yuji Mizobuchi, Masahiro Fukuyori, Wei-Peng Chen, Kazuki Munakata, Tim Menzies

    Abstract: A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

    Comments: 10 pages, 2 figures, 5 tables

  24. arXiv:2110.01109  [pdf, other

    cs.LG cs.SE

    FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes

    Authors: Kewen Peng, Joymallya Chakraborty, Tim Menzies

    Abstract: Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanation… ▽ More

    Submitted 27 October, 2022; v1 submitted 3 October, 2021; originally announced October 2021.

    Comments: 14 pages, 6 figures, 7 tables, accepted by TSE

    ACM Class: D.2

  25. arXiv:2109.14569  [pdf, other

    cs.LG cs.SE stat.ML

    An Expert System for Redesigning Software for Cloud Applications

    Authors: Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic

    Abstract: Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplif… ▽ More

    Submitted 27 June, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: version 3

  26. arXiv:2108.09847  [pdf, other

    cs.SE cs.LG

    FRUGAL: Unlocking SSL for Software Analytics

    Authors: Huy Tu, Tim Menzies

    Abstract: Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn h… ▽ More

    Submitted 22 August, 2021; originally announced August 2021.

    Comments: Accepted for ASE 2022

  27. Crowdsourcing the State of the Art(ifacts)

    Authors: Maria Teresa Baldassarre, Neil Ernst, Ben Hermann, Tim Menzies, Rahul Yedida

    Abstract: In any field, finding the "leading edge" of research is an on-going challenge. Researchers cannot appease reviewers and educators cannot teach to the leading edge of their field if no one agrees on what is the state-of-the-art. Using a novel crowdsourced "reuse graph" approach, we propose here a new method to learn this state-of-the-art. Our reuse graphs are less effort to build and verify than… ▽ More

    Submitted 15 August, 2021; originally announced August 2021.

    Comments: Submitted to Communications ACM

    Journal ref: CACM February 2023 (Vol. 66, No. 2)

  28. arXiv:2107.08310  [pdf, other

    cs.LG

    FairBalance: How to Achieve Equalized Odds With Data Pre-processing

    Authors: Zhe Yu, Joymallya Chakraborty, Tim Menzies

    Abstract: This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically… ▽ More

    Submitted 26 April, 2023; v1 submitted 17 July, 2021; originally announced July 2021.

    Comments: 13 pages

  29. arXiv:2107.05088  [pdf, other

    cs.SE cs.AI

    Fairer Software Made Easier (using "Keys")

    Authors: Tim Menzies, Kewen Peng, Andre Lustosa

    Abstract: Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically si… ▽ More

    Submitted 11 July, 2021; originally announced July 2021.

    Comments: Submitted to NIER ASE 2021 (new ideas, emerging research)

  30. arXiv:2106.06652  [pdf, ps, other

    cs.SE

    Lessons learned from hyper-parameter tuning for microservice candidate identification

    Authors: Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic

    Abstract: When optimizing software for the cloud, monolithic applications need to be partitioned into many smaller *microservices*. While many tools have been proposed for this task, we warn that the evaluation of those approaches has been incomplete; e.g. minimal prior exploration of hyperparameter optimization. Using a set of open source Java EE applications, we show here that (a) such optimization can si… ▽ More

    Submitted 10 August, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to ASE 2021 (industry track, short paper)

  31. arXiv:2106.03792   

    cs.SE

    Preference Discovery in Large Product Lines

    Authors: Andre Lustosa, Tim Menzies

    Abstract: When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. Current iSBSE methods can lead to cognitive fatigue (when they overwhelm humans with too many overly elaborate questi… ▽ More

    Submitted 16 January, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: Reformatting and republishing of the paper under a different name

  32. arXiv:2106.02716  [pdf, other

    cs.SE

    VEER: Enhancing the Interpretability of Model-based Optimizations

    Authors: Kewen Peng, Christian Kaltenecker, Norbert Siegmund, Sven Apel, Tim Menzies

    Abstract: Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Optimizers built for different objectives suffer from "model disagreement"; i.e., they have different (or even opposite) insights and tactics on how to optimize a system. Model disagreement is rampant (at least for configuration problems). Yet p… ▽ More

    Submitted 12 February, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

    Comments: 27 pages, 7 figures, 4 tables, accepted by EMSE

    ACM Class: D.2; K.6.3

  33. Bias in Machine Learning Software: Why? How? What to do?

    Authors: Joymallya Chakraborty, Suvodeep Majumder, Tim Menzies

    Abstract: Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of th… ▽ More

    Submitted 9 July, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

    Journal ref: ESEC/FSE'2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Athens, Greece, August 23-28, 2021

  34. arXiv:2105.11082  [pdf, other

    cs.SE cs.AI cs.LG

    Assessing the Early Bird Heuristic (for Predicting Project Quality)

    Authors: N. C. Shrikanth, Tim Menzies

    Abstract: Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that… ▽ More

    Submitted 11 January, 2023; v1 submitted 23 May, 2021; originally announced May 2021.

    Comments: 38 pages (Accepted TOSEM Jan 2023)

  35. arXiv:2103.12221  [pdf, other

    cs.SE cs.NI

    Mining Scientific Workflows for Anomalous Data Transfers

    Authors: Huy Tu, George Papadimitriou, Mariam Kiran, Cong Wang, Anirban Mandal, Ewa Deelman, Tim Menzies

    Abstract: Modern scientific workflows are data-driven and are often executed on distributed, heterogeneous, high-performance computing infrastructures. Anomalies and failures in the workflow execution cause loss of scientific productivity and inefficient use of the infrastructure. Hence, detecting, diagnosing, and mitigating these anomalies are immensely important for reliable and performant scientific work… ▽ More

    Submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted for MSR 2021: Working Conference on Mining Software Repositories (https://2021.msrconf.org/details/msr-2021-technical-papers/1/Mining-Workflows-for-Anomalous-Data-Transfers)

  36. arXiv:2103.05088  [pdf, other

    cs.SE cs.CR

    Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

    Authors: Sarah Elder, Nusrat Zahan, Val Kozarev, Rui Shu, Tim Menzies, Laurie Williams

    Abstract: Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software securit… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

    Comments: 10 pages, 5 figures, 1 table, submitted to International Conference on Software Engineering: Joint Track on Software Engineering Education and Training (ICSE-JSEET)

    ACM Class: K.3.0; D.2.0; K.6.5

  37. arXiv:2101.06319  [pdf, other

    cs.SE cs.AI

    Old but Gold: Reconsidering the value of feedforward learners for software analytics

    Authors: Rahul Yedida, Xueqi Yang, Tim Menzies

    Abstract: There has been an increased interest in the use of deep learning approaches for software analytics tasks. State-of-the-art techniques leverage modern deep learning techniques such as LSTMs, yielding competitive performance, albeit at the price of longer training times. Recently, Galke and Scherp [18] showed that at least for image recognition, a decades-old feedforward neural network can match t… ▽ More

    Submitted 5 February, 2022; v1 submitted 15 January, 2021; originally announced January 2021.

    Comments: v2

  38. arXiv:2101.02817  [pdf, other

    cs.SE

    Faster SAT Solving for Software with Repeated Structures (with Case Studies on Software Test Suite Minimization)

    Authors: Jianfeng Chen, Xipeng Shen, Tim Menzies

    Abstract: Theorem provers has been used extensively in software engineering for software testing or verification. However, software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. The SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: cluster some candidate tests, then… ▽ More

    Submitted 7 January, 2021; originally announced January 2021.

    Comments: Submitted to Journal Software and Systems. arXiv admin note: substantial text overlap with arXiv:1905.05358

  39. arXiv:2011.13071  [pdf, other

    cs.SE cs.LG

    Early Life Cycle Software Defect Prediction. Why? How?

    Authors: N. C. Shrikanth, Suvodeep Majumder, Tim Menzies

    Abstract: Many researchers assume that, for software analytics, "more data is better." We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, d… ▽ More

    Submitted 8 February, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

    Comments: 12 pages (To appear ICSE 2021)

  40. arXiv:2011.12720  [pdf, other

    cs.CR cs.LG

    Omni: Automated Ensemble with Unexpected Models against Adversarial Evasion Attack

    Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

    Abstract: Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing st… ▽ More

    Submitted 12 October, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: Submitted to EMSE

  41. arXiv:2010.03525  [pdf

    cs.SE cs.GL

    Empirical Standards for Software Engineering Research

    Authors: Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri , et al. (18 additional authors not shown)

    Abstract: Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear… ▽ More

    Submitted 4 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

    Comments: For the complete standards, supplements and other resources, see https://github.com/acmsigsoft/EmpiricalStandards

  42. Revisiting Process versus Product Metrics: a Large Scale Analysis

    Authors: Suvodeep Majumder, Pranav Mody, Tim Menzies

    Abstract: Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granular… ▽ More

    Submitted 26 October, 2021; v1 submitted 21 August, 2020; originally announced August 2020.

    Comments: 36 pages, 12 figures and 5 tables

    Journal ref: Empirical Software Engineering, Volume 27, Issue 3, May 2022

  43. arXiv:2008.07334   

    cs.SE

    Simpler Hyperparameter Optimization for Software Analytics: Why, How, When?

    Authors: Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Xipeng Shen, Tim Menzies

    Abstract: How to make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve for improving the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by just DODGE-ing aw… ▽ More

    Submitted 22 April, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

    Comments: made a mistake with my co-author. the current version of this doc is their version arXiv:1912.04061

  44. arXiv:2008.03835  [pdf, other

    cs.SE

    On the Value of Oversampling for Deep Learning in Software Defect Prediction

    Authors: Rahul Yedida, Tim Menzies

    Abstract: One truism of deep learning is that the automatic feature engineering (seen in the first layers of those networks) excuses data scientists from performing tedious manual feature engineering prior to running DL. For the specific case of deep learning for defect prediction, we show that that truism is false. Specifically, when we preprocess data with a novel oversampling technique called fuzzy sampl… ▽ More

    Submitted 20 April, 2021; v1 submitted 9 August, 2020; originally announced August 2020.

    Comments: v3, revision 2 (minor revision); submitted to TSE

  45. arXiv:2008.00612  [pdf, other

    cs.SE

    How Different is Test Case Prioritization for Open and Closed Source Projects?

    Authors: Xiao Ling, Rishabh Agrawal, Tim Menzies

    Abstract: Improved test case prioritization means that software developers can detect and fix more software faults sooner than usual. But is there one "best" prioritization algorithm? Or do different kinds of projects deserve special kinds of prioritization? To answer these questions, this paper applies nine prioritization schemes to 31 projects that range from (a) highly rated open-source Github projects t… ▽ More

    Submitted 20 February, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

    Comments: 15 pages, 4 figures, 16 tables, accepted to TSE

  46. Making Fair ML Software using Trustworthy Explanation

    Authors: Joymallya Chakraborty, Kewen Peng, Tim Menzies

    Abstract: Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race, etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model-agnost… ▽ More

    Submitted 18 August, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

    Comments: New Ideas and Emerging Results (NIER) track; The 35th IEEE/ACM International Conference on Automated Software Engineering; Melbourne, Australia

    Journal ref: ASE 2020: The 35th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, Mon 21 - Fri 25 September 2020

  47. arXiv:2006.07416  [pdf, other

    cs.SE

    Defect Reduction Planning (using TimeLIME)

    Authors: Kewen Peng, Tim Menzies

    Abstract: Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can r… ▽ More

    Submitted 15 February, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: 15 pages, 5 figures, 12 tables, accepted by TSE. arXiv admin note: substantial text overlap with arXiv:2003.06887

  48. arXiv:2006.07240  [pdf, other

    cs.SE

    Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization)

    Authors: Tianpei Xia, Wei Fu, Rui Shu, Rishabh Agrawal, Tim Menzies

    Abstract: Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159… ▽ More

    Submitted 17 March, 2022; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: Accepted to EMSE 2022

  49. Assessing Practitioner Beliefs about Software Engineering

    Authors: N. C. Shrikanth, William Nichols, Fahmid Morshed Fahid, Tim Menzies

    Abstract: Software engineering is a highly dynamic discipline. Hence, as times change, so too might our beliefs about core processes in this field. This paper checks some five beliefs that originated in the past decades that comment on the relationships between (i) developer productivity; (ii) software quality and (iii) years of developer experience. Using data collected from 1,356 developers in the period… ▽ More

    Submitted 24 May, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: 32 pages, published https://link.springer.com/article/10.1007/s10664-021-09957-5

  50. arXiv:2006.00444  [pdf, other

    cs.SE

    Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy)

    Authors: Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, Tim Menzies

    Abstract: Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algori… ▽ More

    Submitted 10 January, 2021; v1 submitted 31 May, 2020; originally announced June 2020.

    Comments: 24 pages, 5 figures, 7 tables, accepted to Empirical Software Engineering and to appear