subscribe to arXiv mailings

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Authors: Zhimin Zhao, Abdul Ali Bangash, Filipe Roseiro Côgo, Bram Adams, Ahmed E. Hassan

Abstract: Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to… ▽ More Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection. △ Less

Submitted 12 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: awesome foundation model leaderboard list: https://github.com/SAILResearch/awesome-foundation-model-leaderboards

arXiv:2405.13852 [pdf, other]

Predicting long time contributors with knowledge units of programming languages: an empirical study

Authors: Md Ahasanuzzaman, Gustavo A. Oliva, Ahmed E. Hassan

Abstract: Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predi… ▽ More Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. This paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We build a prediction model called KULTC, which leverages KU-based features along five different dimensions. We detect and analyze KUs from the studied 75 Java projects (353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers' engagement/retention to FLOSS projects or build models for predicting LTCs. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.00796 [pdf, other]

Does Using Bazel Help Speed Up Continuous Integration Builds?

Authors: Shenyu Zheng, Bram Adams, Ahmed E. Hassan

Abstract: A long continuous integration (CI) build forces developers to wait for CI feedback before starting subsequent development activities, leading to time wasted. In addition to a variety of build scheduling and test selection heuristics studied in the past, new artifact-based build technologies like Bazel have built-in support for advanced performance optimizations such as parallel build and increment… ▽ More A long continuous integration (CI) build forces developers to wait for CI feedback before starting subsequent development activities, leading to time wasted. In addition to a variety of build scheduling and test selection heuristics studied in the past, new artifact-based build technologies like Bazel have built-in support for advanced performance optimizations such as parallel build and incremental build (caching of build results). However, little is known about the extent to which new build technologies like Bazel deliver on their promised benefits, especially for long-build duration projects. In this study, we collected 383 Bazel projects from GitHub, then studied their parallel and incremental build usage of Bazel in 4 popular CI services, and compared the results with Maven projects. We conducted 3,500 experiments on 383 Bazel projects and analyzed the build logs of a subset of 70 buildable projects to evaluate the performance impact of Bazel's parallel builds. Additionally, we performed 102,232 experiments on the 70 buildable projects' last 100 commits to evaluate Bazel's incremental build performance. Our results show that 31.23% of Bazel projects adopt a CI service but do not use Bazel in the CI service, while for those who do use Bazel in CI, 27.76% of them use other tools to facilitate Bazel's execution. Compared to sequential builds, the median speedups for long-build duration projects are 2.00x, 3.84x, 7.36x, and 12.80x, at parallelism degrees 2, 4, 8, and 16, respectively, even though, compared to a clean build, applying incremental build achieves a median speedup of 4.22x (with a build system tool-independent CI cache) and 4.71x (with a build system tool-specific cache) for long-build duration projects. Our results provide guidance for developers to improve the usage of Bazel in their projects. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.19048 [pdf, other]

A Framework for Real-time Safeguarding the Text Generation of Large Language Model

Authors: Ximing Dong, Dayi Lin, Shaowei Wang, Ahmed E. Hassan

Abstract: Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and… ▽ More Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time. LLMSafeGuard integrates an external validator into the beam search algorithm during decoding, rejecting candidates that violate safety constraints while allowing valid ones to proceed. We introduce a similarity based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on two tasks, detoxification and copyright safeguarding, and demonstrate its superior performance over SOTA baselines. For instance, LLMSafeGuard reduces the average toxic score of. LLM output by 29.7% compared to the best baseline meanwhile preserving similar linguistic quality as natural output in detoxification task. Similarly, in the copyright task, LLMSafeGuard decreases the Longest Common Subsequence (LCS) by 56.2% compared to baselines. Moreover, our context-wise timing selection strategy reduces inference time by at least 24% meanwhile maintaining comparable effectiveness as validating each time step. LLMSafeGuard also offers tunable parameters to balance its effectiveness and efficiency. △ Less

Submitted 1 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.10225 [pdf]

Rethinking Software Engineering in the Foundation Model Era: From Task-Driven AI Copilots to Goal-Driven AI Pair Programmers

Authors: Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, Jiang

Abstract: The advent of Foundation Models (FMs) and AI-powered copilots has transformed the landscape of software development, offering unprecedented code completion capabilities and enhancing developer productivity. However, the current task-driven nature of these copilots falls short in addressing the broader goals and complexities inherent in software engineering (SE). In this paper, we propose a paradig… ▽ More The advent of Foundation Models (FMs) and AI-powered copilots has transformed the landscape of software development, offering unprecedented code completion capabilities and enhancing developer productivity. However, the current task-driven nature of these copilots falls short in addressing the broader goals and complexities inherent in software engineering (SE). In this paper, we propose a paradigm shift towards goal-driven AI-powered pair programmers that collaborate with human developers in a more holistic and context-aware manner. We envision AI pair programmers that are goal-driven, human partners, SE-aware, and self-learning. These AI partners engage in iterative, conversation-driven development processes, aligning closely with human goals and facilitating informed decision-making. We discuss the desired attributes of such AI pair programmers and outline key challenges that must be addressed to realize this vision. Ultimately, our work represents a shift from AI-augmented SE to AI-transformed SE by replacing code completion with a collaborative partnership between humans and AI that enhances both productivity and software quality. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2403.18958 [pdf, other]

A State-of-the-practice Release-readiness Checklist for Generative AI-based Software Products

Authors: Harsh Patel, Dominique Boucher, Emad Fallahzadeh, Ahmed E. Hassan, Bram Adams

Abstract: This paper investigates the complexities of integrating Large Language Models (LLMs) into software products, with a focus on the challenges encountered for determining their readiness for release. Our systematic review of grey literature identifies common challenges in deploying LLMs, ranging from pre-training and fine-tuning to user experience considerations. The study introduces a comprehensive… ▽ More This paper investigates the complexities of integrating Large Language Models (LLMs) into software products, with a focus on the challenges encountered for determining their readiness for release. Our systematic review of grey literature identifies common challenges in deploying LLMs, ranging from pre-training and fine-tuning to user experience considerations. The study introduces a comprehensive checklist designed to guide practitioners in evaluating key release readiness aspects such as performance, monitoring, and deployment strategies, aiming to enhance the reliability and effectiveness of LLM-based applications in real-world settings. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.17154 [pdf, other]

On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance

Authors: Jaskirat Singh, Bram Adams, Ahmed E. Hassan

Abstract: Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct i… ▽ More Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct inference experiments involving 3 deployment operators (i.e., Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile, Edge, Cloud) and their combinations on four widely used Computer-Vision models to investigate the optimal strategies from the point of view of MLOps developers. Our findings suggest that Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators (Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency is a concern at medium accuracy loss. However, when minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge at a latency reduction or increase, respectively over the Early Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge) operators. In scenarios constrained by Mobile CPU/RAM resources, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment. For models with smaller input data samples (such as FCN), a network-constrained cloud deployment can also be a better alternative than Mobile/Edge deployment and Partitioning strategies. For models with large input data samples (ResNet, ResNext, DUC), an edge tier having higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.10468 [pdf, other]

An Empirical Study on Developers Shared Conversations with ChatGPT in GitHub Pull Requests and Issues

Authors: Huizi Hao, Kazi Amit Hasan, Hong Qin, Marcos Macedo, Yuan Tian, Steven H. H. Ding, Ahmed E. Hassan

Abstract: ChatGPT has significantly impacted software development practices, providing substantial assistance to developers in a variety of tasks, including coding, testing, and debugging. Despite its widespread adoption, the impact of ChatGPT as an assistant in collaborative coding remains largely unexplored. In this paper, we analyze a dataset of 210 and 370 developers shared conversations with ChatGPT in… ▽ More ChatGPT has significantly impacted software development practices, providing substantial assistance to developers in a variety of tasks, including coding, testing, and debugging. Despite its widespread adoption, the impact of ChatGPT as an assistant in collaborative coding remains largely unexplored. In this paper, we analyze a dataset of 210 and 370 developers shared conversations with ChatGPT in GitHub pull requests (PRs) and issues. We manually examined the content of the conversations and characterized the dynamics of the sharing behavior, i.e., understanding the rationale behind the sharing, identifying the locations where the conversations were shared, and determining the roles of the developers who shared them. Our main observations are: (1) Developers seek ChatGPT assistance across 16 types of software engineering inquiries. In both conversations shared in PRs and issues, the most frequently encountered inquiry categories include code generation, conceptual questions, how-to guides, issue resolution, and code review. (2) Developers frequently engage with ChatGPT via multi-turn conversations where each prompt can fulfill various roles, such as unveiling initial or new tasks, iterative follow-up, and prompt refinement. Multi-turn conversations account for 33.2% of the conversations shared in PRs and 36.9% in issues. (3) In collaborative coding, developers leverage shared conversations with ChatGPT to facilitate their role-specific contributions, whether as authors of PRs or issues, code reviewers, or collaborators on issues. Our work serves as the first step towards understanding the dynamics between developers and ChatGPT in collaborative software development and opens up new directions for future research on the topic. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.09012 [pdf, other]

Leveraging the Crowd for Dependency Management: An Empirical Study on the Dependabot Compatibility Score

Authors: Benjamin Rombaut, Filipe R. Cogo, Ahmed E. Hassan

Abstract: Dependabot, a popular dependency management tool, includes a compatibility score feature that helps client packages assess the risk of accepting a dependency update by leveraging knowledge from "the crowd". For each dependency update, Dependabot calculates this compatibility score as the proportion of successful updates performed by other client packages that use the same provider package as a dep… ▽ More Dependabot, a popular dependency management tool, includes a compatibility score feature that helps client packages assess the risk of accepting a dependency update by leveraging knowledge from "the crowd". For each dependency update, Dependabot calculates this compatibility score as the proportion of successful updates performed by other client packages that use the same provider package as a dependency. In this paper, we study the efficacy of the compatibility score to help client packages assess the risks involved with accepting a dependency update. We analyze 579,206 pull requests opened by Dependabot to update a dependency, along with 618,045 compatibility score records calculated by Dependabot. We find that a compatibility score cannot be calculated for 83% of the dependency updates due to the lack of data from the crowd. Yet, the vast majority of the scores that can be calculated have a small confidence interval and are based on low-quality data, suggesting that client packages should have additional angles to evaluate the risk of an update and the trustworthiness of the compatibility score. To overcome these limitations, we propose metrics that amplify the input from the crowd and demonstrate the ability of those metrics to predict the acceptance of a successful update by client packages. We also demonstrate that historical update metrics from client packages can be used to provide a more personalized compatibility score. Based on our findings, we argue that, when leveraging the crowd, dependency management bots should include a confidence interval to help calibrate the trust clients can place in the compatibility score, and consider the quality of tests that exercise candidate updates. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.15990 [pdf, other]

doi 10.1007/s10664-024-10474-4

An Empirical Study of Challenges in Machine Learning Asset Management

Authors: Zhimin Zhao, Yihao Chen, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan

Abstract: In machine learning (ML), efficient asset management, including ML models, datasets, algorithms, and tools, is vital for resource optimization, consistent performance, and a streamlined development lifecycle. This enables quicker iterations, adaptability, reduced development-to-deployment time, and reliable outputs. Despite existing research, a significant knowledge gap remains in operational chal… ▽ More In machine learning (ML), efficient asset management, including ML models, datasets, algorithms, and tools, is vital for resource optimization, consistent performance, and a streamlined development lifecycle. This enables quicker iterations, adaptability, reduced development-to-deployment time, and reliable outputs. Despite existing research, a significant knowledge gap remains in operational challenges like model versioning, data traceability, and collaboration, which are crucial for the success of ML projects. Our study aims to address this gap by analyzing 15,065 posts from developer forums and platforms, employing a mixed-method approach to classify inquiries, extract challenges using BERTopic, and identify solutions through open card sorting and BERTopic clustering. We uncover 133 topics related to asset management challenges, grouped into 16 macro-topics, with software dependency, model deployment, and model training being the most discussed. We also find 79 solution topics, categorized under 18 macro-topics, highlighting software dependency, feature development, and file management as key solutions. This research underscores the need for further exploration of identified pain points and the importance of collaborative efforts across academia, industry, and the research community. △ Less

Submitted 28 February, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Journal ref: Empirical Software Engineering 2024

arXiv:2402.15943 [pdf]

Rethinking Software Engineering in the Foundation Model Era: A Curated Catalogue of Challenges in the Development of Trustworthy FMware

Authors: Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe R. Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Ansaldi Oliva, Jiahuei Lin, Wali Mohammad Abdullah, Zhen Ming Jiang

Abstract: Foundation models (FMs), such as Large Language Models (LLMs), have revolutionized software development by enabling new use cases and business models. We refer to software built using FMs as FMware. The unique properties of FMware (e.g., prompts, agents, and the need for orchestration), coupled with the intrinsic limitations of FMs (e.g., hallucination) lead to a completely new set of software eng… ▽ More Foundation models (FMs), such as Large Language Models (LLMs), have revolutionized software development by enabling new use cases and business models. We refer to software built using FMs as FMware. The unique properties of FMware (e.g., prompts, agents, and the need for orchestration), coupled with the intrinsic limitations of FMs (e.g., hallucination) lead to a completely new set of software engineering challenges. Based on our industrial experience, we identified 10 key SE4FMware challenges that have caused enterprise FMware development to be unproductive, costly, and risky. In this paper, we discuss these challenges in detail and state the path for innovation that we envision. Next, we present FMArts, which is our long-term effort towards creating a cradle-to-grave platform for the engineering of trustworthy FMware. Finally, we (i) show how the unique properties of FMArts enabled us to design and develop a complex FMware for a large customer in a timely manner and (ii) discuss the lessons that we learned in doing so. We hope that the disclosure of the aforementioned challenges and our associated efforts to tackle them will not only raise awareness but also promote deeper and further discussions, knowledge sharing, and innovative solutions across the software engineering discipline. △ Less

Submitted 3 March, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2312.15350 [pdf, other]

Why Not Mitigate Vulnerabilities in Helm Charts?

Authors: Yihao Chen, Jiahuei Lin, Bram Adams, Ahmed E. Hassan

Abstract: [Context]: Containerization ensures the resilience of distributed applications by Kubernetes. Helm is a package manager for Kubernetes applications. A Helm package, namely "Chart'', is a set of pre-configured resources that one could quickly deploy a complex application. However, Helm broadens the attack surface of the distributed applications. [Objective]: This study aims to investigate the pre… ▽ More [Context]: Containerization ensures the resilience of distributed applications by Kubernetes. Helm is a package manager for Kubernetes applications. A Helm package, namely "Chart'', is a set of pre-configured resources that one could quickly deploy a complex application. However, Helm broadens the attack surface of the distributed applications. [Objective]: This study aims to investigate the prevalence of fixable vulnerabilities, the factors related to the vulnerabilities, and current mitigation strategies in Helm Charts. [Method]: We conduct a mixed-methods study on 11,035 Helm Charts affected by 10,982 fixable vulnerabilities. We analyze the complexity of Charts and compare the distribution of vulnerabilities between official and unofficial Charts. Subsequently, we investigate vulnerability mitigation strategies from the Chart-associated repositories by a grounded theory. [Results]: Our findings highlight that the complexity of a Chart correlates with the number of vulnerabilities, and the official Charts do not contain fewer vulnerabilities compared to unofficial Charts. The 10,982 fixable vulnerabilities are at a median of high severity and can be easily exploited. In addition, we identify 11 vulnerability mitigation strategies in three categories. Due to the complexity of Charts, maintainers are required to investigate where a vulnerability impacts and how to mitigate it. The use of automated strategies is low as automation has limited capability(e.g., a higher number of false positives) in such complex Charts. [Conclusion]: There exists need for automation tools that assist maintainers in mitigating vulnerabilities to reduce manual effort. In addition, Chart maintainers lack incentives to mitigate vulnerabilities, given a lack of guidelines for mitigation responsibilities. Adopting a shared responsibility model in the Helm ecosystem would increase its security. △ Less

Submitted 23 December, 2023; originally announced December 2023.

arXiv:2312.12604 [pdf]

An empirical study of testing machine learning in the wild

Authors: Moses Openja, Foutse Khomh, Armstrong Foundjem, Zhen Ming, Jiang, Mouna Abidi, Ahmed E. Hassan

Abstract: Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community. Unlike traditional software built deductively by writing explicit rules, ML/DL systems infer rules from training data. Recent research in ML/DL quality assuran… ▽ More Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community. Unlike traditional software built deductively by writing explicit rules, ML/DL systems infer rules from training data. Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability. However, it is unclear if these proposed testing techniques are adopted in practice, or if new testing strategies have emerged from real-world ML deployments. There is little empirical evidence about the testing strategies. To fill this gap, we perform the first fine-grained empirical study on ML testing in the wild to identify the ML properties being tested, the testing strategies, and their implementation throughout the ML workflow. We conducted a mixed-methods study to understand ML software testing practices. We analyzed test files and cases from 11 open-source ML/DL projects on GitHub. Using open coding, we manually examined the testing strategies, tested ML properties, and implemented testing methods to understand their practical application in building and releasing ML/DL software systems. Our findings reveal several key insights: 1.) The most common testing strategies, accounting for less than 40%, are Grey-box and White-box methods, such as Negative Testing, Oracle Approximation and Statistical Testing. 2.) A wide range of 17 ML properties are tested, out of which only 20% to 30% are frequently tested, including Consistency, Correctness}, and Efficiency. 3.) Bias and Fairness is more tested in Recommendation, while Security & Privacy is tested in Computer Vision (CV) systems, Application Platforms, and Natural Language Processing (NLP) systems. △ Less

Submitted 13 July, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted paper at TOSEM journal

arXiv:2311.12019 [pdf, other]

An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software

Authors: Aaditya Bhatia, Foutse Khomh, Bram Adams, Ahmed E Hassan

Abstract: The emergence of open-source ML libraries such as TensorFlow and Google Auto ML has enabled developers to harness state-of-the-art ML algorithms with minimal overhead. However, during this accelerated ML development process, said developers may often make sub-optimal design and implementation decisions, leading to the introduction of technical debt that, if not addressed promptly, can have a signi… ▽ More The emergence of open-source ML libraries such as TensorFlow and Google Auto ML has enabled developers to harness state-of-the-art ML algorithms with minimal overhead. However, during this accelerated ML development process, said developers may often make sub-optimal design and implementation decisions, leading to the introduction of technical debt that, if not addressed promptly, can have a significant impact on the quality of the ML-based software. Developers frequently acknowledge these sub-optimal design and development choices through code comments during software development. These comments, which often highlight areas requiring additional work or refinement in the future, are known as self-admitted technical debt (SATD). This paper aims to investigate SATD in ML code by analyzing 318 open-source ML projects across five domains, along with 318 non-ML projects. We detected SATD in source code comments throughout the different project snapshots, conducted a manual analysis of the identified SATD sample to comprehend the nature of technical debt in the ML code, and performed a survival analysis of the SATD to understand the evolution of such debts. We observed: i) Machine learning projects have a median percentage of SATD that is twice the median percentage of SATD in non-machine learning projects. ii) ML pipeline components for data preprocessing and model generation logic are more susceptible to debt than model validation and deployment components. iii) SATDs appear in ML projects earlier in the development process compared to non-ML projects. iv) Long-lasting SATDs are typically introduced during extensive code changes that span multiple files exhibiting low complexity. △ Less

Submitted 9 June, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.03213 [pdf, other]

On the Model Update Strategies for Supervised Learning in AIOps Solutions

Authors: Yingzhe Lyu, Heng Li, Zhen Ming, Jiang, Ahmed E. Hassan

Abstract: AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced during the operation of large-scale systems and machine learning models to assist software engineers in their system operations. As operation data produced in the field are constantly evolving due to factors such as the changing operational environment and user base, the models in AIOps solutions need to… ▽ More AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced during the operation of large-scale systems and machine learning models to assist software engineers in their system operations. As operation data produced in the field are constantly evolving due to factors such as the changing operational environment and user base, the models in AIOps solutions need to be constantly maintained after deployment. While prior works focus on innovative modeling techniques to improve the performance of AIOps models before releasing them into the field, when and how to update AIOps models remain an under-investigated topic. In this work, we performed a case study on three large-scale public operation data and empirically assessed five different types of model update strategies for supervised learning regarding their performance, updating cost, and stability. We observed that active model update strategies (e.g., periodical retraining, concept drift guided retraining, time-based model ensembles, and online learning) achieve better and more stable performance than a stationary model. Particularly, applying sophisticated model update strategies could provide better performance, efficiency, and stability than simply retraining AIOps models periodically. In addition, we observed that, although some update strategies can save model training time, they significantly sacrifice model testing time, which could hinder their applications in AIOps solutions where the operation data arrive at high pace and volume and where immediate inferences are required. Our findings highlight that practitioners should consider the evolution of operation data and actively maintain AIOps models over time. Our observations can also guide researchers and practitioners in investigating more efficient and effective model update strategies that fit in the context of AIOps. △ Less

Submitted 11 April, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2305.13884 [pdf, other]

Multi-Granularity Detector for Vulnerability Fixes

Authors: Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E. Hassan, Xuan-Bach D. Le, David Lo

Abstract: With the increasing reliance on Open Source Software, users are exposed to third-party library vulnerabilities. Software Composition Analysis (SCA) tools have been created to alert users of such vulnerabilities. SCA requires the identification of vulnerability-fixing commits. Prior works have proposed methods that can automatically identify such vulnerability-fixing commits. However, identifying s… ▽ More With the increasing reliance on Open Source Software, users are exposed to third-party library vulnerabilities. Software Composition Analysis (SCA) tools have been created to alert users of such vulnerabilities. SCA requires the identification of vulnerability-fixing commits. Prior works have proposed methods that can automatically identify such vulnerability-fixing commits. However, identifying such commits is highly challenging, as only a very small minority of commits are vulnerability fixing. Moreover, code changes can be noisy and difficult to analyze. We observe that noise can occur at different levels of detail, making it challenging to detect vulnerability fixes accurately. To address these challenges and boost the effectiveness of prior works, we propose MiDas (Multi-Granularity Detector for Vulnerability Fixes). Unique from prior works, Midas constructs different neural networks for each level of code change granularity, corresponding to commit-level, file-level, hunk-level, and line-level, following their natural organization. It then utilizes an ensemble model that combines all base models to generate the final prediction. This design allows MiDas to better handle the noisy and highly imbalanced nature of vulnerability-fixing commit data. Additionally, to reduce the human effort required to inspect code changes, we have designed an effort-aware adjustment for Midas's outputs based on commit length. The evaluation results demonstrate that MiDas outperforms the current state-of-the-art baseline in terms of AUC by 4.9% and 13.7% on Java and Python-based datasets, respectively. Furthermore, in terms of two effort-aware metrics, EffortCost@L and Popt@L, MiDas also outperforms the state-of-the-art baseline, achieving improvements of up to 28.2% and 15.9% on Java, and 60% and 51.4% on Python, respectively. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Journal ref: IEEE Transactions on Software Engineering, 2023

arXiv:2305.05654 [pdf, other]

Using Knowledge Units of Programming Languages to Recommend Reviewers for Pull Requests: An Empirical Study

Authors: Md Ahasanuzzaman, Gustavo A. Oliva, Ahmed E. Hassan

Abstract: Code review is a key element of quality assurance in software development. Determining the right reviewer for a given code change requires understanding the characteristics of the changed code, identifying the skills of each potential reviewer (expertise profile), and finding a good match between the two. To facilitate this task, we design a code reviewer recommender that operates on the knowledge… ▽ More Code review is a key element of quality assurance in software development. Determining the right reviewer for a given code change requires understanding the characteristics of the changed code, identifying the skills of each potential reviewer (expertise profile), and finding a good match between the two. To facilitate this task, we design a code reviewer recommender that operates on the knowledge units (KUs) of a programming language. We define a KU as a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We operationalize our KUs using certification exams for the Java programming language. We detect KUs from 10 actively maintained Java projects from GitHub, spanning 290K commits and 65K pull requests (PRs). Next, we generate developer expertise profiles based on the detected KUs. Finally, these KU-based expertise profiles are used to build a code reviewer recommender (KUREC). In RQ1, we observe that KUREC performs as well as the top-performing baseline recommender (RF). From a practical standpoint, we highlight that KUREC's performance is more stable (lower interquartile range) than that of RF, thus making it more consistent and potentially more trustworthy. Next, in RQ2 we design three new recommenders by combining KUREC with our baseline recommenders. These new combined recommenders outperform both KUREC and the individual baselines. Finally, in RQ3 we evaluate how reasonable the recommendations from KUREC and the combined recommenders are when those deviate from the ground truth. Taking together the results from all RQs, we conclude that KUREC and one of the combined recommenders (AD_FREQ) are overall superior to the baseline recommenders that we studied. Future work in the area should thus (i) consider KU-based recommenders as baselines and (ii) experiment with combined recommenders. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2211.15733 [pdf, other]

An Empirical Study of Library Usage and Dependency in Deep Learning Frameworks

Authors: Mohamed Raed El aoun, Lionel Nganyewou Tidjon, Ben Rombaut, Foutse Khomh, Ahmed E. Hassan

Abstract: Recent advances in deep learning (dl) have led to the release of several dl software libraries such as pytorch, Caffe, and TensorFlow, in order to assist machine learning (ml) practitioners in developing and deploying state-of-the-art deep neural networks (DNN), but they are not able to properly cope with limitations in the dl libraries such as testing or data processing. In this paper, we present… ▽ More Recent advances in deep learning (dl) have led to the release of several dl software libraries such as pytorch, Caffe, and TensorFlow, in order to assist machine learning (ml) practitioners in developing and deploying state-of-the-art deep neural networks (DNN), but they are not able to properly cope with limitations in the dl libraries such as testing or data processing. In this paper, we present a qualitative and quantitative analysis of the most frequent dl libraries combination, the distribution of dl library dependencies across the ml workflow, and formulate a set of recommendations to (i) hardware builders for more optimized accelerators and (ii) library builder for more refined future releases. Our study is based on 1,484 open-source dl projects with 46,110 contributors selected based on their reputation. First, we found an increasing trend in the usage of deep learning libraries. Second, we highlight several usage patterns of deep learning libraries. In addition, we identify dependencies between dl libraries and the most frequent combination where we discover that pytorch and Scikit-learn and, Keras and TensorFlow are the most frequent combination in 18% and 14% of the projects. The developer uses two or three dl libraries in the same projects and tends to use different multiple dl libraries in both the same function and the same files. The developer shows patterns in using various deep-learning libraries and prefers simple functions with fewer arguments and straightforward goals. Finally, we present the implications of our findings for researchers, library maintainers, and hardware vendors. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2206.08959 [pdf, other]

Is my transaction done yet? An empirical study of transaction processing times in the Ethereum Blockchain Platform

Authors: Michael Pacheco, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Ahmed E. Hassan

Abstract: Ethereum is one of the most popular platforms for the development of blockchain-powered applications. These applications are known as Dapps. When engineering Dapps, developers need to translate requests captured in the front-end of their application into one or more smart contract transactions. Developers need to pay for these transactions and, the more they pay (i.e., the higher the gas price), t… ▽ More Ethereum is one of the most popular platforms for the development of blockchain-powered applications. These applications are known as Dapps. When engineering Dapps, developers need to translate requests captured in the front-end of their application into one or more smart contract transactions. Developers need to pay for these transactions and, the more they pay (i.e., the higher the gas price), the faster the transaction is likely to be processed. Therefore developers need to optimize the balance between cost (transaction fees) and user experience (transaction processing times). Online services have been developed to provide transaction issuers (e.g., Dapp developers) with an estimate of how long transactions will take to be processed given a certain gas price. These estimation services are crucial in the Ethereum domain and several popular wallets such as Metamask rely on them. However, their accuracy has not been empirically investigated so far. In this paper, we quantify the transaction processing times in Ethereum, investigate the relationship between processing times and gas prices, and determine the accuracy of state-of-the-practice estimation services. We find that transactions are processed in a median of 57s and that 90% of the transactions are processed within 8m. The higher gas prices result in faster transaction processing times with diminishing returns. In particular, we observe no practical difference in processing time between expensive and very expensive transactions. In terms of accuracy of processing time estimation services, we note that they are equivalent. However, when stratifying transactions by gas prices, Etherscan's Gas Tracker is the most accurate estimation service for very cheap and cheap transaction. EthGasStation's Gas Price API, in turn, is the most accurate estimation service for regular, expensive, and very expensive transactions. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Under review in Transactions of Software Engineering and Methodology journal

arXiv:2206.08905 [pdf, other]

What makes Ethereum blockchain transactions be processed fast or slow? An empirical study

Authors: Michael Pacheco, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Ahmed E. Hassan

Abstract: The Ethereum platform allows developers to implement and deploy applications called Dapps onto the blockchain for public use through the use of smart contracts. To execute code within a smart contract, a paid transaction must be issued towards one of the functions that are exposed in the interface of a contract. However, such a transaction is only processed once one of the miners in the peer-to-pe… ▽ More The Ethereum platform allows developers to implement and deploy applications called Dapps onto the blockchain for public use through the use of smart contracts. To execute code within a smart contract, a paid transaction must be issued towards one of the functions that are exposed in the interface of a contract. However, such a transaction is only processed once one of the miners in the peer-to-peer network selects it, adds it to a block, and appends that block to the blockchain This creates a delay between transaction submission and code execution. It is crucial for Dapp developers to be able to precisely estimate when transactions will be processed, since this allows them to define and provide a certain Quality of Service (QoS) level (e.g., 95% of the transactions processed within 1 minute). However, the impact that different factors have on these times have not yet been studied. Processing time estimation services are used by Dapp developers to achieve predefined QoS. Yet, these services offer minimal insights into what factors impact processing times. Considering the vast amount of data that surrounds the Ethereum blockchain, changes in processing times are hard for Dapp developers to predict, making it difficult to maintain said QoS. In our study, we build random forest models to understand the factors that are associated with transaction processing times. We engineer several features that capture blockchain internal factors, as well as gas pricing behaviors of transaction issuers. By interpreting our models, we conclude that features surrounding gas pricing behaviors are very strongly associated with transaction processing times. Based on our empirical results, we provide Dapp developers with concrete insights that can help them provide and maintain high levels of QoS. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Under Peer review in Empirical Software Engineering Journal

arXiv:2204.03794 [pdf]

On the Importance of Performing App Analysis Within Peer Groups

Authors: Safwat Hassan, Heng Li, Ahmed E. Hassan

Abstract: The competing nature of the app market motivates us to shift our focus on apps that provide similar functionalities and directly compete with each other (i.e., peer apps). In this work, we study the ratings and the review text of 100 Android apps across 10 peer app groups. We highlight the importance of performing peer-app analysis by showing that it can provide a unique perspective over performin… ▽ More The competing nature of the app market motivates us to shift our focus on apps that provide similar functionalities and directly compete with each other (i.e., peer apps). In this work, we study the ratings and the review text of 100 Android apps across 10 peer app groups. We highlight the importance of performing peer-app analysis by showing that it can provide a unique perspective over performing a global analysis of apps (i.e., mixing apps from multiple categories). First, we observe that comparing user ratings within peer groups can provide very different results from comparing user ratings from a global perspective. Then, we show that peer-app analysis provides a different perspective to spot the dominant topics in the user reviews, and to understand the impact of the topics on user ratings. Our findings suggest that future efforts may pay more attention to performing and supporting app analysis from a peer group context. For example, app store owners may consider an additional rating mechanism that normalizes app ratings within peer groups, and future research may help developers understand the characteristics of specific peer groups and prioritize their efforts. △ Less

Submitted 7 April, 2022; originally announced April 2022.

arXiv:2202.08701 [pdf, other]

Revisiting reopened bugs in open source software systems

Authors: Ankur Tagra, Haoxiang Zhang, Gopi Krishnan Rajbahadur, Ahmed E. Hassan

Abstract: Reopened bugs can degrade the overall quality of a software system since they require unnecessary rework by developers. Moreover, reopened bugs also lead to a loss of trust in the end-users regarding the quality of the software. Thus, predicting bugs that might be reopened could be extremely helpful for software developers to avoid rework. Prior studies on reopened bug prediction focus only on thr… ▽ More Reopened bugs can degrade the overall quality of a software system since they require unnecessary rework by developers. Moreover, reopened bugs also lead to a loss of trust in the end-users regarding the quality of the software. Thus, predicting bugs that might be reopened could be extremely helpful for software developers to avoid rework. Prior studies on reopened bug prediction focus only on three open source projects (i.e., Apache, Eclipse, and OpenOffice) to generate insights. We observe that one out of the three projects (i.e., Apache) has a data leak issue -- the bug status of reopened was included as training data to predict reopened bugs. In addition, prior studies used an outdated prediction model pipeline (i.e., with old techniques for constructing a prediction model) to predict reopened bugs. Therefore, we revisit the reopened bugs study on a large scale dataset consisting of 47 projects tracked by JIRA using the modern techniques such as SMOTE, permutation importance together with 7 different machine learning models. We study the reopened bugs using a mixed methods approach (i.e., both quantitative and qualitative study). We find that: 1) After using an updated reopened bug prediction model pipeline, only 34% projects give an acceptable performance with AUC >= 0.7. 2) There are four major reasons for a bug getting reopened, that is, technical (i.e., patch/integration issues), documentation, human (i.e., due to incorrect bug assessment), and reasons not shown in the bug reports. 3) In projects with an acceptable AUC, 94% of the reopened bugs are due to patch issues (i.e., the usage of an incorrect patch) identified before bug reopening. Our study revisits reopened bugs and provides new insights into developer's bug reopening activities. △ Less

Submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.06157 [pdf, ps, other]

doi 10.1109/MSR.2017.4

The Impact of Using Regression Models to Build Defect Classifiers

Authors: Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Abstract: It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect cla… ▽ More It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect classifiers is through the use of regression models then discretizing the predicted defect counts into defective and non-defective classes (regression-based classifiers). In this paper, we compare the performance and interpretation of defect classifiers that are built using both approaches (i.e., discretized classifiers and regression-based classifiers) across six commonly used machine learning classifiers (i.e., linear/logistic regression, random forest, KNN, SVM, CART, and neural networks) and 17 datasets. We find that: i) Random forest based classifiers outperform other classifiers (best AUC) for both classifier building approaches; ii) In contrast to common practice, building a defect classifier using discretized defect counts (i.e., discretized classifiers) does not always lead to better performance. Hence we suggest that future defect classification studies should consider building regression-based classifiers (in particular when the defective ratio of the modeled dataset is low). Moreover, we suggest that both approaches for building defect classifiers should be explored, so the best-performing classifier can be used when determining the most influential features. △ Less

Submitted 12 February, 2022; originally announced February 2022.

Journal ref: IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 135-145

arXiv:2202.06146 [pdf, other]

doi 10.1109/TSE.2019.2924371

Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering

Authors: Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Abstract: Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discreti… ▽ More Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise. △ Less

Submitted 12 February, 2022; originally announced February 2022.

Journal ref: IEEE Transactions on Software Engineering, Vol 47, Issue 7 (2021), 1414-1430

arXiv:2202.06145 [pdf, other]

doi 10.1109/TSE.2021.3131950

Revisiting the Impact of Dependency Network Metrics on Software Defect Prediction

Authors: Lina Gong, Gopi Krishnan Rajbahadur, Ahmed E. Hassan, Shujuan Jiang

Abstract: Software dependency network metrics extracted from the dependency graph of the software modules by the application of Social Network Analysis (SNA metrics) have been shown to improve the performance of the Software Defect prediction (SDP) models. However, the relative effectiveness of these SNA metrics over code metrics in improving the performance of the SDP models has been widely debated with no… ▽ More Software dependency network metrics extracted from the dependency graph of the software modules by the application of Social Network Analysis (SNA metrics) have been shown to improve the performance of the Software Defect prediction (SDP) models. However, the relative effectiveness of these SNA metrics over code metrics in improving the performance of the SDP models has been widely debated with no clear consensus. Furthermore, some of the common SDP scenarios like predicting the number of defects in a module (Defect-count) in Cross-version and Cross-project SDP contexts remain unexplored. Such lack of clear directive on the effectiveness of SNA metrics when compared to the widely used code metrics prevents us from potentially building better performing SDP models. Therefore, through a case study of 9 open source software projects across 30 versions, we study the relative effectiveness of SNA metrics when compared to code metrics across 3 commonly used SDP contexts (Within-project, Cross-version and Cross-project) and scenarios (Defect-count, Defect-classification (classifying if a module is defective) and Effort-aware (ranking the defective modules w.r.t to the involved effort)). We find the SNA metrics by themselves or along with code metrics improve the performance of SDP models over just using code metrics on 5 out of the 9 studied SDP scenarios (three SDP scenarios across three SDP contexts). However, we note that in some cases the improvements afforded by considering SNA metrics over or alongside code metrics might only be marginal, whereas in other cases the improvements could be potentially large. Based on these findings we suggest that the future work should: consider SNA metrics alongside code metrics in their SDP models; as well as consider Ego metrics and Global metrics, the two different types of the SNA metrics separately when training SDP models as they behave differently. △ Less

Submitted 12 February, 2022; originally announced February 2022.

arXiv:2202.04431 [pdf, other]

doi 10.1145/3546945

Assessing the alignment between the information needs of developers and the documentation of programming languages: A case study on Rust

Authors: Filipe R. Cogo, Xin Xia, Ahmed E. Hassan

Abstract: Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language. Such documentation is essential to support application developers in the effective use of a programming language. One of the challenges faced by documenters (i.e., personnel that produce documentation) is to ensure that documen… ▽ More Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language. Such documentation is essential to support application developers in the effective use of a programming language. One of the challenges faced by documenters (i.e., personnel that produce documentation) is to ensure that documentation has relevant information that aligns with the concrete needs of developers. In this paper, we present an automated approach to support documenters in evaluating the differences and similarities between the concrete information need of developers and the current state of documentation (a problem that we refer to as the topical alignment of a programming language documentation). Our approach leverages semi-supervised topic modelling to assess the similarities and differences between the topics of Q&A posts and the official documentation. To demonstrate the application of our approach, we perform a case study on the documentation of Rust. Our results show that there is a relatively high level of topical alignment in Rust documentation. Still, information about specific topics is scarce in both the Q&A websites and the documentation, particularly related topics with programming niches such as network, game, and database development. For other topics (e.g., related topics with language features such as structs, patterns and matchings, and foreign function interface), information is only available on Q&A websites while lacking in the official documentation. Finally, we discuss implications for programming language documenters, particularly how to leverage our approach to prioritize topics that should be added to the documentation. △ Less

Submitted 8 February, 2022; originally announced February 2022.

Journal ref: ACM Transactions on Software Engineering and Methodology (2022)

arXiv:2202.02389 [pdf, other]

doi 10.1109/TSE.2021.3056941

The impact of feature importance methods on the interpretation of defect classifiers

Authors: Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Abstract: Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance met… ▽ More Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance methods can lead to conclusion instabilities unless there is a strong agreement among different methods. Therefore, in this paper, we evaluate the agreement between the feature importance ranks associated with the studied classifiers through a case study of 18 software projects and six commonly used classifiers. We find that: 1) The computed feature importance ranks by CA and CS methods do not always strongly agree with each other. 2) The computed feature importance ranks by the studied CA methods exhibit a strong agreement including the features reported at top-1 and top-3 ranks for a given dataset and classifier, while even the commonly used CS methods yield vastly different feature importance ranks. Such findings raise concerns about the stability of conclusions across replicated studies. We further observe that the commonly used defect datasets are rife with feature interactions and these feature interactions impact the computed feature importance ranks of the CS methods (not the CA methods). We demonstrate that removing these feature interactions, even with simple methods like CFS improves agreement between the computed feature importance ranks of CA and CS methods. In light of our findings, we provide guidelines for stakeholders and practitioners when performing model interpretation and directions for future research, e.g., future research is needed to investigate the impact of advanced feature interaction removal methods on computed feature importance ranks of different CS methods. △ Less

Submitted 4 February, 2022; originally announced February 2022.

arXiv:2109.13172 [pdf, other]

An empirical study of question discussions on Stack Overflow

Authors: Wenhan Zhu, Haoxiang Zhang, Ahmed E. Hassan, Michael W. Godfrey

Abstract: Stack Overflow provides a means for developers to exchange knowledge. While much previous research on Stack Overflow has focused on questions and answers (Q&A), recent work has shown that discussions in comments also contain rich information. On Stack Overflow, discussions through comments and chat rooms can be tied to questions or answers. In this paper, we conduct an empirical study that focuses… ▽ More Stack Overflow provides a means for developers to exchange knowledge. While much previous research on Stack Overflow has focused on questions and answers (Q&A), recent work has shown that discussions in comments also contain rich information. On Stack Overflow, discussions through comments and chat rooms can be tied to questions or answers. In this paper, we conduct an empirical study that focuses on the nature of question discussions. We observe that: (1) Question discussions occur at all phases of the Q&A process, with most beginning before the first answer is received. (2) Both askers and answerers actively participate in question discussions; the likelihood of their participation increases as the number of comments increases. (3) There is a strong correlation between the number of question comments and the question answering time (i.e., more discussed questions receive answers more slowly); also, questions with a small number of comments are likely to be answered more quickly than questions with no discussion. Our findings suggest that question discussions contain a rich trove of data that is integral to the Q&A processes on Stack Overflow. We further suggest how future research can leverage the information in question discussions, along with the commonly studied Q&A information. △ Less

Submitted 19 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 27 pages, 9 figures

arXiv:2104.03518 [pdf, other]

doi 10.1007/s10664-021-10028-y

An Exploratory Study on the Repeatedly Shared External Links on Stack Overflow

Authors: Jiakun Liu, Haoxiang Zhang, Xin Xia, David Lo, Ying Zou, Ahmed E. Hassan, Shanping Li

Abstract: On Stack Overflow, users reuse 11,926,354 external links to share the resources hosted outside the Stack Overflow website. The external links connect to the existing programming-related knowledge and extend the crowdsourced knowledge on Stack Overflow. Some of the external links, so-called as repeated external links, can be shared for multiple times. We observe that 82.5% of the link sharing activ… ▽ More On Stack Overflow, users reuse 11,926,354 external links to share the resources hosted outside the Stack Overflow website. The external links connect to the existing programming-related knowledge and extend the crowdsourced knowledge on Stack Overflow. Some of the external links, so-called as repeated external links, can be shared for multiple times. We observe that 82.5% of the link sharing activities (i.e., sharing links in any question, answer, or comment) on Stack Overflow share external resources, and 57.0% of the occurrences of the external links are sharing the repeated external links. However, it is still unclear what types of external resources are repeatedly shared. To help users manage their knowledge, we wish to investigate the characteristics of the repeated external links in knowledge sharing on Stack Overflow. In this paper, we analyze the repeated external links on Stack Overflow. We observe that external links that point to the text resources (hosted in documentation websites, tutorial websites, etc.) are repeatedly shared the most. We observe that: 1) different users repeatedly share the same knowledge in the form of repeated external links, thus increasing the maintenance effort of knowledge (e.g., update invalid links in multiple posts), 2) the same users can repeatedly share the external links for the purpose of promotion, and 3) external links can point to webpages with an overload of information that is difficult for users to retrieve relevant information. Our findings provide insights to Stack Overflow moderators and researchers. For example, we encourage Stack Overflow to centrally manage the commonly occurring knowledge in the form of repeated external links in order to better maintain the crowdsourced knowledge on Stack Overflow. △ Less

Submitted 8 April, 2021; originally announced April 2021.

arXiv:2104.00182 [pdf, other]

doi 10.1109/TSE.2020.2983399

Studying Ad Library Integration Strategies of Top Free-to-Download Apps

Authors: Md Ahasanuzzaman, Safwat Hassan, Ahmed E. Hassan

Abstract: In-app advertisements have become a major revenue source for app developers in the mobile app ecosystem. Ad libraries play an integral part in this ecosystem as app developers integrate these libraries into their apps to display ads. In this paper, we study ad library integration practices by analyzing 35,459 updates of 1,837 top free-to-download apps of the Google Play Store. We observe that ad l… ▽ More In-app advertisements have become a major revenue source for app developers in the mobile app ecosystem. Ad libraries play an integral part in this ecosystem as app developers integrate these libraries into their apps to display ads. In this paper, we study ad library integration practices by analyzing 35,459 updates of 1,837 top free-to-download apps of the Google Play Store. We observe that ad libraries (e.g., Google AdMob) are not always used for serving ads -- 22.5% of the apps that integrate Google AdMob do not display ads. They instead depend on Google AdMob for analytical purposes. Among the apps that display ads, we observe that 57.9% of them integrate multiple ad libraries. We observe that such integration of multiple ad libraries occurs commonly in apps with a large number of downloads and ones in app categories with a high proportion of ad-displaying apps. We manually analyze a sample of apps and derive a set of rules to automatically identify four common strategies for integrating multiple ad libraries. Our analysis of the apps across the identified strategies shows that app developers prefer to manage their own integrations instead of using off-the-shelf features of ad libraries for integrating multiple ad libraries. Our findings are valuable for ad library developers who wish to learn first hand about the challenges of integrating ad libraries. △ Less

Submitted 31 March, 2021; originally announced April 2021.

arXiv:2103.14439 [pdf, other]

doi 10.1007/s10664-020-09840-9

An Empirical Study of the Characteristics of Popular Minecraft Mods

Authors: Daniel Lee, Gopi Krishnan Rajbahadur, Dayi Lin, Mohammed Sayagh, Cor-Paul Bezemer, Ahmed E. Hassan

Abstract: It is becoming increasingly difficult for game developers to manage the cost of developing a game, while meeting the high expectations of gamers. One way to balance the increasing gamer expectation and development stress is to build an active modding community around the game. There exist several examples of games with an extremely active and successful modding community, with the Minecraft game b… ▽ More It is becoming increasingly difficult for game developers to manage the cost of developing a game, while meeting the high expectations of gamers. One way to balance the increasing gamer expectation and development stress is to build an active modding community around the game. There exist several examples of games with an extremely active and successful modding community, with the Minecraft game being one of the most notable ones. This paper reports on an empirical study of 1,114 popular and 1,114 unpopular Minecraft mods from the CurseForge mod distribution platform, one of the largest distribution platforms for Minecraft mods. We analyzed the relationship between 33 features across 5 dimensions of mod characteristics and the popularity of mods (i.e., mod category, mod documentation, environmental context of the mod, remuneration for the mod, and community contribution for the mod), to understand the characteristics of popular Minecraft mods. We firstly verify that the studied dimensions have significant explanatory power in distinguishing the popularity of the studied mods. Then we evaluated the contribution of each of the 33 features across the 5 dimensions. We observed that popular mods tend to have a high quality description and promote community contribution. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Journal ref: Empirical Software Engineering, 25, 2020, 3396-3429

arXiv:2103.07528 [pdf, other]

doi 10.1007/s10664-019-09783-w

Building the perfect game -- an empirical study of game modifications

Authors: Daniel Lee, Dayi Lin, Cor-Paul Bezemer, Ahmed E. Hassan

Abstract: Game developers cannot always meet the growing and changing needs of the gaming community, due to the often already overloaded schedules of developers. So-called modders can potentially assist game developers with addressing gamers' needs. Modders are enthusiasts who provide modifications or completely new content for a game. By supporting modders, game developers can meet the rapidly growing and… ▽ More Game developers cannot always meet the growing and changing needs of the gaming community, due to the often already overloaded schedules of developers. So-called modders can potentially assist game developers with addressing gamers' needs. Modders are enthusiasts who provide modifications or completely new content for a game. By supporting modders, game developers can meet the rapidly growing and varying needs of their gamer base. Modders have the potential to play a role in extending the life expectancy of a game, thereby saving game developers time and money, and leading to a better overall gaming experience for their gamer base. In this paper, we empirically study the metadata of 9,521 mods of the 20 most-modded games on the Nexus Mods distribution platform. Our goal is to provide useful insights into the modding community of the Nexus Mods distribution platform from a quantitative perspective, and to provide researchers with a solid foundation for future exploration of game mods. In doing so, game developers can potentially reduce development time and cost due to the increased replayability of their games through mods. We find that providing official support for mods can be beneficial for the perceived quality of the mods of a game. In addition, mod users are willing to submit bug reports for a mod. However, they often fail to do this in a systematic manner using the bug reporting tool of the Nexus Mods platform, resulting in low-quality bug reports which are difficult to resolve. Based on our findings, we recommend that game developers who desire an active modding community for their own games provide the modding community with an officially-supported modding tool. In addition, we recommend that mod distribution platforms, such as Nexus Mods, improve their bug reporting system to receive higher quality bug reports. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Comments: Published in Empirical Software Engineering journal. Please cite "Lee, D. et al. Building the perfect game - an empirical study of game modifications. Empir Software Eng 25, 2485-2518 (2020). https://doi.org/10.1007/s10664-019-09783-w"

Journal ref: Empirical Software Engineering 25 (2020): 2485-2518

arXiv:2103.00141 [pdf, other]

A Differential Testing Approach for Evaluating Abstract Syntax Tree Mapping Algorithms

Authors: Yuanrui Fan, Xin Xia, David Lo, Ahmed E. Hassan, Yuan Wang, Shanping Li

Abstract: Abstract syntax tree (AST) mapping algorithms are widely used to analyze changes in source code. Despite the foundational role of AST mapping algorithms, little effort has been made to evaluate the accuracy of AST mapping algorithms, i.e., the extent to which an algorihtm captures the evolution of code. We observe that a program element often has only one best-mapped program element. Based on this… ▽ More Abstract syntax tree (AST) mapping algorithms are widely used to analyze changes in source code. Despite the foundational role of AST mapping algorithms, little effort has been made to evaluate the accuracy of AST mapping algorithms, i.e., the extent to which an algorihtm captures the evolution of code. We observe that a program element often has only one best-mapped program element. Based on this observation, we propose a hierarchical approach to automatically compare the similarity of mapped statements and tokens by different algorithms. By performing the comparison, we determine if each of the compared algorithms generates inaccurate mappings for a statement or its tokens. We invite 12 external experts to determine if three commonly used AST mapping algorithms generate accurate mappings for a statement and its tokens for 200 statements. Based on the experts' feedback,we observe that our approach achieves a precision of 0.98--1.00 and a recall of 0.65--0.75. Furthermore, we conduct a large-scale study with a dataset of ten Java projects, containing a total of 263,165 file revisions. Our approach determines that GumTree, MTDiff and IJM generate inaccurate mappings for 20%--29%, 25%--36% and 21%--30% of the file revisions, respectively. Our experimental results show that state-of-art AST mapping agorithms still need improvements. △ Less

Submitted 27 February, 2021; originally announced March 2021.

Journal ref: International Conference on Software Engineering 2021

arXiv:2010.04892 [pdf, other]

doi 10.1109/TSE.2021.3086494

Broken External Links on Stack Overflow

Authors: Jiakun Liu, Xin Xia, David Lo, Haoxiang Zhang, Ying Zou, Ahmed E. Hassan, Shanping Li

Abstract: Stack Overflow hosts valuable programming-related knowledge with 11,926,354 links that reference to the third-party websites. The links that reference to the resources hosted outside the Stack Overflow websites extend the Stack Overflow knowledge base substantially. However, with the rapid development of programming-related knowledge, many resources hosted on the Internet are not available anymore… ▽ More Stack Overflow hosts valuable programming-related knowledge with 11,926,354 links that reference to the third-party websites. The links that reference to the resources hosted outside the Stack Overflow websites extend the Stack Overflow knowledge base substantially. However, with the rapid development of programming-related knowledge, many resources hosted on the Internet are not available anymore. Based on our analysis of the Stack Overflow data that was released on Jun. 2, 2019, 14.2% of the links on Stack Overflow are broken links. The broken links on Stack Overflow can obstruct viewers from obtaining desired programming-related knowledge, and potentially damage the reputation of the Stack Overflow as viewers might regard the posts with broken links as obsolete. In this paper, we characterize the broken links on Stack Overflow. 65% of the broken links in our sampled questions are used to show examples, e.g., code examples. 70% of the broken links in our sampled answers are used to provide supporting information, e.g., explaining a certain concept and describing a step to solve a problem. Only 1.67% of the posts with broken links are highlighted as such by viewers in the posts' comments. Only 5.8% of the posts with broken links removed the broken links. Viewers cannot fully rely on the vote scores to detect broken links, as broken links are common across posts with different vote scores. The websites that host resources that can be maintained by their users are referenced by broken links the most on Stack Overflow -- a prominent example of such websites is GitHub. The posts and comments related to the web technologies, i.e., JavaScript, HTML, CSS, and jQuery, are associated with more broken links. Based on our findings, we shed lights for future directions and provide recommendations for practitioners and researchers. △ Less

Submitted 9 October, 2020; originally announced October 2020.

arXiv:2010.02472 [pdf, other]

doi 10.1007/s10664-020-09916-6

What Makes a Popular Academic AI Repository?

Authors: Yuanrui Fan, Xin Xia, David Lo, Ahmed E. Hassan, Shanping Li

Abstract: Many AI researchers are publishing code, data and other resources that accompany their papers in GitHub repositories. In this paper, we refer to these repositories as academic AI repositories. Our preliminary study shows that highly cited papers are more likely to have popular academic AI repositories (and vice versa). Hence, in this study, we perform an empirical study on academic AI repositories… ▽ More Many AI researchers are publishing code, data and other resources that accompany their papers in GitHub repositories. In this paper, we refer to these repositories as academic AI repositories. Our preliminary study shows that highly cited papers are more likely to have popular academic AI repositories (and vice versa). Hence, in this study, we perform an empirical study on academic AI repositories to highlight good software engineering practices of popular academic AI repositories for AI researchers. We collect 1,149 academic AI repositories, in which we label the top 20% repositories that have the most number of stars as popular, and we label the bottom 70% repositories as unpopular. The remaining 10% repositories are set as a gap between popular and unpopular academic AI repositories. We propose 21 features to characterize the software engineering practices of academic AI repositories. Our experimental results show that popular and unpopular academic AI repositories are statistically significantly different in 11 of the studied features---indicating that the two groups of repositories have significantly different software engineering practices. Furthermore, we find that the number of links to other GitHub repositories in the README file, the number of images in the README file and the inclusion of a license are the most important features for differentiating the two groups of academic AI repositories. Our dataset and code are made publicly available to share with the community. △ Less

Submitted 9 October, 2020; v1 submitted 6 October, 2020; originally announced October 2020.

Journal ref: Empirical Software Engineering (2021)

arXiv:2005.14373 [pdf, ps, other]

doi 10.1145/3465403

CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words

Authors: Chao Liu, Xin Xia, David Lo, Zhiwei Liu, Ahmed E. Hassan, Shanping Li

Abstract: To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning based model DeepCS solved this issue by learning the relationship… ▽ More To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this paper, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS, while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool, and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query. We verified its effectiveness on a large-scale codebase with ~41k repositories. Experimental results showed that CodeMatcher achieves an MRR of 0.60, outperforming DeepCS, CodeHow, and UNIF by 82%, 62%, and 46% respectively. Our proposed model is over 1.2k times faster than DeepCS. Moreover, CodeMatcher outperforms GitHub and Google search by 46% and 33% respectively in terms of MRR. We also observed that: fusing the advantages of IR-based and DL-based models is promising; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code. △ Less

Submitted 22 February, 2022; v1 submitted 28 May, 2020; originally announced May 2020.

arXiv:1904.02724 [pdf, other]

Bounties in Open Source Development on GitHub: A Case Study of Bountysource Bounties

Authors: Jiayuan Zhou, Shaowei Wang, Cor-Paul Bezemer, Ying Zou, Ahmed E. Hassan

Abstract: Due to the voluntary nature of open source software, it can be hard to find a developer to work on a particular task. For example, some issue reports may be too cumbersome and unexciting for someone to volunteer to do them, yet these issue reports may be of high priority to the success of a project. To provide an incentive for implementing such issue reports, one can propose a monetary reward, i.e… ▽ More Due to the voluntary nature of open source software, it can be hard to find a developer to work on a particular task. For example, some issue reports may be too cumbersome and unexciting for someone to volunteer to do them, yet these issue reports may be of high priority to the success of a project. To provide an incentive for implementing such issue reports, one can propose a monetary reward, i.e., a bounty, to the developer who completes that particular task. In this paper, we study bounties in open source projects on GitHub to better understand how bounties can be leveraged to evolve such projects in terms of addressing issue reports. We investigated 5,445 bounties for GitHub projects. These bounties were proposed through the Bountysource platform with a total bounty value of $406,425. We find that 1) in general, the timing of proposing bounties and the bounty-usage frequency are the most important factors that impact the likelihood of an issue being addressed. More specifically, issue reports are more likely to be addressed if they are for projects in which bounties are used more frequently and if they are proposed earlier. 2) The bounty value that an issue report has is the most important factor that impacts the issue-addressing likelihood in the projects in which no bounties were used before. Backers in such projects proposed higher bounty values to get issues addressed. 3) There is a risk of wasting money for backers who invest money on long-standing issue reports. △ Less

Submitted 4 April, 2019; originally announced April 2019.

arXiv:1904.00946 [pdf, other]

Does the hiding mechanism for Stack Overflow comments work well? No!

Authors: Haoxiang Zhang, Shaowei Wang, Tse-Hsun Peter Chen, Ahmed E. Hassan

Abstract: Stack Overflow has accumulated millions of answers. Informative comments can strengthen their associated answers (e.g., providing additional information). Currently, Stack Overflow hides comments that are ranked beyond the top 5. Stack Overflow aims to display more informative comments (i.e., the ones with higher scores) and hide less informative ones using this mechanism. As a result, 4.4 million… ▽ More Stack Overflow has accumulated millions of answers. Informative comments can strengthen their associated answers (e.g., providing additional information). Currently, Stack Overflow hides comments that are ranked beyond the top 5. Stack Overflow aims to display more informative comments (i.e., the ones with higher scores) and hide less informative ones using this mechanism. As a result, 4.4 million comments are hidden under their answer threads. Therefore, it is very important to understand how well the current comment hiding mechanism works. In this study, we investigate whether the mechanism can effectively deliver informative comments while hiding uninformative comments. We find that: 1) Hidden comments are as informative as displayed comments; more than half of the comments (both hidden and displayed) are informative (e.g., providing alternative answers, or pointing out flaws in their associated answers). 2) The current comment hiding mechanism tends to rank and hide comments based on their creation time instead of their score in most cases due to the large amount of tie-scored comments (e.g., 87% of the comments have 0-score). 3) In 97.3% of answers that have hidden comments, at least one comment is hidden while there is another comment with the same score is displayed (i.e., we refer to such cases as unfairly hidden comments). Among such unfairly hidden comments, the longest unfairly hidden comment is more likely to be informative than the shortest unfairly displayed comments. Our findings suggest that Stack Overflow should consider adjusting their current comment hiding mechanism, e.g., displaying longer unfairly hidden comments to replace shorter unfairly displayed comments. We also recommend that users examine all comments, in case they would miss informative details such as software obsolescence, code error reports, or notices of security vulnerability in hidden comments. △ Less

Submitted 1 April, 2019; originally announced April 2019.

Comments: 12 pages

arXiv:1903.12282 [pdf]

doi 10.1109/TSE.2019.2906315

An Empirical Study of Obsolete Answers on Stack Overflow

Authors: Haoxiang Zhang, Shaowei Wang, Tse-Hsun, Chen, Ying Zou, Ahmed E. Hassan

Abstract: Stack Overflow accumulates an enormous amount of software engineering knowledge. However, as time passes, certain knowledge in answers may become obsolete. Such obsolete answers, if not identified or documented clearly, may mislead answer seekers and cause unexpected problems (e.g., using an out-dated security protocol). In this paper, we investigate how the knowledge in answers becomes obsolete a… ▽ More Stack Overflow accumulates an enormous amount of software engineering knowledge. However, as time passes, certain knowledge in answers may become obsolete. Such obsolete answers, if not identified or documented clearly, may mislead answer seekers and cause unexpected problems (e.g., using an out-dated security protocol). In this paper, we investigate how the knowledge in answers becomes obsolete and identify the characteristics of such obsolete answers. We find that: 1) More than half of the obsolete answers (58.4%) were probably already obsolete when they were first posted. 2) When an obsolete answer is observed, only a small proportion (20.5%) of such answers are ever updated. 3) Answers to questions in certain tags (e.g., node.js, ajax, android, and objective-c) are more likely to become obsolete. Our findings suggest that Stack Overflow should develop mechanisms to encourage the whole community to maintain answers (to avoid obsolete answers) and answer seekers are encouraged to carefully go through all information (e.g., comments) in answer threads. △ Less

Submitted 28 March, 2019; originally announced March 2019.

Comments: 14 pages

Journal ref: IEEE Transactions on Software Engineering (2019)

arXiv:1806.07727 [pdf, other]

doi 10.1016/j.infsof.2018.06.001

The Impact of IR-based Classifier Configuration on the Performance and the Effort of Method-Level Bug Localization

Authors: Chakkrit Tantithamthavorn, Surafel Lemma Abebe, Ahmed E. Hassan, Akinori Ihara, Kenichi Matsumoto

Abstract: Context: IR-based bug localization is a classifier that assists developers in locating buggy source code entities (e.g., files and methods) based on the content of a bug report. Such IR-based classifiers have various parameters that can be configured differently (e.g., the choice of entity representation). Objective: In this paper, we investigate the impact of the choice of the IR-based classifier… ▽ More Context: IR-based bug localization is a classifier that assists developers in locating buggy source code entities (e.g., files and methods) based on the content of a bug report. Such IR-based classifiers have various parameters that can be configured differently (e.g., the choice of entity representation). Objective: In this paper, we investigate the impact of the choice of the IR-based classifier configuration on the top-k performance and the required effort to examine source code entities before locating a bug at the method level. Method: We execute a large space of classifier configuration, 3,172 in total, on 5,266 bug reports of two software systems, i.e., Eclipse and Mozilla. Results: We find that (1) the choice of classifier configuration impacts the top-k performance from 0.44% to 36% and the required effort from 4,395 to 50,000 LOC; (2) classifier configurations with similar top-k performance might require different efforts; (3) VSM achieves both the best top-k performance and the least required effort for method-level bug localization; (4) the likelihood of randomly picking a configuration that performs within 20% of the best top-k classifier configuration is on average 5.4% and that of the least effort is on average 1%; (5) configurations related to the entity representation of the analyzed data have the most impact on both the top-k performance and the required effort; and (6) the most efficient classifier configuration obtained at the method-level can also be used at the file-level (and vice versa). Conclusion: Our results lead us to conclude that configuration has a large impact on both the top-k performance and the required effort for method-level bug localization, suggesting that the IR-based configuration settings should be carefully selected and the required effort metric should be included in future bug localization studies. △ Less

Submitted 20 June, 2018; originally announced June 2018.

Comments: Accepted at Journal of Information and Software Technology (IST)

arXiv:1801.10271 [pdf, other]

The Impact of Correlated Metrics on Defect Models

Authors: Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, Ahmed E. Hassan

Abstract: Defect models are analytical models that are used to build empirical theories that are related to software quality. Prior studies often derive knowledge from such models using interpretation techniques, such as ANOVA Type-I. Recent work raises concerns that prior studies rarely remove correlated metrics when constructing such models. Such correlated metrics may impact the interpretation of models.… ▽ More Defect models are analytical models that are used to build empirical theories that are related to software quality. Prior studies often derive knowledge from such models using interpretation techniques, such as ANOVA Type-I. Recent work raises concerns that prior studies rarely remove correlated metrics when constructing such models. Such correlated metrics may impact the interpretation of models. Yet, the impact of correlated metrics in such models has not been investigated. In this paper, we set out to investigate the impact of correlated metrics, and the benefits and costs of removing correlated metrics on defect models. Through a case study of 15 publicly-available defect datasets, we find that (1) correlated metrics impact the ranking of the highest ranked metric for all of the 9 studied model interpretation techniques. On the other hand, removing correlated metrics (2) improves the consistency of the highest ranked metric regardless of how a model is specified for all of the studied interpretation techniques (except for ANOVA Type-I); and (3) negligibly impacts the performance and stability of defect models. Thus, researchers must (1) mitigate (e.g., remove) correlated metrics prior to constructing a defect model; and (2) avoid using ANOVA Type-I even if all correlated metrics are removed. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: 16 pages, under review at a software engineering journal

arXiv:1801.10270 [pdf, other]

The Impact of Automated Parameter Optimization on Defect Prediction Models

Authors: Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto

Abstract: Defect prediction models---classifiers that identify defect-prone software modules---have configurable parameters that control their characteristics (e.g., the number of trees in a random forest). Recent studies show that these classifiers underperform when default settings are used. In this paper, we study the impact of automated parameter optimization on defect prediction models. Through a case… ▽ More Defect prediction models---classifiers that identify defect-prone software modules---have configurable parameters that control their characteristics (e.g., the number of trees in a random forest). Recent studies show that these classifiers underperform when default settings are used. In this paper, we study the impact of automated parameter optimization on defect prediction models. Through a case study of 18 datasets, we find that automated parameter optimization: (1) improves AUC performance by up to 40 percentage points; (2) yields classifiers that are at least as stable as those trained using default settings; (3) substantially shifts the importance ranking of variables, with as few as 28% of the top-ranked variables in optimized classifiers also being top-ranked in non-optimized classifiers; (4) yields optimized settings for 17 of the 20 most sensitive parameters that transfer among datasets without a statistically significant drop in performance; and (5) adds less than 30 minutes of additional computation to 12 of the 26 studied classification techniques. While widely-used classification techniques like random forest and support vector machines are not optimization-sensitive, traditionally overlooked techniques like C5.0 and neural networks can actually outperform widely-used techniques after optimization is applied. This highlights the importance of exploring the parameter space when using parameter-sensitive classification techniques. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: 32 pages, accepted at IEEE Transactions on Software Engineering

arXiv:1801.10269 [pdf, other]

The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

Authors: Chakkrit Tantithamthavorn, Ahmed E. Hassan, Kenichi Matsumoto

Abstract: Defect prediction models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models. Prior research efforts arrive at contradictory conclusions due to the… ▽ More Defect prediction models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models. Prior research efforts arrive at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect prediction models. In this paper, we investigate the impact of 4 popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models. We also construct statistical models to better understand in which experimental design settings that class rebalancing techniques are beneficial for defect prediction models. Through a case study of 101 datasets that span across proprietary and open-source systems, we recommend that class rebalancing techniques are necessary when quality assurance teams wish to increase the completeness of identifying software defects (i.e., Recall). However, class rebalancing techniques should be avoided when interpreting defect prediction models. We also find that class rebalancing techniques do not impact the AUC measure. Hence, AUC should be used as a standard measure when comparing defect prediction models. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: 20 pages, under review at a software engineering journal

arXiv:1104.5387 [pdf]

Model based system engineering approach of a lightweight embedded TCP/IP

Authors: M. Z. Rashed, Ahmed E. Hassan, Ahmed I. Sharaf

Abstract: The use of embedded software is growing very rapidly. Accessing the internet is a necessary service which has large range of applications in many fields. The Internet is based on TCP/IP which is a very important stack. Although TCP/IP is very important there is not a software engineering model describing it. The common method in modeling and describing TCP/IP is RFCs which is not sufficient for so… ▽ More The use of embedded software is growing very rapidly. Accessing the internet is a necessary service which has large range of applications in many fields. The Internet is based on TCP/IP which is a very important stack. Although TCP/IP is very important there is not a software engineering model describing it. The common method in modeling and describing TCP/IP is RFCs which is not sufficient for software engineer and developers. Therefore there is a need for software engineering approach to help engineers and developers to customize their own web based applications for embedded systems. This research presents a model based system engineering approach of lightweight TCP/IP. The model contains the necessary phases for developing a lightweight TCP/IP for embedded systems. The proposed model is based on SysML as a model based system engineering language. △ Less

Submitted 28 April, 2011; originally announced April 2011.

Showing 1–44 of 44 results for author: Hassan, A E