subscribe to arXiv mailings

Impermanent Identifiers: Enhanced Source Code Comprehension and Refactoring

Authors: Eduardo Martins Guerra, Andre A. S. Ivo, Fernando O. Pereira, Romain Robbes, Andrea Janes, Fabio Fagundes Silveira

Abstract: In response to the prevailing challenges in contemporary software development, this article introduces an innovative approach to code augmentation centered around Impermanent Identifiers. The primary goal is to enhance the software development experience by introducing dynamic identifiers that adapt to changing contexts, facilitating more efficient interactions between developers and source code,… ▽ More In response to the prevailing challenges in contemporary software development, this article introduces an innovative approach to code augmentation centered around Impermanent Identifiers. The primary goal is to enhance the software development experience by introducing dynamic identifiers that adapt to changing contexts, facilitating more efficient interactions between developers and source code, ultimately advancing comprehension, maintenance, and collaboration in software development. Additionally, this study rigorously evaluates the adoption and acceptance of Impermanent Identifiers within the software development landscape. Through a comprehensive empirical examination, we investigate how developers perceive and integrate this approach into their daily programming practices, exploring perceived benefits, potential barriers, and factors influencing its adoption. In summary, this article charts a new course for code augmentation, proposing Impermanent Identifiers as its cornerstone while assessing their feasibility and acceptance among developers. This interdisciplinary research seeks to contribute to the continuous improvement of software development practices and the progress of code augmentation technology. △ Less

Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: to be published in The Journal of Systems & Software

arXiv:2308.02843 [pdf, other]

One Microservice per Developer: Is This the Trend in OSS?

Authors: Dario Amoroso d'Aragona, Xiaoxhou Li, Tomas Cerny, Andrea Janes, Valentina Lenarduzzi, Davide Taibi

Abstract: When developing and managing microservice systems, practitioners suggest that each microservice should be owned by a particular team. In effect, there is only one team with the responsibility to manage a given service. Consequently, one developer should belong to only one team. This practice of "one-microservice-per-developer" is especially prevalent in large projects with an extensive development… ▽ More When developing and managing microservice systems, practitioners suggest that each microservice should be owned by a particular team. In effect, there is only one team with the responsibility to manage a given service. Consequently, one developer should belong to only one team. This practice of "one-microservice-per-developer" is especially prevalent in large projects with an extensive development team. Based on the bazaar-style software development model of Open Source Projects, in which different programmers, like vendors at a bazaar, offer to help out developing different parts of the system, this article investigates whether we can observe the "one-microservice-per-developer" behavior, a strategy we assume anticipated within microservice based Open Source Projects. We conducted an empirical study among 38 microservice-based OS projects. Our findings indicate that the strategy is rarely respected by open-source developers except for projects that have dedicated DevOps teams. △ Less

Submitted 5 August, 2023; originally announced August 2023.

arXiv:2306.02036 [pdf, other]

On the Empirical Evidence of Microservice Logical Coupling. A Registered Report

Authors: Dario Amoroso d Aragona, Luca Pascarella, Andrea Janes, Valentina Lenarduzzi, Rafael Penaloza, Davide Taibi

Abstract: [Context] Coupling is a widely discussed metric by software engineers while developing complex software systems, often referred to as a crucial factor and symptom of a poor or good design. Nevertheless, measuring the logical coupling among microservices and analyzing the interactions between services is non-trivial because it demands runtime information in the form of log files, which are not alwa… ▽ More [Context] Coupling is a widely discussed metric by software engineers while developing complex software systems, often referred to as a crucial factor and symptom of a poor or good design. Nevertheless, measuring the logical coupling among microservices and analyzing the interactions between services is non-trivial because it demands runtime information in the form of log files, which are not always accessible. [Objective and Method] In this work, we propose the design of a study aimed at empirically validating the Microservice Logical Coupling (MLC) metric presented in our previous study. In particular, we plan to empirically study Open Source Systems (OSS) built using a microservice architecture. [Results] The result of this work aims at corroborating the effectiveness and validity of the MLC metric. Thus, we will gather empirical evidence and develop a methodology to analyze and support the claims regarding the MLC metric. Furthermore, we establish its usefulness in evaluating and understanding the logical coupling among microservices. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2305.00760 [pdf, other]

Breaks and Code Quality: Investigating the Impact of Forgetting on Software Development. A Registered Report

Authors: Dario Amoroso d'Aragona, Luca Pascarella, Andrea Janes, Valentina Lenarduzzi, Rafael Penaloza, Davide Taibi

Abstract: Developers interrupting their participation in a project might slowly forget critical information about the code, such as its intended purpose, structure, the impact of external dependencies, and the approach used for implementation. Forgetting the implementation details can have detrimental effects on software maintenance, comprehension, knowledge sharing, and developer productivity, resulting in… ▽ More Developers interrupting their participation in a project might slowly forget critical information about the code, such as its intended purpose, structure, the impact of external dependencies, and the approach used for implementation. Forgetting the implementation details can have detrimental effects on software maintenance, comprehension, knowledge sharing, and developer productivity, resulting in bugs, and other issues that can negatively influence the software development process. Therefore, it is crucial to ensure that developers have a clear understanding of the codebase and can work efficiently and effectively even after long interruptions. This registered report proposes an empirical study aimed at investigating the impact of the developer's activity breaks duration and different code quality properties. In particular, we aim at understanding if the amount of activity in a project impact the code quality, and if developers with different activity profiles show different impacts on code quality. The results might be useful to understand if it is beneficial to promote the practice of developing multiple projects in parallel, or if it is more beneficial to reduce the number of projects each developer contributes. △ Less

Submitted 28 August, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

arXiv:2303.07722 [pdf, other]

Early Career Developers' Perceptions of Code Understandability. A Study of Complexity Metrics

Authors: Matteo Esposito, Andrea Janes, Terhi Kilamo, Valentina Lenarduzzi

Abstract: Context. Code understandability is fundamental. Developers need to understand the code they are modifying clearly. A low understandability can increase the amount of coding effort, and misinterpreting code impacts the entire development process. Ideally, developers should write clear and understandable code with the least effort. Aim. Our work investigates whether the McCabe Cyclomatic Complexity… ▽ More Context. Code understandability is fundamental. Developers need to understand the code they are modifying clearly. A low understandability can increase the amount of coding effort, and misinterpreting code impacts the entire development process. Ideally, developers should write clear and understandable code with the least effort. Aim. Our work investigates whether the McCabe Cyclomatic Complexity or the Cognitive Complexity can be a good predictor for the developers' perceived code understandability to understand which of the two complexities can be used as criteria to evaluate if a piece of code is understandable. Method. We designed and conducted an empirical study among 216 early career developers with professional experience ranging from one to four years. We asked them to manually inspect and rate the understandability of 12 Java classes that exhibit different levels of Cyclomatic and Cognitive Complexity. Results. Our findings showed that while the old-fashioned McCabe Cyclomatic Complexity and the most recent Cognitive Complexity are modest predictors for code understandability when considering the complexity perceived by early-career developers, they are not for problem severity. Conclusions. Based on our results, early-career developers should not be left alone when performing code-reviewing tasks due to their scarce experience. Moreover, low complexity measures indicate good understandability, but having either CoC or CyC high makes understandability unpredictable. Nevertheless, there is no evidence that CyC or CoC are indicators of early-career perceived severity.Future research efforts will focus on expanding the population to experienced developers to confront whether seniority influences the predictive power of the chosen metrics. △ Less

Submitted 15 July, 2024; v1 submitted 14 March, 2023; originally announced March 2023.

arXiv:2301.10164 [pdf, other]

Lowering Detection in Sport Climbing Based on Orientation of the Sensor Enhanced Quickdraw

Authors: Sadaf Moaveninejad, Andrea Janes, Camillo Porcaro

Abstract: Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the c… ▽ More Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence become practical in terms of expenses and time consumption for replacement when using in large quantity in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect sensors' orientation patterns during lowering different routes, and develop an supervised approach to identify lowering. △ Less

Submitted 15 March, 2024; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.02680

arXiv:2211.02680 [pdf, other]

Climbing Routes Clustering Using Energy-Efficient Accelerometers Attached to the Quickdraws

Authors: Sadaf Moaveninejad, Andrea Janes, Camillo Porcaro, Luca Barletta, Lorenzo Mucchi, Massimiliano Pierobon

Abstract: One of the challenges for climbing gyms is to find out popular routes for the climbers to improve their services and optimally use their infrastructure. This problem must be addressed preserving both the privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipm… ▽ More One of the challenges for climbing gyms is to find out popular routes for the climbers to improve their services and optimally use their infrastructure. This problem must be addressed preserving both the privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence becoming practical in terms of expenses and time consumption for replacement when used in large quantities in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect patterns in data during climbing different routes, and develops an unsupervised approach for route clustering. △ Less

Submitted 7 March, 2024; v1 submitted 4 November, 2022; originally announced November 2022.

Journal ref: Proceedings of the 18th EAI International Conference on Body Area Networks, 2023

arXiv:2207.06875 [pdf, other]

Open Tracing Tools: Overview and Critical Comparison

Authors: Andrea Janes, Xiaozhou Li, Valentina Lenarduzzi

Abstract: Background. Coping with the rapid growing complexity in contemporary software architecture, tracing has become an increasingly critical practice and been adopted widely by software engineers. By adopting tracing tools, practitioners are able to monitor, debug, and optimize distributed software architectures easily. However, with excessive number of valid candidates, researchers and practitioners h… ▽ More Background. Coping with the rapid growing complexity in contemporary software architecture, tracing has become an increasingly critical practice and been adopted widely by software engineers. By adopting tracing tools, practitioners are able to monitor, debug, and optimize distributed software architectures easily. However, with excessive number of valid candidates, researchers and practitioners have a hard time finding and selecting the suitable tracing tools by systematically considering their features and advantages.Objective. To such a purpose, this paper aims to provide an overview of popular Open tracing tools via comparison. Method. Herein, we first identified \ra{30} tools in an objective, systematic, and reproducible manner adopting the Systematic Multivocal Literature Review protocol. Then, we characterized each tool looking at the 1) measured features, 2) popularity both in peer-reviewed literature and online media, and 3) benefits and issues. We used topic modeling and sentiment analysis to extract and summarize the benefits and issues. Specially, we adopted ChatGPT to support the topic interpretation. Results. As a result, this paper presents a systematic comparison amongst the selected tracing tools in terms of their features, popularity, benefits and issues. Conclusion. The result mainly shows that each tracing tool provides a unique combination of features with also different pros and cons. The contribution of this paper is to provide the practitioners better understanding of the tracing tools facilitating their adoption. △ Less

Submitted 23 June, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

arXiv:2206.08718 [pdf, other]

CATTO: Just-in-time Test Case Selection and Execution

Authors: Dario Amoroso d'Aragona, Fabiano Pecorelli, Simone Romano, Giuseppe Scanniello, Maria Teresa Baldassarre, Andrea Janes, Valentina Lenarduzzi

Abstract: Regression testing ensures a System Under Test (SUT) still works as expected after changes to it. The simplest approach for regression testing consists of re-running the entire test suite against the changed version of the SUT. However, this might result in a time- and resource-consuming process; \eg when dealing with large and/or complex SUTs and test suits. To work around this problem, test Case… ▽ More Regression testing ensures a System Under Test (SUT) still works as expected after changes to it. The simplest approach for regression testing consists of re-running the entire test suite against the changed version of the SUT. However, this might result in a time- and resource-consuming process; \eg when dealing with large and/or complex SUTs and test suits. To work around this problem, test Case Selection (TCS) strategies can be used. Such strategies seek to build a temporary test suite comprising only those test cases that are relevant to the changes made to the SUT, so avoiding executing those test cases that do not exercise the changed parts. In this paper, we introduce CATTO (Commit Adaptive Tool for Test suite Optimization) and CATTO INTELLIJ PLUGIN. The former is a tool implementing a TCS strategy for SUTs written in Java, while the latter is a wrapper to allow developers to use \toolName directly in IntelliJ. We also conducted a preliminary evaluation of CATTO on seven open-source Java SUTs in terms of reductions in test-suite size, fault-reveling test cases, and fault-detection capability. The results are promising and suggest that CATTO can be of help to developers when performing regression testing. The video demo and the documentation of the tool is available at: \url{https://catto-tool.github.io/} △ Less

Submitted 17 June, 2022; originally announced June 2022.

arXiv:2103.01722 [pdf, other]

Mining Software Repositories with a Collaborative Heuristic Repository

Authors: Hlib Babii, Julian Aron Prenner, Laurin Stricker, Anjan Karmakar, Andrea Janes, Romain Robbes

Abstract: Many software engineering studies or tasks rely on categorizing software engineering artifacts. In practice, this is done either by defining simple but often imprecise heuristics, or by manual labelling of the artifacts. Unfortunately, errors in these categorizations impact the tasks that rely on them. To improve the precision of these categorizations, we propose to gather heuristics in a collabor… ▽ More Many software engineering studies or tasks rely on categorizing software engineering artifacts. In practice, this is done either by defining simple but often imprecise heuristics, or by manual labelling of the artifacts. Unfortunately, errors in these categorizations impact the tasks that rely on them. To improve the precision of these categorizations, we propose to gather heuristics in a collaborative heuristic repository, to which researchers can contribute a large amount of diverse heuristics for a variety of tasks on a variety of SE artifacts. These heuristics are then leveraged by state-of-the-art weak supervision techniques to train high-quality classifiers, thus improving the categorizations. We present an initial version of the heuristic repository, which we applied to the concrete task of commit classification. △ Less

Submitted 2 March, 2021; originally announced March 2021.

Comments: 5 pages; to appear in Proceedings of ICSE NIER 2021

arXiv:2012.13423 [pdf, other]

doi 10.1109/ACCESS.2020.3028571

Improving Predictability of User-Affecting Metrics to Support Anomaly Detection in Cloud Services

Authors: Vilc Rufino, Mateus Nogueira, Alberto Avritzer, Daniel Menasché, Barbara Russo, Andrea Janes, Vincenzo Ferme, André Van Hoorn, Henning Schulz, Cabral Lima

Abstract: Anomaly detection systems aim to detect and report attacks or unexpected behavior in networked systems. Previous work has shown that anomalies have an impact on system performance, and that performance signatures can be effectively used for implementing an IDS. In this paper, we present an analytical and an experimental study on the trade-off between anomaly detection based on performance signatur… ▽ More Anomaly detection systems aim to detect and report attacks or unexpected behavior in networked systems. Previous work has shown that anomalies have an impact on system performance, and that performance signatures can be effectively used for implementing an IDS. In this paper, we present an analytical and an experimental study on the trade-off between anomaly detection based on performance signatures and system scalability. The proposed approach combines analytical modeling and load testing to find optimal configurations for the signature-based IDS. We apply a heavy-tail bi-modal modeling approach, where "long" jobs represent large resource consuming transactions, e.g., generated by DDoS attacks; the model was parametrized using results obtained from controlled experiments. For performance purposes, mean response time is the key metric to be minimized, whereas for security purposes, response time variance and classification accuracy must be taken into account. The key insights from our analysis are: (i) there is an optimal number of servers which minimizes the response time variance, (ii) the sweet-spot number of servers that minimizes response time variance and maximizes classification accuracy is typically smaller than or equal to the one that minimizes mean response time. Therefore, for security purposes, it may be worth slightly sacrificing performance to increase classification accuracy. △ Less

Submitted 24 December, 2020; originally announced December 2020.

Journal ref: IEEE Access, vol. 8, p.198152-198167, 2020

arXiv:2003.07914 [pdf, ps, other]

doi 10.1145/3377811.3380342

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Authors: Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, Andrea Janes

Abstract: Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large… ▽ More Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available. △ Less

Submitted 17 March, 2020; originally announced March 2020.

Comments: 13 pages; to appear in Proceedings of ICSE 2020

arXiv:1904.01873 [pdf, other]

Modeling Vocabulary for Big Code Machine Learning

Authors: Hlib Babii, Andrea Janes, Romain Robbes

Abstract: When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling… ▽ More When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: 12 pages, 1 figure

Showing 1–13 of 13 results for author: Janes, A