11institutetext: Darren Holden22institutetext: Carleton University
22email: DarrenHolden@cmail.carleton.ca
33institutetext: Nafiseh Kahani 44institutetext: Carleton University
44email: kahani@sce.carleton.ca

Code Linting using Language Models

Darren Holden    Nafiseh Kahani
(Received: date / Accepted: date)
Abstract

Code linters play a crucial role in developing high-quality software systems by detecting potential problems (e.g., memory leaks) in the source code of systems. Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives in the interest of speed. This paper investigates whether large language models can be used to develop a more versatile code linter. Such a linter is expected to be language-independent, cover a variety of issue types, and maintain high speed. To achieve this, we collected a large dataset of code snippets and their associated issues. We then selected a language model and trained two classifiers based on the collected datasets. The first is a binary classifier that detects if the code has issues, and the second is a multi-label classifier that identifies the types of issues. Through extensive experimental studies, we demonstrated that the developed large language model-based linter can achieve an accuracy of 84.9% for the binary classifier and 83.6% for the multi-label classifier.

Keywords:
Code Linting Language Models Code Language Models Static AnalysisDynamic Analysis

1 Introduction

Ensuring code quality and maintaining consistent coding standards are essential for creating robust, maintainable, and error-free software systems. Code linting tools, often integrated into Continuous Integration (CI) pipelines, play a crucial role in this process by analyzing source code for potential problems, stylistic inconsistencies, and other issues. However, despite their widespread use, developing efficient and accurate linters poses significant challenges due to several limitations as discussed below.

Language and Domain Specific. Most existing linters are designed for specific programming languages. For instance, some tools may support only Java (such as SpotBugs spotBugs ), while others may be tailored for C/C++ (such as Cppcheck Marjamäki_2007 ). Additionally, these tools often focus on particular types of issues rather than providing comprehensive coverage. For example, one linter might specialize in detecting syntax errors and style issues, while another might focus on performance-related problems or security vulnerabilities (e.g. Checkstyle focuses on checking for violations of Java code styling conventions checkstyle ).

Accuracy and Performance. Existing linters rely on source code analysis, which can be categorized into dynamic and static methods. Dynamic analysis, while thorough, is often expensive and complicated to implement. Conversely, static analysis tends to overestimate issues and suffers from accuracy problems. This creates a situation where selecting the appropriate type of linter involves a trade-off between performance, computational resources, and accuracy. This issue is particularly critical in the context of CI, where fast CI cycles are crucial for effective and agile software development practices.

Evolving Coding Standards, Issues and Practices. Programming practices and issue types, especially security and vulnerability issues, evolve constantly. Therefore, keeping tools updated to detect new types of issues and comply with new standards and practices poses significant challenges for maintaining effective linters.

Large language models such as OpenAI’s GPT-4 achiam2023gpt , have demonstrated proficiency in understanding and generating human language, including programming languages. These models are trained on diverse datasets encompassing vast amounts of textual and code-based information, enabling them to capture complex patterns, semantics, and contextual nuances in code. We envision that leveraging language models for code linting can eventually address the aforementioned challenges. More specifically:

Language Agnostic. Language models can be trained on multiple programming languages, making them versatile and capable of analyzing diverse codebases. This adaptability reduces the need for multiple language-specific linters tailored to different programming languages and issue types. Additionally, unifying efforts on developing a single, comprehensive linter that supports all languages can lead to more focused and efficient endeavors, ultimately creating a more robust and effective tool.

Improved Accuracy and Performance. Language models can analyze code within its broader context, considering not just syntax but also the logical flow and intent behind the code. This contextual understanding allows for more precise identification of potential issues and more meaningful suggestions for improvement. Although training language models can be time-consuming, their predictions are fast and can even surpass the speed of static analysis techniques.

Continuous Learning and Updates. By continually training on new code repositories and integrating feedback from developers, language models can evolve to stay current with emerging coding trends and practices. This ensures that the linting process remains relevant and effective, capable of detecting new types of issues and adhering to updated standards.

Despite extensive research over the last few years on program generation, repair, and code comprehension, there has been relatively little research directly addressing the use of language models for code linting. This paper explores the architecture and implementation of a language model-based code linter and evaluates its performance under different configurations. To achieve this, we first create a large dataset of code snippets and their corresponding issues by analyzing open repositories. We then develop two classifiers using an existing Code Language Model (CLM): one to detect if a given code snippet contains an issue, and another to predict the type of issue. Relying on the collected dataset, we answer the following Research Questions (RQs).

  • RQ1 How does the selected CLM perform in detecting code issues? We find that the approach is competitive with state of the art results in vulnerability detection in terms of accuracy and F1 scores, achieving an accuracy of 84.0% in binary issue detection, and an F1 score of 0.838 in issue identification. It is also able to correctly detect a higher percentage of issues than is reported by the static analysis linters while being 33.5% faster than either linter on average.

  • RQ2 How do the different input formulations affect the model’s performance? We find that by removing certain parts of the input code (such as comments, or Javadoc) can have a significant impact on the performance of the model. Certain input formats can result in accuracy improvements of up to 3.1 percentage points.

  • RQ3 How does the fine-tuned model perform when identifying issues of types that are rare in the dataset? Our work shows that there is a significant performance decrease (an accuracy decrease of 31.1 percentage points) in issue detection when it encounters issues that were not present in the training dataset. Similarly, in issue identification, the model achieves a F1 score of 0.912 on the issue types that are commonly found in the dataset, and a score of 0.644 of the issue types that are sparsely found in the dataset.

  • RQ4 How does our model’s performance differ when analyzing projects that are present in the pre-training dataset compared to those that are not? The results show that there is a significant improvement in our model’s performance when analyzing projects that were present in the pre-training dataset, regardless of if those projects were included the fine-tuning dataset. The model achieves accuracy that is up to 11.6 percentage points higher on projects that were in the pre-training dataset compared to those that were not.

The rest of this paper is organized as follows. Section 2 introduces the foundational concepts on which our work is based and examines existing studies relevant to our work. Section 3 describes our approach in detail. Section 4 presents our approach for linter tool and model selection. It also presents the results of our experimental study, addressing four key research questions.

2 Background

This section covers some background concepts related to this study and provides a brief overview of related work in this area.

2.1 Code Linting

Code linters are tools designed to automatically detect code issues by analyzing code, helping developers improve their code’s readability, maintainability, and security vayadande2023let . Linters can detect a wide variety of issues, including defects, code styling issues, and design issues, amongst others vassallo2020developers . They can also enforce coding styles johnson2013don , automate code reviews vassallo2020developers , and measure code metrics (such as cyclomatic complexity) vassallo2020developers . Code linters can utilize both static and dynamic analysis techniques in order to detect code issues. Dynamic analysis is the less common approach, but involves executing the code to observe its behaviour gong2015dlint , while static analysis, which is the focus of this project, finds issues by analyzing the source code without execution emanuelsson2008comparative . Static analysis linters include data-flow analysis and control-flow analysis beller2016analyzing , as well as more abstract techniques such as pattern matching, or bug-specific heuristic patterns ayewah2008using .

A common problem with linters is their tendency to produce false positive reports (cases where a reported issue with the software is not an actual issue), which can be more numerous than true positive reports in some cases johnson2013don . This problem arises from the undecidable nature of many software traits that static analysis linters check for, in that it can be infeasible to determine if the trait actually exists in the software or not emanuelsson2008comparative . Due to the undecidable nature of this problem, many linters aim to report no false negatives (cases where an issue with the software is not reported), while minimizing the number of reported false positives emanuelsson2008comparative . Despite the drawback of reporting many false positives, linters are often used in the software development process, with a significant proportion using more than one tool beller2016analyzing . This is due to their numerous benefits, which extend beyond just finding issues.

2.2 Neural Language Models

This study uses Neural Language Models (NLMs) to analyze code and predict if that code has issues. NLMs leverage neural networks to learn the features of text data, which are then used for tasks such as text summarization, text classification, translation, and question answering zhao2023survey . In this study, the task of interest is text classification, specifically classifying source code as having issues or not.

The NLM used in this study uses the sequence-to-sequence Transformer architecture feng2020codebert . A Transformer model takes in a text input as a sequence of tokens, encodes it into a continuous representation, and then decodes that representation into the output representation vaswani2017attention . Transformer models use self-attention to capture dependencies and relationships between the input sequence’s tokens vaswani2017attention . The Transformer architecture consists of an encoder followed by a decoder, each with a sequence of layers. Each encoder layer consists of two sub-layers (a multi-head self-attention mechanism, and a fully connected feed-forward network), while each decoder layer consists of three sub-layers (a masked multi-head attention mechanism, a multi-head self-attention mechanism, and a fully connected feed-forward network) vaswani2017attention . These combinations of layers allows the model to generate and then utilize multiple linear transformations of the input sequence, enabling it to observe the entire input sequence and the incomplete output sequence vaswani2017attention .

Furthermore, NLMs are typically pre-trained on a task or set of tasks to learn general semantic features. These pre-trained models can then be fine-tuned to more specific downstream tasks, generally resulting in improved performance on the downstream tasks zhao2023survey . Pre-training tasks can include tasks such as masked language modelling, denoising, sentence prediction, and sentence order prediction zhou2023comprehensive . This study focuses on models which are pre-trained on a code-based dataset for code-related tasks, as opposed to models that are solely pre-trained for natural language tasks.

2.3 Related Work

There has been a lot of interest in issue detection in recent years, with many different approaches being studied. This study focuses on utilizing a Transformer-based model for issue detection. However, there are many other approaches that utilize neural nguyen2022regvd , li2021vulnerability , chakraborty2021deep , li2021vuldeelocator and non-neural pearce2022asleep , bian2018nar methods.

Gao et al. analyzed the performance of 16 different language models on vulnerability detection, a specialized form of issue detection focused on detecting security vulnerabilities gao2023far . They found that GPT models performed best in both binary classification and multi-class classification tasks. Interestingly, they found that models with higher numbers of parameters did not necessarily result in significant performance improvements. Similarly, Yuan et al. compared 10 different language models on several software development tasks, including issue detection yuan2023evaluating . They focused on a comparison of zero-shot, few-shot, and fine-tuning methodologies. They found that the best model after fine-tuning achieves 61.0% accuracy on issue detection, while the best accuracy amongst the zero-shot and few-shot methodologies was 55.4% accuracy.

Zhou et al. studied how unbalanced datasets impact the performance of language models in vulnerability detection, amongst other software development tasks zhou2023devil . They found that the studied models are at least 39% more accurate in vulnerability detection when identifying “head” labels (the commonly occurring labels) than “tail” labels. For the best performing model, they reported an overall accuracy of 73.1%, with 87.0% accuracy for head labels and 60.6% for tail labels. This study highlights the importance of balanced dataset in this field, and proposes analyzing head and tail labels separately. Fu and Tantithamthavorn proposed LineVul for vulnerability detection fu2022linevul , achieving F1, precision, and recall scores of 0.91, 0.97, and 0.86 respectively, when predicting if a function has a vulnerability, and an accuracy of 0.65 for determining the line of code with the vulnerability.

On top of specific approaches, there have been quite a few surveys done reviewing the current state of issue detection (usually with a specific focus on vulnerability detection). Recently, Nong et al. performed a review of how the practice of open science is handled amongst studies which apply deep learning to vulnerability detection nong2022open . To this end, they analyzed the accessibility, executability, reproducibility, and replicability of many studies, and found that each of these areas is currently lacking in the field. Another survey performed by Bi et al. looks at benchmarking approaches for vulnerability detection, and the lack of a comprehensive approach bi2023benchmarking . Some of the challenges they highlight is the lack of availability in datasets (resembling the lack of availability noted by Nong et al. nong2022open ), a lack of a commonly used evaluation methodology, and a lack of a sufficiently large, accessible dataset. Steenhoek et al. compared many different vulnerability detection approaches on two datasets steenhoek2023empirical . They found that many of the approaches performed similarly to each other, although the types of vulnerabilities that each model varied. Additionally, they found that the different approaches often have an overlap in what lines of code they find most important when identifying vulnerabilities. Finally, Harzevili et al. present a comprehensive overview of machine learning approaches for vulnerability detection harzevili2023survey . Their overview includes the types of models used, the characteristics of the utilized datasets, and the challenges facing studies in this area.

A common gap in this field is that existing work has a significant focus on vulnerability detection, as opposed to general issue detection. Our work aims to address this gap, by evaluating how a language model performs when identifying a wide variety of issues in source code. While existing work focuses solely on detecting vulnerabilities gao2023far , zhou2023devil , yuan2023evaluating , and fu2022linevul , our approach aims to be a general purpose linter, identifying issues related to performance, code organization, and code clarity, on top of issues related to potential vulnerabilities.

3 Approach

The objective of this study is to determine if a language model can identify issues in a given Java method. Specifically, we aim to explore whether a language model can be tuned to perform code linting, a task that typically requires computationally expensive analysis. To achieve this goal, we follow several steps, as illustrated in Figure 1:

  • Data Collection. Methods with and without issues are collected, as discussed in Section 3.1.

  • Pre-processing of the Collected Data. The collected methods are appropriately formatted for analysis.

  • Tuning and Application of the Language Model. A language model is tuned and applied to the formatted methods to identify potential issues.

The approach analyzes the source code of a project to determine if its methods contain issues and, if so, identify the types of issues present. This process aims to provide a minimal amount of context about the methods to the language model, streamlining the linting process while maintaining accuracy. In the following we discuss the details of approach.

Refer to caption
Figure 1: Overview of the approach utilized in this study

3.1 Data Collection

The approach focuses on analysing and detecting issues at the method-level. Thus, data collection involves extracting source code of methods, referred to as the target methods in the rest of this paper. Each target method includes the entire method’s body, signature, any associated annotations, and any comments and Javadoc. No further context–such as the class containing the target method, or any methods that the target method may call–is provided to the model. Figure 2 illustrates a simple target method example. Each target method is associated with any issues detected within the method (or lack thereof). For instance, the target method illustrated in Figure 2 would have an associated issue detailing the potential for a NullPointerException to be thrown.

1 /**
2 * Prints an attribute of an Object to the console
3 * @param input The object whose attribute should be printed
4 */
5 @Override
6 public void printAttribute(SomeObject input) {
7 // Print the attribute
8 System.out.println("Attribute Value: " + input.getAttribute().toString());
9 }
Figure 2: A sample target method with a potential NullPointerException (as the input object may be null)

Maintaining minimal context for the target methods offers several advantages. Firstly, this minimal context allows the approach to be applied as soon as the target method is complete, potentially even before any of the target method’s dependencies exist. This means that linting can be applied earlier in the development process, enabling potential issues to be addressed sooner. Secondly, minimal context allows for quicker analysis. By only requiring a string representation of the target method, any overhead due to compiling the source code is avoided, and only the uncompiled target method needs to be extracted. Lastly, using a minimal context decreases the size of the input to the language model. This reduction enables the use of language models with smaller maximum input lengths, making the approach accessible and requiring minimal resources.

3.2 Pre-processing

Three operations are defined to modify the target method so that different pieces of information in the target method can be analyzed for their impact on the model’s performance. These operations are designed to preserve the target method’s behavior and characteristics, ensuring that the source code is not functionally impacted. The operations are as follows:

Remove Comments. All comments (both single-line and multi-line) are removed from the target method. An example of the result can be seen in Figure 3(a).

Remove Javadoc. Any provided Javadoc is removed from the target method. An example of the result can be seen in Figure 3(b).

Replace String Literals with a Token. All literal string values in the target method are replaced with a special token. An example of the result can be seen in Figure 3(c).

Using these three operations, the model is tested on eight different input formats: one consisting of the unmodified target methods (as seen in Figure 2), and seven comprised of the possible combinations of applying the three operations. The most extreme format involves the application of all three operations, as seen in Figure 3(d).

1 /**
2 * Prints an attribute of an Object to the console
3 * @param input The object whose attribute should be printed
4 */
5 @Override
6 public void printAttribute(SomeObject input) {
7 System.out.println("Attribute Value: " + input.getAttribute().toString());
8 }
(a) Target Method With Comments Removed
1 @Override
2 public void printAttribute(SomeObject input) {
3 // Print the attribute
4 System.out.println("Attribute Value: " + input.getAttribute().toString());
5 }
(b) Target Method With Javadoc Removed
1 /**
2 * Prints an attribute of an Object to the console
3 * @param input The object whose attribute should be printed
4 */
5 @Override
6 public void printAttribute(SomeObject input) {
7 // Print the attribute
8 System.out.println([<stringliteral>] + input.getAttribute().toString());
9 }
(c) Target Method With String Literal Replaced by Special Token
1 @Override
2 public void printAttribute(SomeObject input) {
3 System.out.println([<stringliteral>] + input.getAttribute().toString());
4 }
(d) Target Method With All Operations Applied
Figure 3: Examples of several input formats, applied to the sample target method shown in Figure 2.

CodeBERT, the language model utilized for this study, has a maximum input length of 512 tokens feng2020codebert . More than 99% of the target methods (and thus the input) are shorter than this limit. However, any methods that exceed 512 tokens are too long for CodeBERT’s input capacity and are therefore truncated. The target methods are truncated after they have been modified with any applicable modification operations.

3.3 Model training

The model was trained on two types of tasks. The first is a binary classification task to determine whether the provided target method has an issue or not. For this task, the model output is one of two labels. One label denotes the presence of an issue, and the other indicates that no issues were found.

The second task is multi-label classification, identifying which issues the target method may have. To perform multi-label classification, the output was formatted as a binary output for each issue type, along with a binary output for a “no issues” type. This no issues type was included to ensure the model’s output explicitly indicates when no issues are detected.

Our implementation uses the HuggingFace Transformers library wolf2020transformers . We employ a pre-trained model, retrieved from HuggingFace (the selected model is specified in Section 4.3). We also utilize the AutoModelForSequenceClassification from the Transformers library automodels , which loads the pre-trained model, and adds a final fully-connected layer for classification, with one output for each label. Each generated model was then trained on the collected dataset, as described in Sections 4.2 and 4.2.5, using the Trainer class from the Transformers library trainer .

4 Evaluation

This section details the execution and evaluation the approach. It also includes a detailed discussion of the results.

4.1 Linter Selection

To conduct this study, we selected static analysis linters for identifying issues in Java code and creating a dataset. Given the large number of static analysis linters available, several criteria were used to narrow down the possible options. The following section discusses these criteria.

Availability. For this study, only linters that were freely available were considered. This decision was made in order to make the data collection methodology more accessible and reproducible. This excluded tools such as Coverity Coverity .

Language Analyzed. As this study focused on Java code, only linters capable of analyzing Java code were considered. This excluded tools such as Cppcheck, which only analyzes C and C++ code Marjamäki_2007 .

Automatable. To simplify the data collection process, only linters that could be run locally through a terminal command were considered. This excludes tools such as SonarCloud, which is offered as a remotely-hosted Software as a Service (SaaS) SonarCloud .

Full Project Analysis. To simplify the data collection process, only linters that analyze a full project were considered. This criterion would filter out tools that only analyze pull request differences. No tools were disqualified using this criterion, as all investigated tools had the capability of analyzing full projects.

Issue Location. To build a coherent dataset, it is necessary to identify the location of the discovered issues. Therefore, only tools that provide this location information were considered. This is a very common feature among static analysis linters, so none of the investigated tools were disqualified as a result of this criterion.

Based on these criteria, two static analysis linters were selected for use in this study: Infer v1.1.0 infer and SpotBugs v4.8.3 spotBugs .

4.2 Dataset

4.2.1 Project selection.

A large dataset was constructed by using Infer and SpotBugs to analyze a variety of open-source Java projects. The initial list of projects was sourced from the CodeSearchNet dataset husain2020codesearchnet . This list was then refined based on two criteria: (1) the projects must be publicly available on GitHub, and (2) their build processes must use Maven. The requirement for GitHub availability ensures that the full project source code is easily accessible. The requirement of using Maven for the build process is due to the fact that the selected static analysis linters require compilation of the source code. Requiring that all considered projects use Maven for their build process allows for uniform automation of the project analysis and ensures that project dependencies are easily accounted for. Each project selected for inclusion was required to have a Maven pom file located at the root of the project directory. Projects were not excluded based on any other specific characteristics (such as number of GitHub stars, number of forks, and repository age) to avoid potential biases in the dataset. This step resulted in 3,268 candidate projects.

The list of candidate projects was expanded by querying the GitHub REST API githubrest for projects written in Java that are tagged with the Maven topic. Unfortunately, the GitHub API limits queries to only return 1,000 results, which was insufficient for the number of candidate projects we wished to collect. To collect additional data, the query was run repeatedly with a varying term specifying the age of the most recent commit. This allowed us to construct a list of new projects by repeatedly querying the API for older and older projects. This list of projects was then filtered to remove any projects that were included in CodeSearchNet, and any projects that did not have a root-level pom file (as with the CodeSearchNet projects). Using this methodology, we retrieved 3,209 new candidate projects, bringing the total number of candidate projects to 6,477.

4.2.2 Analysis of the candidate projects.

Infer and SpotBugs were then used to analyze each of the candidate projects retrieved in Section 4.2.1. Each tool was configured to collect all non-experimental bugs available in each tool by default. Since these tools require each project to be compiled, Maven v3.6.3 was used along with Java versions 8, 11, or 17, depending on what was specified in the project’s pom file. Due to the number of projects requiring analysis, each analysis was allowed to run for only 25 minutes, after which the process was terminated. If an analysis failed for a reason other than reaching the 25-minute time limit, two steps were taken to maximize the number of discovered issues. First, the compilation was re-attempted with the other Java versions. Attempting to compile with other Java versions accounted for cases where the Java version collected from the pom file was incorrect. Second, the compilation was re-attempted with any other poms in the project. Attempting to compile with other poms allowed for the potential of finding some issues (e.g., by analyzing a sub-module of the project in question) instead of discounting the project entirely.

Each discovered issue was then mapped to the Java method it was contained in. The scope of this project is limited to considering Java methods, so any issues discovered outside the scope of a method were discarded. Using this methodology, 2,907 projects had at least one discovered issue, resulting in a total of 108,182 issues from 84,747 methods. To round out the dataset, methods with no detected issues were randomly selected so that the number of methods with no issues was equal to the number of methods with issues. This resulted in a dataset containing 169,494 total Java methods. The 108,182 found issues are composed of 118 types of issues (10 from Infer, and 108 from SpotBugs). All the collected issue types are listed and described in Table LABEL:table:issueDefs. This table also specifies IDs used to refer to each issue type. IDs beginning with ‘I’ refer to issues reported by Infer, and IDs beginning with ‘S’ refer to issues reported by SpotBugs.

4.2.3 Equivalent issue mapping.

In order to reduce model confusion, the identified issue types were analyzed to convert equivalent issue types into a single type. Equivalent issues types are defined as issue types which are very similar to each other. This ensures that equivalent issues are all labelled the same way, rather than having multiple labels that essentially denote the same issue. To accurately categorize the issues reported, we first reviewed detailed documentation describing different types of issues inferDesc , spotBugsDesc , and then manually reviewed collocated issues.

The simplest case of equivalent issues involves multiple issue types reported by a single tool having identical or nearly identical definitions. For example, issues I1 and I2 have the exact same definition according to the Infer documentation inferDesc .

Following the analysis of the documentation, we manually reviewed random samples from the dataset to ensure that equivalent issues were classified under the equivalent types. The samples were randomly selected such that only methods with multiple reported issues on the same line were reviewed, as it is assumed that equivalent issues would often be collocated. In this way, 400 samples were manually reviewed to obtain a 95% confidence level, and a 5% margin of error. Collocated issues were compared to their respective issue types’ documentation to determine if they were equivalent. For instance, it was found that issue I3, and issue S6 were often collocated. According to the Infer documentation, the I3 refers to cases where a resource may not be closed, especially if an exception occurs while accessing the resource inferDesc . In the SpotBugs documentation, the S6 issue refers to cases where a “stream, database object, or other resource requiring an explicit cleanup operation” may not be properly closed spotBugsDesc . Manual review of this information determined that these two types of issues are equivalent.

There were many cases in which collocated issues were similar enough to be deemed equivalent, but there were also cases in which collocated issues were not similar enough to be considered equivalent. For example, consider issues I5 and S8, which were often collocated. Issue I5 refers to cases where there is a potential data race inferDesc , while issue S8 refers to cases where references to an object’s mutable field are returned, potentially leading to unchecked and unexpected changes to the object’s field spotBugsDesc . Although S8 could lead to a potential data race, it has a much broader application, so these two issues were not marked as equivalent.

After determining equivalent issues, the dataset labels were modified to reflect the same label for all equivalent issues. The set of equivalent issues can be found in Table 15.

4.2.4 Dataset imbalance.

The collected dataset is very unbalanced in terms of the distribution of issue types. A few types are very numerous (such as issue type S8), while many others have very few occurrences (such as issue type S61). To better balance the dataset, the least frequently occurring issues are removed from the dataset during fine-tuning and evaluation, which is only performed with the top 75% most frequently occurring issues. This includes all issue types that occur at least 15 times. With this filter applied, the dataset contains 84,664 methods with issues, and 75 issue types. Random methods without issues were removed until there were 84,664 remaining in the dataset. This is the case unless otherwise specified, and other values for the issue count cutoff are explored as part of RQ3.

4.2.5 Dataset splits.

The dataset is split into three sets: the train set, the validation set, and the test set. This is performed by allocating 80% (169,328 methods) of the dataset to the training set, and 10% (16,932 methods) each to the validation and test sets. This approach aligns with other studies in this area li2021vulnerability , steenhoek2023empirical . An important note is that each of the three sets was selected such that 50% of the samples in each set have issues, and the remaining 50% of the samples do not.

4.3 Model Selection

In this study, we aim to fine-tune a language model capable of reproducing the results of static analysis linters. Therefore, the language model was chosen with care as it is an integral part of this study. We used the following criteria to narrow down the options from the many available language models.

Open Source. To facilitate fine-tuning and replicability, this study focused on models which are open-source. This excludes models such as GPT-4 achiam2023gpt .

Pre-Trained on Code. We focused on language models that were pre-trained on a corpus including code. The problem of identifying issues in code relies on an understanding of code semantics and structure, so having a model pre-trained on code-related tasks is advantageous. This excludes language models such as T5 raffel2020exploring , which is only pre-trained for natural language tasks. Since the collected dataset consists of solely of Java code, the candidate models were further constrained to those which were pre-trained on a dataset including Java code. This allows the model to leverage its prior knowledge of Java code structure.

Performance on Similar Tasks. To achieve the best possible results, candidate models were limited to those that have performed well in similar tasks, such as defect detection and vulnerability detection. For instance, CodeBERT feng2020codebert achieves good results as part of LineVul fu2022linevul , and GPT-4 achiam2023gpt outperforms other models in Gao et al.’s study gao2023far .

Model Size. The size of the model that could be utilized in this study was limited due to the lack of access to large GPUs for fine-tuning. For instance, CodeLlama roziere2023code , a model that was considered for this study, could not be fine-tuned on the available GPUs due to memory constraints.

Based on the above criteria, CodeBERT feng2020codebert was selected as the model of interest for this study. This model has been successfully used in other studies to detect vulnerabilities fu2022linevul , zhou2023devil . Specifically, we use the “microsoft/codebert-base” model available on HuggingFace111https://huggingface.co/microsoft/codebert-base.

4.4 Research Question

In order to evaluate the approach, we aim to address the following research questions (RQs).

  • RQ1 How does the selected Code Language Model (CLM) perform in detecting code issues (Overall performance)? The primary objective of this study is to evaluate the capability of CLMs. Thus, in this RQ, initially, we examine whether the CLM can detect the presence or absence of issues (binary classification). Subsequently, we assess its ability to identify the types of issues as well (multi-label classification).

  • RQ2 How do the different input formulations affect the model’s performance (Impact of different input formulation)? As discussed in Section 3.2, there are certain types of information (such as comments) in the target methods that could either be helpful or detrimental to the model. This research question aims to determine how the inclusion or exclusion of different types of information affects the model’s performance.

  • RQ3 How does the fine-tuned model perform when identifying issues of types that are rare in the dataset (Detection of rare issues)? Due to the nature of source code issues, the collected dataset has a very unbalanced label distribution, with some labels appearing as little as once. This research question investigates whether the model can reliably identify the presence of these rare issues, which would greatly increase its usefulness in the real world.

    • RQ3.1 Can the binary classification model identify that a method has an issue when all of the method’s issues are of unseen types? While Infer and SpotBugs report a large range of issue types, there are issues that they did not report against the projects in the dataset (such as the USM issue type in SpotBugs spotBugsDesc ), and there are likely issue types that the two tools are incapable of reporting. Being able to identify issues such as these would be a useful trait in the model.

    • RQ3.2 How does the performance of the multi-label classification model differ on rare issue types compared to common issue types? The key goal of the multi-label classification model is to identify the presence of particular issue types. If the model could only identify the commonly found issue types, it would be of limited use.

    • RQ3.3 How does the performance of each model change as rare issues are removed from the dataset? This research question aims to explore the extent to which rarely occurring issues act as confusing noise to the model.

  • RQ4 How does our model’s performance differ when analyzing projects that are present in CodeSearchNet compared to those that are not (generality of the model)? Given that CodeBert was pre-trained using the CodeSearchNet dataset feng2020codebert , we want to determine how the pre-training affects the overall performance and whether the presence of projects in both the pre-training and fine-tuning impacts the model’s overall performance.

    • RQ4.1 How does the model perform when evaluated on projects that were not included in the fine-tuning dataset? Knowing how well the model generalizes to unseen projects will inform how easily this approach could be adopted in the real-world.

4.5 Evaluation Metrics

In order to compare the performance of issue identification, we use the accuracy, precision, recall, and F1 score metrics, which are commonly used in similar studies gao2023far , zhou2023devil , li2021vuldeelocator . In the following descriptions, T refers to the test dataset, TP refers to the number of true positive results, TN refers to the number of true negative results, FP refers to the number of false positive results, and FN refers to the number of false negative results.

Accuracy: The proportion of test samples for which the model gave the correct output.

Accuracy=TP+TN|T|Accuracy𝑇𝑃𝑇𝑁𝑇\text{Accuracy}=\frac{TP+TN}{|T|}Accuracy = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG | italic_T | end_ARG

Precision: The proportion of true positive predictions amongst all the positive predictions that the model generated.

Precision=TPTP+FPPrecision𝑇𝑃𝑇𝑃𝐹𝑃\text{Precision}=\frac{TP}{TP+FP}Precision = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG

Recall: The proportion of true positive predictions amongst all the positive samples in the test set.

Recall=TPTP+FNRecall𝑇𝑃𝑇𝑃𝐹𝑁\text{Recall}=\frac{TP}{TP+FN}Recall = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG

F1 Score: The harmonic mean of the precision and recall.

F1=2PrecisionRecallPrecision+RecallF12𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙\text{F1}=2\frac{Precision\cdot Recall}{Precision+Recall}F1 = 2 divide start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n ⋅ italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_R italic_e italic_c italic_a italic_l italic_l end_ARG

4.5.1 Multi-Label Classification Metrics:

For multi-label classification, the model’s precision, recall, and F1 score are reported as the weighted average of each metric across all issue types. For instance, the reported precision value for the multi-label classification model is the weighted average of the model’s precision for each issue type. This is calculated using the following formula, where I is the set of issue types that appear at least once in the test dataset, n(i) is the number of occurrences of issue type i in the test dataset, and m(i) is value for the metric of interest for the given issue type i.

Weighted Average Metric=iIn(i)m(i)iIn(i)Weighted Average Metricsubscript𝑖𝐼𝑛𝑖𝑚𝑖subscript𝑖𝐼𝑛𝑖\text{Weighted Average Metric}=\frac{\sum_{i\in I}n(i)m(i)}{\sum_{i\in I}n(i)}Weighted Average Metric = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_n ( italic_i ) italic_m ( italic_i ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_n ( italic_i ) end_ARG

The multi-label classification model’s accuracy is calculated as the proportion of test samples where the set of predicted issue types exactly match the set of expected issue types.

4.6 Overall Models’ Performance (RQ1)

This section aims to show that we have developed a language model-based approach that is an effective linter. To evaluate this, we fine-tune and evaluate a binary classification model and a multi-label classification model for issue detection and issue identification, respectively. For this initial experiment, we use the unmodified target methods, without any of the modifications described in Section 3.2. The dataset for this experiment is split as described in Section 4.2.5, in that the train, validation, and test datasets consist of 80%, 10%, and 10% of the total dataset, respectively. The results of this experiment can be seen in Table 1.

As can be seen, the approach achieves a high degree of accuracy in both tasks. The binary classification model achieves an accuracy of 0.840 with an F1 score of 0.842 when detecting issues. Similarly, the multi-label classification can identify issues with an accuracy of 0.832 and an F1 score of 0.838. When comparing these results to other works, we outperform the results reported by Zhou et al. (an accuracy of 0.731) zhou2023devil , and Yuan et al. (an accuracy of 0.61) yuan2023evaluating . While the work by Fu and Tantithamthavorn outperforms ours (as they achieved an F1 score of 0.91), their dataset focuses solely on vulnerabilities, whereas ours is broader, including many other classes of issues such as poor performance (e.g., issues I4 and S66), poor code organization (e.g., issue S21), and poor code clarity (e.g., issue S16), in addition to issues related to vulnerabilities (e.g., issues S23 and S40).

Model Accuracy Precision Recall F1
Binary 0.840 0.828 0.857 0.842
Multi-Label 0.832 0.851 0.834 0.838
Table 1: The Effectiveness of the Binary Classification and Multi-Label Classification Models at Linting

4.6.1 Comparison to Linting Tools

The approach’s performance can also be compared to the linters used for data collection (Infer and SpotBugs). For this comparison, we consider all the methods in the test dataset that have at least one issue. Of these methods, 25.0% are flagged as having an issue by Infer, 82.7% are flagged by SpotBugs, and our approach correctly identifies 85.7% of the methods with issues (as seen through the recall reported in Table 1). This shows that our approach is capable of identifying more methods with issues than either tool can by itself.

Additionally, we compare the amount of time it takes to run each of the linters compared to our approach. Due to the required length of time to perform the analysis, it was infeasible to re-analyze all 6,477 candidate projects with both linters. Instead, each linter was run on a number of randomly selected projects which comprised 10% of the data collected by each linter. This means that 103 projects were analyzed for Infer and 182 projects were analyzed for SpotBugs. The fine-tuned models were timed on the according to how long it took to format and analyze each method in the selected projects. To ensure a fair comparison, each model was evaluated twice: once on the projects analyzed by Infer, and once on the projects analyzed by SpotBugs. In addition, the time it took to extract every method in the selected projects was recorded, which is our approach’s equivalent to the linters’ requirement of compiling the code.

The results of this evaluation can be seen in Table 2. On the 103 projects selected for Infer, Infer took an average of 42.966 seconds to analyze a project. In comparison, the extraction of the Java source code took an average and 4.514 seconds to extract all methods, the binary classification model took an average of 9.315 seconds to analyze all methods, and the multi-label classification model took an average of 9.330 seconds to analyze all methods. Our approach is much faster than Infer, requiring 46.10% less time than Infer’s analysis to run the Java extraction, binary model, and multi-label model in sequence.

For the 182 Spotbugs projects, SpotBugs took an average of 25.326 seconds to analyze each project. Extraction of the Java source code took an average and 3.962 seconds to extract all methods, the binary model took an average of 7.935 seconds to analyze all methods, and the multi-label model took an average of 8.135 seconds to analyze all methods. This shows that our approach is much faster than applying SpotBugs, requiring 20.91% less time than SpotBugs’ analysis to run the Java extraction, binary model, and multi-label model in sequence.

The most significant performance improvement comes when comparing our approach to running both tools, in which case our approach is 66.09% faster. This figure was determined by summing the average times for both linters (42.966 seconds for Infer and 25.326 seconds for SpotBugs), and comparing that to the sum of the average times for our approach on the Infer projects (23.159 seconds total). We only use the times for our approach on the Infer projects for this comparison, since they are the slower set of times. This shows that, in addition to finding a higher ratio of the issues in the code than either individual linter, it is also much faster at analyzing a codebase than the linters.

Case Elapsed Time (Seconds)
Min. Max. Mean Std. Deviation
Infer 5.386 847.09 42.966 95.756
Java Extract 1.247 23.751 4.514 4.006
Binary Model 0.130 148.579 9.315 18.339
Multi-Label Model 0.103 147.865 9.330 18.306
SpotBugs 3.589 704.928 25.326 68.037
Java Extract 0.322 47.015 3.962 4.988
Binary Model 0.027 144.124 7.935 17.819
Multi-Label Model 0.029 149.013 8.135 18.272
Table 2: The Elapsed Time in Seconds Comparing Both the Binary and Multi-Label Classification Models to the Infer and SpotBugs

4.7 Input Format Comparison (RQ2)

This section aims to evaluate how the models perform with different input formats. These input formats are created by applying the operations described in Section 3.2 in different combinations. The dataset for these experiments is split as described in Section 4.2.5, and the same splits were used for each experiment. The results of this evaluation can be seen in Table 3 for binary classification and Table 4 for multi-label classification. In these tables, a shorthand is used to denote the applied input formulation operations. RC represents the remove comments operation, RJ represents the remove Javadoc operation, and RS represents the replace string literals operation. The unmodified format is the target method input with no operations applied. The results of this experiment can be seen in Table 3 and Table 4, for the binary classification model, and the multi-label classification model respectively. The best performing formats are the RJ format for the binary classification problem, achieving an accuracy of 0.849 and an F1 score of 0.847. The RC+RJ format performed the best for multi-label classification problem, with an accuracy of 0.836 and F1 score of 0.843.

Interestingly, in both cases the best performing input formats are still fairly close to the performance of the unmodified input. For binary classification, the unmodified input achieves an accuracy of 0.840 and an F1 score of 0.842, only slightly lagging behind the RJ format, and achieves the best recall score, being 0.857. This likely indicates that while some benefit can be gained from the operations applied to the target method, additional context (such as the methods which the target method calls) is likely required to gain significant performance improvements beyond what is achieved in this study. Exploring this idea is left for future work.

In the case of multi-label classification, all of the results are close together, with the accuracy metrics having only a difference of 1.2 percentage points between the best and worst accuracy measures (the RC+RJ and RS inputs respectively), with the unmodified format achieving an accuracy of 0.832 (0.004 less than the highest accuracy). This is likely due to the unbalanced nature of the dataset’s issue types, where the lesser occurring issue types may be limiting the model’s performance. This idea is explored more in Section 4.7. The F1 scores for each issue type were also investigated to determine if particular issue types had better performance with certain input formats compared to the average. Overall, the variance in F1 score results between different input formats seems to be partially tied to the number of instances of the issue type in the dataset. For each of the 10 most frequently occurring issue types in the test dataset, the best and worst F1 scores have a difference of less than 0.061. In contrast, the less numerous issue types can have a difference of up to 0.821 between the best and worst F1 scores, indicating that the input format is less important than the number of samples for an issue type. Overall, 6 issue types (including 4 of the top 5 most numerous types) achieve their best F1 performance using the RC+RJ format, matching the best format according to the weighted average.

Further analysis was performed on all issue types that had a difference greater than 0.01 between their best and worst F1 scores (as a difference less than this is interpreted as the model’s performance on that issue type being input-format agnostic). Interestingly, every input format resulted in the best F1 score for at least one issue type. The performance of different input formats does not seem to follow any trends within issue types. For instance, consider issue S18, for which the RC+RJ format achieved the best F1 score of 0.627, while the RC and RJ formats resulted in some of the worst F1 scores (0.368 and 0.412 respectively) for this type. Based on a manual analysis of the issue types and some samples from the dataset, it was determined that none of the input formats would be particularly helpful over the others when manually identifying issues. Many issue types (such as S15 and S23) would require additional context from either the target method’s class or from the methods which the target method calls. Based on this analysis, an avenue for future work would be to investigate how our approach performs when using input formats that include additional context from outside the target method.

For the binary classification model, it was determined that the RJ input format performed the best. It achieved the best results in both the accuracy and F1 metrics, and nearly matches the best precision result. It does lag behind other formats in recall. For the multi-label classification, the RC+RJ format achieves the best performance, achieving or matching the best result in every metric except precision, where it very slightly lags behind. Given these results, the remaining experiments are performed using these input formats for each of the respective models.

Input Format Accuracy Precision Recall F1
Unmodified 0.840 0.828 0.857 0.842
RC 0.818 0.795 0.857 0.825
RJ 0.849 0.854 0.841 0.847
RS 0.826 0.845 0.798 0.821
RC+RJ 0.842 0.835 0.852 0.844
RC+RS 0.828 0.814 0.850 0.832
RJ+RS 0.836 0.855 0.809 0.831
RC+RJ+RS 0.819 0.807 0.838 0.822
Table 3: Binary Classification Results
Input Format Accuracy Precision Recall F1
Unmodified 0.832 0.851 0.834 0.838
RC 0.832 0.854 0.833 0.838
RJ 0.830 0.852 0.833 0.837
RS 0.824 0.845 0.826 0.831
RC+RJ 0.836 0.853 0.838 0.843
RC+RS 0.827 0.851 0.829 0.835
RJ+RS 0.827 0.854 0.833 0.839
RC+RJ+RS 0.827 0.851 0.829 0.835
Table 4: Multi-label Classification Results

4.8 Classification Performance on Rare Issue Types (RQ3)

This section will address RQ3, which examines how the models perform when confronted with rare issue types. Three studies were performed to address this research question.

4.8.1 Evaluating the Binary Classification Model on Unseen Issue Types (RQ3.1):

The first study aimed to evaluate whether the binary classification model could identify methods that have issues of types not present in the training dataset. The datasets for this experiment were created using the algorithm shown in Algorithm 1. In this algorithm, a test dataset is created which contains methods with issues only of the randomly selected unseen types and methods without any issues in equal amounts. Methods that contain both unseen and seen issue types are removed from the dataset, to avoid cases where the model may encounter an unseen issue type during training, and cases where a seen issue type is present in the test dataset. The remainder of the dataset is then split to form train, validation, and test datasets using the split methodology described in Section 4.2.5. This second test dataset serves as a baseline point of comparison for how the model performs in identifying the presence of seen issue types.

Algorithm 1 Creating Datasets for Evaluating the Binary Classification Model on Identifying Unseen Issue Types
1:mainDataset \leftarrow allTargetMethods()
2:unseenIssueTypes \leftarrow {}
3:testDataset \leftarrow {}
4:while testDataset.size <0.05<0.05*< 0.05 ∗ mainDataset.size do
5:    selectedType \leftarrow randomIssueTypeFrom(mainDataset)
6:    while numTargetMethodsWithType(selectedType) 0.05\geq 0.05*≥ 0.05 ∗ mainDataset.size do
7:         selectedType \leftarrow randomIssueTypeFrom(mainDataset)
8:    end while
9:    unseenIssueTypes.add(selectedType)
10:    for method in𝑖𝑛initalic_i italic_n mainDataset do
11:         if method.issueTypes \subseteq unseenIssueTypes then
12:             testDataset.add(method)
13:             mainDataset.remove(method)
14:         end if
15:    end for
16:end while
17:for method in𝑖𝑛initalic_i italic_n mainDataset do
18:    if (method.issueTypes \cap unseenIssueTypes {}absent\neq\{\}≠ { }then
19:         mainDataset.remove(method)
20:    end if
21:end for
22:for i0𝑖0i\leftarrow 0italic_i ← 0 to testDataset.size do
23:    method \leftarrow randomMethodWithoutIssuesFrom(mainDataset)
24:    testDataset.add(method)
25:    mainDataset.remove(method)
26:end for
27:trainDataset \leftarrow trainSplit(mainDataset)
28:validationDataset \leftarrow validationSplit(mainDataset)
29:baselineTestDataset \leftarrow testSplit(mainDataset)

The results of this experiment can be seen in Table 5. As indicated by the model’s recall on the unseen issue types (which has a score of 0.275), the model fails to identify most methods with issues of types that were not in the training set. This shows that our approach for identifying methods with issues loses most usefulness when evaluating methods that only contain issues that it has never encountered before. This speaks to the importance of having a large dataset with a diverse set of issue types for fine-tuning a model to be used in a production environment.

Test Set Contents Accuracy Precision Recall F1
Seen Issue Types 0.881 0.867 0.901 0.884
Unseen Issue Types 0.570 0.671 0.275 0.390
Table 5: Comparison of Binary Classification Performance on Test Sets Containing Issues Containing Either Seen or Unseen Issue types

4.8.2 Analyzing the Multi-Label Classification Model’s Performance on Rare Issue Types (RQ3.2):

To analyze how well the multi-label classification model performs on rare issue types, we adopt Zhou et al.’s approach which considers so called “head” and “tail” labels separately zhou2023devil . This approach is designed to analyze the effectiveness of language models on datasets where the labels follow a long-tail distribution (i.e., a dataset where relatively few of the labels make up the majority of the samples). This is done by analyzing the “head” labels and “tail” labels separately, with “head” labels being the most frequently occurring labels that account for 50% of the dataset, and the “tail” labels being the less frequent labels that account for the remaining 50% of the dataset.

Note that, we deviate from Zhou et al.’s methodology in how we handle label overlap. Label overlap occurs when a label could be classified as either a head label or tail label (such as a case where 45% of the samples are head samples, 40% are tail samples, and the remaining 15% all have the same label). Zhou et al. allocate the overlapping label to minimize the difference between the number of head samples and tail samples zhou2023devil . In our dataset, 50% of the samples have no issues, and are labeled as such. If we follow Zhou et al.’s methodology exactly, the only head label would be the “no issues” label. However, the next most frequently occurring label is for issue S8, with 25.4% of samples having this label. To provide a more accurate view of how our model performs on the infrequently occurring labels, we include the S8 label in the head labels (bringing our head label percentage to 75% instead of 50%).

This analysis was performed using RC+RJ results from Table 4. The model’s performance on these two groups of labels can be seen in Table 6. While the two groups have a comparable accuracy (0.887 for head labels and 0.875 for tail labels), the head label performance on each of the other metrics far exceeds the tail label performance. The tail label accuracy is skewed by the fact that 75.3% of the samples have no expected tail labels. When considering only samples that have at least one expected tail label, the model’s accuracy on tail labels becomes 0.598. This low performance on the tail labels show that our model suffers from the unbalanced dataset, and fails to learn the patterns that indicate rarer issues.

To analyze how our approach performs when utilizing a more balanced (although still unbalanced) dataset, this analysis was repeated using a pipeline composed of our binary classification model, followed by our multi-label classification model. For this experiment, the dataset was split into training, validation and test datasets using an 80:10:10 split, as described in Section 4.2.5. The binary classification model was fine-tuned using the entirety of the training and validation datasets, while the multi-label classification model was fine-tuned only using the methods with issues contained in those datasets. The target methods were formulated using the RJ format for the binary classification model, and the RC+RJ format for the multi-label classification model, following the results reported in Section 4.7. The pipeline was then tested using the algorithm defined in Algorithm 2. As can be seen, the multi-label classification model is only used to evaluate the target methods that the binary classification model identifies as having an issue.

Algorithm 2 Evaluating a Pipeline Which Utilizes both the Binary Model and the Multi-Label Model
1:binaryResults \leftarrow []
2:multiLabelResults \leftarrow []
3:for targetMethod in𝑖𝑛initalic_i italic_n testDataset do
4:    formattedMethod \leftarrow formatMethod(targetMethod,RJsuperscript𝑅superscript𝐽{}^{\prime}RJ^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_R italic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
5:    binResult \leftarrow binaryModel.evaluate(formattedMethod)
6:    binaryResults.append(binResult)
7:    if indicatesIssue(binResult) then
8:         formattedMethod \leftarrow formatMethod(targetMethod,RC+RJsuperscript𝑅𝐶𝑅superscript𝐽{}^{\prime}RC+RJ^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_R italic_C + italic_R italic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
9:         multiResult \leftarrow multiLabelModel.evaluate(formattedMethod)
10:         multiLabelResults.append(multiResult)
11:    end if
12:end for
Label Type Accuracy Precision Recall F1
All Methods
All Data 0.836 0.853 0.838 0.843
Head Labels 0.887 0.905 0.919 0.912
Tail Labels 0.875 0.703 0.607 0.644
Pipeline
Binary Model 0.872 0.872 0.872 0.872
All Labels 0.764 0.789 0.847 0.811
Head Labels 0.924 0.896 0.985 0.938
Tail Labels 0.827 0.589 0.591 0.574
Table 6: Comparison of Multi-label Classification Model Performance on Head Labels Compared to Tail Labels

This approach was used instead of simply fine-tuning and testing the multi-label classification model only on methods with issues to ensure a fair comparison to the results obtained when fine-tuning and testing the model on all methods. The results of this experiment are shown in Table 6 (which also shows the performance of the pipeline’s binary classification model). As can be seen, the multi-label classification model in the pipeline overall achieves a lower performance when considering all labels, with an accuracy of 0.764 and F1 score of 0.811, compared to the baseline experiment’s accuracy of 0.836 and F1 score of 0.843. Comparatively, the pipeline’s model performs better on the head labels, and worse on the tail labels.

4.8.3 Model Performance With Different Minimum Issue Count Thresholds (RQ3.3):

This research question examines how the model’s performance changes as the dataset is filtered to remove less frequently occurring issues. For this comparison, four datasets were utilized, each of which filters out issues based on a different frequency of occurrence in the overall dataset. The first dataset includes all issues retrieved during data collection without any filtering. The second dataset is the baseline dataset used throughout this paper, containing the top 75% most frequently occurring issues (issues that occur at least 15 times). The third dataset includes only the top 50% of issues, occurring at least 47 times. Finally, the last dataset contains the top 20% of issues, each occurring at least 601 times. In each case, methods without issues were randomly selected for inclusion in the datasets such that there is an equal number of methods with issues and methods without issues. The details of each dataset can be found in Table 7.

Included Types Number of Types Methods With Issues Train Set Size Validation Set Size Test Set Size
All Types 100 84747 135598 16948 16948
Top 75% 75 84664 135464 16932 16932
Top 50% 50 84163 134662 16832 16832
Top 20% 20 80652 129044 16130 16130
Table 7: Dataset Stats for Different Thresholds

The results of this experiment can be seen in Table 8 and Table 9. As expected, the performance of both models tends to increase as the less frequently occurring issue types are filtered out. The binary classification model achieves an accuracy of 0.843 when considering all issues, and 0.851 when considering the top 20% of issue types. The multi-label classification model achieves an accuracy of 0.824 when considering all issues, and 0.866 when considering the top 20% of issue types. However, the binary model’s precision peaks with dataset containing the top 75% of issue types. Interestingly, in both cases the dataset for the top 50% of issue types results in a model performance that is very similar to (or even slightly worse than) the results from the dataset for the top 75% of issue types.

Included Issues Accuracy Precision Recall F1
All Issue Types 0.843 0.839 0.850 0.844
Top 75% 0.849 0.854 0.841 0.847
Top 50% 0.848 0.837 0.864 0.851
Top 20% 0.851 0.832 0.879 0.855
Table 8: Binary Classification Results for Different Issue Type Cutoffs
Included Issues Accuracy Precision Recall F1
All Issue Types 0.824 0.842 0.828 0.830
Top 75% 0.836 0.853 0.838 0.843
Top 50% 0.834 0.855 0.838 0.842
Top 20% 0.866 0.884 0.871 0.875
Table 9: Multi-label Classification Results for Different Issue Type Cutoffs

4.9 Impact of Fine-Tuning and Pre-Training Datasets (RQ4)

As specified by Feng et al., CodeBERT was pre-trained on the CodeSearchNet (CSN) dataset feng2020codebert . Since the majority of our dataset was created using Java projects included in the CodeSearchNet dataset, the models’ results were analyzed to determine if there was a difference in performance when considering projects that were in the model’s pre-training dataset and those that are not.

In the test dataset, there are 16,932 total methods. 14,693 of those methods are from projects included in the CodeSearchNet dataset, while the remaining 2,239 are not. The differences in performance can be seen in Table 10 for the binary classification model, and Table 11 for the multi-label classification model. In both cases, the model performed better across all metrics on the projects which are included in the CodeSearchNet dataset, although the difference in performance was more significant in the multi-label classification model than the binary classification model. This shows that in both cases, pre-training information retained by the model about the projects helps in identifying issues.

Test Set Contents Accuracy Precision Recall F1
Full Test Set 0.849 0.854 0.841 0.847
CSN Projects 0.852 0.857 0.845 0.851
Non-CSN Projects 0.824 0.832 0.820 0.826
Table 10: Comparison of Binary Classification Results Using Test Sets Consisting of Projects Retrieved from CodeSearchNet and from the GitHub API
Test Set Contents Accuracy Precision Recall F1
Full Test Set 0.836 0.853 0.838 0.843
CSN Projects 0.844 0.859 0.846 0.849
Non-CSN Projects 0.785 0.812 0.791 0.798
Table 11: Comparison of Multi-Label Classification Results Using Test Sets Consisting of Projects Retrieved from CodeSearchNet and from the GitHub API

4.9.1 Model Performance on Unseen Projects (RQ4.1):

To determine how well the models generalize to unseen projects, the models were tested on projects that were withheld from the fine-tuning dataset. Projects to be withheld were randomly selected until 10% of the dataset was selected. This random selection was done in such a way that half of the selected projects were in CodeSearchNet, and the other half were not. This created a test set containing 16,952 methods. The remainder of the dataset was then split into a training, validation, and test datasets (containing 80%, 10%, and 10% of the remaining dataset, respectively). This second test dataset (containing projects that the models are fine-tuned on) was used as a baseline for comparison. The results of this experiment can be seen in Table 12 and Table 13 for the binary classification model and multi-label classification model, respectively.

As expected, the performance on the CodeSearchNet projects surpasses performance on the projects not found in CodeSearchNet on both the seen and unseen test sets (consistent with the results presented in Table 10 and Table 11).

Interestingly, for projects found in CodeSearchNet, the binary classification model performs better on the unseen projects than on the seen projects by a significant amount, achieving an accuracy of 0.866 and F1 score of 0.865 on the unseen projects compared to an accuracy of 0.831 and an F1 score of 0.827 on the seen projects. While the multi-label classification model does not exhibit this exact trend, the performances on seen and unseen projects from CodeSearchNet are remarkably comparable, with an accuracy difference of only 0.010 and an F1 score difference of only 0.004. The performance on seen CodeSearchNet projects achieves the best results for precision and F1 score (0.848 and 0.836, respectively), while the performance on unseen CodeSearchNet projects achieves the best accuracy and recall (0.838 and 0.840, respectively).

When considering the projects not found in CodeSearchNet, there is a decrease in performance when comparing the unseen projects to the seen projects in both models. This likely indicates that the model struggles more with projects that it has never been exposed to, which aligns with expectations. However, this decrease is not substantial, and the models still perform well on the unseen projects. For instance, the binary model achieves an accuracy of 0.798 and an F1 score of 0.790 on the seen projects, and an accuracy of 0.791 and an F1 score of 0.775 on the unseen projects. The multi-label model shows a larger performance difference, achieving an accuracy of 0.750 and an F1 score of 0.785 on seen projects and an accuracy of 0.722 and an F1 score of 0.742 on the unseen projects. Overall, these results indicate that our approach generalizes well to unseen projects, although it does perform better on projects that were used during pre-training and fine-tuning.

Test Set Contents Accuracy Precision Recall F1
Seen CSN Projects 0.831 0.846 0.810 0.827
Seen Non-CSN Projects 0.798 0.830 0.754 0.790
Unseen CSN Projects 0.866 0.891 0.841 0.865
Unseen Non-CSN Projects 0.791 0.823 0.733 0.775
Table 12: Comparison of Binary Classification Performance on Test Sets Containing Either Seen or Unseen Projects
Test Set Projects Accuracy Precision Recall F1
Seen CSN Projects 0.828 0.848 0.832 0.836
Seen Non-CSN Projects 0.750 0.761 0.756 0.785
Unseen CSN Projects 0.838 0.833 0.840 0.832
Unseen Non-CSN Projects 0.722 0.732 0.724 0.742
Table 13: Comparison of Multi-Label Classification Performance on Test Sets Containing Either Seen or Unseen Projects

5 Threats to Validity

Potential False Positives and False Negatives in Dataset: During data collection, an implicit assumption was made that the results generated by the linting tools were correct (i.e. that the reported issues are actually issues and are all of the issues in the project). However, it is very likely that this is not the case. Previous work has found that Infer has a 72.7% precision for issue I1 and a 57.4% precision for issue I3 kharkar2022learning . Unfortunately, precision figures for other Infer issue types and the SpotBugs tool were not readily available, and there seems to be no reported results about potential false negatives in the tools’ output (i.e. an issue of a type that the tool supports and fails to report). Presence of false positives and false negatives in the dataset would prevent the models from effectively learning the patterns that comprise true issues, and would instead cause it to learn to replicate the tools’ performance.

Dataset Imbalance: As previously discussed in Section 4.7, the dataset utilized in this study is very imbalanced, being dominated by relatively few issue types. The imbalanced dataset likely led to the models learning to identify the the common issue types while treating the very rarely occurring ones as noise. A more balanced dataset would help reduce any bias that the models have towards certain issue types, and may lead to better performance on rarely occurring issue types. A way to alleviate this in future work, would be to augment a dataset built from natural code with synthetically created issues.

6 Conclusion

This paper explores the potential of leveraging large language models for code linting with the aim of addressing several limitations inherent in traditional linters. The investigation into the architecture and implementation of a language model-based code linter suggests that large language models can offer improvements in accuracy and performance. The empirical evaluation through the research questions provides preliminary evidence of the effectiveness of large language models in code linting. The results suggest that our language model-based approach can achieve competitive accuracy and can outperform existing linters in speed. The analysis of different input formulations underscores the importance of optimizing input data to enhance model performance.

However, our study also identifies areas where large language models face challenges, such as detecting rare issues in the dataset and analyzing projects not included in the pre-training data. These findings highlight the need for more diverse and comprehensive training datasets and further research to improve the model’s ability to generalize across different types of codebases and issue types. Overall, while this work contributes to the understanding of large language models’ application in software engineering, it is an initial step in exploring their potential for creating more efficient, accurate, and adaptable code linting tools. As large language models continue to evolve, we hope that further advancements will lead to more robust and maintainable software systems.

References

  • (1) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
  • (2) Ayewah, N., Pugh, W., Hovemeyer, D., Morgenthaler, J.D., Penix, J.: Using static analysis to find bugs. IEEE software 25(5), 22–29 (2008)
  • (3) Beller, M., Bholanath, R., McIntosh, S., Zaidman, A.: Analyzing the state of static analysis: A large-scale evaluation in open source software. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1, pp. 470–481. IEEE (2016)
  • (4) Bi, Y., Huang, J., Liu, P., Wang, L.: Benchmarking software vulnerability detection techniques: A survey. arXiv preprint arXiv:2303.16362 (2023)
  • (5) Bian, P., Liang, B., Shi, W., Huang, J., Cai, Y.: Nar-miner: discovering negative association rules from code for bug detection. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 411–422 (2018)
  • (6) Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: Are we there yet? IEEE Transactions on Software Engineering 48(9), 3280–3296 (2021)
  • (7) Checkstyle: Checkstyle (2024). URL https://checkstyle.sourceforge.io/
  • (8) Emanuelsson, P., Nilsson, U.: A comparative study of industrial static analysis tools. Electronic notes in theoretical computer science 217, 5–21 (2008)
  • (9) Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al.: Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
  • (10) Fu, M., Tantithamthavorn, C.: Linevul: A transformer-based line-level vulnerability prediction. In: Proceedings of the 19th International Conference on Mining Software Repositories, pp. 608–620 (2022)
  • (11) Gao, Z., Wang, H., Zhou, Y., Zhu, W., Zhang, C.: How far have we gone in vulnerability detection using large language models. arXiv preprint arXiv:2311.12420 (2023)
  • (12) GitHub: Github rest api documentation (n.d.). URL https://docs.github.com/en/rest
  • (13) Gong, L., Pradel, M., Sridharan, M., Sen, K.: Dlint: Dynamically checking bad coding practices in javascript. In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, pp. 94–105 (2015)
  • (14) Harzevili, N.S., Belle, A.B., Wang, J., Wang, S., Ming, Z., Nagappan, N., et al.: A survey on automated software vulnerability detection using machine learning and deep learning. arXiv preprint arXiv:2306.11673 (2023)
  • (15) HuggingFace: Automodels - transformers 3.0.2 documentation (n.d.). URL https://huggingface.co/transformers/v3.0.2/model_doc/auto.html
  • (16) HuggingFace: Trainer (n.d.). URL https://huggingface.co/docs/transformers/main_classes/trainer
  • (17) Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge: Evaluating the state of semantic code search (2020)
  • (18) Infer: Infer (2021). URL https://fbinfer.com/
  • (19) Infer: List of all issue types (2021). URL https://fbinfer.com/docs/all-issue-types
  • (20) Johnson, B., Song, Y., Murphy-Hill, E., Bowdidge, R.: Why don’t software developers use static analysis tools to find bugs? In: 2013 35th International Conference on Software Engineering (ICSE), pp. 672–681. IEEE (2013)
  • (21) Kharkar, A., Moghaddam, R.Z., Jin, M., Liu, X., Shi, X., Clement, C., Sundaresan, N.: Learning to reduce false positives in analytic bug detectors. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1307–1316 (2022)
  • (22) Li, Y., Wang, S., Nguyen, T.N.: Vulnerability detection with fine-grained interpretations. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 292–303 (2021)
  • (23) Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing 19(4), 2821–2837 (2021)
  • (24) Marjamäki, D.: Cppcheck - a tool for static c/c++ code analysis. Cppcheck (2007). URL https://cppcheck.sourceforge.io/
  • (25) Nguyen, V.A., Nguyen, D.Q., Nguyen, V., Le, T., Tran, Q.H., Phung, D.: Regvd: Revisiting graph neural networks for vulnerability detection. In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pp. 178–182 (2022)
  • (26) Nong, Y., Sharma, R., Hamou-Lhadj, A., Luo, X., Cai, H.: Open science in software engineering: A study on deep learning-based vulnerability detection. IEEE Transactions on Software Engineering 49(4), 1983–2005 (2022)
  • (27) Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., Karri, R.: Asleep at the keyboard? assessing the security of github copilot’s code contributions. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 754–768. IEEE (2022)
  • (28) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21(140), 1–67 (2020)
  • (29) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
  • (30) Sonar: Sonarcloud online code review as a service tool (n.d.). URL https://www.sonarsource.com/products/sonarcloud/
  • (31) SpotBugs: Spotbugs (2021). URL https://spotbugs.github.io/
  • (32) SpotBugs: Bug descriptions (2023). URL https://spotbugs.readthedocs.io/en/latest/bugDescriptions.html
  • (33) Steenhoek, B., Rahman, M.M., Jiles, R., Le, W.: An empirical study of deep learning models for vulnerability detection. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 2237–2248. IEEE (2023)
  • (34) Synopsis: Coverity static analysis software (n.d.). URL https://www.synopsys.com/software-integrity/static-analysis-tools-sast/coverity.html
  • (35) Vassallo, C., Panichella, S., Palomba, F., Proksch, S., Gall, H.C., Zaidman, A.: How developers engage with static analysis tools in different contexts. Empirical Software Engineering 25, 1419–1457 (2020)
  • (36) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • (37) Vayadande, K., Mukhopadhyay, K., Chaudhari, V., Manwadkar, S., Mutalik, T., Gawali, I.: Let us lint: A tool for code formatting and code enhancing. In: 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp. 1–8. IEEE (2023)
  • (38) Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45 (2020)
  • (39) Yuan, Z., Liu, J., Zi, Q., Liu, M., Peng, X., Lou, Y.: Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240 (2023)
  • (40) Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
  • (41) Zhou, C., Li, Q., Li, C., Yu, J., Liu, Y., Wang, G., Zhang, K., Ji, C., Yan, Q., He, L., et al.: A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419 (2023)
  • (42) Zhou, X., Kim, K., Xu, B., Liu, J., Han, D., Lo, D.: The devil is in the tails: How long-tailed code distributions impact large language models. In: 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 40–52. IEEE (2023)

Appendix A Collected Issue Types

Table 14: The list of issue types retrieved during dataset collection, and a brief description of each
ID Issue Count Description
Infer Issues inferDesc
I1 NULL_
DEREFERENCE
7135 Potential null pointer dereference at the specified location
I2 NULLPTR_
DEREFERENCE
121 Potential null pointer dereference at the specified location
I3 RESOURCE_LEAK 2133 A resource may not be closed, especially in the case were an exception occurs while accessing the resource
I4 INEFFICIENT_
KEYSET_
ITERATOR
769 A HashMap is iterated over using the keySet() iterator instead of the entrySet() iterator
I5 THREAD_
SAFETY_
VIOLATION
9241 A potential data race between threads
I6 EXPENSIVE_
LOOP_
INVARIANT_ CALL
26 An expensive method with constant input and output is repeatedly called in a loop
I7 CHECKERS_
IMMUTABLE_
CAST
319 A method with a mutable return type returns an immutable object
I8 INTERFACE_NOT_
THREAD_SAFE
3246 An interface not annotated with @ThreadSafe is called from a thread safe context
I9 ARBITRARY_
CODE_
EXECUTION_ UNDER_LOCK
2 While a lock is held, a call to arbitrary code which may obtain a lock is made, potentially resulting in a deadlock
I10 DEADLOCK 76 Two distinct threads attempt to obtain the same locks in different orders
SpotBugs Issues spotBugsDesc
S1 NP 1662 A method may unexpectedly return null, a field or paramater may be null, or a null value is guaranteed to be dereferenced
S2 WMI 748 A HashMap is iterated over using the keySet() iterator instead of the entrySet() iterator
S3 CT 8120 A constructor can throw an exception, making it vulnerable to Finalizer attacks
S4 DM 3244 A method is invoked in an incorrect or questionable way
S5 SF 994 A switch statement is missing the default case or contains a case which falls through to the next
S6 OBL 981 An object requiring explicit clean up is not cleaned up
S7 OS 313 A stream fails to be closed under certain conditions
S8 EI 47057 A reference to an object’s mutable field is returned, potentially leading to unchecked and unexpected changes to the object’s field
S9 JLM 81 Synchronization on an object is performed incorrectly or in a confusing way
S10 REFLC 94 Reflection is used in a public method to create a class specified by a parameter, which could increase the visibility of other classes in the package
S11 REFLF 30 Reflection is used to increase the accessibility of a field
S12 REC 342 A catch block is used to catch catch Exception objects rather than specific exception types, which may incorrectly catch a thrown RuntimeException
S13 UC 234 A condition always produces the same value, a method performs no useful work, or an object is created and modified yet yields no side-effects and never leaves the current context
S14 RCN 659 A null check is performed on a value that is known to be non-null
S15 ST 611 A static field is modified by an instance method
S16 NM 654 A class, method, or field name is either confusing, ignores Java naming conventions, or is a keyword from a later Java version
S17 RV 1250 A return value is ignored or is handled in a way that could yield unexpected behaviour
S18 DLS 1288 A value is set or modified, and then is never used
S19 ML 25 Synchronization is performed on an object referenced from a mutable field, potentially leading to threads locking different objects
S20 SBSC 390 A string is concatenated in a loop using the ’+’ operator, where a StringBuilder would be more appropriate
S21 SIC 268 An inner class should be made into a static inner class since it doesn’t use its embedded reference to the outer class
S22 SSD 41 An instance level lock on a static field may not guard against concurrent access
S23 MS 696 A mutable, final static field could be accidentally or maliciously altered, and should be made immutable or less visible
S24 UG 48 A synchronized set method has an associated unsynchronized get method, potentially leading to incosnistent caller states
S25 DC 99 An instance of double-checked locking may be used here, which may function incorrectly or in unexpected ways on some platforms
S26 BIT 36 A bitwise operator is used in a way that could lead to unexpected behaviour
S27 RANGE 21 A provided parameter to a string or array access will be out of bounds
S28 UL 43 A method may not release all obtained locks in some executation paths
S29 SE 327 A serializable class may be configured in a way that leads to incorrect serialization or deserialization
S30 NS 33 Non-short-circuit logic is used which is less efficient and may lead to errors
S31 ES 227 ’==’ or ’!=’ is being used to compare Strings instead of the equals() method
S32 VA 8 A primitive array is passed to a method with a variable number of arguments, resulting in the array being treated as a single argument, which may not be expected behaviour
S33 ICAST 181 A type cast of numeric values is performed in an incorrect or useless way
S34 DCN 263 A null check should be performed instead of catching a NullPointerException
S35 NN 31 notify() or notifyAll() is used without any modification to a mutable field
S36 RR 58 The return value of the read() method of an InputStream is ingored, potentially leading to sporadic failures
S37 ENV 5 It is preferable to use portable Java system properties instead of environment variables where possible
S38 UUF 7194 A field is never used
S39 PA 785 A public field should be made less visible
S40 SQL 105 An SQL call is made in an incorrect way, either allowing for SQL injection attacks, or incorrectly accessing results
S41 DE 181 An exception may be ignored instead of handled
S42 VO 71 A field or reference is treated as volatile when it may be preferable for it to be non-volatile
S43 DMI 451 A method is invoked in an inefficient, incorrect, or insecure way
S44 INT 74 A integer operation does not perform useful work, or a numeric value is compared to a value gauranteed to be outside its range
S45 CO 47 A compare() or compareTo() method is incorrectly implemented
S46 EQ 262 An equals() method is incorrectly implemented
S47 RE 16 A regex value is potentially either used incorrectly or used unintentionally
S48 HE 87 An unhashable class is used in a hashable context, or the class may violate the invariant that equal objects should have equal hash codes
S49 GC 18 An argument is used which has a type that is potentially not compatible with the expected generic parameter type.
S50 UW 17 A call to wait() is made without a guard condition
S51 WA 26 A call to wait() or await() is made outside a loop in a context that may have multiple conditions being observed
S52 DL 6 Synchronization is performed on an object that could be shared amongst all objects in the JVM, resulting in a potential deadlock
S53 LI 80 A static field is lazily initialized without synchronization, which could result in incorrect multi-thread behaviour
S54 DB 35 Two or more conditional branches or switch statements use the exact same code.
S55 IS 44 A guarded field is not properly guarded against concurrent access, or a field is accessed with inconsistent synchronization
S56 URF 747 A field seems to never be read
S57 BC 194 An impossible cast is made, or the cast is unchecked
S58 STCAL 145 A Calendar or DateFormat object is accessed in a way indicative of multithreading, even though these types are not threadsafe
S59 RC 107 Two references are compared using ’==’ or ’!=’ instead of the equals() method
S60 UR 54 A constructor performs a read of a value that has not yet been initialized
S61 WL 3 Synchronization is performed on the getClass() return value instead of a class literal, leading to potential data races amongst subclasses
S62 EC 38 An equality comparison is performed in a way that is likely to always result in the objects being inequal
S63 UPM 90 A private methods seems to never be called
S64 FE 21 Two floating point values are compared for equality in a way that could fail (due to potential rounding), or a floating point value is compared to the special NaN value (which will always be inequal)
S65 ODR 259 A database resource seems to not be closed on some execution paths
S66 BX 991 Boxing and/or unboxing is performed in an inefficient way
S67 IM 81 An operator may be used in an incorrect or unreliable way
S68 SC 22 A constructor starts a thread, which may behave incorrectly for subclasses
S69 IA 23 An inner class invokes a method in a way which could be ambiguous
S70 ME 42 A mutable enum field can be set from outside its package, and can be changed either accidentally or maliciously
S71 CN 71 A class’ clone() method does not call super.clone(), or a class implementing Cloneable does not implement clone() or vice versa
S72 UCF 23 A control flow statement has no effect on the code’s execution whether the branch is taken or not
S73 MWN 2 Object.wait(), Object.notify(), or Object.notifyAll() is called, without having an obvious lock on the object
S74 TLW 2 A wait is performed while having multiple locks, which may result in a deadlock
S75 HRS 16 An HTTP header or cookie is constructed in a way that could lead to a HTTP response splitting vulnerability
S76 IL 33 An infinite loop is created, either through a loop with no exit condition, unguarded recursive calls, or adding a collection to itself
S77 IT 32 A class implements the Iterator interface, however its next() method can not throw java.util.NoSuchElementException
S78 SA 64 A field is self-assigned, or an apparently useless self-computation or self-comparison is performed
S79 SS 115 An instance field appears to be a c compile-time static value, and should likely be a static field
S80 RU 3 Invokes run() on an object, where Thread.start() might be more appropriate
S81 AT 42 Calls to a concurrent abstraction may not be executed atomically
S82 FI 17 A finalizer is incorrectlly or inefficiently implemented
S83 MC 5 An overrideable method is called from a constructor or clone() method, which may result in the method being called while not fully initialized
S84 UWF 104 A field is never set, or is only ever set to null
S85 DP 45 Executed method call may require security permission, and should be executed inside a doPrivileged block
S86 FL 16 A floating point value is used in a way where its lack of precision may be detrimental
S87 RpC 26 A conditional test is performed more than once sequentially, which is useless
S88 JCIP 14 A non-final field is created in an immutable class
S89 RS 1 A serializable class’ readObject() method is synchronized, even though it should only be accessible by one thread
S90 MF 14 A field is masked by a local variable or by a subclass’ field, which could result in confusing or incorrect behaviour
S91 TQ 1 A type qualifier is either potentially missing, or potentially used incorrectly
S92 IP 8 A parameter value is ignored before being overwritten
S93 IMSE 1 An IllegalMonitoringStateException is caught, even though this exception is normally only thrown due to a code design flaw
S94 SWL 8 Thread.sleep() is called while a lock is held, leading to poor performance or a potential deadlock
S95 UI 6 A getResource() call is made that could behave unexpectedly if the calling class is extended
S96 OVERRIDING 15 Super method is annotated with @OverridingMethodsMustInvokeSuper, but the overriding method does not invoke it
S97 UMAC 12 An anonymous class defines a method which is not invoked, and seems to be otherwise uncallable
S98 XSS 2 This code potentially introduces a cross-site scripting vulnerability
S99 QF 1 A for loop’s incrementation seems to be potentially incorrect
S100 CNT 12 A constant seems to be approximately equal to a known library value (e.g. approximately equal to Math.PI)
S101 BSHIFT 1 Order of operations for a binary shift may be wrong here
S102 FS 726 A format string should use %n instead of the newline character
S103 IC 5 A circular reference was detected in static initializers
S104 LG 3 A logger configuration may be lost due to an incompatibility in OpenJDK
S105 SP 2 The compiler may move a field read outside a loop, which could result in an infinite loop
S106 J2EE 2 A non-serializable object may be getting stored in an HttpSession, which could result in an error
S107 SR 66 The return value of the skip() method of an InputStream is ingored, potentially leading to sporadic failures
S108 VSC 62 A security check method is non-private and non-final, potentially leaving it vulnerable to being overridden
End of table
ID Base Issues Description
E1 I1, I2, S1 Issues which denote a possible null dereference
E2 I3, S6, S7 Issues which denote a case where a resource may not be properly closed
E3 I4, S2 Issues where keySet() is used to iterate over a HashMap instead of entrySet()
E4 S36, S107 Issues where the return value of the read() or skip() InputStream methods is ignored
E5 S10, S11 Issues where reflection is used to increase the accessibility of a method or class
E6 S4, S43 Issues where a method is invoked in an incorrect, inefficient, or insecure way
E7 S31, S59 Issues where ’==’ or ’!=’ is being used instead of an equals() method
E8 I5, S9, S19, S24, S25, S35, S50, S51, S52, S53 Issues where the implemented synchronization could result in a race condition
Table 15: Equivalent Issues and their Encompassed Base Issues