skip to main content
short-paper
Open access

How Often Do Single-Statement Bugs Occur?: The ManySStuBs4J Dataset

Published: 18 September 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Program repair is an important but difficult software engineering problem. One way to achieve acceptable performance is to focus on classes of simple bugs, such as bugs with single statement fixes, or that match a small set of bug templates. However, it is very difficult to estimate the recall of repair techniques for simple bugs, as there are no datasets about how often the associated bugs occur in code. To fill this gap, we provide a dataset of 153,652 single statement bug-fix changes mined from 1,000 popular open-source Java projects, annotated by whether they match any of a set of 16 bug templates, inspired by state-of-the-art program repair techniques. In an initial analysis, we find that about 33% of the simple bug fixes match the templates, indicating that a remarkable number of single-statement bugs can be repaired with a relatively small set of templates. Further, we find that template fitting bugs appear with a frequency of about one bug per 1,600-2,500 lines of code (as measured by the size of the project's latest version). We hope that the dataset will prove a resource for both future work in program repair and studies in empirical software engineering.

    References

    [1]
    Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). ACM, New York, NY, USA, 38--49. https://doi.org/10.1145/2786805.2786849
    [2]
    Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of ICML 2016, Vol. 48. 2091--2100. http://proceedings.mlr.press/v48/allamanis16.html
    [3]
    Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2019. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. CoRR abs/1901.01808 (2019). arXiv:1901.01808 http://arxiv.org/abs/1901.01808
    [4]
    Zimin Chen and Martin Monperrus. 2018. The CodRep Machine Learning on Source Code Competition. CoRR abs/1807.03200 (2018). arXiv:1807.03200 http://arxiv.org/abs/1807.03200
    [5]
    Efstathia Chioteli, Ioannis Batas, and Diomidis Spinellis. 2019. Does Unit-Tested Code Crash? A Case Study of Eclipse. arXiv preprint arXiv:1903.04055 (2019).
    [6]
    C. Le Goues, N. Holtschulte, E. K. Smith, Y. Brun, P. Devanbu, S. Forrest, and W. Weimer. 2015. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering 41, 12 (Dec 2015), 1236--1256. https://doi.org/10.1109/TSE.2015.2454513
    [7]
    Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR '13). IEEE Press, Piscataway, NJ, USA, 233--236. http://dl.acm.org/citation.cfm?id=2487085.2487132
    [8]
    X. Guo, M. Zhou, X. Song, M. Gu, and J. Sun. 2015. First, Debug the Test Oracle. IEEE Transactions on Software Engineering 41, 10 (Oct 2015), 986--1000. https://doi.org/10.1109/TSE.2015.2425392
    [9]
    Andrew Habib and Michael Pradel. 2018. How Many of All Bugs Do We Find? A Study of Static Bug Detectors. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 317âĂŞ328. https://doi.org/10.1145/3238147.3238213
    [10]
    Kim Herzig and Andreas Zeller. 2013. The impact of tangled code changes. In Working Conference on Mining Software Repositories. IEEE Press, 121--130.
    [11]
    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). ACM, New York, NY, USA, 437--440. https://doi.org/10.1145/2610384.2628055
    [12]
    Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. 2020. Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. arXiv:cs.SE/2003.07914
    [13]
    S. Kim, T. Zimmermann, K. Pan, and E.J. Jr. Whitehead. 2006. Automatic Identification of Bug-Introducing Changes. In 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06). 81--90. https://doi.org/10.1109/ASE.2006.23
    [14]
    Claire Le Goues, Thanh Vu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Trans. Software Eng. 38, 1 (2012), 54--72. https://doi.org/10.1109/TSE.2011.104
    [15]
    Fan Long and Martin Rinard. 2015. Staged Program Repair with Condition Synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). ACM, New York, NY, USA, 166--178. https://doi.org/10.1145/2786805.2786811
    [16]
    Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. SIGPLAN Not. 51, 1 (Jan. 2016), 298--312. https://doi.org/10.1145/2914770.2837617
    [17]
    Frederic P. Miller, Agnes F. Vandome, and John McBrewster. 2010. Apache Maven. Alpha Press.
    [18]
    Martin Monperrus. 2018. Automatic Software Repair: A Bibliography. ACM Comput. Surv. 51, 1, Article 17 (Jan. 2018), 24 pages. https://doi.org/10.1145/3105906
    [19]
    Monika A. F. Müllerburg. 1983. The Role of Debugging Within Software Engineering Environments. SIGPLAN Not. 18, 8 (March 1983), 81--90. https://doi.org/10.1145/1006142.1006165
    [20]
    Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages. https://doi.org/10.1145/3276517
    [21]
    Baishakhi Ray, Vincent Hellendoorn, Zhaopeng Tu, Connie Nguyen, Saheel Godhane, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the" Naturalness" of Buggy Code (ICSE '16). ACM.
    [22]
    Caitlin Sadowski, Jeffrey van Gogh, Ciera Jaspan, Emma Soederberg, and Collin Winter. 2015. Tricorder: Building a Program Analysis Ecosystem. In International Conference on Software Engineering (ICSE).
    [23]
    Ripon K. Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R. Prasad. 2018. Bugs.Jar: A Large-scale, Diverse Dataset of Real-world Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR '18). ACM, New York, NY, USA, 10--13. https://doi.org/10.1145/3196398.3196473
    [24]
    Jacek Sliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do changes induce fixes?. In International Workshop on Mining Software Repositories. ACM.
    [25]
    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. An Empirical Investigation into Learning Bug-fixing Patches in the Wild via Neural Machine Translation. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018). ACM, New York, NY, USA, 832--837. https://doi.org/10.1145/3238147.3240732
    [26]
    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. CoRR abs/1812.08693 (2018). arXiv:1812.08693 http://arxiv.org/abs/1812.08693
    [27]
    Chadd Williams and Jaime Spacco. 2008. SZZ Revisited: Verifying when Changes Induce Fixes. In Proceedings of the 2008 Workshop on Defects in Large Software Systems (Seattle, Washington) (DEFECTS '08). ACM, New York, NY, USA, 32--36. https://doi.org/10.1145/1390817.1390826

    Cited By

    View all
    • (2024)A systematic literature review on the impact of AI models on the security of code generationFrontiers in Big Data10.3389/fdata.2024.13867207Online publication date: 13-May-2024
    • (2024)ExLi: An Inline-Test Generation Tool for JavaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663817(652-656)Online publication date: 10-Jul-2024
    • (2024)Sharing Software-Evolution Datasets: Practices, Challenges, and RecommendationsProceedings of the ACM on Software Engineering10.1145/36607981:FSE(2051-2074)Online publication date: 12-Jul-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories
    June 2020
    675 pages
    ISBN:9781450375177
    DOI:10.1145/3379597
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 September 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MSR '20
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)397
    • Downloads (Last 6 weeks)69

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A systematic literature review on the impact of AI models on the security of code generationFrontiers in Big Data10.3389/fdata.2024.13867207Online publication date: 13-May-2024
    • (2024)ExLi: An Inline-Test Generation Tool for JavaCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663817(652-656)Online publication date: 10-Jul-2024
    • (2024)Sharing Software-Evolution Datasets: Practices, Challenges, and RecommendationsProceedings of the ACM on Software Engineering10.1145/36607981:FSE(2051-2074)Online publication date: 12-Jul-2024
    • (2024)CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of CodeProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644864(637-641)Online publication date: 15-Apr-2024
    • (2024)Out of Context: How important is Local Context in Neural Program Repair?Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639086(1-13)Online publication date: 20-May-2024
    • (2024)APPT: Boosting Automated Patch Correctness Prediction via Fine-Tuning Pre-Trained ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.335496950:3(474-494)Online publication date: Mar-2024
    • (2023)Large language models of code fail at completing code with potential bugsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667916(41386-41412)Online publication date: 10-Dec-2023
    • (2023)Large-Scale Identification and Analysis of Factors Impacting Simple Bug Resolution Times in Open Source Software RepositoriesApplied Sciences10.3390/app1305315013:5(3150)Online publication date: 28-Feb-2023
    • (2023)Machine Learning for Software Technical Debt DetectionИзвестия Российской академии наук. Теория и системы управления10.31857/S0002338823040078(98-104)Online publication date: 1-Jul-2023
    • (2023)A Survey of Learning-based Automated Program RepairACM Transactions on Software Engineering and Methodology10.1145/363197433:2(1-69)Online publication date: 23-Dec-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media