research-article

Open access

Big code != big vocabulary: open-vocabulary models for source code

Authors:

Rafael-Michael Karampatsis,

Charles Sutton, and

Andrea JanesAuthors Info & Claims

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

June 2020

Pages 1073 - 1085

https://doi.org/10.1145/3377811.3380342

Published: 01 October 2020 Publication History

Abstract

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.

In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.

All datasets, code, and trained models used in this work are publicly available.

References

[1]

Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of Onward! 2019. 143--153.

Digital Library

[2]

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2014. Learning natural coding conventions. In Proceedings of SIGSOFT/FSE 2014. 281--293.

Digital Library

[3]

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of ESEC/FSE 2015. 38--49.

Digital Library

[4]

Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4 (2018), 81:1--81:37.

Digital Library

[5]

Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of ICML 2016, Vol. 48. 2091--2100. http://proceedings.mlr.press/v48/allamanis16.html

[6]

Miltiadis Allamanis and Charles A. Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of MSR 2013. 207--216.

[7]

Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of ICML 2015, Vol. 37. 2123--2132. http://proceedings.mlr.press/v37/allamanis15.html

[8]

Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of ICLR 2019. https://openreview.net/forum?id=H1gKYo09tX

[9]

Hlib Babii, Andrea Janes, and Romain Robbes. 2019. Modeling Vocabulary for Big Code Machine Learning. CoRR abs/1904.01873 (2019). http://arxiv.org/abs/1904.01873

[10]

Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts. CoRR abs/1809.05193 (2018). arXiv:1809.05193

[11]

Issam Bazzi. 2002. Modelling Out-of-vocabulary Words for Robust Speech Recognition. Ph.D. Dissertation. Cambridge, MA, USA. AAI0804528.

Digital Library

[12]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155. http://dl.acm.org/citation.cfm?id=944919.944966

Digital Library

[13]

Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks. CoRR abs/1603.06129 (2016). http://arxiv.org/abs/1603.06129

[14]

Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: Probabilistic Model for Code. In Proceedings of ICML 2016, Vol. 48. 2933--2942. http://proceedings.mlr.press/v48/bielik16.html

[15]

David W. Binkley, Marcia Davis, Dawn J. Lawrie, and Christopher Morrell. 2009. To CamelCase or Under_score. In Proceedings of ICPC 2009. 158--167.

[16]

James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-Recurrent Neural Networks. In Proceedings of ICLR 2017. https://openreview.net/forum?id=H1zJ-v5xl

[17]

Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of ESEC/FSE 2009. 213--222.

Digital Library

[18]

Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax Errors Just Aren't Natural: Improving Error Reporting with Language Models. In Proceedings of MSR 2014. 252--261.

Digital Library

[19]

Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--394.

Digital Library

[20]

Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2018. Sequencer: Sequence-to-sequence learning for end-to-end program repair. arXiv preprint arXiv:1901.01808 (2018).

[21]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724--1734.

[22]

Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555

[23]

Anna Corazza, Sergio Di Martino, and Valerio Maggio. 2012. LINSEN: An efficient approach to split identifiers and expand abbreviations. In Proceedings of ICSM 2012. 233--242.

Digital Library

[24]

Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5, 1 (2007), 3.

[25]

Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt. IEEE Transactions on Software Engineering 43, 11 (Nov 2017), 1044--1062.

Digital Library

[26]

Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).

[27]

Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In Proceedings of ICSE 2016. 345--356.

Digital Library

[28]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019. 4171--4186.

[29]

Sergey Dudoladov. 2013. Statistical NLP for computer program source code: An information theoretic perspective on programming language verbosity. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.

[30]

Eric Enslen, Emily Hill, Lori L. Pollock, and K. Vijay-Shanker. 2009. Mining source code to automatically split identifiers for software analysis. In Proceedings of MSR 2009. 71--80.

Digital Library

[31]

Stefan Fiott. 2015. An Investigation of Statistical Language Modelling of Different Programming Language Types Using Large Corpora. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.

[32]

Christine Franks, Zhaopeng Tu, Premkumar T. Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model Based Code Suggestion Tool. In Proceedings of ICSE 2015 (Volume 2). 705--708. https://ieeexplore.ieee.org/document/7203048

[33]

Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of SIGSOFT/FSE 2010. 147--156.

Digital Library

[34]

Philip Gage. 1994. A New Algorithm for Data Compression. C Users J. 12, 2 (Feb. 1994), 23--38. http://dl.acm.org/citation.cfm?id=177910.177914

[35]

ChengYue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: Frequency-Agnostic Word Representation. In Proceedings of NeurIPS 2018. 1341--1352. http://papers.nips.cc/paper/7408-frage-frequency-agnostic-word-representation

[36]

Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural Language Models with a Continuous Cache. In Proceedings of ICLR 2017. https://openreview.net/forum?id=B184E5qee

[37]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of SIGSOFT/FSE 2016. 631--642.

Digital Library

[38]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. In Proceedings of IJCAI 2017. 3675--3681.

[39]

Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1--2 (2017), 1--119.

[40]

Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of AAAI 2017. 1345--1351. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603

[41]

Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Networks the Best Choice for Modeling Source Code?. In Proceedings of ESEC/FSE 2017. 763--773.

Digital Library

[42]

Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli. 2019. When code completion fails: a case study on real-world completions. In Proceedings of ICSE 2019. 960--970.

Digital Library

[43]

Emily Hill, David Binkley, Dawn Lawrie, Lori Pollock, and K Vijay-Shanker. 2014. An empirical study of identifier splitting techniques. Empirical Software Engineering 19, 6 (2014), 1754--1780.

Digital Library

[44]

Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of ICSE 2012. 837--847. http://dl.acm.org/citation.cfm?id=2337223.2337322

[45]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (Nov. 1997), 1735--1780.

Digital Library

[46]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of ACL 2019. 328--339. https://www.aclweb.org/anthology/P18-1031/

[47]

Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In Proceedings of ICPC 2018. 200--210.

Digital Library

[48]

Michael Hucka. 2018. Spiral: splitters for identifiers in source code files. J. Open Source Software 3, 24 (2018), 653.

[49]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of ACL 2016. 2073--2083. http://www.aclweb.org/anthology/P16-1195

[50]

Alan Jaffe, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, and Bogdan Vasilescu. 2018. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of ICPC 2018. 20--30.

Digital Library

[51]

Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of ACL 2015. 1--10.

[52]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).

[53]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of ISSTA 2014. 437--440.

Digital Library

[54]

Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of Onward! 2014. 173--184.

Digital Library

[55]

Rafael-Michael Karampatsis and Charles A. Sutton. 2019. Maybe Deep Neural Networks are the Best Choice for Modeling Source Code. CoRR abs/1903.05734 (2019). http://arxiv.org/abs/1903.05734

[56]

Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of ACL 2018. 284--294.

[57]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models. In Proceedings of AAAI 2016. 2741--2749. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489

[58]

Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. 2018. Code Completion with Neural Attention and Pointer Networks. In Proceedings of IJCAI 2018. 4159--4165.

[59]

Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of CoNLL 2013. 104--113. https://www.aclweb.org/anthology/W13-3512/

[60]

Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of ICSE 2019. 304--315.

Digital Library

[61]

Vadim Markovtsev, Waren Long, Egor Bulychev, Romain Keramitas, Konstantin Slavnov, and Gabor Markowski. 2018. Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network. arXiv preprint arXiv:1805.11651 (2018).

[62]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In Proceedings of ICLR 2017. https://openreview.net/forum?id=Byj72udxe

[63]

Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010. 1045--1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html

[64]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of NIPS 2013. USA, 3111--3119. http://dl.acm.org/citation.cfm?id=2999792.2999959

Digital Library

[65]

Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Le Hai Son, Stefan Kombrink, and Jan Cernock. 2012. Subword Language Modeling With Neural Networks. (08 2012).

[66]

Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings ESEC/FSE 2013. 651--654.

Digital Library

[67]

Thanh Nguyen, Peter C. Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N. Nguyen. 2016. T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation. In Proceedings of SIGSOFT/FSE 2016. 1013--1017.

Digital Library

[68]

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.Nguyen. 2013. A Statistical Semantic Language Model for Source Code. In Proceedings of ESEC/FSE 2013. New York, NY, USA, 532--542.

Digital Library

[69]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT 2018. 2227--2237.

[70]

Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages.

Digital Library

[71]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Available: https://blog.openai.com/language-unsupervised/ (2018).

[72]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Leo Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Available: https://blog.openai.com/better-language-models/ (2019).

[73]

Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of ICSE 2016. 357--367.

Digital Library

[74]

Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural software revisited. In Proceedings of ICSE 2019. 37--48.

Digital Library

[75]

Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of ICSE 2016. 428--439.

Digital Library

[76]

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of PLDI 2014. 419--428.

Digital Library

[77]

Romain Robbes and Andrea Janes. 2019. Leveraging small software engineering data sets with pre-trained neural networks. In Proceedings of ICSE (NIER) 2019. 29--32.

Digital Library

[78]

Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax and Sensibility: Using language models to detect and correct syntax errors. In Proceedings of SANER 2018. 311--322.

[79]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL 2016.

[80]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014), 1929--1958. http://jmlr.org/papers/v15/srivastava14a.html

Digital Library

[81]

Zhaopeng Tu, Zhendong Su, and Premkumar T. Devanbu. 2014. On the localness of software. In Proceedings of SIGSOFT/FSE 2014. 269--280.

Digital Library

[82]

Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of ICSE 2019. 25--36.

Digital Library

[83]

Clara Vania and Adam Lopez. 2017. From Characters to Words to in Between: Do We Capture Morphology?. In Proceedings of ACL 2017. 2016--2027.

[84]

Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of ESEC/FSE 2017. 683--693.

Digital Library

[85]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS 2017. 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need

Digital Library

[86]

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Proceedings of NIPS 2015. 2692--2700. http://papers.nips.cc/paper/5866-pointer-networks

[87]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of ASE 2016. 87--98.

Digital Library

[88]

Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In Proceedings MSR 2015. 334--345. http://dl.acm.org/citation.cfm?id=2820518.2820559

[89]

Peter Willett. 2006. The Porter stemming algorithm: then and now. Program 40, 3 (2006), 219--223.

[90]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of MSR 2018. 476--486.

Digital Library

Cited By

Liu FFu ZLi GJin ZLiu HHao YZhang L(2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3649594
Shi JYang ZKang HXu BHe JLo DRoychoudhury PPaiva AAbreu RStorey MSharif BXia X(2024)Greening Large Language Models of CodeProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society10.1145/3639475.3640097(142-153)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639475.3640097
Mousavi ZIslam CMoore KAbuadbba ABabar MQuek TGao DZhou JCardenas A(2024)An Investigation into Misuse of Java Security APIs by Large Language ModelsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3661134(1299-1315)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3634737.3661134
Show More Cited By

Index Terms

Big code != big vocabulary: open-vocabulary models for source code
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

Open-vocabulary models for source code
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with ...
Read More
Big Code Search: A Bibliography
Code search is an essential task in software development. Developers often search the internet and other code databases for necessary source code snippets to ease the development efforts. Code search techniques also help learn programming as novice ...
Read More
Programming with "Big Code"
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

June 2020

1640 pages

ISBN:9781450371216

DOI:10.1145/3377811

General Chairs:
Gregg Rothermel
North Carolina State University
,
Doo-Hwan Bae
KAIST, South Korea

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

KIISE: Korean Institute of Information Scientists and Engineers
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2020

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

ADVERB
IDEALS
Engineering and Physical Sciences Research Council

Conference

ICSE '20

Sponsor:

SIGSOFT

ICSE '20: 42nd International Conference on Software Engineering

June 27 - July 19, 2020

Seoul, South Korea

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

117
Total Citations
View Citations
2,069
Total Downloads

Downloads (Last 12 months)530
Downloads (Last 6 weeks)67

Other Metrics

View Author Metrics

Citations

Cited By

Liu FFu ZLi GJin ZLiu HHao YZhang L(2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3649594
Shi JYang ZKang HXu BHe JLo DRoychoudhury PPaiva AAbreu RStorey MSharif BXia X(2024)Greening Large Language Models of CodeProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society10.1145/3639475.3640097(142-153)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639475.3640097
Mousavi ZIslam CMoore KAbuadbba ABabar MQuek TGao DZhou JCardenas A(2024)An Investigation into Misuse of Java Security APIs by Large Language ModelsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3661134(1299-1315)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3634737.3661134
Murodova NKoo HQuek TGao DZhou JCardenas A(2024)BinAdapter: Leveraging Continual Learning for Inferring Function Symbol Names in a BinaryProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3645006(1200-1213)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3634737.3645006
Li ♂ JLi ZZhang HLi GJin ZHu XXia X(2024)Poison Attack and Poison Detection on Deep Source Code Processing ModelsACM Transactions on Software Engineering and Methodology10.1145/363000833:3(1-31)Online publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1145/3630008
Liu ZTang ZZhang JXia XYang XRoychoudhury APaiva AAbreu RStorey M(2024)Pre-training by Predicting Program Dependencies for Vulnerability Analysis TasksProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639142(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639142
Zhu QLiang QSun ZXiong YZhang LCheng SRoychoudhury APaiva AAbreu RStorey M(2024)GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639125(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639125
Ding YSteenhoek BPei KKaiser GLe WRay BRoychoudhury APaiva AAbreu RStorey M(2024)TRACED: Execution-aware Pre-training for Source CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3608140(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3608140
Lin YWan CBai SGu X(2024)VarGAN: Adversarial Learning of Variable Semantic RepresentationsIEEE Transactions on Software Engineering10.1109/TSE.2024.339173050:6(1505-1517)Online publication date: Jun-2024
https://doi.org/10.1109/TSE.2024.3391730
Gharibi RSadreddini MFakhrahmad S(2024)T5APR: Empowering automated program repair across languages through checkpoint ensembleJournal of Systems and Software10.1016/j.jss.2024.112083214(112083)Online publication date: Aug-2024
https://doi.org/10.1016/j.jss.2024.112083
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents