skip to main content
research-article
Open access

Big code != big vocabulary: open-vocabulary models for source code

Published: 01 October 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.
    In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.
    All datasets, code, and trained models used in this work are publicly available.

    References

    [1]
    Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of Onward! 2019. 143--153.
    [2]
    Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2014. Learning natural coding conventions. In Proceedings of SIGSOFT/FSE 2014. 281--293.
    [3]
    Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of ESEC/FSE 2015. 38--49.
    [4]
    Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4 (2018), 81:1--81:37.
    [5]
    Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of ICML 2016, Vol. 48. 2091--2100. http://proceedings.mlr.press/v48/allamanis16.html
    [6]
    Miltiadis Allamanis and Charles A. Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of MSR 2013. 207--216.
    [7]
    Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of ICML 2015, Vol. 37. 2123--2132. http://proceedings.mlr.press/v37/allamanis15.html
    [8]
    Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of ICLR 2019. https://openreview.net/forum?id=H1gKYo09tX
    [9]
    Hlib Babii, Andrea Janes, and Romain Robbes. 2019. Modeling Vocabulary for Big Code Machine Learning. CoRR abs/1904.01873 (2019). http://arxiv.org/abs/1904.01873
    [10]
    Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts. CoRR abs/1809.05193 (2018). arXiv:1809.05193
    [11]
    Issam Bazzi. 2002. Modelling Out-of-vocabulary Words for Robust Speech Recognition. Ph.D. Dissertation. Cambridge, MA, USA. AAI0804528.
    [12]
    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155. http://dl.acm.org/citation.cfm?id=944919.944966
    [13]
    Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks. CoRR abs/1603.06129 (2016). http://arxiv.org/abs/1603.06129
    [14]
    Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: Probabilistic Model for Code. In Proceedings of ICML 2016, Vol. 48. 2933--2942. http://proceedings.mlr.press/v48/bielik16.html
    [15]
    David W. Binkley, Marcia Davis, Dawn J. Lawrie, and Christopher Morrell. 2009. To CamelCase or Under_score. In Proceedings of ICPC 2009. 158--167.
    [16]
    James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-Recurrent Neural Networks. In Proceedings of ICLR 2017. https://openreview.net/forum?id=H1zJ-v5xl
    [17]
    Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of ESEC/FSE 2009. 213--222.
    [18]
    Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax Errors Just Aren't Natural: Improving Error Reporting with Language Models. In Proceedings of MSR 2014. 252--261.
    [19]
    Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--394.
    [20]
    Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2018. Sequencer: Sequence-to-sequence learning for end-to-end program repair. arXiv preprint arXiv:1901.01808 (2018).
    [21]
    Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724--1734.
    [22]
    Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555
    [23]
    Anna Corazza, Sergio Di Martino, and Valerio Maggio. 2012. LINSEN: An efficient approach to split identifiers and expand abbreviations. In Proceedings of ICSM 2012. 233--242.
    [24]
    Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5, 1 (2007), 3.
    [25]
    Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt. IEEE Transactions on Software Engineering 43, 11 (Nov 2017), 1044--1062.
    [26]
    Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).
    [27]
    Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In Proceedings of ICSE 2016. 345--356.
    [28]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019. 4171--4186.
    [29]
    Sergey Dudoladov. 2013. Statistical NLP for computer program source code: An information theoretic perspective on programming language verbosity. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.
    [30]
    Eric Enslen, Emily Hill, Lori L. Pollock, and K. Vijay-Shanker. 2009. Mining source code to automatically split identifiers for software analysis. In Proceedings of MSR 2009. 71--80.
    [31]
    Stefan Fiott. 2015. An Investigation of Statistical Language Modelling of Different Programming Language Types Using Large Corpora. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.
    [32]
    Christine Franks, Zhaopeng Tu, Premkumar T. Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model Based Code Suggestion Tool. In Proceedings of ICSE 2015 (Volume 2). 705--708. https://ieeexplore.ieee.org/document/7203048
    [33]
    Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of SIGSOFT/FSE 2010. 147--156.
    [34]
    Philip Gage. 1994. A New Algorithm for Data Compression. C Users J. 12, 2 (Feb. 1994), 23--38. http://dl.acm.org/citation.cfm?id=177910.177914
    [35]
    ChengYue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: Frequency-Agnostic Word Representation. In Proceedings of NeurIPS 2018. 1341--1352. http://papers.nips.cc/paper/7408-frage-frequency-agnostic-word-representation
    [36]
    Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural Language Models with a Continuous Cache. In Proceedings of ICLR 2017. https://openreview.net/forum?id=B184E5qee
    [37]
    Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of SIGSOFT/FSE 2016. 631--642.
    [38]
    Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. In Proceedings of IJCAI 2017. 3675--3681.
    [39]
    Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1--2 (2017), 1--119.
    [40]
    Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of AAAI 2017. 1345--1351. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603
    [41]
    Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Networks the Best Choice for Modeling Source Code?. In Proceedings of ESEC/FSE 2017. 763--773.
    [42]
    Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli. 2019. When code completion fails: a case study on real-world completions. In Proceedings of ICSE 2019. 960--970.
    [43]
    Emily Hill, David Binkley, Dawn Lawrie, Lori Pollock, and K Vijay-Shanker. 2014. An empirical study of identifier splitting techniques. Empirical Software Engineering 19, 6 (2014), 1754--1780.
    [44]
    Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of ICSE 2012. 837--847. http://dl.acm.org/citation.cfm?id=2337223.2337322
    [45]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (Nov. 1997), 1735--1780.
    [46]
    Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of ACL 2019. 328--339. https://www.aclweb.org/anthology/P18-1031/
    [47]
    Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In Proceedings of ICPC 2018. 200--210.
    [48]
    Michael Hucka. 2018. Spiral: splitters for identifiers in source code files. J. Open Source Software 3, 24 (2018), 653.
    [49]
    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of ACL 2016. 2073--2083. http://www.aclweb.org/anthology/P16-1195
    [50]
    Alan Jaffe, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, and Bogdan Vasilescu. 2018. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of ICPC 2018. 20--30.
    [51]
    Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of ACL 2015. 1--10.
    [52]
    Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).
    [53]
    René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of ISSTA 2014. 437--440.
    [54]
    Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of Onward! 2014. 173--184.
    [55]
    Rafael-Michael Karampatsis and Charles A. Sutton. 2019. Maybe Deep Neural Networks are the Best Choice for Modeling Source Code. CoRR abs/1903.05734 (2019). http://arxiv.org/abs/1903.05734
    [56]
    Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of ACL 2018. 284--294.
    [57]
    Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models. In Proceedings of AAAI 2016. 2741--2749. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489
    [58]
    Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. 2018. Code Completion with Neural Attention and Pointer Networks. In Proceedings of IJCAI 2018. 4159--4165.
    [59]
    Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of CoNLL 2013. 104--113. https://www.aclweb.org/anthology/W13-3512/
    [60]
    Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of ICSE 2019. 304--315.
    [61]
    Vadim Markovtsev, Waren Long, Egor Bulychev, Romain Keramitas, Konstantin Slavnov, and Gabor Markowski. 2018. Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network. arXiv preprint arXiv:1805.11651 (2018).
    [62]
    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In Proceedings of ICLR 2017. https://openreview.net/forum?id=Byj72udxe
    [63]
    Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010. 1045--1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html
    [64]
    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of NIPS 2013. USA, 3111--3119. http://dl.acm.org/citation.cfm?id=2999792.2999959
    [65]
    Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Le Hai Son, Stefan Kombrink, and Jan Cernock. 2012. Subword Language Modeling With Neural Networks. (08 2012).
    [66]
    Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings ESEC/FSE 2013. 651--654.
    [67]
    Thanh Nguyen, Peter C. Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N. Nguyen. 2016. T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation. In Proceedings of SIGSOFT/FSE 2016. 1013--1017.
    [68]
    Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.Nguyen. 2013. A Statistical Semantic Language Model for Source Code. In Proceedings of ESEC/FSE 2013. New York, NY, USA, 532--542.
    [69]
    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT 2018. 2227--2237.
    [70]
    Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages.
    [71]
    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Available: https://blog.openai.com/language-unsupervised/ (2018).
    [72]
    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Leo Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Available: https://blog.openai.com/better-language-models/ (2019).
    [73]
    Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of ICSE 2016. 357--367.
    [74]
    Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural software revisited. In Proceedings of ICSE 2019. 37--48.
    [75]
    Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of ICSE 2016. 428--439.
    [76]
    Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of PLDI 2014. 419--428.
    [77]
    Romain Robbes and Andrea Janes. 2019. Leveraging small software engineering data sets with pre-trained neural networks. In Proceedings of ICSE (NIER) 2019. 29--32.
    [78]
    Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax and Sensibility: Using language models to detect and correct syntax errors. In Proceedings of SANER 2018. 311--322.
    [79]
    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL 2016.
    [80]
    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014), 1929--1958. http://jmlr.org/papers/v15/srivastava14a.html
    [81]
    Zhaopeng Tu, Zhendong Su, and Premkumar T. Devanbu. 2014. On the localness of software. In Proceedings of SIGSOFT/FSE 2014. 269--280.
    [82]
    Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of ICSE 2019. 25--36.
    [83]
    Clara Vania and Adam Lopez. 2017. From Characters to Words to in Between: Do We Capture Morphology?. In Proceedings of ACL 2017. 2016--2027.
    [84]
    Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of ESEC/FSE 2017. 683--693.
    [85]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS 2017. 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need
    [86]
    Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Proceedings of NIPS 2015. 2692--2700. http://papers.nips.cc/paper/5866-pointer-networks
    [87]
    Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of ASE 2016. 87--98.
    [88]
    Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In Proceedings MSR 2015. 334--345. http://dl.acm.org/citation.cfm?id=2820518.2820559
    [89]
    Peter Willett. 2006. The Porter stemming algorithm: then and now. Program 40, 3 (2006), 219--223.
    [90]
    Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of MSR 2018. 476--486.

    Cited By

    View all
    • (2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 3-Jun-2024
    • (2024)Greening Large Language Models of CodeProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society10.1145/3639475.3640097(142-153)Online publication date: 14-Apr-2024
    • (2024)An Investigation into Misuse of Java Security APIs by Large Language ModelsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3661134(1299-1315)Online publication date: 1-Jul-2024
    • Show More Cited By

    Index Terms

    1. Big code != big vocabulary: open-vocabulary models for source code

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering
      June 2020
      1640 pages
      ISBN:9781450371216
      DOI:10.1145/3377811
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      In-Cooperation

      • KIISE: Korean Institute of Information Scientists and Engineers
      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 October 2020

      Check for updates

      Badges

      Author Tags

      1. byte-pair encoding
      2. naturalness of code
      3. neural language models

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ICSE '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 276 of 1,856 submissions, 15%

      Upcoming Conference

      ICSE 2025

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)530
      • Downloads (Last 6 weeks)67

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 3-Jun-2024
      • (2024)Greening Large Language Models of CodeProceedings of the 46th International Conference on Software Engineering: Software Engineering in Society10.1145/3639475.3640097(142-153)Online publication date: 14-Apr-2024
      • (2024)An Investigation into Misuse of Java Security APIs by Large Language ModelsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3661134(1299-1315)Online publication date: 1-Jul-2024
      • (2024)BinAdapter: Leveraging Continual Learning for Inferring Function Symbol Names in a BinaryProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3645006(1200-1213)Online publication date: 1-Jul-2024
      • (2024)Poison Attack and Poison Detection on Deep Source Code Processing ModelsACM Transactions on Software Engineering and Methodology10.1145/363000833:3(1-31)Online publication date: 14-Mar-2024
      • (2024)Pre-training by Predicting Program Dependencies for Vulnerability Analysis TasksProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639142(1-13)Online publication date: 20-May-2024
      • (2024)GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639125(1-13)Online publication date: 20-May-2024
      • (2024)TRACED: Execution-aware Pre-training for Source CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3608140(1-12)Online publication date: 20-May-2024
      • (2024)VarGAN: Adversarial Learning of Variable Semantic RepresentationsIEEE Transactions on Software Engineering10.1109/TSE.2024.339173050:6(1505-1517)Online publication date: Jun-2024
      • (2024)T5APR: Empowering automated program repair across languages through checkpoint ensembleJournal of Systems and Software10.1016/j.jss.2024.112083214(112083)Online publication date: Aug-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media