subscribe to arXiv mailings

Generating Exceptional Behavior Tests with Reasoning Augmented Large Language Models

Authors: Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

Abstract: Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown… ▽ More Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on "happy paths", e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed exLong, that automatically generates EBTs. exLong is a large language model instruction-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT3.5), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that exLong outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by exLong were already accepted. △ Less

Submitted 24 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.05904 [pdf, other]

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Authors: Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, Pasquale Minervini

Abstract: Large Language Models (LLMs) have transformed the Natural Language Processing (NLP) landscape with their remarkable ability to understand and generate human-like text. However, these models are prone to ``hallucinations'' -- outputs that do not align with factual reality or the input context. This paper introduces the Hallucinations Leaderboard, an open initiative to quantitatively measure and com… ▽ More Large Language Models (LLMs) have transformed the Natural Language Processing (NLP) landscape with their remarkable ability to understand and generate human-like text. However, these models are prone to ``hallucinations'' -- outputs that do not align with factual reality or the input context. This paper introduces the Hallucinations Leaderboard, an open initiative to quantitatively measure and compare the tendency of each model to produce hallucinations. The leaderboard uses a comprehensive set of benchmarks focusing on different aspects of hallucinations, such as factuality and faithfulness, across various tasks, including question-answering, summarisation, and reading comprehension. Our analysis provides insights into the performance of different models, guiding researchers and practitioners in choosing the most reliable models for their applications. △ Less

Submitted 17 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2403.04652 [pdf, other]

Yi: Open Foundation Models by 01.AI

Authors: 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie , et al. (7 additional authors not shown)

Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU,… ▽ More We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.15627 [pdf, other]

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Authors: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao , et al. (7 additional authors not shown)

Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model bl… ▽ More We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2312.11678 [pdf, other]

Misinformation as a harm: structured approaches for fact-checking prioritization

Authors: Connie Moon Sehat, Ryan Li, Peipei Nie, Tarunima Prabhakar, Amy X. Zhang

Abstract: In this work, we examine how fact-checkers prioritize which claims to fact-check and what tools may assist them in their efforts. Through a series of interviews with 23 professional fact-checkers from around the world, we validate that harm assessment is a central component of how fact-checkers triage their work. We also clarify the processes behind fact-checking prioritization, finding that they… ▽ More In this work, we examine how fact-checkers prioritize which claims to fact-check and what tools may assist them in their efforts. Through a series of interviews with 23 professional fact-checkers from around the world, we validate that harm assessment is a central component of how fact-checkers triage their work. We also clarify the processes behind fact-checking prioritization, finding that they are typically ad hoc, and gather suggestions for tools that could help with these processes. To address the needs articulated by fact-checkers, we present a structured framework of questions to help fact-checkers negotiate the priority of claims through assessing potential harms. Our FABLE Framework of Misinformation Harms incorporates five dimensions of magnitude -- (social) Fragmentation, Actionability, Believability, Likelihood of spread, and Exploitativeness -- that can help determine the potential urgency of a specific message or claim when considering misinformation as harm. The result is a practical and conceptual tool to support fact-checkers and others as they make strategic decisions to prioritize their efforts. We conclude with a discussion of computational approaches to support structured prioritization, as well as applications beyond fact-checking to content moderation and curation. △ Less

Submitted 18 March, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Accepted to CSCW 2024, with clean up for typos and figures

arXiv:2307.14991 [pdf, other]

Multilingual Code Co-Evolution Using Large Language Models

Authors: Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

Abstract: Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers)… ▽ More Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance. △ Less

Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: FSE 2023 (camera ready)

arXiv:2305.13486 [pdf, other]

pytest-inline: An Inline Testing Tool for Python

Authors: Yu Liu, Zachary Thurston, Alan Han, Pengyu Nie, Milos Gligoric, Owolabi Legunsen

Abstract: We present pytest-inline, the first inline testing framework for Python. We recently proposed inline tests to make it easier to test individual program statements. But, there is no framework-level support for developers to write inline tests in Python. To fill this gap, we design and implement pytest-inline as a plugin for pytest, the most popular Python testing framework. Using pytest-inline, a d… ▽ More We present pytest-inline, the first inline testing framework for Python. We recently proposed inline tests to make it easier to test individual program statements. But, there is no framework-level support for developers to write inline tests in Python. To fill this gap, we design and implement pytest-inline as a plugin for pytest, the most popular Python testing framework. Using pytest-inline, a developer can write an inline test by assigning test inputs to variables in a target statement and specifying the expected test output. Then, pytest-inline runs each inline test and fails if the target statement's output does not match the expected output. In this paper, we describe our design of pytest-inline, the testing features that it provides, and the intended use cases. Our evaluation on inline tests that we wrote for 80 target statements from 31 open-source Python projects shows that using pytest-inline incurs negligible overhead, at 0.012x. pytest-inline is integrated into the pytest-dev organization, and a video demo is at https://www.youtube.com/watch?v=pZgiAxR_uJg. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted as a tool demo paper at ICSE DEMO 2023

arXiv:2302.10166 [pdf, other]

Learning Deep Semantics for Test Completion

Authors: Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J. Mooney, Milos Gligoric

Abstract: Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation to assist developers in writing tests. We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo --… ▽ More Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation to assist developers in writing tests. We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo -- a deep learning model using code semantics for test completion. The key insight underlying TeCo is that predicting the next statement in a test method requires reasoning about code execution, which is hard to do with only syntax-level data that existing code completion models use. TeCo extracts and uses six kinds of code semantics data, including the execution result of prior statements and the execution context of the test method. To provide a testbed for this new task, as well as to evaluate TeCo, we collect a corpus of 130,934 test methods from 1,270 open-source Java projects. Our results show that TeCo achieves an exact-match accuracy of 18, which is 29% higher than the best baseline using syntax-level data only. When measuring functional correctness of generated next statement, TeCo can generate runnable code in 29% of the cases compared to 18% obtained by the best baseline. Moreover, TeCo is significantly better than prior work on test oracle generation. △ Less

Submitted 7 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: Accepted as a conference paper in ICSE 2023

arXiv:2210.16637 [pdf, other]

Beyond Prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations

Authors: Yu Fei, Ping Nie, Zhao Meng, Roger Wattenhofer, Mrinmaya Sachan

Abstract: Recent work has demonstrated that pre-trained language models (PLMs) are zero-shot learners. However, most existing zero-shot methods involve heavy human engineering or complicated self-training pipelines, hindering their application to new situations. In this work, we show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of PLMs. Specifically,… ▽ More Recent work has demonstrated that pre-trained language models (PLMs) are zero-shot learners. However, most existing zero-shot methods involve heavy human engineering or complicated self-training pipelines, hindering their application to new situations. In this work, we show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of PLMs. Specifically, we fit the unlabeled texts with a Bayesian Gaussian Mixture Model after initializing cluster positions and shapes using class names. Despite its simplicity, this approach achieves superior or comparable performance on both topic and sentiment classification datasets and outperforms prior works significantly on unbalanced datasets. We further explore the applicability of our clustering approach by evaluating it on 14 datasets with more diverse topics, text lengths, and numbers of classes. Our approach achieves an average of 20% absolute improvement over prompt-based zero-shot learning. Finally, we compare different PLM embedding spaces and find that texts are well-clustered by topics even if the PLM is not explicitly pre-trained to generate meaningful sentence embeddings. This work indicates that PLM embeddings can categorize texts without task-specific fine-tuning, thus providing a new way to analyze and utilize their knowledge and zero-shot learning ability. △ Less

Submitted 23 November, 2022; v1 submitted 29 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP 2022

arXiv:2209.06315 [pdf, other]

Inline Tests

Authors: Yu Liu, Pengyu Nie, Owolabi Legunsen, Milos Gligoric

Abstract: Unit tests are widely used to check source code quality, but they can be too coarse-grained or ill-suited for testing individual program statements. We introduce inline tests to make it easier to check for faults in statements. We motivate inline tests through several language features and a common testing scenario in which inline tests could be beneficial. For example, inline tests can allow a de… ▽ More Unit tests are widely used to check source code quality, but they can be too coarse-grained or ill-suited for testing individual program statements. We introduce inline tests to make it easier to check for faults in statements. We motivate inline tests through several language features and a common testing scenario in which inline tests could be beneficial. For example, inline tests can allow a developer to test a regular expression in place. We also define language-agnostic requirements for inline testing frameworks. Lastly, we implement I-Test, the first inline testing framework. I-Test works for Python and Java, and it satisfies most of the requirements. We evaluate I-Test on open-source projects by using it to test 144 statements in 31 Python programs and 37 Java programs. We also perform a user study. All nine user study participants say that inline tests are easy to write and that inline testing is beneficial. The cost of running inline tests is negligible, at 0.007x--0.014x, and our inline tests helped find two faults that have been fixed by the developers. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: Accepted as a conference paper in ASE 2022

arXiv:2209.04725 [pdf, other]

Anticipating the Unseen Discrepancy for Vision and Language Navigation

Authors: Yujie Lu, Huiliang Zhang, Ping Nie, Weixi Feng, Wenda Xu, Xin Eric Wang, William Yang Wang

Abstract: Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. Previous studies propose data augmentation methods to mitigate the data bias explicitly or implicitly and provide improvements in generalization. However, they try to memor… ▽ More Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. Previous studies propose data augmentation methods to mitigate the data bias explicitly or implicitly and provide improvements in generalization. However, they try to memorize augmented trajectories and ignore the distribution shifts under unseen environments at test time. In this paper, we propose an Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency. Specifically, we devise: 1) a semi-supervised framework DAVIS that leverages visual consistency signals across similar semantic observations. 2) a two-stage learning procedure that encourages adaptation to test-time distribution. The framework enhances the basic mixture of imitation and reinforcement learning with Momentum Contrast to encourage stable decision-making on similar observations under a joint training stage and a test-time adaptation stage. Extensive experiments show that DAVIS achieves model-agnostic improvement over previous state-of-the-art VLN baselines on R2R and RxR benchmarks. Our source code and data are in supplemental materials. △ Less

Submitted 10 September, 2022; originally announced September 2022.

arXiv:2208.05446 [pdf, ps, other]

CoditT5: Pretraining for Source Code and Natural Language Editing

Authors: Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

Abstract: Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on l… ▽ More Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming standard generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a standard generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks. △ Less

Submitted 14 September, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: ASE 2022 (camera ready)

arXiv:2206.03450 [pdf, other]

doi 10.1145/3534929

A Trade-off-centered Framework of Content Moderation

Authors: Jialun Aaron Jiang, Peipei Nie, Jed R. Brubaker, Casey Fiesler

Abstract: Content moderation research typically prioritizes representing and addressing challenges for one group of stakeholders or communities in one type of context. While taking a focused approach is reasonable or even favorable for empirical case studies, it does not address how content moderation works in multiple contexts. Through a systematic literature review of 86 content moderation papers that doc… ▽ More Content moderation research typically prioritizes representing and addressing challenges for one group of stakeholders or communities in one type of context. While taking a focused approach is reasonable or even favorable for empirical case studies, it does not address how content moderation works in multiple contexts. Through a systematic literature review of 86 content moderation papers that document empirical studies, we seek to uncover patterns and tensions within past content moderation research. We find that content moderation can be characterized as a series of trade-offs around moderation actions, styles, philosophies, and values. We discuss how facilitating cooperation and preventing abuse, two key elements in Grimmelmann's definition of moderation, are inherently dialectical in practice. We close by showing how researchers, designers, and moderators can use our framework of trade-offs in their own work, and arguing that trade-offs should be of central importance in investigating and designing content moderation. △ Less

Submitted 7 June, 2022; originally announced June 2022.

Comments: To appear in ACM TOCHI

ACM Class: J.4; K.4.2

arXiv:2206.03273 [pdf, other]

City-scale synthetic individual-level vehicle trip data

Authors: Guilong Li, Yixian Chen, Yimin Wang, Zhi Yu, Peilin Nie, Zhaocheng He

Abstract: Trip data that records each vehicle's trip activity on the road network describes the operation of urban traffic from the individual perspective, and it is extremely valuable for transportation research. However, restricted by data privacy, the trip data of individual-level cannot be opened for all researchers, while the need for it is very urgent. In this paper, we produce a city-scale synthetic… ▽ More Trip data that records each vehicle's trip activity on the road network describes the operation of urban traffic from the individual perspective, and it is extremely valuable for transportation research. However, restricted by data privacy, the trip data of individual-level cannot be opened for all researchers, while the need for it is very urgent. In this paper, we produce a city-scale synthetic individual-level vehicle trip dataset by generating for each individual based on the historical trip data, where the availability and trip data privacy protection are balanced. Privacy protection inevitably affects the availability of data. Therefore, we have conducted numerous experiments to demonstrate the performance and reliability of the synthetic data in different dimensions and at different granularities to help users properly judge the tasks it can perform. The result shows that the synthetic data is consistent with the real data (i.e., historical data) on the aggregated level and reasonable from the individual perspective. △ Less

Submitted 1 February, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2202.10695 [pdf, other]

A Framework for Multi-stage Bonus Allocation in meal delivery Platform

Authors: Zhuolin Wu, Li Wang, Fangsheng Huang, Linjun Zhou, Yu Song, Chengpeng Ye, Pengyu Nie, Hao Ren, Jinghua Hao, Renqing He, Zhizhao Sun

Abstract: Online meal delivery is undergoing explosive growth, as this service is becoming increasingly popular. A meal delivery platform aims to provide excellent and stable services for customers and restaurants. However, in reality, several hundred thousand orders are canceled per day in the Meituan meal delivery platform since they are not accepted by the crowd soucing drivers. The cancellation of the o… ▽ More Online meal delivery is undergoing explosive growth, as this service is becoming increasingly popular. A meal delivery platform aims to provide excellent and stable services for customers and restaurants. However, in reality, several hundred thousand orders are canceled per day in the Meituan meal delivery platform since they are not accepted by the crowd soucing drivers. The cancellation of the orders is incredibly detrimental to the customer's repurchase rate and the reputation of the Meituan meal delivery platform. To solve this problem, a certain amount of specific funds is provided by Meituan's business managers to encourage the crowdsourcing drivers to accept more orders. To make better use of the funds, in this work, we propose a framework to deal with the multi-stage bonus allocation problem for a meal delivery platform. The objective of this framework is to maximize the number of accepted orders within a limited bonus budget. This framework consists of a semi-black-box acceptance probability model, a Lagrangian dual-based dynamic programming algorithm, and an online allocation algorithm. The semi-black-box acceptance probability model is employed to forecast the relationship between the bonus allocated to order and its acceptance probability, the Lagrangian dual-based dynamic programming algorithm aims to calculate the empirical Lagrangian multiplier for each allocation stage offline based on the historical data set, and the online allocation algorithm uses the results attained in the offline part to calculate a proper delivery bonus for each order. To verify the effectiveness and efficiency of our framework, both offline experiments on a real-world data set and online A/B tests on the Meituan meal delivery platform are conducted. Our results show that using the proposed framework, the total order cancellations can be decreased by more than 25\% in reality. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: 9 pages; submit to KDD 2022

arXiv:2110.11570 [pdf, other]

MIC: Model-agnostic Integrated Cross-channel Recommenders

Authors: Yujie Lu, Ping Nie, Shengyu Zhang, Ming Zhao, Ruobing Xie, William Yang Wang, Yi Ren

Abstract: Semantically connecting users and items is a fundamental problem for the matching stage of an industrial recommender system. Recent advances in this topic are based on multi-channel retrieval to efficiently measure users' interest on items from the massive candidate pool. However, existing work are primarily built upon pre-defined retrieval channels, including User-CF (U2U), Item-CF (I2I), and Emb… ▽ More Semantically connecting users and items is a fundamental problem for the matching stage of an industrial recommender system. Recent advances in this topic are based on multi-channel retrieval to efficiently measure users' interest on items from the massive candidate pool. However, existing work are primarily built upon pre-defined retrieval channels, including User-CF (U2U), Item-CF (I2I), and Embedding-based Retrieval (U2I), thus access to the limited correlation between users and items which solely entail from partial information of latent interactions. In this paper, we propose a model-agnostic integrated cross-channel (MIC) approach for the large-scale recommendation, which maximally leverages the inherent multi-channel mutual information to enhance the matching performance. Specifically, MIC robustly models correlation within user-item, user-user, and item-item from latent interactions in a universal schema. For each channel, MIC naturally aligns pairs with semantic similarity and distinguishes them otherwise with more uniform anisotropic representation space. While state-of-the-art methods require specific architectural design, MIC intuitively considers them as a whole by enabling the complete information flow among users and items. Thus MIC can be easily plugged into other retrieval recommender systems. Extensive experiments show that our MIC helps several state-of-the-art models boost their performance on two real-world benchmarks. The satisfactory deployment of the proposed MIC on industrial online services empirically proves its scalability and flexibility. △ Less

Submitted 13 February, 2022; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: 10 pages, 4 figures

arXiv:2108.09619 [pdf, other]

Impact of Evaluation Methodologies on Code Summarization

Authors: Pengyu Nie, Jiyang Zhang, Junyi Jessy Li, Raymond J. Mooney, Milos Gligoric

Abstract: There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization cons… ▽ More There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations. △ Less

Submitted 5 April, 2022; v1 submitted 21 August, 2021; originally announced August 2021.

Comments: Accepted as a conference paper in ACL 2022

arXiv:2103.13426 [pdf, other]

Learning to Generate Code Comments from Class Hierarchies

Authors: Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Raymond J. Mooney, Junyi Jessy Li, Milos Gligoric

Abstract: Descriptive code comments are essential for supporting code comprehension and maintenance. We propose the task of automatically generating comments for overriding methods. We formulate a novel framework which accommodates the unique contextual and linguistic reasoning that is required for performing this task. Our approach features: (1) incorporating context from the class hierarchy; (2) condition… ▽ More Descriptive code comments are essential for supporting code comprehension and maintenance. We propose the task of automatically generating comments for overriding methods. We formulate a novel framework which accommodates the unique contextual and linguistic reasoning that is required for performing this task. Our approach features: (1) incorporating context from the class hierarchy; (2) conditioning on learned, latent representations of specificity to generate comments that capture the more specialized behavior of the overriding method; and (3) unlikelihood training to discourage predictions which do not conform to invariant characteristics of the comment corresponding to the overridden method. Our experiments show that the proposed approach is able to generate comments for overriding methods of higher quality compared to prevailing comment generation techniques. △ Less

Submitted 17 April, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

arXiv:2103.01346 [pdf, other]

Roosterize: Suggesting Lemma Names for Coq Verification Projects Using Deep Learning

Authors: Pengyu Nie, Karl Palmskog, Junyi Jessy Li, Milos Gligoric

Abstract: Naming conventions are an important concern in large verification projects using proof assistants, such as Coq. In particular, lemma names are used by proof engineers to effectively understand and modify Coq code. However, providing accurate and informative lemma names is a complex task, which is currently often carried out manually. Even when lemma naming is automated using rule-based tools, gene… ▽ More Naming conventions are an important concern in large verification projects using proof assistants, such as Coq. In particular, lemma names are used by proof engineers to effectively understand and modify Coq code. However, providing accurate and informative lemma names is a complex task, which is currently often carried out manually. Even when lemma naming is automated using rule-based tools, generated names may fail to adhere to important conventions not specified explicitly. We demonstrate a toolchain, dubbed Roosterize, which automatically suggests lemma names in Coq projects. Roosterize leverages a neural network model trained on existing Coq code, thus avoiding manual specification of naming conventions. To allow proof engineers to conveniently access suggestions from Roosterize during Coq project development, we integrated the toolchain into the popular Visual Studio Code editor. Our evaluation shows that Roosterize substantially outperforms strong baselines for suggesting lemma names and is useful in practice. The demo video for Roosterize can be viewed at: https://youtu.be/HZ5ac7Q14rc. △ Less

Submitted 3 May, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: Accepted in International Conference on Software Engineering, Demonstrations Track (ICSE-DEMO 2021)

arXiv:2009.07465 [pdf, other]

Answering Any-hop Open-domain Questions with Iterative Document Reranking

Authors: Ping Nie, Yuyu Zhang, Arun Ramamurthy, Le Song

Abstract: Existing approaches for open-domain question answering (QA) are typically designed for questions that require either single-hop or multi-hop reasoning, which make strong assumptions of the complexity of questions to be answered. Also, multi-step document retrieval often incurs higher number of relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for ans… ▽ More Existing approaches for open-domain question answering (QA) are typically designed for questions that require either single-hop or multi-hop reasoning, which make strong assumptions of the complexity of questions to be answered. Also, multi-step document retrieval often incurs higher number of relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for answer extraction. To address these challenges, we propose a unified QA framework to answer any-hop open-domain questions, which iteratively retrieves, reranks and filters documents, and adaptively determines when to stop the retrieval process. To improve the retrieval accuracy, we propose a graph-based reranking model that perform multi-document interaction as the core of our iterative reranking framework. Our method consistently achieves performance comparable to or better than the state-of-the-art on both single-hop and multi-hop open-domain QA datasets, including Natural Questions Open, SQuAD Open, and HotpotQA. △ Less

Submitted 24 May, 2021; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: Accepted by SIGIR 2021

arXiv:2006.16743 [pdf, ps, other]

Learning to Format Coq Code Using Language Models

Authors: Pengyu Nie, Karl Palmskog, Junyi Jessy Li, Milos Gligoric

Abstract: Should the final right bracket in a record declaration be on a separate line? Should arguments to the rewrite tactic be separated by a single space? Coq code tends to be written in distinct manners by different people and teams. The expressiveness, flexibility, and extensibility of Coq's languages and notations means that Coq projects have a wide variety of recognizable coding styles, sometimes ex… ▽ More Should the final right bracket in a record declaration be on a separate line? Should arguments to the rewrite tactic be separated by a single space? Coq code tends to be written in distinct manners by different people and teams. The expressiveness, flexibility, and extensibility of Coq's languages and notations means that Coq projects have a wide variety of recognizable coding styles, sometimes explicitly documented as conventions on naming and formatting. In particular, even inexperienced users can distinguish vernacular using the standard library and plain Ltac from idiomatic vernacular using the Mathematical Components (MathComp) library and SSReflect. While coding conventions are important for comprehension and maintenance, they are costly to document and enforce. Rule-based formatters, such as Coq's beautifier, have limited flexibility and only capture small fractions of desired conventions in large verification projects. We believe that application of language models - a class of Natural Language Processing (NLP) techniques for capturing regularities in corpora - can provide a solution to this conundrum. More specifically, we believe that an approach based on automatically learning conventions from existing Coq code, and then suggesting idiomatic code to users in the proper context, can be superior to manual approaches and static analysis tools - both in terms of effort and results. As a first step, we here outline initial models to learn and suggest space formatting in Coq files, with a preliminary implementation for Coq 8.10, and evaluated on a corpus based on MathComp 1.9.0 which comprises 164k lines of Coq code from four core projects. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted in the Coq Workshop 2020

arXiv:2006.01346 [pdf, other]

doi 10.1145/3397271.3401195

A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension

Authors: Jie Cai, Zhengzhou Zhu, Ping Nie, Qian Liu

Abstract: Pre-trained models have brought significant improvements to many NLP tasks and have been extensively analyzed. But little is known about the effect of fine-tuning on specific tasks. Intuitively, people may agree that a pre-trained model already learns semantic representations of words (e.g. synonyms are closer to each other) and fine-tuning further improves its capabilities which require more comp… ▽ More Pre-trained models have brought significant improvements to many NLP tasks and have been extensively analyzed. But little is known about the effect of fine-tuning on specific tasks. Intuitively, people may agree that a pre-trained model already learns semantic representations of words (e.g. synonyms are closer to each other) and fine-tuning further improves its capabilities which require more complicated reasoning (e.g. coreference resolution, entity boundary detection, etc). However, how to verify these arguments analytically and quantitatively is a challenging task and there are few works focus on this topic. In this paper, inspired by the observation that most probing tasks involve identifying matched pairs of phrases (e.g. coreference requires matching an entity and a pronoun), we propose a pairwise probe to understand BERT fine-tuning on the machine reading comprehension (MRC) task. Specifically, we identify five phenomena in MRC. According to pairwise probing tasks, we compare the performance of each layer's hidden representation of pre-trained and fine-tuned BERT. The proposed pairwise probe alleviates the problem of distraction from inaccurate model training and makes a robust and quantitative comparison. Our experimental analysis leads to highly confident conclusions: (1) Fine-tuning has little effect on the fundamental and low-level information and general semantic tasks. (2) For specific abilities required for downstream tasks, fine-tuned BERT is better than pre-trained BERT and such gaps are obvious after the fifth layer. △ Less

Submitted 1 June, 2020; originally announced June 2020.

Comments: e.g.: 4 pages, 1 figure

arXiv:2004.12169 [pdf, other]

Learning to Update Natural Language Comments Based on Code Changes

Authors: Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, Raymond J. Mooney

Abstract: We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a… ▽ More We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a dataset that we collected from commit histories of open-source software projects, with each example consisting of a concurrent update to a method and its corresponding comment. We compare our approach against multiple baselines using both automatic metrics and human evaluation. Results reflect the challenge of this task and that our model outperforms baselines with respect to making edits. △ Less

Submitted 27 April, 2020; v1 submitted 25 April, 2020; originally announced April 2020.

Comments: Accepted in Association for Computational Linguistics (ACL) 2020

arXiv:2004.07761 [pdf, other]

Deep Generation of Coq Lemma Names Using Elaborated Terms

Authors: Pengyu Nie, Karl Palmskog, Junyi Jessy Li, Milos Gligoric

Abstract: Coding conventions for naming, spacing, and other essentially stylistic properties are necessary for developers to effectively understand, review, and modify source code in large software projects. Consistent conventions in verification projects based on proof assistants, such as Coq, increase in importance as projects grow in size and scope. While conventions can be documented and enforced manual… ▽ More Coding conventions for naming, spacing, and other essentially stylistic properties are necessary for developers to effectively understand, review, and modify source code in large software projects. Consistent conventions in verification projects based on proof assistants, such as Coq, increase in importance as projects grow in size and scope. While conventions can be documented and enforced manually at high cost, emerging approaches automatically learn and suggest idiomatic names in Java-like languages by applying statistical language models on large code corpora. However, due to its powerful language extension facilities and fusion of type checking and computation, Coq is a challenging target for automated learning techniques. We present novel generation models for learning and suggesting lemma names for Coq projects. Our models, based on multi-input neural networks, are the first to leverage syntactic and semantic information from Coq's lexer (tokens in lemma statements), parser (syntax trees), and kernel (elaborated terms) for naming; the key insight is that learning from elaborated terms can substantially boost model performance. We implemented our models in a toolchain, dubbed Roosterize, and applied it on a large corpus of code derived from the Mathematical Components family of projects, known for its stringent coding conventions. Our results show that Roosterize substantially outperforms baselines for suggesting lemma names, highlighting the importance of using multi-input models and elaborated terms. △ Less

Submitted 22 April, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: Accepted in International Joint Conference on Automated Reasoning (IJCAR 2020). With Appendix

arXiv:2002.12591 [pdf, other]

DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

Authors: Yuyu Zhang, Ping Nie, Xiubo Geng, Arun Ramamurthy, Le Song, Daxin Jiang

Abstract: Recent studies on open-domain question answering have achieved prominent performance improvement using pre-trained language models such as BERT. State-of-the-art approaches typically follow the "retrieve and read" pipeline and employ BERT-based reranker to filter retrieved documents before feeding them into the reader module. The BERT retriever takes as input the concatenation of question and each… ▽ More Recent studies on open-domain question answering have achieved prominent performance improvement using pre-trained language models such as BERT. State-of-the-art approaches typically follow the "retrieve and read" pipeline and employ BERT-based reranker to filter retrieved documents before feeding them into the reader module. The BERT retriever takes as input the concatenation of question and each retrieved document. Despite the success of these approaches in terms of QA accuracy, due to the concatenation, they can barely handle high-throughput of incoming questions each with a large collection of retrieved documents. To address the efficiency problem, we propose DC-BERT, a decoupled contextual encoding framework that has dual BERT models: an online BERT which encodes the question only once, and an offline BERT which pre-encodes all the documents and caches their encodings. On SQuAD Open and Natural Questions Open datasets, DC-BERT achieves 10x speedup on document retrieval, while retaining most (about 98%) of the QA performance compared to state-of-the-art approaches for open-domain question answering. △ Less

Submitted 28 February, 2020; originally announced February 2020.

arXiv:1901.10125 [pdf, other]

Glyce: Glyph-vectors for Chinese Character Representations

Authors: Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, Jiwei Li

Abstract: It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this… ▽ More It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. We are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including tagging (NER, CWS, POS), sentence pair classification, single sentence classification tasks, dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\% on the Fudan corpus for text classification. Code found at https://github.com/ShannonAI/glyce. △ Less

Submitted 21 May, 2020; v1 submitted 29 January, 2019; originally announced January 2019.

Comments: Accepted by NeurIPS 2019

arXiv:1808.01729 [pdf, other]

Executable Trigger-Action Comments

Authors: Pengyu Nie, Rishabh Rai, Junyi Jessy Li, Sarfraz Khurshid, Raymond J. Mooney, Milos Gligoric

Abstract: Natural language elements, e.g., todo comments, are frequently used to communicate among the developers and to describe tasks that need to be performed (actions) when specific conditions hold in the code repository (triggers). As projects evolve, development processes change, and development teams reorganize, these comments, because of their informal nature, frequently become irrelevant or forgott… ▽ More Natural language elements, e.g., todo comments, are frequently used to communicate among the developers and to describe tasks that need to be performed (actions) when specific conditions hold in the code repository (triggers). As projects evolve, development processes change, and development teams reorganize, these comments, because of their informal nature, frequently become irrelevant or forgotten. We present the first technique, dubbed TrigIt, to specify triggeraction todo comments as executable statements. Thus, actions are executed automatically when triggers evaluate to true. TrigIt specifications are written in the host language (e.g., Java) and are evaluated as part of the build process. The triggers are specified as query statements over abstract syntax trees and abstract representation of build configuration scripts, and the actions are specified as code transformation steps. We implemented TrigIt for the Java programming language and migrated 20 existing trigger-action comments from 8 popular open-source projects. We evaluate the cost of using TrigIt in terms of the number of tokens in the executable comments and the time overhead introduced in the build process. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Showing 1–27 of 27 results for author: Nie, P