Skip to main content

Showing 1–50 of 83 results for author: Cohan, A

  1. arXiv:2406.14644  [pdf, other

    cs.CL

    Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

    Authors: Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, Xiangru Tang, Arman Cohan

    Abstract: Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitig… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Camera-Ready Version

  2. arXiv:2406.14275  [pdf, other

    cs.CL cs.AI

    Step-Back Profiling: Distilling User History for Personalized Scientific Writing

    Authors: Xiangru Tang, Xingyao Zhang, Yanjun Shao, Jie Wu, Yilun Zhao, Arman Cohan, Ming Gong, Dongmei Zhang, Mark Gerstein

    Abstract: Large language models (LLM) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce STEP-BACK PROFILING to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. To… ▽ More

    Submitted 11 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  3. arXiv:2406.07835  [pdf, other

    cs.CL cs.AI

    SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

    Authors: David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan

    Abstract: We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed t… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Submitted to NeurIPS Datasets and Benchmarks 2024

  4. arXiv:2404.14662  [pdf, other

    cs.LG cs.CL cs.PL cs.SE

    NExT: Teaching Large Language Models to Reason about Code Execution

    Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin

    Abstract: A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of h… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 35 pages

  5. arXiv:2404.04285  [pdf, other

    cs.CL cs.AI

    MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise

    Authors: Chunyuan Deng, Xiangru Tang, Yilun Zhao, Hanming Wang, Haoran Wang, Wangchunshu Zhou, Arman Cohan, Mark Gerstein

    Abstract: Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across a wide variety of tasks. However, without specific agent tuning, open-source models like LLaMA currently struggle to match the efficiency of GPT- 4, particularly given the scarcity of agent-tuning datasets for fine-tuning. In response, we introduce \textsc{Mimir}… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  6. arXiv:2404.03602  [pdf, other

    cs.CL

    Evaluating LLMs at Detecting Errors in LLM Responses

    Authors: Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang

    Abstract: With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g.… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Benchmark and code: https://github.com/psunlpgroup/ReaLMistake

  7. arXiv:2403.15246  [pdf, other

    cs.IR cs.CL cs.LG

    FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

    Authors: Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini

    Abstract: Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, w… ▽ More

    Submitted 7 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  8. arXiv:2403.05788  [pdf, other

    cs.CL cs.AI

    On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization

    Authors: Lorenzo Jaime Yu Flores, Arman Cohan

    Abstract: Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during tr… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: EACL 2024

  9. arXiv:2403.04811  [pdf, other

    cs.SE cs.CL cs.LG

    Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

    Authors: Martin Riddell, Ansong Ni, Arman Cohan

    Abstract: While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  10. arXiv:2402.06544  [pdf, other

    cs.CL cs.AI cs.LG

    Calibrating Long-form Generations from Large Language Models

    Authors: Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra

    Abstract: To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where a… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  11. arXiv:2402.04247  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science

    Authors: Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein

    Abstract: Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents, called scientific LLM agents, also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notab… ▽ More

    Submitted 5 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  12. arXiv:2402.00838  [pdf, other

    cs.CL

    OLMo: Accelerating the Science of Language Models

    Authors: Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam , et al. (18 additional authors not shown)

    Abstract: Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models… ▽ More

    Submitted 7 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  13. arXiv:2312.16291  [pdf, other

    cs.LG cs.CL

    Observable Propagation: Uncovering Feature Vectors in Transformers

    Authors: Jacob Dunefsky, Arman Cohan

    Abstract: A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally… ▽ More

    Submitted 3 June, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

    Comments: 42 pages, 6 tables, 3 figures. ICML 2024

  14. arXiv:2311.10537  [pdf, other

    cs.CL cs.AI

    MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

    Authors: Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein

    Abstract: Large language models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and reasoning over specialized knowledge. To address these issues, we propose MedAgents, a novel multi-disciplinary collaboration framework for the medical domain. MedAgent… ▽ More

    Submitted 4 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

  15. arXiv:2311.09835  [pdf, other

    cs.CL cs.AI

    ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

    Authors: Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein

    Abstract: Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., com… ▽ More

    Submitted 18 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

  16. arXiv:2311.09805  [pdf, other

    cs.CL

    DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

    Authors: Yilun Zhao, Yitao Long, Hongjun Liu, Linyong Nan, Lyuhao Chen, Ryo Kamoi, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan

    Abstract: Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capa… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

    Comments: work in progress

  17. arXiv:2311.09797  [pdf, other

    cs.CL

    KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains

    Authors: Yilun Zhao, Hongjun Liu, Yitao Long, Rui Zhang, Chen Zhao, Arman Cohan

    Abstract: We introduce KnowledgeMath, a novel benchmark designed to evaluate LLMs' capabilities in applying financial knowledge to solve complex math word problems. Compared to prior works, this study features three core advancements. First, KnowledgeMath includes 1,259 problems with a hybrid of textual and tabular content and require college-level knowledge in the finance domain for effective resolution. S… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

    Comments: work in progress

  18. arXiv:2311.09783  [pdf, other

    cs.CL cs.AI

    Investigating Data Contamination in Modern Benchmarks for Large Language Models

    Authors: Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

    Abstract: Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods ta… ▽ More

    Submitted 3 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Version

  19. arXiv:2311.09765  [pdf, other

    cs.IR cs.AI

    Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

    Authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo

    Abstract: Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base mo… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

  20. arXiv:2311.09721  [pdf, other

    cs.CL

    On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

    Authors: Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, Arman Cohan

    Abstract: This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings high… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

  21. arXiv:2311.09184  [pdf, other

    cs.CL cs.LG

    Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

    Authors: Yixin Liu, Alexander R. Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, Arman Cohan

    Abstract: While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristi… ▽ More

    Submitted 12 July, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Findings, GitHub Repo: https://github.com/yale-nlp/InstruSum, LLM-evaluators Leaderboard: https://huggingface.co/spaces/yale-nlp/InstruSumEval

  22. arXiv:2310.11191  [pdf, other

    cs.CL cs.AI

    Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding

    Authors: Lorenzo Jaime Yu Flores, Heyuan Huang, Kejian Shi, Sophie Chheang, Arman Cohan

    Abstract: Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to… ▽ More

    Submitted 25 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  23. arXiv:2309.17446  [pdf, other

    cs.CL cs.LG cs.PL cs.SE

    L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

    Authors: Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Dragomir Radev, Arman Cohan

    Abstract: Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific task… ▽ More

    Submitted 2 October, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: Project Website: https://l2c-eval.github.io/

  24. arXiv:2309.08963  [pdf, other

    cs.CL

    Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

    Authors: Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein

    Abstract: Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3… ▽ More

    Submitted 4 April, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

  25. arXiv:2309.08960  [pdf, other

    cs.CL

    ODSum: New Benchmarks for Open Domain Multi-Document Summarization

    Authors: Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan

    Abstract: Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMD… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

  26. arXiv:2309.08541  [pdf, other

    cs.IR cs.AI cs.CL

    When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

    Authors: Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, Luca Soldaini

    Abstract: Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find t… ▽ More

    Submitted 26 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: EACL 2024 camera ready

  27. arXiv:2305.15387  [pdf, other

    cs.CL cs.AI

    Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

    Authors: Avi Caciularu, Matthew E. Peters, Jacob Goldberger, Ido Dagan, Arman Cohan

    Abstract: The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systemati… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted at ACL 2023; camera-ready version

  28. arXiv:2305.14987  [pdf, other

    cs.CL

    Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios

    Authors: Yilun Zhao, Haowei Zhang, Shengyun Si, Linyong Nan, Xiangru Tang, Arman Cohan

    Abstract: Tabular data is prevalent across various industries, necessitating significant time and effort for users to understand and manipulate for their information-seeking purposes. The advancements in large language models (LLMs) have shown enormous potential to improve user efficiency. However, the adoption of LLMs in real-world applications for table information seeking remains underexplored. In this p… ▽ More

    Submitted 30 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Camera-ready version for EMNLP 2023 industry track

  29. arXiv:2305.14772  [pdf, other

    cs.CL

    A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

    Authors: Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle Lo

    Abstract: Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First,… ▽ More

    Submitted 30 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: 19 pages, 2 figures, 8 tables, EMNLP2023

  30. arXiv:2305.14303  [pdf, other

    cs.CL

    QTSumm: Query-Focused Summarization over Tabular Data

    Authors: Yilun Zhao, Zhenting Qi, Linyong Nan, Boyu Mi, Yixin Liu, Weijin Zou, Simeng Han, Ruizhe Chen, Xiangru Tang, Yumo Xu, Dragomir Radev, Arman Cohan

    Abstract: People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and a… ▽ More

    Submitted 6 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023

  31. arXiv:2305.14239  [pdf, other

    cs.CL

    On Learning to Summarize with Large Language Models as References

    Authors: Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan

    Abstract: Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we investigate a new learning setting of text summarization models that considers the LLMs as the reference or the gold-standard oracle on these datasets. To examine the standard practices that a… ▽ More

    Submitted 16 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: GitHub Repo: https://github.com/yixinL7/SumLLM

  32. arXiv:2305.12586  [pdf, other

    cs.CL

    Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies

    Authors: Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, Dragomir Radev

    Abstract: In-context learning (ICL) has emerged as a new approach to various natural language processing tasks, utilizing large language models (LLMs) to make predictions based on context that has been supplemented with a few examples or task-specific instructions. In this paper, we aim to extend this method to question answering tasks that utilize structured knowledge sources, and improve Text-to-SQL syste… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

  33. arXiv:2305.11744  [pdf, other

    cs.IR cs.CL

    ReFIT: Relevance Feedback from a Reranker during Inference

    Authors: Revanth Gangi Reddy, Pradeep Dasigi, Md Arafat Sultan, Arman Cohan, Avirup Sil, Heng Ji, Hannaneh Hajishirzi

    Abstract: Retrieve-and-rerank is a prevalent framework in neural information retrieval, wherein a bi-encoder network initially retrieves a pre-defined number of candidates (e.g., K=100), which are then reranked by a more powerful cross-encoder model. While the reranker often yields improved candidate scores compared to the retriever, its scope is confined to only the top K retrieved candidates. As a result,… ▽ More

    Submitted 28 May, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: Preprint

  34. arXiv:2305.08379  [pdf, other

    cs.CL cs.LG

    TESS: Text-to-Text Self-Conditioned Simplex Diffusion

    Authors: Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E. Peters, Arman Cohan

    Abstract: Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models to natural language remains challenging due to its discrete nature and the need for a large number of diffusion steps to generate text, making diffusion-based generation expensive. In this work, we propose Text-to-text Self-c… ▽ More

    Submitted 20 February, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: EACL 2024

  35. arXiv:2301.13298  [pdf, other

    cs.CL

    LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

    Authors: Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo

    Abstract: While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of t… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: EACL 2023 camera ready. Code and data can be found in https://github.com/martiansideofthemoon/longeval-summarization

  36. arXiv:2301.10140  [pdf, other

    cs.DL cs.CL

    The Semantic Scholar Open Data Platform

    Authors: Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin , et al. (23 additional authors not shown)

    Abstract: The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF conte… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

    Comments: 8 pages, 6 figures

  37. arXiv:2212.10526  [pdf, other

    cs.CL cs.AI

    Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval

    Authors: John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Lu Wang, Arman Cohan

    Abstract: Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub "open-domain" MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retriev… ▽ More

    Submitted 25 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to EMNLP Findings 2023

  38. arXiv:2211.13308  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

    Authors: Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman

    Abstract: Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 2… ▽ More

    Submitted 13 November, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: 19 pages, 2 figures, 11 tables. Accepted in EMNLP 2023 Main Conference

  39. arXiv:2210.13777  [pdf, other

    cs.CL cs.AI

    SciFact-Open: Towards open-domain scientific claim verification

    Authors: David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, Hannaneh Hajishirzi

    Abstract: While research on scientific claim verification has led to the development of powerful systems that appear to approach human performance, these approaches have yet to be tested in a realistic setting against large corpora of scientific literature. Moving to this open-domain evaluation setting, however, poses unique challenges; in particular, it is infeasible to exhaustively annotate all evidence d… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: EMNLP Findings 2022. GitHub: https://github.com/dwadden/scifact-open-2022

  40. arXiv:2209.00840  [pdf, other

    cs.CL

    FOLIO: Natural Language Reasoning with First-Order Logic

    Authors: Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alex Wardle-Solano, Hannah Szabo, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander R. Fabbri , et al. (10 additional authors not shown)

    Abstract: Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FO… ▽ More

    Submitted 17 May, 2024; v1 submitted 2 September, 2022; originally announced September 2022.

  41. arXiv:2207.04993  [pdf, other

    cs.CL

    Embedding Recycling for Language Models

    Authors: Jon Saad-Falcon, Amanpreet Singh, Luca Soldaini, Mike D'Arcy, Arman Cohan, Doug Downey

    Abstract: Real-world applications of neural language models often involve running many different models over the same corpus. The high computational cost of these runs has led to interest in techniques that can reuse the contextualized embeddings produced in previous runs to speed training and inference of future ones. We refer to this approach as embedding recycling (ER). While multiple ER techniques have… ▽ More

    Submitted 30 January, 2023; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: EACL Findings 2023

  42. arXiv:2204.10432  [pdf, other

    cs.CL

    Improving the Generalizability of Depression Detection by Leveraging Clinical Questionnaires

    Authors: Thong Nguyen, Andrew Yates, Ayah Zirikly, Bart Desmet, Arman Cohan

    Abstract: Automated methods have been widely used to identify and analyze mental health conditions (e.g., depression) from various sources of information, including social media. Yet, deployment of such models in real-world healthcare applications faces challenges including poor out-of-domain generalization and lack of trust in black box models. In this work, we propose approaches for depression detection t… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

  43. arXiv:2203.12990  [pdf, other

    cs.CL

    Generating Scientific Claims for Zero-Shot Scientific Fact Checking

    Authors: Dustin Wright, David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Isabelle Augenstein, Lucy Lu Wang

    Abstract: Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of training data, as annotation requires domain expertise. To address this challenge, we propose scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrate its usefulness in zero-shot fact checkin… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to ACL 2022; 13 pages, 3 figures, 8 tables

  44. arXiv:2112.08777  [pdf, other

    cs.CL cs.AI

    Long Context Question Answering via Supervised Contrastive Learning

    Authors: Avi Caciularu, Ido Dagan, Jacob Goldberger, Arman Cohan

    Abstract: Long-context question answering (QA) tasks require reasoning over a long document or multiple documents. Addressing these tasks often benefits from identifying a set of evidence spans (e.g., sentences), which provide supporting evidence for answering the question. In this work, we propose a novel method for equipping long-context QA models with an additional sequence-level objective for better ide… ▽ More

    Submitted 5 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: accepted NAACL 2022, main conference

  45. arXiv:2112.01640  [pdf, other

    cs.CL cs.AI

    MultiVerS: Improving scientific claim verification with weak supervision and full-document context

    Authors: David Wadden, Kyle Lo, Lucy Lu Wang, Arman Cohan, Iz Beltagy, Hannaneh Hajishirzi

    Abstract: The scientific claim verification task requires an NLP system to label scientific documents which Support or Refute an input claim, and to select evidentiary sentences (or rationales) justifying each predicted label. In this work, we present MultiVerS, which predicts a fact-checking label and identifies rationales in a multitask fashion based on a shared encoding of the claim and full document con… ▽ More

    Submitted 9 May, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: NAACL Findings 2022. Github: https://github.com/dwadden/multivers

  46. arXiv:2111.08366  [pdf, other

    cs.CL cs.IR

    Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity

    Authors: Sheshera Mysore, Arman Cohan, Tom Hope

    Abstract: We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple papers together (co-citations). Such co-citations not only reflect close paper relatedness, but also provide textual descriptions of how the co-cited papers are rela… ▽ More

    Submitted 4 May, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

    Comments: NAACL 2022 camera-ready

  47. arXiv:2110.08499  [pdf, other

    cs.CL

    PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

    Authors: Wen Xiao, Iz Beltagy, Giuseppe Carenini, Arman Cohan

    Abstract: We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers… ▽ More

    Submitted 16 March, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: 19 pages, accepted at the main conference of ACL 2022

  48. arXiv:2107.07170  [pdf, other

    cs.CL cs.LG

    FLEX: Unifying Evaluation for Few-Shot NLP

    Authors: Jonathan Bragg, Arman Cohan, Kyle Lo, Iz Beltagy

    Abstract: Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. In response, we formulate the FLEX Principles, a set of requirements and best… ▽ More

    Submitted 8 November, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

    Comments: NeurIPS 2021. First two authors contributed equally. Code and leaderboard available at: https://github.com/allenai/flex

    ACM Class: I.2.7

  49. arXiv:2107.00414  [pdf, other

    cs.CL

    MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting

    Authors: Anne Lauscher, Brandon Ko, Bailey Kuehl, Sophie Johnson, David Jurgens, Arman Cohan, Kyle Lo

    Abstract: Citation context analysis (CCA) is an important task in natural language processing that studies how and why scholars discuss each others' work. Despite decades of study, traditional frameworks for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that s… ▽ More

    Submitted 31 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

  50. arXiv:2105.03011  [pdf, other

    cs.CL

    A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

    Authors: Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner

    Abstract: Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing inform… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: Accepted at NAACL 2021; Project page: https://allenai.org/project/qasper