Skip to main content

Showing 1–50 of 121 results for author: Lo, K

  1. arXiv:2406.18219  [pdf, other

    cs.CL cs.LG

    A Closer Look into Mixture-of-Experts in Large Language Models

    Authors: Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu

    Abstract: Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechani… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2406.16746  [pdf, other

    cs.LG cs.AI cs.CL

    The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

    Authors: Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini

    Abstract: Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation,… ▽ More

    Submitted 25 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  3. arXiv:2406.16264  [pdf, other

    cs.CL cs.AI

    One Thousand and One Pairs: A "novel" challenge for long-context language models

    Authors: Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, wr… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: preprint, 29 pages

  4. arXiv:2406.11794  [pdf, other

    cs.LG cs.CL

    DataComp-LM: In search of the next generation of training sets for language models

    Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

    Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More

    Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Project page: https://www.datacomp.ai/dclm/

  5. arXiv:2406.07835  [pdf, other

    cs.CL cs.AI

    SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

    Authors: David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan

    Abstract: We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed t… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Submitted to NeurIPS Datasets and Benchmarks 2024

  6. arXiv:2405.10467  [pdf, other

    cs.AI cs.SE

    Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents

    Authors: Yue Liu, Sin Kit Lo, Qinghua Lu, Liming Zhu, Dehai Zhao, Xiwei Xu, Stefan Harrer, Jon Whittle

    Abstract: Foundation model-enabled generative artificial intelligence facilitates the development and implementation of agents, which can leverage distinguished reasoning and language processing capabilities to takes a proactive, autonomous role to pursue users' goals. Nevertheless, there is a lack of systematic knowledge to guide practitioners in designing the agents considering challenges of goal-seeking… ▽ More

    Submitted 24 June, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

  7. arXiv:2404.09290  [pdf, other

    cs.CV eess.IV

    RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion

    Authors: Kyle Shih-Huang Lo, Jörg Peters, Eric Spellman

    Abstract: Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  8. arXiv:2404.06393  [pdf, other

    cs.SD cs.AI eess.AS

    MuPT: A Generative Symbolic Music Pretrained Transformer

    Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

    Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More

    Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  9. arXiv:2404.01261  [pdf, other

    cs.CL cs.AI

    FABLES: Evaluating faithfulness and content selection in book-length summarization

    Authors: Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study miti… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: preprint - 39 pages

  10. arXiv:2403.15246  [pdf, other

    cs.IR cs.CL cs.LG

    FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

    Authors: Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini

    Abstract: Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, w… ▽ More

    Submitted 7 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  11. arXiv:2403.14371  [pdf

    cs.LG cs.AI cs.DC

    Loop Improvement: An Efficient Approach for Extracting Shared Features from Heterogeneous Data without Central Server

    Authors: Fei Li, Chu Kiong Loo, Wei Shiung Liew, Xiaofeng Liu

    Abstract: In federated learning, data heterogeneity significantly impacts performance. A typical solution involves segregating these parameters into shared and personalized components, a concept also relevant in multi-task learning. Addressing this, we propose "Loop Improvement" (LI), a novel method enhancing this separation and feature extraction without necessitating a central server or data interchange a… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 11 pages, 11 figures

  12. arXiv:2403.04979  [pdf, other

    cs.HC

    Know Your Audience: The benefits and pitfalls of generating plain language summaries beyond the "general" audience

    Authors: Tal August, Kyle Lo, Noah A. Smith, Katharina Reinecke

    Abstract: Language models (LMs) show promise as tools for communicating science to the general public by simplifying and summarizing complex language. Because models can be prompted to generate text for a specific audience (e.g., college-educated adults), LMs might be used to create multiple versions of plain language summaries for people with different familiarities of scientific topics. However, it is not… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  13. arXiv:2403.03866  [pdf, other

    cs.CL

    KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

    Authors: Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, David Wadden

    Abstract: Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientifi… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  14. arXiv:2402.16918  [pdf, other

    cs.LG cs.CV

    m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers

    Authors: Ka Man Lo, Yiming Liang, Wenyu Du, Yuantao Fan, Zili Wang, Wenhao Huang, Lei Ma, Jie Fu

    Abstract: Modular neural architectures are gaining attention for their powerful generalization and efficient adaptation to new domains. However, training these models poses challenges due to optimization difficulties arising from intrinsic sparse connectivity. Leveraging knowledge from monolithic models through techniques like knowledge distillation can facilitate training and enable integration of diverse… ▽ More

    Submitted 7 July, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

  15. arXiv:2402.04505  [pdf, other

    cs.CL quant-ph

    Developments in Sheaf-Theoretic Models of Natural Language Ambiguities

    Authors: Kin Ian Lo, Mehrnoosh Sadrzadeh, Shane Mansfield

    Abstract: Sheaves are mathematical objects consisting of a base which constitutes a topological space and the data associated with each open set thereof, e.g. continuous functions defined on the open sets. Sheaves have originally been used in algebraic topology and logic. Recently, they have also modelled events such as physical experiments and natural language disambiguation processes. We extend the latter… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: arXiv admin note: text overlap with arXiv:2308.16498

  16. arXiv:2402.00838  [pdf, other

    cs.CL

    OLMo: Accelerating the Science of Language Models

    Authors: Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam , et al. (18 additional authors not shown)

    Abstract: Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models… ▽ More

    Submitted 7 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  17. arXiv:2402.00159  [pdf, other

    cs.CL

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    Authors: Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen , et al. (11 additional authors not shown)

    Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training dat… ▽ More

    Submitted 6 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024; Dataset: https://hf.co/datasets/allenai/dolma; Code: https://github.com/allenai/dolma

  18. arXiv:2401.16475  [pdf, other

    cs.CL

    InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

    Authors: Jan Trienes, Sebastian Joseph, Jörg Schlötterer, Christin Seifert, Kyle Lo, Wei Xu, Byron C. Wallace, Junyi Jessy Li

    Abstract: Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at ACL 2024 (main conference)

  19. arXiv:2312.10523  [pdf, other

    cs.CL cs.AI cs.LG

    Paloma: A Benchmark for Evaluating Language Model Fit

    Authors: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge

    Abstract: Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution extrapolates to others, Perplexity Analysis for Language Model Assessment (Paloma), measures LM fit to 585 text domains, ranging from nytimes.com… ▽ More

    Submitted 16 December, 2023; originally announced December 2023.

    Comments: Project Page: https://paloma.allen.ai/

  20. arXiv:2311.09765  [pdf, other

    cs.IR cs.AI

    Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

    Authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo

    Abstract: Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base mo… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

  21. arXiv:2310.03193  [pdf

    cs.DL cs.CL cs.CY physics.hist-ph physics.soc-ph

    The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices

    Authors: Hancheng Cao, Jesse Dodge, Kyle Lo, Daniel A. McFarland, Lucy Lu Wang

    Abstract: In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at scale has proven difficult. In this work, we leverage a large-scale dataset of 1.1M papers from arXiv that are representative of the fields of physics, math, and co… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  22. arXiv:2310.01649  [pdf, other

    cs.LG cs.AI

    On Training Derivative-Constrained Neural Networks

    Authors: KaiChieh Lo, Daniel Huang

    Abstract: We refer to the setting where the (partial) derivatives of a neural network's (NN's) predictions with respect to its inputs are used as additional training signal as a derivative-constrained (DC) NN. This situation is common in physics-informed settings in the natural sciences. We propose an integrated RELU (IReLU) activation function to improve training of DC NNs. We also investigate denormalizat… ▽ More

    Submitted 11 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

  23. arXiv:2310.00785  [pdf, other

    cs.CL cs.AI cs.LG

    BooookScore: A systematic exploration of book-length summarization in the era of LLMs

    Authors: Yapei Chang, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book… ▽ More

    Submitted 13 April, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 camera-ready (updated figure1 and table2; corrected minor details in the explanation of hierarchical merging)

  24. arXiv:2309.08541  [pdf, other

    cs.IR cs.AI cs.CL

    When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

    Authors: Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, Luca Soldaini

    Abstract: Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find t… ▽ More

    Submitted 26 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: EACL 2024 camera ready

  25. arXiv:2309.03487  [pdf, other

    cs.LG cs.CR cs.NE

    Privacy-preserving Continual Federated Clustering via Adaptive Resonance Theory

    Authors: Naoki Masuyama, Yusuke Nojima, Yuichiro Toda, Chu Kiong Loo, Hisao Ishibuchi, Naoyuki Kubota

    Abstract: With the increasing importance of data privacy protection, various privacy-preserving machine learning methods have been proposed. In the clustering domain, various algorithms with a federated learning framework (i.e., federated clustering) have been actively studied and showed high clustering performance while preserving data privacy. However, most of the base clusterers (i.e., clustering algorit… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: This paper is currently under review. arXiv admin note: substantial text overlap with arXiv:2305.01507

  26. arXiv:2308.16498  [pdf, other

    cs.CL cs.AI quant-ph

    Generalised Winograd Schema and its Contextuality

    Authors: Kin Ian Lo, Mehrnoosh Sadrzadeh, Shane Mansfield

    Abstract: Ambiguities in natural language give rise to probability distributions over interpretations. The distributions are often over multiple ambiguous words at a time; a multiplicity which makes them a suitable topic for sheaf-theoretic models of quantum contextuality. Previous research showed that different quantitative measures of contextuality correlate well with Psycholinguistic research on lexical… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: In Proceedings QPL 2023, arXiv:2308.15489

    Journal ref: EPTCS 384, 2023, pp. 187-202

  27. arXiv:2307.09701  [pdf, other

    cs.CL

    Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

    Authors: Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across diffe… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  28. arXiv:2306.13410  [pdf, other

    cs.LG

    Explainable Lifelong Stream Learning Based on "Glocal" Pairwise Fusion

    Authors: Chu Kiong Loo, Wei Shiung Liew, Stefan Wermter

    Abstract: Real-time on-device continual learning applications are used on mobile phones, consumer robots, and smart appliances. Such devices have limited processing and memory storage capabilities, whereas continual learning acquires data over a long period of time. By necessity, lifelong learning algorithms have to be able to operate under such constraints while delivering good performance. This study pres… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

    Comments: 24 pages, 8 figures

  29. arXiv:2306.08056  [pdf, other

    cs.CR cs.AI cs.SE

    Distributed Trust Through the Lens of Software Architecture

    Authors: Sin Kit Lo, Yue Liu, Guangsheng Yu, Qinghua Lu, Xiwei Xu, Liming Zhu

    Abstract: Distributed trust is a nebulous concept that has evolved from different perspectives in recent years. While one can attribute its current prominence to blockchain and cryptocurrency, the distributed trust concept has been cultivating progress in federated learning, trustworthy and responsible AI in an ecosystem setting, data sharing, privacy issues across organizational boundaries, and zero trust… ▽ More

    Submitted 25 May, 2023; originally announced June 2023.

  30. arXiv:2306.01058  [pdf, other

    cs.CL

    Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

    Authors: Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, Kyle Lo

    Abstract: Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers. Layout-infused LMs are often evaluated on documents with familiar layout features (e.g., papers from the same publisher), but in practice models encounter documents with unfamiliar distributions of layout features, such as new combinations of text… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: To appear in ACL Findings 2023

  31. arXiv:2305.15053  [pdf, other

    cs.CL cs.IR

    Decomposing Complex Queries for Tip-of-the-tongue Retrieval

    Authors: Kevin Lin, Kyle Lo, Joseph E. Gonzalez, Dan Klein

    Abstract: When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs -- complex queries that describe content elements (e.g., book characters or events), information beyond the document text (e.g., descriptions of book covers), or personal context (e.g., when they read a book). This retrieval setting, called tip… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  32. arXiv:2305.14772  [pdf, other

    cs.CL

    A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

    Authors: Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle Lo

    Abstract: Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First,… ▽ More

    Submitted 30 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: 19 pages, 2 figures, 8 tables, EMNLP2023

  33. arXiv:2305.14660  [pdf, other

    cs.CL

    Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction

    Authors: Anna Martin-Boyle, Andrew Head, Kyle Lo, Risham Sidhu, Marti A. Hearst, Dongyeop Kang

    Abstract: Mathematical symbol definition extraction is important for improving scholarly reading interfaces and scholarly information extraction (IE). However, the task poses several challenges: math symbols are difficult to process as they are not composed of natural language morphemes; and scholarly papers often contain sentences that require resolving complex coordinate structures. We present SymDef, an… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 9 pages, 4 figures

    ACM Class: I.2.7

  34. arXiv:2305.01507  [pdf, other

    cs.NE cs.LG

    A Parameter-free Adaptive Resonance Theory-based Topological Clustering Algorithm Capable of Continual Learning

    Authors: Naoki Masuyama, Takanori Takebayashi, Yusuke Nojima, Chu Kiong Loo, Hisao Ishibuchi, Stefan Wermter

    Abstract: In general, a similarity threshold (i.e., a vigilance parameter) for a node learning process in Adaptive Resonance Theory (ART)-based algorithms has a significant impact on clustering performance. In addition, an edge deletion threshold in a topological clustering algorithm plays an important role in adaptively generating well-separated clusters during a self-organizing process. In this paper, we… ▽ More

    Submitted 2 May, 2023; v1 submitted 30 April, 2023; originally announced May 2023.

    Comments: This paper is currently under review

  35. arXiv:2304.02623  [pdf, other

    cs.CL cs.HC

    Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks

    Authors: Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, David Sontag

    Abstract: Large language models have introduced exciting new opportunities and challenges in designing and developing new AI-assisted writing support tools. Recent work has shown that leveraging this new technology can transform writing in many scenarios such as ideation during creative writing, editing support, and summarization. However, AI-supported expository writing--including real-world tasks like sch… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

    Comments: 3 pages, 1 figure, accepted by The Second Workshop on Intelligent and Interactive Writing Assistants

  36. arXiv:2303.14334  [pdf, other

    cs.HC cs.AI cs.CL

    The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces

    Authors: Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Kinney , et al. (30 additional authors not shown)

    Abstract: Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has chan… ▽ More

    Submitted 23 April, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

  37. arXiv:2303.03915  [pdf, other

    cs.CL cs.AI

    The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

    Authors: Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa , et al. (29 additional authors not shown)

    Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the f… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: NeurIPS 2022, Datasets and Benchmarks Track

    ACM Class: I.2.7

  38. CiteSee: Augmenting Citations in Scientific Papers with Persistent and Personalized Historical Context

    Authors: Joseph Chee Chang, Amy X. Zhang, Jonathan Bragg, Andrew Head, Kyle Lo, Doug Downey, Daniel S. Weld

    Abstract: When reading a scholarly article, inline citations help researchers contextualize the current article and discover relevant prior work. However, it can be challenging to prioritize and make sense of the hundreds of citations encountered during literature reviews. This paper introduces CiteSee, a paper reading tool that leverages a user's publishing, reading, and saving activities to provide person… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

  39. arXiv:2301.13298  [pdf, other

    cs.CL

    LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

    Authors: Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo

    Abstract: While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of t… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: EACL 2023 camera ready. Code and data can be found in https://github.com/martiansideofthemoon/longeval-summarization

  40. arXiv:2301.10140  [pdf, other

    cs.DL cs.CL

    The Semantic Scholar Open Data Platform

    Authors: Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan, Daniel Lin , et al. (23 additional authors not shown)

    Abstract: The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF conte… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

    Comments: 8 pages, 6 figures

  41. arXiv:2212.10526  [pdf, other

    cs.CL cs.AI

    Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval

    Authors: John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Lu Wang, Arman Cohan

    Abstract: Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub "open-domain" MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retriev… ▽ More

    Submitted 25 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to EMNLP Findings 2023

  42. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  43. arXiv:2210.13777  [pdf, other

    cs.CL cs.AI

    SciFact-Open: Towards open-domain scientific claim verification

    Authors: David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, Hannaneh Hajishirzi

    Abstract: While research on scientific claim verification has led to the development of powerful systems that appear to approach human performance, these approaches have yet to be tested in a realistic setting against large corpora of scientific literature. Moving to this open-domain evaluation setting, however, poses unique challenges; in particular, it is infeasible to exhaustively annotate all evidence d… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: EMNLP Findings 2022. GitHub: https://github.com/dwadden/scifact-open-2022

  44. arXiv:2210.04400  [pdf

    cs.HC cs.AI

    Focus Plus: Detect Learner's Distraction by Web Camera in Distance Teaching

    Authors: Eason Chen, Yuen Hsien Tseng, Kuo-Ping Lo

    Abstract: Distance teaching has become popular these years because of the COVID-19 epidemic. However, both students and teachers face several challenges in distance teaching, like being easy to distract. We proposed Focus+, a system designed to detect learners' status with the latest AI technology from their web camera to solve such challenges. By doing so, teachers can know students' status, and students c… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

    Comments: 5 Pages, 4 Figures, 2021 National Chair Professorship Academic Series: Teaching and Learning in Pandemic Era

  45. arXiv:2209.03041  [pdf, other

    cs.CV cs.AI cs.LG

    Multi-Scale Attention-based Multiple Instance Learning for Classification of Multi-Gigapixel Histology Images

    Authors: Made Satria Wibawa, Kwok-Wai Lo, Lawrence Young, Nasir Rajpoot

    Abstract: Histology images with multi-gigapixel of resolution yield rich information for cancer diagnosis and prognosis. Most of the time, only slide-level label is available because pixel-wise annotation is labour intensive task. In this paper, we propose a deep learning pipeline for classification in histology images. Using multiple instance learning, we attempt to predict the latent membrane protein 1 (L… ▽ More

    Submitted 7 September, 2022; originally announced September 2022.

  46. arXiv:2208.05720  [pdf, other

    cs.CL cs.AI cs.LG cs.NE

    A Model of Anaphoric Ambiguities using Sheaf Theoretic Quantum-like Contextuality and BERT

    Authors: Kin Ian Lo, Mehrnoosh Sadrzadeh, Shane Mansfield

    Abstract: Ambiguities of natural language do not preclude us from using it and context helps in getting ideas across. They, nonetheless, pose a key challenge to the development of competent machines to understand natural language and use it as humans do. Contextuality is an unparalleled phenomenon in quantum mechanics, where different mathematical formalisms have been put forwards to understand and reason… ▽ More

    Submitted 11 August, 2022; originally announced August 2022.

    Comments: In Proceedings E2ECOMPVEC, arXiv:2208.05313

    Journal ref: EPTCS 366, 2022, pp. 23-34

  47. arXiv:2208.05393  [pdf, other

    cs.CL cs.LO

    A Quantum Natural Language Processing Approach to Pronoun Resolution

    Authors: Hadi Wazni, Kin Ian Lo, Lachlan McPheat, Mehrnoosh Sadrzadeh

    Abstract: We use the Lambek Calculus with soft sub-exponential modalities to model and reason about discourse relations such as anaphora and ellipsis. A semantics for this logic is obtained by using truncated Fock spaces, developed in our previous work. We depict these semantic computations via a new string diagram. The Fock Space semantics has the advantage that its terms are learnable from large corpora o… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

  48. arXiv:2206.10883  [pdf, other

    cs.CL cs.CY

    Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

    Authors: Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, Doug Downey

    Abstract: With the advent of large language models, methods for abstractive summarization have made great strides, creating potential for use in applications to aid knowledge workers processing unwieldy document collections. One such setting is the Civil Rights Litigation Clearinghouse (CRLC) (https://clearinghouse.net),which posts information about large-scale civil rights lawsuits, serving lawyers, schola… ▽ More

    Submitted 22 July, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

    Comments: 37 pages, 2 figures, 9 tables

  49. arXiv:2206.05853  [pdf, other

    cs.CV cs.AI

    Modeling Generalized Specialist Approach To Train Quality Resilient Snapshot Ensemble

    Authors: Ghalib Ahmed Tahir, Chu Kiong Loo, Zongying Liu

    Abstract: Convolutional neural networks (CNNs) apply well with food image recognition due to the ability to learn discriminative visual features. Nevertheless, recognizing distorted images is challenging for existing CNNs. Hence, the study modelled a generalized specialist approach to train a quality resilient ensemble. The approach aids the models in the ensemble framework retain general skills of recogniz… ▽ More

    Submitted 12 June, 2022; originally announced June 2022.

  50. arXiv:2206.03216  [pdf, other

    cs.CY cs.AI cs.CL

    Data Governance in the Age of Large-Scale Data-Driven Language Technology

    Authors: Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Gérard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac Johnson, Dragomir Radev, Somaieh Nikpoor, Jörg Frohberg, Aaron Gokaslan, Peter Henderson, Rishi Bommasani, Margaret Mitchell

    Abstract: The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distrib… ▽ More

    Submitted 2 November, 2022; v1 submitted 3 May, 2022; originally announced June 2022.

    Comments: 32 pages: Full paper and Appendices; Association for Computing Machinery, New York, NY, USA, 2206-2222

    Journal ref: Proceedings of 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22)