subscribe to arXiv mailings

arXiv:2407.09311 [pdf, other]

Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis

Authors: Nikolay Babakov, Ehud Reiter, Alberto Bugarin

Abstract: In this work, we propose a novel method for Bayesian Networks (BNs) structure elicitation that is based on the initialization of several LLMs with different experiences, independently querying them to create a structure of the BN, and further obtaining the final structure by majority voting. We compare the method with one alternative method on various widely and not widely known BNs of different s… ▽ More In this work, we propose a novel method for Bayesian Networks (BNs) structure elicitation that is based on the initialization of several LLMs with different experiences, independently querying them to create a structure of the BN, and further obtaining the final structure by majority voting. We compare the method with one alternative method on various widely and not widely known BNs of different sizes and study the scalability of both methods on them. We also propose an approach to check the contamination of BNs in LLM, which shows that some widely known BNs are inapplicable for testing the LLM usage for BNs structure elicitation. We also show that some BNs may be inapplicable for such experiments because their node names are indistinguishable. The experiments on the other BNs show that our method performs better than the existing method with one of the three studied LLMs; however, the performance of both methods significantly decreases with the increase in BN size. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 27 pages

arXiv:2406.15963 [pdf, other]

Effectiveness of ChatGPT in explaining complex medical reports to patients

Authors: Mengxuan Sun, Ehud Reiter, Anne E Kiltie, George Ramsay, Lisa Duncan, Peter Murchie, Rosalind Adam

Abstract: Electronic health records contain detailed information about the medical condition of patients, but they are difficult for patients to understand even if they have access to them. We explore whether ChatGPT (GPT 4) can help explain multidisciplinary team (MDT) reports to colorectal and prostate cancer patients. These reports are written in dense medical language and assume clinical knowledge, so t… ▽ More Electronic health records contain detailed information about the medical condition of patients, but they are difficult for patients to understand even if they have access to them. We explore whether ChatGPT (GPT 4) can help explain multidisciplinary team (MDT) reports to colorectal and prostate cancer patients. These reports are written in dense medical language and assume clinical knowledge, so they are a good test of the ability of ChatGPT to explain complex medical reports to patients. We asked clinicians and lay people (not patients) to review explanations and responses of ChatGPT. We also ran three focus groups (including cancer patients, caregivers, computer scientists, and clinicians) to discuss output of ChatGPT. Our studies highlighted issues with inaccurate information, inappropriate language, limited personalization, AI distrust, and challenges integrating large language models (LLMs) into clinical workflow. These issues will need to be resolved before LLMs can be used to explain complex personal medical information to patients. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: under review

arXiv:2405.18350 [pdf, other]

doi 10.1109/ACCESS.2019.2937505

A System for Automatic English Text Expansion

Authors: Silvia García Méndez, Milagros Fernández Gavilanes, Enrique Costa Montenegro, Jonathan Juncal Martínez, Francisco Javier González Castaño, Ehud Reiter

Abstract: We present an automatic text expansion system to generate English sentences, which performs automatic Natural Language Generation (NLG) by combining linguistic rules with statistical approaches. Here, "automatic" means that the system can generate coherent and correct sentences from a minimum set of words. From its inception, the design is modular and adaptable to other languages. This adaptabilit… ▽ More We present an automatic text expansion system to generate English sentences, which performs automatic Natural Language Generation (NLG) by combining linguistic rules with statistical approaches. Here, "automatic" means that the system can generate coherent and correct sentences from a minimum set of words. From its inception, the design is modular and adaptable to other languages. This adaptability is one of its greatest advantages. For English, we have created the highly precise aLexiE lexicon with wide coverage, which represents a contribution on its own. We have evaluated the resulting NLG library in an Augmentative and Alternative Communication (AAC) proof of concept, both directly (by regenerating corpus sentences) and manually (from annotations) using a popular corpus in the NLG field. We performed a second analysis by comparing the quality of text expansion in English to Spanish, using an ad-hoc Spanish-English parallel corpus. The system might also be applied to other domains such as report and news generation. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Journal ref: (2019) IEEE Access, 7, 123320-123333

arXiv:2405.02770 [pdf, other]

PhilHumans: Benchmarking Machine Learning for Personal Health

Authors: Vadim Liventsev, Vivek Kumar, Allmin Pradhap Singh Susaiyah, Zixiu Wu, Ivan Rodin, Asfand Yaar, Simone Balloccu, Marharyta Beraziuk, Sebastiano Battiato, Giovanni Maria Farinella, Aki Härmä, Rim Helaoui, Milan Petkovic, Diego Reforgiato Recupero, Ehud Reiter, Daniele Riboni, Raymond Sterling

Abstract: The use of machine learning in Healthcare has the potential to improve patient outcomes as well as broaden the reach and affordability of Healthcare. The history of other application areas indicates that strong benchmarks are essential for the development of intelligent systems. We present Personal Health Interfaces Leveraging HUman-MAchine Natural interactions (PhilHumans), a holistic suite of be… ▽ More The use of machine learning in Healthcare has the potential to improve patient outcomes as well as broaden the reach and affordability of Healthcare. The history of other application areas indicates that strong benchmarks are essential for the development of intelligent systems. We present Personal Health Interfaces Leveraging HUman-MAchine Natural interactions (PhilHumans), a holistic suite of benchmarks for machine learning across different Healthcare settings - talk therapy, diet coaching, emergency care, intensive care, obstetric sonography - as well as different learning settings, such as action anticipation, timeseries modeling, insight mining, language modeling, computer vision, reinforcement learning and program synthesis △ Less

Submitted 16 May, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

arXiv:2404.04103 [pdf, other]

Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in ToTTo

Authors: Barkavi Sundararajan, Somayajulu Sripada, Ehud Reiter

Abstract: Neural Table-to-Text models tend to hallucinate, producing texts that contain factual errors. We investigate whether such errors in the output can be traced back to problems with the input. We manually annotated 1,837 texts generated by multiple models in the politics domain of the ToTTo dataset. We identify the input problems that are responsible for many output errors and show that fixing these… ▽ More Neural Table-to-Text models tend to hallucinate, producing texts that contain factual errors. We investigate whether such errors in the output can be traced back to problems with the input. We manually annotated 1,837 texts generated by multiple models in the politics domain of the ToTTo dataset. We identify the input problems that are responsible for many output errors and show that fixing these inputs reduces factual errors by between 52% and 76% (depending on the model). In addition, we observe that models struggle in processing tabular inputs that are structured in a non-standard way, particularly when the input lacks distinct row and column values or when the column headers are not correctly mapped to corresponding values. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: Added link to human evaluation guidelines and error annotations

arXiv:2401.17511 [pdf, other]

Linguistically Communicating Uncertainty in Patient-Facing Risk Prediction Models

Authors: Adarsa Sivaprasad, Ehud Reiter

Abstract: This paper addresses the unique challenges associated with uncertainty quantification in AI models when applied to patient-facing contexts within healthcare. Unlike traditional eXplainable Artificial Intelligence (XAI) methods tailored for model developers or domain experts, additional considerations of communicating in natural language, its presentation and evaluating understandability are necess… ▽ More This paper addresses the unique challenges associated with uncertainty quantification in AI models when applied to patient-facing contexts within healthcare. Unlike traditional eXplainable Artificial Intelligence (XAI) methods tailored for model developers or domain experts, additional considerations of communicating in natural language, its presentation and evaluating understandability are necessary. We identify the challenges in communication model performance, confidence, reasoning and unknown knowns using natural language in the context of risk prediction. We propose a design aimed at addressing these challenges, focusing on the specific application of in-vitro fertilisation outcome prediction. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.09041 [pdf, other]

Textual Summarisation of Large Sets: Towards a General Approach

Authors: Kittipitch Kuptavanich, Ehud Reiter, Kees Van Deemter, Advaith Siddharthan

Abstract: We are developing techniques to generate summary descriptions of sets of objects. In this paper, we present and evaluate a rule-based NLG technique for summarising sets of bibliographical references in academic papers. This extends our previous work on summarising sets of consumer products and shows how our model generalises across these two very different domains. We are developing techniques to generate summary descriptions of sets of objects. In this paper, we present and evaluate a rule-based NLG technique for summarising sets of bibliographical references in academic papers. This extends our previous work on summarising sets of consumer products and shows how our model generalises across these two very different domains. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.08420 [pdf, other]

Ask the experts: sourcing high-quality datasets for nutritional counselling through Human-AI collaboration

Authors: Simone Balloccu, Ehud Reiter, Vivek Kumar, Diego Reforgiato Recupero, Daniele Riboni

Abstract: Large Language Models (LLMs), with their flexible generation abilities, can be powerful data sources in domains with few or no available corpora. However, problems like hallucinations and biases limit such applications. In this case study, we pick nutrition counselling, a domain lacking any public resource, and show that high-quality datasets can be gathered by combining LLMs, crowd-workers and nu… ▽ More Large Language Models (LLMs), with their flexible generation abilities, can be powerful data sources in domains with few or no available corpora. However, problems like hallucinations and biases limit such applications. In this case study, we pick nutrition counselling, a domain lacking any public resource, and show that high-quality datasets can be gathered by combining LLMs, crowd-workers and nutrition experts. We first crowd-source and cluster a novel dataset of diet-related issues, then work with experts to prompt ChatGPT into producing related supportive text. Finally, we let the experts evaluate the safety of the generated text. We release HAI-coaching, the first expert-annotated nutrition counselling dataset containing ~2.4K dietary struggles from crowd workers, and ~97K related supportive texts generated by ChatGPT. Extensive analysis shows that ChatGPT while producing highly fluent and human-like text, also manifests harmful behaviours, especially in sensitive topics like mental health, making it unsuitable for unsupervised use. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2309.09917 [pdf, other]

doi 10.1007/978-3-031-50396-2_3

Evaluation of Human-Understandability of Global Model Explanations using Decision Tree

Authors: Adarsa Sivaprasad, Ehud Reiter, Nava Tintarev, Nir Oren

Abstract: In explainable artificial intelligence (XAI) research, the predominant focus has been on interpreting models for experts and practitioners. Model agnostic and local explanation approaches are deemed interpretable and sufficient in many applications. However, in domains like healthcare, where end users are patients without AI or domain expertise, there is an urgent need for model explanations that… ▽ More In explainable artificial intelligence (XAI) research, the predominant focus has been on interpreting models for experts and practitioners. Model agnostic and local explanation approaches are deemed interpretable and sufficient in many applications. However, in domains like healthcare, where end users are patients without AI or domain expertise, there is an urgent need for model explanations that are more comprehensible and instil trust in the model's operations. We hypothesise that generating model explanations that are narrative, patient-specific and global(holistic of the model) would enable better understandability and enable decision-making. We test this using a decision tree model to generate both local and global explanations for patients identified as having a high risk of coronary heart disease. These explanations are presented to non-expert users. We find a strong individual preference for a specific type of explanation. The majority of participants prefer global explanations, while a smaller group prefers local explanations. A task based evaluation of mental models of these participants provide valuable feedback to enhance narrative global explanations. This, in turn, guides the design of health informatics systems that are both trustworthy and actionable. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2305.01633 [pdf, other]

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Authors: Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai , et al. (17 additional authors not shown)

Abstract: We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, a… ▽ More We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP. △ Less

Submitted 7 August, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: 5 pages plus appendix, 4 tables, 1 figure. To appear at "Workshop on Insights from Negative Results in NLP" (co-located with EACL2023). Updated author list and acknowledgements

MSC Class: 68 ACM Class: I.2.7

arXiv:2211.09455 [pdf, other]

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

Authors: Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, Ehud Reiter

Abstract: Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previ… ▽ More Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Accepted for publication at EMNLP 2022

arXiv:2211.05100 [pdf, other]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License. △ Less

Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2206.13435 [pdf, other]

Comparing informativeness of an NLG chatbot vs graphical app in diet-information domain

Authors: Simone Balloccu, Ehud Reiter

Abstract: Visual representation of data like charts and tables can be challenging to understand for readers. Previous work showed that combining visualisations with text can improve the communication of insights in static contexts, but little is known about interactive ones. In this work we present an NLG chatbot that processes natural language queries and provides insights through a combination of charts a… ▽ More Visual representation of data like charts and tables can be challenging to understand for readers. Previous work showed that combining visualisations with text can improve the communication of insights in static contexts, but little is known about interactive ones. In this work we present an NLG chatbot that processes natural language queries and provides insights through a combination of charts and text. We apply it to nutrition, a domain communication quality is critical. Through crowd-sourced evaluation we compare the informativeness of our chatbot against traditional, static diet-apps. We find that the conversational context significantly improved users' understanding of dietary data in various tasks, and that users considered the chatbot as more useful and quick to use than traditional apps. △ Less

Submitted 2 July, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

arXiv:2205.02549 [pdf, other]

User-Driven Research of Medical Note Generation Software

Authors: Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov

Abstract: A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three ro… ▽ More A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three rounds of user studies, carried out in the context of developing a medical note generation system. We present, analyse and discuss the participating clinicians' impressions and views of how the system ought to be adapted to be of value to them. Next, we describe a three-week test run of the system in a live telehealth clinical practice. Major findings include (i) the emergence of five different note-taking behaviours; (ii) the importance of the system generating notes in real time during the consultation; and (iii) the identification of a number of clinical use cases that could prove challenging for automatic note generation systems. △ Less

Submitted 6 May, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

Comments: Accepted for publication at NAACL 2022

arXiv:2204.00447 [pdf, other]

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Authors: Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov

Abstract: In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 cl… ▽ More In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: To be published in proceedings of ACL 2022

arXiv:2108.05644 [pdf, other]

Generation Challenges: Results of the Accuracy Evaluation Shared Task

Authors: Craig Thomson, Ehud Reiter

Abstract: The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automa… ▽ More The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference). △ Less

Submitted 15 August, 2021; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: To appear in proceedings of INGL2021

arXiv:2104.04412 [pdf, other]

Towards objectively evaluating the quality of generated medical summaries

Authors: Francesco Moramarco, Damir Juric, Aleksandar Savkov, Ehud Reiter

Abstract: We propose a method for evaluating the quality of generated text by asking evaluators to count facts, and computing precision, recall, f-score, and accuracy from the raw counts. We believe this approach leads to a more objective and easier to reproduce evaluation. We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance. We propose a method for evaluating the quality of generated text by asking evaluators to count facts, and computing precision, recall, f-score, and accuracy from the raw counts. We believe this approach leads to a more objective and easier to reproduce evaluation. We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2104.04402 [pdf, other]

A preliminary study on evaluating Consultation Notes with Post-Editing

Authors: Francesco Moramarco, Alex Papadopoulos Korfiatis, Aleksandar Savkov, Ehud Reiter

Abstract: Automatic summarisation has the potential to aid physicians in streamlining clerical tasks such as note taking. But it is notoriously difficult to evaluate these systems and demonstrate that they are safe to be used in a clinical setting. To circumvent this issue, we propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them. We conduct a preliminary stud… ▽ More Automatic summarisation has the potential to aid physicians in streamlining clerical tasks such as note taking. But it is notoriously difficult to evaluate these systems and demonstrate that they are safe to be used in a clinical setting. To circumvent this issue, we propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them. We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing. Our evaluators are asked to listen to mock consultations and to post-edit three generated notes. We time this and find that it is faster than writing the note from scratch. We present insights and lessons learnt from this experiment. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2103.07929 [pdf, other]

A Systematic Review of Reproducibility Research in Natural Language Processing

Authors: Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter

Abstract: Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defi… ▽ More Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP, delineating differences and similarities, and providing pointers to common denominators. △ Less

Submitted 21 March, 2021; v1 submitted 14 March, 2021; originally announced March 2021.

Comments: To be published in proceedings of EACL'21

arXiv:2011.03992 [pdf, ps, other]

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Authors: Craig Thomson, Ehud Reiter

Abstract: Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluatio… ▽ More Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics △ Less

Submitted 8 November, 2020; originally announced November 2020.

Comments: To appear in INLG-2020. Resources available at https://github.com/nlgcat/evaluating_accuracy

arXiv:2007.09970 [pdf, other]

How are you? Introducing stress-based text tailoring

Authors: Simone Balloccu, Ehud Reiter, Alexandra Johnstone, Claire Fyfe

Abstract: Can stress affect not only your life but also how you read and interpret a text? Healthcare has shown evidence of such dynamics and in this short paper we discuss customising texts based on user stress level, as it could represent a critical factor when it comes to user engagement and behavioural change. We first show a real-world example in which user behaviour is influenced by stress, then, afte… ▽ More Can stress affect not only your life but also how you read and interpret a text? Healthcare has shown evidence of such dynamics and in this short paper we discuss customising texts based on user stress level, as it could represent a critical factor when it comes to user engagement and behavioural change. We first show a real-world example in which user behaviour is influenced by stress, then, after discussing which tools can be employed to assess and measure it, we propose an initial method for tailoring the document by exploiting complexity reduction and affect enforcement. The result is a short and encouraging text which requires less commitment to be read and understood. We believe this work in progress can raise some interesting questions on a topic that is often overlooked in NLG. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Journal ref: IntelLang 2020

arXiv:2006.12234 [pdf, ps, other]

Shared Task on Evaluating Accuracy in Natural Language Generation

Authors: Ehud Reiter, Craig Thomson

Abstract: We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts. Participants will measure the accuracy of basketball game summaries produced by NLG systems from basketball box score data. We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts. Participants will measure the accuracy of basketball game summaries produced by NLG systems from basketball box score data. △ Less

Submitted 6 November, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: To appear in INLG 2020

arXiv:2004.14067 [pdf, other]

FitChat: Conversational Artificial Intelligence Interventions for Encouraging Physical Activity in Older Adults

Authors: Nirmalie Wiratunga, Kay Cooper, Anjana Wijekoon, Chamath Palihawadana, Vanessa Mendham, Ehud Reiter, Kyle Martin

Abstract: Delivery of digital behaviour change interventions which encourage physical activity has been tried in many forms. Most often interventions are delivered as text notifications, but these do not promote interaction. Advances in conversational AI have improved natural language understanding and generation, allowing AI chatbots to provide an engaging experience with the user. For this reason, chatbot… ▽ More Delivery of digital behaviour change interventions which encourage physical activity has been tried in many forms. Most often interventions are delivered as text notifications, but these do not promote interaction. Advances in conversational AI have improved natural language understanding and generation, allowing AI chatbots to provide an engaging experience with the user. For this reason, chatbots have recently been seen in healthcare delivering digital interventions through free text or choice selection. In this work, we explore the use of voice-based AI chatbots as a novel mode of intervention delivery, specifically targeting older adults to encourage physical activity. We co-created "FitChat", an AI chatbot, with older adults and we evaluate the first prototype using Think Aloud Sessions. Our thematic evaluation suggests that older adults prefer voice-based chat over text notifications or free text entry and that voice is a powerful mode for encouraging motivation. △ Less

Submitted 29 April, 2020; originally announced April 2020.

arXiv:1911.08794 [pdf, ps, other]

Natural Language Generation Challenges for Explainable AI

Authors: Ehud Reiter

Abstract: Good quality explanations of artificial intelligence (XAI) reasoning must be written (and evaluated) for an explanatory purpose, targeted towards their readers, have a good narrative and causal structure, and highlight where uncertainty and data quality affect the AI output. I discuss these challenges from a Natural Language Generation (NLG) perspective, and highlight four specific NLG for XAI res… ▽ More Good quality explanations of artificial intelligence (XAI) reasoning must be written (and evaluated) for an explanatory purpose, targeted towards their readers, have a good narrative and causal structure, and highlight where uncertainty and data quality affect the AI output. I discuss these challenges from a Natural Language Generation (NLG) perspective, and highlight four specific NLG for XAI research challenges. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: Presented at the NL4XAI workshop (https://sites.google.com/view/nl4xai2019/)

arXiv:1911.00127 [pdf]

Automatic Prostate Zonal Segmentation Using Fully Convolutional Network with Feature Pyramid Attention

Authors: Yongkai Liu, Guang Yang, Sohrab Afshari Mirak, Melina Hosseiny, Afshin Azadikhah, Xinran Zhong, Robert E. Reiter, Yeejin Lee, Steven Raman, Kyunghyun Sung

Abstract: Our main objective is to develop a novel deep learning-based algorithm for automatic segmentation of prostate zone and to evaluate the proposed algorithm on an additional independent testing data in comparison with inter-reader consistency between two experts. With IRB approval and HIPAA compliance, we designed a novel convolutional neural network (CNN) for automatic segmentation of the prostatic… ▽ More Our main objective is to develop a novel deep learning-based algorithm for automatic segmentation of prostate zone and to evaluate the proposed algorithm on an additional independent testing data in comparison with inter-reader consistency between two experts. With IRB approval and HIPAA compliance, we designed a novel convolutional neural network (CNN) for automatic segmentation of the prostatic transition zone (TZ) and peripheral zone (PZ) on T2-weighted (T2w) MRI. The total study cohort included 359 patients from two sources; 313 from a deidentified publicly available dataset (SPIE-AAPM-NCI PROSTATEX challenge) and 46 from a large U.S. tertiary referral center with 3T MRI (external testing dataset (ETD)). The TZ and PZ contours were manually annotated by research fellows, supervised by genitourinary (GU) radiologists. The model was developed using 250 patients and tested internally using the remaining 63 patients from the PROSTATEX (internal testing dataset (ITD)) and tested again (n=46) externally using the ETD. The Dice Similarity Coefficient (DSC) was used to evaluate the segmentation performance. DSCs for PZ and TZ were 0.74 and 0.86 in the ITD respectively. In the ETD, DSCs for PZ and TZ were 0.74 and 0.792, respectively. The inter-reader consistency (Expert 2 vs. Expert 1) were 0.71 (PZ) and 0.75 (TZ). This novel DL algorithm enabled automatic segmentation of PZ and TZ with high accuracy on both ITD and ETD without a performance difference for PZ and less than 10% TZ difference. In the ETD, the proposed method can be comparable to experts in the segmentation of prostate zones. △ Less

Submitted 31 October, 2019; originally announced November 2019.

Comments: Has been accepted by IEEE Access

arXiv:1809.02494 [pdf, other]

Meteorologists and Students: A resource for language grounding of geographical descriptors

Authors: Alejandro Ramos-Soto, Ehud Reiter, Kees van Deemter, Jose M. Alonso, Albert Gatt

Abstract: We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteo… ▽ More We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteorologists. △ Less

Submitted 7 September, 2018; originally announced September 2018.

Comments: Resource paper, 5 pages, 6 figures, 1 table. Conference: INLG 2018

arXiv:1808.03507 [pdf, other]

Making effective use of healthcare data using data-to-text technology

Authors: Steffen Pauws, Albert Gatt, Emiel Krahmer, Ehud Reiter

Abstract: Healthcare organizations are in a continuous effort to improve health outcomes, reduce costs and enhance patient experience of care. Data is essential to measure and help achieving these improvements in healthcare delivery. Consequently, a data influx from various clinical, financial and operational sources is now overtaking healthcare organizations and their patients. The effective use of this da… ▽ More Healthcare organizations are in a continuous effort to improve health outcomes, reduce costs and enhance patient experience of care. Data is essential to measure and help achieving these improvements in healthcare delivery. Consequently, a data influx from various clinical, financial and operational sources is now overtaking healthcare organizations and their patients. The effective use of this data, however, is a major challenge. Clearly, text is an important medium to make data accessible. Financial reports are produced to assess healthcare organizations on some key performance indicators to steer their healthcare delivery. Similarly, at a clinical level, data on patient status is conveyed by means of textual descriptions to facilitate patient review, shift handover and care transitions. Likewise, patients are informed about data on their health status and treatments via text, in the form of reports or via ehealth platforms by their doctors. Unfortunately, such text is the outcome of a highly labour-intensive process if it is done by healthcare professionals. It is also prone to incompleteness, subjectivity and hard to scale up to different domains, wider audiences and varying communication purposes. Data-to-text is a recent breakthrough technology in artificial intelligence which automatically generates natural language in the form of text or speech from data. This chapter provides a survey of data-to-text technology, with a focus on how it can be deployed in a healthcare setting. It will (1) give an up-to-date synthesis of data-to-text approaches, (2) give a categorized overview of use cases in healthcare, (3) seek to make a strong case for evaluating and implementing data-to-text in a healthcare setting, and (4) highlight recent research challenges. △ Less

Submitted 10 August, 2018; originally announced August 2018.

Comments: 27 pages, 2 figures, book chapter

arXiv:1703.10429 [pdf, other]

An Empirical Approach for Modeling Fuzzy Geographical Descriptors

Authors: Alejandro Ramos-Soto, Jose M. Alonso, Ehud Reiter, Kees van Deemter, Albert Gatt

Abstract: We present a novel heuristic approach that defines fuzzy geographical descriptors using data gathered from a survey with human subjects. The participants were asked to provide graphical interpretations of the descriptors `north' and `south' for the Galician region (Spain). Based on these interpretations, our approach builds fuzzy descriptors that are able to compute membership degrees for geograph… ▽ More We present a novel heuristic approach that defines fuzzy geographical descriptors using data gathered from a survey with human subjects. The participants were asked to provide graphical interpretations of the descriptors `north' and `south' for the Galician region (Spain). Based on these interpretations, our approach builds fuzzy descriptors that are able to compute membership degrees for geographical locations. We evaluated our approach in terms of efficiency and precision. The fuzzy descriptors are meant to be used as the cornerstones of a geographical referring expression generation algorithm that is able to linguistically characterize geographical locations and regions. This work is also part of a general research effort that intends to establish a methodology which reunites the empirical studies traditionally practiced in data-to-text and the use of fuzzy sets to model imprecision and vagueness in words and expressions for text generation purposes. △ Less

Submitted 30 March, 2017; originally announced March 2017.

Comments: Conference paper: Accepted for FUZZIEEE-2017. One column version for arXiv (8 pages)

arXiv:1106.5264 [pdf, ps]

doi 10.1613/jair.1176

Acquiring Correct Knowledge for Natural Language Generation

Authors: E. Reiter, R. Robertson, S. G. Sripada

Abstract: Natural language generation (NLG) systems are computer software systems that produce texts in English and other human languages, often from non-linguistic input data. NLG systems, like most AI systems, need substantial amounts of knowledge. However, our experience in two NLG projects suggests that it is difficult to acquire correct knowledge for NLG systems; indeed, every knowledge acquisition (KA… ▽ More Natural language generation (NLG) systems are computer software systems that produce texts in English and other human languages, often from non-linguistic input data. NLG systems, like most AI systems, need substantial amounts of knowledge. However, our experience in two NLG projects suggests that it is difficult to acquire correct knowledge for NLG systems; indeed, every knowledge acquisition (KA) technique we tried had significant problems. In general terms, these problems were due to the complexity, novelty, and poorly understood nature of the tasks our systems attempted, and were worsened by the fact that people write so differently. This meant in particular that corpus-based KA approaches suffered because it was impossible to assemble a sizable corpus of high-quality consistent manually written texts in our domains; and structured expert-oriented KA techniques suffered because experts disagreed and because we could not get enough information about special and unusual cases to build robust systems. We believe that such problems are likely to affect many other NLG systems as well. In the long term, we hope that new KA techniques may emerge to help NLG system builders. In the shorter term, we believe that understanding how individual KA techniques can fail, and using a mixture of different KA techniques with different strengths and weaknesses, can help developers acquire NLG knowledge that is mostly correct. △ Less

Submitted 26 June, 2011; originally announced June 2011.

Journal ref: Journal Of Artificial Intelligence Research, Volume 18, pages 491-516, 2003

arXiv:cmp-lg/9707007 [pdf, ps]

Tailored Patient Information: Some Issues and Questions

Authors: Ehud Reiter, Liesl Osman

Abstract: Tailored patient information (TPI) systems are computer programs which produce personalised heath-information material for patients. TPI systems are of growing interest to the natural-language generation (NLG) community; many TPI systems have also been developed in the medical community, usually with mail-merge technology. No matter what technology is used, experience shows that it is not easy t… ▽ More Tailored patient information (TPI) systems are computer programs which produce personalised heath-information material for patients. TPI systems are of growing interest to the natural-language generation (NLG) community; many TPI systems have also been developed in the medical community, usually with mail-merge technology. No matter what technology is used, experience shows that it is not easy to field a TPI system, even if it is shown to be effective in clinical trials. In this paper we discuss some of the difficulties in fielding TPI systems. This is based on our experiences with 2 TPI systems, one for generating asthma-information booklets and one for generating smoking-cessation letters. △ Less

Submitted 18 July, 1997; originally announced July 1997.

Comments: This is a paper about technology-transfer. It does not have much technical content

Journal ref: Proceedings of the 1997 ACL Workshop on From Research to Commercial Applications: Making NLP Work in Practice

arXiv:cmp-lg/9702013 [pdf, ps]

Knowledge Acquisition for Content Selection

Authors: Ehud Reiter, Alison Cawsey, Liesl Osman, Yvonne Roff

Abstract: An important part of building a natural-language generation (NLG) system is knowledge acquisition, that is deciding on the specific schemas, plans, grammar rules, and so forth that should be used in the NLG system. We discuss some experiments we have performed with KA for content-selection rules, in the context of building an NLG system which generates health-related material. These experiments… ▽ More An important part of building a natural-language generation (NLG) system is knowledge acquisition, that is deciding on the specific schemas, plans, grammar rules, and so forth that should be used in the NLG system. We discuss some experiments we have performed with KA for content-selection rules, in the context of building an NLG system which generates health-related material. These experiments suggest that it is useful to supplement corpus analysis with KA techniques developed for building expert systems, such as structured group discussions and think-aloud protocols. They also raise the point that KA issues may influence architectural design issues, in particular the decision on whether a planning approach is used for content selection. We suspect that in some cases, KA may be easier if other constructive expert-system techniques (such as production rules, or case-based reasoning) are used to determine the content of a generated text. △ Less

Submitted 24 February, 1997; originally announced February 1997.

Comments: To appear in the 1997 European NLG workshop. 10 pages, postscript

arXiv:cmp-lg/9605002 [pdf, ps, other]

Building Natural-Language Generation Systems

Authors: Ehud Reiter

Abstract: This is a very short paper that briefly discusses some of the tasks that NLG systems perform. It is of no research interest, but I have occasionally found it useful as a way of introducing NLG to potential project collaborators who know nothing about the field. This is a very short paper that briefly discusses some of the tasks that NLG systems perform. It is of no research interest, but I have occasionally found it useful as a way of introducing NLG to potential project collaborators who know nothing about the field. △ Less

Submitted 2 May, 1996; originally announced May 1996.

Comments: Standard LaTeX. A 3-page paper of no research interest, but occasionally useful in helping to explain applied NLG to people with little knowledge of the field. Presented at the AIPE workshop in Glasgow

arXiv:cmp-lg/9604006 [pdf, ps, other]

The Role of the Gricean Maxims in the Generation of Referring Expressions

Authors: Robert Dale, Ehud Reiter

Abstract: Grice's maxims of conversation [Grice 1975] are framed as directives to be followed by a speaker of the language. This paper argues that, when considered from the point of view of natural language generation, such a characterisation is rather misleading, and that the desired behaviour falls out quite naturally if we view language generation as a goal-oriented process. We argue this position with… ▽ More Grice's maxims of conversation [Grice 1975] are framed as directives to be followed by a speaker of the language. This paper argues that, when considered from the point of view of natural language generation, such a characterisation is rather misleading, and that the desired behaviour falls out quite naturally if we view language generation as a goal-oriented process. We argue this position with particular regard to the generation of referring expressions. △ Less

Submitted 18 April, 1996; originally announced April 1996.

Comments: LaTeX file, needs aaai.sty (available from the cmp-lg macro library). This paper was presented at the 1996 AAAI Spring Symposium on Computational Models of Conversational Implicature

arXiv:cmp-lg/9504020 [pdf, ps]

Computational Interpretations of the Gricean Maxims in the Generation of Referring Expressions

Authors: Robert Dale, Ehud Reiter

Abstract: We examine the problem of generating definite noun phrases that are appropriate referring expressions; i.e, noun phrases that (1) successfully identify the intended referent to the hearer whilst (2) not conveying to her any false conversational implicatures (Grice, 1975). We review several possible computational interpretations of the conversational implicature maxims, with different computation… ▽ More We examine the problem of generating definite noun phrases that are appropriate referring expressions; i.e, noun phrases that (1) successfully identify the intended referent to the hearer whilst (2) not conveying to her any false conversational implicatures (Grice, 1975). We review several possible computational interpretations of the conversational implicature maxims, with different computational costs, and argue that the simplest may be the best, because it seems to be closest to what human speakers do. We describe our recommended algorithm in detail, along with a specification of the resources a host system must provide in order to make use of the algorithm, and an implementation used in the natural language generation component of the IDAS system. This paper will appear in the the April--June 1995 issue of Cognitive Science, and is made available on cmp-lg with the permission of Ablex, the publishers of that journal. △ Less

Submitted 26 April, 1995; originally announced April 1995.

Comments: 29 pages, compressed PS file

arXiv:cmp-lg/9504013 [pdf, ps, other]

NLG vs. Templates

Authors: Ehud Reiter

Abstract: One of the most important questions in applied NLG is what benefits (or `value-added', in business-speak) NLG technology offers over template-based approaches. Despite the importance of this question to the applied NLG community, however, it has not been discussed much in the research NLG community, which I think is a pity. In this paper, I try to summarize the issues involved and recap current… ▽ More One of the most important questions in applied NLG is what benefits (or `value-added', in business-speak) NLG technology offers over template-based approaches. Despite the importance of this question to the applied NLG community, however, it has not been discussed much in the research NLG community, which I think is a pity. In this paper, I try to summarize the issues involved and recap current thinking on this topic. My goal is not to answer this question (I don't think we know enough to be able to do so), but rather to increase the visibility of this issue in the research community, in the hope of getting some input and ideas on this very important question. I conclude with a list of specific research areas I would like to see more work in, because I think they would increase the `value-added' of NLG over templates. △ Less

Submitted 23 April, 1995; originally announced April 1995.

Comments: Uuencoded compressed tar file, containing LaTeX source and a style file. This paper will appear in the 1995 European NL Generation Workshop

arXiv:cmp-lg/9411032 [pdf, ps, other]

Has a Consensus NL Generation Architecture Appeared, and is it Psycholinguistically Plausible?

Authors: Ehud Reiter

Abstract: I survey some recent applications-oriented NL generation systems, and claim that despite very different theoretical backgrounds, these systems have a remarkably similar architecture in terms of the modules they divide the generation process into, the computations these modules perform, and the way the modules interact with each other. I also compare this `consensus architecture' among applied NL… ▽ More I survey some recent applications-oriented NL generation systems, and claim that despite very different theoretical backgrounds, these systems have a remarkably similar architecture in terms of the modules they divide the generation process into, the computations these modules perform, and the way the modules interact with each other. I also compare this `consensus architecture' among applied NLG systems with psycholinguistic knowledge about how humans speak, and argue that at least some aspects of the consensus architecture seem to be in agreement with what is known about human language production, despite the fact that psycholinguistic plausibility was not in general a goal of the developers of the surveyed systems. △ Less

Submitted 30 November, 1994; originally announced November 1994.

Comments: uuencoded compressed tar file, containing LaTeX source and two style files. This paper appeared in the 1994 International NLG workshop

arXiv:cmp-lg/9411031 [pdf, ps, other]

Automatic Generation of Technical Documentation

Authors: Ehud Reiter, Chris Mellish, John Levine

Abstract: Natural-language generation (NLG) techniques can be used to automatically produce technical documentation from a domain knowledge base and linguistic and contextual models. We discuss this application of NLG technology from both a technical and a usefulness (costs and benefits) perspective. This discussion is based largely on our experiences with the IDAS documentation-generation project, and th… ▽ More Natural-language generation (NLG) techniques can be used to automatically produce technical documentation from a domain knowledge base and linguistic and contextual models. We discuss this application of NLG technology from both a technical and a usefulness (costs and benefits) perspective. This discussion is based largely on our experiences with the IDAS documentation-generation project, and the reactions various interested people from industry have had to IDAS. We hope that this summary of our experiences with IDAS and the lessons we have learned from it will be beneficial for other researchers who wish to build technical-documentation generation systems. △ Less

Submitted 30 November, 1994; v1 submitted 29 November, 1994; originally announced November 1994.

Comments: uuencoded compressed tar file, with LaTeX source and ps figures. Will appear in APPLIED ARTIFICIAL INTELLIGENCE journal, volume 9 (1995)

Showing 1–37 of 37 results for author: Reiter, E