subscribe to arXiv mailings

Exploring the Impact of ChatGPT on Wikipedia Engagement

Authors: Neal Reeves, Wenjie Yin, Elena Simperl

Abstract: Wikipedia is one of the most popular websites in the world, serving as a major source of information and learning resource for millions of users worldwide. While motivations for its usage vary, prior research suggests shallow information gathering -- looking up facts and information or answering questions -- dominates over more in-depth usage. On the 22nd of November 2022, ChatGPT was released to… ▽ More Wikipedia is one of the most popular websites in the world, serving as a major source of information and learning resource for millions of users worldwide. While motivations for its usage vary, prior research suggests shallow information gathering -- looking up facts and information or answering questions -- dominates over more in-depth usage. On the 22nd of November 2022, ChatGPT was released to the public and has quickly become a popular source of information, serving as an effective question-answering and knowledge gathering resource. Early indications have suggested that it may be drawing users away from traditional question answering services such as Stack Overflow, raising the question of how it may have impacted Wikipedia. In this paper, we explore Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia. We perform pairwise comparisons of these metrics before and after the release of ChatGPT and implement a panel regression model to observe and quantify longer-term trends. We find no evidence of a fall in engagement across any of the four metrics, instead observing that page views and visitor numbers increased in the period following ChatGPT's launch. However, we observe a lower increase in languages where ChatGPT was available than in languages where it was not, which may suggest ChatGPT's availability limited growth in those languages. Our results contribute to the understanding of how emerging generative AI tools are disrupting the Web ecosystem. △ Less

Submitted 29 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: 12 pages, 4 figures, submitted to ACM Collective Intelligence

arXiv:2403.19546 [pdf, other]

doi 10.1145/3650203.3663326

Croissant: A Metadata Format for ML-Ready Datasets

Authors: Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, Carole-Jean Wu

Abstract: Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is… ▽ More Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks. △ Less

Submitted 30 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for End-to-End Machine Learning (DEEM) Workshop https://dl.acm.org/doi/10.1145/3650203.3663326

arXiv:2403.15861 [pdf, other]

User Experience in Dataset Search Platform Interfaces

Authors: Yihang Zhao, Albert Meroño-Peñuela, Elena Simperl

Abstract: This research investigates User Experience (UX) issues in dataset search platform interfaces, targeting Google Dataset Search and data.europa.eu. It focuses on 6 areas within UX: Initial Interaction, Search Process, Dataset Exploration, Filtering and Sorting, Dataset Actions, and Assistance and Feedback. The evaluation method combines 'The Pandemic Puzzle' user task, think-aloud methods, and demog… ▽ More This research investigates User Experience (UX) issues in dataset search platform interfaces, targeting Google Dataset Search and data.europa.eu. It focuses on 6 areas within UX: Initial Interaction, Search Process, Dataset Exploration, Filtering and Sorting, Dataset Actions, and Assistance and Feedback. The evaluation method combines 'The Pandemic Puzzle' user task, think-aloud methods, and demographic and post-task questionnaires. 29 strengths and 63 weaknesses were collected from 19 participants involved in roles within technology firm or academia. While certain insights are specific to particular platforms, most are derived from features commonly observed in dataset search platforms across a variety of fields, implying that our findings are broadly applicable. Observations from commonly found features in dataset search platforms across various fields have led to the development of 10 new design prototypes. Unlike literature retrieval, dataset retrieval involves a significant focus on metadata accessibility and quality, each element of which can impact decision-making. To address issues like reading fatigue from metadata presentation, inefficient methods for results searching, filtering, and selection, along with other unresolved user-centric issues on current platforms. These prototypes concentrate on enhancing metadata-related features. They include a redesigned homepage, an improved search bar, better sorting options, an enhanced search result display, a metadata comparison tool, and a navigation guide. Our aim is to improve usability for a wide range of users, including both developers and researchers. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2402.01495 [pdf, other]

A Comparative Analysis of Conversational Large Language Models in Knowledge-Based Text Generation

Authors: Phillip Schneider, Manuel Klettner, Elena Simperl, Florian Matthes

Abstract: Generating natural language text from graph-structured data is essential for conversational information seeking. Semantic triples derived from knowledge graphs can serve as a valuable source for grounding responses from conversational agents by providing a factual basis for the information they communicate. This is especially relevant in the context of large language models, which offer great pote… ▽ More Generating natural language text from graph-structured data is essential for conversational information seeking. Semantic triples derived from knowledge graphs can serve as a valuable source for grounding responses from conversational agents by providing a factual basis for the information they communicate. This is especially relevant in the context of large language models, which offer great potential for conversational interaction but are prone to hallucinating, omitting, or producing conflicting information. In this study, we conduct an empirical analysis of conversational large language models in generating natural language text from semantic triples. We compare four large language models of varying sizes with different prompting techniques. Through a series of benchmark experiments on the WebNLG dataset, we analyze the models' performance and identify the most common issues in the generated predictions. Our findings show that the capabilities of large language models in triple verbalization can be significantly improved through few-shot prompting, post-processing, and efficient fine-tuning techniques, particularly for smaller models that exhibit lower zero-shot performance. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Accepted to EACL 2024

arXiv:2401.01711 [pdf, ps, other]

Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs

Authors: Phillip Schneider, Manuel Klettner, Kristiina Jokinen, Elena Simperl, Florian Matthes

Abstract: Conversational question answering systems often rely on semantic parsing to enable interactive information retrieval, which involves the generation of structured database queries from a natural language input. For information-seeking conversations about facts stored within a knowledge graph, dialogue utterances are transformed into graph queries in a process that is called knowledge-based conversa… ▽ More Conversational question answering systems often rely on semantic parsing to enable interactive information retrieval, which involves the generation of structured database queries from a natural language input. For information-seeking conversations about facts stored within a knowledge graph, dialogue utterances are transformed into graph queries in a process that is called knowledge-based conversational question answering. This paper evaluates the performance of large language models that have not been explicitly pre-trained on this task. Through a series of experiments on an extensive benchmark dataset, we compare models of varying sizes with different prompting techniques and identify common issue types in the generated output. Our results demonstrate that large language models are capable of generating graph queries from dialogues, with significant improvements achievable through few-shot prompting and fine-tuning techniques, especially for smaller models that exhibit lower zero-shot performance. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: Accepted to ICAART 2024

arXiv:2312.09947 [pdf, other]

Prompting Datasets: Data Discovery with Conversational Agents

Authors: Johanna Walker, Elisavet Koutsiana, Joe Massey, Gefion Thuermer, Elena Simperl

Abstract: Can large language models assist in data discovery? Data discovery predominantly happens via search on a data portal or the web, followed by assessment of the dataset to ensure it is fit for the intended purpose. The ability of conversational generative AI (CGAI) to support recommendations with reasoning implies it can suggest datasets to users, explain why it has done so, and provide information… ▽ More Can large language models assist in data discovery? Data discovery predominantly happens via search on a data portal or the web, followed by assessment of the dataset to ensure it is fit for the intended purpose. The ability of conversational generative AI (CGAI) to support recommendations with reasoning implies it can suggest datasets to users, explain why it has done so, and provide information akin to documentation regarding the dataset in order to support a use decision. We hold 3 workshops with data users and find that, despite limitations around web capabilities, CGAIs are able to suggest relevant datasets and provide many of the required sensemaking activities, as well as support dataset analysis and manipulation. However, CGAIs may also suggest fictional datasets, and perform inaccurate analysis. We identify emerging practices in data discovery and present a model of these to inform future research directions and data prompt design. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: 27 pages, 9 figures

arXiv:2311.07453 [pdf, other]

ChartCheck: Explainable Fact-Checking over Real-World Chart Images

Authors: Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Cocarascu, Elena Simperl

Abstract: Whilst fact verification has attracted substantial interest in the natural language processing community, verifying misinforming statements against data visualizations such as charts has so far been overlooked. Charts are commonly used in the real-world to summarize and communicate key information, but they can also be easily misused to spread misinformation and promote certain agendas. In this pa… ▽ More Whilst fact verification has attracted substantial interest in the natural language processing community, verifying misinforming statements against data visualizations such as charts has so far been overlooked. Charts are commonly used in the real-world to summarize and communicate key information, but they can also be easily misused to spread misinformation and promote certain agendas. In this paper, we introduce ChartCheck, a novel, large-scale dataset for explainable fact-checking against real-world charts, consisting of 1.7k charts and 10.5k human-written claims and explanations. We systematically evaluate ChartCheck using vision-language and chart-to-table models, and propose a baseline to the community. Finally, we study chart reasoning types and visual attributes that pose a challenge to these models △ Less

Submitted 16 February, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.02216 [pdf, other]

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

Authors: Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, Elena Simperl

Abstract: Numbers are crucial for various real-world domains such as finance, economics, and science. Thus, understanding and reasoning with numbers are essential skills for language models to solve different tasks. While different numerical benchmarks have been introduced in recent years, they are limited to specific numerical aspects mostly. In this paper, we propose a hierarchical taxonomy for numerical… ▽ More Numbers are crucial for various real-world domains such as finance, economics, and science. Thus, understanding and reasoning with numbers are essential skills for language models to solve different tasks. While different numerical benchmarks have been introduced in recent years, they are limited to specific numerical aspects mostly. In this paper, we propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels: representation, number sense, manipulation, and complex reasoning. We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them. Henceforth, we develop a diverse set of numerical probes employing a semi-automated approach. We focus on the tabular Natural Language Inference (TNLI) task as a case study and measure models' performance shifts. Our results show that no model consistently excels across all numerical reasoning types. Among the probed models, FlanT5 (few-/zero-shot) and GPT-3.5 (few-shot) demonstrate strong overall numerical reasoning skills compared to other models. Label-flipping probes indicate that models often exploit dataset artifacts to predict the correct labels. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP 2023 (Findings)

arXiv:2309.08491 [pdf, other]

Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata

Authors: Bohui Zhang, Ioannis Reklos, Nitisha Jain, Albert Meroño Peñuela, Elena Simperl

Abstract: In this work, we explore the use of Large Language Models (LLMs) for knowledge engineering tasks in the context of the ISWC 2023 LM-KBC Challenge. For this task, given subject and relation pairs sourced from Wikidata, we utilize pre-trained LLMs to produce the relevant objects in string format and link them to their respective Wikidata QIDs. We developed a pipeline using LLMs for Knowledge Enginee… ▽ More In this work, we explore the use of Large Language Models (LLMs) for knowledge engineering tasks in the context of the ISWC 2023 LM-KBC Challenge. For this task, given subject and relation pairs sourced from Wikidata, we utilize pre-trained LLMs to produce the relevant objects in string format and link them to their respective Wikidata QIDs. We developed a pipeline using LLMs for Knowledge Engineering (LLMKE), combining knowledge probing and Wikidata entity mapping. The method achieved a macro-averaged F1-score of 0.701 across the properties, with the scores varying from 1.00 to 0.328. These results demonstrate that the knowledge of LLMs varies significantly depending on the domain and that further experimentation is required to determine the circumstances under which LLMs can be used for automatic Knowledge Base (e.g., Wikidata) completion and correction. The investigation of the results also suggests the promising contribution of LLMs in collaborative knowledge engineering. LLMKE won Track 2 of the challenge. The implementation is available at https://github.com/bohuizhang/LLMKE. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Knowledge Base Construction from Pre-trained Language Models (LM-KBC) Challenge @ ISWC 2023

arXiv:2306.11766 [pdf, other]

Agreeing and Disagreeing in Collaborative Knowledge Graph Construction: An Analysis of Wikidata

Authors: Elisavet Koutsiana, Tushita Yadav, Nitisha Jain, Albert Meroño-Peñuela, Elena Simperl

Abstract: In this work, we study disagreement in discussions around Wikidata, an online knowledge community that builds the data backend of Wikipedia. Discussions are important in collaborative work as they can increase contributor performance and encourage the emergence of shared norms and practices. While disagreements can play a productive role in discussions, they can also lead to conflicts and controve… ▽ More In this work, we study disagreement in discussions around Wikidata, an online knowledge community that builds the data backend of Wikipedia. Discussions are important in collaborative work as they can increase contributor performance and encourage the emergence of shared norms and practices. While disagreements can play a productive role in discussions, they can also lead to conflicts and controversies, which impact contributor well-being and their motivation to engage. We want to understand if and when such phenomena arise in Wikidata, using a mix of quantitative and qualitative analyses to identify the types of topics people disagree about, the most common patterns of interaction, and roles people play when arguing for or against an issue. We find that decisions to create Wikidata properties are much faster than those to delete properties and that more than half of controversial discussions do not lead to consensus. Our analysis suggests that Wikidata is an inclusive community, considering different opinions when making decisions, and that conflict and vandalism are rare in discussions. At the same time, while one-fourth of the editors participating in controversial discussions contribute with legit and insightful opinions about Wikidata's emerging issues, they do not remain engaged in the discussions. We hope our findings will help Wikidata support community decision making, and improve discussion tools and practices. △ Less

Submitted 20 June, 2023; originally announced June 2023.

arXiv:2305.13507 [pdf, other]

Multimodal Automated Fact-Checking: A Survey

Authors: Mubashara Akhtar, Michael Schlichtkrull, Zhijiang Guo, Oana Cocarascu, Elena Simperl, Andreas Vlachos

Abstract: Misinformation is often conveyed in multiple modalities, e.g. a miscaptioned image. Multimodal misinformation is perceived as more credible by humans, and spreads faster than its text-only counterparts. While an increasing body of research investigates automated fact-checking (AFC), previous surveys mostly focus on text. In this survey, we conceptualise a framework for AFC including subtasks uniqu… ▽ More Misinformation is often conveyed in multiple modalities, e.g. a miscaptioned image. Multimodal misinformation is perceived as more credible by humans, and spreads faster than its text-only counterparts. While an increasing body of research investigates automated fact-checking (AFC), previous surveys mostly focus on text. In this survey, we conceptualise a framework for AFC including subtasks unique to multimodal misinformation. Furthermore, we discuss related terms used in different communities and map them to our framework. We focus on four modalities prevalent in real-world fact-checking: text, image, audio, and video. We survey benchmarks and models, and discuss limitations and promising directions for future research △ Less

Submitted 25 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings

arXiv:2301.11843 [pdf, other]

Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking

Authors: Mubashara Akhtar, Oana Cocarascu, Elena Simperl

Abstract: Evidence data for automated fact-checking (AFC) can be in multiple modalities such as text, tables, images, audio, or video. While there is increasing interest in using images for AFC, previous works mostly focus on detecting manipulated or fake images. We propose a novel task, chart-based fact-checking, and introduce ChartBERT as the first model for AFC against chart evidence. ChartBERT leverages… ▽ More Evidence data for automated fact-checking (AFC) can be in multiple modalities such as text, tables, images, audio, or video. While there is increasing interest in using images for AFC, previous works mostly focus on detecting manipulated or fake images. We propose a novel task, chart-based fact-checking, and introduce ChartBERT as the first model for AFC against chart evidence. ChartBERT leverages textual, structural and visual information of charts to determine the veracity of textual claims. For evaluation, we create ChartFC, a new dataset of 15, 886 charts. We systematically evaluate 75 different vision-language (VL) baselines and show that ChartBERT outperforms VL models, achieving 63.8% accuracy. Our results suggest that the task is complex yet feasible, with many challenges ahead. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: Accepted to EACL 2023 (Findings)

arXiv:2212.01818 [pdf]

Exploring and Eliciting Needs and Preferences from Editors for Wikidata Recommendations

Authors: Kholoud Alghamdi, Miaojing Shi, Elena Simperl

Abstract: Wikidata is an open knowledge graph created, managed, and maintained collaboratively by a global community of volunteers. As it continues to grow, it faces substantial editor engagement challenges, including acquiring new editors to tackle an increasing workload and retaining existing editors. Experiences from other online communities and peer-production systems, including Wikipedia, suggest that… ▽ More Wikidata is an open knowledge graph created, managed, and maintained collaboratively by a global community of volunteers. As it continues to grow, it faces substantial editor engagement challenges, including acquiring new editors to tackle an increasing workload and retaining existing editors. Experiences from other online communities and peer-production systems, including Wikipedia, suggest that recommending tasks to editors could help with both. Our aim with this paper is to elicit the user requirements for a Wikidata recommendations system. We conduct a mixed-methods study with a thematic analysis of in-depth interviews with 31 Wikidata editors and three Wikimedia managers, complemented by a quantitative analysis of edit records of 3,740 Wikidata editors. The insights gained from the study help us outline design requirements for the Wikidata recommender system. We conclude with a discussion of the implications of this work and directions for future work. △ Less

Submitted 4 December, 2022; originally announced December 2022.

arXiv:2210.14846 [pdf, other]

ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs against Textual Sources

Authors: Gabriel Amaral, Odinaldo Rodrigues, Elena Simperl

Abstract: Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance… ▽ More Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance to ensure their trustworthiness and usability. However, their ability to systematically assess and assure the quality of this provenance, most crucially whether it properly supports the graph's information, relies mainly on manual processes that do not scale with size. ProVe aims at remedying this, consisting of a pipelined approach that automatically verifies whether a Knowledge Graph triple is supported by text extracted from its documented provenance. ProVe is intended to assist information curators and consists of four main steps involving rule-based methods and machine learning models: text extraction, triple verbalisation, sentence selection, and claim verification. ProVe is evaluated on a Wikidata dataset, achieving promising results overall and excellent performance on the binary classification task of detecting support from provenance, with 87.5% accuracy and 82.9% F1-macro on text-rich sources. The evaluation data and scripts used in this paper are available on GitHub and Figshare. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.00105 [pdf, other]

A Decade of Knowledge Graphs in Natural Language Processing: A Survey

Authors: Phillip Schneider, Tim Schopf, Juraj Vladika, Mikhail Galkin, Elena Simperl, Florian Matthes

Abstract: In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing am… ▽ More In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work. △ Less

Submitted 30 September, 2022; originally announced October 2022.

Comments: Accepted to AACL-IJCNLP 2022

arXiv:2206.08709 [pdf, other]

Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs

Authors: Gabriel Amaral, Mārcis Pinnis, Inguna Skadiņa, Odinaldo Rodrigues, Elena Simperl

Abstract: Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, w… ▽ More Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation. △ Less

Submitted 17 June, 2022; originally announced June 2022.

arXiv:2205.02627 [pdf, other]

WDV: A Broad Data Verbalisation Dataset Built from Wikidata

Authors: Gabriel Amaral, Odinaldo Rodrigues, Elena Simperl

Abstract: Data verbalisation is a task of great importance in the current field of natural language processing, as there is great benefit in the transformation of our abundant structured and semi-structured data into human-readable formats. Verbalising Knowledge Graph (KG) data focuses on converting interconnected triple-based claims, formed of subject, predicate, and object, into text. Although KG verbalis… ▽ More Data verbalisation is a task of great importance in the current field of natural language processing, as there is great benefit in the transformation of our abundant structured and semi-structured data into human-readable formats. Verbalising Knowledge Graph (KG) data focuses on converting interconnected triple-based claims, formed of subject, predicate, and object, into text. Although KG verbalisation datasets exist for some KGs, there are still gaps in their fitness for use in many scenarios. This is especially true for Wikidata, where available datasets either loosely couple claim sets with textual information or heavily focus on predicates around biographies, cities, and countries. To address these gaps, we propose WDV, a large KG claim verbalisation dataset built from Wikidata, with a tight coupling between triples and text, covering a wide variety of entities and predicates. We also evaluate the quality of our verbalisations through a reusable workflow for measuring human-centred fluency and adequacy scores. Our data and code are openly available in the hopes of furthering research towards KG verbalisation. △ Less

Submitted 5 May, 2022; originally announced May 2022.

arXiv:2109.09405 [pdf, other]

Assessing the quality of sources in Wikidata across languages: a hybrid approach

Authors: Gabriel Amaral, Alessandro Piscopo, Lucie-Aimée Kaffee, Odinaldo Rodrigues, Elena Simperl

Abstract: Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this e… ▽ More Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community. △ Less

Submitted 20 September, 2021; originally announced September 2021.

arXiv:2107.06423 [pdf, other]

Learning to Recommend Items to Wikidata Editors

Authors: Kholoud AlGhamdi, Miaojing Shi, Elena Simperl

Abstract: Wikidata is an open knowledge graph built by a global community of volunteers. As it advances in scale, it faces substantial challenges around editor engagement. These challenges are in terms of both attracting new editors to keep up with the sheer amount of work and retaining existing editors. Experience from other online communities and peer-production systems, including Wikipedia, suggests that… ▽ More Wikidata is an open knowledge graph built by a global community of volunteers. As it advances in scale, it faces substantial challenges around editor engagement. These challenges are in terms of both attracting new editors to keep up with the sheer amount of work and retaining existing editors. Experience from other online communities and peer-production systems, including Wikipedia, suggests that personalised recommendations could help, especially newcomers, who are sometimes unsure about how to contribute best to an ongoing effort. For this reason, we propose a recommender system WikidataRec for Wikidata items. The system uses a hybrid of content-based and collaborative filtering techniques to rank items for editors relying on both item features and item-editor previous interaction. A neural network, named a neural mixture of representations, is designed to learn fine weights for the combination of item-based representations and optimize them with editor-based representation by item-editor interaction. To facilitate further research in this space, we also create two benchmark datasets, a general-purpose one with 220,000 editors responsible for 14 million interactions with 4 million items and a second one focusing on the contributions of more than 8,000 more active editors. We perform an offline evaluation of the system on both datasets with promising results. Our code and datasets are available at https://github.com/WikidataRec-developer/Wikidata_Recommender. △ Less

Submitted 30 July, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

Comments: The paper is accepted to appear in ISWC 2021

arXiv:1911.09041 [pdf, other]

doi 10.1016/j.ijhcs.2020.102562

Talking datasets: Understanding data sensemaking behaviours

Authors: Laura Koesten, Kathleen Gregory, Paul Groth, Elena Simperl

Abstract: The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little is known about a key step in data reuse: people's behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 research… ▽ More The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little is known about a key step in data reuse: people's behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 researchers as they summarised and interacted with both familiar and unfamiliar data. We use our findings to identify and detail common activity patterns and necessary data attributes across three clusters of sensemaking activities: inspecting data, engaging with content, and placing data within broader contexts. We conclude by proposing design recommendations for tools and documentation practices which can be used to facilitate sensemaking and subsequent data reuse. △ Less

Submitted 18 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: 26 pages, 7 figures, 6 tables

arXiv:1901.09264 [pdf, other]

On the mapping of Points of Interest through StreetView imagery and paid crowdsourcing

Authors: Eddy Maddalena, Luis-Daniel Ibáñez, Elena Simperl

Abstract: The use of volunteers has emerged as low-cost alternative to generate accurate geographical information, an approach known as Volunteered Geographic Information (VGI). However, VGI is limited by the number and availability of volunteers in the area to be mapped, hindering scalability for large areas and making difficult to map within a time-frame. Fortunately, the availability of street-view image… ▽ More The use of volunteers has emerged as low-cost alternative to generate accurate geographical information, an approach known as Volunteered Geographic Information (VGI). However, VGI is limited by the number and availability of volunteers in the area to be mapped, hindering scalability for large areas and making difficult to map within a time-frame. Fortunately, the availability of street-view imagery enables the virtual exploration of urban environments, making possible the recruitment of contributors not necessarily located in the area to be mapped. In this paper, we describe the design, implementation, and evaluation of the Virtual City Explorer (VCE), a system to collect the coordinates of Points of Interest within a bounded area on top of a street view service with the use of paid crowdworkers. Our evaluation suggests that paid crowdworkers are effective for finding PoIs, and cover almost all the area. With respect to completeness, our approach does not find all PoIs found by experts or VGI communities, but is able to find PoIs that were not found by them, suggesting complementarity. We also studied the impact of making PoIs already discovered by a certain number of workers \emph{taboo} for incoming workers, finding that it encourages more exploration from workers , increase the number of detected PoIs , and reduce costs. △ Less

Submitted 26 January, 2019; originally announced January 2019.

Comments: 25 pages

arXiv:1901.05670 [pdf, other]

Beyond monetary incentives: experiments in paid microtask contests modelled as continuous-time markov chains

Authors: Oluwaseyi Feyisetan, Elena Simperl

Abstract: In this paper, we aim to gain a better understanding into how paid microtask crowdsourcing could leverage its appeal and scaling power by using contests to boost crowd performance and engagement. We introduce our microtask-based annotation platform Wordsmith, which features incentives such as points, leaderboards and badges on top of financial remuneration. Our analysis focuses on a particular typ… ▽ More In this paper, we aim to gain a better understanding into how paid microtask crowdsourcing could leverage its appeal and scaling power by using contests to boost crowd performance and engagement. We introduce our microtask-based annotation platform Wordsmith, which features incentives such as points, leaderboards and badges on top of financial remuneration. Our analysis focuses on a particular type of incentive, contests, as a means to apply crowdsourcing in near-real-time scenarios, in which requesters need labels quickly. We model crowdsourcing contests as a continuous-time Markov chain with the objective to maximise the output of the crowd workers, while varying a parameter which determines whether a worker is eligible for a reward based on their present rank on the leaderboard. We conduct empirical experiments in which crowd workers recruited from CrowdFlower carry out annotation microtasks on Wordsmith - in our case, to identify named entities in a stream of Twitter posts. In the experimental conditions, we test different reward spreads and record the total number of annotations received. We compare the results against a control condition in which the same annotation task was completed on CrowdFlower without a time or contest constraint. The experiments show that rewarding only the best contributors in a live contest could be a viable model to deliver results faster, though quality might suffer for particular types of annotation tasks. Increasing the reward spread leads to more work being completed, especially by the top contestants. Overall, the experiments shed light on possible design improvements of paid microtasks platforms to boost task performance and speed, and make the overall experience more fair and interesting for crowd workers. △ Less

Submitted 17 January, 2019; originally announced January 2019.

arXiv:1901.00735 [pdf, other]

doi 10.1007/s00778-019-00564-x

Dataset search: a survey

Authors: Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez, Emilia Kacprzak, Paul Groth

Abstract: Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data s… ▽ More Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems in dataset retrieval. We identify what makes dataset search a research field in its own right, with unique challenges and methods and highlight open problems. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to resolve these open problems as well as immediate next steps that will take the field forward. △ Less

Submitted 3 January, 2019; originally announced January 2019.

Comments: 20 pages, 153 references

arXiv:1810.12423 [pdf, other]

Everything you always wanted to know about a dataset: studies in data summarisation

Authors: Laura Koesten, Elena Simperl, Emilia Kacprzak, Tom Blount, Jeni Tennison

Abstract: Summarising data as text helps people make sense of it. It also improves data discovery, as search algorithms can match this text against keyword queries. In this paper, we explore the characteristics of text summaries of data in order to understand how meaningful summaries look like. We present two complementary studies: a data-search diary study with 69 students, which offers insight into the in… ▽ More Summarising data as text helps people make sense of it. It also improves data discovery, as search algorithms can match this text against keyword queries. In this paper, we explore the characteristics of text summaries of data in order to understand how meaningful summaries look like. We present two complementary studies: a data-search diary study with 69 students, which offers insight into the information needs of people searching for data; and a summarisation study, with a lab and a crowdsourcing component with overall 80 data-literate participants, which produced summaries for 25 datasets. In each study we carried out a qualitative analysis to identify key themes and commonly mentioned dataset attributes, which people consider when searching and making sense of data. The results helped us design a template to create more meaningful textual representations of data, alongside guidelines for improving data-search experience overall. △ Less

Submitted 23 October, 2018; originally announced October 2018.

arXiv:1805.11883 [pdf, ps, other]

DATA:SEARCH'18 -- Searching Data on the Web

Authors: Paul Groth, Laura Koesten, Philipp Mayr, Maarten de Rijke, Elena Simperl

Abstract: This half day workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks an… ▽ More This half day workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks and studies in human data interaction. The workshop aims to bring together communities interested in making the web of data more discoverable, easier to search and more user friendly. △ Less

Submitted 30 May, 2018; originally announced May 2018.

arXiv:1803.07116 [pdf, other]

Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata

Authors: Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl

Abstract: While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidat… ▽ More While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidata triples. We demonstrate the effectiveness of the proposed approach by evaluating it against a set of baselines on two languages of different natures: Arabic, a morphological rich language with a larger vocabulary than English, and Esperanto, a constructed language known for its easy acquisition. △ Less

Submitted 29 April, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

Comments: NAACL HTL 2018

arXiv:1711.00155 [pdf, other]

doi 10.1016/j.websem.2018.07.002

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

Authors: Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christoph Gravier, Frederique Laforest, Jonathon Hare, Elena Simperl

Abstract: Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand the underlying technology, they need textual or visual interfaces to help them make sense of it. We explore the problem of generating natural language summaries for Semantic Web data. This is non-trivial, especially in an open-domain context. To address this problem, we explore the use of neur… ▽ More Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand the underlying technology, they need textual or visual interfaces to help them make sense of it. We explore the problem of generating natural language summaries for Semantic Web data. This is non-trivial, especially in an open-domain context. To address this problem, we explore the use of neural networks. Our system encodes the information from a set of triples into a vector of fixed dimensionality and generates a textual summary by conditioning the output on the encoded vector. We train and evaluate our models on two corpora of loosely aligned Wikipedia snippets and DBpedia and Wikidata triples with promising results. △ Less

Submitted 31 October, 2017; originally announced November 2017.

arXiv:1710.04203 [pdf, other]

Crowdsourcing for Beyond Polarity Sentiment Analysis A Pure Emotion Lexicon

Authors: Giannis Haralabopoulos, Elena Simperl

Abstract: Sentiment analysis aims to uncover emotions conveyed through information. In its simplest form, it is performed on a polarity basis, where the goal is to classify information with positive or negative emotion. Recent research has explored more nuanced ways to capture emotions that go beyond polarity. For these methods to work, they require a critical resource: a lexicon that is appropriate for the… ▽ More Sentiment analysis aims to uncover emotions conveyed through information. In its simplest form, it is performed on a polarity basis, where the goal is to classify information with positive or negative emotion. Recent research has explored more nuanced ways to capture emotions that go beyond polarity. For these methods to work, they require a critical resource: a lexicon that is appropriate for the task at hand, in terms of the range of emotions it captures diversity. In the past, sentiment analysis lexicons have been created by experts, such as linguists and behavioural scientists, with strict rules. Lexicon evaluation was also performed by experts or gold standards. In our paper, we propose a crowdsourcing method for lexicon acquisition, which is scalable, cost-effective, and doesn't require experts or gold standards. We also compare crowd and expert evaluations of the lexicon, to assess the overall lexicon quality, and the evaluation capabilities of the crowd. △ Less

Submitted 4 October, 2017; originally announced October 2017.

Comments: Keywords: Beyond Polarity, Pure Sentiment, Crowdsourcing, Sentiment Analysis, Lexicon Acquisition, Reddit, Twitter, Brexit [19 pages, 6 figures, 4 tables]

arXiv:1503.02911 [pdf, other]

RDF-Hunter: Automatically Crowdsourcing the Execution of Queries Against RDF Data Sets

Authors: Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal, Rudi Studer

Abstract: In the last years, a large number of RDF data sets has become available on the Web. However, due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against this data. To overcome this limitation, we propose RDF-Hunter, a novel hybrid query processing approach that brings together machine and human computation to execute queries against RD… ▽ More In the last years, a large number of RDF data sets has become available on the Web. However, due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against this data. To overcome this limitation, we propose RDF-Hunter, a novel hybrid query processing approach that brings together machine and human computation to execute queries against RDF data. We develop a novel quality model and query engine in order to enable RDF-Hunter to on the fly decide which parts of a query should be executed through conventional technology or crowd computing. To evaluate RDF-Hunter, we created a collection of 50 SPARQL queries against the DBpedia data set, executed them using our hybrid query engine, and analyzed the accuracy of the outcomes obtained from the crowd. The experiments clearly show that the overall approach is feasible and produces query results that reliably and significantly enhance completeness of automatic query processing responses. △ Less

Submitted 10 March, 2015; originally announced March 2015.

arXiv:1406.7551 [pdf]

Collective Intelligence in Citizen Science -- A Study of Performers and Talkers

Authors: Ramine Tinati, Elena Simperl, Markus Luczak-Roesch, Max Van Kleek, Nigel Shadbolt

Abstract: The recent emergence of online citizen science is illustrative of an efficient and effective means to harness the crowd in order to achieve a range of scientific discoveries. Fundamentally, citizen science projects draw upon crowds of non-expert volunteers to complete short Tasks, which can vary in domain and complexity. However, unlike most human-computational systems, participants in these syste… ▽ More The recent emergence of online citizen science is illustrative of an efficient and effective means to harness the crowd in order to achieve a range of scientific discoveries. Fundamentally, citizen science projects draw upon crowds of non-expert volunteers to complete short Tasks, which can vary in domain and complexity. However, unlike most human-computational systems, participants in these systems, the `citizen scientists' are volunteers, whereby no incentives, financial or otherwise, are offered. Furthermore, encouraged by citizen science platforms such as Zooniverse, online communities have emerged, providing them with an environment to discuss, share ideas, and solve problems. In fact, it is the result of these forums that has enabled a number of scientific discoveries to be made. In this paper we explore the phenomenon of collective intelligence via the relationship between the activities of online citizen science communities and the discovery of scientific knowledge. We perform a cross-project analysis of ten Zooniverse citizen science projects and analyse the behaviour of users with regards to their Task completion activity and participation in discussion and discover collective behaviour amongst highly active users. Whilst our findings have implications for future citizen science design, we also consider the wider implications for understanding collective intelligence research in general. △ Less

Submitted 29 June, 2014; originally announced June 2014.

Report number: ci-2014/28

Showing 1–30 of 30 results for author: Simperl, E