subscribe to arXiv mailings

Explaining Graph Neural Networks for Node Similarity on Graphs

Authors: Daniel Daza, Cuong Xuan Chu, Trung-Kien Tran, Daria Stepanova, Michael Cochez, Paul Groth

Abstract: Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable simil… ▽ More Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable similarity search over graphs, by investigating how GNN-based methods for computing node similarities can be augmented with explanations. Specifically, we evaluate the performance of two prominent approaches towards explanations in GNNs, based on the concepts of mutual information (MI), and gradient-based explanations (GB). We discuss their suitability and empirically validate the properties of their explanations over different popular graph benchmarks. We find that unlike MI explanations, gradient-based explanations have three desirable properties. First, they are actionable: selecting inputs depending on them results in predictable changes in similarity scores. Second, they are consistent: the effect of selecting certain inputs overlaps very little with the effect of discarding them. Third, they can be pruned significantly to obtain sparse explanations that retain the effect on similarity scores. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.00002 [pdf, other]

Kermut: Composite kernel regression for protein variant effects

Authors: Peter Mørch Groth, Mads Herbert Kerrn, Lars Olsen, Jesper Salomon, Wouter Boomsma

Abstract: Reliable prediction of protein variant effects is crucial for both protein optimization and for advancing biological understanding. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, and while prediction accuracy has seen much progress in recent years, uncertainty metrics are rarely reported. We here provide a Gaus… ▽ More Reliable prediction of protein variant effects is crucial for both protein optimization and for advancing biological understanding. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, and while prediction accuracy has seen much progress in recent years, uncertainty metrics are rarely reported. We here provide a Gaussian process regression model, Kermut, with a novel composite kernel for modelling mutation similarity, which obtains state-of-the-art performance for protein variant effect prediction while also offering estimates of uncertainty through its posterior. An analysis of the quality of the uncertainty estimates demonstrates that our model provides meaningful levels of overall calibration, but that instance-specific uncertainty calibration remains more challenging. We hope that this will encourage future work in this promising direction. △ Less

Submitted 9 July, 2024; v1 submitted 9 April, 2024; originally announced July 2024.

Comments: 10 pages (36 in total with appendix), 4 figures (26 figures in total with appendix)

arXiv:2404.19591 [pdf, other]

doi 10.1145/3650203.3663327

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Authors: Stefan Grafberger, Paul Groth, Sebastian Schelter

Abstract: Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improve… ▽ More Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations. △ Less

Submitted 30 April, 2024; originally announced April 2024.

ACM Class: H.2; H.2.8; H.4; D.2.6; I.2

arXiv:2404.17000 [pdf, other]

Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models

Authors: Bradley P. Allen, Paul T. Groth

Abstract: A backbone of knowledge graphs are their class membership relations, which assign entities to a given class. As part of the knowledge engineering process, we propose a new method for evaluating the quality of these relations by processing descriptions of a given entity and class using a zero-shot chain-of-thought classifier that uses a natural language intensional definition of a class. We evaluat… ▽ More A backbone of knowledge graphs are their class membership relations, which assign entities to a given class. As part of the knowledge engineering process, we propose a new method for evaluating the quality of these relations by processing descriptions of a given entity and class using a zero-shot chain-of-thought classifier that uses a natural language intensional definition of a class. We evaluate the method using two publicly available knowledge graphs, Wikidata and CaLiGraph, and 7 large language models. Using the gpt-4-0125-preview large language model, the method's classification performance achieves a macro-averaged F1-score of 0.830 on data from Wikidata and 0.893 on data from CaLiGraph. Moreover, a manual analysis of the classification errors shows that 40.9% of errors were due to the knowledge graphs, with 16.0% due to missing relations and 24.9% due to incorrectly asserted relations. These results show how large language models can assist knowledge engineers in the process of knowledge graph refinement. The code and data are available on Github. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 11 pages, 1 figure, 2 tables, accepted at the European Semantic Web Conference Special Track on Large Language Models for Knowledge Engineering, Hersonissos, Crete, GR, May 2024, for associated code and data, see https://github.com/bradleypallen/evaluating-kg-class-memberships-using-llms

ACM Class: I.2.7; I.2.4

arXiv:2404.03732 [pdf, other]

SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

Authors: Bradley P. Allen, Fina Polat, Paul Groth

Abstract: We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task,… ▽ More We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system's classification decisions were consistent with those of the crowd-sourced human labellers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 6 pages, 6 figures, 4 tables, camera-ready copy, accepted to the 18th International Workshop on Semantic Evaluation (SemEval-2024), for associated code and data see https://github.com/bradleypallen/shroom

arXiv:2403.18133 [pdf, other]

AE SemRL: Learning Semantic Association Rules with Autoencoders

Authors: Erkan Karabulut, Victoria Degeler, Paul Groth

Abstract: Association Rule Mining (ARM) is the task of learning associations among data features in the form of logical rules. Mining association rules from high-dimensional numerical data, for example, time series data from a large number of sensors in a smart environment, is a computationally intensive task. In this study, we propose an Autoencoder-based approach to learn and extract association rules fro… ▽ More Association Rule Mining (ARM) is the task of learning associations among data features in the form of logical rules. Mining association rules from high-dimensional numerical data, for example, time series data from a large number of sensors in a smart environment, is a computationally intensive task. In this study, we propose an Autoencoder-based approach to learn and extract association rules from time series data (AE SemRL). Moreover, we argue that in the presence of semantic information related to time series data sources, semantics can facilitate learning generalizable and explainable association rules. Despite enriching time series data with additional semantic features, AE SemRL makes learning association rules from high-dimensional data feasible. Our experiments show that semantic association rules can be extracted from a latent representation created by an Autoencoder and this method has in the order of hundreds of times faster execution time than state-of-the-art ARM approaches in many scenarios. We believe that this study advances a new way of extracting associations from representations and has the potential to inspire more research in this field. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2402.07926 [pdf]

From Data Creator to Data Reuser: Distance Matters

Authors: Christine L. Borgman, Paul T. Groth

Abstract: Sharing research data is complex, labor-intensive, expensive, and requires infrastructure investments by multiple stakeholders. Open science policies focus on data release rather than on data reuse, yet reuse is also difficult, expensive, and may never occur. Investments in data management could be made more wisely by considering who might reuse data, how, why, for what purposes, and when. Data cr… ▽ More Sharing research data is complex, labor-intensive, expensive, and requires infrastructure investments by multiple stakeholders. Open science policies focus on data release rather than on data reuse, yet reuse is also difficult, expensive, and may never occur. Investments in data management could be made more wisely by considering who might reuse data, how, why, for what purposes, and when. Data creators cannot anticipate all possible reuses or reusers; our goal is to identify factors that may aid stakeholders in deciding how to invest in research data, how to identify potential reuses and reusers, and how to improve data exchange processes. Drawing upon empirical studies of data sharing and reuse, we develop the theoretical construct of distance between data creator and data reuser, identifying six distance dimensions that influence the ability to transfer knowledge effectively: domain, methods, collaboration, curation, purposes, and time and temporality. These dimensions are primarily social in character, with associated technical aspects that can decrease - or increase - distances between creators and reusers. We identify the order of expected influence on data reuse and ways in which the six dimensions are interdependent. Our theoretical framing of the distance between data creators and prospective reusers leads to recommendations to four categories of stakeholders on how to make data sharing and reuse more effective: data creators, data reusers, data archivists, and funding agencies. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: 30 pages, consisting of Table of Contents, Abstract, 20 page narrative, 1 box, 10 pages references. Original work

arXiv:2311.13806 [pdf, other]

AdaTyper: Adaptive Semantic Column Type Detection

Authors: Madelon Hulsebos, Paul Groth, Çağatay Demiralp

Abstract: Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a ga… ▽ More Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a gap between this performance and its applicability in practice. In this paper, we propose AdaTyper to address one of the most critical deployment challenges: adaptation. AdaTyper uses weak-supervision to adapt a hybrid type predictor towards new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor of AdaTyper combines rule-based methods and a light machine learning model for semantic column type detection. We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: Submitted to VLDB'24

arXiv:2310.12997 [pdf]

doi 10.1117/12.2677526

Parking Spot Classification based on surround view camera system

Authors: Andy Xiao, Deep Doshi, Lihao Wang, Harsha Gorantla, Thomas Heitzmann, Peter Groth

Abstract: Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy s… ▽ More Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy status in recent years, but little work on whether the free slot is compatible with the mission of the ego vehicle or not. For instance, some spots are handicap or electric vehicles accessible only. In this paper, we tackle parking spot classification based on the surround view camera system. We adapt the object detection neural network YOLOv4 with a novel polygon bounding box model that is well-suited for various shaped parking spaces, such as slanted parking slots. To the best of our knowledge, we present the first detailed study on parking spot detection and classification on fisheye cameras for auto valet parking scenarios. The results prove that our proposed classification approach is effective to distinguish between regular, electric vehicle, and handicap parking spots. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: SPIE Optical Engineering + Applications, 2023, San Diego, California, United States. Proc. SPIE 12675, Applications of Machine Learning 2023

arXiv:2310.07736 [pdf, other]

Observatory: Characterizing Embeddings of Relational Tables

Authors: Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, H. V. Jagadish

Abstract: Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model fo… ▽ More Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models. △ Less

Submitted 27 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Camera ready of VLDB 2024

arXiv:2310.07348 [pdf, other]

Semantic Association Rule Learning from Time Series Data and Knowledge Graphs

Authors: Erkan Karabulut, Victoria Degeler, Paul Groth

Abstract: Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time s… ▽ More Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time series data. In addition to this initial pipeline, we also propose new semantic association rule criterion. The approach is evaluated on an industrial water network scenario. Initial evaluation shows that the proposed approach is able to learn a high number of association rules with semantic information which are more generalizable. The paper aims to set a foundation for further work on using semantic association rule learning especially in the context of industrial applications. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: This paper is accepted to SemIIM23: 2nd International Workshop on Semantic Industrial Information Modelling, 7th November 2023, Athens, Greece, co-located with 22nd International Semantic Web Conference (ISWC 2023)

Report number: https://ceur-ws.org/Vol-3647/SemIIM2023_paper_3.pdf

arXiv:2310.00637 [pdf, other]

Knowledge Engineering using Large Language Models

Authors: Bradley P. Allen, Lise Stork, Paul Groth

Abstract: Knowledge engineering is a discipline that focuses on the creation and maintenance of processes that generate and apply knowledge. Traditionally, knowledge engineering approaches have focused on knowledge expressed in formal languages. The emergence of large language models and their capabilities to effectively work with natural language, in its broadest sense, raises questions about the foundatio… ▽ More Knowledge engineering is a discipline that focuses on the creation and maintenance of processes that generate and apply knowledge. Traditionally, knowledge engineering approaches have focused on knowledge expressed in formal languages. The emergence of large language models and their capabilities to effectively work with natural language, in its broadest sense, raises questions about the foundations and practice of knowledge engineering. Here, we outline the potential role of LLMs in knowledge engineering, identifying two central directions: 1) creating hybrid neuro-symbolic knowledge systems; and 2) enabling knowledge engineering in natural language. Additionally, we formulate key open research questions to tackle these directions. △ Less

Submitted 1 October, 2023; originally announced October 2023.

Comments: 19 pages, 2 figures, accepted in Transactions on Graph Data and Knowledge

arXiv:2308.15168 [pdf, other]

doi 10.1016/j.future.2023.12.013

Ontologies in Digital Twins: A Systematic Literature Review

Authors: Erkan Karabulut, Salvatore F. Pileggi, Paul Groth, Victoria Degeler

Abstract: Digital Twins (DT) facilitate monitoring and reasoning processes in cyber-physical systems. They have progressively gained popularity over the past years because of intense research activity and industrial advancements. Cognitive Twins is a novel concept, recently coined to refer to the involvement of Semantic Web technology in DTs. Recent studies address the relevance of ontologies and knowledge… ▽ More Digital Twins (DT) facilitate monitoring and reasoning processes in cyber-physical systems. They have progressively gained popularity over the past years because of intense research activity and industrial advancements. Cognitive Twins is a novel concept, recently coined to refer to the involvement of Semantic Web technology in DTs. Recent studies address the relevance of ontologies and knowledge graphs in the context of DTs, in terms of knowledge representation, interoperability and automatic reasoning. However, there is no comprehensive analysis of how semantic technologies, and specifically ontologies, are utilized within DTs. This Systematic Literature Review (SLR) is based on the analysis of 82 research articles, that either propose or benefit from ontologies with respect to DT. The paper uses different analysis perspectives, including a structural analysis based on a reference DT architecture, and an application-specific analysis to specifically address the different domains, such as Manufacturing and Infrastructure. The review also identifies open issues and possible research directions on the usage of ontologies and knowledge graphs in DTs. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: The Systematic Literature Review (SLR) is submitted to Future Generation Computer System journal's Special Issue on Digital Twin for Future Networks and Emerging IoT Applications (2023)

arXiv:2308.02622 [pdf, other]

Harnessing the Web and Knowledge Graphs for Automated Impact Investing Scoring

Authors: Qingzhi Hu, Daniel Daza, Laurens Swinkels, Kristina Ūsaitė, Robbert-Jan 't Hoen, Paul Groth

Abstract: The Sustainable Development Goals (SDGs) were introduced by the United Nations in order to encourage policies and activities that help guarantee human prosperity and sustainability. SDG frameworks produced in the finance industry are designed to provide scores that indicate how well a company aligns with each of the 17 SDGs. This scoring enables a consistent assessment of investments that have the… ▽ More The Sustainable Development Goals (SDGs) were introduced by the United Nations in order to encourage policies and activities that help guarantee human prosperity and sustainability. SDG frameworks produced in the finance industry are designed to provide scores that indicate how well a company aligns with each of the 17 SDGs. This scoring enables a consistent assessment of investments that have the potential of building an inclusive and sustainable economy. As a result of the high quality and reliability required by such frameworks, the process of creating and maintaining them is time-consuming and requires extensive domain expertise. In this work, we describe a data-driven system that seeks to automate the process of creating an SDG framework. First, we propose a novel method for collecting and filtering a dataset of texts from different web sources and a knowledge graph relevant to a set of companies. We then implement and deploy classifiers trained with this data for predicting scores of alignment with SDGs for a given company. Our results indicate that our best performing model can accurately predict SDG scores with a micro average F1 score of 0.89, demonstrating the effectiveness of the proposed solution. We further describe how the integration of the models for its use by humans can be facilitated by providing explanations in the form of data relevant to a predicted score. We find that our proposed solution enables access to a large amount of information that analysts would normally not be able to process, resulting in an accurate prediction of SDG scores at a fraction of the cost. △ Less

Submitted 4 August, 2023; originally announced August 2023.

Comments: Presented at the KDD 2023 Workshop - Fragile Earth: AI for Climate Sustainability

arXiv:2307.06698 [pdf, other]

IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation

Authors: Thiviyan Thanapalasingam, Emile van Krieken, Peter Bloem, Paul Groth

Abstract: Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations. A key task in the literature is predicting missing links between entities. However, Knowledge Graphs are not just sets of links but also have semantics underlying their structure. Semantics is crucial in several downstream tasks, such as query answering or reasoning. We introduce the subg… ▽ More Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations. A key task in the literature is predicting missing links between entities. However, Knowledge Graphs are not just sets of links but also have semantics underlying their structure. Semantics is crucial in several downstream tasks, such as query answering or reasoning. We introduce the subgraph inference task, where a model has to generate likely and semantically valid subgraphs. We propose IntelliGraphs, a set of five new Knowledge Graph datasets. The IntelliGraphs datasets contain subgraphs with semantics expressed in logical rules for evaluating subgraph inference. We also present the dataset generator that produced the synthetic datasets. We designed four novel baseline models, which include three models based on traditional KGEs. We evaluate their expressiveness and show that these models cannot capture the semantics. We believe this benchmark will encourage the development of machine learning models that emphasize semantic understanding. △ Less

Submitted 25 August, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.07436 [pdf, ps, other]

doi 10.7717/peerj-cs.1781

Evaluating FAIR Digital Object and Linked Data as distributed object systems

Authors: Stian Soiland-Reyes, Carole Goble, Paul Groth

Abstract: FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building a ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR prin… ▽ More FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building a ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR principles, EOSC requirements and FDO guidelines themself. We compare the FDO approach with established Linked Data practices and the existing Web architecture, and provide a brief history of the Semantic Web while discussing why these technologies may have been difficult to adopt for FDO purposes. We conclude with recommendations for both Linked Data and FDO communities to further their adaptation and alignment. △ Less

Submitted 17 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: 39 pages, submitted to PeerJ CS

ACM Class: H.3; C.2

Journal ref: PeerJ Computer Science 10 (2024) e1781

arXiv:2306.03606 [pdf, other]

BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs

Authors: Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth

Abstract: Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate… ▽ More Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. We find settings involving low degree entities, which make up for a substantial amount of the set of entities in the KG, where our method outperforms the baselines. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. Our implementation is available at https://github.com/elsevier-AI-Lab/BioBLP . △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.16877 [pdf, other]

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

Authors: Sami Jullien, Romain Deffayet, Jean-Michel Renders, Paul Groth, Maarten de Rijke

Abstract: Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and makes a better use of environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric $L_1$ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is… ▽ More Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and makes a better use of environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric $L_1$ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is often improved by using a more efficient, hybrid asymmetric $L_1$-$L_2$ Huber loss for quantile regression. However, by doing so, distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean. Indeed, asymmetric $L_2$ losses, corresponding to expectile regression, cannot be readily used for distributional temporal difference learning. Motivated by the efficiency of $L_2$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns. We prove that our approach approximately learns the correct return distribution, and we benchmark a practical implementation on a toy example and at scale. On the Atari benchmark, our approach matches the performance of the Huber-based IQN-1 baseline after $200$M training frames but avoids distributional collapse and keeps estimates of the full distribution of returns. △ Less

Submitted 18 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: 16 pages, 3 figures, 1 algorithm

ACM Class: I.2.8; G.3

arXiv:2208.06662 [pdf, other]

Self-Contained Entity Discovery from Captioned Videos

Authors: Melika Ayoughi, Pascal Mettes, Paul Groth

Abstract: This paper introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g. faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating faces with entity labels… ▽ More This paper introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g. faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating faces with entity labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame-caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks SC-Friends and SC-BBT based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub. △ Less

Submitted 13 August, 2022; originally announced August 2022.

arXiv:2208.04609 [pdf, other]

E2EG: End-to-End Node Classification Using Graph Topology and Text-based Node Attributes

Authors: Tu Anh Dinh, Jeroen den Boef, Joran Cornelisse, Paul Groth

Abstract: Node classification utilizing text-based node attributes has many real-world applications, ranging from prediction of paper topics in academic citation graphs to classification of user characteristics in social media networks. State-of-the-art node classification frameworks, such as GIANT, use a two-stage pipeline: first embedding the text attributes of graph nodes then feeding the resulting embed… ▽ More Node classification utilizing text-based node attributes has many real-world applications, ranging from prediction of paper topics in academic citation graphs to classification of user characteristics in social media networks. State-of-the-art node classification frameworks, such as GIANT, use a two-stage pipeline: first embedding the text attributes of graph nodes then feeding the resulting embeddings into a node classification model. In this paper, we eliminate these two stages and develop an end-to-end node classification model that builds upon GIANT, called End-to-End-GIANT (E2EG). The tandem utilization of a main and an auxiliary classification objectives in our approach results in a more robust model, enabling the BERT backbone to be switched out for a distilled encoder with a 25% - 40% reduction in the number of parameters. Moreover, the model's end-to-end nature increases ease of use, as it avoids the need of chaining multiple models for node classification. Compared to a GIANT+MLP baseline on the ogbn-arxiv and ogbn-products datasets, E2EG obtains slightly better accuracy in the transductive setting (+0.5%), while reducing model training time by up to 40%. Our model is also applicable in the inductive setting, outperforming GIANT+MLP by up to +2.23%. △ Less

Submitted 26 September, 2023; v1 submitted 9 August, 2022; originally announced August 2022.

Comments: Accepted to MLoG - IEEE International Conference on Data Mining Workshops ICDMW 2023

arXiv:2205.15455 [pdf, other]

A Simulation Environment and Reinforcement Learning Method for Waste Reduction

Authors: Sami Jullien, Mozhdeh Ariannezhad, Paul Groth, Maarten de Rijke

Abstract: In retail (e.g., grocery stores, apparel shops, online retailers), inventory managers have to balance short-term risk (no items to sell) with long-term-risk (over ordering leading to product waste). This balancing task is made especially hard due to the lack of information about future customer purchases. In this paper, we study the problem of restocking a grocery store's inventory with perishable… ▽ More In retail (e.g., grocery stores, apparel shops, online retailers), inventory managers have to balance short-term risk (no items to sell) with long-term-risk (over ordering leading to product waste). This balancing task is made especially hard due to the lack of information about future customer purchases. In this paper, we study the problem of restocking a grocery store's inventory with perishable items over time, from a distributional point of view. The objective is to maximize sales while minimizing waste, with uncertainty about the actual consumption by costumers. This problem is of a high relevance today, given the growing demand for food and the impact of food waste on the environment, the economy, and purchasing power. We frame inventory restocking as a new reinforcement learning task that exhibits stochastic behavior conditioned on the agent's actions, making the environment partially observable. We make two main contributions. First, we introduce a new reinforcement learning environment, RetaiL, based on real grocery store data and expert knowledge. This environment is highly stochastic, and presents a unique challenge for reinforcement learning practitioners. We show that uncertainty about the future behavior of the environment is not handled well by classical supply chain algorithms, and that distributional approaches are a good way to account for the uncertainty. Second, we introduce GTDQN, a distributional reinforcement learning algorithm that learns a generalized Tukey Lambda distribution over the reward space. GTDQN provides a strong baseline for our environment. It outperforms other distributional reinforcement learning approaches in this partially observable setting, in both overall reward and reduction of generated waste. △ Less

Submitted 26 May, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: 20 pages, 4 figures, 4 tables, 3 listings, 1 algorithm

ACM Class: I.2.1; I.6.7

Journal ref: TMLR, May 2023

arXiv:2109.05173 [pdf, other]

Making Table Understanding Work in Practice

Authors: Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, Çağatay Demiralp

Abstract: Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap b… ▽ More Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap between the performance of these models on these benchmarks and their applicability in practice. In this paper, we address the question: what do we need for these models to work in practice? We discuss three challenges of deploying table understanding models and propose a framework to address them. These challenges include 1) difficulty in customizing models to specific domains, 2) lack of training data for typical database tables often found in enterprises, and 3) lack of confidence in the inferences made by models. We present SigmaTyper which implements this framework for the semantic column type detection task. SigmaTyper encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model. Lastly, we highlight avenues for future research that further close the gap towards making table understanding effective in practice. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Comments: Submitted to CIDR'22

arXiv:2108.06503 [pdf]

doi 10.3233/DS-210053

Packaging research artefacts with RO-Crate

Authors: Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble

Abstract: An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with thei… ▽ More An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with their metadata in a machine readable manner. RO-Crate is based on Schema$.$org annotations in JSON-LD, aiming to establish best practices to formally describe metadata in an accessible and practical way for their use in a wide variety of situations. An RO-Crate is a structured archive of all the items that contributed to a research outcome, including their identifiers, provenance, relations and annotations. As a general purpose packaging approach for data and their metadata, RO-Crate is used across multiple areas, including bioinformatics, digital humanities and regulatory sciences. By applying "just enough" Linked Data standards, RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility. An RO-Crate for this article is available at https://w3id.org/ro/doi/10.5281/zenodo.5146227 △ Less

Submitted 6 December, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

Comments: 44 pages. Accepted for Data Science

ACM Class: H.1.1; H.3.2

Journal ref: Data Science 2022

arXiv:2107.10015 [pdf, other]

doi 10.7717/peerj-cs.1073

Relational Graph Convolutional Networks: A Closer Look

Authors: Thiviyan Thanapalasingam, Lucas van Berkel, Peter Bloem, Paul Groth

Abstract: In this paper, we describe a reproduction of the Relational Graph Convolutional Network (RGCN). Using our reproduction, we explain the intuition behind the model. Our reproduction results empirically validate the correctness of our implementations using benchmark Knowledge Graph datasets on node classification and link prediction tasks. Our explanation provides a friendly understanding of the diff… ▽ More In this paper, we describe a reproduction of the Relational Graph Convolutional Network (RGCN). Using our reproduction, we explain the intuition behind the model. Our reproduction results empirically validate the correctness of our implementations using benchmark Knowledge Graph datasets on node classification and link prediction tasks. Our explanation provides a friendly understanding of the different components of the RGCN for both users and researchers extending the RGCN approach. Furthermore, we introduce two new configurations of the RGCN that are more parameter efficient. The code and datasets are available at https://github.com/thiviyanT/torch-rgcn. △ Less

Submitted 21 July, 2021; originally announced July 2021.

arXiv:2106.07258 [pdf, other]

doi 10.1145/3588710

GitTables: A Large-Scale Corpus of Relational Tables

Authors: Madelon Hulsebos, Çağatay Demiralp, Paul Groth

Abstract: The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, w… ▽ More The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io. △ Less

Submitted 12 April, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

arXiv:2101.01353 [pdf, other]

Reinforcement Learning based Collective Entity Alignment with Adaptive Features

Authors: Weixin Zeng, Xiang Zhao, Jiuyang Tang, Xuemin Lin, Paul Groth

Abstract: Entity alignment (EA) is the task of identifying the entities that refer to the same real-world object but are located in different knowledge graphs (KGs). For entities to be aligned, existing EA solutions treat them separately and generate alignment results as ranked lists of entities on the other side. Nevertheless, this decision-making paradigm fails to take into account the interdependence amo… ▽ More Entity alignment (EA) is the task of identifying the entities that refer to the same real-world object but are located in different knowledge graphs (KGs). For entities to be aligned, existing EA solutions treat them separately and generate alignment results as ranked lists of entities on the other side. Nevertheless, this decision-making paradigm fails to take into account the interdependence among entities. Although some recent efforts mitigate this issue by imposing the 1-to-1 constraint on the alignment process, they still cannot adequately model the underlying interdependence and the results tend to be sub-optimal. To fill in this gap, in this work, we delve into the dynamics of the decision-making process, and offer a reinforcement learning (RL) based model to align entities collectively. Under the RL framework, we devise the coherence and exclusiveness constraints to characterize the interdependence and restrict collective alignment. Additionally, to generate more precise inputs to the RL framework, we employ representative features to capture different aspects of the similarity between entities in heterogeneous KGs, which are integrated by an adaptive feature fusion strategy. Our proposal is evaluated on both cross-lingual and mono-lingual EA benchmarks and compared against state-of-the-art solutions. The empirical results verify its effectiveness and superiority. △ Less

Submitted 5 January, 2021; originally announced January 2021.

Comments: Accepted by ACM TOIS

arXiv:2011.08903 [pdf, other]

Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

Authors: Ryan Brate, Paul Groth, Marieke van Erp

Abstract: Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them `smell experiences', offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper,… ▽ More Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them `smell experiences', offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper, we present two variations on a semi-supervised approach to identify smell experiences in English literature. The combined set of patterns from both implementations offer significantly better performance than a keyword-based baseline. △ Less

Submitted 6 December, 2020; v1 submitted 17 November, 2020; originally announced November 2020.

Comments: Accepted to The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2020). Barcelona, Spain. December 2020./

ACM Class: I.2.7

arXiv:2011.03009 [pdf, other]

doi 10.1121/10.0005655

Accelerating frequency-domain numerical methods for weakly nonlinear focused ultrasound using nested meshes

Authors: Samuel P. Groth, Pierre Gélat, Seyyed R. Haqshenas, Nader Saffari, Elwin van 't Wout, Timo Betcke, Garth N. Wells

Abstract: The numerical simulation of weakly nonlinear ultrasound is important in treatment planning for focused ultrasound (FUS) therapies. However, the large domain sizes and generation of higher harmonics at the focus make these problems extremely computationally demanding. Numerical methods typically employ a uniform mesh fine enough to resolve the highest harmonic present in the problem, leading to a v… ▽ More The numerical simulation of weakly nonlinear ultrasound is important in treatment planning for focused ultrasound (FUS) therapies. However, the large domain sizes and generation of higher harmonics at the focus make these problems extremely computationally demanding. Numerical methods typically employ a uniform mesh fine enough to resolve the highest harmonic present in the problem, leading to a very large number of degrees of freedom. This paper proposes a more efficient strategy in which each harmonic is approximated on a separate mesh, the size of which is proportional to the wavelength of the harmonic. The increase in resolution required to resolve a smaller wavelength is balanced by a reduction in the domain size. This nested meshing is feasible owing to the increasingly localised nature of higher harmonics near the focus. Numerical experiments are performed for FUS transducers in homogeneous media in order to determine the size of the meshes required to accurately represent the harmonics. In particular, a fast \emph{volume potential} approach is proposed and employed to perform convergence experiments as the computation domain size is modified. This approach allows each harmonic to be computed via the evaluation of an integral over the domain. Discretising this integral using the midpoint rule allows the computations to be performed rapidly with the FFT. It is shown that at least an order of magnitude reduction in memory consumption and computation time can be achieved with nested meshing. Finally, it is demonstrated how to generalise this approach to inhomogeneous propagation domains. △ Less

Submitted 22 July, 2021; v1 submitted 5 November, 2020; originally announced November 2020.

Journal ref: The Journal of the Acoustical Society of America 150 441(2021)

arXiv:2010.08269 [pdf, other]

doi 10.18653/v1/2020.sdp-1.7

Effective Distributed Representations for Academic Expert Search

Authors: Mark Berger, Jakub Zavrel, Paul Groth

Abstract: Expert search aims to find and rank experts based on a user's query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a… ▽ More Expert search aims to find and rank experts based on a user's query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retrofitting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a transformer model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retrofitting the paper embeddings and using elaborate author contribution weighting strategies did not improve retrieval performance. △ Less

Submitted 16 October, 2020; originally announced October 2020.

Comments: To be published in the Scholarly Document Processing 2020 Workshop @ EMNLP 2020 proceedings

arXiv:2010.03496 [pdf, other]

doi 10.1145/3442381.3450141

Inductive Entity Representations from Text via Link Prediction

Authors: Daniel Daza, Michael Cochez, Paul Groth

Abstract: Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector represe… ▽ More Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work. △ Less

Submitted 14 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

arXiv:2004.07917 [pdf, ps, other]

Knowledge Scientists: Unlocking the data-driven organization

Authors: George Fletcher, Paul Groth, Juan Sequeda

Abstract: Organizations across all sectors are increasingly undergoing deep transformation and restructuring towards data-driven operations. The central role of data highlights the need for reliable and clean data. Unreliable, erroneous, and incomplete data lead to critical bottlenecks in processing pipelines and, ultimately, service failures, which are disastrous for the competitive performance of the orga… ▽ More Organizations across all sectors are increasingly undergoing deep transformation and restructuring towards data-driven operations. The central role of data highlights the need for reliable and clean data. Unreliable, erroneous, and incomplete data lead to critical bottlenecks in processing pipelines and, ultimately, service failures, which are disastrous for the competitive performance of the organization. Given its central importance, those organizations which recognize and react to the need for reliable data will have the advantage in the coming decade. We argue that the technologies for reliable data are driven by distinct concerns and expertise which complement those of the data scientist and the data engineer. Those organizations which identify the central importance of meaningful, explainable, reproducible, and maintainable data will be at the forefront of the democratization of reliable data. We call the new role which must be developed to fill this critical need the Knowledge Scientist. The organizational structures, tools, methodologies and techniques to support and make possible the work of knowledge scientists are still in their infancy. As organizations not only use data but increasingly rely on data, it is time to empower the people who are central to this transformation. △ Less

Submitted 16 April, 2020; originally announced April 2020.

arXiv:1911.09041 [pdf, other]

doi 10.1016/j.ijhcs.2020.102562

Talking datasets: Understanding data sensemaking behaviours

Authors: Laura Koesten, Kathleen Gregory, Paul Groth, Elena Simperl

Abstract: The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little is known about a key step in data reuse: people's behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 research… ▽ More The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little is known about a key step in data reuse: people's behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 researchers as they summarised and interacted with both familiar and unfamiliar data. We use our findings to identify and detail common activity patterns and necessary data attributes across three clusters of sensemaking activities: inspecting data, engaging with content, and placing data within broader contexts. We conclude by proposing design recommendations for tools and documentation practices which can be used to facilitate sensemaking and subsequent data reuse. △ Less

Submitted 18 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: 26 pages, 7 figures, 6 tables

arXiv:1909.00464 [pdf]

doi 10.1162/99608f92.e38165eb

Lost or found? Discovering data needed for research

Authors: Kathleen Gregory, Paul Groth, Andrea Scharnhorst, Sally Wyatt

Abstract: Finding data is a necessary precursor to being able to reuse data, although relatively little large-scale empirical evidence exists about how researchers discover, make sense of and (re)use data for research. This study presents evidence from the largest known survey investigating how researchers discover and use data that they do not create themselves. We examine the data needs and discovery stra… ▽ More Finding data is a necessary precursor to being able to reuse data, although relatively little large-scale empirical evidence exists about how researchers discover, make sense of and (re)use data for research. This study presents evidence from the largest known survey investigating how researchers discover and use data that they do not create themselves. We examine the data needs and discovery strategies of respondents, propose a typology for data reuse and probe the role of social interactions and literature search in data discovery. We consider how data communities can be conceptualized according to data uses and propose practical applications of our findings for designers of data discovery systems and repositories. Specifically, we consider how to design for a diversity of practices, how communities of use can serve as an entry point for design and the role of metadata in supporting both sensemaking and social interactions. △ Less

Submitted 2 April, 2020; v1 submitted 1 September, 2019; originally announced September 2019.

Comments: Harvard Data Science Review (2020)

arXiv:1908.10632 [pdf, other]

doi 10.1162/qss_a_00052

A Longitudinal Analysis of University Rankings

Authors: Friso Selten, Cameron Neylon, Chun-Kai Huang, Paul Groth

Abstract: Pressured by globalization and the increasing demand for public organisations to be accountable, efficient and transparent, university rankings have become an important tool for assessing the quality of higher education institutions. It is therefore important to carefully assess exactly what these rankings measure. In this paper, the three major global university rankings, The Academic Ranking of… ▽ More Pressured by globalization and the increasing demand for public organisations to be accountable, efficient and transparent, university rankings have become an important tool for assessing the quality of higher education institutions. It is therefore important to carefully assess exactly what these rankings measure. In this paper, the three major global university rankings, The Academic Ranking of World Universities, The Times Higher Education and the Quacquarelli Symonds World University Rankings, are studied. After a description of the ranking methodologies, it is shown that university rankings are stable over time but that there is variation between the three rankings. Furthermore, using Principal Component Analysis and Exploratory Factor Analysis, we show that the variables used to construct the rankings primarily measure two underlying factors: a universities reputation and its research performance. By correlating these factors and plotting regional aggregates of universities on the two factors, differences between the rankings are made visible. Last, we elaborate how the results from these analysis can be viewed in light of often voiced critiques of the ranking process. This indicates that the variables used by the rankings might not capture the concepts they claim to measure. Doing so the study provides evidence of the ambiguous nature of university ranking's quantification of university performance. △ Less

Submitted 20 January, 2020; v1 submitted 28 August, 2019; originally announced August 2019.

Comments: 26 pages

arXiv:1901.00735 [pdf, other]

doi 10.1007/s00778-019-00564-x

Dataset search: a survey

Authors: Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez, Emilia Kacprzak, Paul Groth

Abstract: Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data s… ▽ More Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems in dataset retrieval. We identify what makes dataset search a research field in its own right, with unique challenges and methods and highlight open problems. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to resolve these open problems as well as immediate next steps that will take the field forward. △ Less

Submitted 3 January, 2019; originally announced January 2019.

Comments: 20 pages, 153 references

arXiv:1811.06303 [pdf, other]

End-to-End Learning for Answering Structured Queries Directly over Text

Authors: Paul Groth, Antony Scerri, Ron Daniel, Jr., Bradley P. Allen

Abstract: Structured queries expressed in languages (such as SQL, SPARQL, or XQuery) offer a convenient and explicit way for users to express their information needs for a number of tasks. In this work, we present an approach to answer these directly over text data without storing results in a database. We specifically look at the case of knowledge bases where queries are over entities and the relations bet… ▽ More Structured queries expressed in languages (such as SQL, SPARQL, or XQuery) offer a convenient and explicit way for users to express their information needs for a number of tasks. In this work, we present an approach to answer these directly over text data without storing results in a database. We specifically look at the case of knowledge bases where queries are over entities and the relations between them. Our approach combines distributed query answering (e.g. Triple Pattern Fragments) with models built for extractive question answering. Importantly, by applying distributed querying answering we are able to simplify the model learning problem. We train models for a large portion (572) of the relations within Wikidata and achieve an average 0.70 F1 measure across all models. We also present a systematic method to construct the necessary training data for this task from knowledge graphs and describe a prototype implementation. △ Less

Submitted 16 November, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

Comments: 18 pages, 6 figures

Journal ref: Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019)

arXiv:1805.11883 [pdf, ps, other]

DATA:SEARCH'18 -- Searching Data on the Web

Authors: Paul Groth, Laura Koesten, Philipp Mayr, Maarten de Rijke, Elena Simperl

Abstract: This half day workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks an… ▽ More This half day workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks and studies in human data interaction. The workshop aims to bring together communities interested in making the web of data more discoverable, easier to search and more user friendly. △ Less

Submitted 30 May, 2018; originally announced May 2018.

arXiv:1802.05574 [pdf, other]

Open Information Extraction on Scientific Text: An Evaluation

Authors: Paul Groth, Michael Lauruhn, Antony Scerri, Ron Daniel Jr

Abstract: Open Information Extraction (OIE) is the task of the unsupervised creation of structured information from text. OIE is often used as a starting point for a number of downstream tasks including knowledge base construction, relation extraction, and question answering. While OIE methods are targeted at being domain independent, they have been evaluated primarily on newspaper, encyclopedic or general… ▽ More Open Information Extraction (OIE) is the task of the unsupervised creation of structured information from text. OIE is often used as a starting point for a number of downstream tasks including knowledge base construction, relation extraction, and question answering. While OIE methods are targeted at being domain independent, they have been evaluated primarily on newspaper, encyclopedic or general web text. In this article, we evaluate the performance of OIE on scientific texts originating from 10 different disciplines. To do so, we use two state-of-the-art OIE systems applying a crowd-sourcing approach. We find that OIE systems perform significantly worse on scientific text than encyclopedic text. We also provide an error analysis and suggest areas of work to reduce errors. Our corpus of sentences and judgments are made available. △ Less

Submitted 4 June, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

Comments: 10 pages

Journal ref: The 27th International Conference on Computational Linguistics (COLING 2018)

arXiv:1801.04971 [pdf]

doi 10.1177/0165551519837182

Understanding Data Search as a Socio-technical Practice

Authors: Kathleen Gregory, Helena Cousijn, Paul Groth, Andrea Scharnhorst, Sally Wyatt

Abstract: Open research data are heralded as having the potential to increase effectiveness, productivity, and reproducibility in science, but little is known about the actual practices involved in data search. The socio-technical problem of locating data for reuse is often reduced to the technological dimension of designing data search systems. We combine a bibliometric study of the current academic discou… ▽ More Open research data are heralded as having the potential to increase effectiveness, productivity, and reproducibility in science, but little is known about the actual practices involved in data search. The socio-technical problem of locating data for reuse is often reduced to the technological dimension of designing data search systems. We combine a bibliometric study of the current academic discourse around data search with interviews with data seekers. In this article, we explore how adopting a contextual, socio-technical perspective can help to understand user practices and behavior and ultimately help to improve the design of data discovery systems. △ Less

Submitted 18 February, 2019; v1 submitted 15 January, 2018; originally announced January 2018.

Comments: 19 pages, 3 figures, 7 tables

Journal ref: Journal of Information Science. (2019). 0165551519837182

arXiv:1707.06937 [pdf]

doi 10.1002/asi.24165

Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines

Authors: Kathleen Gregory, Paul Groth, Helena Cousijn, Andrea Scharnhorst, Sally Wyatt

Abstract: A cross-disciplinary examination of the user behaviours involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data. Two analytical frameworks rooted in information retrieval and science technology studies are used to id… ▽ More A cross-disciplinary examination of the user behaviours involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data. Two analytical frameworks rooted in information retrieval and science technology studies are used to identify key similarities in practices as a first step toward developing a model describing data retrieval. △ Less

Submitted 12 March, 2020; v1 submitted 21 July, 2017; originally announced July 2017.

Journal ref: Journal of the Association for Information Science and Technology. (2019). 70(5), 419-432

arXiv:1611.00217 [pdf]

Sources of Change for Modern Knowledge Organization Systems

Authors: Michael Lauruhn, Paul Groth

Abstract: Knowledge Organization Systems (e.g. taxonomies and ontologies) continue to contribute benefits in the design of information systems by providing a shared conceptual underpinning for developers, users, and automated systems. However, the standard mechanisms for the management of KOSs changes are inadequate for systems built on top of thousands of data sources or with the involvement of hundreds of… ▽ More Knowledge Organization Systems (e.g. taxonomies and ontologies) continue to contribute benefits in the design of information systems by providing a shared conceptual underpinning for developers, users, and automated systems. However, the standard mechanisms for the management of KOSs changes are inadequate for systems built on top of thousands of data sources or with the involvement of hundreds of individuals. In this work, we review standard sources of change for KOSs (e.g. institutional shifts; standards cycles; cultural and political; distribution, etc) and then proceed to catalog new sources of change for KOSs ranging from massively cooperative development to always-on automated extraction systems. Finally, we reflect on what this means for the design and management of KOSs. △ Less

Submitted 1 November, 2016; originally announced November 2016.

Comments: 10 pages, 1 figure

Journal ref: Knowledge Organization, 43(8), 622-629 (2016)

arXiv:1401.2134 [pdf, other]

doi 10.1371/journal.pcbi.1003542

10 Simple Rules for the Care and Feeding of Scientific Data

Authors: Alyssa Goodman, Alberto Pepe, Alexander W. Blocker, Christine L. Borgman, Kyle Cranmer, Mercè Crosas, Rosanne Di Stefano, Yolanda Gil, Paul Groth, Margaret Hedstrom, David W. Hogg, Vinay Kashyap, Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic

Abstract: This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review… ▽ More This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that. △ Less

Submitted 9 January, 2014; originally announced January 2014.

Comments: Accepted in PLOS Computational Biology. This paper was written collaboratively, on the web, in the open, using Authorea. The living version of this article, which includes sources and history, is available at http://www.authorea.com/3410/

arXiv:1304.0567 [pdf, ps, other]

doi 10.1016/j.websem.2014.11.003

On the Formulation of Performant SPARQL Queries

Authors: Antonis Loizou, Paul Groth

Abstract: The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to… ▽ More The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write "good" queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores. △ Less

Submitted 2 April, 2013; originally announced April 2013.

arXiv:1210.1480 [pdf, other]

doi 10.1140/epjst/e2012-01692-1

Theoretical And Technological Building Blocks For An Innovation Accelerator

Authors: Frank van Harmelen, George Kampis, Katy Borner, Peter van den Besselaar, Erik Schultes, Carole Goble, Paul Groth, Barend Mons, Stuart Anderson, Stefan Decker, Conor Hayes, Thierry Buecheler, Dirk Helbing

Abstract: The scientific system that we use today was devised centuries ago and is inadequate for our current ICT-based society: the peer review system encourages conservatism, journal publications are monolithic and slow, data is often not available to other scientists, and the independent validation of results is limited. Building on the Innovation Accelerator paper by Helbing and Balietti (2011) this pap… ▽ More The scientific system that we use today was devised centuries ago and is inadequate for our current ICT-based society: the peer review system encourages conservatism, journal publications are monolithic and slow, data is often not available to other scientists, and the independent validation of results is limited. Building on the Innovation Accelerator paper by Helbing and Balietti (2011) this paper takes the initial global vision and reviews the theoretical and technological building blocks that can be used for implementing an innovation (in first place: science) accelerator platform driven by re-imagining the science system. The envisioned platform would rest on four pillars: (i) Redesign the incentive scheme to reduce behavior such as conservatism, herding and hyping; (ii) Advance scientific publications by breaking up the monolithic paper unit and introducing other building blocks such as data, tools, experiment workflows, resources; (iii) Use machine readable semantics for publications, debate structures, provenance etc. in order to include the computer as a partner in the scientific process, and (iv) Build an online platform for collaboration, including a network of trust and reputation among the different types of stakeholders in the scientific system: scientists, educators, funding agencies, policy makers, students and industrial innovators among others. Any such improvements to the scientific system must support the entire scientific process (unlike current tools that chop up the scientific process into disconnected pieces), must facilitate and encourage collaboration and interdisciplinarity (again unlike current tools), must facilitate the inclusion of intelligent computing in the scientific process, must facilitate not only the core scientific process, but also accommodate other stakeholders such science policy makers, industrial innovators, and the general public. △ Less

Submitted 4 October, 2012; originally announced October 2012.

arXiv:1006.4860 [pdf]

doi 10.1117/12.856486

The Application of Cloud Computing to the Creation of Image Mosaics and Management of Their Provenance

Authors: G. Bruce Berriman, Ewa Deelman, Paul Groth, Gideon Juve

Abstract: We have used the Montage image mosaic engine to investigate the cost and performance of processing images on the Amazon EC2 cloud, and to inform the requirements that higher-level products impose on provenance management technologies. We will present a detailed comparison of the performance of Montage on the cloud and on the Abe high performance cluster at the National Center for Supercomputing Ap… ▽ More We have used the Montage image mosaic engine to investigate the cost and performance of processing images on the Amazon EC2 cloud, and to inform the requirements that higher-level products impose on provenance management technologies. We will present a detailed comparison of the performance of Montage on the cloud and on the Abe high performance cluster at the National Center for Supercomputing Applications (NCSA). Because Montage generates many intermediate products, we have used it to understand the science requirements that higher-level products impose on provenance management technologies. We describe experiments with provenance management technologies such as the "Provenance Aware Service Oriented Architecture" (PASOA). △ Less

Submitted 24 June, 2010; originally announced June 2010.

Comments: 15 pages, 3 figure

Journal ref: SPIE Conference 7740: Software and Cyberinfrastructure for Astronomy (2010)

arXiv:1005.4457 [pdf, other]

Pipeline-Centric Provenance Model

Authors: Paul Groth, Ewa Deelman, Gideon Juve, Gaurang Mehta, Bruce Berriman

Abstract: In this paper we propose a new provenance model which is tailored to a class of workflow-based applications. We motivate the approach with use cases from the astronomy community. We generalize the class of applications the approach is relevant to and propose a pipeline-centric provenance model. Finally, we evaluate the benefits in terms of storage needed by the approach when applied to an astronom… ▽ More In this paper we propose a new provenance model which is tailored to a class of workflow-based applications. We motivate the approach with use cases from the astronomy community. We generalize the class of applications the approach is relevant to and propose a pipeline-centric provenance model. Finally, we evaluate the benefits in terms of storage needed by the approach when applied to an astronomy application. △ Less

Submitted 24 May, 2010; originally announced May 2010.

Comments: 9 pages, 4 figures

Journal ref: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, 2009

arXiv:1005.2643 [pdf]

Metadata and provenance management

Authors: Ewa Deelman, Bruce Berriman, Ann Chervenak, Oscar Corcho, Paul Groth, Luc Moreau

Abstract: Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata an… ▽ More Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes. △ Less

Submitted 14 May, 2010; originally announced May 2010.

Journal ref: Scientific Data Management: Challenges, Existing Technology, and Deployment (Arie Shoshani and Doron Rotem, Editors) CRC Press 2010

Showing 1–47 of 47 results for author: Groth, P