Skip to main content

Showing 1–47 of 47 results for author: Groth, P

  1. arXiv:2407.07639  [pdf, other

    cs.LG cs.AI

    Explaining Graph Neural Networks for Node Similarity on Graphs

    Authors: Daniel Daza, Cuong Xuan Chu, Trung-Kien Tran, Daria Stepanova, Michael Cochez, Paul Groth

    Abstract: Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable simil… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  2. arXiv:2407.00002  [pdf, other

    q-bio.BM cs.LG

    Kermut: Composite kernel regression for protein variant effects

    Authors: Peter Mørch Groth, Mads Herbert Kerrn, Lars Olsen, Jesper Salomon, Wouter Boomsma

    Abstract: Reliable prediction of protein variant effects is crucial for both protein optimization and for advancing biological understanding. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, and while prediction accuracy has seen much progress in recent years, uncertainty metrics are rarely reported. We here provide a Gaus… ▽ More

    Submitted 9 July, 2024; v1 submitted 9 April, 2024; originally announced July 2024.

    Comments: 10 pages (36 in total with appendix), 4 figures (26 figures in total with appendix)

  3. arXiv:2404.19591  [pdf, other

    cs.DB cs.LG cs.SE

    Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

    Authors: Stefan Grafberger, Paul Groth, Sebastian Schelter

    Abstract: Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improve… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    ACM Class: H.2; H.2.8; H.4; D.2.6; I.2

  4. arXiv:2404.17000  [pdf, other

    cs.CL cs.AI

    Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models

    Authors: Bradley P. Allen, Paul T. Groth

    Abstract: A backbone of knowledge graphs are their class membership relations, which assign entities to a given class. As part of the knowledge engineering process, we propose a new method for evaluating the quality of these relations by processing descriptions of a given entity and class using a zero-shot chain-of-thought classifier that uses a natural language intensional definition of a class. We evaluat… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 11 pages, 1 figure, 2 tables, accepted at the European Semantic Web Conference Special Track on Large Language Models for Knowledge Engineering, Hersonissos, Crete, GR, May 2024, for associated code and data, see https://github.com/bradleypallen/evaluating-kg-class-memberships-using-llms

    ACM Class: I.2.7; I.2.4

  5. arXiv:2404.03732  [pdf, other

    cs.CL cs.AI

    SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

    Authors: Bradley P. Allen, Fina Polat, Paul Groth

    Abstract: We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task,… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: 6 pages, 6 figures, 4 tables, camera-ready copy, accepted to the 18th International Workshop on Semantic Evaluation (SemEval-2024), for associated code and data see https://github.com/bradleypallen/shroom

  6. arXiv:2403.18133  [pdf, other

    cs.LG cs.AI

    AE SemRL: Learning Semantic Association Rules with Autoencoders

    Authors: Erkan Karabulut, Victoria Degeler, Paul Groth

    Abstract: Association Rule Mining (ARM) is the task of learning associations among data features in the form of logical rules. Mining association rules from high-dimensional numerical data, for example, time series data from a large number of sensors in a smart environment, is a computationally intensive task. In this study, we propose an Autoencoder-based approach to learn and extract association rules fro… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  7. arXiv:2402.07926  [pdf

    cs.HC cs.CY cs.DL cs.IR

    From Data Creator to Data Reuser: Distance Matters

    Authors: Christine L. Borgman, Paul T. Groth

    Abstract: Sharing research data is complex, labor-intensive, expensive, and requires infrastructure investments by multiple stakeholders. Open science policies focus on data release rather than on data reuse, yet reuse is also difficult, expensive, and may never occur. Investments in data management could be made more wisely by considering who might reuse data, how, why, for what purposes, and when. Data cr… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 30 pages, consisting of Table of Contents, Abstract, 20 page narrative, 1 box, 10 pages references. Original work

  8. arXiv:2311.13806  [pdf, other

    cs.DB cs.CL cs.LG

    AdaTyper: Adaptive Semantic Column Type Detection

    Authors: Madelon Hulsebos, Paul Groth, Çağatay Demiralp

    Abstract: Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a ga… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: Submitted to VLDB'24

  9. Parking Spot Classification based on surround view camera system

    Authors: Andy Xiao, Deep Doshi, Lihao Wang, Harsha Gorantla, Thomas Heitzmann, Peter Groth

    Abstract: Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy s… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Comments: SPIE Optical Engineering + Applications, 2023, San Diego, California, United States. Proc. SPIE 12675, Applications of Machine Learning 2023

  10. arXiv:2310.07736  [pdf, other

    cs.DB cs.LG

    Observatory: Characterizing Embeddings of Relational Tables

    Authors: Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, H. V. Jagadish

    Abstract: Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model fo… ▽ More

    Submitted 27 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Camera ready of VLDB 2024

  11. arXiv:2310.07348  [pdf, other

    cs.AI

    Semantic Association Rule Learning from Time Series Data and Knowledge Graphs

    Authors: Erkan Karabulut, Victoria Degeler, Paul Groth

    Abstract: Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time s… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: This paper is accepted to SemIIM23: 2nd International Workshop on Semantic Industrial Information Modelling, 7th November 2023, Athens, Greece, co-located with 22nd International Semantic Web Conference (ISWC 2023)

    Report number: https://ceur-ws.org/Vol-3647/SemIIM2023_paper_3.pdf

  12. arXiv:2310.00637  [pdf, other

    cs.AI cs.CL

    Knowledge Engineering using Large Language Models

    Authors: Bradley P. Allen, Lise Stork, Paul Groth

    Abstract: Knowledge engineering is a discipline that focuses on the creation and maintenance of processes that generate and apply knowledge. Traditionally, knowledge engineering approaches have focused on knowledge expressed in formal languages. The emergence of large language models and their capabilities to effectively work with natural language, in its broadest sense, raises questions about the foundatio… ▽ More

    Submitted 1 October, 2023; originally announced October 2023.

    Comments: 19 pages, 2 figures, accepted in Transactions on Graph Data and Knowledge

  13. Ontologies in Digital Twins: A Systematic Literature Review

    Authors: Erkan Karabulut, Salvatore F. Pileggi, Paul Groth, Victoria Degeler

    Abstract: Digital Twins (DT) facilitate monitoring and reasoning processes in cyber-physical systems. They have progressively gained popularity over the past years because of intense research activity and industrial advancements. Cognitive Twins is a novel concept, recently coined to refer to the involvement of Semantic Web technology in DTs. Recent studies address the relevance of ontologies and knowledge… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: The Systematic Literature Review (SLR) is submitted to Future Generation Computer System journal's Special Issue on Digital Twin for Future Networks and Emerging IoT Applications (2023)

  14. arXiv:2308.02622  [pdf, other

    cs.LG cs.AI cs.IR cs.SI

    Harnessing the Web and Knowledge Graphs for Automated Impact Investing Scoring

    Authors: Qingzhi Hu, Daniel Daza, Laurens Swinkels, Kristina Ūsaitė, Robbert-Jan 't Hoen, Paul Groth

    Abstract: The Sustainable Development Goals (SDGs) were introduced by the United Nations in order to encourage policies and activities that help guarantee human prosperity and sustainability. SDG frameworks produced in the finance industry are designed to provide scores that indicate how well a company aligns with each of the 17 SDGs. This scoring enables a consistent assessment of investments that have the… ▽ More

    Submitted 4 August, 2023; originally announced August 2023.

    Comments: Presented at the KDD 2023 Workshop - Fragile Earth: AI for Climate Sustainability

  15. arXiv:2307.06698  [pdf, other

    cs.AI cs.LG

    IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation

    Authors: Thiviyan Thanapalasingam, Emile van Krieken, Peter Bloem, Paul Groth

    Abstract: Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations. A key task in the literature is predicting missing links between entities. However, Knowledge Graphs are not just sets of links but also have semantics underlying their structure. Semantics is crucial in several downstream tasks, such as query answering or reasoning. We introduce the subg… ▽ More

    Submitted 25 August, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

  16. Evaluating FAIR Digital Object and Linked Data as distributed object systems

    Authors: Stian Soiland-Reyes, Carole Goble, Paul Groth

    Abstract: FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building a ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR prin… ▽ More

    Submitted 17 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: 39 pages, submitted to PeerJ CS

    ACM Class: H.3; C.2

    Journal ref: PeerJ Computer Science 10 (2024) e1781

  17. arXiv:2306.03606  [pdf, other

    cs.AI

    BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs

    Authors: Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth

    Abstract: Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

  18. arXiv:2305.16877  [pdf, other

    cs.LG cs.AI

    Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

    Authors: Sami Jullien, Romain Deffayet, Jean-Michel Renders, Paul Groth, Maarten de Rijke

    Abstract: Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and makes a better use of environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric $L_1$ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is… ▽ More

    Submitted 18 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: 16 pages, 3 figures, 1 algorithm

    ACM Class: I.2.8; G.3

  19. arXiv:2208.06662  [pdf, other

    cs.CV

    Self-Contained Entity Discovery from Captioned Videos

    Authors: Melika Ayoughi, Pascal Mettes, Paul Groth

    Abstract: This paper introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g. faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating faces with entity labels… ▽ More

    Submitted 13 August, 2022; originally announced August 2022.

  20. arXiv:2208.04609  [pdf, other

    cs.LG cs.SI

    E2EG: End-to-End Node Classification Using Graph Topology and Text-based Node Attributes

    Authors: Tu Anh Dinh, Jeroen den Boef, Joran Cornelisse, Paul Groth

    Abstract: Node classification utilizing text-based node attributes has many real-world applications, ranging from prediction of paper topics in academic citation graphs to classification of user characteristics in social media networks. State-of-the-art node classification frameworks, such as GIANT, use a two-stage pipeline: first embedding the text attributes of graph nodes then feeding the resulting embed… ▽ More

    Submitted 26 September, 2023; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: Accepted to MLoG - IEEE International Conference on Data Mining Workshops ICDMW 2023

  21. arXiv:2205.15455  [pdf, other

    cs.LG cs.AI

    A Simulation Environment and Reinforcement Learning Method for Waste Reduction

    Authors: Sami Jullien, Mozhdeh Ariannezhad, Paul Groth, Maarten de Rijke

    Abstract: In retail (e.g., grocery stores, apparel shops, online retailers), inventory managers have to balance short-term risk (no items to sell) with long-term-risk (over ordering leading to product waste). This balancing task is made especially hard due to the lack of information about future customer purchases. In this paper, we study the problem of restocking a grocery store's inventory with perishable… ▽ More

    Submitted 26 May, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: 20 pages, 4 figures, 4 tables, 3 listings, 1 algorithm

    ACM Class: I.2.1; I.6.7

    Journal ref: TMLR, May 2023

  22. arXiv:2109.05173  [pdf, other

    cs.DB cs.HC cs.LG

    Making Table Understanding Work in Practice

    Authors: Madelon Hulsebos, Sneha Gathani, James Gale, Isil Dillig, Paul Groth, Çağatay Demiralp

    Abstract: Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap b… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: Submitted to CIDR'22

  23. Packaging research artefacts with RO-Crate

    Authors: Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble

    Abstract: An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with thei… ▽ More

    Submitted 6 December, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

    Comments: 44 pages. Accepted for Data Science

    ACM Class: H.1.1; H.3.2

    Journal ref: Data Science 2022

  24. Relational Graph Convolutional Networks: A Closer Look

    Authors: Thiviyan Thanapalasingam, Lucas van Berkel, Peter Bloem, Paul Groth

    Abstract: In this paper, we describe a reproduction of the Relational Graph Convolutional Network (RGCN). Using our reproduction, we explain the intuition behind the model. Our reproduction results empirically validate the correctness of our implementations using benchmark Knowledge Graph datasets on node classification and link prediction tasks. Our explanation provides a friendly understanding of the diff… ▽ More

    Submitted 21 July, 2021; originally announced July 2021.

  25. arXiv:2106.07258  [pdf, other

    cs.DB cs.LG

    GitTables: A Large-Scale Corpus of Relational Tables

    Authors: Madelon Hulsebos, Çağatay Demiralp, Paul Groth

    Abstract: The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, w… ▽ More

    Submitted 12 April, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

  26. arXiv:2101.01353  [pdf, other

    cs.CL cs.AI

    Reinforcement Learning based Collective Entity Alignment with Adaptive Features

    Authors: Weixin Zeng, Xiang Zhao, Jiuyang Tang, Xuemin Lin, Paul Groth

    Abstract: Entity alignment (EA) is the task of identifying the entities that refer to the same real-world object but are located in different knowledge graphs (KGs). For entities to be aligned, existing EA solutions treat them separately and generate alignment results as ranked lists of entities on the other side. Nevertheless, this decision-making paradigm fails to take into account the interdependence amo… ▽ More

    Submitted 5 January, 2021; originally announced January 2021.

    Comments: Accepted by ACM TOIS

  27. arXiv:2011.08903  [pdf, other

    cs.CL

    Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels

    Authors: Ryan Brate, Paul Groth, Marieke van Erp

    Abstract: Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them `smell experiences', offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper,… ▽ More

    Submitted 6 December, 2020; v1 submitted 17 November, 2020; originally announced November 2020.

    Comments: Accepted to The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2020). Barcelona, Spain. December 2020./

    ACM Class: I.2.7

  28. arXiv:2011.03009  [pdf, other

    math.NA cs.CE physics.comp-ph

    Accelerating frequency-domain numerical methods for weakly nonlinear focused ultrasound using nested meshes

    Authors: Samuel P. Groth, Pierre Gélat, Seyyed R. Haqshenas, Nader Saffari, Elwin van 't Wout, Timo Betcke, Garth N. Wells

    Abstract: The numerical simulation of weakly nonlinear ultrasound is important in treatment planning for focused ultrasound (FUS) therapies. However, the large domain sizes and generation of higher harmonics at the focus make these problems extremely computationally demanding. Numerical methods typically employ a uniform mesh fine enough to resolve the highest harmonic present in the problem, leading to a v… ▽ More

    Submitted 22 July, 2021; v1 submitted 5 November, 2020; originally announced November 2020.

    Journal ref: The Journal of the Acoustical Society of America 150 441(2021)

  29. arXiv:2010.08269  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    Effective Distributed Representations for Academic Expert Search

    Authors: Mark Berger, Jakub Zavrel, Paul Groth

    Abstract: Expert search aims to find and rank experts based on a user's query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a… ▽ More

    Submitted 16 October, 2020; originally announced October 2020.

    Comments: To be published in the Scholarly Document Processing 2020 Workshop @ EMNLP 2020 proceedings

  30. arXiv:2010.03496  [pdf, other

    cs.CL cs.AI cs.LG

    Inductive Entity Representations from Text via Link Prediction

    Authors: Daniel Daza, Michael Cochez, Paul Groth

    Abstract: Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector represe… ▽ More

    Submitted 14 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

  31. arXiv:2004.07917  [pdf, ps, other

    cs.DB cs.CY cs.GL

    Knowledge Scientists: Unlocking the data-driven organization

    Authors: George Fletcher, Paul Groth, Juan Sequeda

    Abstract: Organizations across all sectors are increasingly undergoing deep transformation and restructuring towards data-driven operations. The central role of data highlights the need for reliable and clean data. Unreliable, erroneous, and incomplete data lead to critical bottlenecks in processing pipelines and, ultimately, service failures, which are disastrous for the competitive performance of the orga… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

  32. Talking datasets: Understanding data sensemaking behaviours

    Authors: Laura Koesten, Kathleen Gregory, Paul Groth, Elena Simperl

    Abstract: The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little is known about a key step in data reuse: people's behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 research… ▽ More

    Submitted 18 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

    Comments: 26 pages, 7 figures, 6 tables

  33. Lost or found? Discovering data needed for research

    Authors: Kathleen Gregory, Paul Groth, Andrea Scharnhorst, Sally Wyatt

    Abstract: Finding data is a necessary precursor to being able to reuse data, although relatively little large-scale empirical evidence exists about how researchers discover, make sense of and (re)use data for research. This study presents evidence from the largest known survey investigating how researchers discover and use data that they do not create themselves. We examine the data needs and discovery stra… ▽ More

    Submitted 2 April, 2020; v1 submitted 1 September, 2019; originally announced September 2019.

    Comments: Harvard Data Science Review (2020)

  34. arXiv:1908.10632  [pdf, other

    cs.SI cs.DL physics.soc-ph

    A Longitudinal Analysis of University Rankings

    Authors: Friso Selten, Cameron Neylon, Chun-Kai Huang, Paul Groth

    Abstract: Pressured by globalization and the increasing demand for public organisations to be accountable, efficient and transparent, university rankings have become an important tool for assessing the quality of higher education institutions. It is therefore important to carefully assess exactly what these rankings measure. In this paper, the three major global university rankings, The Academic Ranking of… ▽ More

    Submitted 20 January, 2020; v1 submitted 28 August, 2019; originally announced August 2019.

    Comments: 26 pages

  35. Dataset search: a survey

    Authors: Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez, Emilia Kacprzak, Paul Groth

    Abstract: Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data s… ▽ More

    Submitted 3 January, 2019; originally announced January 2019.

    Comments: 20 pages, 153 references

  36. arXiv:1811.06303  [pdf, other

    cs.CL

    End-to-End Learning for Answering Structured Queries Directly over Text

    Authors: Paul Groth, Antony Scerri, Ron Daniel, Jr., Bradley P. Allen

    Abstract: Structured queries expressed in languages (such as SQL, SPARQL, or XQuery) offer a convenient and explicit way for users to express their information needs for a number of tasks. In this work, we present an approach to answer these directly over text data without storing results in a database. We specifically look at the case of knowledge bases where queries are over entities and the relations bet… ▽ More

    Submitted 16 November, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: 18 pages, 6 figures

    Journal ref: Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019)

  37. arXiv:1805.11883  [pdf, ps, other

    cs.OH

    DATA:SEARCH'18 -- Searching Data on the Web

    Authors: Paul Groth, Laura Koesten, Philipp Mayr, Maarten de Rijke, Elena Simperl

    Abstract: This half day workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks an… ▽ More

    Submitted 30 May, 2018; originally announced May 2018.

  38. arXiv:1802.05574  [pdf, other

    cs.CL

    Open Information Extraction on Scientific Text: An Evaluation

    Authors: Paul Groth, Michael Lauruhn, Antony Scerri, Ron Daniel Jr

    Abstract: Open Information Extraction (OIE) is the task of the unsupervised creation of structured information from text. OIE is often used as a starting point for a number of downstream tasks including knowledge base construction, relation extraction, and question answering. While OIE methods are targeted at being domain independent, they have been evaluated primarily on newspaper, encyclopedic or general… ▽ More

    Submitted 4 June, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

    Comments: 10 pages

    Journal ref: The 27th International Conference on Computational Linguistics (COLING 2018)

  39. Understanding Data Search as a Socio-technical Practice

    Authors: Kathleen Gregory, Helena Cousijn, Paul Groth, Andrea Scharnhorst, Sally Wyatt

    Abstract: Open research data are heralded as having the potential to increase effectiveness, productivity, and reproducibility in science, but little is known about the actual practices involved in data search. The socio-technical problem of locating data for reuse is often reduced to the technological dimension of designing data search systems. We combine a bibliometric study of the current academic discou… ▽ More

    Submitted 18 February, 2019; v1 submitted 15 January, 2018; originally announced January 2018.

    Comments: 19 pages, 3 figures, 7 tables

    Journal ref: Journal of Information Science. (2019). 0165551519837182

  40. Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines

    Authors: Kathleen Gregory, Paul Groth, Helena Cousijn, Andrea Scharnhorst, Sally Wyatt

    Abstract: A cross-disciplinary examination of the user behaviours involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data. Two analytical frameworks rooted in information retrieval and science technology studies are used to id… ▽ More

    Submitted 12 March, 2020; v1 submitted 21 July, 2017; originally announced July 2017.

    Journal ref: Journal of the Association for Information Science and Technology. (2019). 70(5), 419-432

  41. arXiv:1611.00217  [pdf

    cs.DL

    Sources of Change for Modern Knowledge Organization Systems

    Authors: Michael Lauruhn, Paul Groth

    Abstract: Knowledge Organization Systems (e.g. taxonomies and ontologies) continue to contribute benefits in the design of information systems by providing a shared conceptual underpinning for developers, users, and automated systems. However, the standard mechanisms for the management of KOSs changes are inadequate for systems built on top of thousands of data sources or with the involvement of hundreds of… ▽ More

    Submitted 1 November, 2016; originally announced November 2016.

    Comments: 10 pages, 1 figure

    Journal ref: Knowledge Organization, 43(8), 622-629 (2016)

  42. arXiv:1401.2134  [pdf, other

    cs.DL astro-ph.IM cs.CY

    10 Simple Rules for the Care and Feeding of Scientific Data

    Authors: Alyssa Goodman, Alberto Pepe, Alexander W. Blocker, Christine L. Borgman, Kyle Cranmer, Mercè Crosas, Rosanne Di Stefano, Yolanda Gil, Paul Groth, Margaret Hedstrom, David W. Hogg, Vinay Kashyap, Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic

    Abstract: This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review… ▽ More

    Submitted 9 January, 2014; originally announced January 2014.

    Comments: Accepted in PLOS Computational Biology. This paper was written collaboratively, on the web, in the open, using Authorea. The living version of this article, which includes sources and history, is available at http://www.authorea.com/3410/

  43. On the Formulation of Performant SPARQL Queries

    Authors: Antonis Loizou, Paul Groth

    Abstract: The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to… ▽ More

    Submitted 2 April, 2013; originally announced April 2013.

  44. Theoretical And Technological Building Blocks For An Innovation Accelerator

    Authors: Frank van Harmelen, George Kampis, Katy Borner, Peter van den Besselaar, Erik Schultes, Carole Goble, Paul Groth, Barend Mons, Stuart Anderson, Stefan Decker, Conor Hayes, Thierry Buecheler, Dirk Helbing

    Abstract: The scientific system that we use today was devised centuries ago and is inadequate for our current ICT-based society: the peer review system encourages conservatism, journal publications are monolithic and slow, data is often not available to other scientists, and the independent validation of results is limited. Building on the Innovation Accelerator paper by Helbing and Balietti (2011) this pap… ▽ More

    Submitted 4 October, 2012; originally announced October 2012.

  45. arXiv:1006.4860  [pdf

    astro-ph.IM cs.DC

    The Application of Cloud Computing to the Creation of Image Mosaics and Management of Their Provenance

    Authors: G. Bruce Berriman, Ewa Deelman, Paul Groth, Gideon Juve

    Abstract: We have used the Montage image mosaic engine to investigate the cost and performance of processing images on the Amazon EC2 cloud, and to inform the requirements that higher-level products impose on provenance management technologies. We will present a detailed comparison of the performance of Montage on the cloud and on the Abe high performance cluster at the National Center for Supercomputing Ap… ▽ More

    Submitted 24 June, 2010; originally announced June 2010.

    Comments: 15 pages, 3 figure

    Journal ref: SPIE Conference 7740: Software and Cyberinfrastructure for Astronomy (2010)

  46. arXiv:1005.4457  [pdf, other

    astro-ph.IM cs.IR

    Pipeline-Centric Provenance Model

    Authors: Paul Groth, Ewa Deelman, Gideon Juve, Gaurang Mehta, Bruce Berriman

    Abstract: In this paper we propose a new provenance model which is tailored to a class of workflow-based applications. We motivate the approach with use cases from the astronomy community. We generalize the class of applications the approach is relevant to and propose a pipeline-centric provenance model. Finally, we evaluate the benefits in terms of storage needed by the approach when applied to an astronom… ▽ More

    Submitted 24 May, 2010; originally announced May 2010.

    Comments: 9 pages, 4 figures

    Journal ref: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, 2009

  47. arXiv:1005.2643  [pdf

    astro-ph.IM cs.DL

    Metadata and provenance management

    Authors: Ewa Deelman, Bruce Berriman, Ann Chervenak, Oscar Corcho, Paul Groth, Luc Moreau

    Abstract: Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata an… ▽ More

    Submitted 14 May, 2010; originally announced May 2010.

    Journal ref: Scientific Data Management: Challenges, Existing Technology, and Deployment (Arie Shoshani and Doron Rotem, Editors) CRC Press 2010