-
Large Language Models as Evaluators for Scientific Synthesis
Authors:
Julia Evans,
Jennifer D'Souza,
Sören Auer
Abstract:
Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality rat…
▽ More
Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality ratings. The study evaluates both the closed-source GPT-4 and the open-source Mistral model's ability to rate these summaries and provide reasons for their judgments. Preliminary results show that LLMs can offer logical explanations that somewhat match the quality ratings, yet a deeper statistical analysis shows a weak correlation between LLM and human ratings, suggesting the potential and current limitations of LLMs in scientific synthesis evaluation.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study
Authors:
Salomon Kabongo,
Jennifer D'Souza,
Sören Auer
Abstract:
This paper explores the impact of context selection on the efficiency of Large Language Models (LLMs) in generating Artificial Intelligence (AI) research leaderboards, a task defined as the extraction of (Task, Dataset, Metric, Score) quadruples from scholarly articles. By framing this challenge as a text generation objective and employing instruction finetuning with the FLAN-T5 collection, we int…
▽ More
This paper explores the impact of context selection on the efficiency of Large Language Models (LLMs) in generating Artificial Intelligence (AI) research leaderboards, a task defined as the extraction of (Task, Dataset, Metric, Score) quadruples from scholarly articles. By framing this challenge as a text generation objective and employing instruction finetuning with the FLAN-T5 collection, we introduce a novel method that surpasses traditional Natural Language Inference (NLI) approaches in adapting to new developments without a predefined taxonomy. Through experimentation with three distinct context types of varying selectivity and length, our study demonstrates the importance of effective context selection in enhancing LLM accuracy and reducing hallucinations, providing a new pathway for the reliable and efficient generation of AI leaderboards. This contribution not only advances the state of the art in leaderboard generation but also sheds light on strategies to mitigate common challenges in LLM-based information extraction.
△ Less
Submitted 6 June, 2024;
originally announced July 2024.
-
Scholarly Question Answering using Large Language Models in the NFDI4DataScience Gateway
Authors:
Hamed Babaei Giglou,
Tilahun Abedissa Taffa,
Rana Abdullah,
Aida Usmanova,
Ricardo Usbeck,
Jennifer D'Souza,
Sören Auer
Abstract:
This paper introduces a scholarly Question Answering (QA) system on top of the NFDI4DataScience Gateway, employing a Retrieval Augmented Generation-based (RAG) approach. The NFDI4DS Gateway, as a foundational framework, offers a unified and intuitive interface for querying various scientific databases using federated search. The RAG-based scholarly QA, powered by a Large Language Model (LLM), faci…
▽ More
This paper introduces a scholarly Question Answering (QA) system on top of the NFDI4DataScience Gateway, employing a Retrieval Augmented Generation-based (RAG) approach. The NFDI4DS Gateway, as a foundational framework, offers a unified and intuitive interface for querying various scientific databases using federated search. The RAG-based scholarly QA, powered by a Large Language Model (LLM), facilitates dynamic interaction with search results, enhancing filtering capabilities and fostering a conversational engagement with the Gateway search. The effectiveness of both the Gateway and the scholarly QA system is demonstrated through experimental analysis.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Exploring the Latest LLMs for Leaderboard Extraction
Authors:
Salomon Kabongo,
Jennifer D'Souza,
Sören Auer
Abstract:
The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experim…
▽ More
The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experimental Setup, and Tabular Information), DocREC (Results, Experiments, and Conclusions), and DocFULL (entire document). Our comprehensive study evaluates the performance of these models in generating (Task, Dataset, Metric, Score) quadruples from research papers. The findings reveal significant insights into the strengths and limitations of each model and context type, providing valuable guidance for future AI research automation efforts.
△ Less
Submitted 8 July, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
LLMs4OM: Matching Ontologies with Large Language Models
Authors:
Hamed Babaei Giglou,
Jennifer D'Souza,
Felix Engel,
Sören Auer
Abstract:
Ontology Matching (OM), is a critical task in knowledge integration, where aligning heterogeneous ontologies facilitates data interoperability and knowledge sharing. Traditional OM systems often rely on expert knowledge or predictive models, with limited exploration of the potential of Large Language Models (LLMs). We present the LLMs4OM framework, a novel approach to evaluate the effectiveness of…
▽ More
Ontology Matching (OM), is a critical task in knowledge integration, where aligning heterogeneous ontologies facilitates data interoperability and knowledge sharing. Traditional OM systems often rely on expert knowledge or predictive models, with limited exploration of the potential of Large Language Models (LLMs). We present the LLMs4OM framework, a novel approach to evaluate the effectiveness of LLMs in OM tasks. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. Through comprehensive evaluations using 20 OM datasets from various domains, we demonstrate that LLMs, under the LLMs4OM framework, can match and even surpass the performance of traditional OM systems, particularly in complex matching scenarios. Our results highlight the potential of LLMs to significantly contribute to the field of OM.
△ Less
Submitted 23 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph
Authors:
Raia Abu Ahmad,
Jennifer D'Souza,
Matthäus Zloch,
Wolfgang Otto,
Georg Rehm,
Allard Oelen,
Stefan Dietze,
Sören Auer
Abstract:
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research…
▽ More
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Organizing Scientific Knowledge From Energy System Research Using the Open Research Knowledge Graph
Authors:
Oliver Karras,
Jan Göpfert,
Patrick Kuckertz,
Tristan Pelser,
Sören Auer
Abstract:
Engineering sciences, such as energy system research, play an important role in developing solutions to technical, environmental, economic, and social challenges of our modern society. In this context, the transformation of energy systems into climate-neutral systems is one of the key strategies for mitigating climate change. For the transformation of energy systems, engineers model, simulate and…
▽ More
Engineering sciences, such as energy system research, play an important role in developing solutions to technical, environmental, economic, and social challenges of our modern society. In this context, the transformation of energy systems into climate-neutral systems is one of the key strategies for mitigating climate change. For the transformation of energy systems, engineers model, simulate and analyze scenarios and transformation pathways to initiate debates about possible transformation strategies. For these debates and research in general, all steps of the research process must be traceable to guarantee the trustworthiness of published results, avoid redundancies, and ensure their social acceptance. However, the analysis of energy systems is an interdisciplinary field as the investigations of large, complex energy systems often require the use of different software applications and large amounts of heterogeneous data. Engineers must therefore communicate, understand, and (re)use heterogeneous scientific knowledge and data. Although the importance of FAIR scientific knowledge and data in the engineering sciences and energy system research is increasing, little research has been conducted on this topic. When it comes to publishing scientific knowledge and data from publications, software, and datasets (such as models, scenarios, and simulations) openly available and transparent, energy system research lags behind other research domains. According to Schmitt et al. and Nieße et al., engineers need technical support in the form of infrastructures, services, and terminologies to improve communication, understanding, and (re)use of scientific knowledge and data.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Large Language Models for Scientific Information Extraction: An Empirical Study for Virology
Authors:
Mahsa Shamsabadi,
Jennifer D'Souza,
Sören Auer
Abstract:
In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generat…
▽ More
In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs' emergent abilities.
For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Scholarly Knowledge Graph Construction from Published Software Packages
Authors:
Muhammad Haris,
Sören Auer,
Markus Stocker
Abstract:
The value of structured scholarly knowledge for research and society at large is well understood, but producing scholarly knowledge (i.e., knowledge traditionally published in articles) in structured form remains a challenge. We propose an approach for automatically extracting scholarly knowledge from published software packages by static analysis of their metadata and contents (scripts and data)…
▽ More
The value of structured scholarly knowledge for research and society at large is well understood, but producing scholarly knowledge (i.e., knowledge traditionally published in articles) in structured form remains a challenge. We propose an approach for automatically extracting scholarly knowledge from published software packages by static analysis of their metadata and contents (scripts and data) and populating a scholarly knowledge graph with the extracted knowledge. Our approach is based on mining scientific software packages linked to article publications by extracting metadata and analyzing the Abstract Syntax Tree (AST) of the source code to obtain information about the used and produced data as well as operations performed on data. The resulting knowledge graph includes articles, software packages metadata, and computational techniques applied to input data utilized as materials in research work. The knowledge graph also includes the results reported as scholarly knowledge in articles.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
Towards Large-scale Building Attribute Mapping using Crowdsourced Images: Scene Text Recognition on Flickr and Problems to be Solved
Authors:
Yao Sun,
Anna Kruspe,
Liqiu Meng,
Yifan Tian,
Eike J Hoffmann,
Stefan Auer,
Xiao Xiang Zhu
Abstract:
Crowdsourced platforms provide huge amounts of street-view images that contain valuable building information. This work addresses the challenges in applying Scene Text Recognition (STR) in crowdsourced street-view images for building attribute mapping. We use Flickr images, particularly examining texts on building facades. A Berlin Flickr dataset is created, and pre-trained STR models are used for…
▽ More
Crowdsourced platforms provide huge amounts of street-view images that contain valuable building information. This work addresses the challenges in applying Scene Text Recognition (STR) in crowdsourced street-view images for building attribute mapping. We use Flickr images, particularly examining texts on building facades. A Berlin Flickr dataset is created, and pre-trained STR models are used for text detection and recognition. Manual checking on a subset of STR-recognized images demonstrates high accuracy. We examined the correlation between STR results and building functions, and analysed instances where texts were recognized on residential buildings but not on commercial ones. Further investigation revealed significant challenges associated with this task, including small text regions in street-view images, the absence of ground truth labels, and mismatches in buildings in Flickr images and building footprints in OpenStreetMap (OSM). To develop city-wide mapping beyond urban hotspot locations, we suggest differentiating the scenarios where STR proves effective while developing appropriate algorithms or bringing in additional data for handling other cases. Furthermore, interdisciplinary collaboration should be undertaken to understand the motivation behind building photography and labeling. The STR-on-Flickr results are publicly available at https://github.com/ya0-sun/STR-Berlin.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
An Approach to Evaluate User Interfaces in a Scholarly Knowledge Communication Domain
Authors:
Denis Obrezkov,
Allard Oelen,
Sören Auer
Abstract:
The amount of research articles produced every day is overwhelming: scholarly knowledge is getting harder to communicate and easier to get lost. A possible solution is to represent the information in knowledge graphs: structures representing knowledge in networks of entities, their semantic types, and relationships between them. But this solution has its own drawback: given its very specific task,…
▽ More
The amount of research articles produced every day is overwhelming: scholarly knowledge is getting harder to communicate and easier to get lost. A possible solution is to represent the information in knowledge graphs: structures representing knowledge in networks of entities, their semantic types, and relationships between them. But this solution has its own drawback: given its very specific task, it requires new methods for designing and evaluating user interfaces. In this paper, we propose an approach for user interface evaluation in the knowledge communication domain. We base our methodology on the well-established Cognitive Walkthough approach but employ a different set of questions, tailoring the method towards domain-specific needs. We demonstrate our approach on a scholarly knowledge graph implementation called Open Research Knowledge Graph (ORKG).
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
Drones4Good: Supporting Disaster Relief Through Remote Sensing and AI
Authors:
Nina Merkle,
Reza Bahmanyar,
Corentin Henry,
Seyed Majid Azimi,
Xiangtian Yuan,
Simon Schopferer,
Veronika Gstaiger,
Stefan Auer,
Anne Schneibel,
Marc Wieland,
Thomas Kraft
Abstract:
In order to respond effectively in the aftermath of a disaster, emergency services and relief organizations rely on timely and accurate information about the affected areas. Remote sensing has the potential to significantly reduce the time and effort required to collect such information by enabling a rapid survey of large areas. To achieve this, the main challenge is the automatic extraction of re…
▽ More
In order to respond effectively in the aftermath of a disaster, emergency services and relief organizations rely on timely and accurate information about the affected areas. Remote sensing has the potential to significantly reduce the time and effort required to collect such information by enabling a rapid survey of large areas. To achieve this, the main challenge is the automatic extraction of relevant information from remotely sensed data. In this work, we show how the combination of drone-based data with deep learning methods enables automated and large-scale situation assessment. In addition, we demonstrate the integration of onboard image processing techniques for the deployment of autonomous drone-based aid delivery. The results show the feasibility of a rapid and large-scale image analysis in the field, and that onboard image processing can increase the safety of drone-based aid deliveries.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
LLMs4OL: Large Language Models for Ontology Learning
Authors:
Hamed Babaei Giglou,
Jennifer D'Souza,
Sören Auer
Abstract:
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL). LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. Our LLMs4OL paradigm investigates the following hypothesis: \textit{Can LLMs effectively apply their language pattern capturi…
▽ More
We propose the LLMs4OL approach, which utilizes Large Language Models (LLMs) for Ontology Learning (OL). LLMs have shown significant advancements in natural language processing, demonstrating their ability to capture complex language patterns in different knowledge domains. Our LLMs4OL paradigm investigates the following hypothesis: \textit{Can LLMs effectively apply their language pattern capturing capability to OL, which involves automatically extracting and structuring knowledge from natural language text?} To test this hypothesis, we conduct a comprehensive evaluation using the zero-shot prompting method. We evaluate nine different LLM model families for three main OL tasks: term typing, taxonomy discovery, and extraction of non-taxonomic relations. Additionally, the evaluations encompass diverse genres of ontological knowledge, including lexicosemantic knowledge in WordNet, geographical knowledge in GeoNames, and medical knowledge in UMLS.
△ Less
Submitted 2 August, 2023; v1 submitted 31 July, 2023;
originally announced July 2023.
-
Divide and Conquer the EmpiRE: A Community-Maintainable Knowledge Graph of Empirical Research in Requirements Engineering
Authors:
Oliver Karras,
Felix Wernlein,
Jil Klünder,
Sören Auer
Abstract:
[Background.] Empirical research in requirements engineering (RE) is a constantly evolving topic, with a growing number of publications. Several papers address this topic using literature reviews to provide a snapshot of its "current" state and evolution. However, these papers have never built on or updated earlier ones, resulting in overlap and redundancy. The underlying problem is the unavailabi…
▽ More
[Background.] Empirical research in requirements engineering (RE) is a constantly evolving topic, with a growing number of publications. Several papers address this topic using literature reviews to provide a snapshot of its "current" state and evolution. However, these papers have never built on or updated earlier ones, resulting in overlap and redundancy. The underlying problem is the unavailability of data from earlier works. Researchers need technical infrastructures to conduct sustainable literature reviews. [Aims.] We examine the use of the Open Research Knowledge Graph (ORKG) as such an infrastructure to build and publish an initial Knowledge Graph of Empirical research in RE (KG-EmpiRE) whose data is openly available. Our long-term goal is to continuously maintain KG-EmpiRE with the research community to synthesize a comprehensive, up-to-date, and long-term available overview of the state and evolution of empirical research in RE. [Method.] We conduct a literature review using the ORKG to build and publish KG-EmpiRE which we evaluate against competency questions derived from a published vision of empirical research in software (requirements) engineering for 2020 - 2025. [Results.] From 570 papers of the IEEE International Requirements Engineering Conference (2000 - 2022), we extract and analyze data on the reported empirical research and answer 16 out of 77 competency questions. These answers show a positive development towards the vision, but also the need for future improvements. [Conclusions.] The ORKG is a ready-to-use and advanced infrastructure to organize data from literature reviews as knowledge graphs. The resulting knowledge graphs make the data openly available and maintainable by research communities, enabling sustainable literature reviews.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
A Metadata-Based Ecosystem to Improve the FAIRness of Research Software
Authors:
Patrick Kuckertz,
Jan Göpfert,
Oliver Karras,
David Neuroth,
Julian Schönau,
Rodrigo Pueblas,
Stephan Ferenz,
Felix Engel,
Noah Pflugradt,
Jann M. Weinand,
Astrid Nieße,
Sören Auer,
Detlef Stolten
Abstract:
The reuse of research software is central to research efficiency and academic exchange. The application of software enables researchers with varied backgrounds to reproduce, validate, and expand upon study findings. Furthermore, the analysis of open source code aids in the comprehension, comparison, and integration of approaches. Often, however, no further use occurs because relevant software cann…
▽ More
The reuse of research software is central to research efficiency and academic exchange. The application of software enables researchers with varied backgrounds to reproduce, validate, and expand upon study findings. Furthermore, the analysis of open source code aids in the comprehension, comparison, and integration of approaches. Often, however, no further use occurs because relevant software cannot be found or is incompatible with existing research processes. This results in repetitive software development, which impedes the advancement of individual researchers and entire research communities. In this article, the DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata. In addition to a specialized metadata schema, an exchange format and support tools for easy collection and the automated publishing of software documentation are introduced. This approach practically increases the FAIRness, i.e., findability, accessibility, interoperability, and so the reusability of research software, as well as effectively promotes its impact on research.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
Evaluating Prompt-based Question Answering for Object Prediction in the Open Research Knowledge Graph
Authors:
Jennifer D'Souza,
Moussab Hrou,
Sören Auer
Abstract:
There have been many recent investigations into prompt-based training of transformer language models for new text genres in low-resource settings. The prompt-based training approach has been found to be effective in generalizing pre-trained or fine-tuned models for transfer to resource-scarce settings. This work, for the first time, reports results on adopting prompt-based training of transformers…
▽ More
There have been many recent investigations into prompt-based training of transformer language models for new text genres in low-resource settings. The prompt-based training approach has been found to be effective in generalizing pre-trained or fine-tuned models for transfer to resource-scarce settings. This work, for the first time, reports results on adopting prompt-based training of transformers for \textit{scholarly knowledge graph object prediction}. The work is unique in the following two main aspects. 1) It deviates from the other works proposing entity and relation extraction pipelines for predicting objects of a scholarly knowledge graph. 2) While other works have tested the method on text genera relatively close to the general knowledge domain, we test the method for a significantly different domain, i.e. scholarly knowledge, in turn testing the linguistic, probabilistic, and factual generalizability of these large-scale transformer models. We find that (i) per expectations, transformer models when tested out-of-the-box underperform on a new domain of data, (ii) prompt-based training of the models achieve performance boosts of up to 40\% in a relaxed evaluation setting, and (iii) testing the models on a starkly different domain even with a clever training objective in a low resource setting makes evident the domain knowledge capture gap offering an empirically-verified incentive for investing more attention and resources to the scholarly domain in the context of transformer models.
△ Less
Submitted 11 June, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
ORKG-Leaderboards: A Systematic Workflow for Mining Leaderboards as a Knowledge Graph
Authors:
Salomon Kabongo,
Jennifer D'Souza,
Sören Auer
Abstract:
The purpose of this work is to describe the Orkg-Leaderboard software designed to extract leaderboards defined as Task-Dataset-Metric tuples automatically from large collections of empirical research papers in Artificial Intelligence (AI). The software can support both the main workflows of scholarly publishing, viz. as LaTeX files or as PDF files. Furthermore, the system is integrated with the Op…
▽ More
The purpose of this work is to describe the Orkg-Leaderboard software designed to extract leaderboards defined as Task-Dataset-Metric tuples automatically from large collections of empirical research papers in Artificial Intelligence (AI). The software can support both the main workflows of scholarly publishing, viz. as LaTeX files or as PDF files. Furthermore, the system is integrated with the Open Research Knowledge Graph (ORKG) platform, which fosters the machine-actionable publishing of scholarly findings. Thus the system output, when integrated within the ORKG's supported Semantic Web infrastructure of representing machine-actionable 'resources' on the Web, enables: 1) broadly, the integration of empirical results of researchers across the world, thus enabling transparency in empirical research with the potential to also being complete contingent on the underlying data source(s) of publications; and 2) specifically, enables researchers to track the progress in AI with an overview of the state-of-the-art (SOTA) across the most common AI tasks and their corresponding datasets via dynamic ORKG frontend views leveraging tables and visualization charts over the machine-actionable data. Our best model achieves performances above 90% F1 on the \textit{leaderboard} extraction task, thus proving Orkg-Leaderboards a practically viable tool for real-world usage. Going forward, in a sense, Orkg-Leaderboards transforms the leaderboard extraction task to an automated digitalization task, which has been, for a long time in the community, a crowdsourced endeavor.
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Evaluating BERT-based Scientific Relation Classifiers for Scholarly Knowledge Graph Construction on Digital Library Collections
Authors:
Ming Jiang,
Jennifer D'Souza,
Sören Auer,
J. Stephen Downie
Abstract:
The rapid growth of research publications has placed great demands on digital libraries (DL) for advanced information management technologies. To cater to these demands, techniques relying on knowledge-graph structures are being advocated. In such graph-based pipelines, inferring semantic relations between related scientific concepts is a crucial step. Recently, BERT-based pre-trained models have…
▽ More
The rapid growth of research publications has placed great demands on digital libraries (DL) for advanced information management technologies. To cater to these demands, techniques relying on knowledge-graph structures are being advocated. In such graph-based pipelines, inferring semantic relations between related scientific concepts is a crucial step. Recently, BERT-based pre-trained models have been popularly explored for automatic relation classification. Despite significant progress, most of them were evaluated in different scenarios, which limits their comparability. Furthermore, existing methods are primarily evaluated on clean texts, which ignores the digitization context of early scholarly publications in terms of machine scanning and optical character recognition (OCR). In such cases, the texts may contain OCR noise, in turn creating uncertainty about existing classifiers' performances. To address these limitations, we started by creating OCR-noisy texts based on three clean corpora. Given these parallel corpora, we conducted a thorough empirical evaluation of eight Bert-based classification models by focusing on three factors: (1) Bert variants; (2) classification strategies; and, (3) OCR noise impacts. Experiments on clean data show that the domain-specific pre-trained Bert is the best variant to identify scientific relations. The strategy of predicting a single relation each time outperforms the one simultaneously identifying multiple relations in general. The optimal classifier's performance can decline by around 10% to 20% in F-score on the noisy corpora. Insights discussed in this study can help DL stakeholders select techniques for building optimal knowledge-graph-based systems.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
Zero-shot Entailment of Leaderboards for Empirical AI Research
Authors:
Salomon Kabongo,
Jennifer D'Souza,
Sören Auer
Abstract:
We present a large-scale empirical investigation of the zero-shot learning phenomena in a specific recognizing textual entailment (RTE) task category, i.e. the automated mining of leaderboards for Empirical AI Research. The prior reported state-of-the-art models for leaderboards extraction formulated as an RTE task, in a non-zero-shot setting, are promising with above 90% reported performances. Ho…
▽ More
We present a large-scale empirical investigation of the zero-shot learning phenomena in a specific recognizing textual entailment (RTE) task category, i.e. the automated mining of leaderboards for Empirical AI Research. The prior reported state-of-the-art models for leaderboards extraction formulated as an RTE task, in a non-zero-shot setting, are promising with above 90% reported performances. However, a central research question remains unexamined: did the models actually learn entailment? Thus, for the experiments in this paper, two prior reported state-of-the-art models are tested out-of-the-box for their ability to generalize or their capacity for entailment, given leaderboard labels that were unseen during training. We hypothesize that if the models learned entailment, their zero-shot performances can be expected to be moderately high as well--perhaps, concretely, better than chance. As a result of this work, a zero-shot labeled dataset is created via distant labeling formulating the leaderboard extraction RTE task.
△ Less
Submitted 29 March, 2023;
originally announced March 2023.
-
Describing and Organizing Semantic Web and Machine Learning Systems in the SWeMLS-KG
Authors:
Fajar J. Ekaputra,
Majlinda Llugiqi,
Marta Sabou,
Andreas Ekelhart,
Heiko Paulheim,
Anna Breit,
Artem Revenko,
Laura Waltersdorfer,
Kheir Eddine Farfar,
Sören Auer
Abstract:
In line with the general trend in artificial intelligence research to create intelligent systems that combine learning and symbolic components, a new sub-area has emerged that focuses on combining machine learning (ML) components with techniques developed by the Semantic Web (SW) community - Semantic Web Machine Learning (SWeML for short). Due to its rapid growth and impact on several communities…
▽ More
In line with the general trend in artificial intelligence research to create intelligent systems that combine learning and symbolic components, a new sub-area has emerged that focuses on combining machine learning (ML) components with techniques developed by the Semantic Web (SW) community - Semantic Web Machine Learning (SWeML for short). Due to its rapid growth and impact on several communities in the last two decades, there is a need to better understand the space of these SWeML Systems, their characteristics, and trends. Yet, surveys that adopt principled and unbiased approaches are missing. To fill this gap, we performed a systematic study and analyzed nearly 500 papers published in the last decade in this area, where we focused on evaluating architectural, and application-specific features. Our analysis identified a rapidly growing interest in SWeML Systems, with a high impact on several application domains and tasks. Catalysts for this rapid growth are the increased application of deep learning and knowledge graph technologies. By leveraging the in-depth understanding of this area acquired through this study, a further key contribution of this paper is a classification system for SWeML Systems which we publish as ontology.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
A Next-Generation Digital Procurement Workspace Focusing on Information Integration, Automation, Analytics, and Sustainability
Authors:
Jan-David Stütz,
Oliver Karras,
Allard Oelen,
Sören Auer
Abstract:
Recent events such as wars, sanctions, pandemics, and climate change have shown the importance of proper supply network management. A key step in managing supply networks is procurement. We present an approach for realizing a next-generation procurement workspace that aims to facilitate resilience and sustainability. To achieve this, the approach encompasses a novel way of information integration,…
▽ More
Recent events such as wars, sanctions, pandemics, and climate change have shown the importance of proper supply network management. A key step in managing supply networks is procurement. We present an approach for realizing a next-generation procurement workspace that aims to facilitate resilience and sustainability. To achieve this, the approach encompasses a novel way of information integration, automation tools as well as analytical techniques. As a result, the procurement can be viewed from the perspective of the environmental impact, comprising and aggregating sustainability scores along the supply chain. We suggest and present an implementation of our approach, which is meanwhile used in a global Fortune 500 company. We further present the results of an empirical evaluation study, where we performed in-depth interviews with the stakeholders of the novel procurement platform to validate its adequacy, usability, and innovativeness.
△ Less
Submitted 22 March, 2023; v1 submitted 17 February, 2023;
originally announced March 2023.
-
Scholarly Knowledge Extraction from Published Software Packages
Authors:
Muhammad Haris,
Markus Stocker,
Sören Auer
Abstract:
A plethora of scientific software packages are published in repositories, e.g., Zenodo and figshare. These software packages are crucial for the reproducibility of published research. As an additional route to scholarly knowledge graph construction, we propose an approach for automated extraction of machine actionable (structured) scholarly knowledge from published software packages by static anal…
▽ More
A plethora of scientific software packages are published in repositories, e.g., Zenodo and figshare. These software packages are crucial for the reproducibility of published research. As an additional route to scholarly knowledge graph construction, we propose an approach for automated extraction of machine actionable (structured) scholarly knowledge from published software packages by static analysis of their (meta)data and contents (in particular scripts in languages such as Python). The approach can be summarized as follows. First, we extract metadata information (software description, programming languages, related references) from software packages by leveraging the Software Metadata Extraction Framework (SOMEF) and the GitHub API. Second, we analyze the extracted metadata to find the research articles associated with the corresponding software repository. Third, for software contained in published packages, we create and analyze the Abstract Syntax Tree (AST) representation to extract information about the procedures performed on data. Fourth, we search the extracted information in the full text of related articles to constrain the extracted information to scholarly knowledge, i.e. information published in the scholarly literature. Finally, we publish the extracted machine actionable scholarly knowledge in the Open Research Knowledge Graph (ORKG).
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles
Authors:
Mohamad Yaser Jaradeh,
Markus Stocker,
Sören Auer
Abstract:
Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates st…
▽ More
Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article's full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.
△ Less
Submitted 11 December, 2022;
originally announced December 2022.
-
KGMM -- A Maturity Model for Scholarly Knowledge Graphs based on Intertwined Human-Machine Collaboration
Authors:
Hassan Hussein,
Allard Oelen,
Oliver Karras,
Sören Auer
Abstract:
Knowledge Graphs (KG) have gained increasing importance in science, business and society in the last years. However, most knowledge graphs were either extracted or compiled from existing sources. There are only relatively few examples where knowledge graphs were genuinely created by an intertwined human-machine collaboration. Also, since the quality of data and knowledge graphs is of paramount imp…
▽ More
Knowledge Graphs (KG) have gained increasing importance in science, business and society in the last years. However, most knowledge graphs were either extracted or compiled from existing sources. There are only relatively few examples where knowledge graphs were genuinely created by an intertwined human-machine collaboration. Also, since the quality of data and knowledge graphs is of paramount importance, a number of data quality assessment models have been proposed. However, they do not take the specific aspects of intertwined human-machine curated knowledge graphs into account. In this work, we propose a graded maturity model for scholarly knowledge graphs (KGMM), which specifically focuses on aspects related to the joint, evolutionary curation of knowledge graphs for digital libraries. Our model comprises 5 maturity stages with 20 quality measures. We demonstrate the implementation of our model in a large scale scholarly knowledge graph curation effort.
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
Clustering Semantic Predicates in the Open Research Knowledge Graph
Authors:
Omar Arab Oghli,
Jennifer D'Souza,
Sören Auer
Abstract:
When semantically describing knowledge graphs (KGs), users have to make a critical choice of a vocabulary (i.e. predicates and resources). The success of KG building is determined by the convergence of shared vocabularies so that meaning can be established. The typical lifecycle for a new KG construction can be defined as follows: nascent phases of graph construction experience terminology diverge…
▽ More
When semantically describing knowledge graphs (KGs), users have to make a critical choice of a vocabulary (i.e. predicates and resources). The success of KG building is determined by the convergence of shared vocabularies so that meaning can be established. The typical lifecycle for a new KG construction can be defined as follows: nascent phases of graph construction experience terminology divergence, while later phases of graph construction experience terminology convergence and reuse. In this paper, we describe our approach tailoring two AI-based clustering algorithms for recommending predicates (in RDF statements) about resources in the Open Research Knowledge Graph (ORKG) https://orkg.org/. Such a service to recommend existing predicates to semantify new incoming data of scholarly publications is of paramount importance for fostering terminology convergence in the ORKG. Our experiments show very promising results: a high precision with relatively high recall in linear runtime performance. Furthermore, this work offers novel insights into the predicate groups that automatically accrue loosely as generic semantification patterns for semantification of scholarly knowledge spanning 44 research fields.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Persistent Identification and Interlinking of FAIR Scholarly Knowledge
Authors:
Muhammad Haris,
Markus Stocker,
Sören Auer
Abstract:
We leverage the Open Research Knowledge Graph - a scholarly infrastructure that supports the creation, curation, and reuse of structured, semantic scholarly knowledge - and present an approach for persistent identification of FAIR scholarly knowledge. We propose a DOI-based persistent identification of ORKG Papers, which are machine-actionable descriptions of the essential information published in…
▽ More
We leverage the Open Research Knowledge Graph - a scholarly infrastructure that supports the creation, curation, and reuse of structured, semantic scholarly knowledge - and present an approach for persistent identification of FAIR scholarly knowledge. We propose a DOI-based persistent identification of ORKG Papers, which are machine-actionable descriptions of the essential information published in scholarly articles. This enables the citability of FAIR scholarly knowledge and its discovery in global scholarly communication infrastructures (e.g., DataCite, OpenAIRE, and ORCID). While publishing, the state of the ORKG Paper is saved and cannot be further edited. To allow for updating published versions, ORKG supports creating new versions, which are linked in provenance chains. We demonstrate the linking of FAIR scholarly knowledge with digital artefacts (articles), agents (researchers) and other objects (organizations). We persistently identify FAIR scholarly knowledge (namely, ORKG Papers and ORKG Comparisons as collections of ORKG Papers) by leveraging DataCite services. Given the existing interoperability between DataCite, Crossref, OpenAIRE and ORCID, sharing metadata with DataCite ensures global findability of FAIR scholarly knowledge in scholarly communication infrastructures.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Plumber: A Modular Framework to Create Information Extraction Pipelines
Authors:
Mohamad Yaser Jaradeh,
Kuldeep Singh,
Markus Stocker,
Sören Auer
Abstract:
Information Extraction (IE) tasks are commonly studied topics in various domains of research. Hence, the community continuously produces multiple techniques, solutions, and tools to perform such tasks. However, running those tools and integrating them within existing infrastructure requires time, expertise, and resources. One pertinent task here is triples extraction and linking, where structured…
▽ More
Information Extraction (IE) tasks are commonly studied topics in various domains of research. Hence, the community continuously produces multiple techniques, solutions, and tools to perform such tasks. However, running those tools and integrating them within existing infrastructure requires time, expertise, and resources. One pertinent task here is triples extraction and linking, where structured triples are extracted from a text and aligned to an existing Knowledge Graph (KG). In this paper, we present PLUMBER, the first framework that allows users to manually and automatically create suitable IE pipelines from a community-created pool of tools to perform triple extraction and alignment on unstructured text. Our approach provides an interactive medium to alter the pipelines and perform IE tasks. A short video to show the working of the framework for different use-cases is available online under: https://www.youtube.com/watch?v=XC9rJNIUv8g
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Open Research Knowledge Graph:A System Walkthrough
Authors:
Mohamad Yaser Jaradeh,
Allard Oelen,
Manuel Prinz,
Markus Stocker,
Sören Auer
Abstract:
Despite improved digital access to scholarly literature in the last decades, the fundamental principles of scholarly communication remain unchanged and continue to be largely document-based. Scholarly knowledge remains locked in representations that are inadequate for machine processing. The Open Research Knowledge Graph (ORKG) is an infrastructure for representing, curating and exploring scholarl…
▽ More
Despite improved digital access to scholarly literature in the last decades, the fundamental principles of scholarly communication remain unchanged and continue to be largely document-based. Scholarly knowledge remains locked in representations that are inadequate for machine processing. The Open Research Knowledge Graph (ORKG) is an infrastructure for representing, curating and exploring scholarly knowledge in a machine actionable manner. We demonstrate the core functionality of ORKG for representing research contributions published in scholarly articles. A video of the demonstration and the system are available online.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
KnowGraph-PM: a Knowledge Graph based Pricing Model for Semiconductors Supply Chains
Authors:
Nour Ramzy,
Soren Auer,
Javad Chamanara,
Hans Ehm
Abstract:
Semiconductor supply chains are described by significant demand fluctuation that increases as one moves up the supply chain, the so-called bullwhip effect. To counteract, semiconductor manufacturers aim to optimize capacity utilization, to deliver with shorter lead times and exploit this to generate revenue. Additionally, in a competitive market, firms seek to maintain customer relationships while…
▽ More
Semiconductor supply chains are described by significant demand fluctuation that increases as one moves up the supply chain, the so-called bullwhip effect. To counteract, semiconductor manufacturers aim to optimize capacity utilization, to deliver with shorter lead times and exploit this to generate revenue. Additionally, in a competitive market, firms seek to maintain customer relationships while applying revenue management strategies such as dynamic pricing. Price change potentially generates conflicts with customers. In this paper, we present KnowGraph-PM, a knowledge graph-based dynamic pricing model. The semantic model uses the potential of faster delivery and shorter lead times to define premium prices, thus entail increased profits based on the customer profile. The knowledge graph enables the integration of customer-related information, e.g., customer class and location to customer order data. The pricing algorithm is realized as a SPARQL query that relies on customer profile and order behavior to determine the corresponding price premium. We evaluate the approach by calculating the revenue generated after applying the pricing algorithm. Based on competency questions that translate to SPARQL queries, we validate the created knowledge graph. We demonstrate that semantic data integration enables customer-tailored revenue management.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
MARE: Semantic Supply Chain Disruption Management and Resilience Evaluation Framework
Authors:
Nour Ramzy,
Soren Auer,
Hans Ehm,
Javad Chamanara
Abstract:
Supply Chains (SCs) are subject to disruptive events that potentially hinder the operational performance. Disruption Management Process (DMP) relies on the analysis of integrated heterogeneous data sources such as production scheduling, order management and logistics to evaluate the impact of disruptions on the SC. Existing approaches are limited as they address DMP process steps and corresponding…
▽ More
Supply Chains (SCs) are subject to disruptive events that potentially hinder the operational performance. Disruption Management Process (DMP) relies on the analysis of integrated heterogeneous data sources such as production scheduling, order management and logistics to evaluate the impact of disruptions on the SC. Existing approaches are limited as they address DMP process steps and corresponding data sources in a rather isolated manner which hurdles the systematic handling of a disruption originating anywhere in the SC. Thus, we propose MARE a semantic disruption management and resilience evaluation framework for integration of data sources included in all DMP steps, i.e. Monitor/Model, Assess, Recover and Evaluate. MARE, leverages semantic technologies i.e. ontologies, knowledge graphs and SPARQL queries to model and reproduce SC behavior under disruptive scenarios. Also, MARE includes an evaluation framework to examine the restoration performance of a SC applying various recovery strategies. Semantic SC DMP, put forward by MARE, allows stakeholders to potentially identify the measures to enhance SC integration, increase the resilience of supply networks and ultimately facilitate digitalization.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
SENS: Semantic Synthetic Benchmarking Model for integrated supply chain simulation and analysis
Authors:
Nour Ramzy,
Soren Auer,
Hans Ehm,
Javad Chamanara
Abstract:
Supply Chain (SC) modeling is essential to understand and influence SC behavior, especially for increasingly globalized and complex SCs. Existing models address various SC notions, e.g., processes, tiers and production, in an isolated manner limiting enriched analysis granted by integrated information systems. Moreover, the scarcity of real-world data prevents the benchmarking of the overall SC pe…
▽ More
Supply Chain (SC) modeling is essential to understand and influence SC behavior, especially for increasingly globalized and complex SCs. Existing models address various SC notions, e.g., processes, tiers and production, in an isolated manner limiting enriched analysis granted by integrated information systems. Moreover, the scarcity of real-world data prevents the benchmarking of the overall SC performance in different circumstances, especially wrt. resilience during disruption. We present SENS, an ontology-based Knowlegde-Graph (KG) equipped with SPARQL implementations of KPIs to incorporate an end-to-end perspective of the SC including standardized SCOR processes and metrics. Further, we propose SENS-GEN, a highly configurable data generator that leverages SENS to create synthetic semantic SC data under multiple scenario configurations for comprehensive analysis and benchmarking applications. The evaluation shows that the significantly improved simulation and analysis capabilities, enabled by SENS, facilitate grasping, controlling and ultimately enhancing SC behavior and increasing resilience in disruptive scenarios.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
TinyGenius: Intertwining Natural Language Processing with Microtask Crowdsourcing for Scholarly Knowledge Graph Creation
Authors:
Allard Oelen,
Markus Stocker,
Sören Auer
Abstract:
As the number of published scholarly articles grows steadily each year, new methods are needed to organize scholarly knowledge so that it can be more efficiently discovered and used. Natural Language Processing (NLP) techniques are able to autonomously process scholarly articles at scale and to create machine readable representations of the article content. However, autonomous NLP methods are by f…
▽ More
As the number of published scholarly articles grows steadily each year, new methods are needed to organize scholarly knowledge so that it can be more efficiently discovered and used. Natural Language Processing (NLP) techniques are able to autonomously process scholarly articles at scale and to create machine readable representations of the article content. However, autonomous NLP methods are by far not sufficiently accurate to create a high-quality knowledge graph. Yet quality is crucial for the graph to be useful in practice. We present TinyGenius, a methodology to validate NLP-extracted scholarly knowledge statements using microtasks performed with crowdsourcing. The scholarly context in which the crowd workers operate has multiple challenges. The explainability of the employed NLP methods is crucial to provide context in order to support the decision process of crowd workers. We employed TinyGenius to populate a paper-centric knowledge graph, using five distinct NLP methods. In the end, the resulting knowledge graph serves as a digital library for scholarly articles.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Enriching Scholarly Knowledge with Context
Authors:
Muhammad Haris,
Markus Stocker,
Sören Auer
Abstract:
Leveraging a GraphQL-based federated query service that integrates multiple scholarly communication infrastructures (specifically, DataCite, ORCID, ROR, OpenAIRE, Semantic Scholar, Wikidata and Altmetric), we develop a novel web widget based approach for the presentation of scholarly knowledge with rich contextual information. We implement the proposed approach in the Open Research Knowledge Graph…
▽ More
Leveraging a GraphQL-based federated query service that integrates multiple scholarly communication infrastructures (specifically, DataCite, ORCID, ROR, OpenAIRE, Semantic Scholar, Wikidata and Altmetric), we develop a novel web widget based approach for the presentation of scholarly knowledge with rich contextual information. We implement the proposed approach in the Open Research Knowledge Graph (ORKG) and showcase it on three kinds of widgets. First, we devise a widget for the ORKG paper view that presents contextual information about related datasets, software, project information, topics, and metrics. Second, we extend the ORKG contributor profile view with contextual information including authored articles, developed software, linked projects, and research interests. Third, we advance ORKG comparison faceted search by introducing contextual facets (e.g. citations). As a result, the devised approach enables presenting ORKG scholarly knowledge flexibly enriched with contextual information sourced in a federated manner from numerous technologically heterogeneous scholarly communication infrastructures.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Computer Science Named Entity Recognition in the Open Research Knowledge Graph
Authors:
Jennifer D'Souza,
Sören Auer
Abstract:
Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can beset the task and has been less studied than NER in the general domain. Given that significant progress has been made on NER, we believe that scholarly domain-specific NER will receive increasing att…
▽ More
Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can beset the task and has been less studied than NER in the general domain. Given that significant progress has been made on NER, we believe that scholarly domain-specific NER will receive increasing attention in the years to come. Currently, progress on CS NER -- the focus of this work -- is hampered in part by its recency and the lack of a standardized annotation aim for scientific entities/terms. This work proposes a standardized task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem, solution, resource, language, tool, method, and dataset. Following which, its main contributions are: combines existing CS NER resources that maintain their annotation focus on the set or subset of contribution-centric scholarly entities we consider; further, noting the need for big data to train neural NER models, this work additionally supplies thousands of contribution-centric entity annotations from article titles and abstracts, thus releasing a cumulative large novel resource for CS NER; and, finally, trains a sequence labeling CS NER model inspired after state-of-the-art neural architectures from the general domain NER task. Throughout the work, several practical considerations are made which can be useful to information technology designers of the digital libraries.
△ Less
Submitted 14 November, 2022; v1 submitted 28 March, 2022;
originally announced March 2022.
-
The Digitalization of Bioassays in the Open Research Knowledge Graph
Authors:
Jennifer D'Souza,
Anita Monteverdi,
Muhammad Haris,
Marco Anteghini,
Kheir Eddine Farfar,
Markus Stocker,
Vitor A. P. Martins dos Santos,
Sören Auer
Abstract:
Background: Recent years are seeing a growing impetus in the semantification of scholarly knowledge at the fine-grained level of scientific entities in knowledge graphs. The Open Research Knowledge Graph (ORKG) https://www.orkg.org/ represents an important step in this direction, with thousands of scholarly contributions as structured, fine-grained, machine-readable data. There is a need, however,…
▽ More
Background: Recent years are seeing a growing impetus in the semantification of scholarly knowledge at the fine-grained level of scientific entities in knowledge graphs. The Open Research Knowledge Graph (ORKG) https://www.orkg.org/ represents an important step in this direction, with thousands of scholarly contributions as structured, fine-grained, machine-readable data. There is a need, however, to engender change in traditional community practices of recording contributions as unstructured, non-machine-readable text. For this in turn, there is a strong need for AI tools designed for scientists that permit easy and accurate semantification of their scholarly contributions. We present one such tool, ORKG-assays. Implementation: ORKG-assays is a freely available AI micro-service in ORKG written in Python designed to assist scientists obtain semantified bioassays as a set of triples. It uses an AI-based clustering algorithm which on gold-standard evaluations over 900 bioassays with 5,514 unique property-value pairs for 103 predicates shows competitive performance. Results and Discussion: As a result, semantified assay collections can be surveyed on the ORKG platform via tabulation or chart-based visualizations of key property values of the chemicals and compounds offering smart knowledge access to biochemists and pharmaceutical researchers in the advancement of drug development.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
SmartReviews: Towards Human- and Machine-actionable Representation of Review Articles
Authors:
Allard Oelen,
Markus Stocker,
Sören Auer
Abstract:
Review articles are a means to structure state-of-the-art literature and to organize the growing number of scholarly publications. However, review articles are suffering from numerous limitations, weakening the impact the articles could potentially have. A key limitation is the inability of machines to access and process knowledge presented within review articles. In this work, we present SmartRev…
▽ More
Review articles are a means to structure state-of-the-art literature and to organize the growing number of scholarly publications. However, review articles are suffering from numerous limitations, weakening the impact the articles could potentially have. A key limitation is the inability of machines to access and process knowledge presented within review articles. In this work, we present SmartReviews, a review authoring and publishing tool, specifically addressing the limitations of review articles. The tool enables community-based authoring of living articles, leveraging a scholarly knowledge graph to provide machine-actionable knowledge. We evaluate the approach and tool by means of a SmartReview use case. The results indicate that the evaluated article is successfully addressing the weaknesses of the current review practices.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
Easy Semantification of Bioassays
Authors:
Marco Anteghini,
Jennifer D'Souza,
Vitor A. P. Martins dos Santos,
Sören Auer
Abstract:
Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. We propose a solution for automatically semantifying biological assays. Our solution contrasts the problem of automated semantification as labeling versus clustering where the two methods are on opposite ends of the method complex…
▽ More
Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. We propose a solution for automatically semantifying biological assays. Our solution contrasts the problem of automated semantification as labeling versus clustering where the two methods are on opposite ends of the method complexity spectrum. Characteristically modeling our problem, we find the clustering solution significantly outperforms a deep neural network state-of-the-art labeling approach. This novel contribution is based on two factors: 1) a learning objective closely modeled after the data outperforms an alternative approach with sophisticated semantic modeling; 2) automatically semantifying biological assays achieves a high performance F1 of nearly 83%, which to our knowledge is the first reported standardized evaluation of the task offering a strong benchmark model.
△ Less
Submitted 2 December, 2021; v1 submitted 30 November, 2021;
originally announced November 2021.
-
Triple Classification for Scholarly Knowledge Graph Completion
Authors:
Mohamad Yaser Jaradeh,
Kuldeep Singh,
Markus Stocker,
Sören Auer
Abstract:
Scholarly Knowledge Graphs (KGs) provide a rich source of structured information representing knowledge encoded in scientific publications. With the sheer volume of published scientific literature comprising a plethora of inhomogeneous entities and relations to describe scientific concepts, these KGs are inherently incomplete. We present exBERT, a method for leveraging pre-trained transformer lang…
▽ More
Scholarly Knowledge Graphs (KGs) provide a rich source of structured information representing knowledge encoded in scientific publications. With the sheer volume of published scientific literature comprising a plethora of inhomogeneous entities and relations to describe scientific concepts, these KGs are inherently incomplete. We present exBERT, a method for leveraging pre-trained transformer language models to perform scholarly knowledge graph completion. We model triples of a knowledge graph as text and perform triple classification (i.e., belongs to KG or not). The evaluation shows that exBERT outperforms other baselines on three scholarly KG completion datasets in the tasks of triple classification, link prediction, and relation prediction. Furthermore, we present two scholarly datasets as resources for the research community, collected from public KGs and online resources.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
NRC-GAMMA: Introducing a Novel Large Gas Meter Image Dataset
Authors:
Ashkan Ebadi,
Patrick Paul,
Sofia Auer,
Stéphane Tremblay
Abstract:
Automatic meter reading technology is not yet widespread. Gas, electricity, or water accumulation meters reading is mostly done manually on-site either by an operator or by the homeowner. In some countries, the operator takes a picture as reading proof to confirm the reading by checking offline with another operator and/or using it as evidence in case of conflicts or complaints. The whole process…
▽ More
Automatic meter reading technology is not yet widespread. Gas, electricity, or water accumulation meters reading is mostly done manually on-site either by an operator or by the homeowner. In some countries, the operator takes a picture as reading proof to confirm the reading by checking offline with another operator and/or using it as evidence in case of conflicts or complaints. The whole process is time-consuming, expensive, and prone to errors. Automation can optimize and facilitate such labor-intensive and human error-prone processes. With the recent advances in the fields of artificial intelligence and computer vision, automatic meter reading systems are becoming more viable than ever. Motivated by the recent advances in the field of artificial intelligence and inspired by open-source open-access initiatives in the research community, we introduce a novel large benchmark dataset of real-life gas meter images, named the NRC-GAMMA dataset. The data were collected from an Itron 400A diaphragm gas meter on January 20, 2020, between 00:05 am and 11:59 pm. We employed a systematic approach to label the images, validate the labellings, and assure the quality of the annotations. The dataset contains 28,883 images of the entire gas meter along with 57,766 cropped images of the left and the right dial displays. We hope the NRC-GAMMA dataset helps the research community to design and implement accurate, innovative, intelligent, and reproducible automatic gas meter reading solutions.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Ranking Facts for Explaining Answers to Elementary Science Questions
Authors:
Jennifer D'Souza,
Isaiah Onando Mulang',
Soeren Auer
Abstract:
In multiple-choice exams, students select one answer from among typically four choices and can explain why they made that particular choice. Students are good at understanding natural language questions and based on their domain knowledge can easily infer the question's answer by 'connecting the dots' across various pertinent facts.
Considering automated reasoning for elementary science question…
▽ More
In multiple-choice exams, students select one answer from among typically four choices and can explain why they made that particular choice. Students are good at understanding natural language questions and based on their domain knowledge can easily infer the question's answer by 'connecting the dots' across various pertinent facts.
Considering automated reasoning for elementary science question answering, we address the novel task of generating explanations for answers from human-authored facts. For this, we examine the practically scalable framework of feature-rich support vector machines leveraging domain-targeted, hand-crafted features. Explanations are created from a human-annotated set of nearly 5,000 candidate facts in the WorldTree corpus. Our aim is to obtain better matches for valid facts of an explanation for the correct answer of a question over the available fact candidates. To this end, our features offer a comprehensive linguistic and semantic unification paradigm. The machine learning problem is the preference ordering of facts, for which we test pointwise regression versus pairwise learning-to-rank.
Our contributions are: (1) a case study in which two preference ordering approaches are systematically compared; (2) it is a practically competent approach that can outperform some variants of BERT-based reranking models; and (3) the human-engineered features make it an interpretable machine learning model for the task.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Automated Mining of Leaderboards for Empirical AI Research
Authors:
Salomon Kabongo,
Jennifer D'Souza,
Sören Auer
Abstract:
With the rapid growth of research publications, empowering scientists to keep oversight over the scientific progress is of paramount importance. In this regard, the Leaderboards facet of information organization provides an overview on the state-of-the-art by aggregating empirical results from various studies addressing the same research challenge. Crowdsourcing efforts like PapersWithCode among o…
▽ More
With the rapid growth of research publications, empowering scientists to keep oversight over the scientific progress is of paramount importance. In this regard, the Leaderboards facet of information organization provides an overview on the state-of-the-art by aggregating empirical results from various studies addressing the same research challenge. Crowdsourcing efforts like PapersWithCode among others are devoted to the construction of Leaderboards predominantly for various subdomains in Artificial Intelligence. Leaderboards provide machine-readable scholarly knowledge that has proven to be directly useful for scientists to keep track of research progress. The construction of Leaderboards could be greatly expedited with automated text mining.
This study presents a comprehensive approach for generating Leaderboards for knowledge-graph-based scholarly information organization. Specifically, we investigate the problem of automated Leaderboard construction using state-of-the-art transformer models, viz. Bert, SciBert, and XLNet. Our analysis reveals an optimal approach that significantly outperforms existing baselines for the task with evaluation scores above 90% in F1. This, in turn, offers new state-of-the-art results for Leaderboard extraction. As a result, a vast share of empirical AI research can be organized in the next-generation digital libraries as knowledge graphs.
△ Less
Submitted 31 August, 2021;
originally announced September 2021.
-
Federating Scholarly Infrastructures with GraphQL
Authors:
Muhammad Haris,
Kheir Eddine Farfar,
Markus Stocker,
Sören Auer
Abstract:
A plethora of scholarly knowledge is being published on distributed scholarly infrastructures. Querying a single infrastructure is no longer sufficient for researchers to satisfy information needs. We present a GraphQL-based federated query service for executing distributed queries on numerous, heterogeneous scholarly infrastructures (currently, ORKG, DataCite and GeoNames), thus enabling the inte…
▽ More
A plethora of scholarly knowledge is being published on distributed scholarly infrastructures. Querying a single infrastructure is no longer sufficient for researchers to satisfy information needs. We present a GraphQL-based federated query service for executing distributed queries on numerous, heterogeneous scholarly infrastructures (currently, ORKG, DataCite and GeoNames), thus enabling the integrated retrieval of scholarly content from these infrastructures. Furthermore, we present the methods that enable cross-walks between artefact metadata and artefact content across scholarly infrastructures, specifically DOI-based persistent identification of ORKG artefacts (e.g., ORKG comparisons) and linking ORKG content to third-party semantic resources (e.g., taxonomies, thesauri, ontologies). This type of linking increases interoperability, facilitates the reuse of scholarly knowledge, and enables finding machine actionable scholarly knowledge published by ORKG in global scholarly infrastructures. In summary, we suggest applying the established linked data principles to scholarly knowledge to improve its findability, interoperability, and ultimately reusability, i.e., improve scholarly knowledge FAIR-ness.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Pattern-based Acquisition of Scientific Entities from Scholarly Article Titles
Authors:
Jennifer D'Souza,
Soeren Auer
Abstract:
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article's contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic pat…
▽ More
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article's contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%.
△ Less
Submitted 17 September, 2021; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Researcher or Crowd Member? Why not both! The Open Research Knowledge Graph for Applying and Communicating CrowdRE Research
Authors:
Oliver Karras,
Eduard C. Groen,
Javed Ali Khan,
Sören Auer
Abstract:
In recent decades, there has been a major shift towards improved digital access to scholarly works. However, even now that these works are available in digital form, they remain document-based, making it difficult to communicate the knowledge they contain. The next logical step is to extend these works with more flexible, fine-grained, semantic, and context-sensitive representations of scholarly k…
▽ More
In recent decades, there has been a major shift towards improved digital access to scholarly works. However, even now that these works are available in digital form, they remain document-based, making it difficult to communicate the knowledge they contain. The next logical step is to extend these works with more flexible, fine-grained, semantic, and context-sensitive representations of scholarly knowledge. The Open Research Knowledge Graph (ORKG) is a platform that structures and interlinks scholarly knowledge, relying on crowdsourced contributions from researchers (as a crowd) to acquire, curate, publish, and process this knowledge. In this experience report, we consider the ORKG in the context of Crowd-based Requirements Engineering (CrowdRE) from two perspectives: (1) As CrowdRE researchers, we investigate how the ORKG practically applies CrowdRE techniques to involve scholars in its development to make it align better with their academic work. We determined that the ORKG readily provides social and financial incentives, feedback elicitation channels, and support for context and usage monitoring, but that there is improvement potential regarding automated user feedback analyses and a holistic CrowdRE approach. (2) As crowd members, we explore how the ORKG can be used to communicate scholarly knowledge about CrowdRE research. For this purpose, we curated qualitative and quantitative scholarly knowledge in the ORKG based on papers contained in two previously published systematic literature reviews (SLRs) on CrowdRE. This knowledge can be explored and compared interactively, and with more data than what the SLRs originally contained. Therefore, the ORKG improves access and communication of the scholarly knowledge about CrowdRE research. For both perspectives, we found the ORKG to be a useful multi-tool for CrowdRE research.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Demonstration of Faceted Search on Scholarly Knowledge Graphs
Authors:
Golsa Heidari,
Ahmad Ramadan,
Markus Stocker,
Sören Auer
Abstract:
Scientists always look for the most accurate and relevant answer to their queries on the scholarly literature. Traditional scholarly search systems list documents instead of providing direct answers to the search queries. As data in knowledge graphs are not acquainted semantically, they are not machine-readable. Therefore, a search on scholarly knowledge graphs ends up in a full-text search, not a…
▽ More
Scientists always look for the most accurate and relevant answer to their queries on the scholarly literature. Traditional scholarly search systems list documents instead of providing direct answers to the search queries. As data in knowledge graphs are not acquainted semantically, they are not machine-readable. Therefore, a search on scholarly knowledge graphs ends up in a full-text search, not a search in the content of scholarly literature. In this demo, we present a faceted search system that retrieves data from a scholarly knowledge graph, which can be compared and filtered to better satisfy user information needs. Our practice's novelty is that we use dynamic facets, which means facets are not fixed and will change according to the content of a comparison.
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
EduCOR: An Educational and Career-Oriented Recommendation Ontology
Authors:
Eleni Ilkou,
Hasan Abu-Rasheed,
Mohammadreza Tavakoli,
Sherzod Hakimov,
Gábor Kismihók,
Sören Auer,
Wolfgang Nejdl
Abstract:
With the increased dependence on online learning platforms and educational resource repositories, a unified representation of digital learning resources becomes essential to support a dynamic and multi-source learning experience. We introduce the EduCOR ontology, an educational, career-oriented ontology that provides a foundation for representing online learning resources for personalised learning…
▽ More
With the increased dependence on online learning platforms and educational resource repositories, a unified representation of digital learning resources becomes essential to support a dynamic and multi-source learning experience. We introduce the EduCOR ontology, an educational, career-oriented ontology that provides a foundation for representing online learning resources for personalised learning systems. The ontology is designed to enable learning material repositories to offer learning path recommendations, which correspond to the user's learning goals, academic and psychological parameters, and the labour-market skills. We present the multiple patterns that compose the EduCOR ontology, highlighting its cross-domain applicability and integrability with other ontologies. A demonstration of the proposed ontology on the real-life learning platform eDoer is discussed as a use-case. We evaluate the EduCOR ontology using both gold standard and task-based approaches. The comparison of EduCOR to three gold schemata, and its application in two use-cases, shows its coverage and adaptability to multiple OER repositories, which allows generating user-centric and labour-market oriented recommendations.
△ Less
Submitted 13 July, 2021; v1 submitted 12 July, 2021;
originally announced July 2021.
-
Leveraging a Federation of Knowledge Graphs to Improve Faceted Search in Digital Libraries
Authors:
Golsa Heidari,
Ahmad Ramadan,
Markus Stocker,
Sören Auer
Abstract:
Scientists always look for the most accurate and relevant answers to their queries in the literature. Traditional scholarly digital libraries list documents in search results, and therefore are unable to provide precise answers to search queries. In other words, search in digital libraries is metadata search and, if available, full-text search. We present a methodology for improving a faceted sear…
▽ More
Scientists always look for the most accurate and relevant answers to their queries in the literature. Traditional scholarly digital libraries list documents in search results, and therefore are unable to provide precise answers to search queries. In other words, search in digital libraries is metadata search and, if available, full-text search. We present a methodology for improving a faceted search system on structured content by leveraging a federation of scholarly knowledge graphs. We implemented the methodology on top of a scholarly knowledge graph. This search system can leverage content from third-party knowledge graphs to improve the exploration of scholarly content. A novelty of our approach is that we use dynamic facets on diverse data types, meaning that facets can change according to the user query. The user can also adjust the granularity of dynamic facets. An additional novelty is that we leverage third-party knowledge graphs to improve exploring scholarly knowledge.
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
SmartReviews: Towards Human- and Machine-actionable Reviews
Authors:
Allard Oelen,
Markus Stocker,
Sören Auer
Abstract:
Review articles summarize state-of-the-art work and provide a means to organize the growing number of scholarly publications. However, the current review method and publication mechanisms hinder the impact review articles can potentially have. Among other limitations, reviews only provide a snapshot of the current literature and are generally not readable by machines. In this work, we identify the…
▽ More
Review articles summarize state-of-the-art work and provide a means to organize the growing number of scholarly publications. However, the current review method and publication mechanisms hinder the impact review articles can potentially have. Among other limitations, reviews only provide a snapshot of the current literature and are generally not readable by machines. In this work, we identify the weaknesses of the current review method. Afterwards, we present the SmartReview approach addressing those weaknesses. The approach pushes towards semantic community-maintained review articles. At the core of our approach, knowledge graphs are employed to make articles more machine-actionable and maintainable.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
SemEval-2021 Task 11: NLPContributionGraph -- Structuring Scholarly NLP Contributions for a Research Knowledge Graph
Authors:
Jennifer D'Souza,
Sören Auer,
Ted Pedersen
Abstract:
There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks par…
▽ More
There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article's contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples.
Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, in the conclusion of this article, the difficulty of producing such data and as a consequence of modeling it is highlighted.
△ Less
Submitted 15 October, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Better Call the Plumber: Orchestrating Dynamic Information Extraction Pipelines
Authors:
Mohamad Yaser Jaradeh,
Kuldeep Singh,
Markus Stocker,
Andreas Both,
Sören Auer
Abstract:
In the last decade, a large number of Knowledge Graph (KG) information extraction approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG information extraction (IE) have not been studied in the literature. We propose Plumber, the first framework that brings together the research community's disjoint IE efforts. The Plum…
▽ More
In the last decade, a large number of Knowledge Graph (KG) information extraction approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG information extraction (IE) have not been studied in the literature. We propose Plumber, the first framework that brings together the research community's disjoint IE efforts. The Plumber architecture comprises 33 reusable components for various KG information extraction subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components,Plumber dynamically generates suitable information extraction pipelines and offers overall 264 distinct pipelines.We study the optimization problem of choosing suitable pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over two KGs: DBpedia, and Open Research Knowledge Graph (ORKG). Our results demonstrate the effectiveness of Plumber in dynamically generating KG information extraction pipelines,outperforming all baselines agnostics of the underlying KG. Furthermore,we provide an analysis of collective failure cases, study the similarities and synergies among integrated components, and discuss their limitations.
△ Less
Submitted 22 February, 2021;
originally announced February 2021.