subscribe to arXiv mailings

doi 10.1109/CBMS.2019.00032

iASiS: Towards Heterogeneous Big Data Analysis for Personalized Medicine

Authors: Anastasia Krithara, Fotis Aisopos, Vassiliki Rentoumi, Anastasios Nentidis, Konstantinos Bougatiotis, Maria-Esther Vidal, Ernestina Menasalvas, Alejandro Rodriguez-Gonzalez, Eleftherios G. Samaras, Peter Garrard, Maria Torrente, Mariano Provencio Pulla, Nikos Dimakopoulos, Rui Mauricio, Jordi Rambla De Argila, Gian Gaetano Tartaglia, George Paliouras

Abstract: The vision of IASIS project is to turn the wave of big biomedical data heading our way into actionable knowledge for decision makers. This is achieved by integrating data from disparate sources, including genomics, electronic health records and bibliography, and applying advanced analytics methods to discover useful patterns. The goal is to turn large amounts of available data into actionable info… ▽ More The vision of IASIS project is to turn the wave of big biomedical data heading our way into actionable knowledge for decision makers. This is achieved by integrating data from disparate sources, including genomics, electronic health records and bibliography, and applying advanced analytics methods to discover useful patterns. The goal is to turn large amounts of available data into actionable information to authorities for planning public health activities and policies. The integration and analysis of these heterogeneous sources of information will enable the best decisions to be made, allowing for diagnosis and treatment to be personalised to each individual. The project offers a common representation schema for the heterogeneous data sources. The iASiS infrastructure is able to convert clinical notes into usable data, combine them with genomic data, related bibliography, image data and more, and create a global knowledge base. This facilitates the use of intelligent methods in order to discover useful patterns across different resources. Using semantic integration of data gives the opportunity to generate information that is rich, auditable and reliable. This information can be used to provide better care, reduce errors and create more confidence in sharing data, thus providing more insights and opportunities. Data resources for two different disease categories are explored within the iASiS use cases, dementia and lung cancer. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 6 pages, 2 figures, accepted at 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)

Journal ref: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), Cordoba, Spain, 2019, pp. 106-111

arXiv:2407.00509 [pdf, other]

Leveraging Ontologies to Document Bias in Data

Authors: Mayra Russo, Maria-Esther Vidal

Abstract: Machine Learning (ML) systems are capable of reproducing and often amplifying undesired biases. This puts emphasis on the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines, prompting the emergence of documentation frameworks with the idea that ``any remedy for bias starts with awareness of its existence''. However, a re… ▽ More Machine Learning (ML) systems are capable of reproducing and often amplifying undesired biases. This puts emphasis on the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines, prompting the emergence of documentation frameworks with the idea that ``any remedy for bias starts with awareness of its existence''. However, a resource that can formally describe these pipelines in terms of biases detected is still amiss. To fill this gap, we present the Doc-BiasO ontology, a resource that aims to create an integrated vocabulary of biases defined in the \textit{fair-ML} literature and their measures, as well as to incorporate relevant terminology and the relationships between them. Overseeing ontology engineering best practices, we re-use existing vocabulary on machine learning and AI, to foster knowledge sharing and interoperability between the actors concerned with its research, development, regulation, among others. Overall, our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI and to improve the interpretation of bias in data and downstream impact. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2402.07714 [pdf]

doi 10.1016/j.swevo.2017.07.002

Adaptive Artificial Immune Networks for Mitigating DoS flooding Attacks

Authors: Jorge Maestre Vidal, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

Abstract: Denial of service attacks pose a threat in constant growth. This is mainly due to their tendency to gain in sophistication, ease of implementation, obfuscation and the recent improvements in occultation of fingerprints. On the other hand, progress towards self-organizing networks, and the different techniques involved in their development, such as software-defined networking, network-function virt… ▽ More Denial of service attacks pose a threat in constant growth. This is mainly due to their tendency to gain in sophistication, ease of implementation, obfuscation and the recent improvements in occultation of fingerprints. On the other hand, progress towards self-organizing networks, and the different techniques involved in their development, such as software-defined networking, network-function virtualization, artificial intelligence or cloud computing, facilitates the design of new defensive strategies, more complete, consistent and able to adapt the defensive deployment to the current status of the network. In order to contribute to their development, in this paper, the use of artificial immune systems to mitigate denial of service attacks is proposed. The approach is based on building networks of distributed sensors suited to the requirements of the monitored environment. These components are capable of identifying threats and reacting according to the behavior of the biological defense mechanisms in human beings. It is accomplished by emulating the different immune reactions, the establishment of quarantine areas and the construction of immune memory. For their assessment, experiments with public domain datasets (KDD'99, CAIDA'07 and CAIDA'08) and simulations on various network configurations based on traffic samples gathered by the University Complutense of Madrid and flooding attacks generated by the tool DDoSIM were performed. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Journal ref: J. Maestre Vidal, A. L. Sandoval Orozco, L. J. García Villalba: Adaptive Artificial Immune Networks for Mitigating DoS Flooding Attacks. Swarm and Evolutionary Computation. Vol. 38, pp. 3894-108, February 2018

arXiv:2402.05571 [pdf]

doi 10.2196/34492

Traditional Machine Learning Models and Bidirectional Encoder Representations From Transformer (BERT)-Based Automatic Classification of Tweets About Eating Disorders: Algorithm Development and Validation Study

Authors: José Alberto Benítez-Andrades, José-Manuel Alija-Pérez, Maria-Esther Vidal, Rafael Pastor-Vargas, María Teresa García-Ordás

Abstract: Background: Eating disorders are increasingly prevalent, and social networks offer valuable information. Objective: Our goal was to identify efficient machine learning models for categorizing tweets related to eating disorders. Methods: Over three months, we collected tweets about eating disorders. A 2,000-tweet subset was labeled for: (1) being written by individuals with eating disorders, (2… ▽ More Background: Eating disorders are increasingly prevalent, and social networks offer valuable information. Objective: Our goal was to identify efficient machine learning models for categorizing tweets related to eating disorders. Methods: Over three months, we collected tweets about eating disorders. A 2,000-tweet subset was labeled for: (1) being written by individuals with eating disorders, (2) promoting eating disorders, (3) informativeness, and (4) scientific content. Both traditional machine learning and deep learning models were employed for classification, assessing accuracy, F1 score, and computational time. Results: From 1,058,957 collected tweets, transformer-based bidirectional encoder representations achieved the highest F1 scores (71.1%-86.4%) across all four categories. Conclusions: Transformer-based models outperform traditional techniques in classifying eating disorder-related tweets, though they require more computational resources. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Journal ref: JMIR Medical Informatics, Volume 10, Issue 2, 2022, ID e34492

arXiv:2402.05536 [pdf]

doi 10.3233/SW-223269

Empowering machine learning models with contextual knowledge for enhancing the detection of eating disorders in social media posts

Authors: José Alberto Benítez-Andrades, María Teresa García-Ordás, Mayra Russo, Ahmad Sakor, Luis Daniel Fernandes Rotger, Maria-Esther Vidal

Abstract: Social networks are vital for information sharing, especially in the health sector for discussing diseases and treatments. These platforms, however, often feature posts as brief texts, posing challenges for Artificial Intelligence (AI) in understanding context. We introduce a novel hybrid approach combining community-maintained knowledge graphs (like Wikidata) with deep learning to enhance the cat… ▽ More Social networks are vital for information sharing, especially in the health sector for discussing diseases and treatments. These platforms, however, often feature posts as brief texts, posing challenges for Artificial Intelligence (AI) in understanding context. We introduce a novel hybrid approach combining community-maintained knowledge graphs (like Wikidata) with deep learning to enhance the categorization of social media posts. This method uses advanced entity recognizers and linkers (like Falcon 2.0) to connect short post entities to knowledge graphs. Knowledge graph embeddings (KGEs) and contextualized word embeddings (like BERT) are then employed to create rich, context-based representations of these posts. Our focus is on the health domain, particularly in identifying posts related to eating disorders (e.g., anorexia, bulimia) to aid healthcare providers in early diagnosis. We tested our approach on a dataset of 2,000 tweets about eating disorders, finding that merging word embeddings with knowledge graph information enhances the predictive models' reliability. This methodology aims to assist health experts in spotting patterns indicative of mental disorders, thereby improving early detection and accurate diagnosis for personalized medicine. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Journal ref: Semantic Web, Volume 4, Issue 5, pp. 873-892, 2023

arXiv:2402.03562 [pdf]

doi 10.1016/j.knosys.2018.03.018

A novel pattern recognition system for detecting Android malware by analyzing suspicious boot sequences

Authors: Jorge Maestre Vidal, Marco Antonio Sotelo Monge, Luis Javier García Villalba

Abstract: This paper introduces a malware detection system for smartphones based on studying the dynamic behavior of suspicious applications. The main goal is to prevent the installation of the malicious software on the victim systems. The approach focuses on identifying malware addressed against the Android platform. For that purpose, only the system calls performed during the boot process of the recently… ▽ More This paper introduces a malware detection system for smartphones based on studying the dynamic behavior of suspicious applications. The main goal is to prevent the installation of the malicious software on the victim systems. The approach focuses on identifying malware addressed against the Android platform. For that purpose, only the system calls performed during the boot process of the recently installed applications are studied. Thereby the amount of information to be considered is reduced, since only activities related with their initialization are taken into account. The proposal defines a pattern recognition system with three processing layers: monitoring, analysis and decision-making. First, in order to extract the sequences of system calls, the potentially compromised applications are executed on a safe and isolated environment. Then the analysis step generates the metrics required for decision-making. This level combines sequence alignment algorithms with bagging, which allow scoring the similarity between the extracted sequences considering their regions of greatest resemblance. At the decision-making stage, the Wilcoxon signed-rank test is implemented, which determines if the new software is labeled as legitimate or malicious. The proposal has been tested in different experiments that include an in-depth study of a particular use case, and the evaluation of its effectiveness when analyzing samples of well-known public datasets. Promising experimental results have been shown, hence demonstrating that the approach is a good complement to the strategies of the bibliography. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Journal ref: Knowledge-Based Systems. Vol. 150, pp. 198-217, June 2018

arXiv:2402.03369 [pdf]

doi 10.1080/10429247.2015.1054752

Evaluation of Google's Voice Recognition and Sentence Classification for Health Care Applications

Authors: Majbah Uddin, Nathan Huynh, Jose M Vidal, Kevin M Taaffe, Lawrence D Fredendall, Joel S Greenstein

Abstract: This study examined the use of voice recognition technology in perioperative services (Periop) to enable Periop staff to record workflow milestones using mobile technology. The use of mobile technology to improve patient flow and quality of care could be facilitated if such voice recognition technology could be made robust. The goal of this experiment was to allow the Periop staff to provide care… ▽ More This study examined the use of voice recognition technology in perioperative services (Periop) to enable Periop staff to record workflow milestones using mobile technology. The use of mobile technology to improve patient flow and quality of care could be facilitated if such voice recognition technology could be made robust. The goal of this experiment was to allow the Periop staff to provide care without being interrupted with data entry and querying tasks. However, the results are generalizable to other situations where an engineering manager attempts to improve communication performance using mobile technology. This study enhanced Google's voice recognition capability by using post-processing classifiers (i.e., bag-of-sentences, support vector machine, and maximum entropy). The experiments investigated three factors (original phrasing, reduced phrasing, and personalized phrasing) at three levels (zero training repetition, 5 training repetitions, and 10 training repetitions). Results indicated that personal phrasing yielded the highest correctness and that training the device to recognize an individual's voice improved correctness as well. Although simplistic, the bag-of-sentences classifier significantly improved voice recognition correctness. The classification efficiency of the maximum entropy and support vector machine algorithms was found to be nearly identical. These results suggest that engineering managers could significantly enhance Google's voice recognition technology by using post-processing techniques, which would facilitate its use in health care and other applications. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Journal ref: Engineering Management Journal, 27:3, 152-162, 2015

arXiv:2310.19503 [pdf, other]

Trust, Accountability, and Autonomy in Knowledge Graph-based AI for Self-determination

Authors: Luis-Daniel Ibáñez, John Domingue, Sabrina Kirrane, Oshani Seneviratne, Aisling Third, Maria-Esther Vidal

Abstract: Knowledge Graphs (KGs) have emerged as fundamental platforms for powering intelligent decision-making and a wide range of Artificial Intelligence (AI) services across major corporations such as Google, Walmart, and AirBnb. KGs complement Machine Learning (ML) algorithms by providing data context and semantics, thereby enabling further inference and question-answering capabilities. The integration… ▽ More Knowledge Graphs (KGs) have emerged as fundamental platforms for powering intelligent decision-making and a wide range of Artificial Intelligence (AI) services across major corporations such as Google, Walmart, and AirBnb. KGs complement Machine Learning (ML) algorithms by providing data context and semantics, thereby enabling further inference and question-answering capabilities. The integration of KGs with neuronal learning (e.g., Large Language Models (LLMs)) is currently a topic of active research, commonly named neuro-symbolic AI. Despite the numerous benefits that can be accomplished with KG-based AI, its growing ubiquity within online services may result in the loss of self-determination for citizens as a fundamental societal issue. The more we rely on these technologies, which are often centralised, the less citizens will be able to determine their own destinies. To counter this threat, AI regulation, such as the European Union (EU) AI Act, is being proposed in certain regions. The regulation sets what technologists need to do, leading to questions concerning: How can the output of AI systems be trusted? What is needed to ensure that the data fuelling and the inner workings of these artefacts are transparent? How can AI be made accountable for its decision-making? This paper conceptualises the foundational topics and research pillars to support KG-based AI for self-determination. Drawing upon this conceptual framework, challenges and opportunities for citizen self-determination are illustrated and analysed in a real-world scenario. As a result, we propose a research agenda aimed at accomplishing the recommended objectives. △ Less

Submitted 31 October, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2211.08190 [pdf]

Reconocimiento de Objetos a partir de Nube de Puntos en un Veículo Aéreo no Tripulado

Authors: Agustina Marion de Freitas Vidal, Anthony Rodriguez, Richard Suarez, André Kelbouscas, Ricardo Grando

Abstract: Currently, research in robotics, artificial intelligence and drones are advancing exponentially, they are directly or indirectly related to various areas of the economy, from agriculture to industry. With this context, this project covers these topics guiding them, seeking to provide a framework that is capable of helping to develop new future researchers. For this, we use an aerial vehicle that w… ▽ More Currently, research in robotics, artificial intelligence and drones are advancing exponentially, they are directly or indirectly related to various areas of the economy, from agriculture to industry. With this context, this project covers these topics guiding them, seeking to provide a framework that is capable of helping to develop new future researchers. For this, we use an aerial vehicle that works autonomously and is capable of mapping the scenario and providing useful information to the end user. This occurs from a communication between a simple programming language (Scratch) and one of the most important and efficient robot operating systems today (ROS). This is how we managed to develop a tool capable of generating a 3D map and detecting objects using the camera attached to the drone. Although this tool can be used in the advanced fields of industry, it is also an important advance for the research sector. The implementation of this tool in intermediate-level institutions is aspired to provide the ability to carry out high-level projects from a simple programming language. △ Less

Submitted 23 October, 2022; originally announced November 2022.

Comments: in Spanish language. Articulo aceptado en la FEBITEC 2022

arXiv:2210.15645 [pdf, other]

Dragoman: Efficiently Evaluating Declarative Mapping Languages over Frameworks for Knowledge Graph Creation

Authors: Samaneh Jozashoori, Enrique Iglesias, Maria-Esther Vidal

Abstract: In recent years, there have been valuable efforts and contributions to make the process of RDF knowledge graph creation traceable and transparent; extending and applying declarative mapping languages is an example. One challenging step is the traceability of procedures that aim to overcome interoperability issues, a.k.a. data-level integration. In most pipelines, data integration is performed by a… ▽ More In recent years, there have been valuable efforts and contributions to make the process of RDF knowledge graph creation traceable and transparent; extending and applying declarative mapping languages is an example. One challenging step is the traceability of procedures that aim to overcome interoperability issues, a.k.a. data-level integration. In most pipelines, data integration is performed by ad-hoc programs, preventing traceability and reusability. However, formal frameworks provided by function-based declarative mapping languages such as FunUL and RML+FnO empower expressiveness. Data-level integration can be defined as functions and integrated as part of the mappings performing schema-level integration. However, combining functions with the mappings introduces a new source of complexity that can considerably impact the required number of resources and execution time. We tackle the problem of efficiently executing mappings with functions and formalize the transformation of them into function-free mappings. These transformations are the basis of an optimization process that aims to perform an eager evaluation of function-based mapping rules. These techniques are implemented in a framework named Dragoman. We demonstrate the correctness of the transformations while ensuring that the function-free data integration processes are equivalent to the original one. The effectiveness of Dragoman is empirically evaluated in 230 testbeds composed of various types of functions integrated with mapping rules of different complexity. The outcomes suggest that evaluating function-free mapping rules reduces execution time in complex knowledge graph creation pipelines composed of large data sources and multiple types of mapping rules. The savings can be up to 75%, suggesting that eagerly executing functions in mapping rules enable making these pipelines applicable and scalable in real-world settings. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2206.07375 [pdf, other]

Knowledge4COVID-19: A Semantic-based Approach for Constructing a COVID-19 related Knowledge Graph from Various Sources and Analysing Treatments' Toxicities

Authors: Ahmad Sakor, Samaneh Jozashoori, Emetis Niazmand, Ariam Rivas, Kostantinos Bougiatiotis, Fotis Aisopos, Enrique Iglesias, Philipp D. Rohde, Trupti Padiya, Anastasia Krithara, Georgios Paliouras, Maria-Esther Vidal

Abstract: In this paper, we present Knowledge4COVID-19, a framework that aims to showcase the power of integrating disparate sources of knowledge to discover adverse drug effects caused by drug-drug interactions among COVID-19 treatments and pre-existing condition drugs. Initially, we focus on constructing the Knowledge4COVID-19 knowledge graph (KG) from the declarative definition of mapping rules using the… ▽ More In this paper, we present Knowledge4COVID-19, a framework that aims to showcase the power of integrating disparate sources of knowledge to discover adverse drug effects caused by drug-drug interactions among COVID-19 treatments and pre-existing condition drugs. Initially, we focus on constructing the Knowledge4COVID-19 knowledge graph (KG) from the declarative definition of mapping rules using the RDF Mapping Language. Since valuable information about drug treatments, drug-drug interactions, and side effects is present in textual descriptions in scientific databases (e.g., DrugBank) or in scientific literature (e.g., the CORD-19, the Covid-19 Open Research Dataset), the Knowledge4COVID-19 framework implements Natural Language Processing. The Knowledge4COVID-19 framework extracts relevant entities and predicates that enable the fine-grained description of COVID-19 treatments and the potential adverse events that may occur when these treatments are combined with treatments of common comorbidities, e.g., hypertension, diabetes, or asthma. Moreover, on top of the KG, several techniques for the discovery and prediction of interactions and potential adverse effects of drugs have been developed with the aim of suggesting more accurate treatments for treating the virus. We provide services to traverse the KG and visualize the effects that a group of drugs may have on a treatment outcome. Knowledge4COVID-19 was part of the Pan-European hackathon#EUvsVirus in April 2020 and is publicly available as a resource through a GitHub repository (https://github.com/SDM-TIB/Knowledge4COVID-19) and a DOI (https://zenodo.org/record/4701817#.YH336-8zbol). △ Less

Submitted 7 October, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

arXiv:2205.13883 [pdf, other]

Efficient Semantic Summary Graphs for Querying Large Knowledge Graphs

Authors: Emetis Niazmand, Gezim Sejdiu, Damien Graux, Maria-Esther Vidal

Abstract: Knowledge Graphs (KGs) integrate heterogeneous data, but one challenge is the development of efficient tools for allowing end users to extract useful insights from these sources of knowledge. In such a context, reducing the size of a Resource Description Framework (RDF) graph while preserving all information can speed up query engines by limiting data shuffle, especially in a distributed setting.… ▽ More Knowledge Graphs (KGs) integrate heterogeneous data, but one challenge is the development of efficient tools for allowing end users to extract useful insights from these sources of knowledge. In such a context, reducing the size of a Resource Description Framework (RDF) graph while preserving all information can speed up query engines by limiting data shuffle, especially in a distributed setting. This paper presents two algorithms for RDF graph summarization: Grouping Based Summarization (GBS) and Query Based Summarization (QBS). The latter is an optimized and lossless approach for the former method. We empirically study the effectiveness of the proposed lossless RDF graph summarization to retrieve complete data, by rewriting an RDF Query Language called SPARQL query with fewer triple patterns using a semantic similarity. We conduct our experimental study in instances of four datasets with different sizes. Compared with the state-of-the-art query engine Sparklify executed over the original RDF graphs as a baseline, QBS query execution time is reduced by up to 80% and the summarized RDF graph is decreased by up to 99%. △ Less

Submitted 27 May, 2022; originally announced May 2022.

arXiv:2203.07436 [pdf, other]

SuperAnimal pretrained pose estimation models for behavioral analysis

Authors: Shaokai Ye, Anastasiia Filippova, Jessy Lauer, Steffen Schneider, Maxime Vidal, Tian Qiu, Alexander Mathis, Mackenzie Weygandt Mathis

Abstract: Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts. A common key step for behavioral analysis is first extracting relevant keypoints on animals, known as pose estimation. However, reliable inference of poses currently requires domain knowledge and manual labeling effort to build supervised models. We present a serie… ▽ More Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts. A common key step for behavioral analysis is first extracting relevant keypoints on animals, known as pose estimation. However, reliable inference of poses currently requires domain knowledge and manual labeling effort to build supervised models. We present a series of technical innovations that enable a new method, collectively called SuperAnimal, to develop unified foundation models that can be used on over 45 species, without additional human labels. Concretely, we introduce a method to unify the keypoint space across differently labeled datasets (via our generalized data converter) and for training these diverse datasets in a manner such that they don't catastrophically forget keypoints given the unbalanced inputs (via our keypoint gradient masking and memory replay approaches). These models show excellent performance across six pose benchmarks. Then, to ensure maximal usability for end-users, we demonstrate how to fine-tune the models on differently labeled data and provide tooling for unsupervised video adaptation to boost performance and decrease jitter across frames. If the models are fine-tuned, we show SuperAnimal models are 10-100$\times$ more data efficient than prior transfer-learning-based approaches. We illustrate the utility of our models in behavioral classification in mice and gait analysis in horses. Collectively, this presents a data-efficient solution for animal pose estimation. △ Less

Submitted 30 December, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: Models and demos available at http://modelzoo.deeplabcut.org

arXiv:2201.09694 [pdf, other]

Scaling Up Knowledge Graph Creation to Large and Heterogeneous Data Sources

Authors: Enrique Iglesias, Samaneh Jozashoori, Maria-Esther Vidal

Abstract: RDF knowledge graphs (KG) are powerful data structures to represent factual statements created from heterogeneous data sources. KG creation is laborious and demands data management techniques to be executed efficiently. This paper tackles the problem of the automatic generation of KG creation processes declaratively specified; it proposes techniques for planning and transforming heterogeneous data… ▽ More RDF knowledge graphs (KG) are powerful data structures to represent factual statements created from heterogeneous data sources. KG creation is laborious and demands data management techniques to be executed efficiently. This paper tackles the problem of the automatic generation of KG creation processes declaratively specified; it proposes techniques for planning and transforming heterogeneous data into RDF triples following mapping assertions specified in the RDF Mapping Language (RML). Given a set of mapping assertions, the planner provides an optimized execution plan by partitioning and scheduling the execution of the assertions. First, the planner assesses an optimized number of partitions considering the number of data sources, type of mapping assertions, and the associations between different assertions. After providing a list of partitions and assertions that belong to each partition, the planner determines their execution order. A greedy algorithm is implemented to generate the partitions' bushy tree execution plan. Bushy tree plans are translated into operating system commands that guide the execution of the partitions of the mapping assertions in the order indicated by the bushy tree. The proposed optimization approach is evaluated over state-of-the-art RML-compliant engines, and existing benchmarks of data sources and RML triples maps. Our experimental results suggest that the performance of the studied engines can be considerably improved, particularly in a complex setting with numerous triples maps and large data sources. As a result, engines that time out in complex cases are enabled to produce at least a portion of the KG applying the planner. △ Less

Submitted 26 October, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

arXiv:2112.07493 [pdf, other]

doi 10.1145/3477314.3507132

EABlock: A Declarative Entity Alignment Block for Knowledge Graph Creation Pipelines

Authors: Samaneh Jozashoori, Ahmad Sakor, Enrique Iglesias, Maria-Esther Vidal

Abstract: Despite encoding enormous amount of rich and valuable data, existing data sources are mostly created independently, being a significant challenge to their integration. Mapping languages, e.g., RML and R2RML, facilitate declarative specification of the process of applying meta-data and integrating data into a knowledge graph. Mapping rules can also include knowledge extraction functions in addition… ▽ More Despite encoding enormous amount of rich and valuable data, existing data sources are mostly created independently, being a significant challenge to their integration. Mapping languages, e.g., RML and R2RML, facilitate declarative specification of the process of applying meta-data and integrating data into a knowledge graph. Mapping rules can also include knowledge extraction functions in addition to expressing correspondences among data sources and a unified schema. Combining mapping rules and functions represents a powerful formalism to specify pipelines for integrating data into a knowledge graph transparently. Surprisingly, these formalisms are not fully adapted, and many knowledge graphs are created by executing ad-hoc programs to pre-process and integrate data. In this paper, we present EABlock, an approach integrating Entity Alignment (EA) as part of RML mapping rules. EABlock includes a block of functions performing entity recognition from textual attributes and link the recognized entities to the corresponding resources in Wikidata, DBpedia, and domain specific thesaurus, e.g., UMLS. EABlock provides agnostic and efficient techniques to evaluate the functions and transfer the mappings to facilitate its application in any RML-compliant engine. We have empirically evaluated EABlock performance, and results indicate that EABlock speeds up knowledge graph creation pipelines that require entity recognition and linking in state-of-the-art RML-compliant engines. EABlock is also publicly available as a tool through a GitHub repository(https://github.com/SDM-TIB/EABlock) and a DOI(https://doi.org/10.5281/zenodo.5779773). △ Less

Submitted 15 December, 2021; v1 submitted 14 December, 2021; originally announced December 2021.

arXiv:2111.07005 [pdf, other]

Understanding and Assessment of Mission-Centric Key Cyber Terrains for joint Military Operations

Authors: Álvaro Luis Martínez, Jorge Maestre Vidal, Victor A. Villagrá González

Abstract: Since the cyberspace consolidated as fifth warfare dimension, the different actors of the defense sector began an arms race toward achieving cyber superiority, on which research, academic and industrial stakeholders contribute from a dual vision, mostly linked to a large and heterogeneous heritage of developments and adoption of civilian cybersecurity capabilities. In this context, augmenting the… ▽ More Since the cyberspace consolidated as fifth warfare dimension, the different actors of the defense sector began an arms race toward achieving cyber superiority, on which research, academic and industrial stakeholders contribute from a dual vision, mostly linked to a large and heterogeneous heritage of developments and adoption of civilian cybersecurity capabilities. In this context, augmenting the conscious of the context and warfare environment, risks and impacts of cyber threats on kinetic actuations became a critical rule-changer that military decision-makers are considering. A major challenge on acquiring mission-centric Cyber Situational Awareness (CSA) is the dynamic inference and assessment of the vertical propagations from situations that occurred at the mission supportive Information and Communications Technologies (ICT), up to their relevance at military tactical, operational and strategical views. In order to contribute on acquiring CSA, this paper addresses a major gap in the cyber defence state-of-the-art: the dynamic identification of Key Cyber Terrains (KCT) on a mission-centric context. Accordingly, the proposed KCT identification approach explores the dependency degrees among tasks and assets defined by commanders as part of the assessment criteria. These are correlated with the discoveries on the operational network and the asset vulnerabilities identified thorough the supported mission development. The proposal is presented as a reference model that reveals key aspects for mission-centric KCT analysis and supports its enforcement and further enforcement by including an illustrative application case. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: Preprint of an extended version of the conference "A novel automatic discovery system of critical assets in cyberspace-oriented military missions", in Proc. of the First Workshop on Recent Advances in Cyber Situational Awareness on Military Operations (CSA 2020) held by the 15th ARES International Conference in August 2020. https://doi.org/10.1145/3407023.3409225

arXiv:2107.06999 [pdf]

Reuse of Semantic Models for Emerging Smart Grids Applications

Authors: Valentina Janev, Dušan Popadić, Dea Pujić, Maria Esther Vidal, Kemele Endris

Abstract: Data in the energy domain grows at unprecedented rates. Despite the great potential that IoT platforms and other big data-driven technologies have brought in the energy sector, data exchange and data integration are still not wholly achieved. As a result, fragmented applications are developed against energy data silos, and data exchange is limited to few applications. Therefore, this paper identif… ▽ More Data in the energy domain grows at unprecedented rates. Despite the great potential that IoT platforms and other big data-driven technologies have brought in the energy sector, data exchange and data integration are still not wholly achieved. As a result, fragmented applications are developed against energy data silos, and data exchange is limited to few applications. Therefore, this paper identifies semantic models that can be reused for building interoperable energy management services and applications. The ambition is to innovate the Institute Mihajlo Pupin proprietary SCADA system and to enable integration of the Institute Mihajlo Pupin services and applications in the European Union (EU) Energy Data Space. The selection of reusable models has been done based on a set of scenarios related to electricity balancing services, predictive maintenance services, and services for the residential, commercial and industrial sectors. △ Less

Submitted 8 July, 2021; originally announced July 2021.

Comments: Paper presented at the ICIST Conference 2021

arXiv:2107.01965 [pdf, other]

doi 10.1145/3442442.3453541

Managing Knowledge in Energy Data Spaces

Authors: Valentina Janev, Maria-Esther Vidal, Kemele Endris, Dea Pujic

Abstract: Data in the energy domain grows at unprecedented rates and is usually generated by heterogeneous energy systems. Despite the great potential that big data-driven technologies can bring to the energy sector, general adoption is still lagging. Several challenges related to controlled data exchange and data integration are still not wholly achieved. As a result, fragmented applications are developed… ▽ More Data in the energy domain grows at unprecedented rates and is usually generated by heterogeneous energy systems. Despite the great potential that big data-driven technologies can bring to the energy sector, general adoption is still lagging. Several challenges related to controlled data exchange and data integration are still not wholly achieved. As a result, fragmented applications are developed against energy data silos, and data exchange is limited to few applications. In this paper, we analyze the challenges and requirements related to energy-related data applications. We also evaluate the use of Energy Data Ecosystems (EDEs) as data-driven infrastructures to overcome the current limitations of fragmented energy applications. EDEs are inspired by the International Data Space (IDS) initiative launched in Germany at the end of 2014 with an overall objective to take both the development and use of the IDS reference architecture model to a European/global level. The reference architecture model consists of four architectures related to business, security, data and service, and software aspects. This paper illustrates the applicability of EDEs and IDS reference architecture in real-world scenarios from the energy sector. The analyzed scenario is positioned in the context of the EU-funded H2020 project PLATOON. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: Based on the article Valentina Janev, Maria-Esther Vidal, Kemele M. Endris, Dea Pujic: Managing Knowledge in Energy Data Spaces. WWW (Companion Volume) 2021: 7-15

ACM Class: H.2.5; H.2.8

arXiv:2107.01910 [pdf, other]

doi 10.1145/3442442.3453541

Analyzing a Knowledge Graph of Industry4.0 Standards

Authors: Irlan Grangel-Gonzalez, Maria-Esther Vidal

Abstract: In this article, we tackle the problem of standard interoperability across different standardization frameworks, and devise a knowledge-driven approach that allows for the description of standards and standardization frameworks into an Industry 4.0 knowledge graph (I40KG). The STO ontology represents properties of standards and standardization frameworks, as well as relationships among them. The I… ▽ More In this article, we tackle the problem of standard interoperability across different standardization frameworks, and devise a knowledge-driven approach that allows for the description of standards and standardization frameworks into an Industry 4.0 knowledge graph (I40KG). The STO ontology represents properties of standards and standardization frameworks, as well as relationships among them. The I40KG integrates more than 200 standards and four standardization frameworks. To populate the I40KG, the landscape of standards has been analyzed from a semantic perspective and the resulting I40KG represents knowledge expressed in more than 200 industrial related documents including technical reports, research articles, and white papers. Additionally, the I40KG has been linked to existing knowledge graphs and an automated reasoning has been implemented to reveal implicit relations between standards as well as mappings across standardization frameworks. We analyze both the number of discovered relations between standards and the accuracy of these relations. Observed results indicate that both reasoning and linking processes enable for increasing the connectivity in the knowledge graph by up to 80%, whilst up to 96% of the relations can be validated. These outcomes suggest that integrating standards and standardization frameworks into the I40KG enables the resolution of semantic interoperability conflicts, empowering the communication in smart factories. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: Based on the paper Irlan Grangel-Gonzalez, Maria-Esther Vidal: Analyzing a Knowledge Graph of Industry 4.0 Standards. WWW (Companion Volume) 2021: 16-25

ACM Class: H.2.5; H.2.8

arXiv:2105.09312 [pdf, other]

Knowledge-driven Data Ecosystems Towards Data Transparency

Authors: Sandra Geisler, Maria-Esther Vidal, Cinzia Cappiello, Bernadette Farias Lóscio, Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda Paja, Barbara Pernici, Jakob Rehof

Abstract: A Data Ecosystem offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requiremen… ▽ More A Data Ecosystem offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that data ecosystems face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven data ecosystem architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Lastly, we discuss and rate the potential of the proposed architecture in the fulfillment of these requirements. △ Less

Submitted 21 May, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

arXiv:2103.12115 [pdf, other]

End-to-End Trainable Multi-Instance Pose Estimation with Transformers

Authors: Lucas Stoffl, Maxime Vidal, Alexander Mathis

Abstract: We propose an end-to-end trainable approach for multi-instance pose estimation, called POET (POse Estimation Transformer). Combining a convolutional neural network with a transformer encoder-decoder architecture, we formulate multiinstance pose estimation from images as a direct set prediction problem. Our model is able to directly regress the pose of all individuals, utilizing a bipartite matchin… ▽ More We propose an end-to-end trainable approach for multi-instance pose estimation, called POET (POse Estimation Transformer). Combining a convolutional neural network with a transformer encoder-decoder architecture, we formulate multiinstance pose estimation from images as a direct set prediction problem. Our model is able to directly regress the pose of all individuals, utilizing a bipartite matching scheme. POET is trained using a novel set-based global loss that consists of a keypoint loss, a visibility loss and a class loss. POET reasons about the relations between multiple detected individuals and the full image context to directly predict their poses in parallel. We show that POET achieves high accuracy on the COCO keypoint detection task while having less parameters and higher inference speed than other bottom-up and top-down approaches. Moreover, we show successful transfer learning when applying POET to animal pose estimation. To the best of our knowledge, this model is the first end-to-end trainable multi-instance pose estimation method and we hope it will serve as a simple and promising alternative. △ Less

Submitted 21 December, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

arXiv:2103.00560 [pdf, other]

doi 10.1093/icb/icab107

Perspectives on individual animal identification from biology and computer vision

Authors: Maxime Vidal, Nathan Wolf, Beth Rosenberg, Bradley P. Harris, Alexander Mathis

Abstract: Identifying individual animals is crucial for many biological investigations. In response to some of the limitations of current identification methods, new automated computer vision approaches have emerged with strong performance. Here, we review current advances of computer vision identification techniques to provide both computer scientists and biologists with an overview of the available tools… ▽ More Identifying individual animals is crucial for many biological investigations. In response to some of the limitations of current identification methods, new automated computer vision approaches have emerged with strong performance. Here, we review current advances of computer vision identification techniques to provide both computer scientists and biologists with an overview of the available tools and discuss their applications. We conclude by offering recommendations for starting an animal identification project, illustrate current limitations and propose how they might be addressed in the future. △ Less

Submitted 28 February, 2021; originally announced March 2021.

Comments: 12 pages, 1 figure, 2 boxes and 1 table

Journal ref: Integr Comp Biol . 2021 Oct 4;61(3):900-916

arXiv:2101.08676 [pdf, other]

Conceptualization and cases of study on cyber operations against the sustainability of the tactical edge

Authors: Marco Antonio Sotelo Monge, Jorge Maestre Vidal

Abstract: The last decade consolidated the cyberspace as fifth domain of operations, which extends its preliminarily intelligence and information exchange purposes towards enabling complex offensive and defensive operations supported/supportively of parallel kinetic domain actuations. Although there is a plethora of well documented cases on strategic and operational interventions of cyber commands, the cybe… ▽ More The last decade consolidated the cyberspace as fifth domain of operations, which extends its preliminarily intelligence and information exchange purposes towards enabling complex offensive and defensive operations supported/supportively of parallel kinetic domain actuations. Although there is a plethora of well documented cases on strategic and operational interventions of cyber commands, the cyber tactical military edge is still a challenge, where cyber fires barely integrate to the traditional joint targeting cycle due among others to long planning/development times, asymmetric effects, strict target reachability requirements, or the fast propagation of collateral damage; the latter rapidly deriving on hybrid impacts (political, economic, social, etc.) and evidencing significant socio-technical gaps. In this context, it is expected that tactical clouds disruptively facilitate cyber operations at the edge while exposing the rest of the digital assets of the operation to them. On these grounds, the main purpose of the conducted research is to review and in depth analyze the risks and opportunities of jeopardizing the sustainability of the military tactical clouds at the edge by cyber operations. Along with a 1) comprehensively formulation of the researched problematic, the study 2) formalizes the Tactical Denial of Sustainability (TDoS) concept; 3) introduces the phasing, potential attack surfaces, terrains and impact of TDoS attacks; 4) emphasizes the related human and socio-technical aspects; 5) analyzes the threats/opportunities inherent to their impact on the cloud energy efficiency; 6) reviews their implications at the military cyber thinking for tactical operations; 7) illustrates five extensive CONOPS that facilitate the understanding of the TDoS concept; and given the high novelty of the discussed topics, it 8) paves the way for further research and development actions. △ Less

Submitted 21 January, 2021; originally announced January 2021.

arXiv:2101.07136 [pdf, other]

Trav-SHACL: Efficiently Validating Networks of SHACL Constraints

Authors: Mónica Figuera, Philipp D. Rohde, Maria-Esther Vidal

Abstract: Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of kn… ▽ More Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of knowledge graphs, SHACL is rapidly gaining attention in real-world scenarios. SHACL models integrity constraints as a network of shapes, where a shape contains the constraints to be fullfiled by the same entities. The validation of a SHACL shape schema can face the issue of tractability during validation. To facilitate full adoption, efficient computational methods are required. We present Trav-SHACL, a SHACL engine capable of planning the traversal and execution of a shape schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shape schema for efficient validation and rewrites target and constraint queries for the fast detection of invalid entities. Trav-SHACL is empirically evaluated on 27 testbeds executed against knowledge graphs of up to 34M triples. Our experimental results suggest that Trav-SHACL exhibits high performance gradually and reduces validation time by a factor of up to 28.93 compared to the state of the art. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2011.09748 [pdf, other]

Compact Representations for Efficient Storage of Semantic Sensor Data

Authors: Farah Karim, Maria-Esther Vidal, Sören Auer

Abstract: Nowadays, there is a rapid increase in the number of sensor data generated by a wide variety of sensors and devices. Data semantics facilitate information exchange, adaptability, and interoperability among several sensors and devices. Sensor data and their meaning can be described using ontologies, e.g., the Semantic Sensor Network (SSN) Ontology. Notwithstanding, semantically enriched, the size o… ▽ More Nowadays, there is a rapid increase in the number of sensor data generated by a wide variety of sensors and devices. Data semantics facilitate information exchange, adaptability, and interoperability among several sensors and devices. Sensor data and their meaning can be described using ontologies, e.g., the Semantic Sensor Network (SSN) Ontology. Notwithstanding, semantically enriched, the size of semantic sensor data is substantially larger than raw sensor data. Moreover, some measurement values can be observed by sensors several times, and a huge number of repeated facts about sensor data can be produced. We propose a compact or factorized representation of semantic sensor data, where repeated measurement values are described only once. Furthermore, these compact representations are able to enhance the storage and processing of semantic sensor data. To scale up to large datasets, factorization based, tabular representations are exploited to store and manage factorized semantic sensor data using Big Data technologies. We empirically study the effectiveness of a semantic sensor's proposed compact representations and their impact on query processing. Additionally, we evaluate the effects of storing the proposed representations on diverse RDF implementations. Results suggest that the proposed compact representations empower the storage and query processing of sensor data over diverse RDF implementations, and up to two orders of magnitude can reduce query execution time. △ Less

Submitted 19 November, 2020; originally announced November 2020.

arXiv:2008.13482 [pdf, other]

doi 10.1007/978-3-030-62419-4_16

FunMap: Efficient Execution of Functional Mappings for Knowledge Graph Creation

Authors: Samaneh Jozashoori, David Chaves-Fraga, Enrique Iglesias, Maria-Esther Vidal, Oscar Corcho

Abstract: Data has exponentially grown in the last years, and knowledge graphs constitute powerful formalisms to integrate a myriad of existing data sources. Transformation functions -- specified with function-based mapping languages like FunUL and RML+FnO -- can be applied to overcome interoperability issues across heterogeneous data sources. However, the absence of engines to efficiently execute these map… ▽ More Data has exponentially grown in the last years, and knowledge graphs constitute powerful formalisms to integrate a myriad of existing data sources. Transformation functions -- specified with function-based mapping languages like FunUL and RML+FnO -- can be applied to overcome interoperability issues across heterogeneous data sources. However, the absence of engines to efficiently execute these mapping languages hinders their global adoption. We propose FunMap, an interpreter of function-based mapping languages; it relies on a set of lossless rewriting rules to push down and materialize the execution of functions in initial steps of knowledge graph creation. Although applicable to any function-based mapping language that supports joins between mapping rules, FunMap feasibility is shown on RML+FnO. FunMap reduces data redundancy, e.g., duplicates and unused attributes, and converts RML+FnO mappings into a set of equivalent rules executable on RML-compliant engines. We evaluate FunMap performance over real-world testbeds from the biomedical domain. The results indicate that FunMap reduces the execution time of RML-compliant engines by up to a factor of 18, furnishing, thus, a scalable solution for knowledge graph creation. △ Less

Submitted 5 October, 2020; v1 submitted 31 August, 2020; originally announced August 2020.

arXiv:2008.07176 [pdf, other]

doi 10.1145/3340531.3412881

SDM-RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs

Authors: Enrique Iglesias, Samaneh Jozashoori, David Chaves-Fraga, Diego Collarana, Maria-Esther Vidal

Abstract: In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the impact negatively… ▽ More In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the impact negatively of these issues on the knowledge graph creation process. In this paper, we propose the SDM-RDFizer, an interpreter of the RDF Mapping Language (RML), to transform raw data in various formats into an RDF knowledge graph. SDM-RDFizer implements novel algorithms to execute the logical operators between mappings in RML, allowing thus to scale up to complex scenarios where data is not only broad but has a high-duplication rate. We empirically evaluate the SDM-RDFizer performance against diverse testbeds with diverse configurations of data volume, duplicates, and heterogeneity. The observed results indicate that SDM-RDFizer is two orders of magnitude faster than state of the art, thus, meaning that SDM-RDFizer an interoperable and scalable solution for knowledge graph creation. SDM-RDFizer is publicly available as a resource through a Github repository and a DOI. △ Less

Submitted 17 August, 2020; originally announced August 2020.

arXiv:2006.04556 [pdf, other]

Unveiling Relations in the Industry 4.0 Standards Landscape based on Knowledge Graph Embeddings

Authors: Ariam Rivas, Irlán Grangel-González, Diego Collarana, Jens Lehmann, Maria-Esther Vidal

Abstract: Industry~4.0 (I4.0) standards and standardization frameworks have been proposed with the goal of \emph{empowering interoperability} in smart factories. These standards enable the description and interaction of the main components, systems, and processes inside of a smart factory. Due to the growing number of frameworks and standards, there is an increasing need for approaches that automatically an… ▽ More Industry~4.0 (I4.0) standards and standardization frameworks have been proposed with the goal of \emph{empowering interoperability} in smart factories. These standards enable the description and interaction of the main components, systems, and processes inside of a smart factory. Due to the growing number of frameworks and standards, there is an increasing need for approaches that automatically analyze the landscape of I4.0 standards. Standardization frameworks classify standards according to their functions into layers and dimensions. However, similar standards can be classified differently across the frameworks, producing, thus, interoperability conflicts among them. Semantic-based approaches that rely on ontologies and knowledge graphs, have been proposed to represent standards, known relations among them, as well as their classification according to existing frameworks. Albeit informative, the structured modeling of the I4.0 landscape only provides the foundations for detecting interoperability issues. Thus, graph-based analytical methods able to exploit knowledge encoded by these approaches, are required to uncover alignments among standards. We study the relatedness among standards and frameworks based on community analysis to discover knowledge that helps to cope with interoperability conflicts between standards. We use knowledge graph embeddings to automatically create these communities exploiting the meaning of the existing relationships. In particular, we focus on the identification of similar standards, i.e., communities of standards, and analyze their properties to detect unknown relations. We empirically evaluate our approach on a knowledge graph of I4.0 standards using the Trans$^*$ family of embedding models for knowledge graph entities. Our results are promising and suggest that relations among standards can be detected accurately. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: 15 pages, 7 figures, DEXA2020 Conference

arXiv:2003.05238 [pdf, other]

doi 10.1007/s10844-020-00595-9

Compacting Frequent Star Patterns in RDF Graphs

Authors: Farah Karim, Maria-Esther Vidal, Sören Auer

Abstract: Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of propertie… ▽ More Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. We address the problem of identifying frequent star patterns in RDF graphs and devise the concept of factorized RDF graphs, which denote compact representations of RDF graphs where the number of frequent star patterns is minimized. We also develop computational methods to identify frequent star patterns and generate a factorized RDF graph, where compact RDF molecules replace frequent star patterns. A compact RDF molecule of a frequent star pattern denotes an RDF subgraph that instantiates the corresponding star pattern. Instead of having all the entities matching the original frequent star pattern, a surrogate entity is added and related to the properties of the frequent star pattern; it is linked to the entities that originally match the frequent star pattern. We evaluate the performance of our factorization techniques on several RDF graph benchmarks and compare with a baseline built on top of gSpan, a state-of-the-art algorithm to detect frequent patterns. The outcomes evidence the efficiency of proposed approach and show that our techniques are able to reduce execution time of the baseline approach in at least three orders of magnitude reducing the RDF graph size by up to 66.56%. △ Less

Submitted 11 March, 2020; originally announced March 2020.

arXiv:2002.08102 [pdf, other]

Optimizing Federated Queries Based on the Physical Design of a Data Lake

Authors: Philipp D. Rohde, Maria-Esther Vidal

Abstract: The optimization of query execution plans is known to be crucial for reducing the query execution time. In particular, query optimization has been studied thoroughly for relational databases over the past decades. Recently, the Resource Description Framework (RDF) became popular for publishing data on the Web. As a consequence, federations composed of different data models like RDF and relational… ▽ More The optimization of query execution plans is known to be crucial for reducing the query execution time. In particular, query optimization has been studied thoroughly for relational databases over the past decades. Recently, the Resource Description Framework (RDF) became popular for publishing data on the Web. As a consequence, federations composed of different data models like RDF and relational databases evolved. One type of these federations are Semantic Data Lakes where every data source is kept in its original data model and semantically annotated with ontologies or controlled vocabularies. However, state-of-the-art query engines for federated query processing over Semantic Data Lakes often rely on optimization techniques tailored for RDF. In this paper, we present query optimization techniques guided by heuristics that take the physical design of a Data Lake into account. The heuristics are implemented on top of Ontario, a SPARQL query engine for Semantic Data Lakes. Using source-specific heuristics, the query engine is able to generate more efficient query execution plans by exploiting the knowledge about indexes and normalization in relational databases. We show that heuristics which take the physical design of the Data Lake into account are able to speed up query processing. △ Less

Submitted 23 March, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

Comments: work-in-progress paper

arXiv:2002.06071 [pdf, other]

FQuAD: French Question Answering Dataset

Authors: Martin d'Hoffschmidt, Wacim Belblidia, Tom Brendlé, Quentin Heinrich, Maxime Vidal

Abstract: Recent advances in the field of language modeling have improved state-of-the-art results on many Natural Language Processing tasks. Among them, Reading Comprehension has made significant progress over the past few years. However, most results are reported in English since labeled resources available in other languages, such as French, remain scarce. In the present work, we introduce the French Que… ▽ More Recent advances in the field of language modeling have improved state-of-the-art results on many Natural Language Processing tasks. Among them, Reading Comprehension has made significant progress over the past few years. However, most results are reported in English since labeled resources available in other languages, such as French, remain scarce. In the present work, we introduce the French Question Answering Dataset (FQuAD). FQuAD is a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version. We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set. In order to track the progress of French Question Answering models we propose a leader-board and we have made the 1.0 version of our dataset freely available at https://illuin-tech.github.io/FQuAD-explorer/. △ Less

Submitted 25 May, 2020; v1 submitted 14 February, 2020; originally announced February 2020.

Comments: 15 pages, 5 figures

arXiv:2001.09762 [pdf, other]

Bias in Data-driven AI Systems -- An Introductory Survey

Authors: Eirini Ntoutsi, Pavlos Fafalios, Ujwal Gadiraju, Vasileios Iosifidis, Wolfgang Nejdl, Maria-Esther Vidal, Salvatore Ruggieri, Franco Turini, Symeon Papadopoulos, Emmanouil Krasanakis, Ioannis Kompatsiaris, Katharina Kinder-Kurlanda, Claudia Wagner, Fariba Karimi, Miriam Fernandez, Harith Alani, Bettina Berendt, Tina Kruegel, Christian Heinze, Klaus Broelemann, Gjergji Kasneci, Thanassis Tiropanis, Steffen Staab

Abstract: AI-based systems are widely employed nowadays to make decisions that have far-reaching impacts on individuals and society. Their decisions might affect everyone, everywhere and anytime, entailing concerns about potential human rights issues. Therefore, it is necessary to move beyond traditional AI algorithms optimized for predictive performance and embed ethical and legal principles in their desig… ▽ More AI-based systems are widely employed nowadays to make decisions that have far-reaching impacts on individuals and society. Their decisions might affect everyone, everywhere and anytime, entailing concerns about potential human rights issues. Therefore, it is necessary to move beyond traditional AI algorithms optimized for predictive performance and embed ethical and legal principles in their design, training and deployment to ensure social good while still benefiting from the huge potential of the AI technology. The goal of this survey is to provide a broad multi-disciplinary overview of the area of bias in AI systems, focusing on technical challenges and solutions as well as to suggest new research directions towards approaches well-grounded in a legal frame. In this survey, we focus on data-driven AI, as a large part of AI is powered nowadays by (big) data and powerful Machine Learning (ML) algorithms. If otherwise not specified, we use the general term bias to describe problems related to the gathering or processing of data that might result in prejudiced decisions on the bases of demographic features like race, sex, etc. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: 19 pages, 1 figure

arXiv:2001.09052 [pdf, other]

doi 10.3233/SW-210432

Enhancing Virtual Ontology Based Access over Tabular Data with Morph-CSV

Authors: David Chaves-Fraga, Edna Ruckhaus, Freddy Priyatna, Maria-Esther Vidal, Oscar Corcho

Abstract: Ontology-Based Data Access (OBDA) has traditionally focused on providing a unified view of heterogeneous datasets, either by materializing integrated data into RDF or by performing on-the fly querying via SPARQL query translation. In the specific case of tabular datasets represented as several CSV or Excel files, query translation approaches have been applied by considering each source as a single… ▽ More Ontology-Based Data Access (OBDA) has traditionally focused on providing a unified view of heterogeneous datasets, either by materializing integrated data into RDF or by performing on-the fly querying via SPARQL query translation. In the specific case of tabular datasets represented as several CSV or Excel files, query translation approaches have been applied by considering each source as a single table that can be loaded into a relational database management system (RDBMS). Nevertheless, constraints over these tables are not represented; thus, neither consistency among attributes nor indexes over tables are enforced. As a consequence, efficiency of the SPARQL-to-SQL translation process may be affected, as well as the completeness of the answers produced during the evaluation of the generated SQL query. Our work is focused on applying implicit constraints on the OBDA query translation process over tabular data. We propose Morph-CSV, a framework for querying tabular data that exploits information from typical OBDA inputs (e.g., mappings, queries) to enforce constraints that can be used together with any SPARQL-to-SQL OBDA engine. Morph-CSV relies on both a constraint component and a set of constraint operators. For a given set of constraints, the operators are applied to each type of constraint with the aim of enhancing query completeness and performance. We evaluate Morph-CSV in several domains: e-commerce with the BSBM benchmark; transportation with a benchmark using the GTFS dataset from the Madrid subway; and biology with a use case extracted from the Bio2RDF project. We compare and report the performance of two SPARQL-to-SQL OBDA engines, without and with the incorporation of MorphCSV. The observed results suggest that Morph-CSV is able to speed up the total query execution time by up to two orders of magnitude, while it is able to produce all the query answers. △ Less

Submitted 21 February, 2021; v1 submitted 24 January, 2020; originally announced January 2020.

arXiv:1912.11270 [pdf, other]

doi 10.1145/3340531.3412777

Falcon 2.0: An Entity and Relation Linking Tool over Wikidata

Authors: Ahmad Sakor, Kuldeep Singh, Anery Patel, Maria-Esther Vidal

Abstract: The Natural Language Processing (NLP) community has significantly contributed to the solutions for entity and relation recognition from the text, and possibly linking them to proper matches in Knowledge Graphs (KGs). Considering Wikidata as the background KG, still, there are limited tools to link knowledge within the text to Wikidata. In this paper, we present Falcon 2.0, first joint entity, and… ▽ More The Natural Language Processing (NLP) community has significantly contributed to the solutions for entity and relation recognition from the text, and possibly linking them to proper matches in Knowledge Graphs (KGs). Considering Wikidata as the background KG, still, there are limited tools to link knowledge within the text to Wikidata. In this paper, we present Falcon 2.0, first joint entity, and relation linking tool over Wikidata. It receives a short natural language text in the English language and outputs a ranked list of entities and relations annotated with the proper candidates in Wikidata. The candidates are represented by their Internationalized Resource Identifier (IRI) in Wikidata. Falcon 2.0 resorts to the English language model for the recognition task (e.g., N-Gram tiling and N-Gram splitting), and then an optimization approach for linking task. We have empirically studied the performance of Falcon 2.0 on Wikidata and concluded that it outperforms all the existing baselines. Falcon 2.0 is public and can be reused by the community; all the required instructions of Falcon 2.0 are well-documented at our GitHub repository. We also demonstrate an online API, which can be run without any technical expertise. Falcon 2.0 and its background knowledge bases are available as resources at https://labs.tib.eu/falcon/falcon2/. △ Less

Submitted 31 August, 2020; v1 submitted 24 December, 2019; originally announced December 2019.

Comments: CIKM 2020 Paper 8 pages

arXiv:1912.06214 [pdf, other]

Encoding Knowledge Graph Entity Aliases in Attentive Neural Network for Wikidata Entity Linking

Authors: Isaiah Onando Mulang, Kuldeep Singh, Akhilesh Vyas, Saeedeh Shekarpour, Maria Esther Vidal, Jens Lehmann, Soren Auer

Abstract: The collaborative knowledge graphs such as Wikidata excessively rely on the crowd to author the information. Since the crowd is not bound to a standard protocol for assigning entity titles, the knowledge graph is populated by non-standard, noisy, long or even sometimes awkward titles. The issue of long, implicit, and nonstandard entity representations is a challenge in Entity Linking (EL) approach… ▽ More The collaborative knowledge graphs such as Wikidata excessively rely on the crowd to author the information. Since the crowd is not bound to a standard protocol for assigning entity titles, the knowledge graph is populated by non-standard, noisy, long or even sometimes awkward titles. The issue of long, implicit, and nonstandard entity representations is a challenge in Entity Linking (EL) approaches for gaining high precision and recall. Underlying KG, in general, is the source of target entities for EL approaches, however, it often contains other relevant information, such as aliases of entities (e.g., Obama and Barack Hussein Obama are aliases for the entity Barack Obama). EL models usually ignore such readily available entity attributes. In this paper, we examine the role of knowledge graph context on an attentive neural network approach for entity linking on Wikidata. Our approach contributes by exploiting the sufficient context from a KG as a source of background knowledge, which is then fed into the neural network. This approach demonstrates merit to address challenges associated with entity titles (multi-word, long, implicit, case-sensitive). Our experimental study shows approx 8% improvements over the baseline approach, and significantly outperform an end to end approach for Wikidata entity linking. △ Less

Submitted 26 September, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

Comments: 15 pages

Journal ref: WISE 2020 (21st International Conference on Web Information Systems Engineering)

arXiv:1911.02679 [pdf, other]

A Domain-Specific Language for Verifying Software Requirement Constraints

Authors: Marzina Vidal, Tiago Massoni, Franklin Ramalho

Abstract: Software requirement analysis can certainly benefit from prevention and early detection of failures, in particular by some kind of automatic analysis. Formal methods offer means to represent and analyze requirements with rigorous tools, avoiding ambiguities and allowing automatic verification of requirement consistency. However, formalisms often clash in the culture or lack of skills of software a… ▽ More Software requirement analysis can certainly benefit from prevention and early detection of failures, in particular by some kind of automatic analysis. Formal methods offer means to represent and analyze requirements with rigorous tools, avoiding ambiguities and allowing automatic verification of requirement consistency. However, formalisms often clash in the culture or lack of skills of software analysts, making them challenging to apply. In this article, we propose a Domain-Specific Language (DSL) based on Set Theory for requirement analysts. The Graphical InvaRiant Language (GIRL) can be used to specify software requirement structural invariants, with entities and their relationships. Those invariants can then have their consistency evaluated by the Alloy Analyzer, based on a mapping semantics we provide for transforming GIRL models into Alloy specifications with no user intervention. With a prototypical language editor and transformations implemented into an Eclipse plugin, we carried out a qualitative study with requirement analysts working for a government software company in Brazil, to evaluate usability and effectiveness of the GIRL-based analysis of real software requirements. The participants were able to effectively use the underlying formal analysis, since 79 out of 80 assigned invariants were correctly modeled. While participants perceived as low the complexity of learning and using GIRL's simplest, set-based structures and relationships, the most complex logical structures, such as quantification and implication, were challenging. Furthermore, almost all post-study evaluations from the participants were positive, especially as a tool for discovering requirement inconsistencies. △ Less

Submitted 6 November, 2019; originally announced November 2019.

Comments: Preprint for the 2019 Brazilian Symposium on Formal Methods

arXiv:1909.01032 [pdf, other]

doi 10.1007/978-3-030-33246-4_4

MapSDI: A Scaled-up Semantic Data Integration Framework for Knowledge Graph Creation

Authors: Samaneh Jozashoori, Maria-Esther Vidal

Abstract: Semantic web technologies have significantly contributed with effective solutions for the problems of data integration and knowledge graph creation. However, with the rapid growth of big data in diverse domains, different interoperability issues still demand to be addressed, being scalability one of the main challenges. In this paper, we address the problem of knowledge graph creation at scale and… ▽ More Semantic web technologies have significantly contributed with effective solutions for the problems of data integration and knowledge graph creation. However, with the rapid growth of big data in diverse domains, different interoperability issues still demand to be addressed, being scalability one of the main challenges. In this paper, we address the problem of knowledge graph creation at scale and provide MapSDI, a mapping rule-based framework for optimizing semantic data integration into knowledge graphs. MapSDI allows for the semantic enrichment of large-sized, heterogeneous, and potentially low-quality data efficiently. The input of MapSDI is a set of data sources and mapping rules being generated by a mapping language such as RML. First, MapSDI pre-processes the sources based on semantic information extracted from mapping rules, by performing basic database operators; it projects out required attributes, eliminates duplicates, and selects relevant entries. All these operators are defined based on the knowledge encoded by the mapping rules which will be then used by the semantification engine (or RDFizer) to produce a knowledge graph. We have empirically studied the impact of MapSDI on existing RDFizers, and observed that knowledge graph creation time can be reduced on average in one order of magnitude. It is also shown, theoretically, that the sources and rules transformations provided by MapSDI are data-lossless. △ Less

Submitted 3 September, 2019; originally announced September 2019.

arXiv:1908.06265 [pdf, other]

Towards an Integrated Graph Algebra for Graph Pattern Matching with Gremlin (Extended Version)

Authors: Harsh Thakkar, Dharmen Punjani, Soeren Auer, Maria-Esther Vidal

Abstract: Graph data management (also called NoSQL) has revealed beneficial characteristics in terms of flexibility and scalability by differently balancing between query expressivity and schema flexibility. This peculiar advantage has resulted into an unforeseen race of developing new task-specific graph systems, query languages and data models, such as property graphs, key-value, wide column, resource des… ▽ More Graph data management (also called NoSQL) has revealed beneficial characteristics in terms of flexibility and scalability by differently balancing between query expressivity and schema flexibility. This peculiar advantage has resulted into an unforeseen race of developing new task-specific graph systems, query languages and data models, such as property graphs, key-value, wide column, resource description framework (RDF), etc. Present-day graph query languages are focused towards flexible graph pattern matching (aka sub-graph matching), whereas graph computing frameworks aim towards providing fast parallel (distributed) execution of instructions. The consequence of this rapid growth in the variety of graph-based data management systems has resulted in a lack of standardization. Gremlin, a graph traversal language, and machine provides a common platform for supporting any graph computing system (such as an OLTP graph database or OLAP graph processors). We present a formalization of graph pattern matching for Gremlin queries. We also study, discuss and consolidate various existing graph algebra operators into an integrated graph algebra. △ Less

Submitted 7 September, 2019; v1 submitted 17 August, 2019; originally announced August 2019.

Comments: This is an extended version of an article formally published at DEXA 2017

arXiv:1908.05098 [pdf, other]

Towards Optimisation of Collaborative Question Answering over Knowledge Graphs

Authors: Kuldeep Singh, Mohamad Yaser Jaradeh, Saeedeh Shekarpour, Akash Kulkarni, Arun Sethupat Radhakrishna, Ioanna Lytra, Maria-Esther Vidal, Jens Lehmann

Abstract: Collaborative Question Answering (CQA) frameworks for knowledge graphs aim at integrating existing question answering (QA) components for implementing sequences of QA tasks (i.e. QA pipelines). The research community has paid substantial attention to CQAs since they support reusability and scalability of the available components in addition to the flexibility of pipelines. CQA frameworks attempt t… ▽ More Collaborative Question Answering (CQA) frameworks for knowledge graphs aim at integrating existing question answering (QA) components for implementing sequences of QA tasks (i.e. QA pipelines). The research community has paid substantial attention to CQAs since they support reusability and scalability of the available components in addition to the flexibility of pipelines. CQA frameworks attempt to build such pipelines automatically by solving two optimisation problems: 1) local collective performance of QA components per QA task and 2) global performance of QA pipelines. In spite offering several advantages over monolithic QA systems, the effectiveness and efficiency of CQA frameworks in answering questions is limited. In this paper, we tackle the problem of local optimisation of CQA frameworks and propose a three fold approach, which applies feature selection techniques with supervised machine learning approaches in order to identify the best performing components efficiently. We have empirically evaluated our approach over existing benchmarks and compared to existing automatic CQA frameworks. The observed results provide evidence that our approach answers a higher number of questions than the state of the art while reducing: i) the number of used features by 50% and ii) the number of components used by 76%. △ Less

Submitted 14 August, 2019; originally announced August 2019.

arXiv:1903.12554 [pdf, other]

Linked Open Data Validity -- A Technical Report from ISWS 2018

Authors: Tayeb Abderrahmani Ghor, Esha Agrawal, Mehwish Alam, Omar Alqawasmeh, Claudia D'amato, Amina Annane, Amr Azzam, Andrew Berezovskyi, Russa Biswas, Mathias Bonduel, Quentin Brabant, Cristina-iulia Bucur, Elena Camossi, Valentina Anita Carriero, Shruthi Chari, David Chaves Fraga, Fiorela Ciroku, Michael Cochez, Hubert Curien, Vincenzo Cutrona, Rahma Dandan, Danilo Dess, Valerio Di Carlo, Ahmed El Amine Djebri, Marieke Van Erp , et al. (46 additional authors not shown)

Abstract: Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natu… ▽ More Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natural Language Processing, Information Retrieval, Computer Vision, Speech Recognition, and many more. Nevertheless, regardless of the specific tasks that LOD-based tools aim to address, the reuse of such knowledge may be challenging for diverse reasons, e.g. semantic heterogeneity, provenance, and data quality. As aptly stated by Heath et al. Linked Data might be outdated, imprecise, or simply wrong": there arouses a necessity to investigate the problem of linked data validity. This work reports a collaborative effort performed by nine teams of students, guided by an equal number of senior researchers, attending the International Semantic Web Research School (ISWS 2018) towards addressing such investigation from different perspectives coupled with different approaches to tackle the issue. △ Less

Submitted 26 March, 2019; originally announced March 2019.

arXiv:1811.01660 [pdf, other]

Data Integration for Supporting Biomedical Knowledge Graph Creation at Large-Scale

Authors: Samaneh Jozashoori, Tatiana Novikova, Maria-Esther Vidal

Abstract: In recent years, following FAIR and open data principles, the number of available big data including biomedical data has been increased exponentially. In order to extract knowledge, these data should be curated, integrated, and semantically described. Accordingly, several semantic integration techniques have been developed; albeit effective, they may suffer from scalability in terms of different p… ▽ More In recent years, following FAIR and open data principles, the number of available big data including biomedical data has been increased exponentially. In order to extract knowledge, these data should be curated, integrated, and semantically described. Accordingly, several semantic integration techniques have been developed; albeit effective, they may suffer from scalability in terms of different properties of big data. Even scaled-up approaches may be highly costly because tasks of semantification, curation and integration are performed independently. In order to overcome these issues, we devise ConMap, a semantic integration approach which exploits knowledge encoded in ontology in order to describe mapping rules to perform these tasks at the same time. Experimental results performed on different data sets suggest that ConMap can significantly reduce the time required for knowledge graph creation by up to 70\% of the time that is consumed following a traditional approach. △ Less

Submitted 5 November, 2018; originally announced November 2018.

arXiv:1809.10044 [pdf, other]

No One is Perfect: Analysing the Performance of Question Answering Components over the DBpedia Knowledge Graph

Authors: Kuldeep Singh, Ioanna Lytra, Arun Sethupat Radhakrishna, Saeedeh Shekarpour, Maria-Esther Vidal, Jens Lehmann

Abstract: Question answering (QA) over knowledge graphs has gained significant momentum over the past five years due to the increasing availability of large knowledge graphs and the rising importance of question answering for user interaction. DBpedia has been the most prominently used knowledge graph in this setting and most approaches currently use a pipeline of processing steps connecting a sequence of c… ▽ More Question answering (QA) over knowledge graphs has gained significant momentum over the past five years due to the increasing availability of large knowledge graphs and the rising importance of question answering for user interaction. DBpedia has been the most prominently used knowledge graph in this setting and most approaches currently use a pipeline of processing steps connecting a sequence of components. In this article, we analyse and micro evaluate the behaviour of 29 available QA components for DBpedia knowledge graph that were released by the research community since 2010. As a result, we provide a perspective on collective failure cases, suggest characteristics of QA components that prevent them from performing better and provide future challenges and research directions for the field. △ Less

Submitted 27 July, 2020; v1 submitted 26 September, 2018; originally announced September 2018.

Comments: Evaluation of State of the art Question Answering components performing entity linking, relation linking etc

Journal ref: Journal of Web Semantics (JWS 2020)

arXiv:1807.06816 [pdf, other]

Unveiling Scholarly Communities over Knowledge Graphs

Authors: Sahar Vahdati, Guillermo Palma, Rahul Jyoti Nath, Christoph Lange, Sören Auer, Maria-Esther Vidal

Abstract: Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way. Exploiting semantics encoded in knowledge graphs enables the implementation of knowledge-driven tasks such as semantic retrieval, query processing, and question answering, as well as solutions to knowledge discovery tasks including pattern discovery and link prediction. In thi… ▽ More Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way. Exploiting semantics encoded in knowledge graphs enables the implementation of knowledge-driven tasks such as semantic retrieval, query processing, and question answering, as well as solutions to knowledge discovery tasks including pattern discovery and link prediction. In this paper, we tackle the problem of knowledge discovery in scholarly knowledge graphs, i.e., graphs that integrate scholarly data, and present Korona, a knowledge-driven framework able to unveil scholarly communities for the prediction of scholarly networks. Korona implements a graph partition approach and relies on semantic similarity measures to determine relatedness between scholarly entities. As a proof of concept, we built a scholarly knowledge graph with data from researchers, conferences, and papers of the Semantic Web area, and apply Korona to uncover co-authorship networks. Results observed from our empirical evaluation suggest that exploiting semantics in scholarly knowledge graphs enables the identification of previously unknown relations between researchers. By extending the ontology, these observations can be generalized to other scholarly entities, e.g., articles or institutions, for the prediction of other scholarly patterns, e.g., co-citations or academic collaboration. △ Less

Submitted 18 July, 2018; originally announced July 2018.

Comments: 12 pages. Paper accepted in the 22nd International Conference on Theory and Practice of Digital Libraries, 2018

arXiv:1705.08018 [pdf, other]

Use of Knowledge Graph in Rescoring the N-Best List in Automatic Speech Recognition

Authors: Ashwini Jaya Kumar, Camilo Morales, Maria-Esther Vidal, Christoph Schmidt, Sören Auer

Abstract: With the evolution of neural network based methods, automatic speech recognition (ASR) field has been advanced to a level where building an application with speech interface is a reality. In spite of these advances, building a real-time speech recogniser faces several problems such as low recognition accuracy, domain constraint, and out-of-vocabulary words. The low recognition accuracy problem is… ▽ More With the evolution of neural network based methods, automatic speech recognition (ASR) field has been advanced to a level where building an application with speech interface is a reality. In spite of these advances, building a real-time speech recogniser faces several problems such as low recognition accuracy, domain constraint, and out-of-vocabulary words. The low recognition accuracy problem is addressed by improving the acoustic model, language model, decoder and by rescoring the N-best list at the output of the decoder. We are considering the N-best list rescoring approach to improve the recognition accuracy. Most of the methods in the literature use the grammatical, lexical, syntactic and semantic connection between the words in a recognised sentence as a feature to rescore. In this paper, we have tried to see the semantic relatedness between the words in a sentence to rescore the N-best list. Semantic relatedness is computed using TransE~\cite{bordes2013translating}, a method for low dimensional embedding of a triple in a knowledge graph. The novelty of the paper is the application of semantic web to automatic speech recognition. △ Less

Submitted 22 May, 2017; originally announced May 2017.

arXiv:1701.03318 [pdf, other]

doi 10.4204/EPTCS.237.2

Comparing MapReduce and Pipeline Implementations for Counting Triangles

Authors: Edelmira Pasarella, Maria-Esther Vidal, Cristina Zoltan

Abstract: A common method to define a parallel solution for a computational problem consists in finding a way to use the Divide and Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the… ▽ More A common method to define a parallel solution for a computational problem consists in finding a way to use the Divide and Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide and Conquer paradigm, named dynamic pipeline. The main features of dynamic pipelines are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes. An empirical evaluation is conducted on graphs of different topologies, sizes, and densities. Observed results suggest that dynamic pipelines allows for an efficient implementation of the problem of counting triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation. △ Less

Submitted 12 January, 2017; originally announced January 2017.

Comments: In Proceedings PROLE 2016, arXiv:1701.03069

ACM Class: D.1.3; F.1.2

Journal ref: EPTCS 237, 2017, pp. 20-33

arXiv:1608.02800 [pdf, other]

LITMUS: An Open Extensible Framework for Benchmarking RDF Data Management Solutions

Authors: Harsh Thakkar, Mohnish Dubey, Gezim Sejdiu, Axel-Cyrille Ngonga Ngomo, Jeremy Debattista, Christoph Lange, Jens Lehmann, Sören Auer, Maria-Esther Vidal

Abstract: Developments in the context of Open, Big, and Linked Data have led to an enormous growth of structured data on the Web. To keep up with the pace of efficient consumption and management of the data at this rate, many data Management solutions have been developed for specific tasks and applications. We present LITMUS, a framework for benchmarking data management solutions. LITMUS goes beyond classic… ▽ More Developments in the context of Open, Big, and Linked Data have led to an enormous growth of structured data on the Web. To keep up with the pace of efficient consumption and management of the data at this rate, many data Management solutions have been developed for specific tasks and applications. We present LITMUS, a framework for benchmarking data management solutions. LITMUS goes beyond classical storage benchmarking frameworks by allowing for analysing the performance of frameworks across query languages. In this position paper we present the conceptual architecture of LITMUS as well as the considerations that led to this architecture. △ Less

Submitted 9 August, 2016; originally announced August 2016.

Comments: 8 pages, 1 figure, position paper

arXiv:1503.02940 [pdf, other]

Efficient Query Processing for SPARQL Federations with Replicated Fragments

Authors: Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal

Abstract: Low reliability and availability of public SPARQL endpoints prevent real-world applications from exploiting all the potential of these querying infras-tructures. Fragmenting data on servers can improve data availability but degrades performance. Replicating fragments can offer new tradeoff between performance and availability. We propose FEDRA, a framework for querying Linked Data that takes advan… ▽ More Low reliability and availability of public SPARQL endpoints prevent real-world applications from exploiting all the potential of these querying infras-tructures. Fragmenting data on servers can improve data availability but degrades performance. Replicating fragments can offer new tradeoff between performance and availability. We propose FEDRA, a framework for querying Linked Data that takes advantage of client-side data replication, and performs a source selection algorithm that aims to reduce the number of selected public SPARQL endpoints, execution time, and intermediate results. FEDRA has been implemented on the state-of-the-art query engines ANAPSID and FedX, and empirically evaluated on a variety of real-world datasets. △ Less

Submitted 10 March, 2015; originally announced March 2015.

arXiv:1503.02911 [pdf, other]

RDF-Hunter: Automatically Crowdsourcing the Execution of Queries Against RDF Data Sets

Authors: Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal, Rudi Studer

Abstract: In the last years, a large number of RDF data sets has become available on the Web. However, due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against this data. To overcome this limitation, we propose RDF-Hunter, a novel hybrid query processing approach that brings together machine and human computation to execute queries against RD… ▽ More In the last years, a large number of RDF data sets has become available on the Web. However, due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against this data. To overcome this limitation, we propose RDF-Hunter, a novel hybrid query processing approach that brings together machine and human computation to execute queries against RDF data. We develop a novel quality model and query engine in order to enable RDF-Hunter to on the fly decide which parts of a query should be executed through conventional technology or crowd computing. To evaluate RDF-Hunter, we created a collection of 50 SPARQL queries against the DBpedia data set, executed them using our hybrid query engine, and analyzed the accuracy of the outcomes obtained from the crowd. The experiments clearly show that the overall approach is feasible and produces query results that reliably and significantly enhance completeness of automatic query processing responses. △ Less

Submitted 10 March, 2015; originally announced March 2015.

arXiv:1407.2899 [pdf, other]

Fedra: Query Processing for SPARQL Federations with Divergence

Authors: Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal

Abstract: Data replication and deployment of local SPARQL endpoints improve scalability and availability of public SPARQL endpoints, making the consumption of Linked Data a reality. This solution requires synchronization and specific query processing strategies to take advantage of replication. However, existing replication aware techniques in federations of SPARQL endpoints do not consider data dynamicity.… ▽ More Data replication and deployment of local SPARQL endpoints improve scalability and availability of public SPARQL endpoints, making the consumption of Linked Data a reality. This solution requires synchronization and specific query processing strategies to take advantage of replication. However, existing replication aware techniques in federations of SPARQL endpoints do not consider data dynamicity. We propose Fedra, an approach for querying federations of endpoints that benefits from replication. Participants in Fedra federations can copy fragments of data from several datasets, and describe them using provenance and views. These descriptions enable Fedra to reduce the number of selected endpoints while satisfying user divergence requirements. Experiments on real-world datasets suggest savings of up to three orders of magnitude. △ Less

Submitted 10 July, 2014; originally announced July 2014.

arXiv:0711.2087 [pdf, other]

Query Evaluation and Optimization in the Semantic Web

Authors: Edna Ruckhaus, Eduardo Ruiz, Maria-Esther Vidal

Abstract: We address the problem of answering Web ontology queries efficiently. An ontology is formalized as a Deductive Ontology Base (DOB), a deductive database that comprises the ontology's inference axioms and facts. A cost-based query optimization technique for DOB is presented. A hybrid cost model is proposed to estimate the cost and cardinality of basic and inferred facts. Cardinality and cost of i… ▽ More We address the problem of answering Web ontology queries efficiently. An ontology is formalized as a Deductive Ontology Base (DOB), a deductive database that comprises the ontology's inference axioms and facts. A cost-based query optimization technique for DOB is presented. A hybrid cost model is proposed to estimate the cost and cardinality of basic and inferred facts. Cardinality and cost of inferred facts are estimated using an adaptive sampling technique, while techniques of traditional relational cost models are used for estimating the cost of basic facts and conjunctive ontology queries. Finally, we implement a dynamic-programming optimization algorithm to identify query evaluation plans that minimize the number of intermediate inferred facts. We modeled a subset of the Web ontology language OWL Lite as a DOB, and performed an experimental study to analyze the predictive capacity of our cost model and the benefits of the query optimization technique. Our study has been conducted over synthetic and real-world OWL ontologies, and shows that the techniques are accurate and improve query performance. To appear in Theory and Practice of Logic Programming (TPLP). △ Less

Submitted 13 November, 2007; originally announced November 2007.

Comments: 18 pages, 8 figures, 7 tables. Presented at the ALPSWS2006 First International Workshop on Applications of Logic Programming in the Semantic Web and Semantic Web Services where it got a "Best Paper Award". To appear in Theory and Practice of Logic Programming (TPLP)

ACM Class: F.4.1; H.2.3; I.2.4

Showing 1–50 of 56 results for author: Vidal, M