subscribe to arXiv mailings

Modeling community standards for metadata as templates makes data FAIR

Authors: Mark A. Musen, Martin J. O'Connor, Erik Schultes, Marcos Martinez-Romero, Josef Hardi, John Graybeal

Abstract: It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to defin… ▽ More It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets--both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing. △ Less

Submitted 14 October, 2022; v1 submitted 4 August, 2022; originally announced August 2022.

Comments: 20 pages, 1 table, 5 figures

arXiv:2112.07051 [pdf]

doi 10.1093/database/baac035

A Simple Standard for Sharing Ontological Mappings (SSSOM)

Authors: Nicolas Matentzoglu, James P. Balhoff, Susan M. Bello, Chris Bizon, Matthew Brush, Tiffany J. Callahan, Christopher G Chute, William D. Duncan, Chris T. Evelo, Davera Gabriel, John Graybeal, Alasdair Gray, Benjamin M. Gyori, Melissa Haendel, Henriette Harmse, Nomi L. Harris, Ian Harrow, Harshad Hegde, Amelia L. Hoyt, Charles T. Hoyt, Dazhi Jiao, Ernesto Jiménez-Ruiz, Simon Jupp, Hyeongsik Kim, Sebastian Koehler , et al. (19 additional authors not shown)

Abstract: Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, ar… ▽ More Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, are two terms equivalent or merely related? Are they narrow or broad matches? Are they associated in some other way? Such relationships between the mapped terms are often not documented, leading to incorrect assumptions and making them hard to use in scenarios that require a high degree of precision (such as diagnostics or risk prediction). Also, the lack of descriptions of how mappings were done makes it hard to combine and reconcile mappings, particularly curated and automated ones. The Simple Standard for Sharing Ontological Mappings (SSSOM) addresses these problems by: 1. Introducing a machine-readable and extensible vocabulary to describe metadata that makes imprecision, inaccuracy and incompleteness in mappings explicit. 2. Defining an easy to use table-based format that can be integrated into existing data science pipelines without the need to parse or query ontologies, and that integrates seamlessly with Linked Data standards. 3. Implementing open and community-driven collaborative workflows designed to evolve the standard continuously to address changing requirements and mapping practices. 4. Providing reference tools and software libraries for working with the standard. In this paper, we present the SSSOM standard, describe several use cases, and survey some existing work on standardizing the exchange of mappings, with the goal of making mappings Findable, Accessible, Interoperable, and Reusable (FAIR). The SSSOM specification is at http://w3id.org/sssom/spec. △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: Corresponding author: Christopher J. Mungall <cjmungall@lbl.gov>

arXiv:1905.06480 [pdf]

doi 10.1007/978-3-319-68204-4_10

The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments

Authors: Rafael S. Gonçalves, Martin J. O'Connor, Marcos Martínez-Romero, Attila L. Egyedi, Debra Willrett, John Graybeal, Mark A. Musen

Abstract: The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed--the CEDAR Workbench--is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. T… ▽ More The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed--the CEDAR Workbench--is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source. △ Less

Submitted 15 May, 2019; originally announced May 2019.

arXiv:1903.09270 [pdf]

doi 10.1093/database/baz059

Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases

Authors: Marcos Martínez-Romero, Martin J. O'Connor, Attila L. Egyedi, Debra Willrett, Josef Hardi, John Graybeal, Mark A. Musen

Abstract: Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata… ▽ More Metadata-the machine-readable descriptions of the data-are increasingly seen as crucial for describing the vast array of biomedical datasets that are currently being deposited in public repositories. While most public repositories have firm requirements that metadata must accompany submitted datasets, the quality of those metadata is generally very poor. A key problem is that the typical metadata acquisition process is onerous and time consuming, with little interactive guidance or assistance provided to users. Secondary problems include the lack of validation and sparse use of standardized terms or ontologies when authoring metadata. There is a pressing need for improvements to the metadata acquisition process that will help users to enter metadata quickly and accurately. In this paper we outline a recommendation system for metadata that aims to address this challenge. Our approach uses association rule mining to uncover hidden associations among metadata values and to represent them in the form of association rules. These rules are then used to present users with real-time recommendations when authoring metadata. The novelties of our method are that it is able to combine analyses of metadata from multiple repositories when generating recommendations and can enhance those recommendations by aligning them with ontology terms. We implemented our approach as a service integrated into the CEDAR Workbench metadata authoring platform, and evaluated it using metadata from two public biomedical repositories: US-based National Center for Biotechnology Information (NCBI) BioSample and European Bioinformatics Institute (EBI) BioSamples. The results show that our approach is able to use analyses of previous entered metadata coupled with ontology-based mappings to present users with accurate recommendations when authoring metadata. △ Less

Submitted 21 March, 2019; originally announced March 2019.

arXiv:1902.11162 [pdf]

The FAIR Funder pilot programme to make it easy for funders to require and for grantees to produce FAIR Data

Authors: P. Wittenburg, H. Pergl Sustkova, A. Montesanti, S. M. Bloemers, S. H. de Waard, M. A. Musen, J. B. Graybeal, K. M. Hettne, A. Jacobsen, R. Pergl, R. W. W. Hooft, C. Staiger, C. W. G. van Gelder, S. L. Knijnenburg, A. C. van Arkel, B. Meerman, M. D. Wilkinson, S-A Sansone, P. Rocca-Serra, P. McQuilton, A. N. Gonzalez-Beltran, G. J. C. Aben, P. Henning, S. Alencar, C. Ribeiro , et al. (35 additional authors not shown)

Abstract: There is a growing acknowledgement in the scientific community of the importance of making experimental data machine findable, accessible, interoperable, and reusable (FAIR). Recognizing that high quality metadata are essential to make datasets FAIR, members of the GO FAIR Initiative and the Research Data Alliance (RDA) have initiated a series of workshops to encourage the creation of Metadata for… ▽ More There is a growing acknowledgement in the scientific community of the importance of making experimental data machine findable, accessible, interoperable, and reusable (FAIR). Recognizing that high quality metadata are essential to make datasets FAIR, members of the GO FAIR Initiative and the Research Data Alliance (RDA) have initiated a series of workshops to encourage the creation of Metadata for Machines (M4M), enabling any self-identified stakeholder to define and promote the reuse of standardized, comprehensive machine-actionable metadata. The funders of scientific research recognize that they have an important role to play in ensuring that experimental results are FAIR, and that high quality metadata and careful planning for FAIR data stewardship are central to these goals. We describe the outcome of a recent M4M workshop that has led to a pilot programme involving two national science funders, the Health Research Board of Ireland (HRB) and the Netherlands Organisation for Health Research and Development (ZonMW). These funding organizations will explore new technologies to define at the time that a request for proposals is issued the minimal set of machine-actionable metadata that they would like investigators to use to annotate their datasets, to enable investigators to create such metadata to help make their data FAIR, and to develop data-stewardship plans that ensure that experimental data will be managed appropriately abiding by the FAIR principles. The FAIR Funders design envisions a data-management workflow having seven essential stages, where solution providers are openly invited to participate. The initial pilot programme will launch using existing computer-based tools of those who attended the M4M Workshop. △ Less

Submitted 6 March, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

Comments: This is a pre-print of the FAIR Funders pilot, an outcome of the first Metadata for Machines workshop, see: https://www.go-fair.org/resources/go-fair-workshop-series/metadata-for-machines-workshops/. Corresponding author: E. A Schultes, ORCID 0000-0001-8888-635X

arXiv:1806.09582 [pdf, ps, other]

Traffic Differentiation in Dense WLANs with CSMA/ECA-DR MAC Protocol

Authors: Seyedmohammad Salehi, Li Li, Chien-Chung Shen, Leonard Cimini, John Graybeal

Abstract: In today's WLANs, scheduling of packet transmissions solely relies on the collision and success a station may experience. To better support traffic differentiation in dense WLANs, in this paper, we propose a distributed reservation mechanism for the Carrier Sense Multiple Access Extended Collision Avoidance (CSMA/ECA) MAC protocol, termed CSMA/ECA-DR, based on which stations can collaboratively ac… ▽ More In today's WLANs, scheduling of packet transmissions solely relies on the collision and success a station may experience. To better support traffic differentiation in dense WLANs, in this paper, we propose a distributed reservation mechanism for the Carrier Sense Multiple Access Extended Collision Avoidance (CSMA/ECA) MAC protocol, termed CSMA/ECA-DR, based on which stations can collaboratively achieve higher network performance. In addition, proper Contention Window (CW) will be chosen based on the instantaneously estimated number of active contenders in the network. Simulation results from dense scenarios with traffic differentiation demonstrate that CSMA/ECA-DR can greatly improve the efficiency of WLANs for traffic differentiation even with large numbers of contenders. △ Less

Submitted 25 June, 2018; originally announced June 2018.

Comments: This work has been accepted by IEEE VTC 2018 Fall conference

arXiv:1708.01286 [pdf]

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies

Authors: Rafael S. Gonçalves, Martin J. O'Connor, Marcos Martínez-Romero, John Graybeal, Mark A. Musen

Abstract: The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample--a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata… ▽ More The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample--a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled--15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets. △ Less

Submitted 3 August, 2017; originally announced August 2017.

arXiv:1611.05973 [pdf]

doi 10.1186/s13326-017-0128-y

NCBO Ontology Recommender 2.0: An Enhanced Approach for Biomedical Ontology Recommendation

Authors: Marcos Martinez-Romero, Clement Jonquet, Martin J. O'Connor, John Graybeal, Alejandro Pazos, Mark A. Musen

Abstract: Biomedical researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) release… ▽ More Biomedical researchers use ontologies to annotate their data with ontology terms, enabling better data integration and interoperability. However, the number, variety and complexity of current biomedical ontologies make it cumbersome for researchers to determine which ones to reuse for their specific needs. To overcome this problem, in 2010 the National Center for Biomedical Ontology (NCBO) released the Ontology Recommender, which is a service that receives a biomedical text corpus or a list of keywords and suggests ontologies appropriate for referencing the indicated terms. We developed a new version of the NCBO Ontology Recommender. Called Ontology Recommender 2.0, it uses a new recommendation approach that evaluates the relevance of an ontology to biomedical text data according to four criteria: (1) the extent to which the ontology covers the input data; (2) the acceptance of the ontology in the biomedical community; (3) the level of detail of the ontology classes that cover the input data; and (4) the specialization of the ontology to the domain of the input data. Our evaluation shows that the enhanced recommender provides higher quality suggestions than the original approach, providing better coverage of the input data, more detailed information about their concepts, increased specialization for the domain of the input data, and greater acceptance and use in the community. In addition, it provides users with more explanatory information, along with suggestions of not only individual ontologies but also groups of ontologies. It also can be customized to fit the needs of different scenarios. Ontology Recommender 2.0 combines the strengths of its predecessor with a range of adjustments and new features that improve its reliability and usefulness. Ontology Recommender 2.0 recommends over 500 biomedical ontologies from the NCBO BioPortal platform, where it is openly available. △ Less

Submitted 25 May, 2017; v1 submitted 17 November, 2016; originally announced November 2016.

Comments: 29 pages, 8 figures, 11 tables

ACM Class: I.2.4

Journal ref: Journal of Biomedical Semantics 8 (2017) 1-22

Showing 1–8 of 8 results for author: Graybeal, J