-
SpaCE: The Spatial Confounding Environment
Authors:
Mauricio Tec,
Ana Trisovic,
Michelle Audirac,
Sophie Woodward,
Jie Kate Hu,
Naeem Khoshnevis,
Francesca Dominici
Abstract:
Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating caus…
▽ More
Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
△ Less
Submitted 5 December, 2023; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Cluster Analysis of Open Research Data and a Case for Replication Metadata
Authors:
Ana Trisovic
Abstract:
Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming…
▽ More
Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software or Report), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample) would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Advancing Software Citation Implementation (Software Citation Workshop 2022)
Authors:
Daina Bouquin,
Ana Trisovic,
Oliver Bertuch,
Elena Colón-Marrero
Abstract:
Software is foundationally important to scientific and social progress, however, traditional acknowledgment of the use of others' work has not adapted in step with the rapid development and use of software in research.
This report outlines a series of collaborative discussions that brought together an international group of stakeholders and experts representing many communities, forms of labor,…
▽ More
Software is foundationally important to scientific and social progress, however, traditional acknowledgment of the use of others' work has not adapted in step with the rapid development and use of software in research.
This report outlines a series of collaborative discussions that brought together an international group of stakeholders and experts representing many communities, forms of labor, and expertise. Participants addressed specific challenges about software citation that have so far gone unresolved. The discussions took place in summer 2022 both online and in-person and involved a total of 51 participants.
The activities described in this paper were intended to identify and prioritize specific software citation problems, develop (potential) interventions, and lay out a series of mutually supporting approaches to address them. The outcomes of this report will be useful for the GLAM (Galleries, Libraries, Archives, Museums) community, repository managers and curators, research software developers, and publishers.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Toward Reusable Science with Readable Code and Reproducibility
Authors:
Layan Bahaidarah,
Ethan Hung,
Andreas F. De Melo Oliveira,
Jyotsna Penumaka,
Lukas Rosario,
Ana Trisovic
Abstract:
An essential part of research and scientific communication is researchers' ability to reproduce the results of others. While there have been increasing standards for authors to make data and code available, many of these files are hard to re-execute in practice, leading to a lack of research reproducibility. This poses a major problem for students and researchers in the same field who cannot lever…
▽ More
An essential part of research and scientific communication is researchers' ability to reproduce the results of others. While there have been increasing standards for authors to make data and code available, many of these files are hard to re-execute in practice, leading to a lack of research reproducibility. This poses a major problem for students and researchers in the same field who cannot leverage the previously published findings for study or further inquiry. To address this, we propose an open-source platform named RE3 that helps improve the reproducibility and readability of research projects involving R code. Our platform incorporates assessing code readability with a machine learning model trained on a code readability survey and an automatic containerization service that executes code files and warns users of reproducibility errors. This process helps ensure the reproducibility and readability of projects and therefore fast-track their verification and reuse.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
Towards FAIR Principles for Open Hardware
Authors:
Nadica Miljković,
Ana Trisovic,
Limor Peer
Abstract:
The lack of scientific openness is identified as one of the key challenges of computational reproducibility. In addition to Open Data, Free and Open-source Software (FOSS) and Open Hardware (OH) can address this challenge by introducing open policies, standards, and recommendations. However, while both FOSS and OH are free to use, study, modify, and redistribute, there are significant differences…
▽ More
The lack of scientific openness is identified as one of the key challenges of computational reproducibility. In addition to Open Data, Free and Open-source Software (FOSS) and Open Hardware (OH) can address this challenge by introducing open policies, standards, and recommendations. However, while both FOSS and OH are free to use, study, modify, and redistribute, there are significant differences in sharing and reusing these artifacts. FOSS is increasingly supported with software repositories, but support for OH is lacking, potentially due to the complexity of its digital format and licensing. This paper proposes leveraging FAIR principles to make OH findable, accessible, interoperable, and reusable. We define what FAIR means for OH, how it differs from FOSS, and present examples of unique demands. Also, we evaluate dissemination platforms currently used for OH and provide recommendations.
△ Less
Submitted 17 April, 2023; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Packaging research artefacts with RO-Crate
Authors:
Stian Soiland-Reyes,
Peter Sefton,
Mercè Crosas,
Leyla Jael Castro,
Frederik Coppens,
José M. Fernández,
Daniel Garijo,
Björn Grüning,
Marco La Rosa,
Simone Leo,
Eoghan Ó Carragáin,
Marc Portier,
Ana Trisovic,
RO-Crate Community,
Paul Groth,
Carole Goble
Abstract:
An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with thei…
▽ More
An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with their metadata in a machine readable manner. RO-Crate is based on Schema$.$org annotations in JSON-LD, aiming to establish best practices to formally describe metadata in an accessible and practical way for their use in a wide variety of situations.
An RO-Crate is a structured archive of all the items that contributed to a research outcome, including their identifiers, provenance, relations and annotations. As a general purpose packaging approach for data and their metadata, RO-Crate is used across multiple areas, including bioinformatics, digital humanities and regulatory sciences. By applying "just enough" Linked Data standards, RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility.
An RO-Crate for this article is available at https://w3id.org/ro/doi/10.5281/zenodo.5146227
△ Less
Submitted 6 December, 2021; v1 submitted 14 August, 2021;
originally announced August 2021.
-
A large-scale study on research code quality and execution
Authors:
Ana Trisovic,
Matthew K. Lau,
Thomas Pasquier,
Mercè Crosas
Abstract:
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research rep…
▽ More
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74\% of R files crashed in the initial execution, while 56\% crashed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Advancing computational reproducibility in the Dataverse data repository platform
Authors:
Ana Trisovic,
Philip Durbin,
Tania Schlatter,
Gustavo Durand,
Sonia Barbosa,
Danny Brooke,
Mercè Crosas
Abstract:
Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computatio…
▽ More
Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computational environments for code encapsulation, thus enabling research portability and reproducibility. However, they do not often enable research discoverability, standardized data citation, or long-term archival like data repositories do. This paper addresses the shortcomings of data repositories and reproducibility tools and how they could be overcome to improve the current lack of computational reproducibility in published and archived research outputs.
△ Less
Submitted 16 June, 2020; v1 submitted 6 May, 2020;
originally announced May 2020.
-
Provenance tracking in the LHCb software
Authors:
Ana Trisovic,
Chris R. Jones,
Ben Couturier,
Marco Clemencic
Abstract:
Even though computational reproducibility is widely accepted as necessary for research validation and reuse, it is often not considered during the research process. This is because reproducibility tools are typically stand-alone and require additional training to be employed. In this article, we present a solution to foster reproducibility, which is integrated within existing scientific software t…
▽ More
Even though computational reproducibility is widely accepted as necessary for research validation and reuse, it is often not considered during the research process. This is because reproducibility tools are typically stand-alone and require additional training to be employed. In this article, we present a solution to foster reproducibility, which is integrated within existing scientific software that is actively used in the LHCb collaboration. Our provenance tracking service captures metadata of a dataset, which is then saved inside the output data file on the disk. The captured information allows a complete understanding of how the file was produced and enables a user to reproduce the dataset, even when the original input code (that was used to initially produce the dataset) is altered or lost. This article describes the implementation of the service and gives examples of its application.
△ Less
Submitted 2 March, 2020; v1 submitted 3 October, 2019;
originally announced October 2019.