Research Objects are us…

Research Objects are a wonderful thing! Uh, but just what is a Research Object?  One definition, according to Workflow4Ever is “Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations.” Uh2, doesn’t a ‘semantically rich aggregation of resources that bring together data, methods and people in scientific investigations‘ sound just like a weird definition of a scientific paper? Well, that’s what I thought. One of the core missions of ReproNim is to support the generation of completely re-executable research publications, aka the ReproPub (Kennedy 2019). A ReproPub can be used to verify a study’s claims and explore their generalizability through systematic variation of the input data and analytic components. Such a publication includes the complete description (i.e. provenance) of the: experimental data, analysis workflow (script, pipeline), execution environment (OS version, hardware specification), and results that are used to establish that publications’s claim(s). Yet these elements: the data; workflow; environment and results; are themselves Research Objects; scholarly products that each  may have its own history, evolution, creators, credit, and reusability.  This makes the ReproPub itself an overarching mechanism (Research Object) to aggregate these subsidiary research objects together in support of a specific set of claims.ReproPubSlide

Example

In the ReproPub by Ghosh, et al. (2017), care is taken to explicitly annotate the constituent objects:

Additional objects, of course, can be included, such as hypothesis pre-registration, separation of image analysis workflow and statistical analysis workflow, etc.

Summary

Creating each of these complex Research Objects is still hard and needs to be made easier. Some tools ReproNim is working on in this area include:

Facilitating the evolution of publications to fully provisioned Research Objects has numerous benefits.  A ReproPub embraces many of the recent advances and evolutions in publication: treatment of data as a first-class research object (Honor, et al. 2016); the principles of software citation (Katz et al.  2019); the FAIR (findable, accessible, interoperable and reusable) (Wilkinson et al. 2016) principles applied to the scientific process itself. The resulting scientific literature can be rendered not only in a more reproducibility-supportive fashion but will allow for a much fuller and precise description by adding to the publication the elements for which claims generalize, or not.  Culturally, this evolution of publication practice should be perceived as a plus for the scientific community.  Specifically, what used to be one publication that referred to data, processing, and a set of claims, can now conceivably become numerous publications of distinct and independently creditable scientific output: a publication for the data, a publication for the processing approach, a publication for the complete results, in addition to the publication for the conclusions and claims.

Acknowledgements

Elements of this work are based upon a presentation of a conference paper at the Workshop on Research Objects 2019 (RO2019) at IEEE eScience 2019 — http://www.researchobject.org/ro2019/ .

ReproNim Training: Some Lessons Learnt

While the fields of life sciences, and neuroimaging in particular, are struggling with the apparent reproducibility crisis, the community is honing its skills, developing tools and best practices to foster more replicable and reproducible studies. Tools are key in this respect – without new practical and efficient tooling to design studies, analyze data, and verify results, research will keep moving too slowly (due to potential uncertainty and ambiguity in its results) in its strive to establish a fundamental understanding of the neurobiology of health and disease and hence not succeed in responding appropriately or maximally effectively to the needs of the populations suffering from brain diseases. Tools that provide ‘provenance aware’ (capturing exactly how each step was carried out) support  for all phases of the analysis workflow (pipeline systems for analysis, better handling mechanisms for data and software (i.e. containers), better ways to harmonize and track provenance) are dearly needed.

But for many of us, the fundamental work has to be in the training of the research community. First, tools (software products and libraries) are only as good as their users are. Powerful tools may be badly misused either because they are complex – or used in situations where they should not be used. Technically savvy personnel (researchers, developers, etc.) have a strong tendency to minimize the difficulty of adoption and mastering new tools, and keeping a constant and strong connection with the majority of life sciences researchers requires a huge effort, often incompatible with the rapid development of technologies. Tools change and evolve, and if we do not have a mechanism for continuously keeping researchers well trained we will waste time and financial resources through the inefficiency of work performed without the benefit of the most up-to-date approaches.

There might be a never ending – Sisyphus aspect to this. Is this a perpetual task, where we are constantly climbing the steep hill of new tools and new ways of conducting research? To some extent, probably. Research is, by its very essence, driven by (and rewarded for) the production of new techniques. What is the best way to ensure that these new techniques diffuse rapidly and are use appropriately in research? Continuous Training.

The ReproNim answer to these challenges has been multi-faceted. It first comes with the realization that training must go beyond simply “how to use the tools”, since without the insight into the how and why these are constructed, how they operate, what are the dependencies, and how to completely document what has been done, one may be led to a “quick win” in running the tool but in a superficial way that may perpetuate a lack of transparency. Without training to understand the inner mechanisms or conceptual aspects of a tool or approach, there is little doubt that any knowledge gained will only be short term. Trainers will have lost time in presenting overly specific material, trainees will rapidly be stuck and left feeling powerless facing a long list of magical commands that do not build a coherent framework for practical problem solution. This problem has been famously illustrated by the “how to draw an owl” – acquiring in depth knowledge requires time (see Figure below, sourced from https://cryptogenomicon.org/2016/09/08/from-zero-to-python/).  Second, we also realized that training needs to be practical. Training only on theoretical and conceptual component would be useful, but clearly disconnected from the actual work and practices of our community. Often, neuroimagers are subjected to a high-level set of concepts, concepts that are hard to understand and hard to put into practice. The most efficient training in applied sciences is through practical and goal oriented work, which has to be hands on.

owl

Combining these two constraints (depth and practicality) has largely defined the ReproNim training program. Measures of success are hard to define, but the training methods and material have been oriented to answer these complementary needs. These materials have, to date, been presented in two main formats.

Hands on workshops and online material

First, we worked to implement some aspects of the inverted classroom, where lectures are done ‘on-line’, and class time is spent solving problems (see, for example https://ii.library.jhu.edu/tag/inverted-classroom/). We started with on-line, comprehensive web based material (see https://www.repronim.org/teach.html) that spans what we identified as the set of core components needed for reproducible neuroimaging research: 1) How to make data FAIR; 2) What are the basic computer literacy requirements; 3) How to build and use reproducible pipelines; and 4) What are the key statistical concepts for reliable results. Based on this code material (which could span weeks for each module) we established a series of ‘introductory workshops’. These 1-1.5 day workshops are designed to give trainees the practical knowledge and an environment that they can bring back to their labs. The format is close to the data (https://datacarpentry.org/) and software (https://software-carpentry.org/) carpentries, and brings some trainees to the level at which doing (and understanding) a ‘git pull’ request on the training material (all of which is based in and publicly accessible via GitHub) becomes a possible task. This is crucial, because we are then putting these training materials in the hands of a larger community, distributing the work of finding errors or unclear aspects to the user community, making ‘this’ training material ‘their’ training material. But because there isn’t enough time to actually teach all what is needed, we consider these workshops to be an illustration of the online material, to which trainees can go back to at their own pace. We are also finalizing a complete MOOC version of the materials using the Moodle platform, to help the reuse of this material in a more formal setting.

Train-the-trainer workshops: A pyramidal scheme.

Organizing teaching at workshops has been fruitful and rewarding. However, this solution does not scale: it would take too many resources to train the entire research community in this manner only. More recently, we have focused on organizing “train the trainer” workshops in partnership with the International Neuroinformatics Coordinating Facility (incf.org), where a small number of fellows were selected and invited to create a plan for their own training event, and hopefully create their own “train the trainer” workshop in their own institution settings. Having investigators skilled in these use of these tools, embedded in various labs around the world, will help in the adoption of these best-practices with the trainers themselves becoming the go-to person at each of these sites.

Summary

If there is one take home message, it would be that training is not a secondary aspect of reproducible research. If we are serious about changing the detailed practices of research, fostering the use of the optimum tools, developing the capacity to adapt to the evolving landscape, it must be at the heart of it to grow a community of a new type of researcher who will invest in long term and conceptual training to adapt rapidly and adopt the practices enabled by the new generations of tools, methods and software for reproducible and replicable research.

BIDS and NIDM: Improving Imaging Data Sharing Together

BIDS (the Brain Imaging Data Structure) and NIDM (the NeuroImaging Data Model) both grew out of the INCF NeuroImaging DAtaSHaring Task Force (NI-DASH)), as parallel efforts to address different aspects of data sharing.  Specifically, the need for standards of organization and annotation.  A recent joint statement by these development groups has been released to better document the the synergy between these initiatives. In brief, these two initiatives can be summarized as follows:

BIDS is a standard that prescribes a formal convention for naming and organizing neuroimaging data and metadata in a file system that simplifies documentation, communication and collaboration between users, and enables easier data validation and software development through consistent paths and naming for data files.

NIDM is a Semantic Web-based metadata standard that helps capture and describe experimental data, analytic workflows and statistical results via the provenance of the data. NIDM uses consistent data descriptors in the form of machine accessible terminology, and a formal, extensible data model, which enables rich aggregation and query across datasets and domains.

BIDS has rapidly become a critically useful tool in neuroimaging data sharing by greatly reducing the barriers to documenting and sharing the imaging aspects of a study. Indeed, BIDS is specifically implicated in the ReproNim 5-Steps to More Reproducible Neuroimaging Research recommendations. However, many of the nuances that characterize much of the meaning of a specific dataset or its derivatives requires additional information in order to fully capture the specific meaning of the content. For example, the details of IQ as a measure of intelligence as may be reported in a BIDS ‘participants.tsv’ file can depend upon the way this data is collected and reported: is it a ‘full scale’, ‘performance’ or ‘verbal’ IQ? Semantic markup, as supported by the standard descriptors of the NIDM representation, helps to disambiguate measures through annotation of a measure (e.g. IQ, age) relative to the concept it represents, including documentation of the methods, units, ranges, etc. associated with that measure. As the semantics of a measure are equally important to understanding shared data as the format of the data representation, semantic annotation is also a key element of the ReproNim 5-Steps.

We at ReproNim resonate with the conclusion of the Joint BIDS-NIDM Statement in our support for the:

“…integrated use of both of these standards, (BIDS plus NIDM can be defined as a “SemanticBIDS” representation), in order to both maximize the ‘ease of [re]use’ and ‘ease of sharing’ of neuroimaging data in support of greater research transparency. The BIDS and NIDM development communities will continue to work together to build tools for further synergies between these initiatives.”

As such, ReproNim strives to increase the efficiency of neuroimaging tools that help drive the adoption of these best-practices of data sharing in support of our overarching goal of enhancing overall neuroimaging research reproducibility.

(Discover, Replicate, Innovate)Repeat

“Reproducible by Design”

What is Reproducibility

In the era of ‘questioning everything’ with respect to its impact on neuroimaging analysis reproducibility, we start with a set of petites histoires which take a look at the implications of various choices that researchers routinely make, and often take for granted.

First, let’s set the stage; while there are many definitions around the concept of ‘reproducibility’, I’m a bit partial to the one reflected in the following figure: ReproSpectrum

Here, we define a number of concepts that we will return to over and over again in the course of our stories:

  • Re-executability (publication-level replication): The exact same data, operated on by the exact same analysis should yield the exact same result. Current publications, in order to maintain readability, do not typically provide a complete specification of the exact analysis method or access to the exact data. Many published neuroimaging experiments are therefore not precisely re-executable. This is a problem for reproducibility.
  • Generalizability: We can divide generalizability into three variations:
    • Generalization Variation 1: Exact Same Data + Nominally ‘Similar’ Analyses should yield a ‘Similar’ Result (i.e. FreeSurfer subcortical volumes compared to FSL FIRST)
    • Generalization Variation 2: Nominally ‘Similar’ Data + Exact Same Analysis should yield a ‘Similar’ Result (i.e. the cohort of kids with autism I am using compared to the cohort you are using)
    • Generalized Reproducibility: Nominally ‘Similar’ Data + Nominally ‘Similar’ Analyses should yield a ‘Similar’ Result

We contend that ‘true findings’ in the neuroimaging literature should be able to achieve this ‘Generalized Reproducibility’ status in order to be valid claims.  As generalized reproducibility takes numerous claims and multiple publications in order to be established, most publications, themselves, are reporting what I would call ‘proto-claims’. These proto-claims may, or may not, end up being generalized. Since in our publications we do not really characterize data, analysis, and results very exhaustively, this lack of provenance permits the concept of ‘similar’ to have lots of wiggle room for interpretation (either to enhance similarity or to highlight differences, as desired by the interests of the author). In addition, we, as a community, tend to treat these individual reports, or proto-claims, as if they are established scientific claims (generalizably reproducible), since we do not really have any proper ‘system’ (apart from our own reading of the literature) to track the evolution of a claim.

While much of the work of ReproNim is to help establish easy-to-use end-user tools to exhaustively characterize data, analysis, and results (in order to enhance the community’s ability to explore the ‘reproducibility landscape’ of any given publication and its claims), it is equally important to work on the claims identification and tracking problem so that we can detect when our more ‘reproducible and re-executable’ procedures have established the ‘generalized reproducibility’ of a specific finding.  Our next petites histoires will delve more deeply into the details of neuroimaging analysis ecosystem.

 

Introducing the ReproNim Blog

ReproNim is a Center for Reproducible Neuroimaging Computation. As a NIH/NIBIB Biomedical Technology Research Center (BTRC) P41, ReproNim seeks to solve the ‘last mile’ problem for actual utilization of the myriad neuroinformatics resources that have been developed, but not routinely used, in support of the publications of more reproducible neuroimaging science. More details for the overall program can be found at our website: ReproNim website.

With this blog, the ReproNim team hopes to bring ‘little stories’ (les petites histoires) to our readers that highlight issues and solutions in the ongoing quest for enhanced reproducibility in neuroimaging. Feel free to comment, contact us (email: info@reproducibleimaging.org), or otherwise engage with the effort to:

(Discover, Replicate, Innovate)Repeat