ET12 – Using FAIR data to solve cross-resource questions in Rare Disease research

ELIXIR talk – session: ELIXIR National collaborations, international standards.

Abstract

Rare Disease research is challenging not only because of the inherent biological complexity, but also because relevant research data is rare (by definition) and scattered over different repositories and geographic locations. Moreover, data is collected and maintained by different stakeholders and is typically heterogeneous and non-interoperable. However, in order to improve understanding (and identify potential treatments) of any particular Rare Disease, researchers need to be able to discover and evaluate as much data as possible relevant to their case. Here we report on the activities of the inter-project engineering team supported by RD-Connect (http://rd-connect.eu), ODEX4All (https://www.esciencecenter.nl/project/odex4all) and BBMRI-NL 2.0 (https://www.bbmri.nl), in the context of an Elixir Rare Disease Implementation Study. Driven by the use case of Ring14 syndrome, we have implemented a demonstrator that harnesses a novel, practical approach to enable researchers to ask questions across the traditional boundaries of data resources. Such resources may include biobanks (biological sample data), patient registries, genomic data and (public) repositories of biological relations. Even the same type of data is typically maintained by different organizations with different data models, formats and access policies. This is a barrier for researchers who need to look at, for example, registries of different, but potentially related diseases. Here we present a demonstrator application that enables queries across a number of resources: biobanks and patient registries for Ring14 syndrome, Angelman syndrome and (partial) deletion in chromosome 14 (both related to Ring14), and the (unrelated) Huntington’s Disease. Our demonstrator uses real resource metadata and data types, but the actual data has been obfuscated for the demonstrator proof-of-principle. Here we discuss our approach of “making data interoperable at the source” and we use a FAIR-compliant software architecture to build a fully functional, user-friendly application to query across resources. The FAIR Principles (doi:10.1038/sdata.2016.18) provide an implementation-independent, precise and measurable set of qualities for the publication of scientific research data. The principles are grouped according to the acronym into four distinct, but complementary goals: to make data Findable, Accessible, Interoperable and Reusable. A FAIR Data Point (FDP) is a lightweight web service that enables access to FAIR (meta)data. More specifically, it provides metadata at four complementary layers: FDP Metadata, Data Catalog Metadata, Dataset Metadata and finally, the Data Distribution Metadata, which includes minimally one link to a data distribution. Each layer is self descriptive and provides links to the next level of (meta) data. Consequently, given a FDP location (URL), all data and metadata in the FDP can be (automatically) browsed or mined by both humans and machines. The FDP guidelines (in development) recommends the use of standard vocabularies such as OAI-PMH, Dublin Core, DCAT and LDP to represent metadata. We have chosen the Linked Data model for data interchange for both FDP metadata and the actual data distributions offered by the FDP. FDPs are compatible with the paradigm of “making data interoperable at the source” as they require only a minimal agreement about metadata that must be supplied by a data owner. We have emulated this process using generic tools like OpenRefine for a small number of resources. Additionally, we created a FDP for ID-Card (http://catalogue.rd-connect.eu): the metadata registry of biobanks and patient registries in the European RD-Connect project. Its metadata already enables useful, general questions like “Which registries have data on patients with a diagnosis Ring14?”, but we are also interested in more specific questions like: “Give me blood specimens for patients that have a short neck”. For this purpose, we have made a small number of anonymized datasets related to Ring14 available as Linked Data distributions through the ID-Card FDP. Although FDPs provide data, they do not, by themselves, provide a service for query resolution. Furthermore, the standardized Linked Data query language SPARQL would require a level of technical skill that can not be realistically expected from a typical rare disease researcher. Therefore, our demonstrator implements a simple, ad-hoc software platform, which offers a user-friendly query interface that sends automatically-generated SPARQL queries to a Linked Data cache (see figure). Users first select a query template from a list, for example: “Get biosamples from donors with a specific phenotype”. Subsequently, the interface offers a dropdown autocomplete widget to select specific values for required parameters, e.g. sample type (“blood specimen”) and phenotype (“short neck”). Results are shown in a table that is marked up with links to additional information (e.g. to the resulting sample in a biobank repository). In the background, the interface transforms the user-query to a SPARQL query that is resolved by a Scaleus (https://github.com/bioinformatics-ua/scaleus) cache of prefetched content from relevant FDPs. We note that FDPs can also be used in other applications using different architectures: e.g. platforms that dynamically load (or perform federated queries over) FDP content. This flexibility is possible, because all resource-specific information is administered and contained within the FDP: the application need not be concerned with data updates or other state-full operations, which is a notable advantage over data warehouse-based applications. Initial user feedback indicates that the application is easy-to-use while providing valuable insights into previously unconnected datasets. In the future we would like to extend the application with user-level support for ad-hoc queries and data exploration, for example by using existing Linked Data facet browsers or even advanced natural language SPARQL query generators. Moreover, within the context of RD-Connect and Elixir, we have proposed to extend the demonstrator to include data sources with other relevant data types for Rare Disease research, such as genomic (variant) data and other “-omics” data. Finally, we will continue to develop FAIR tools within the context of the Dutch ODEX4All project. This includes user-level tools to stream-line the process of making data FAIR: ideally using an easy, non-technical procedure to enable data experts “at the source” (such as Biobank and Patient registry maintainers) to generate FAIR data. Finally, we will look into the development of FAIR-compatible, yet secure access mechanisms on FDPs, which would support further development and exploitation of our current demonstrator in a safe, privacy-preserving environment.

Authors

Mark Thompson, Leiden University Medical Center, Netherlands
Rajaram Kaliyaperumal, Leiden University Medical Center, Netherlands
Pedro Sernadela, Universidade de Aveiro, Portugal
Robert Reihs, Medizinische Universität Graz, Austria
Davide Piscia, CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Spain
Kees Burger, Leiden University Medical Center, Netherlands
Andrew Gibson, Independent consultant, Netherlands
Claudio Carta, Istituto Superiore di Sanità, Italy
Marco Crimi, RING14 International ONLUS, Italy
Heimo Müller, Medizinische Universität Graz, Austria
Arnold Kuzniar, Netherlands eScience Center, Netherlands
Anand Gavai, Netherlands eScience Center, Netherlands
Richard Finkers, Wageningen University & Research centre, Netherlands
Sergi Beltran, CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Spain
J.L. Oliveira, Universidade de Aveiro, Portugal
Barend Mons, Leiden University Medical Center, Netherlands
Marco Roos, Leiden University Medical Center, Netherlands
Mark D. Wilkinson, Universidad Politécnica de Madrid, Spain
Luiz Bonino, Dutch Techcentre for Life Sciences, Netherlands
David van Enckevort, University of Groningen, Netherlands
Dennis Hendriksen, University Medical Center Groningen, Netherlands
Morris Swertz, University Medical Center Groningen, Netherlands