AT09 – The THOR project: Integrating persistent identifiers such as ORCIDs in life sciences data resources

Application talk

Abstract

The THOR (Technical and Human infrastructure for Open Research) project (http://project-thor.eu) is a 30-month project funded by the European Commission under the Horizon 2020 programme. In general, THOR aims to extend the integration of persistent identifiers (PIDs) into platforms, services and workflows. The aim is not to build new, standalone services, but to work with existing systems and communities, in this case, the life sciences research community. By creating new and improved integrations of PIDs in the services that researchers and institutions actually use, we aim to ensure that PIDs are usefully embedded in research outputs and activities from the very beginning, with minimal effort for researchers. The THOR consortium includes ORCID, DataCite, EMBL-EBI, The British Library, CERN, PANGAEA, DRYAD, Elsevier Labs and PLOS. Alongside publishers, datacentres, research organisations and national libraries, there are two major PID infrastructure providers; ORCID and DataCite. In the life sciences, data resources use accession numbers, which are unique within a given data resource or namespace, but may be ambiguous in a broader context. Some resources use DOIs as an additional identifier at the level of a study, but in general this behaviour currently represents a very small proportion of the data ecosystem in biology. The challenge for the THOR project is therefore to align these well-established and community-standard systems with the more generic DOI-based infrastructures offered by DataCite, in order to ensure full integration as required. Resources such as identifiers.org (http://identifiers.org) may play a role here as a hub for life science data resources. Furthermore, supplemental data, i.e. unstructured files published alongside research articles, are not universally deposited in databases such as Zenodo, FigShare, Dryad or BioStudies. Across the THOR partners, our goal is to simplify workflows to ensure the appropriate deposition of data within article submission processes and encourage the assignment of PIDs such as ORCIDs – identifiers for people – within these workflows. Life sciences researchers typically publish articles as the major research output, and work by many stakeholders such as the ORCID Foundation, CrossRef, publishers and Europe PMC have gained traction on the integration of ORCIDs into article submission, publication, and distribution systems. Currently there are over 2.5M articles in Europe PMC that have at least one associated ORCID, from around 250,000 unique ORCIDs (i.e. people). The THOR project wishes to capitalise on this adoption in publications, extending into claiming datasets to ORCIDs. The EMBL-EBI and the THOR project recognises that the production of open data, submitted to the appropriate community database, is an important contribution to open science, no more so critically demonstrated as in the fields of Bioinformatics and Computational Biology, which rely in large part on reusing data available in public data resources. We are building services that allow ORCIDs to be integrated into data submission systems, as well as allowing retrospective claiming of data to ORCID records, positioning these contributions alongside articles published and grants awarded. Supporting the work in this area, THOR partners are building tool kits, resources and training programs to disperse the outcomes of the development as widely as possible. Information about researchers’ activities is gathered in multiple systems such as manuscript submissions systems, grant funding applications, data resources, citation indices, institution-based systems and personal webpages. One of the overall goals of the THOR project is to help make research reporting more efficient and consistent (i.e. report once and reuse in different context, as required), and make this information easier to discover and collect accuracy. This importance is likely to grow as credit systems extend to other research objects and contributions such as software, data analysis or peer review. In this presentation, we will describe and demonstrate tools developed by THOR at the EMBL-EBI for linking datasets to ORCIDs and describe ongoing work in this area.

ORCID provides identifiers for people involved in generating research. These identifiers are persistent, actionable and provide links to research activities such as publications, datasets and funding via other identifier systems such as Digital Object Identifiers (DOIs) and Fundref. ORCID records are maintained by their owners who curate input from other systems such as publisher, datacentres and institutional systems. ORCID iDs also help to improve the accuracy of information held in, and shared between, systems. As of February 2016, more than 2 million ORCID identifiers have been claimed. DataCite is a DOI registration agency that provides identifiers for a wide range of academic outputs with a particular focus on data. Datacite works with data centres to assign persistent identifiers to research objects and develop infrastructure that supports simple and effective methods of data citation, discovery, and access. As of February 2016, DataCite have minted over 7 million identifiers. The European Bioinformatics Institute (EMBL-EBI) is part of EMBL, Europe’s flagship laboratory for the Biological and Medical Sciences. EMBL-EBI is a non-profit, intergovernmental organisation funded by EMBL member states. EBI provides freely available data from life science experiments covering the full spectrum of molecular biology, and including the life science literature. This is achieved through the hosting and development of a number of databases, services and tools. The Literature Services group in particular has experience in delivering services based on ORCIDs and on the use of data PIDs/DOIs for integrating the literature with data. Europe PubMed Central (Europe PMC) is the leading database for the life science research literature in Europe (http://europepmc.org). It contains about 31M abstracts including PubMed and 3.6M full text articles, about 1.2M of which are “gold” Open Access i.e. free to read and reuse. Europe PMC integrates the literature with related data via reciprocal data-article citations across resources, and through text mining. Europe PMC implemented a tool that allows users to link articles to their ORCID, and the team that develops Europe PMC has led the implementation of ORCID identifiers across EMBL on an Institutional level.

Authors

Johanna McEntyre, EMBL-EBI, United Kingdom
Guilherme Formaggio de Mello, EMBL-EBI, United Kingdom
Florian Graf, EMBL-EBI, United Kingdom
Josh Brown, ORCID, United Kingdom
Tom Demeranville, ORCID, United Kingdom
Maaike Duine, ORCID, United Kingdom