ET03 – NGS-Logistics: federated analysis of NGS sequence variants across multiple locations

ELIXIR talk – session: ELIXIR – Human genomics and translational data.

Abstract

Next-Generation Sequencing (NGS) is a key tool in genomics, in particular in research and diagnostics of human Mendelian, oligogenic, and complex disorders. Furthermore, with the dramatic decrease of the price and turnaround time, large amounts of human sequencing data have been generated over the past decade. Crucially, the speed at which NGS data is produced greatly surpasses Moore’s law and challenges our ability to conveniently store, exchange, and analyze this data. The collection of files resulting from the analysis of a single whole genome study can take up to 50Gb of disk space. This raises significant issues in terms of computing and data storage and transfer, with off-site data transfer currently being a key bottleneck. Moreover, the analysis of NGS data also raises the major challenge of how to reconcile federated analysis of personal genomic data and confidentiality of data to protect privacy. In many situations, the analysis of data from a single study alone will be much less powerful than if it can be correlated with other studies. In particular, when investigating a mutation of interest, it is extremely useful to obtain data about other patients or controls sharing similar mutations. However, personal genome data (whole genome, exome, transcriptome data, etc.) is sensitive personal data. Confidentiality of this data must be guaranteed at all times and only duly authorized researchers should access such personal data. To address all challenges described above, we developed a data structure NGS-Logistics, which fulfills all requirements of a successful application that can process data inclusively and comprehensively from multiple sources while guaranteeing privacy and security. NGS-Logistics is a web-based application providing a data structure to analyze NGS data in a distributed way. The data can be located in any data center, anywhere in the world. NGS-Logistics provides an environment in which researchers do not need to worry about the physical location of the data. With respect to users rights, queries will be sent to each remote server. The host will process the request and return the results back to the main server where all the privacy limitations are controlled for the data. Once the results are ready, the end user can see the desired information. Depending on the type of query, results will be divided into two parts, the first part is related to the samples to which the user has authorized access, and for which the users can see all details. The second part contains results for the whole population, for which the user has only access to some aggregate statistics without details. An example of such a query would be to review the mutations present at a single genomic position in each individual patient from a set of patients to which the user has authorized access (1st part) and to contrast these results with background frequency of mutation in the reference populations (2nd part). NGS-Logistics has been installed in KU Leuven, UZ Leuven, CMG Antwerp, CHU Liege, UZ Brussels and UCL. Almost 3000 exome sequencing samples are successfully shared. Most of the users are from these centers and we also have some users from other research institute. Local admins are able to easily control the users access level and manage the access control list.

Authors

Amin Ardeshirdavani, KU Leuven ESAT – STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
Erika Souche, KULeuven, Belgium
Luc Dehaspe, KU Leuven Department of Human Genetics (Genomics Core), Belgium
Jeroen Van Houdt, KU Leuven Department of Human Genetics (Genomics Core), Belgium
Joris Vermeesch, KU Leuven Department of Human Genetics (Genomics Core), Belgium
Yves Moreau, KU Leuven, Belgium,