ET04 – Marine metagenomics infrastructure as driver for research and industrial innovation

ELIXIR talk – session: ELIXIR – Data resources.

Abstract

Background. Marine metagenomics is still in its infancy, but it is rapidly expanding. Recent large-scale oceanographic projects such as the Tara Ocean (1) and OSD (2) have generated more than 10 Tb of sequence data from the prokaryote fraction. Generally, metagenomics datasets are becoming larger due to the increase in uptake of the approach, coupled with the falling cost of sequencing. To enable the research and innovation potential of marine metagenomics, users must be able to analyse and interpret the data as produced. In particular, there is an urgent need to establish sustainable data management infrastructure, data resources and analysis pipelines dedicated to the marine domain. As a part of the ELIXIR Pilot Action project (3), we performed a gap analysis and concluded on recommendations for developing sustainable ELIXIR services for marine metagenomics in four areas; standardisation of metagenomics data, establishing marine metagenomics resources, gold standard pipelines for metagenomics analysis, and exploring HPC and storage technologies. The outcome of the pilot was fed into the ELIXIR-EXCELERATE Marine metagenomics Use Case (4), helping to define the requirements and specifications for establishment of a sustainable ELIXIR marine metagenomics infrastructure. A comprehensive metagenomics data standards environment. The context in which marine metagenomics projects are conducted often gets lost, because experimental data, as well as information about the sample and environment, are rarely submitted along with the sequence data. If these contextual data are missing, key opportunities of comparison and analysis across studies and environments are hampered or even impossible to conduct. In an ideal world, a metagenomics study should report on each processing step, from contextual data of sampling, through experimental variables of sequencing and metadata of sequence analysis to parameters associated with archiving of the analysed data. Over the past five years, there has been a focus on developing descriptors of contextual data standards suitable for marine metagenomics. However, further developments are needed to enrich the records for better utilization of the data. We will use the MlxS (5) and M2B3 Standards (6) as a base towards drafting a comprehensive metagenomics data standards environment. Moreover, we will also focus on developing descriptors associated with metagenomic analysis and archiving of analysed data. Partners and experts in the field of marine metagenomics will review the drafted guidelines for the metagenomics standards in order to achieve an objective and comprehensive view. Marine data resources. Marine metagenomics research and innovation is limited by the lack of dedicated reference data resources. Existing databases are generalist or biased towards human-associated microorganisms and the contextual data for the records is often incomplete or lacking. Due to the lack of coverage of marine organisms, only about one quarter of sequences can be annotated from typical marine samples. To improve the characterization of marine environmental samples, we will establish three public available databases for the marine microbial domain. The Marine reference databases (MarRef) will serve as gold standard reference database, consisting of manually curated records of complete sequenced marine microbial organisms. MarRef will be based on annotated records available in public resources, such as INSDC providing submitted annotation, and RefSeq, in which standardised annotation from the NCBI’s prokaryotic genome annotation pipeline is available (9). We will manually curate each record to improve the contextual content, thereby expanding the usefulness of the sequence data. While MarRef only consists of complete genomes, the Marine Genome Database (MarDb) aims to include all Whole Genome Shotgun (WGS), Transcriptome Sequence Shotgun (TSS) and Single Cell Genome (SCG) sequences of marine microorganisms available at the European Nucleotide Archive (ENA). The Marine Microbial Gene Catalogue (MarCat) will be constructed using all available marine metagenomics projects at ENA. The sequences will assembled and annotate using META-pipe to ensure consistency of datasets in MarCat database. Gold standards and pipelines. As with most emerging bioinformatics fields, a myriad of tools that perform different types of metagenomics analysis are constantly being published or updated. Pipelines that aggregate such tools therefore are under constant flux. Such change makes comparing the results from two different pipelines or even two different versions of a single pipeline non-trivial. To improve the interoperability, understanding and comparison of our pipelines, EBI Metagenomics (9) and META-pipe (10), we will semantically describe of our pipelines tools, including their input and outputs. During the course of this work, we will evaluate several types of analysis tool (e.g. taxonomic assignment, open reading frame prediction), defining gold standards tools for different scenarios from the marine sector. Where possible, these tools and their corresponding benchmarks will be registered within ELIXIR platform services. Together, these activities will facilitate user-driven, bespoke extension analyses to be performed, increase the utility of the standard analyses for either the academic, clinical or biotechnological sectors. Dataset varies in size, ranging from tens of gigabytes for raw sequences for the typical datasets, to terabytes for flagship projects such as Tara Ocean (1). Both the EBI Metagenomics (EMG) and META-pipe have been designed for parallel execution on high-performance computer (HPC) clusters. However, there is a need for more elastic computation resource allocation. We are therefore exploring deployment of our pipelines on the OpenStack based cPouta cloud at CSC (Finland) and commercial clouds such as Amazon Web Services.

References

Sunagawa, S. et al. (2015) Structure and function of the global ocean microbiome. Science 348:6238
Kopf, A. et al. (2015) The ocean sampling day consortium. Gigascience 4:27
Towards user centric services for the Marine community (https://www.elixir-europe.org/news/towards-user-centric-services-marine-community)
Marine metagenomic infrastructure Use case: Marine metagenomic infrastructure as driver for research and industrial innovation (https://www.elixir-europe.org/excelerate/marine)
MIxS, v.4.0. (http://wiki.gensc.org/index.php?title=MIxS)
Ten Hoopen P, et al. (2015) Marine microbial biodiversity, bioinformatics and biotechnology (M2B3) data reporting and service standards. Stand Genomic Sci. 8;10:20.
NCBI’s prokaryotic genome annotation pipeline, (http://www.ncbi.nlm.nih.gov/genome/annotation_prok/)
INSDS: International Nucleotide Sequence Database Collaboration (http://europepmc.org/articles/PMC4702924)
Mitchell A, et al. (2016) EBI metagenomics in 2016 – an expanding and evolving resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 44(D1):D595-603.
Robertsen, E., et al. (2016) META-pipe – Pipeline Annotation, Analysis and Visualization of Marine Metagenomic Sequence Data (http://arxiv.org/abs/1604.04103v1)

Authors

Nils Peder Willassen, UiT The Arctic University of Norway, Norway
Robert D. Finn, EMBL-EBI, United Kingdom
Guy Cochrane, EMBL-EBI, United Kingdom
Lars Ailo Bongo, UiT The Arctic University of Norway, Norway
Petra Ten Hoopen, EMBL-EBI, United Kingdom