ET05 – Reference Proteomes, Pan Proteomes and Redundant Proteomes: concepts that define protein space in UniProtKB

ELIXIR talk – session: ELIXIR – Data resources.

Abstract

The Universal Protein Resource (UniProt) is a comprehensive central resource for protein sequences and their functional annotation with extensive links to other complementary resources. UniProt is built upon the extensive bioinformatics infrastructure and scientific expertise at the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB). The UniProt Knowledgebase (UniProtKB) provides a large collection of proteomes through the Proteomes pages (1). A proteome is the set of proteins thought to be expressed by an organism and is typically obtained from the translation of a fully sequenced, annotated genome. Proteomes are classified into three categories, not mutually exclusive [1]: Reference Proteomes, Pan Proteomes and Redundant Proteomes. Reference proteomes are chosen to provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity found within UniProtKB. These proteomes, both selected by scientist community and computationally determined [2], include model organisms and other proteomes of interest to biomedical and biotechnological research. For each Reference Proteome, UniProtKB provides specific files that can be downloaded via FTP (2). In the repository, file names are prefixed with proteome (UPID) and taxonomic identifiers. For each proteome, users can retrieve the set of canonical protein sequences, one sequence per gene (fasta.gz); the additional protein sequences for genes with isoforms or variants (additional.fasta.gz); the underling DNA coding sequences (DNA.fasta.gz); gene to UniProtKB accession mappings (gene2acc.gz); all the set of cross-reference databases and their identifiers mapped to UniProtKB accessions (idmapping.gz)[3]. A Pan Proteome is the full set of proteins thought to be expressed by a group of related organisms (e.g. multiple strains of the same bacterial species). For each reference proteome cluster, a pan proteome is a set of sequences consisting of all the sequences in the reference proteome, plus the addition of unique protein sequences that are found in other species or strains of the cluster but not in the reference proteome. These additional sequences are identified using UniRef50 membership. UniProtKB Pan Proteomes encompass all non-redundant proteomes and are aimed at users interested in phylogenetic comparisons and the study of genome evolution and gene diversity. On the Proteome page, when a proteome has proteins that are part of a larger pan proteome, the ‘Pan proteome’ row link back to the Reference Proteome it is linked to, a link to download the full fasta sequence set of the Reference Proteome is also available. The Pan Proteome data are available on the FTP for download (3). UniProtKB has witnessed exponential growth with a two-fold increase in the number of entries in 2014. This follows the increased submission of multiple genomes for the same or closely related organisms. This led to a high level of redundancy in unreviewed UniProtKB/TrEMBL, and many sequences were over-represented in the database. This was especially true for bacterial species where different strains of the same species have been sequenced and submitted. Two extreme examples are Mycobacterium tuberculosis and Staphylococcus aureus that contained respectively 1,692 and 4,080 strains, corresponding to 5.97 million and 10.88 million entries. To reduce this redundancy, we have developed a procedure to identify highly Redundant Proteomes within species groups using a combination of manual and automatic methods [4]. We applied this procedure to bacterial proteomes (which constituted 82% of UniProtKB/TrEMBL as of release 2015_03) beginning in the 2015_04 release. Sequences corresponding to Redundant Proteomes were removed from UniProtKB representing a 51% drop; 47.0 million entries deleted over 92.6 million entries in total. From release 2015_04 on, we no longer create new UniProtKB/TrEMBL records for proteomes identified as redundant; between 2015_04 and 2016_04 this accounted for 81.2 millions fewer sequence entries. The redundant sequences are available in the UniParc sequence archive dataset. All proteomes (Redundant and non-Redundant) remain searchable through the Proteome pages (1). Recently we have observed same trend of redundant proteomes for fungi and we plan to apply the Proteome Redundancy Removal on Fungi by the end of 2016. To conclude the different ways of organizing proteome spaces in UniProtKB as been proven to help users to find the relevant data in a consistent and intuitive fashion. Removing the Redundant Proteomes from UniProtKB also helps users to perform their search in a more efficient manner. And users that want to retrieve protein sequences of Redundant Proteomes can also do it from UniParc.

(1)http://www.uniprot.org/proteomes
(2)ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/
(3)ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/pan_proteomes/
[1] UniProt Consortium. “Reorganizing the protein space at the Universal Protein Resource (UniProt).” Nucleic acids research (2011).
[2] Chen, Chuming, et al. “Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation.”PloS one 6.4 (2011).
[3] Altenhoff, Adrian M., et al. “Standardized benchmarking in the quest for orthologs.” Nature Methods (2016). [4] Bursteinas, Borisas, et al. “Proteome Redundancy in UniProt” (in press)

Authors

Benoit Bely, The European Bioinformatics Institute (EMBL-EBI), United Kingdom
Ramona Britto, The European Bioinformatics Institute (EMBL-EBI), United Kingdom
Borisas Bursteinas, The European Bioinformatics Institute (EMBL-EBI), United Kingdom
Alan Wilter Sousa Da Silva, The European Bioinformatics Institute (EMBL-EBI), United Kingdom
Andrea Auchincloss, SIB Swiss Institute of Bioinformatics, Switzerland
Chuming Chen, Protein Information Resource, Center for Bioinformatics and Computational Biology, University of Delaware, USA
Maria Martin, The European Bioinformatics Institute (EMBL-EBI), United Kingdom
Uniprot Consortium, EBI, SIB, PIR, United Kingdom