Highlight talk – Theme: Genome.
Abstract
Identifying coding genes is an essential step in genome annotation. Here, we present an approach that utilizes existing whole genome alignments to detect conserved coding exons and then maps gene annotations from one genome to all aligned genomes. We show that genome alignments contain thousands of spurious frameshift and splice site mutations in exons that are truly conserved. To overcome these limitations, we developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering the annotated reading frame and splice sites. CESAR effectively avoids spurious frameshifts and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, we used CESAR to annotate human genes in 99 vertebrates. CESAR can readily be applied to other genome alignments to accurately annotate coding genes in vertebrates and invertebrates.
Authors
Michael Hiller, Max Planck Institute of Molecular Cell Biology and Genetics, Germany
Virag Sharma, Max Planck Institute of Molecular Cell Biology and Genetics, Germany
Anas Elghafari, Max Planck Institute of Molecular Cell Biology and Genetics, Germany
