PT40 – GTRAC: Fast retrieval from compressed collections of genomic variants

Proceeding talk – Theme: Data.

Abstract

The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient. We present a new algorithm GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.

Link to PDF file

Authors

Kedar Tatwawadi, Stanford University, United States
Mikel Hernaez, Stanford University, United States
Idoia Ochoa, Stanford University, United States
Tsachy Weissman, Stanford University, United States