Application talk
Abstract
Introduction
The ongoing shift of Next-generation DNA sequencing (NGS) technologies towards a production- grade clinical environment significantly increases the pressure on the computational infrastructure of an organization, both in terms of the initial cost of investment as well as the operational cost of the infrastructure. In order to address this issue, researchers have called for a cloud-based solution for the computational challenges in genomics [Stein10]. However, there is general reluctance against using such cloud-based solutions, due to security concerns given the strict regulations surrounding patient information management. In addition, such cloud-based solutions increase the complexity of using a computational infrastructure, putting an even bigger burden on the side of the clinic to manage. In this presentation, we discuss a scalable, accelerated and secure private-cloud concept dedicated to the genomics domain that addresses the security concerns, and at the same time reduces the complexity of using a scalable infrastructure. In addition, such a platform provides seamless scalability as well as high-performance accelerated hardware. We also discuss a case study with University Medical Center Utrecht (UMCU) in the Netherlands that shows the advantages of using such a cloud infrastructure in practice.
Cloud-based genomics infrastructure
We propose a private cloud-based genomics analysis platform, where users can seamlessly connect their wet-lab sequencing infrastructure to a choice of geographically localized compute centers in a highly secured way. To ensure maximum regulatory compliance, all NGS data is processed and stored in the users region of operation. In addition, no European data is processed in the US, and data does not leave the country of origin. At any point in time, users can monitor data transfer and processing progress, administrators can define user roles including users processing and access rights. Researchers and clinical users can configure pipelines and workflows and create reports. All data is stored encrypted both in flight and at rest and is tamper proof. Any action on any portion of NGS data or the processing pipelines becomes part of a compliant audit trail, enabling full transparency for security audit purposes.
Connecting to the cloud
After a verification process, users are approved and entitled to install an application called the Bluebee Service Connector on their computers. The Connector first of all uniquely identifies the user and facilitates end-to-end process auditing. It also streams data coming from the user’s sequencer and initiate the transfer of NGS data in an automated way. Additionally the Connector encrypts, compresses the raw sequence data and drastically accelerates the data transfer to the cloud. In addition to the Connector, we use a web interface that allows the user to connect and administer the private section of the compute cloud. The web interface allows users to monitor all ongoing activities on the platform, including data transfers in progress, running pipelines, their progress and expected completion time, status of reference data, users logged in and their current activity, etc. The platform allows users to have a security profile, which granularly determines access rights to certain data. This allows several ways for users to share and organize access to data they are entitled to, thereby allowing for collaboration with other users.
The cloud infrastructure
Upon arrival in the cloud, all patient related data is “signed” and monitored for authenticity. Any tampering with the data through circumvention of the standard access methods will trigger alerts. Additionally, all user or system interactions (through API’s) with any of the data are subject to audit logging. The compute infrastructure allows for users to trigger pipelines and workflows once transferred data hits the compute center. Pipelines can be either standardized (like the Broad Best Practice for example), tailor made or having been lifted over from the user organization. Users can also bring their own components, scripts or reference data to be used in the high performance processing runs. In the compute centers, high performance processing of NGS data is obtained through the combination of 2 techniques: acceleration and distribution. We have accelerated several standard pipeline components through implementation of standardized and generally accepted algorithms on specific co-processors (FPGA, GPU, …). These implementations allow for substantial speedups of specific parts of the pipelines. Moreover, we have several distributed implementations of some of pipeline algorithms. Both techniques combined enable large scale, distributed, accelerated and elastic processing of genome data.
UMCU case study
UMCU developed a multistage computational pipeline [Cuppen15] using state-of-the-art, community-accepted tools to aid oncologists with experimental treatment for cancer patients, and clinical geneticists with diagnostics of congenital disease (see Figure 1). The computational pipeline developed by UMCU contains the following set of state-of-the-art tools along with their computation runtime on a single compute node: – Mapping (BWA-MEM), sorting and mark dup (sambamba): 10 days – Realignment/variant calling (GATK): 5 days – Somatic calling (freebayes, strelka, varscan): 3 days  The reported runtime for each of these tools was based on an Illumina HiSeq X Ten dataset of a tumor/normal set of 90/30x based on NA24385 with a 30% spike-in of NA12878. The total input compressed FASTQ file size is 356 Gbytes (266 Gbases). A number of optimizations were performed on the pipeline to enable effective utilization of a scalable cloud infrastructure, resulting in a drastic reduction of the runtime. Using standard settings and techniques (such as multi-threading and GATK Queue), the UMCU cluster implementation reduces the computation time from 18 days down to 5 days (3.6x speedup). Bluebee performs several additional optimizations tailored to the specific cloud infrastructure: – Parallelization of specific steps – Optimization of file storage locations – Streaming of data between nodes – Smart scatter/gather of genomic data The above techniques reduce the total runtime down to 2 days (9x speedup). This speedup is obtained without any compromises to the analysis approach and has no effects on any of the quality metrics of the analysis. References [Stein10] Lincoln D. Stein, The case for cloud computing in genome informatics, Genome Biology, 2010, 11:207, DOI: 10.1186/gb-2010-11-5-207 [Cuppen15] https://github.com/CuppenResearch/IAP
Authors
Zaid Al-Ars, Delft University of Technology, Netherlands
Vlad-Mihai Sima, Bluebee, Netherlands
Kurt Florus, Bluebee, Netherlands
