T1 – Scalable data analysis with Apache Spark and Hadoop

Tutorial details

Date: Saturday September 3, 2016
Time: 9:00 – 12:30 (half day tutorial)
Venue: World Forum, room: Europe 1

Tutors

Mathijs Kattenberg, SURFsara
Jeroen Schot, SURFsara
Machiel Jansen, SURFsara

Summary

In this tutorial you will be introduced to the Apache Hadoop and Spark frameworks for processing big data. These frameworks offer a novel way for creating data analysis applications that easily scale over hundreds to thousands of machines. This data-parallel approach has been pioneered in industry by tech companies such as Google and Facebook, and is very applicable to many scientific workloads in general. We will introduce you to the key concepts and features of the Apache Hadoop and Spark stacks. In addition you will work on hands-on Spark exercises in a Jupyter notebook environment. The exercises and demos will provide a basic understanding of Spark and demonstrate the applicability to bioinformatics applications such as sequence alignment and variant calling using ADAM and running BLAST on Hadoop.

More information

Target audience

Anyone who would like to get started with Apache Spark and Hadoop to build robust and scalable applications. You should be familiar with the basics of programming (preferably Python) and the Unix command line. Most bioinformaticians and scientific programmers will feel right at home.

Requirements

Participants should bring their own laptops.

Schedule

time	subject
9:00	General introduction
9:15	Introduction Hadoop & Spark
10:00	Hands-on applications on Hadoop
10:30	Tea/Coffee break
11:00	Hands-on Spark applications
11:30	Step-by-step bioinformatics tool(s) on Spark/Hadoop
12:15	Wrap-up & questions
12:30	End of program