ECCB'14 - T04 - Multivariate Projection Methodologies for Big Data and Application in R Using the mixOmics Package

T04 - Multivariate projection methodologies for the exploration of large biological data sets. Application in R using the mixOmics package

The objective of this tutorial is to introduce the fundamental concepts behind projection-based approaches and illustrate their application on some exemplar studies using the R package mixOmics.

Multivariate projection approaches are useful exploratory tools to get a first understanding of large and complex data sets. These approaches are extremely efficient on large data sets, and can also answer complex questions. Such approaches include Principal Component Analysis (PCA, Joliffe 2002) and other variants, Partial Least Squares regression (PLS, Wold 2001), PLS-Discriminant Analysis, Canonical Correlation Analysis (CCA, Hotelling 1936). These approaches enable the reduction of the dimension of the data by projecting them into a smaller subspace. Recent developments proposed the so-called `sparse’ approaches, which include Lasso penalisations to allow variable selection (Tibshirani 2001).

PCA is the oldest and most popular multivariate technique but often, little is known about how this approach is solved and what are the limitations. More sophisticated approaches like PLS and CCA have recently been extended to deal with the large dimension (sparse PLS, or regularized CCA) and were proven to bring biologically meaningful results in many studies. Contrary to PCA, PLS and CCA enable the integration of two types of data sets.

Since 2009, we have implemented many multivariate approaches and their sparse variants in the R package mixOmics to be used by the statistical and bioinformatics community. Full tutorials are given on our: http://perso.math.univ-toulouse.fr/mixomics/

In this tutorial, we will focus on the application of these approaches to medium and high throughput biological data (transcriptomics, metabolomics, proteomics data) using PCA, CCA, PLS, PLS-DA and the variants that the mixOmics team and collaborators have developed.

Outline of the one-day tutorial

Concept about the key methodologies in mixOmics and their variants

Principal Component Analysis: PCA, sparse PCA, NIPALS
Canonical Correlation Analysis: CCA, regularized CCA
Partial Least Squares regression: PLS, sparse PLS
Partial Least Squares Discriminant Analysis: PLS-DA, sparse PLS-DA

Review on the graphical outputs that are implemented in mixOmics

Sample plots
Correlation circles
Integrating two data sets: relevance networks and clustered image map

Case studies and applications

Example with PCA: Nutrimouse
Example with CCA: Multidrug
Example with PLS: Liver toxicity
Example with PLS-DA: SRBCT study

The course will be delivered in a form of a lectures followed by practicals in R. Course material will be available.

Date and Venue

Date: Sunday Sept 7, 2014

Venue: FORUM building, Faculté de Médecine, 4, rue Kirschleger, Strasbourg

Time: 9am – 5.30pm (registration starts from 8 am)

Target Audience

Postgraduate students, postdoctoral fellows and researchers with good statistical knowledge, in need to

-explore large data sets

-use graphical techniques to better visualize data

-understand and/or apply multivariate projection methodologies to large data sets.

Prerequisite and requirements

We expect the audience to have a good working knowledge in R.

Attendees are requested to bring their own laptops, having installed the software RStudio http://www.rstudio.com/ and the package mixOmics.

Presenters

The presenters are all key developers of mixOmics:

Dr Kim-Anh Lê Cao (The University of Queensland Dimantina Insitute, Brisbane, Australia, k.lecao[at]uq.edu.au). Kim-Anh is a biostatistician researcher in the University of Queensland, Brisbane, Australia. Her institute has a particular focus on severe and chronic diseases such as cancer and diseases involving the immune system, including arthritis, chronic infections, and diabetes. Together with the mixOmics team, Kim-Anh continues to develop methodologies to analyse complex biological studies.

Dr Sébastien Déjean (Institut de Mathématiques de Toulouse, Université de Toulouse, France, sebastien.dejean[at]math.univ-toulouse.fr). Sebastien is a statistician research engineer in the Universite de Toulouse. Through his support activities to research, he contribues to various projects particularly in the fields of high throughput biology and information retrieval systems.

Dr Ignacio González (Institut de Mathématiques de Toulouse, Université de Toulouse, Institut National de la Recherche Agronomique, France). Ignacio is working at the plateforme de bioinformatique et biostatistique de Toulouse. Ignacio has been working in several wet laboratories (INSERM, INRA, CNRS, INSA) where he provided statistical support. He has considerable experience in analyzing a vast range of biological data.

Contact

Any enquiries about the tutorial? Contact the mixOmics team: mixomics[at]math.univ-toulouse.fr (replace [at] by @)

Related literature

González I., Lê Cao K.-A., Davis, M.D. and Déjean S. (2013) Insightful graphical outputs to explore relationships between two ‘omics’ data sets. BioData Mining 5:19. See a full version of the manuscript with enclosed figures here.

Yao F., Coquery J., Lê Cao K.-A. (2012) Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets, BMC Bioinformatics 13:24. [link]

Lê Cao K.-A., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems, BMC Bioinformatics, 22:253. [link]

Lê Cao K.-A., González I. and Déjean S. (2009) integrOmics: an R package to unravel relationships between two omics data sets. Bioinformatics, 25(21):2855-2856. [link] NOTE: the package ‘integrOmics’ has been renamed ‘mixOmics’.

González I., Déjean S., Martin P. and Baccini A. (2008) CCA: An R package to extend canonical correlation analysis. Journal of Statistical Sofware, 23(12). [link]

González I., Déjean S., Martin P.G.P., Gonçalves O., Besse P. and Baccini A.(2009) Highlighting Relationships Between Heteregeneous Biological Data Through Graphical Displays Based On Regularized Canonical Correlation Analysis. Journal of Biological Systems 17(2), pp 173-199. [link]

Lê Cao K.-A., Martin P.G.P, Robert-Granié C. and Besse, P. (2009) Sparse Canonical Methods for Biological Data Integration: application to a cross-platform study. BMC Bioinformatics, 10:34. [link]

Yergeau E., Schoondermark-Stolk S.A., Brodie E.L., Déjean S., DeSantis T.Z., Gonçalves O., Piceno Y.M., Andersen G.L. and Kowalchuk G.A.(2009) Environmental microarray analyses of Antarctic soil microbial communities. The International Society for Microbial Ecology Journal, 3(3), pp 340-351. [link]

Lê Cao K.-A., Rossouw D., Robert-Granié C. and Besse P. (2008) A Sparse PLS for Variable Selection when Integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7(1), Article 35. [link]

Combes S., González I., Déjean S., Baccini A., Jehl N., Juin H., Cauquil L., Gabinaud B., Lebas F. and Larzul C. (2008) Relationships between sensorial and physicochemical measurements in meat of rabbit from three different breeding systems using canonical correlation analysis. Meat Science 3, pp 835-841. [link]