Tutorial 2: Use of semantic web resources in computational biology and bioinformatics

Presenters

Paolo Romano Paolo Romano, PhD, Bioengineer, is a Senior Scientist in bioinformatics at the National cancer Research Institute of Genoa, Italy. His research interests include biomedical data management, network standards and tools, data integration, interoperability, ontologies, semantics methods and tools. He designed and contributed to the development of the Cell Line Data Base and its associated hypertext HyperCLDB, and he designed and developed Biowep, the Workflow Enactment Portal for Bioinformatics.

 

Andrea Splendiani

Andrea Splendiani, PhD, is a Senior Scientist in data integration at Rothamsted Research (BBSRC), Harpenden, UK. He has previous experience in Microarray Databases design and standards (University of Milano-Bicocca and Genopolis consortium, Italy), in Systems Biology (Institut Pasteur, France) and in Medical Informatics (University of Rennes1 – now ISERM U936, France). His research interests include standardization of biological data and in particular pathway information, biological data integration and the development of interactive systems to access and analyse biological information.

Motivation

The Semantic Web is a set of technologies, or a framework, which is designed to make data integration possible via the web, with the addition of a precise semantic characterization of entities and relations (ontologies). These technologies are designed to make the task of data integration easier, more precise, more flexible and less expensive. As data integration is a pre-requisite for systems biology and translational research, the Semantic Web can bring relevant benefits in these areas.

The importance of this tutorial stems from the fact that there is still a scarcity of courses where Semantic Web technologies are taught. As the Semantic Web is based on a set of standard technologies (XML, web architecture), but introduce new ways to use them (RDF, SPARQL), this lack of know-how often results in a premature dismissal of Semantic Web technologies, a wrong perception of their complexity or an improper usage.

Goals

The aim of this tutorial is to briefly introduce the key basic principles needed to understand what it means to represent information on the Semantic Web, and then to provide the attendees with basic hands on competences to start using biomedical information resources which are now available on this framework. With this tutorial we aim at helping interested researcher to quickly become productive, and avoid the most common pitfalls in Semantic Web based data integration.

Prerequisites

Knowledge of main biology databases; programming skills are a plus

Tutorial outline

Morning session (90'+90')

  • Introduction to basic principles of Semantic Web based representation of biological information (1h, theory). In this hour we will introduce the basics of RDF, and focus on the importance of URIs, shared relations and the implication of the open world assumption. We will invite the participants to think at RDF in terms of of a conceptual model, rather than XML, where is important to be precise and “what” we are referring to and on which predicates we use. We will show that the Semantic Web is easy.
  • Introduction relevant biomedical resources, including uniprot, bioportal, pathwaycommons (1h, theory). In this hour we introduce briefly a few main biomedical resources which are available on the Semantic Web. The objective is to contextualise what we have presented before to the biomedical domain, and to provide some resources which will be used in the following examples and exercises. At the same time, given the widespread use of the resources above, it is likely that this introduction will enable participants to quickly adopt them via the Semantic Web in their daily work. (Note: this lesson can easily accommodate a break as distinct resources are presented, which don't require a continuous flow of attention).

The remainder of the tutorial will introduce technologies through a very simple hands-on use case which will guide the participants through: exporting data on the Semantic Web, integrating this data with other existing biomedical resources and querying the resulting integrated (distributed) knowledge-base.

  • Introduction to the simple examples and explanation of how several technologies fit
    together (15')
  • Introduction to the D2RQ relational mapping system (45'). We will show how to generate an automatic mapping between a relational database and RDF via the D2RQ tool. We will then show how to tune this mapping to better represent URIs and relations, and then we will show how to open a SPARQL endpoint via D2RQ.

Afternoon session (90'+90')

  • Practical example on how to use D2RQ (1h). Participants with laptops or more proficient in programming will be provided and extended example and invited to practice hands on, while we will present a subset of this example in detail to the audience, making this section very interactive by inviting participants to propose how they would conceptually map their information to RDF (which will help them to refine their understanding of this language).
  • Triplestores (30'). We will introduce other triplestores, which can complement D2RQ in a production environment. We will and briefly introduce their pros and cons and provide tips on how to install and operate them.
  • Introduction to the SPARQL query language (45'). How to query D2RQ or any triplestore providing a SPARQL endpoint: examples of the most common and useful SPARQL constructs.
  • Practical example on performing queries with SPARQL. We will begin by presenting a list of queries of increasing complexity, which will involve the mapping realized in the previous example (via D2RQ) and the biomedical resources presented in the morning. We will explain how to express these queries in SPARQL, and we will show how the proper definition of URIs for entities and relations is the basis for data integration, which is now automatically realized at query time. This will refer to what we have presented in the introduction and the participants have practiced in the previous session.

Practical issues

People with laptops and more proficient in programming will have the chance to explore a little bit further the exercises (we will leave some slightly more challenging exercises for the after course), however the course will be designed to be followed as a frontal presentation, at times very interactive.

Additional information

For more information, please contact andrea dot splendiani @ bbsrc dot ac dot uk