AT11 – Search! « ECCB2016

Application talk

Abstract

Search! Search has become one of the essential technologies in daily life. For scientists it can bring not only new or additional knowledge of interest, but also directions for further exploration. In this talk, I will present several search services EMBL-EBI provides to scientists: EBI Search (1), sequence analysis services via the Job Dispatcher framework (2), and system monitoring, including how these services are used in practice. The European Bioinformatics Institute (EBI) hosts data from life science experiments comprising assembled genomes; nucleotide sequences; protein sequences; macromolecular structures; small (‘drug-like’) molecules; gene expression; molecular interactions; reactions, pathways and diseases; protein families; enzymes; literature; and samples and ontologies. These represent discrete categories containing one or more specialised data resources that are curated and annotated by experts from around the world. One of the big challenges is how to provide users a simple but uniform way to access the vast amount of biological data present in heterogeneous forms. EBI Search (1) is a full text and meta-information search engine dedicated to biological data, which indexes more than 100 discrete biological datasets (~1.1 billion entries) available in EBI as well as outside of it. The search engine allows users to query and review results using faceted navigation and filters based on common fields. It enables users to navigate related entries in different categories of resources using cross-reference searching, which can be implicitly declared in EBI Search by the data provider or inferred by the system. Further human-friendly presentation of results is obtained via gene-centric summaries at the top of search results, which are generated using EBI Search RESTful API and other resources’ Web services. The full range of features the search engine provides to Web interface users is also available through its RESTful API, which can be used to integrate search functionality into other applications – RNAcentral (portals http://www.rnacentral.org/) and Omics discovery index (https://www.ebi.ac.uk/Tools/omicsdi) – and for analytical workflows by combining of the RESTful APIs and other Web services. The search engine is built using the Apache Lucene library (http://lucene.apache.org), is constantly updated with new data and is under continuous review by scientists as well as specialists in web usability and design. In 2010, EBI Search had 400 million entries. During 2014 it surpassed one billion entries, which are accessible over the web as well as programmatically using RESTful and SOAP (announced to be phased out) Web Services. During 2015 EBI Search was used by more than 366 000 unique Internet Protocols (IPs) addresses that generated 347 million requests. Since 2009 the EMBL-EBI Job Dispatcher framework (2) has provided free access to a range of mainstream sequence analysis applications. These include sequence similarity search services (https://www.ebi.ac.uk/Tools/sss/) such as BLAST, FASTA and PSI-Search, multiple sequence alignment tools (https://www.ebi.ac.uk/Tools/msa/) such as Clustal Omega, MAFFT and T-Coffee, and other sequence analysis tools (https://www.ebi.ac.uk/Tools/pfa/) such as InterProScan. Through these services users can search mainstream sequence databases such as ENA, UniProt and Ensembl Genomes, utilising a uniform web interface or systematically through Web Services interfaces (https://www.ebi.ac.uk/Tools/webservices/) using common programming languages, and obtain enriched results with novel visualizations. The framework is used by academic and industry scientists, and in 2014 handled roughly 110 million analysis jobs, up from 65 million in 2013. Help pages, tutorials and user guides (available as protocols) are provided, together with training courses and helpdesk support. Continued feedback from the biological community, collaboration with bioinformatics tools and data providers and comprehensive metrics analysis helps to drive improvements to the accessibility and quality of the services. Links between these platforms allow relationships to be discovered between data entities. The result of a BLAST search is automatically populated with links to entries in other databases referencing the sequence discovered for example. Likewise results from EBI Search can be used as data entry into further tools analysis or data searches. Last but not least, search technologies are also becoming important in other aspects in the life of a major bioinformatics centre – real time monitoring of services via log analysis using Elastic Stack (Elasticsearch, Logstash and Kibana) for example. We will demonstrate how we are using this technology to make systems robust and better, and explore possible future applications.

References:

Squizzato S, Park YM, Buso N, Gur T, Cowley A, Li W, Uludag M, Pundir S, Cham JA, McWilliam H, Lopez R. The EBI Search engine: providing search and retrieval functionality for biological data from EMBLEBI. Nucleic Acids Res (2015) Web Server Issue DOI: 10.1093/nar/gkv316
Li W, Cowley A, Uludag M, Gur T, McWilliam H, Squizzato S, Park YM, Buso N, Lopez R. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res (2015) Web Server Issue DOI: 10.1093/nar/gkv27

Authors

Young Mi Park, EMBL-EBI, United Kingdom
Andrew Cowley, EMBL-EBI, United Kingdom
Weizhong Li, EMBL-EBI, United Kingdom
Nicola Buso, EMBL-EBI, United Kingdom
Szymon Chojnacki, EMBL-EBI, United Kingdom
Tamer Gur, EMBL-EBI, United Kingdom
Silvano Squizzato, EMBL-EBI, United Kingdom
Rodrigo Lopez, EMBL-EBI, United Kingdom