Español (spanish formal Internacional)English (United Kingdom)

Semantic e-Science

logo escienceThe fast advances of technologies transform the way scientific research is performed. Data analysis and storage has moved from a paper-based, manual affair to an activity in which computers are vital. As a result, a vast amount of scientific data is being daily collected or produced by computational equipment. No single research organization has enough resources to collect everything; hence the data gathering and archiving processes are distributed and scattered at different places. Neither any single research group has the computational power to process all these data. Besides, collaboration among scientists from different institutions or disciplines is necessary in many occasions to apply a spectrum of methods and models to analyze and process this deluge of information, and the ability to access and reuse datasets, methods, models and results of existing scholarly publications generally ensures more effectiveness and better quality in the research that can be done.

The development of e-Science is a response to these emerging trends in scientific research. E-science was originally conceived as the application of computing to traditional science (mostly empirical, although in some cases theoretical as well) in order to empower scientists with their research in traditional activities such as modeling, simulation and prediction, among others. However, now e-Science can be considered to have gone further than that, and is even being considered as a third leg of the scientific method, together with the theoretical and empirical ones, by introducing a new environment in scientific research that has also led to new research methods that may potentially lead to better science.

Giving support to some of these new requirements arising from this new approach to Science requires in some cases the explicit definition of the meaning of data about these different domains. This is the role that explicit semantics and their associated technologies, models and methods can play, in the context of what it is known as Semantic e-Science. That is, while traditionally e-Science has mainly addressed issues of data and computation distribution, interoperation and high-performance in traditional and non-traditional scientific research tasks, the main focus of Semantic e-Science is on the application of explicit semantics over the e-Science infrastructure to drive more accurate information interpretation, more efficient scientific analyses, and better collaboration among scientists, among others.

researchobject phdcomic

Achieving computational experiment conservation and reproducibility in eScience is a multidisciplinary work in which several aspects have to be considered. Among them we focus on the conservation and reproduction of the execution environment of in-silico scientific experiments, trying to develop approaches for guaranteeing that an experiment that can be run today in a computational infrastructure could be run again in the future in an equivalent one. We explore how semantics can be applied to this end, developing ontologies for describing computational infrastructures and tools for reproducing them based on their descriptions. To this end we are also exploring the uses of virtualization techniques as a flexible and dynamic way for setting up and managing computational resources on demand.

Projects

Currently we are involved in a European project in this area, DrInventor, which started in January 2014, and we are actively participating in the W3C Community Group on Research Objects for Scholarly Communication, and maintaining the researchobject.org site.

Previous projects in this area include Wf4EverADMIRE and OntoGrid, the Marie Curie Initial Training Network SCALUS, and the national project myBigData

Main results

The work done in this research area has mainly focused on:

  1. The definition of models to describe, in a standard way, scientific experiments by means of workflow-centric Research Objects, which comprise scientific workflows, the provenance of their executions, interconnections between workflows and related resources (e.g., datasets, publications, etc.), and social aspects related to such scientific experiments. This activity also includes the definition of best practices for the creation and management of Research Objects, along with strategies for dealing with workflow decay.
    • rohub.linkeddata.es is the portal where we expose some of our group's Research Objects, associated to their corresponding scientific papers.
    • Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman DR, Palma R,  Bechhofer S, Garcia-Cuesta E, Gómez-Pérez JM, Klyne G, Page K, Roos M, Ruiz JE, Soiland-Reyes S, Verdes-Montenegro L, De Roure D, Goble CA: Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May 2012
  2. The publication of a corpus of provenance traces, compliant with the W3C standard PROV-O, in order to have data available for different types of analyses (derivation of results, completeness, abstraction, error detection during the experiment, etc):
    • Khalid Belhajjame, Jun Zhao, Daniel Garijo, Aleix Garrido, Stian Soiland-Reyes, Pinar Alper and Oscar Corcho, A Workflow PROV-Corpus based on Taverna and Wings. To be presented in BigPROV13.
  3. Another area of work is related to understanding scientific workflows to improve workflow reuse, through the use of provenance. By manually analyzing templates and traces, we have identified a set of domain independent motifs in scientific workflows that could be used to simplify and abstract them. We are currently working towards the automatic recognition of these abstractions, in order to simplify the view of the workflow to other communities and make it easier to understand. Metadata and provenance are key to facilitate this, since they describe the history and main features of every resource in a workflow execution:
    • Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil and Carole Goble, Common Motifs in Scientific Workflows: An Empirical Analysis.
    • Daniel Garijo, Oscar Corcho, Yolanda Gil. "Detecting common scientific workflow fragments using templates and execution provenance". In Proceedings of the seventh international conference on Knowledge capture (K-CAP '13). ACM, New York, NY, USA, 2013, Pages 33-40. DOI=10.1145/2479832.2479848 http://doi.acm.org/10.1145/2479832.2479848
    • Daniel Garijo, Oscar Corcho, and Yolanda Gil. 2013. Detecting common scientific workflow fragments using templates and execution provenance. In Proceedings of the seventh international conference on Knowledge capture (K-CAP '13). ACM, New York, NY, USA, 33-40. 10.1145/2479832.2479848 
  4. Ontology-based integration of heterogeneous scientific and non-scientific data sources. Important steps towards this goal are the provision of SPARQL querying support over distributed SPARQL endpoints, with a testbed in the bioinformatics domain that makes use of Bio2RDF endpoints and some initial results in query planning over distributed data sources. Previous results, which are still being used in several semantic e-Science projects, are the S-OGSA architecture and its related technological infrastructure. Some of the most relevant publications in this area are:
    • Buil-Aranda, C., Arenas, M., Corcho, O., Polleres, A., "Federating queries in SPARQL 1.1: Syntax, semantics and evaluation", Web Semantics: Science, Services and Agents on the World Wide Web, Volume 18, Issue 1, January 2013, Pages 1-17, 10.1016/j.websem.2012.10.001
    • Corcho, O., Alper, P., Kotsiopoulos, I., Missier, P., Bechhofer, S., Goble, C. (2006) An overview of S-OGSA: A Reference Semantic Grid Architecture. Journal of Web Semantics, 4 (2). pp. 102-115. ISSN 1570-8268
  5. Semantic annotation of scientific documents. This task includes the definition of a set of ontologies that allows describing scientific documents. We have reviewed the most relevant ontologies for describing scientific documents. The result of this work is a clasification of the most important ontologies for describing scientific documents. Currently, we are working in the publication of an ontology that covers the particular characteristics of the scientific discourse that are not covered by the existing ontologies.
    • Ruiz-Iniesta A. and Corcho O., A review of ontologies for describing scholarly and scientific documents, Proceedings of the Workshop on Semantic Publications (SePublica), 2014.
  6. The definition of a set of models for describing the execution environment of computational scientific experiments. These models aim to describe the hardware and software components involved in the execution of scientific workflows, including their dependencies and configuration information. In this context we have developed a set of experiments in which, using those models, we have described the Pegasus workflow execution system and its dependencies, and also Montage, a scientific workflow for the astronomical analysis of the sky:
  7. A protocol is a depiction of a sequence of operations; experimental protocols are usually written in natural language. Generally, they are presented in a “recipe” style providing step-by-step descriptions for processes. Such sequence of tasks and operations in experimental research are fundamental units of knowledge. Investigators follow and generate protocols in their daily activities; as experimental protocols reflect the know-how of a laboratory they are shared and adapted for various purposes. Most importantly, experimental protocols are essential for patenting; they are also central in reproducibility efforts. Several efforts have focused on accurate description of data for interoperability purposes; however, fewer efforts have emphasized on the actual how data was produced. Throughout this research project the question “How to semantically formalize experimental protocols so that sharing and discovering can be supported” will be addressed. Currently, 175 laboratory protocols in plant biology were analyzed. Resulting from this effort we have generated a checklist including a metadata set for reporting this type of document. The metadata included in the checklist was validated by 32 domain experts and represented in the ontology SMART Protocols-document. In addition, NLP techniques are being used in order to gain a deeper understanding of structures currently supporting the narratives of experimental protocols. A structured vocabulary of concepts to represent the execution of laboratory protocols in life sciences is available in the ontology SMART Protocols-workflow.
  • Members

This research area is led by Oscar Corcho, and the team is also composed of María Pérez Hernández, the postdoc Rafael González, the PhD students Daniel Garijo, Idafen Santana and Olga Giraldo, and the MSc student Carlos Badenes.

  • Recommended Reading

Some readings related with the e-Semantic Science:

There are currently no job offers or studentships available in this research area. For offers in other areas of the group, please check our job opportunities section. However, you may contact Oscar Corcho to check whether there are any potential open positions in the near future.

 

 News

Created under Creative Commons License - 2015 OEG.