BIOLOGICAL INFORMATICS

Personnel:

 

Reproducibility .......... Metagenomics ............ RNA Biology

Visit our lab homepage for more details


 

Reproducibility through Semantics

On top of the natural complexity of biology itself, is the technical complexity of biological datasets and their analysis. Biological data, and the tools and algorithms designed to analyse them, have been generated by thousands of independent research groups worldwide, and are published over a wide array of non-coordinating Websites often in non-standardized formats or databases. Given the inter-connections between biological systems, a researcher may therefore find themselves relying on an unfamiliar data type or analytical tool from another species to answer any given biological question in their species of interest.

 

 

Our SADI (Semantic Automated Discovery and Integration) project extends these core principles into the domain of analytical algorithms and tools. SADI requires that every analytical tool must describe (using semantic technologies) the kinds of biological entities it is capable of analysing, and what inter-entity relationships it discovers as a result of its analysis. This then allows machines to automatically match any given dataset, with a set of tools capable of analysing that dataset to generate a biological relationship of interest to the researcher.

 

Our SHARE (Semantic Health and Research Environment) project takes this yet one step further. We propose that scientific hypotheses can be formally modeled using the OWL languge. This model is then decomposed into its individual assertions - some derived from prior knowledge, some representing the core hypothetical proposition. From there, we automatically match each of those assertions to a tool on the Web capable of retrieving or discovering data matching that assertion. SHARE then automatically "pipelines" those tools together and executes the analysis, thereby automatically generating a result dataset that attempts to meet the criterion in the hypothetical model. This comes close to achieving our desired end-point, where the biological researcher need only pose the question in order to obtain the answer - with all technical steps in-between being automated. Moreover, the resulting analysis is entirely transparent and reproducible, since all steps are selected, recorded, and executed without manual intervention, and each step is associated with a specific logical assertion in the initial hypothesis.

 

FAIR Data - Findable Accessible Interoperable and Reusable Our lab are lead participants in the FAIR Data initiative. In addition to being co-authors of the FAIR Principles, we are exploring how these principles can be used to make science more transparent. When data and knowledge is FAIR, it becomes easier to find, and therefore easier to validate against prior biological knowledge and data. We examine how FAIR publication of scientific assertions might be automatically compared to similar assertions in the scholarly literature, providing a means to both explore the liklihood of truth of a given assertion, as well as provide a richer collection of citations, ensuring that all relevant scholars are properly credited.

 
Metagenomics

There are few tools that allow longitudinal analysis of metagenomic data subjected to distinct perturbations. We are examining longitudinal metagenomics data modelled as a Markov Decision Process (MDP). Given an external perturbation, the MDP predicts the next microbiome state in a temporal sequence, selected from a finite set of possible microbiome states. This results in a set of behaviour policies. For example, that moving from a state associated with disease to a state associated with health, requires applying or avoiding certain perturbations (interventions, food, drugs, etc.). We have shown the flexibility of this approach by applying MDPs to human gut and chick gut microbiomes, and human vaginal health, using sexual practices, nutritional intakes, probiotic treatments, and other perturbations to create the models. This novel analytical approach has applications in, for example, medicine where the MDP could suggest the sequence of perturbations (e.g. clinical interventions) to apply to follow the best path from any given starting state, to a desired (healthy) state, avoiding strongly negative states.

RNA Biology


Polyadenylation in Plant Pathogens:
We investigate the protein structure and components of the polyadenylation machinery in animals, plants, fungi and oomycetes. The specific objectives are: Using bioinformatics approaches, we will undertake:

    * Identification of different protein complexes involved in 3' end pre-mRNA processing in selected eukaryotic species with diverse lifestyles and environmental niches to gain insights into protein structure and evolution of the polyadenylation machinery in eukaryotes.
    * A survey for canonical and alternative polyadenylation signals using bioinformatic methods6,7 and available EST sequences from selected organisms. Transcriptome experiments will also be used to complete and confirm results.

Understanding mechanisms that regulate 3'UTR lengths by APA constitutes an uncovered area of research particularly in fungal pathogens of plants and animals, and generally in filamentous fungi. The cis elements present in the 3’UTRs such as microRNA target sites modulate gene expression by affecting cytoplasmic polyadenylation, subcellular localization, stability, translation and/or decay of the mRNA. Therefore, the selection of a proper 3' end cleavage site represents an important step of regulation of gene expression. We expect to gain knowledge about the involvement of APA mechanisms in the expression of genes that help to adapt organisms to specific environmental conditions.

 

Evolution of RNA Processing Machinery in Plant Pathogens:
Maturation of eukaryotic mRNA involves highly orchestrated cellular events that initiate with pre-mRNA formation in the nucleus by the RNA polymerase II, 5’end capping, splicing, and 3’end polyadenylation. These processes are occurring while RNA is transcribing, thus leading to a cotranscriptional mRNA processing. mRNA 3’end formation is a two-step process essential for eukaryotic gene expression. Multiple levels of regulation tightly control and coordinate these gene expression processes.
Studies on yeast and filamentous fungi have revealed that some of the cellular processes that regulate gene expression in animals and plants are not present or simply have evolved differently in these species. Previous work has shown that the RNA-binding protein Rbp35 of the rice blast fungus Magnaporthe oryzae is not present in yeasts or mammalian cells. Rbp35 is a subunit of the fungal Cleavage Factor I complex, which is part of the polyadenylation machinery. It has been shown that Rbp35 regulates the alternative polyadenylation of a subset of transcripts associated with fungal pathogenicity. These observations prompted our investigation on understanding the protein composition of the eukaryotic mRNA machineries within the different phyla of the fungal kingdom to understand their evolution and potential links to their wide range of lifestyles. For this purpose, we performed an in silico analysis and identified the core components of the RNA-associated processing machineries in fungi.

 

 

 

 

Representative Publications

Illana, A., Marconi, M., Rodriguez-Romero, J., Xu, P., Dalmay, T., Wilkinson, M.D., Ayllon, M.A., and Sesma, A. 2017. Molecular characterization of a novel ssRNA ourmia-like virus from the rice blast fungus Magnaporthe oryzae. Archives of Virology 162, 891-895. doi: 10.1007/s00705-016-3144-9.

Wilkinson, MD; Verborgh, R; Bonino da Silva Santos, LO; Clark, T; Swertz, MA; Kelpin, FDL; Gray, AJG; Schultes, EA; van Mulligen, EM; Ciccarese, P; Kuzniar, A; Gavai, A; Thompson, M; Kaliyaperumal, R; Bolleman, JT; Dumontier, M. 2017. "Interoperability and FAIRness through a novel combination of Web technologies". PeerJ Computer Science. DOI: 10.7717/peerj-cs.110".

Mons, B; Neylon, C; Velterop, J; Dumontier, M; da Silva Santos, LOB; Wilkinson, MD. 2017. "Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud". Information Services & Use. DOI: 10.3233/isu-170824".

Rodríguez Iglesias, A; Rodríguez González, A; Irvine, AG; Sesma, A; Urban, M; Hammond-Kosack, KE; Wilkinson, MD. 2016. "Publishing FAIR Data: an exemplar methodology utilizing PHI-base". Frontiers in Plant Science. DOI: 10.3389/fpls.2016.00641".

Wilkinson, MD; Dumontier, M; Aalbersberg, IJ; Appleton, G; Axton, M; Baak, A; Blomberg, N; Boiten, J-W; da Silva Santos, LB; Bourne, PE; Bouwman, J; Brookes, AJ; Clark, T; Crosas, M; Dillo, I; Dumon, O; Edmunds, S; Evelo, CT; Finkers, R; Gonzalez-Beltran, A; Gray, AJG; Groth, P; Goble, C; Grethe, JS; Heringa, J; ’t Hoen, PAC; Hooft, R; Kuhn, T; Kok, R; Kok, J; Lusher, SJ; Martone, ME; Mons, A; Packer, AL; Persson, B; Rocca-Serra, P; Roos, M; van Schaik, R; Sansone, S-A; Schultes, E; Sengstag, T; Slater, T; Strawn, G; Swertz, MA; Thompson, M; van der Lei, J; van Mulligen, E; Velterop, J; Waagmeester, A; Wittenburg, P; Wolstencroft, K; Zhao, J; Mons, B. 2016. "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. DOI: 10.1038/sdata.2016.18".

Aranguren, ME; Wilkinson, MD. 2015. "Enhanced reproducibility of SADI web service workflows with Galaxy and Docker". GigaScience. DOI: 10.1186/s13742-015-0092-3".

Nakada, T; Boyd, JH; Russell, JA; Aguirre-Hernández, R; Wilkinson, MD; Thair, SA; Nakada, E; McConechy, MK; Fjell, CD; Walley, KR. 2015. "VPS13D gene variant is associated with altered IL-6 production and mortality in septic shock". Journal of Innate Immunity. DOI: 10.1159/000381265".

Pawluczyk, M; Weiss, J; Links, MG; Egaña Aranguren, M; Wilkinson, MD; Egea-Cortines, M. 2015. "Quantitative evaluation of bias in PCR amplification and next-generation sequencing derived from metabarcoding samples". Analytical and Bioanalytical Chemistry. DOI: 10.1007/s00216-014-8435-y".

Marconi, M; Rodriguez-Romero, J; Sesma, A; Wilkinson, MD. 2014. "Bioinformatics tools for Next-Generation RNA sequencing analysis ", p. 371-391. In A. Sesma and T. von der Haar (eds.), Fungal RNA Biology. Springer International Publishing Switzerland. DOI: 10.1007/978-3-319-05687-6_15".

Katayama T; Wilkinson M; Aoki-Kinoshita K; Kawashima S; Yamamoto Y; Yamaguchi A; Okamoto S; Kawano S; Kim J-D; Wang Y; Wu H; Kano Y; Ono H; Bono H; Kocbek S; Aerts J; Akune Y; Antezana E; Arakawa K; Aranda B; Baran J; Bolleman J; Bonnal R; Buttigieg P; Campbell M; Chen Y; Chiba H; Cock P; Cohen K; Constantin A. 2014. "BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains". J. Biomed. Semantics 5:5.

Dumontier, M; Baker, C; Baran, J; Callahan, A; Chepelev, L; Cruz-Toledo, J; Del Rio, N; Duck, G; Furlong, L; Keath, N; Klassen, D; McCusker, J; Queralt-Rosinach, N; Samwald, M; Villanueva-Rosales, N; Wilkinson, M; Hoehndorf, R. 2014. "The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery". Journal of Biomedical Semantics 5:14.

Samadian S; McManus B; Wilkinson M. 2014. "Automatic detection and resolution of measurement-unit conflicts in aggregated data". BMC Med. Genomics 7:S12.

Egana Aranguren, M; Rodriguez Gonzalez, A; Wilkinson, MD. 2014. "Executing SADI services in Galaxy". Journal of Biomedical Semantics. DOI: 10.1186/2041-1480-5-42".

Luciano, JS; Cumming, GP; Kahana, E; Wilkinson, MD; Brooks, EH; Jarman, H; McGuinness, DL; Levine, MS. 2014. "Health Web Science". Foundations and Trends® in Web Science. DOI: 10.1561/1800000019".

Rodríguez González, A; Callahan, A; Cruz-Toledo, J; Garcia, A; Egaña Aranguren, M; Dumontier, M; Wilkinson, M. 2014. "Automatically exposing OpenLifeData via SADI semantic Web Services". Journal of Biomedical Semantics. DOI: 10.1186/2041-1480-5-46

Egaña Aranguren, M; Fernández-Breis, JT; Antezana, E; Mungall, C; Rodríguez González, A; Wilkinson, MD. 2013. "OPPL-Galaxy, a Galaxy tool for enhancing ontology exploitation as part of bioinformatics workflows". Journal of Biomedical Semantics. DOI: 2041-1480-4-2 [pii] 10.1186/2041-1480-4-2".

Katayama, T; Wilkinson, MD; Micklem, G; Kawashima, S; Yamaguchi, A; Nakao, M; Yamamoto, Y; Okamoto, S; Oouchida, K; Chun, HW; Aerts, J; Afzal, H; Antezana, E; Arakawa, K; Aranda, B; Belleau, F; Bolleman, J; Bonnal, RJ; Chapman, B; Cock, PJ; Eriksson, T; Gordon, PM; Goto, N; Hayashi, K; Horn, H; Ishiwata, R; Kaminuma, E; Kasprzyk, A; Kawaji, H; Kido, N; Kim, YJ; Kinjo, AR; Konishi, F; Kwon, KH; Labarga, A; Lamprecht, AL; Lin, Y; Lindenbaum, P; McCarthy, L; Morita, H; Murakami, K; Nagao, K; Nishida, K; Nishimura, K; Nishizawa, T; Ogishima, S; Ono, K; Oshita, K; Park, KJ; Prins, P; Saito, TL; Samwald, M; Satagopam, VP; Shigemoto, Y; Smith, R; Splendiani, A; Sugawara, H; Taylor, J; Vos, RA; Withers, D; Yamasaki, C; Zmasek, CM; Kawamoto, S; Okubo, K; Asai, K; Takagi, T. 2013. "The 3rd DBCLS BioHackathon: improving life science data integration with Semantic Web technologies". Journal of Biomedical Semantics. DOI: 2041-1480-4-6 [pii] 10.1186/2041-1480-4-6".

Luciano, JS; Cumming, GP; Wilkinson, MD; Kahana, E. 2013. "The emergent discipline of health web science". Journal of Medical Internet Research. DOI: v15i8e166 [pii] 10.2196/jmir.2499".

McCarthy, L; Vandervalk, B; Wilkinson, M. 2012. "SPARQL Assist language-neutral query composer" BMC bioinformatics, vol. 13 Suppl 1, no. Suppl 1, p. S2.

Samadian, S; McManus, B; Wilkinson, M.D. 2012. "Extending and encoding existing biological terminologies and datasets for use in the reasoned semantic web" Journal of biomedical semantics, vol. 3, no. 1, p. 6, Jul.

Rodríguez-González, A; Torres-Niño, J; Mayer, M. A; Alor-Hernandez, G; Wilkinson, M.D. 2012. "Analysis of a multilevel diagnosis decision support system and its implications: a case study" Computational and Mathematical Methods in Medicine, vol. 2012, pp. 1-9.

Wood, I; Vandervalk, B; McCarthy, L; Wilkinson, M. 2012. "OWL-DL Domain-Models as Abstract Workflows" in Leveraging Applications of Formal Methods, Verification and Validation. Applications and Case Studies, T. Margaria and B. Steffen, Eds. Berlin/Heidelberg: Springer, pp. 56-66.

Centro de Biotecnología y Genómica de Plantas UPM – INIA Parque Científico y Tecnológico de la U.P.M. Campus de Montegancedo
Autopista M-40, Km 38 - 28223 Pozuelo de Alarcón (Madrid) Tel.: +34 91 4524900 ext. 1806 / +34 91 3364539 Fax: +34 91 7157721. Contacto

Síguenos en: