|
VIB / Ghent University |
Research Knowledge management Ontology based knowledge acquisition, representation and management : CCO Ontologies are shared vocabularies plus a specification of its intended meaning that aim at supporting consistent and unambiguous knowledge sharing. They provide a framework for knowledge integration. An ontology links concept labels (symbols) to their interpretations (meaning), i.e. specifications of their meanings and relations to other concepts. As such, ontologies can be used to support automatic semantic interpretation of textual information and thus provide a basis for advanced text mining and reasoning. Currently, only a limited body of biomedical knowledge has been captured in ontologies, first of all in Open Biomedical Ontologies (OBO). Although these are already widely used within the biomedical research community, the knowledge they have captured is largely insufficient, both quantitatively and qualitatively, to support more advanced applications, such as reasoning services or dynamic modeling of biological processes. To overcome these limitations, an ontology dedicated to the domain of cell cycle research has been developed in the framework of the FP6 project DIAMONDS. The integrated ontology, called Cell Cycle Ontology (CCO, http://www.cellcycleontology.org), has integrated data from a number of resources such as Gene Ontology, Relations Ontology, UniProt, IntAct, and so forth (erant06). The ontology comprises four sub-ontologies where each one represents knowledge for one of the four model organisms - A. thaliana, H. sapiens, S. cerevisiae and S. pombe. This will allow comparative analysis of the knowledge by ontology alignment approaches. Unlike OBO ontologies, CCO is available in several formats, one of them being OWL, the standard web ontology language for the Semantic Web. The use of OWL entails the possibility of reasoning on the ontology for consistency checking, hypotheses formulation and knowledge generation. CCO is being now used for constructing an ontology based Cell Cycle Knowledge Base. This task is involving currently: " Enrichment of CCO by integrating additional sources of data (protein localization, protein complexes, protein post-translational modifications, protein features, and so on). " Development of a database support to enable efficient querying, representation and maintainability. Future development will include: Enrichment of the knowledge base through application of advanced Information Extraction tools; Application of advanced reasoning services and ontology learning approaches. An interface for visualizing/browsing CCO has also been foreseen. Ref: Antezana, E., Tsiporkova, E., Mironov, V. & Kuiper, M. A cell-cycle knowledge integration framework. Lecture Notes In Bioinformatics 4075 19 - 34 (2006) . Curated literature mining: MineMap We develop a system that allows to take quick but rich and structured notes when reading publications, and to obtain a powerful and easily browseable visualization of all the extracted biological information and its relations afterwards. MineMap's controlled syntax: While reading publications, biological researchers may collect notes of the most interesting pieces of information. Such notes typically include facts, hypotheses, interaction diagrams and fuzzy expression or activity profiles. When the number of articles increases, a large stack of collected notes can easily transform into a body of text as opaque as the original full-text publications themselves. Therefore we have developed a controlled vocabulary that is clear and intuitive to use for writing notes while reading publications, yet that is still readable by a computer program. As a really simple example, the piece of information "CycA stimulates CdkB at the G1 phase" would be translated as "CycA -> CdkB @G1". We point out that the ability to capture a broad range of information types is essential in the design of the vocabulary, but in addition our focus is to capture information that is seen as interesting from the perspective of dynamical biological network modeling. This human-validated information collection is naturally slower than automated text-mining, but it can be much more trusted. In order to be successful, automated text-mining algorithms are now focusing on narrow, well-defined topics (for example analyzing all assertions about phosphorylation [yuan06]). In contrast, our method enables collection of detailed and trustworthy information, that covers a much broader topic area than a focused text-mining effort. This is essential in the field of Systems Biology where knowledge-integration from several areas is essential for success. MineMap's visualizer: The second part of the software imports these human-curated textual notes and represents them as a web of connected entities (proteins, complexes, organs, etc) that can be explored, shared or integrated with other data sources. The relationship-visualizer and graphical browser allows to look up any entity, to see all connected entities around it, and to hop further onto those entities' connections. " Outlook We are now testing and further refining a draft version of this software. We are also building collaborations for larger-scale development. Related development topics are:
Ref.: yuan06 Yuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH (2006). An online literature mining tool for protein phosphorylation. Bioinformatics. 2006 Jul 1;22(13):1668-9.
Bottom-up modeling
SIM-plex: quantitative modeling made easier. We developed a genetic network simulator, especially made for biologists. It lowers the mathematical threshold by translating and combining textual 'if-then' statements into differential equations while not bothering the user (e.g. "if Component1 > 50 then block Component2") [verc05]. The mathematical basis uses piecewise linear differential equations that are very suitable for genetic interaction networks [kauf73]. They are also most fit for translation into such if-then statements. These equations also require one parameter less for each interaction compared to ordinary equations. Since quantitative parameter information is very scarce and difficult to collect in this field, this framework gives the extra benefit of requiring less effort for parameter estimation. SIM-plex has been applied in a number of publications [verk05, beem06]. It was used to illustrate and examine the behaviour of new parts of the Arabidopsis cell cycle and leaf development networks. Since its first version, it has been extended to connect biological scales, for example to connect the molecular and phenotypical scales of the leaf development process. Refs: beem06: Beemster GTS, Vercruysse S, De Veylder L, Kuiper M, Inzé D (2006). The Arabidopsis leaf as a model system for investigating the role of cell cycle regulation in organ growth. J. Plant Res. 2006 Jan;119(1):43-50. kauf73: Glass L, Kauffman SA. The logical analysis of continuous, nonlinear biochemical control networks. Journal of Theoretical Biology 39, 103-129 (1973). verc05: Vercruysse S, Kuiper M (2005). Simulating genetic networks made easy: network construction with simple building blocks. Bioinformatics. 2005 Jan 15;21(2):269-71. verk05: Verkest A, Manes CL, Vercruysse S, Maes S, Van Der Schueren E, Beeckman T, Genschik P, Kuiper M, Inzé D, De Veylder L (2005). The cyclin dependent kinase inhibitor KRP2 controls the mitosis to endocycle transition during Arabidopsis leaf development through inhibition of mitotic CDKA;1 kinase complexes. Plant Cell. 2005 Jun;17(6):1723-36.
GINsim: logical modeling In a collaboration with Denis Thieffry (TAGC-INSERM, Luminy Campus, Marseille, France) we are exploring the use of logical principles for modeling biological regulation processes. The Thieffry group is developing GINsim (Gene Interaction Network simulation), a computer tool specifically designed for the modeling and simulation of genetic regulatory networks. GINsim consists of a simulator of qualitative models of genetic regulatory networks based on a discrete, logical formalism. GINsim allows the user to specify a model of a genetic regulatory network in term of asynchronous, multivalued logical functions, and to simulate and/or analyze its qualitative dynamical behaviour. As we are partner in the FP6 project TRANSISTOR we will use GINsim to build a model for flower development. Ref.: C. Chaouiya, E. Remy, B. Mossé and D. Thieffry Qualitative analysis of regulatory graphs: a computational tool based on a discrete formal framework. "First Multidisciplinary International Symposium on Positive Systems: Theory and Applications" (POSTA 2003); Farina (Eds), Springer-Verlag, LNCIS 294:119-126
Matcont: Numerical study of dynamical systems The use of quantitative models and methods is a useful and mathematically exact way to understand pathways within cells and complex networks of interaction among genes. Quantitative models can be used to predict which kinetic properties lead to a specific feature in network behaviour. In many cases network behaviour can be so complex that computer simulation is the only way to derive predictions from a model. In particular, estimating the parameters in order to obtain qualitative different behaviours of the networks with a large number of parameters is difficult. In a collaboration with Willy Govaerts (Dept. of Applied Mathematics & Computer Science, Ghent University) we use MatCont to construct and study dynamical models that incorporate all aspects of specific plant growth and development regulatory networks, to analyze their behaviour and to draw conclusions concerning their parameters. Matcont is a powerful software package that can help to meet these goals. It is a graphical Matlab package for the interactive numerical study of dynamical systems. It is developed in parallel with the command line continuation toolbox Cl_MatCont. Both packages are freely available for non-commercial use on an as is basis. MatCont is developed under the supervision of W. Govaerts (UGent, Belgium) and Yu.A. Kuznetsov (RUU, Netherlands). Ref: MATCONT, CL_MATCONT and CL_MATCONT_for_MAPS, continuation software in Matlab, Last revision: March 26, 2007, http://users.ugent.be/~wgovarts/ Dynamic Models in Biology, Stephen P. Ellner and John Guckenheimer, 2006 EMERALD: a European Coordination Action to develop measures to improve microarray data quality
We will analyse quality control metrics on existing microarray data, and add Normalisation and Transformation ontology to MGED to allow a structured recording of data pre-processing. European efforts to assess the merits of hybridisation standards for QC will be coordinated, and procedures to certify selected standards as European Reference Material will be set in motion. We will unite the different microarray technology stakeholders in a series of topical workshops to address the development and implementation of QA/QC in research, service, diagnostics, data pre-processing and archiving, computational datamining, new technology development and its exploitation, and to acquire a wide community acceptance and government approval of 'best practices'. We will assess the spectrum of best laboratory practices (standards, protocols, procedures), and discuss how common denominators can be established to standardise QA in varous experimental settings and applications. In a pilot implementation of QA/QC we will assess the impact on data quality and coherence. A web portal at EBI will present a wide network of contacts, and serve to disseminate protocols, data sets, the use of control material, etc. We will assist individual microarray users in their transfer to such common practices. The results and experiences from transcriptome microarray QA/QC will create a cornerstone for a systems biology based life science, and cross-fertilise and advance the maturation process of emerging applications of microarray technology. This will stimulate the generation of essential know-how and the development of technical IP that can then be absorbed by European SMEs. More information can be found on the EMERALD website. DIAMONDS: A European Systems Biology project to study the Cell Cycle
We study the regulatory network structure of one of the most fundamental biological process in eukaryotes: the cell cycle. We apply an integrative approach to build a basic model of the cell cycle, in four different species including S. cerevisiae (budding yeast), S. pombe (fission yeast), A. thaliana (weed, model plant) and human cells. To do this, a Consortium is assembled of leaders in the fields of cell cycle biology, functional genomics technologies, database design and development, data analysis and integration technology, and modeling and simulation approaches. The project aims to produce and integrate various complementary (functional genomics) data sets, using an advanced mining and modeling environment designed to assist the biologist in building and amending hypotheses, and to help the investigator when designing new experiments to challenge these hypotheses. By doing this simultaneously in widely different organisms we will ensure that the tools are generally applicable across species. By bringing together in the design phase biologists, bioinfomaticians, biomathematicians and (commercial) software developers we will ensure that a user-friendly, intuitive data analysis environment is created. The main data streams generated de novo within the project concern transcript profiling and proteomics data (Y2H and TAP). These data will be complemented with information extracted through comparative genomics, and prior knowledge coming from literature mining (text mining tools). The project will bring together a number of existing technologies to build a knowledge warehouse in a relational database designed to contain cell cycle regulatory network information, accessible through an intuitive user platform (GUI) with embedded modeling tools. This platform will enable both top-down and bottom-up hypothesis-driven research, and it will serve as a basis to develop more rigorous dynamical models for cell cycle variants. More information can be found on the DIAMONDS website. Identification of heterosis predictive transcription-based markers in Arabidopsis: biometrics versus soft-computing approach
Heterosis ('hybrid vigour') refers to an improved performance of F1 hybrids with respect to the parents. It has been observed that a cross between quasi-homozygous parents can in some cases lead to an offspring (F1) that is better in terms of yield, stress resistance, speed of development, etc. as compared to the parents. Heterosis is of great commercial importance since it enables the breeder to generate a product (F1 hybrid seed) with preserved values which in turn, allows the farmer to grow uniform plants expressing these heterosis features. Heterosis is particularly important for commercial agricultural crops as for instance, corn, sugar beet, canola and sunflower, as well as for vegetables as for instance, tomato, cauliflower and onion and also for the commercial forestry as for instance, poplar. Besides a commercial interest there is a more fundamental scientific interest associated with the biological phenomenon of heterosis performance, as an excellent example of what complex genetic interactions can lead to. Although the phenomenon of heterosis has already been studied for many years no complete genetic explanation has been found. Heterosis is in some way correlated with the genetic distance between the parents and their level of homozygosity, which in turn causes a certain level of heterozygosity in the first generation of the offspring. However the latter explains only at most half of the heterosis encounters. The F1 hybrids are nowadays tested for heterosis features mostly via a "trial-and-error" method which, considering the great number of parental lines that have to be tested, turns to be an enormous and also rather expensive task. The choice of parental lines for an F1 program could be performed in a more selective fashion if sufficient information concerning the combining abilities of the parental lines would be available in advance to the breeder. Arabidopsis thaliana is used as a model system in the project. A great number of F1 crosses (diallel population) are being made between different Arabidopsis ecotypes chosen to induce a broad spectrum of heterosis observations. These F1s will be phenotypically characterized for several properties as for instance, leaf size, biomass, xylem/phloem-ratio of hypocotyls. Moreover all the ecotypes considered as parental lines will be genotyped using AFLP-marker technology. Besides the molecular markers (AFLP markers), so-called molecular genotypes would also be determined via a genome-wide transcription analysis with micro-arrays. The Arabidopsis micro-array data will be analysed with both standard biometrics techniques and soft-computing algorithms. The ultimate goal is to build a computational model on the basis of Arabidopsis gene-expression data that enables prediction of hybrid performance with higher efficiency than the usual genetic markers. The results obtained with micro-arrays and AFLP markers will, of course, be analysed and compared. The classification markers will subsequently be used for identification of key-genes involved in the biological processes underlying heterosis and it will further be investigated in which genetic networks these genes play a role. Ultimately this study will provide insight into the mechanism behind biomass heterosis in Arabidopsis.
Top-down modeling
One of the major goals of computational biology is to integrate functional genomics data of all types in a global network that reflects the regulatory wiring and modularity of an organism. Among the most important data sources for unravelling biological networks is microarray data from perturbation experiments. We have developed novel methods, based on combinatorial statistics and graph theory, to explore perturbational microarray data. This algorithm positions the genes in a highly interconnected network of transcriptional clusters (see figure, click for larger view). Functional annotation of those clusters using Gene Ontology (see below) reveals not only the global transcriptional modularity and behaviour, but also crosstalk between pathways and clues about the functionality of unknown genes. While perturbational data for Arabidopsis are being generated in the CAGE framework, we applied our approach on a compendium of microarray profiles for S. cerevisiae representing 300 perturbationsn. Another prominent source of data is the a priori knowledge that has been gathered since the early days of molecular biology. Integrating this information in the network is essential, and it also plays an important role in validating methodology. In the last few years, a number of initiatives have been set up to structure this large body of information in a controlled vocabulary. One of the most promising is Gene Ontology (GO). We are developing tools and methods to take advantage of the rich hierarchical structure of GO in functional annotation. We have developed the GO-annotation- and visualization tool BiNGO as a plugin for the Cytoscape visualisation platform, in collaboration with the Institute for Systems Biology in Seattle.
Presentations New Project AROBIO: VIB-PSB, VIB-PSB-CCO, VIB-PSB-MINEMAP, UGENT, UNIMAN, CNIO, DKFZ, LUB, NTNU, PUBGENE, BIOALMA
|