User Guide

BiNGO

BiNGO Installation Instructions

Install BiNGO using the Cytoscape App Manager (under Apps menu). BiNGO is now ready to use !

BiNGO Manual

1. Introduction

BiNGO is a Java-based tool to determine which Gene Ontology (GO) categories are statistically over- or underrepresented in a set of genes or a subgraph of a biological network. BiNGO is implemented as a plugin for Cytoscape, which is a an open source bioinformatics software platform for visualizing and integrating molecular interaction networks. BiNGO maps the predominant functional themes of a given gene set on the GO hierarchy, and outputs this mapping as a Cytoscape graph. Gene sets can either be selected or computed from a Cytoscape network (as subgraphs) or compiled from sources other than Cytoscape (e.g. a list of genes that are significantly upregulated in a microarray experiment). The main advantage of BiNGO over other GO tools is the fact that it can be used directly and interactively on molecular interaction graphs. Another plus is that BiNGO takes full advantage of Cytoscape's versatile visualization environment. This allows you to produce customized high-quality figures.

Features include :

•assessing overrepresentation or underrepresentation of GO categories
•Graph or gene list input
•batch mode : analyze several clusters simultaneously using same settings
•Different GO and GOSlim ontologies
•Evidence code filtering
•Wide range of organisms
•Hypergeometric or binomial test for overrepresentation
•Multiple testing correction using Bonferroni (FWER) or Benjamini&Hochberg (FDR) correction
•Interactive visualization of results mapped on the GO hierarchy.
•more extensive results in tab-delimited text file
•making and using your own annotation files is easy
•open source

This manual aims to explain the inner workings, and potential pitfalls, of BiNGO in more detail. To get a taste of the basic user interface, please take a look at the tutorial section.

2. Statistical tests

BiNGO currently provides two statistical tests for assessing over- or underrepresentation in a set of genes. The basic question answered by these tests is the following :

'When sampling X genes (test set) out of N genes (reference set ; graph or annotation), what is the probability that x or more of these genes belong to a functional category C shared by n of the N genes in the reference set.'

The hypergeometric test (test without replacement) provides an accurate answer to this question in the form of a p-value. Its counterpart with replacement, the binomial test, provides only an approximate p-value but requires less calculation time. You should only consider using the binomial test if your test set contains several thousand genes...

3. Multiple testing corrections

Because BiNGO tests all GO labels present in the test set, the number of statistical tests performed in a single analysis may amount to several hundreds. When testing individual categories at a significance level (a level) of say 0.05, you would expect 5 out of each 100 tested categories to be identified as being over-represented just by chance. Suppose you have one truly positive category, this would imply you would have identified 5 times more false positive than true positive categories. Multiple testing corrections are designed to provide better control over the false positive rate at a given significance level.

One of the most basic corrections is the Bonferroni correction. The Bonferroni correction provides strong control over the Family-Wise Error Rate (FWER), which is defined as the probability of making at least one type I (false positive) error. E.g., when performing Bonferroni control of the FWER at level a = 0.05, you would be 95% certain that that the over-represented categories that you identified contain no false positives. The Bonferroni correction is generally assumed to be rather conservative, although there have been reports (Boyle et al. 2004) that the Bonferroni correction would actually be rather liberal (at least for FWER control) when used for correcting tests that are not mutually independent, as is the case when testing GO categories (see further).

An alternative to using FWER controlling corrections is to control the False Discovery Rate (FDR), i.e. the expected proportion of false positives among the positively identified tests. Generally, this type of correction is more appropriate for our purposes, since we would typically rather have more power (less false negatives) at the cost of a few more false positives. One of the most popular FDR-corrections is the Benjamini & Hochberg correction, which provides strong control over the FDR under positive regression dependency of the null hypotheses.

In fact, it's not sure whether the GO hierarchy fulfills this positive regression dependency requirement. Nevertheless, the Benjamini & Hochberg correction is used widely. Alternatives include the Benjamini & Yekutieli procedure, which controls the FDR under arbitrary dependency, or resampling based procedures to control either the FWER (e.g. Westfall & Young step-down minP procedure) or the FDR (e.g. Storey & Tibshirani ST-q procedure, which calculates adjusted q-values instead of p-values). The latter procedures are rather computationally intensive, which is why they have not yet been implemented in BiNGO. The Benjamini & Yekutieli procedure exhibits severely decreased power compared to the Benjamini & Hochberg correction, which is a large price to pay for allowing arbitrary dependence.

For now, BiNGO only provides only the most widely used (= basic) multiple testing corrections. However, we are planning to provide more correction options (such as those mentioned in the previous paragraph) at some point in the future. In the mean time, the user can add extra tests or multiple testing corrections through implementation of the Java interfaces provided for this purpose.

For a more thorough discussion of this topic, see e.g. Ge et al. (2003).

Boyle, E.I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J.M., Sherlock, G. (2004) GO::TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes, Bioinformatics 20, 3710-3715.

Ge, Y., Dudoit, S. and Speed, T.P. (2003) Resampling-based multiple testing for microarray data analysis, Technical Report 633, Dept. of Statistics, UC Berkeley

available at : http://stat-www.berkeley.edu/tech-reports/

4. Batch Mode

When you use BiNGO in the text input mode, you can specify several clusters at the same time and let BiNGO analyze them using identical settings :

•type in batch in the cluster name box.
•give in the clusters in the following format :

cluster_name1
gene1
gene2
gene3
batch
cluster_name2
gene4
gene5
batch
cluster_name3
gene6
gene7
...

batch

cluster_name123

gene523

gene524

(so start each cluster with a name for the cluster, and separate the clusters by the keyword 'batch')

5. Using Standard and Custom Annotations and Ontologies

The goal of the Gene Ontology (GO) project is to provide a structured description of known biological information at different levels of granularity. GO consists of three structured, controlled vocabularies that describe gene products in terms of their associated biological processes, molecular functions and cellular components in a species-independent manner.

BiNGO provides several default GO ontologies and annotations for a wide range of organisms. The GO ontologies and annotations in BiNGO are parsed from information available at the NCBI (see version_information.txt in the BiNGO.jar file). The identifiers supported in the default annotation files usually include UniProt IDs, LocusTags, Official Gene Symbols and Unigene IDs, and some other identifiers dependent on the organism (see table below). For the identifiers in the other IDs column, you have to use the database as prefix to the identifier, e.g. TAIR:AT5G45880, HGNC:19074, MIM:612733, Ensembl:ENSG00000100296. In case of doubt, please check if the identifiers used in the default annotations (which can be found in the BiNGO.jar archive in the Cytoscape plugins directory after installation) correspond to the ones you use in your network/test set. If not, you should either change the identifiers in your network/test set or create a custom annotation file. Genes/proteins for which no annotations were retrieved are listed in the BiNGO output file. In case there are too many of such genes, there's a chance that the identifiers you use are not supported.

IMPORTANT: the default annotations and ontologies in BiNGO are being phased out. They will not be updated regularly. We recommend you use custom annotation and ontology .obo files available on the GO website (www.geneontology.org). Download the files from your species of interest and specify them in the BiNGO annotation and ontology choice panels under ‘Custom...’.

One way to avoid the issues related to multiple testing is to test fewer categories. This option is especially attractive if you're only interested in more general functional profiling anyway. To this end, we provide several GOSlim ontologies in BiNGO, which are (organism-specific) slimmed-down versions of the full GO hierarchy. When using these default GOSlims in combination with either standard or custom annotations, the provided annotation is automatically remapped by BiNGO onto the chosen GOSlim, using the default full GO ontology as a remapping guide. Similar automatic remapping also occurs when you use .obo ontology files downloaded from GO. However, when you would envision building and using a custom GOSlim of your own, annotation files will NOT be remapped in the same way. When using custom ontologies, remapping will only occur within your custom ontology (i.e. from specific nodes in the custom ontology to its parents along the custom hierarchy). Please make sure that you specify an appropriate custom annotation file accordingly. You can no longer use the default organism/annotation options, since these provide annotation on the full GO, typically at more specific levels in the GO hierarchy. These annotations cannot be remapped appropriately onto your custom ontology. To make sure that you're using an appropriate annotation file, BiNGO will issue a warning message when some of the GO labels in your annotation don't match with those of the chosen ontology. If you're not working with GOSlims, this might also indicate that you're using annotation and ontology files of different versions (GO is still very much under development, with new labels being added regularly).

6. Which reference set to use ?

Choosing the appropriate reference set against which your genes of interest will be tested depends very much on the problem under study. When you wish to assess the over-representation of functional categories in a test cluster relative to a network visualized in Cytoscape, you can simply select the Test cluster versus Network option in the Select Reference Set dropdown box of the BiNGO Settings Panel. You can then choose any of the organism/annotation and ontology options provided, and the appropriate reference set annotation will be parsed automatically. However, if you're not testing against a Cytoscape network, but instead you e.g. use the text input option to put in a set of genes that proved to be significantly overexpressed in some microarray experiment, you have to specify a reference set containing ONLY the genes represented on the microarray. Indeed, it makes no sense to include genes in your reference set which are not on your array. For whole-genome arrays, it would probably not make a big difference if you used Test cluster versus whole annotation instead, but still. The only cases in which you can actually use the default options (Test cluster versus Network/whole annotation) are a) if you're testing a cluster versus a Cytoscape network (see above) or b) if you genuinely want to test the over-representation of functional categories in your test set versus the whole genome. In all other cases, you have to specify the appropriate reference set under Custom... Custom reference set files simply contain all gene identifiers you want to include in the reference set, separated by newlines. E.g.

AT5G67110

AT5G67120

AT5G67130

AT5G67140

AT5G67150

AT5G67160

AT5G67170

AT5G67180

AT5G67190

AT5G67200

AT5G67210

AT5G67220

AT5G67230

AT5G67240

...

In case you use the annotation files provided by the GO Consortium as custom annotations, you should be especially careful: the annotation files for some organisms are protein-centric, while others are gene-centric. The annotation file for Arabidopsis, for example, contains splice variants of genes, resulting in approx. 50,000 annotated entities, although there are only approx. 25,000 ORFs. If you do not want to include splice variants in your analysis, you'll have to use a custom reference set containing all Arabidopsis loci without splice variants, or make a custom annotation, which can be done from within BiNGO (see also section on custom annotations).

7. Interpretation of BiNGO graph and results

BiNGO is aimed at providing the user with a good idea of what functional themes are present in your gene set. The p-values give a good indication about the prominence of a given functional category. However, no biologist would dream of drawing conclusions solely based on p-values, and rightly so ! The p-values returned by BiNGO can give the user additional clues, which should be interpreted in the light of other evidence.

The BiNGO graph visualizes the GO categories that were found significantly over-represented in the context of the GO hierarchy. The size (area) of the nodes is proportional to the number of genes in the test set which are annotated to that node. The color of the node represents the (corrected) p-value. White nodes are not significantly over-represented, the other ones are, with a color scale ranging from yellow (p-value = significance level, e.g. 0.01) to dark orange (p-value = 5 orders of magnitude smaller than significance level, e.g. 10-5 * 0.01). The color saturates at dark orange for p-values which are more than 5 orders of magnitude smaller than the chosen significance level.

Due to the interdependency of functional categories in the GO hierarchy, it is very likely that not one category, but a whole branch of the GO hierarchy lights up as being significantly over-represented. In such cases, interpretation can be more difficult. The darkest orange nodes which are furthest down the hierarchy are probably the ones that you're looking for. Suppose for example that a branch of metabolism categories lights up (see figure).

You should not conclude from this that the genes involved in 'metabolism' as a whole are being over-represented here. In fact, you can see from the figure that the category 'protein modification process' is the important one, and that the over-representation of 'protein metabolism process' and 'metabolism' categories is merely a result of the presence of those 'protein modification' genes. Should there be a substantial contribution of genes in the 'metabolism' category other than 'protein modification' genes, than the metabolism node should be bigger in size (which is not the case here) and darker in color. The same goes for the categories on the other end of the figure, where 'kinase activity' is the relevant category and you should not conclude that genes involved in 'catalytic activity' in general are over-represented. The fact that both categories are colored equally dark, is due to the saturation of the node color for very low p-values.

The visualization is the key to the interpretation of BiNGO results. It would be hard to make the same interpretation in a reasonable amount of time based on textual output alone.

Next to the visual representation and BiNGO output window, BiNGO produces a tab-delimited text file containing more detailed results. Apart from a listing of the analysis options, the results file contains the (adjusted) p-value for each over-represented GO class, the number of genes in the test set annotated to that class and their identity, and the number of genes annotated to that class in the reference set.

8. Manipulating and saving BiNGO graphs

The controls and options for visualizing and modifying BiNGO graphs are basically the same as for other Cytoscape graphs. You can modify colors, labels, node form and size... through the Set Visual Properties option under the Cytoscape Visualization menu. A visual styles menu should pop up that allows you to modify the current Visual Style. Please note that each BiNGO graph has its own visual style, identified as BiNGO_<cluster name>. The basic reason for this is that the mapping of data attributes (e.g. p-values) to visual attributes (e.g. color) is different for every analysis you perform. Because you can switch back and forth between several BiNGO networks, all these attributes need to be stored separately in order to avoid confusion. DON'T change the visual style of e.g. your second BiNGO network (with style BiNGO2) to BiNGO1. This will cause all visual attributes of the nodes in the second network, such as node size (~ # of genes annotated to that node) and color (~ p-value), to change to the values calculated for those nodes in the first network. Overall, you should avoid renaming networks (also when saving them) or visual styles, since this can give rise to similar phenomena on rare occasions.

You can map various attributes to the node labels. A few of them need further explanation :

•x_<cluster name> : the number of genes in your cluster annotated to a certain GO class
•X_<cluster name> : the total number of genes in your cluster. This number may be different from the number of genes you selected in the graph or put into the text field, since genes without any annotation are discarded (see FAQ)
•n_<cluster name> : the number of genes in the reference set (graph or annotation) annotated to a certain GO class
•N_<cluster name> : the total number of genes in your reference set. This number may be different from the number of genes in your reference graph, since genes without any annotation are discarded (see FAQ)

You can save BiNGO figures just as any other Cytoscape figures in a number of formats, available under File>Export>Network As Graphics... . If you want to store your BiNGO graph for future use in Cytoscape, save your Cytoscape session as a .cys file. Alternatively, you can choose to Export>Network and attributes as XGMML... (preferrably with the same name as the cluster name you used in BiNGO, to keep things coherent). However, then you should also Export>Vizmap Property File to save the visual style. The visual style will be necessary when restoring your graph at a later time. The BiNGO visual styles DO NOT get saved automatically upon exiting Cytoscape (otherwise you would get a flood of BiNGO visual styles after a while), in contrast to visual styles you create from within Cytoscape.

When you would, at some point in time, load in two or more BiNGO graphs from saved files, and you chose not to follow the naming guidelines outlined above, you have to make sure that the names of the attributes in your two BiNGO networks are different. Otherwise, one of them will definitely get the wrong attributes mapped to its nodes. Likewise, if you did follow the naming guidelines, never load 2 networks with the same name at the same time.