Comparison with a Correlation Distance |
Our aim here is to
evaluate in some objective fashion the template
matching performance of GenTχWarper. Recall that GenTχWarper
is based on the dynamic time warping techniques
used in the speech recognition field. These
techniques allow for non-linear (elastic)
alignment of time profiles and consequently enable detection of
similar shapes with different phases. Classical distance metrics as
Euclidean or a variation thereof fail to capture
temporal variations since they are very
sensitive to small distortions in the time axis
and consequently, produce poor similarity
measures between time series. Pearson
correlation is a more robust distance metric in this
respect since it is supposed to measure
the similarity in shape between two
profiles. In view of this, it seems reasonable to
benchmark the dynamic time warping distance against the
Pearson correlation distance. The formula for the Pearson
correlation distance is: d = 1 - r where r =
Z(x) · Z(y) / n is the dot product of the
z-transformed values of two expression profiles x and
y. We have shown above that the z-transform is
the best data adjustment with respect to
performance that can be applied to our data and consequently, the dynamic time warping
distance will be tested with z-transformed
expression data.
Our test gene expression
time series is the same as above, i.e. coming from the Arabidopsis
aphidicolin synchronization experiment of
Menges et al., 2003(1)(Data) and consists of 3277
genes (selected based on p-values for regulation) sampled 10
times over a period of 19h. The set of all 9 Arabidopsis B-type cyclins presently
available on the Affymetrix chip have been
selected as a benchmarking set. The B-type cyclins
are known to have a peak of transcription during
the G2
to M phase transition and it is believed that
they are probably responsible for the mitotic events(2). Their expression
profiles are depicted in
the two consecutive figures below:
1) Absolute Expression Values
The figure above demonstrates
that although the expression of the 9 B-type cyclins is highly fluctuating in
the cell cycle, they all have a distinctive peak in early mitosis. Three cyclins (CYCB1_1,
CYCB2_1 and CYCB3_1) deviate considerably in
expression behaviour from the rest of the
group. Consequently their expression profiles
are depicted in different colours. The
somewhat peculiar expression profile of CYCB1_1 is probably due to
some stress response to the aphidicolin
synchronization method. The profiles of CYCB2_1 and CYCB3_1
exhibit a lower signal to noise ratio as compared
to the rest of the cyclins and the peak
of CYCB2_1 seems delayed in time. The latter is more
clearly illustrated with the z-transformed profiles
in the figure below.
2) Z-transformed Expression Values
All 9 B-type cyclins have
been used as templates and for each one of them
the first 50 best matching genes have been
determined, once using the
dynamic time warping (DTW)
distance and once using the Pearson correlation distance. Then each best matching list has been
subjected to analysis with the BiNGO tool in order to determine which Gene Ontology (GO) categories are statistically overrepresented in each list. The results for a cut-off p-value of 0.05 and using Benjamini & Hochberg (False Discovery Rate) multiple testing correction are presented in the 3 consecutive tables below. Each table is divided into two columns, on the left for the DTW distance and on the right for the Pearson correlation distance. Each of the columns is sub-divided into 4 additional columns containing: 1) the GO category identification (GO-id); 2) the multiple testing corrected p-value (p-value); 3) the number of selected genes versus the total GO number (selected / total); 4) a detailed description of the selected GO categories (description). Alternatively, a pdf file containing all the 3 tables can be found here cyclinsB tables.
It was observed above that,
probably due to stress response,
CYCB1_1 does not exhibit typical B-type cyclin
transcription behaviour. This explains the fact that no GO
categories are significantly overrepresented
in the list of 50 best matching genes for either the DTW or the Pearson
correlation distances.
The GO categories
selected for CYCB1_3 and CYCB1_4 show
considerable over-representation of cell cycle
regulation and mitosis control related genes. Observe that
for these two cyclins, and in particular for CYCB1_4, the DTW
category list clearly outperforms the Pearson correlation both
quantitatively (in terms of gene numbers
selected, see column 3 'selected/total')
and qualitatively (in terms of p-values, see
column 2 'p-value'). The same
conclusion can be made for CYCB1_5,
CYCB2_4 and CYCB2_5 below.
The interpretation
of the CYCB2_1 category lists is not so
straightforward. Recall that CYCB2_1 is the one that peaks
somewhat later in mitosis than the rest of the
B-type cyclins (consult the expression profile
figures above), which may be due to either stress
response or to some specific functional role of
this cyclin in the cell cycle. At first sight, it seems that
the Pearson correlation performs much better
than the DTW distance. The DTW distance identifies
only 2 GO categories (both mitosis related), while the Pearson
correlation distance produces a much more extensive
category list. However, when taking a closer look
at the GO categories selected with the
correlation, one finds categories as
geranylgeranyl reductase activity,
chloroplast, plastid, which clearly do
not have anything to do with cell cycle
processes. In this respect the DTW
distance seems to be more specific
and false positives are not present in its
category list.
The lists for CYCB2_2
exhibit an interesting phenomenon. Exactly the same set of
GO categories is produced by both distances, DTW
and Pearson correlation, but with a completely different
statistical significance. Thus the top of the DTW category list is occupied by
microtubule motor activity related genes (GO-id:
3777, 3774), while the one of the Pearson correlation by
cyclin-dependent protein kinase and regulation
of cell cycle associated genes (GO-id: 3753, 74,
19887, 19207). Overall, the GO categories in the DTW list for CYCB2_2
are assigned lower p-values than the
corresponding ones in the Pearson correlation list.
The results for CYCB3_1 are
also not that easy to interpret. The Pearson correlation
distance has identified one more
cyclin-dependent protein kinase (GO-id: 3753) than the DTW distance, which is
subsequently reflected in all kinase activity and cell
cycle regulation related categories (GO-id:
19887, 19207, 74, 7049, 50794, ...). The DTW
distance on the other side, has selected 2 more microtubule motor
activity related genes (GO-id: 3777, 3774) than
the correlation. Overall, it seems that the
DTW distance is more successful in identifying microtubule motor
activity related genes than the Pearson
correlation. The latter is the case for 6 out the
9 cyclins: CYCB1_1, CYCB1_4, CYCB1_5, CYCB2_2,
CYCB2_4, and CYCB3_1. A possible explanation for
this phenomenon is that the expression
profiles of these genes may have
higher time progression variability, which
consequently, is more adequately captured by the
DTW distance.
In summary:
1) In 1/9 cases (CYCB1_1) no
significant GO categories are found for both
distance measures;
2) In 6/9 cases (CYCB1_3, CYCB1_4,
CYCB1_5, CYCB2_2, CYCB2_4, CYCB2_5)
DTW outperforms the Pearson correlation both
quantitatively (in terms of gene numbers
selected, see column 3 'selected/total')
and qualitatively (in terms of p-values, see
column 2 'p-value');
3) In 1/9 cases (CYCB2_1)
Pearson outperforms DTW, but also produces some
odd (not cell cycle related) GO categories;
4) In 1/9 cases
(CYCB3_1) the performance is difficult to
compare since the two distance measures seem to
be better in different GO categories.
(1)
Sakoe,H. and Chiba, S. Dynamic
programming algorithm optimization for spoken word
recognition. IEEE Trans. on Acoust., Speech,
and Signal Process., ASSP 26, 43-49
(1978).
(2)
De Veylder,L., Joubès,J., Inzé, D. Plant cell
cycle transitions. Current Opinion in
Plant Biology 6, 536-543
(2003).
(3)
Menges,M., Henning,L., Gruissem,W.,
Murray,A.H. Genome-wide gene expression in an
Arabidopsis cell suspension. Plant Molecular
Biology, 53, 423-442
(2003).
|