Python for Bioinformatics: Motif Discovery 7

Sunday, June 21, 2009

Motif Discovery 7

I did some exploration of the scoring function's landscape to demonstrate the existence of local maxima. I had thought to work up a graphic display, perhaps along these lines:

But, simplest interesting problem has 3 sequences, and that means that one of the three position variables can't be plotted next to its neighbors. (Maybe I should look into a 3D visualization of density in R). Anyway, I kept it simple. We have sequences of length SZ = 15 and motifs of length M = 3, and the motifs are completely conserved. The motif generator produced:

(1, 'TCA', 'CTCAGTTCGAGAGGA')
(2, 'TCA', 'CTTCAAAGTCAGCTA')
(7, 'TCA', 'CCATACCTCAATGAA')

That is, the motif TCA is embedded at the indicated locations in these sequences. All 13 x 13 x 13 = 2197 alignments were scored (Hamming distance). The two top positions are:

9, 1, 8, 7
9, 1, 2, 7

Here, the score is the first value, and the i,j,k indexes into the sequences follow. The second result is the position designed to have the motif, the first has a fortuitous match at position 8 of sequence # 2.

Next, I evaluated all positions with a score >= a threshhold (6 here), and asked: is it possible to change only one index (i.e. Gibbs sliding) and achieve a higher score? If so, we save the results in a list. If not, we are at a local maximum, and show an empty list [ ]. Here is the printout:


 9, 1, 8, 7 :   []
 9, 1, 2, 7 :   []
 7, 7, 3, 8 :   []
 7, 6, 8, 7 :   [(9, 1, 8, 7)]
 7, 6, 2, 7 :   [(9, 1, 2, 7)]
 7, 5, 1, 6 :   []
 7, 5, 1, 3 :   []
 7, 3, 6, 9 :   []
 7, 2, 9, 8 :   []
 7, 2, 9, 1 :   []
 7, 2, 3, 8 :   []
 7, 1, 8,11 :   [(9, 1, 8, 7)]
 7, 1, 8, 0 :   [(9, 1, 8, 7)]
 7, 1, 2,11 :   [(9, 1, 2, 7)]
 7, 1, 2, 0 :   [(9, 1, 2, 7)]
 7, 0,12, 6 :   []
 7, 0, 7, 6 :   []
 7, 0, 1, 6 :   []
 7, 0, 0, 6 :   []
 6,11, 5,10 :   []
 6, 9, 4, 2 :   []
 6, 7,12, 8 :   [(7, 7, 3, 8)]
 6, 7,12, 0 :   []
 6, 7, 3, 0 :   [(7, 7, 3, 8)]
 6, 5, 7, 6 :   [(7, 5, 1, 6), (7, 0, 7, 6)]
 6, 2, 3, 1 :   [(7, 2, 9, 1), (7, 2, 3, 8)]

I think you can see that of 17 positions with score = 7, only six can be improved by Gibbs sliding. Even at the level of score = 6, three positions can not be improved. And at the level of score = 5, 17 positions can not be improved (not shown). Thus, it is clear there are many local maxima.

Two other things related to what we've been talking about, before I move on to write a Gibbs Sampler. First, I discovered a rather embarassing off-by-one error this morning. It is shown in this graphic:

The correct range to use for embedding or searching for motifs is SZ - M + 1, but my code used SZ - M Luckily, since I used the same expression in both places, the analysis is not in error. It only means that the effective size of the sequences is SZ -1, since the motif was never positioned (or searched for) at the extreme right end of the sequences.

The second issue has to do with the scoring function used by Lawrence et al. Here is my table of the values for nucleotide counts between 0 and 10 (with a total number of sequences in the motif of 10):

 0  0.07 -1.81   0.0
 1  0.14 -0.81 -0.81
 2  0.21 -0.22 -0.44
 3  0.29  0.19  0.58
 4  0.36  0.51  2.06
 5  0.43  0.78  3.89
 6   0.5   1.0   6.0
 7  0.57  1.19  8.35
 8  0.64  1.36  10.9
 9  0.71  1.51 13.63
10  0.79  1.65 16.52

The first column is the count, the second is the q value with pseudocounts, the third is log2(q / p) and the fourth is the score = count x column 3. The problem I see is that the score for a nucleotide which does not appear in the motif against which it is being compared is 0, while the scores for counts of 1 and 2 are less than zero. That doesn't seem very logical.