Sunday, December 19, 2010

DNA binding sites 1

I haven't written much about classification of DNA binding sites or motifs on the blog. (I did a series on Gibbs sampling to find new motifs starting here).

But a couple of years ago, before the blog started, I did some work on this and there is a modest writeup on my pages at mac.com (here). At that time, I put a lot of effort into providing a Quartz GUI for the basic functionality, but I haven't tested it recently and I'd be pretty surprised if it still works. Still, that discussion (and see here) is reasonable. Let's see if we can improve upon it.

The reason the subject came up is in considering Cython (here). One use case that's definitely appealing is to predict new members of the family in a bacterial genome (given a set of known binding sites), where there are (just about) as many sites to be tested as base pairs. That can take a substantial time using Python (a few minutes).

To start working on this problem, we need some data. A repository of alignments for various transcription factors of E. coli is available from George Church's lab (here), but I'm going to use Tom Schneider's method (website) for constructing the scoring system, so I'll use the example from his paper about fis (Hengen 1997 PMID 9396807). Here is a graphic from that paper showing a few of the 60 sites they analyzed and the bit scores for each site. We're going try to to recreate this. I've got a file with the 60 sequences; you can get them and the code we'll be using from Dropbox (here).



As usual for these problems, we consider each position in the alignment to be independent, and add up the scores for all the columns to obtain a total score. The score for a single column of the alignment depends on the the count of nucleotides at that position. Suppose we have N = 20 sequences and this distribution:

A:5  C:5  G:5  T:5

We calculate the frequency as 5/20 = 0.25 for each base, and then the score is 2 + log2(freq) - correction. The correction is for the small sample size of known sites (see here to begin) and it's about 0.018 in this particular case, but we'll neglect it for this brief discussion. The method essentially computes a log odds score for the competing hypotheses of a real binding site versus random sequence.

The score for each base in this alignment, which seems to be random, is calculated as 2 log2(0.25) = 0. On the other hand, suppose we have:

A:18  C:1  G:1  T:0

Then the scores are:

A = 2 + log2(0.9) = 2 - 0.152 = 1.848
C = G = 2 + log2(0.05) = 2 - 4.322 = -2.322

We assign a pseudocount of 1 for T, even though no site had T, so the score for T is the same as for C and G. In this way we end up with an m x n matrix of floats, where m = 4 (ACGT) and n = the length of the alignment.

To score a candidate site, we observe each nucleotide in turn, and retrieve the corresponding score for that nucleotide and position in the alignment. For example, in the non-random case if the site contains A at that position we add 1.848 to the score, while if it has C we subtract 2.322 from the score. Here we see the great strength of this approach: we don't just reward sequences for being close to the "consensus", we penalize them for having a nucleotide that is rarely observed at the corresponding position in authentic sites.

That's the simple, basic idea. As I say, the files will be on Dropbox (here). Next time we'll show some results.