Friday, August 8, 2008

Deonier Ch 2 simulation for example 2

Suppose we simulate a sequence of 1000 nt and count the frequency of one of the nucleotides, say 'A.' If the frequencies are all the same ( P(A) = 0.25 ), then we expect the mean count of a number of different simulations to be 250. Suppose we are looking at windows along a genome with an overall freq of P(A) = 0.25 and we observe a count of 280. What does that mean? Well, actually, it's a complicated situation because we would be asking the question many times. But suppose we just asked once, for one sequence, is 280 significantly different than 250? How do we decide?

One way is to do a simulation. We generate a random sequence big enough to sample 1000 times at 1000 nt each.
import random, time
# random sequences of length = 1000
# do this 10000 times and count 'A' each time

length = 1000
trials = 1000

start = time.time()
seq = list('ACGT') * length * trials
random.shuffle(seq)
middle = time.time()
Now we run the simulations and count how many times there are more than 280 'A' nucleotides present:
results = list()
for n in range(trials):
i = n * length
j = i + length
results.append(seq[i:j].count('A'))

T = 280
L = [n for n in results if n > T]

print len(L) * 1.0 / trials
# 0.014
print '%3.2f seconds' % (middle - start)
# 5.51 seconds
print '%3.2f seconds' % (time.time() - middle)
# 0.06 seconds

We'd only expect to obtain this result by chance 14 times in 1000.