Python for Bioinformatics: Normally squared

Saturday, August 15, 2009

Normally squared

According to wikipedia, the chi-square distribution (with df = k), arises as the sum of the squares of k independent, normally distributed random variables with mean 0 and variance 1. For example, in a bivariate distribution where each variable is normally distributed, the square of the distance from the origin should be distributed in this way with df = 2.

u = rnorm(10000)
v = rnorm(10000)
w = u**2 + v**2
B = seq(0,max(w)+1,by=1)
hist(w,freq=F,ylim=c(0,0.5),
  breaks=B,col='cyan')
curve(dchisq(x,df=2),
  lwd=5,col='red',
  from=0,to=10,add=T)

I found that I could also get the chi-squared distribution by multiplying two random variables (with mean not equal to 0) together, and it seems that in this case the degrees of freedom is equal to the product of the means.

u = rnorm(10000,3)
v = rnorm(10000,3)
w = u*v
B = seq(-5,max(w)+1,by=1)
hist(w,freq=F,ylim=c(0,0.18),
  breaks=B,col='cyan')
curve(dchisq(x,df=9),
  lwd=5,col='red',
  from=0,to=30,add=T)

Let's explore how to use R to solve the problem of the grade distribution from the last post. We have males and females with grades from A,B,D,F in order, stored in a matrix M:

m = c(56,60,43,8)
f = c(37,63,47,5)
M = t(rbind(m,f))

> M
      m  f
[1,] 56 37
[2,] 60 63
[3,] 43 47
[4,]  8  5

r = chisq.test(M)

> r

 Pearson's Chi-squared test

data:  M 
X-squared = 4.1288, df = 3, p-value =
0.2479

This matches the value given in Grinstead & Snell. We can explore contributions of the individual categories to the statistic as follows.

E = r$expected
O = r$observed
(O-E)**2/E

We see that the disparity in A's was certainly higher than for the other categories, but the p-value (above) is not significant.

> (O-E)**2/E
             m         f
[1,] 1.0985994 1.2070139
[2,] 0.2995463 0.3291068
[3,] 0.3595670 0.3950506
[4,] 0.2096039 0.2302885

We see that the 95th percentile of the chi-squared distribution for df=3 is just a bit larger than 7.8:

> S = seq(7,8,by=0.1)
> pchisq(S,df=3)
 [1] 0.9281022 0.9312222 0.9342109 0.9370738
 [5] 0.9398157 0.9424415 0.9449561 0.9473637
 [9] 0.9496689 0.9518757 0.9539883

We can do a Monte Carlo simulation (I actually do not know yet how the preceding function works, but the simulation should be just like what we did yesterday:

r = chisq.test(M,simulate.p.value=T)

> r


 Pearson's Chi-squared test with
 simulated p-value (based on 2000
 replicates)

data:  M 
X-squared = 4.1288, df = NA, p-value =
0.2414

And Fisher's exact test (also on my study list for the future):

r = fisher.test(M)

> r

 Fisher's Exact Test for Count Data

data:  M 
p-value = 0.2479
alternative hypothesis: two.sided