Python for Bioinformatics: False positives

Saturday, August 8, 2009

False positives

Suppose you take a pregnancy test and you score +

What is the probability that you are pregnant given the positive result?

What is P(pregnant | +) ?

Clearly, it depends. What is:

P(pregnant | +, age=85) ?
P(pregnant | +, sex = M) ?

More practically, consider a diagnostic test (like this, or umm maybe we should just stick with the first one. :)

Suppose we have some underlying condition---say, a disease D (present) or ~D (absent)---and a test for the disease which can turn out positive P or negative N. (And no, smart guy, pregnancy is not a disease). Make a contingency table:

        P     N    total

 D
~D
total

Now, introduce the concepts of false negative and false positive. These are indicated in the table below as FN and FP, while the corresponding "true" results are labeled TP and TN.

        P     N    total

 D     TP    FN
~D     FP    TN
total

In the real world, the test results might be a continuous variable rather than a binary result.

The results might even overlap:

A false positive is defined as the fraction of all positives which incorrectly assign the label 'positive' when the subject does not have the disease; the fraction of the blue distribution in the above plot which have values higher than the threshold indicated by the dotted line. (Note: it is not the fraction of all tests with this result).

Such an error is often identified with the symbol α, and is called a type I error. In statistical lingo, it is the probability of rejecting the null hypothesis (~D) when it is actually correct. (We're ignoring some subtleties of hypothesis testing here).

Let's switch examples (to the third one!) and consider a bunch of drugs that are being tested for a given condition. Suppose (a freaking miracle) that 10 % of them really work (W) that 90 % do not (~W), and that we have adjusted things so that α = 0.05. The probability of having a "significant" or positive result for a bogus drug is α * ~W = α * 900 (the number of ineffective drugs).

Suppose there are 1000 drugs in the test. Then the contingency table looks like this:

         good   no-good   
           W       ~W    total

 +                 45
 _                855          
total    100      900     1000

Notice that we can fill out part of the table but not all of it. Detecting those drugs that work depends on the details of the test (the power, see below). With a power of 80%, we detect 80 of the 100 effective drugs. This allows us to fill out the rest of the table.

         good   no-good   
           W       ~W    total

 +        80       45      125
 _        20      855      875
total    100      900     1000

Now here's the thing to remember. A question of central importance is "how often do we call the test significant, but the drug is actually no good?" It does not depend only on α, but on both α and power!. We want P(~W | +).

We have 125 total positive results, but 45 of them are for drugs that don't work. The false positive parameter α is 0.05 but this is not the fraction of bogus drugs we called good. That fraction depends on the power. In this case it is about 1/3.

This is exactly the same issue that we got into with Bayes, the flu and headache here.

There are some other common terms we should define before we quit. Specificity refers to how good we are at identifying no good drugs.

Specificity is the fraction of no good drugs that we correctly classify as -.

         good   no-good   
           W      ~W    total

 +        TP      FP     
 _        FN      TN     
total            

Specificity = TN / (FP + TN) 
            = P(- | ~W)

α           = 1 - Specificity
            = P(+ | ~W) 

or equivalently

Specificity = 1 - α

β = P(- | W)

Sensitivity = Power
            = P(+ | W)
            = 1 - β

And yes, I do find this all a bit confusing. I just to try to remember the definitions for α and for power.

Then, sensitivity = power = 1 - β
and Specificity = 1 - α

Specificity is how often we make the right call when a drug is no good. Sensitivity is how often we make the right call when a drug is good.

One last issue is that we can adjust the threshold of the test. In the disease example, we adjust the numerical value that is the minimum value which we will label as significant. This adjustment involves a tradeoff between sensitivity and specificity, between α and β, between the rates of false-positive (type I) errors and false-negative (type II) errors.