Fun with Statistics: Base Rate Fallacy

Interesting BBC article about 90% certainty being really very uncertain when applied to large populations:

Imagine you’ve invented a machine to detect terrorists. It’s good, about 90% accurate. You sit back with pride and think of the terrorists trembling. […]

You’re in the Houses of Parliament demonstrating the device to MPs when you receive urgent information from MI5 that a potential attacker is in the building. Security teams seal every exit and all 3,000 people inside are rounded up to be tested.

The first 30 pass. Then, dramatically, a man in a mac fails. Police pounce, guns point.

How sure are you that this person is a terrorist?
A. 90%
B. 10%
C. 0.3%

The answer is C, about 0.3%.

The correct answer is D, 45%. The person WAS at parliament after all.

Hi. Well, that is a good article to help the general population understand statistics a bit better. The lack of knowledge about math puts the public at risk of being manipulated by spin doctors. However, this article is about 15 years too late.

The problem of multiple-experiments entered the scene when screening for multiple gene variations and association studies became practical. You may recall a time in the early 90’s when scientists declared they had found the gene responsible for risk-taking, and excessive forum posting, and what-not.

It turns out that when you perform multiple tests, a confidence rate of 0.95, which is sufficient for a single experiment, would lead to 50 false positives when 1,000 gene’s are tested at the same time.

So most of those papers were basically nonsense. Then statisticians, the sheriffs of the wild west of bio-informatics, got to work developing corrective measures for the significance of multiple experiments. See the bonferroni correction and the false discovery rate. So yeah, statisticians should be well aware of these things by now. But of course most normal people aren’t.

Is this the man, by any chance?

I’m a little confused by the math (no surprise, since I am WAY out of practice).

If the test is 90% accurate, doesn’t that mean that it has a 90% chance of catching a person who IS a terrorist. In other words, a 90% accurate terrorist test would let 10% of actual terrorists get away. But that’s not the same as saying that it would give a false positive for 10% of people it tested, is it? Or in practice, is that the same thing?

If you’re only given a single accuracy rate, I think it’s presumed that it applies to any random trial, otherwise you’d need different rates for different cases. Here was a similar two-rate example we went through a while back.

It’s the same thing. The test is right 90% of the time, but wrong 10% in either direction.

There are certainly cases where you can end up with different rates for false positives and false negatives, but that’s not important to the point they’re making.

Where they’re going is that even false positive rates that sound impressively low are bad simply because there are so goddamn many people who aren’t terrorists.

Say you’re screening a million people, and your test has a 10% false negative rate and a 0.01% false positive rate. Assume that there are, say, ten terrorists in that set of a million people.

You will identify 9 of those terrorists and falsely let one of them go, because you have 10 terrorists and a 10% false negative rate.

You will also falsely identify 100 innocent people as terrorists, because there are a million people and an 0.01% false negative rate.

From the results of the screening, you have 109 people flagged as terrorists. The test doesn’t tell you which of them are really terrorists and which are innocent. You need to figure that out.

Now, on the one hand this has helped a lot – out of the people the test has flagged, almost one in ten are terrorists. Much easier than sifting through the general populace.

On the other hand, when this test says someone is a terrorist, it’s wrong ninety percent of the time.

The wording of the OP isn’t all that clear.

Usually when I see references to accuracy, there is a false negative rate (the percentage chance that something/someone that SHOULD be identified, IS NOT), and a false positive rate (the percentage chance (the percentage chance that something/someone that SHOULD NOT be identified, IS).

So, suppose there is a genetic test for some disease, and the test has a 10% false positive rate and a 5% false negative rate.

John tests positive.

Sally tests negative.

What are the true chances that John really does have the disease, and that Sally does not?

Surprisingly, you can’t answer the question yet. You need to know the percentage of the population that actually has the condition.

If only one person out of 300 million ACTUALLY has the disease, but it has a 10% false positive rate, the vast vast majority of those who test positive don’t actually have it. (This is basically an exaggerated form of the problem in the OP).

OK, so I’m going to try to get the right formula here - if I make a mistake, someone please point it out:

Let fp be the false positive rate.
Let fn be the false negative rate.
Let tpp be the true positive rate in the tested population as a whole.

Then, for any test that returns positive, the chance that that is a correct diagnosis is:

((1-fn) * tpp) / (fp * (1-tpp))

(Again, no guarantees that I’ve got the formula exactly right, but I think I’ve got it.)

When tpp (the true rate of positives in the tested population) is very low, then even a small-ish false positive rate can cause the testing to be wildly inaccurate.

The assumption is apparently that the test has a 90% probability of positive when the subject is a terrorist and a 90% probability of negative when the subject is not a terrorist (in general there is no reason those two numbers have to be the same). Thus,

Pr ( terrorist | positive ) = Pr( pos | terr)Pr(terr) / [ Pr(pos | terr)Pr(terr) + Pr ( pos | not )Pr(not) ]

= 0.9*(1/3000) / [ 0.9*(1/3000) + 0.1*(2999/3000) ]

= .00299202 ~= 0.3%.

We had a thread on this issue once before but I can’t find it.

The problem is that the way we talk about these tests in colloquial language is far too simplistic.

Every test has four key numbers:

% true positives
% false positives
% true negatives
% false negatives

So lets say a test for dingdongitis looks like this

5% t.p.
3% f.p.
90% t.n.
2% f.n.

Let’s say we test 100 QT3ers. That means that 90 of them will be correctly identified as not having dingdongitis, 5 will be correctly identified as dingdongs, 3 will think they are dingdongs but they’re really not, and 2 are actually dingdongs but they don’t realize it.

The base rate of dingdongs on the boards is actually 5%+2% = 7%. 93% of the people on the board are not dingdongs.

The test however only correctly identifies the gene for dingdongitis 95% of the time. Five people will inevitably be incorrectly categorized.

The rate of successes and failures of the test can create huge numbers of false positives and false negatives depending on the actual likelihood that you actually ARE a dingdong. If the rate of dingdongs in society is low you get different problems than if the rate is high.

It’s complicated and the way the media usually talks about this stuff is far too simplistic.

We’ve gotten much better since then. I’ve been in the field since the late 90’s, and now everyone has to have at least some background in statistics (and dedicated statisticians on the analysis staff) for anything to be done. There’s still plenty of bad papers coming out, but it gets better every year as people spend more time thinking about the problems at hand, and how to get the most statistical power out of the data available.

Genetics went from a wet lab shotgun approach to a statistics/computational approach pretty quickly as the amount of data one could obtain for a unit of money increased dramatically in the last 10 years.