Today I'd like to talk about Bayes' Theorem, especially since it's come up in the comments section several times. It's named after St. Thomas Bayes (rhymes with "phase"). It can be used as a general framework for evaluating the probability of some hypothesis about the world, given some evidence, and your background assumptions about the world.
Let me illustrate it with a specific and very non-original example. The police find the body of someone who was murdered! They find DNA evidence on the murder weapon. So they analyze the DNA and compare it to their list of suspects. They have a huge computer database containing 100,000 people who have previously had run-ins with the law. They find a match! Let's say that the DNA test only gives a false positive one out of every million (1,000,000) times.
So the prosecutor hauls the suspect into court. He stands up in front of the jury. "There's only a one in a million chance that the test is wrong!" he thunders, "so he's guilty beyond a reasonable doubt; you must convict."
The problem here—colloquially known as the prosecutor's fallacy—is a misuse of the concept of conditional probability, that is, the probability that something is true given something else. We write the conditional probability as , the probability that is true if it turns out that is true. It turns out that is not the same thing in general as .
When we say that the rate of false positives is 1 in a million, we mean that
(note that I'm writing probabilities as numbers between 0 and 1, rather than as percentages between 0 and 100). However, the probability of guilt given a match is not the same concept:
The reason for this error is easy to see. The police database contains 100,000 names, which is 10% of a million. That means that even if all 100,000 people are innocent, the odds are still nearly equal to .1 that some poor sucker on the list is going to have a false positive (it's slightly less than .1 actually, because sometimes there are multiple false positives, but I'm going to ignore this since it's a small correction.)
Suppose that there's a .5 chance that the guilty person is on the list, and a .5 chance that he isn't. Then prior to doing the DNA test, the probability of a person on the list being guilty is only 1 : 200,000. The positive DNA test makes that person's guilt a million times more likely, but this only increases the odds to 1,000,000 : 200,000 or 5 : 1. So the suspect is only guilty with 5/6 probability. That's not beyond a reasonable doubt. (And that's before considering the possibility of identical twins and other close relatives...)
Things would have been quite different if the police had any other specific evidence that the suspect is guilty. For example, suppose that the suspect was seen near the scene of the crime 45 minutes before it was committed. Or suppose that the suspect was the murder victim's boyfriend. Suppose that the prior odds of such a person doing the murder rises to 1 : 100. That's weak circumstantial evidence. But in conjunction with the DNA test, the ratio becomes 1,000,000 : 100, which corresponds to a .9999 probability of guilt. Intuitively, we think that the circumstantial evidence is weak because it could easily be compatible with innocence. But if it has the effect of putting the person into a much smaller pool of potential suspects, then in fact it raises the probability of guilt by many orders of magnitude. Then the DNA evidence clinches the case.
So you have to be careful when using conditional probabilities. Fortunately, there's a general rule for how to do it. It's called Bayes' Theorem, and I've already used it implicitly in the example above. It's a basic result of probability theory which goes like this:
The way we read this, is that if we want to know the probability of some hypothesis given some evidence which we just observed, we start by asking what was the prior probability of the hypothesis before taking data. Then we ask what is the likelihood , if the hypothesis were true, we'd see the evidence that we did. We multiply these two numbers together.
Finally, we divide by the probability of observing that evidence . This just ensures that the probabilities all add up to 1. The rule may seem a little simpler if you think in terms of proability ratios for a complete set of mutually exclusive rival hypotheses for explaining the same evidence . The prior probabilities all add up to 1. is a number between 0 and 1 which lowers the probability of hypotheses depending on how likely they were to predict . If says that is certain, the probability remains the same; if says that is impossible, it lowers the probability of to 0; otherwise it is somewhere inbetween. The resulting probabilities add up to less than 1. is just the number you have to divide by to make everything add up to 1 again.
If you're comparing two rival hypotheses, doesn't matter for calculating their relative odds, since it's the same for both of them. It's easiest to just compare the probability ratios of the rival hypotheses, because then you don't have to figure out what is. You can always figure it out at the end by requiring everything to add up to 1.
For example, let's say that you have a coin, and you know it's either fair (), or a double-header . Double-headed coins are a lot rarer than regular coins, so maybe you'll start out thinking that the odds are 1000 : 1 that it's fair (i.e. ). You flip it and get heads. This is twice as likely if it's a double-header, so the odds ratio drops down to 500 : 1 (i.e. ). A second heads will make it 250 : 1, and a third will make it 125 : 1 (i.e. ). But then you flip a tails and it becomes 1 : 0.
If that's still too complicated, here's an even easier way to think about Bayes' Theorem. Suppose we imagine making a list of every way that the universe could possibly be. (Obviously we could never really do this, but at least in some cases we can list every possibility we actually care about, for some particular purpose.) Each of us has a prior, which tells us how unlikely each possibility is (essentially, this is a measure of how surprised you'd be if that possibility turned out to be true). Now we learn the results of some measurement . Since a complete description of the universe should include what is, the likelihood of measuring has to be either 0 or 1. Now we simply eliminate all of the possibilities that we've ruled out, and rescale the probabilities of all the other possibilities so that the odds add to 1. That's equivalent to Bayes' Theorem.
I would have liked to discuss the philosophical aspects of the Bayesian attitude towards probability theory, but this post is already too long without it! Some other time, maybe. In the meantime, try this old discussion here.