Today I'd like to talk about Bayes' Theorem, especially since it's come up in the comments section several times. It's named after St. Thomas Bayes (rhymes with "phase"). It can be used as a general framework for evaluating the probability of some hypothesis about the world, given some evidence, and your background assumptions about the world.
Let me illustrate it with a specific and very non-original example. The police find the body of someone who was murdered! They find DNA evidence on the murder weapon. So they analyze the DNA and compare it to their list of suspects. They have a huge computer database containing 100,000 people who have previously had run-ins with the law. They find a match! Let's say that the DNA test only gives a false positive one out of every million (1,000,000) times.
So the prosecutor hauls the suspect into court. He stands up in front of the jury. "There's only a one in a million chance that the test is wrong!" he thunders, "so he's guilty beyond a reasonable doubt; you must convict."
The problem here—colloquially known as the prosecutor's fallacy—is a misuse of the concept of conditional probability, that is, the probability that something is true given something else. We write the conditional probability as
, the probability that
is true if it turns out that
is true. It turns out that
is not the same thing in general as
.
When we say that the rate of false positives is 1 in a million, we mean that


Suppose that there's a .5 chance that the guilty person is on the list, and a .5 chance that he isn't. Then prior to doing the DNA test, the probability of a person on the list being guilty is only 1 : 200,000. The positive DNA test makes that person's guilt a million times more likely, but this only increases the odds to 1,000,000 : 200,000 or 5 : 1. So the suspect is only guilty with 5/6 probability. That's not beyond a reasonable doubt. (And that's before considering the possibility of identical twins and other close relatives...)
Things would have been quite different if the police had any other specific evidence that the suspect is guilty. For example, suppose that the suspect was seen near the scene of the crime 45 minutes before it was committed. Or suppose that the suspect was the murder victim's boyfriend. Suppose that the prior odds of such a person doing the murder rises to 1 : 100. That's weak circumstantial evidence. But in conjunction with the DNA test, the ratio becomes 1,000,000 : 100, which corresponds to a .9999 probability of guilt. Intuitively, we think that the circumstantial evidence is weak because it could easily be compatible with innocence. But if it has the effect of putting the person into a much smaller pool of potential suspects, then in fact it raises the probability of guilt by many orders of magnitude. Then the DNA evidence clinches the case.
So you have to be careful when using conditional probabilities. Fortunately, there's a general rule for how to do it. It's called Bayes' Theorem, and I've already used it implicitly in the example above. It's a basic result of probability theory which goes like this:

given some evidence
which we just observed, we start by asking what was the prior probability
of the hypothesis before taking data. Then we ask what is the likelihood
, if the hypothesis
were true, we'd see the evidence
that we did. We multiply these two numbers together.
Finally, we divide by the probability
of observing that evidence
. This just ensures that the probabilities all add up to 1. The rule may seem a little simpler if you think in terms of proability ratios for a complete set of mutually exclusive rival hypotheses
for explaining the same evidence
. The prior probabilities
all add up to 1.
is a number between 0 and 1 which lowers the probability of hypotheses depending on how likely they were to predict
. If
says that
is certain, the probability remains the same; if
says that
is impossible, it lowers the probability of
to 0; otherwise it is somewhere inbetween. The resulting probabilities add up to less than 1.
is just the number you have to divide by to make everything add up to 1 again.
If you're comparing two rival hypotheses,
doesn't matter for calculating their relative odds, since it's the same for both of them. It's easiest to just compare the probability ratios of the rival hypotheses, because then you don't have to figure out what
is. You can always figure it out at the end by requiring everything to add up to 1.
For example, let's say that you have a coin, and you know it's either fair (
), or a double-header
. Double-headed coins are a lot rarer than regular coins, so maybe you'll start out thinking that the odds are 1000 : 1 that it's fair (i.e.
). You flip it and get heads. This is twice as likely if it's a double-header, so the odds ratio drops down to 500 : 1 (i.e.
). A second heads will make it 250 : 1, and a third will make it 125 : 1 (i.e.
). But then you flip a tails and it becomes 1 : 0.
If that's still too complicated, here's an even easier way to think about Bayes' Theorem. Suppose we imagine making a list of every way that the universe could possibly be. (Obviously we could never really do this, but at least in some cases we can list every possibility we actually care about, for some particular purpose.) Each of us has a prior, which tells us how unlikely each possibility is (essentially, this is a measure of how surprised you'd be if that possibility turned out to be true). Now we learn the results of some measurement
. Since a complete description of the universe should include what
is, the likelihood of measuring
has to be either 0 or 1. Now we simply eliminate all of the possibilities that we've ruled out, and rescale the probabilities of all the other possibilities so that the odds add to 1. That's equivalent to Bayes' Theorem.
I would have liked to discuss the philosophical aspects of the Bayesian attitude towards probability theory, but this post is already too long without it! Some other time, maybe. In the meantime, try this old discussion here.







































