Originally posted by Steven B. Harris on sci.cryonics, on September 5, 1996.

Bayesian Reasoning in Science

Popper championed the ideality of making scientific hypotheses which are "far-reaching" (i.e., which make broad predictions of many different kinds of phenomena with relatively simple mathematical machinery) and which also are "bold." By "bold," Popper means hypotheses which are falsifiable, and which predict effects which are not likely to be observed unless the hypothesis is true. Bold hypotheses which are tested in this manner (by observing predicted "odd" effects) and which pass the test, are said to be "corroborated," by which Popper means that they are preferred. For Popper, the "truth" content of science (to the extent that the word has meaning) lies in its stock of well-corroborated bold and far-reaching hypotheses.

One school of more modern philosophers of scientific theorizing are called "Bayesians," and are in some ways successors to Popper. Bayesians are named for Bayes' theorem. Bayes' theorem calculates the probability of one event given another, if one knows certain other probabilistic associations.

Let us stop to introduce this idea. One of the simplest expositions of Bayes' theory perhaps occurs in the context of the practice of medicine.

Consider, for instance, a doctor who has a throat swab test which purports to help him tell whether he is looking at a "strep throat" (throat infection caused by strep bacteria), or not. Of course, no test is perfect. This hypothetical test is 95% sensitive (it turns color for 95 out of every 100 genuine strep throats tested) and 97% specific (it doesn't turn color on 97 of every 100 genuinely strep-free throats). Now the doctor has the following statistical problem: knowing how good the test is, if the doc tests a person for strep throat and the test comes out positive, what is the probability that the doc is looking at a genuine case of strep, and not a mis-diagnosis due to test failure?

The answer to the above is surprising. The surprise is that you cannot calculate the answer with the information given! For what you also need to know, in order to know what a positive test result means, is the incidence of strep in the population being tested, or in other words, the prior likelihood that the person has strep before you knew the result of the test.

Why? The reason is that it is this prior probability information which, along with the specificity, gives you the background noise against which your test must operate. If one knows the fraction of sore throats which have strep (a priori probability of disease), the fraction of strep throats which test positive for strep (sensitivity of the test), and the fraction of non-strep throats which test negative for strep (specificity of the test), one can calculate what fraction of positive-test throats will have strep. It is this last number, called the "positive predictive value" of the test, which is what the average clinician is really interested in. The relationship between these all these values is calculated mathematically in Bayes' theorem.

Let us illustrate with some numbers how Bayes' theorem works. Suppose that 1 person in a 1000 has a strep throat at any time. Suppose that you test a large sample of people (a thousand people, say) at random. If this is done with a test that is 97% specific, you will get 3% false positives, or 30 false positives. On the other hand, you'll probably (with 95% certainty) pick up that one guy in the thousand who really has a strep infection. So each time you get a positive strep test during random testing of the first 1000 people you run across, the chance that you've found the 1 guy in there who really has a strep throat, is 95/30, which is pretty small (3% or so). Here, every positive strep test is not very predictive of real disease.

On the other hand, suppose that 20% of people with red, sore throats and swollen lymph nodes actually have strep throats, and you test 1000 people who all have these symptoms. If you get a positive test, now what is your chance that you're seeing a genuine case of strep throat? Remember, in 1000 people with sore throat, 200 will have strep, and 800 won't. You can also expect that in your thousand sick people, 3% of the 800 who don't have strep will test positive anyway (24 people), and that 95% of the 200 who do have strep will also test positive (190 people). Out of 190 plus 24 people who test positive, then, 190 really have strep, so a positive test here means that you're looking at a 190/214 = 89% chance that you're really seeing the disease. These are much better odds, and now a positive test means something! A pretty high probability. Interestingly, then, the value of an imperfect corroborative medical test, depends in part on the odds of the diagnosis being present to begin with.

We are now in a position to use these ideas in a more general way. Bayesians use Bayes' theorem as doctors do with the strep test, but generalize it to an inductive test of scientific hypotheses. The idea that the person standing before a doctor with a positive strep test actually has a strep throat, is really just a kind of hypothesis, which has a probability attached to it. One asks: what is the probability that the hypothesis "this patient with a sore throat has strep" is true, given the evidence of a positive strep test? When one does a test for strep, one is actually testing a scientific hypothesis, albeit a very specific one. Still, the whole idea can be generalized into a method of science. One may ask: What is the probability of the truth of any hypothesis, given a predicted effect of the hypothesis, which is actually found, upon experiment? To answer this all-important question, the Bayesians say simply that one must have some idea of three probabilities:

  1. One must know (or have some idea of) the prior probability of the truth of the hypothesis, independent of the experimental result or effect being particularly considered (i.e., an a priori plausibility). This corresponds to our earlier examples of the prevalence of strep in two situations (1 in 1000 in randomly sampled people, vs. 1 out of 5 in people with sore throats).

  2. One must have some idea of the "sensitivity" of the predicted experimental effect. In other words, if the hypothesis is true, how likely are we to see the effect? Is the effect "guaranteed" if the hypothesis is correct, or might it not be seen anyway?

  3. Finally, one must have some idea of the "specificity" of the experimental effect looked for. We want to know the following: Suppose the hypothesis is not true; how likely are we to see the effect predicted from the hypothesis, anyways? The "specificity" probability relates to the "boldness" of hypotheses that Popper was fond of: We are much more likely to pay attention to those hypotheses which predict an effect which should definitely not be there otherwise. Some famous examples of this principle include the bending of starlight near the sun (theory of relativity), the finding of a bright spot in the center of the shadow of small objects (the theory of the wave nature of light), and the finding of a diffraction pattern when electrons are beamed through a crystal (the theory of the wave nature of particles). All these things really plausibly have no way to be there, unless the theories in question (or something like them) describe reality.

All three probabilities mentioned above act as powerful filters to help us design experiments and evaluate hypotheses. Here is how they work in practice:

The "sensitivity" of an effect relates to how we go about deciding what experiments to do, and is a key part of the scientific method. We try to avoid wasting our experimental time looking for effects which aren't clearly predicted (i.e., predicted effects which might not be observed or observable, even if the hypothesis is true). When you build that gravity wave telescope, does Einstein's theory guarantee it will pick up gravity waves? No. Einstein's theory says nothing about how difficult such things will be to pick up out of background noise.

By contrast to sensitivity, the "specificity" of an experimental effect is part of what we seek to know when we include placebos and controls in our experiment. In modern science we think of these things as part of the primary experiment itself, but actually they are separate experiments designed to get at the vexing question of how likely it is that we will see our "predicted" experimental effect, even if our predicting hypothesis is false. It is here also that the value of true predictions vs. "post hoc predictions" (ad hoc theories) comes into play. Obviously, if a theory was cleverly constructed to "predict" an already known observation, the chance that the retrospective "effect" would be observed, even if the theory is false, is 100%. An example here is Bohr's 1913 quantum planetary model of the atom, which "predicted" the already known spectra of hydrogen (surprise), but which failed to do much else. We now know that this failure was due to the fact that key features of the model were false.

Finally, the "prior probability" of the hypothesis being correct, is really what we're thinking about when we consider a claimed experimental "observation" in light of such things as how much corroboration a hypothesis has had in the past, how many people have reported the same observation, how strongly the scientific community already believes the hypothesis, who the asserter of the new idea is (i.e., appeal to authority and skill), how much a hypothesis resembles other successful hypotheses, etc. This is also the mechanism by which we pare an infinite number of possible hypotheses down to a few, before we get to work on a given experiment. It is also the mechanism by which we decide how much intellectual "airtime" to give possible crackpots, and their theories.

Consideration of the prior probability of a hypothesis when considering claims of new evidence that support it, is a very important part of science, but one not discussed much. It is a point also not understood by many who wish science would pay more attention to very unlikely pet theories. It is quite rational of science not to do so, however, in just the same way it would be rational of a doctor not to do a strep test if he did not believe there was much chance of the tested person having strep in the first place. It would also be rational for a doctor to ignore a strep test which happened to be done on a random person, and which turned out positive. In this case, a positive test result (which would constitute evidence, but not extraordinary evidence, of strep infection, given a specificity of only 97%) would not change the doctor's mind anyway about the person having a strep throat infection.

It's clear that the modern scientific method uses all three components of Bayes' theorem when evaluating hypotheses, although it does so in a somewhat informal way and non-mathematical way. Indeed, scientists themselves may not think of what they are doing in Bayesian terms. When Dr. Carl Sagan (for example) said that "extraordinary claims require extraordinary evidence," he was talking about the a priori probability of the truth of a given hypotheses, and how a low a priori probability needs a lot from the other two Bayesian probability components to give a reasonable probability that a previously unlikely hypothesis is nevertheless correct.

In the same way, the idea of there being a "prior probability" of the truth of a hypothesis is why, if our experiment shows one data point where a scientific "law" (say the conservation of momentum) seems to be violated, we ignore it, and throw out the data (or refuse to consider those sensory impressions "data," which is the same thing). All honest working scientists very early in their careers eventually face the problem that it is not possible to do science without throwing out or ignoring or defining into oblivion a great deal of raw "data," and this is always done by finding ways to "tinker" with an experiment when it produces "data" which are "obviously wrong" -- a process which stops when the experiment begins to produce "data" which are "believable." One can historically see this process even in some of the greatest experiments on record, for example R. Millikan's measurement of the charge of the electron. It crops up nastily in the subjective decision which every scientist faces as to when the "design phase" of an experiment ends, and the "data collection phase" begins.

We live in a day when some scientists, wishing to sound open minded, publicly abhor the idea of rejecting any observation or hypothesis out of hand. In practice, however, every working scientist is forced to do exactly this many times each day, in order to get anything done. The "dirty little practice" of ignoring certain data in one's own experiments, and ignoring crazy-sounding claims in other's experiments, is unavoidable. Bayes' theorem shows why. Without using the idea of prior probabilities, all scientists would be paralysed, since the list of nonsensical claims is endless, and the amount of irrelevant "data" is infinite. If scientists did not ignore most of these things outright, they would be mired forever in nonsense.

To sum up, in science we prefer to test hypotheses that have all three Bayesian probability components of high value, given a certain experimental result which we expect to obtain. Thus, in order to apply Bayes' theorem most profitably, you have to have such a experimental test which is possible to begin with. If a hypothesis does not clearly predict any effects which are producible or reproducible, therefore, it may find itself ignored (for an almost-example, see "string theory" in modern physics).

If a hypothesis does not predict anything new which we can see or check in "reasonable time," we may ignore it, on the grounds that it isn't scientifically "useful." [Cryonics here comes to mind.] When an "alternate" theory proves useful merely by being easier to use to get new predictions, we may not choose to think of such theories as truly alternates anyway, but rather as different ways of expressing the same theory or truth (the many alternate but equivalent formulations of quantum mechanics serve as example here). We all agree, however, that experiments waste our time if they do not winnow a small number of significant and different competing theories. We do not want to waste our time; and furthermore, that making of a public bet to test two competing theories (the critical and essential act of all science) involves a substantial amount of effort and risk.

Finally, in science we like hypotheses that are a priori likely to be true; which also predict a particular measurable effect clearly and with some assurance of reproducibility; and finally, which predict an effect which we definitely do not predict (by all that we know) otherwise, and therefore which we weren't at all expecting from the old theories. In other words, the less we "expect" the result of an experiment given the old ideas, the more important it becomes, other things being equal.

A low value for any of the Bayesian numbers can swamp the others: we have already mentioned that if an experimental result is completely unexpected, and an "explanatory hypothesis" for it judged a priori very unlikely, it will be ignored, no matter how well-done or clever the study. The reception of the scientific community to Pons and Fleishman's claims of "cold fusion" serves here as a good example of this phenomenon in practice. Sometimes we are caught rejecting valid observations by this method, to be sure, but the alternative, unfortunately, is a sort of "new-age" intellectual anarchy and chaos.

Again, it's important to emphasize that the upshot of Bayes' theory is that all three Bayesian conditions must be met to result in optimum conditions for a public demonstration or bet, like Elisha's public bet with the priests of Baal in the Bible, or the equally well-known public bets of Pasteur, Hawking, or Amazing Randi. The experiment or public bet, however, is a very final step in a long process. Like American politics, where run-off candidates may be chosen out of a huge number of possibilities in a pre-election process, competing scientific theories are in practice always winnowed down to a pre-chosen few before experiment or public demonstration is ever done. The "smoke filled room" politics of scientific theorizing is a crucial part of the scientific method that we cannot ignore, and public demonstration (ala bet) is the other crucial part.

Historically neither of these parts of the real scientific method ("real" in the sense of being the method that is actually used by real scientists) has historically drawn the attention it deserves, until very recently. Only in the last generation or two are philosophers beginning to understand what at least some humans have been very successfully doing epistemologically for as long as humans have existed. We tend to date the "scientific method" as coming in about the time of Francis Bacon, and being only four centuries old or so.

Contrary to myth, however, what came in around the time of Bacon (ca 1600) was not so much the "scientific method" but (prosaically) the first wide use of the movable type printing press (invented 150 years before), which allowed for the first time dissemination of boring experimental data, and with it, attendant standards of good record keeping. Vesalius' and Kepler's arguments, for instance, could not have survived in hand-copied texts. The second great leap forward in science after the printing press was the use of quantitative statistics in scientific investigations, a practice which began in the middle of 18th century. This practice, which did not become universal until our own century, naturally brought a closer connection to Bayesian inductive methods. Before formal statistics, scientific theory making had been ruled by "common sense," which is far more ancient than four hundred years.

This, then, is the paradox which historians of science have had to face: In one way the scientific method in practice is so ancient that we do not know its origins. In another way, our understanding of what the actual method of science is dates only from the last half of our own century. It's no wonder that the issue continues to confuse.


Afterword

Bayesian reasoning asks questions about how you interpret the likelihood of hypotheses being true in the light of certain data, given your assumptions about the data, and how likely, or not likely, the same data would be found, even if the hypothesis were not true. And so on. There are millions of such situations relevant to cryonics.

Let's take a dramatic one: suppose somebody proves that they can freeze a mouse in liquid nitrogen and bring it back to functional life, maze-running ability and all. Bayes' theorem asks us how this piece of data impacts the likelihood that human suspended animation is possible, given the best conditions for freezing (not quite cryonics as we know it, but close enough).

The answer is: that depends on your assumptions. How much is a mouse brain like that of a human? Can you think of scenarios in which mice can be frozen with no problem, but in which technology is still stumped for the long term by humans? How likely are these scenarios? Why do you think so? And so on. Thus, what this mouse experiment (like any mouse experiment) implies to you about what will be true for humans, depends on your prior assumptions. Maybe you think mice don't have souls, but people do. Maybe you're impressed with the differences between what mice do with their cerebral cortexes, vs what people do with theirs. Whatever you think, you are using Bayesian logic if you put these prior assumptions to work for you in evaluating data.

It's labyrinthine, and the more inferences have to be stretched to support a hypothesis (suppose you can only freeze nematodes...?), the tougher it gets.

And Bayesian logic doesn't apply only to scientific induction, but to all logical inference about the future. Cryonics, for instance, also only works if the social part works. What does the survival of the Catholic church imply about the theory that a cryonics organization will survive as a social institution, across centuries? Anything? Has the church survived in the way a cryonics organization must? And what about population trends and population in this century -- how relevant are they to next century, and thus to cryonics? And so on. It's apparent that Bayesian thinking is used (must be used) every time one rationally argues the likelihood of a hypothesis, but the reasoning for such inference is not usually very explicit. And worse still, usually it doesn't help much to make these assumptions explicit, except perhaps to cool tempers. It's important for people to see that others are also thinking rationally to come to different conclusions, by simply starting with different premises, each of which was also arrived at by a similar inductive logical processes starting with different assumptions, and so on, back into epistemological infinity.

Objectivists hate the idea of "turtles all the way down," as the old astronomical joke goes, but there it is.


Oh yes, and the turtle joke...

After a lecture one day, an astronomer is accosted by a elderly lady who states she believes the Earth is a hemisphere supported by four elephants, standing on the back of a giant turtle.

"And what is the turtle standing on?" asks the astronomer quizzically.

"A far larger turtle."

"Okay," says the astronomer in exasperation, "and what is that one standing on?"

"It's no good trying to trap me, young man. It's turtles all the way down."