Friday, June 29, 2012

Statistics wars

kw: book reviews, nonfiction, mathematics, statistics, history

It is curious. As long as there have been formal, mathematical methods, they have been seen as the epitome of intelligence. This, in spite of the fact that modern amounts of computer power have helped us see that machinery can easily do what our own "wetware" struggles with, yet things we take for granted are usually what our computer machinery has yet to do well.

Nobody could ever get a BS degree in Walking to the Corner Store to Buy Milk. Yet, millions of hours of effort and billions of dollars have been spent to develop a robotic device with the ability to perform this simple task, so far without clear success. On the other hand, you might have a BS or even MS in mathematics, and find that you need to check your work using Mathematica, to be certain you haven't dropped a minus sign or parenthesis somewhere.

I had six years of college- and grad-school-level math (calculus, differential equations, advanced physics analytics, and statistics). I recall that every math professor said, hundreds of times, after showing us how to set up a system of equations, "Now it is all just turning the crank." Maybe so, but "turning the crank" was something even the most obsessive of us—frequently me, yet I didn't always get an A—could not reliably do. It might take all night, also, but even on a generic IBM PC AT-class machine running in Turbo mode at 10 MHz (that is 1/300 the speed of today's processors), with the first edition of Mathematica, we could set up those same equations and get the crank turned for us in ten seconds or less.

That branch of mathematics called statistics is, today, a double enterprise. The more familiar discipline of hypothesis testing is based on gathering enough data to make a standard test such as the t-test or F-test produce a result with a range narrow enough to reject an alternative hypothesis, usually the Null hypothesis. Maybe you take the next twelve light bulbs from the factory floor, put them in sockets, and turn them on. You set up some kind of machinery to record when each one burns out. With incandescent, 100-watt bulbs, this could take several months; with CFL's or LED's it could take years. Regardless, once the last bulb burns out, you can calculate a distribution function, figure its average value, and state with a certain measure of confidence that most of the bulbs will burn for, say, 1000 hours. By the time this information makes it to the retail package, it simply reads, "Lasts 1000 hours!"

Here is another kind of statistics. You have kept track of many such tests over many years. The tests are now a quality control exercise. You know pretty well what kind of distribution function there is. For light bulbs, this function is rather broad, so a "1000 hour" bulb has a 10% chance of burning out after only 500 hours or less, and a 10% chance of lasting at least 1350 hours. A few bulbs have been known to hang in there for more than 2000 hours, which is why a complete test usually takes almost three months.

But you know the shape of the function. With this knowledge, you take twelve bulbs as usual, and start the test. After one month (720 hours), three of them have burned out, and you know the times. You probably have enough information to decide whether the current batch of bulbs is worthy of the "1000 hours!" designation. In some cases, you might have to wait for a fourth bulb to burn out to be sure. You now have a QC test that gives you the result you need in 4-6 weeks rather than 10-12. And you have some leftover bulbs with some life in them.

That is one kind of Bayesian analysis. Another is this. You or your significant other, being of appropriate age (over 40, or 50, depending on whom you believe) go to the breast center for a mammogram. Suppose three days later a phone call comes, "There is a suspicious shadow. Please consult your physician." What are the odds that you actually have cancer?

Analyze it this way.
  • The test catches 80% of cases of genuine cancer.
  • Of 10,000 women who have mammograms, 40 have cancer.
  • Thus 32 of those 40 will receive that daunting phone call.
  • The test has a "suspicious shadow" or other indication that might indicate cancer just over 10% of the time, whether there is cancer or not.
  • The actual number is 1,028 per 10,000. The 32 real cancers are among them (Remember, 8 cancers have been missed).
  • That leaves 996 women who do not have cancer, but can't be sure. They got the same phone call.
  • 32/1,028 = 0.031. That is 3.1%.
It is up to you to decide if you want to do nothing and hope you are among the 97% who have no cancer. Many women opt to undergo further testing and further expense. But for every genuine case of cancer caught, 31 women who were perfectly well were subjected to fear, and probably pain (biopsies).

This mammography example is presented in greater detail in an appendix to The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines & Emerged Triumphant From Two Centuries of Controversy, by Sharon Bertsch McGrayne. The light bulb analysis is not; I got it from experiences in a prior career.

Bayes' Rule is simple. When you update your initial belief with current information, you get a new and improved belief. In other words, it is a statistical method that incorporates feedback.

Here is an example from my father. Though he was in Corps of Engineers for most of WWII, he was a gunnery officer for part of a year. Suppose you have a mortar squad with a dozen mortars, and your task is to demolish an enemy bunker a mile or two distant. Mortars are fired above a 45° angle, so they go a mile or more up on their way to the target. You don't know the winds "up there". How do you determine your "windage", and its variability? One way is to fire all 12 mortars using the range sighted in by your spotter, at the visual azimuth angle, and see how much you miss. The center of your impact pattern gives you the average windage, and the scatter gives you the gustiness figure. That'll work, but you can do it with three shells.

You aim all three "on target", but fire them a couple of seconds apart. A second round of three, if needed, and adjusted for the average "miss", should allow you to correct for average windage. The scatter tells you how variable the winds are, and you compensate by adding some divergence to your aiming when you fire all 12, which will ensure that some of the mortar rounds strike in the most effective locations. The bottom line is, this will get the job done with the smallest number of mortar rounds.

The book is a delightful history of Bayes' Rule and the people who used it. A presentation of the Rule got into print only after Thomas Bayes died, and became much better known years later due to publications by Pierre Simon Laplace. There are some who think the Rule should be named for Laplace, but there is already a large collection of methods called Laplacians that are used to simplify the solution of differential equations. We can let Bayes have this one.

The singular fact is the nearly two centuries of determined opposition Bayesian methods endured. For about six generations, "frequentists" (roughly, those who rely only on methods that require no prior estimates) made the name Bayes into a dirty word, in professional statistical circles at least. Yet among those with real jobs to do, making an informed guess and then refining it was the only way to solve their problems. In a sense, there was the ivory tower, denigrating the methods that almost everyone else was using to great effect!

This was particularly the case in military circles. Code-breaking and decipherment is a Bayesian process. The "Enigma" machines used by the Germans and others for encoding messages were based on a pre-war commercial product. So the English and others were able to build machines that duplicated them, and then "try stuff". Encoding and decoding was laborious, even with the machines' help, so they worked with snippets. When a trial turned a snippet into something nearly intelligible, they could tweak and tweak until they were ready to decode the entire message the snippet came from. It must be said that the laziness of signals officers made their job easier; it seems that shifting an encoding wheel just one position makes a whole new code, but of course that is the first thing a decoder is going to try.

The manifold uses of Bayes' Rule unfold in story after story, of hunts for lost submarines (on which The Hunt for Red October by Tom Clancy is based), of figuring out how big the critical mass is for an atomic bomb, and of weather forecasting—all in the days before computers. Once computers became ubiquitous, and especially after 1980 or so, software that made Bayesian inference practical for larger and larger problems suddenly brought the Rule into the limelight. Coupled with Markov chaining and Monte Carlo methods, it is the only practical way to solve many kinds of problems.

Have you ever built and used a Decision Tree? That is a Bayesian process. You have to have at least a guess about the values of a number of things, and their chance (probability) of occurrence. Once you populate the tree, you just "turn the crank", there are programs that can do this for you. Here is one example of a decision tree worked out by hand:

Once the numbers have been calculated, the largest composite value on the right will indicate which decisions must be taken to achieve this optimal result.

A most exciting area, still in early days, is machine learning. Bayesian learning is behind the self-driving vehicles that recently won DARPA prizes for driving in the desert and driving in a city; it is behind the data store, and particularly the data relationships it contained, that enabled the Watson supercomputer to win at Jeopardy. It is also behind Google Translate and the "you might also like" suggestions Amazon makes when you search for a particular book.

Finally, think how you learn something new. You were not born knowing that the letter "B" is pronounced "bee" and is made by a certain way of grunting as you pop your lips apart. As an infant, you heard it again and again, and played around with your mouth until you got a sound that matched "pretty well" what you heard. Meanwhile, inside, to avoid tying up too much valuable gray matter, your brain was remembering motions that "satisfied" you more and forgetting those that didn't. Before long, it could prune away the experience of learning (and you have indeed forgotten it), and just keep a tiny bit of brain machinery that is very good at pronouncing a "B". Bayesian inference is the way we learn.

No comments: