February 2011

# Bad Math, Bad Thinking: the BMI and DNA Identification Revisited

THE SERMON

This month's column is about the importance, in today's scientific, technological, and statistics-driven society, of clear thinking. I present two simple mathematical arguments, one to show that a seemingly scientific emperor actually has no clothes, the other that another emperor has so many garments available, we need to be careful which ones we choose to display.

My first argument shows that the much-vaunted body mass index (BMI), used by doctors as a basis for the advice they give to patients and by medical researchers worldwide as a measure in countless studies, is little more than a disguised measure of waistsize, which even your granny knew was a perfectly sensible way to tell if you carried too much fat.

My second argument shows that in the FBI DNA database, the chances are actually better than 50% that two unrelated entries share the same 9-locus DNA profile. The significance of this is that a 9-locus match is generally assumed as enough to guarantee a reliable identification that can form the basis of a guilty verdict. Indeed, for many years, DNA databases stored only 9-locus profiles.

Neither conclusion shows that current practice is inherently dangerous. The BMI is really little more than a fancy way to refer to your girth, and as I indicate, the DNA calculation I give addresses just one aspect of profile databases. In both cases there is more to be said - in the case of DNA identification a lot more. In fact, my point is that DNA identification requires considerable statistical understanding. The two calculations I present are both elementary, requiring just a few minutes scribbling on a piece of scrap paper. But they are, I think, illuminating. The surprising, and worrying, thing is that no one seems to have bothered to pause and make them before. SERMON OVER.

* * * * * *

In past columns, I've discussed the absurdity of the body mass index (Devlin's Angle, May 2009) and the danger of false conviction in a DNA Cold Hit identification case (Devlin's Angle, October 2006) In both cases, I presented what I thought was a fairly sound logical argument to support my claims, but I still receive a slow but steady stream of emails from people who read those columns and try to convince me I am wrong. By and large, their objection to my BMI claim is that the wretched number (and the formula behind it) must make sense, otherwise the CDC, the bulk of the medical profession, and the medical insurance companies would not use it. For my DNA identification rant, the objection is not so much that the FBI uses it, rather the conclusion I reached runs against our intuitions.

Now, I have no problem with people disagreeing with me. Heavens, I was a mathematics department chair for four years and a dean for eight, and before that a father of two daughters growing up through adolescence to adulthood, so I have had my fair share of disagreements. What bothers me is that some folks seem (1) all too willing to accept something simply because its looks scientific and some august body advocates it, and (2) unable to adjust their intuitions when faced with evidence that they mislead us, as they do on occasion.

It bothers me because I am in the education business, and the primary goal of education is to help people develop the ability to think for themselves and to reach conclusions based on evidence and rational thought.

It bothers me particularly because, as the ancient Greeks recognized, when taught well, mathematics is one of the best mental disciplines to develop analytic thinking skills. At which point I cannot miss a chance to point you to this excellent video:

But I digress. Going back to the BMI, some simple algebraic reasoning should be enough to show that the emperor has no clothes.

### BMI Rant Revisited

Recall that, ignoring a constant, the BMI formula is
BMI = M/H2
where M is your mass (weight) and H your height.

To a reasonable approximation, the human body can be thought of as an ellipsoid with circular cross section. (I am looking at the logic behind the BMI here. What matters is how the various parameters affect each other; the exact numbers are not significant. You could model the body as a cylinder if you preferred.) The length of the ellipsoid is the person's height, H, and the diameter of the cross section at its mid-point (the waist) is W/PI, where W is the waist size (girth). The volume of the body is then

V = (PI/6).H.(W/PI)2 = (H.W2)/6.PI
Hence, if D is the average density of human tissue
M = D.V = (D.H.W2)/6.PI
Thus:
BMI = M/H2 = (D.H.W2)/6.PI.H2 = (D.W2)/6.PI.H
In the case of an adult, H is, of course, a constant, as is D (more or less). Thus the formula reduces to
BMI = K.W2
where K = D/6.PI.H is a constant. So what is the BMI really measuring? Your waist! Now just fancy that, the size of your waist indicates whether you are overweight or not! Who would have guessed?

Which begs the question, why doesn't the medical profession just use that far simpler metric? Beats me.

To be sure, I have a pretty shrewd idea why Quetelet introduced the BMI back in the early nineteenth century. (See my earlier column for the history.) He was looking at public records to determine society trends and doubtless he had lists that gave people's heights and weights, since these are measured and recorded all the time. But the tables almost certainly did not give waist sizes, since those are rarely measured, except when we buy trousers. So he played around with the data he had until he found a correlation. The correlation he found was with what he called the body mass index. It required taking the square of a person's height. It certainly had an air of scientific authority and precision - after all, it came from a mathematical formula! But in reality, all he had done was find a way to estimate people's waist sizes.

And the medical profession has been chanting the BMI tune ever since.

The above alternative formula for the BMI does provide a bit of insight that I find lacking in the original. For the purposes of a population study (such as Quetelet's), H will be a variable. My new formula then looks like this:

BMI = C.(W/H).W
where C is the constant D/6.PI. The ratio W/H, your girth-to-height-ratio, is a factor whose relevance we can all surely appreciate. But notice that there remains an additional, multiplicative W factor. This shows that girth is a particularly significant factor; the BMI increases linearly with girth even after you have taken account of the girth-to-height-ratio.

Of course, the BMI still comes down to waist-size - doubly so as it turns out! And nothing I say here takes away from the issues I raised last time, including the fact that using a formula obtained by a statistical correlation across a population to make individual health and lifestyle recommendations and decisions has no scientific justification.

This isn't rocket science, folks. Take another look at what Ed Burger says in the introductory remarks in that video. Please. Then get your physician to watch it.

I suggest we henceforth rename the BMI the Bogus Mathematics Index.

### Cold Hit DNA Identification Rant Revisited

Moving on to DNA, you'd better read my original rant to get the relevant background, since this is a complex issue, and I'm going to assume everything I said there.

For the sake of argument, suppose the government has a database of 13-loci DNA profiles from 1 million individuals. (The FBI's CODIS database has roughly 4.2 million such profiles. As a size of the database increases, the kind of problem I am talking about becomes more significant. So the calculation I am about to do is more conservative than in reality.)

The probability of a random match on a single locus is about 1/10. That means, the match occurs purely by chance; it does not provide identification. (This is an empirical fact I learned from the experts, as you can too.)

In seeking to identify a crime perpetrator based on a DNA profile (and nothing else), suppose the FBI lab finds a match with a crime-scene sample on 9 loci. (This is actually pretty good. Crime scene samples are often damaged or degraded, making a full 13-locus match impossible. Again, I'm just passing on what the experts say here.)

Naively, if the RMP on a single locus is 1/10, you might think that the probability of a match on 9 loci being a result of pure chance is (1/10)9, that is, 1 in a billion. Hence, you conclude, the FBI have surely got the guy. Guilty.

Well, he might be guilty. But in the interests of justice you should at least check the math. And there is a lot of math you can do to get an overall sense of how likely things are. One calculation you might make - and statisticians have looked at this issue empirically for a real DNA database (Arizona's) and obtained results consistent with my calculation - is to find out how likely it is that the 1 million entry database contains two different individuals whose profiles are identical on 9 loci. When you perform that calculation, you find that the chances of that happening are better than 50%. That's right, the odds are actually in favor of there being a random match on 9 loci. Surprising, no?

You can find my calculation of this result here. Lay readers may find it a bit intricate, since I use some techniques to handle large numbers efficiently. But I can explain the argument using a simpler example where the numbers are not so big. The same fact that seemingly unlikely events can actually occur quite often (like two people having the same DNA profile) is often demonstrated with what is called the Birthday Paradox.

The question, is how many people do you need to have at a party so that there is a better-than-even chance that two of them will share the same birthday?

Most people think the answer is 183, the smallest whole number larger than 365/2. But that is not the case. In fact, you need just 23 people in the room. (The analog to the DNA case is how big does a database have to be for the chances to be better than half that two entries share the same profile on 9 loci. In that case, the answer is not 23, but a million entries will do it, and CODIS already exceeds that number.)

Here is how to get that answer. To figure out the probability of finding two people with the same birthday in a given group, it turns out to be easier to ask the opposite question: what is the probability that NO two will share a birthday, i.e., that they will all have different birthdays?

With just two people, the probability that they have different birthdays is 364/365, or about .997.

If a third person joins them, the probability that this new person has a different birthday from those two (i.e., the probability that all three will have different birthdays) is (364/365) x (363/365), about .992.

With a fourth person, the probability that all four have different birthdays is (364/365) x (363/365) x (362/365), which comes out at around .983. And so on.

The answers to these multiplications get steadily smaller. When a twenty-third person enters the room, the final fraction that you multiply by is 343/365, and the answer you get drops below .5 for the first time, being approximately .493.

This is the probability that all 23 people have a different birthday. So, the probability that at least two people share a birthday is 1 - .493 = .507, just greater than 1/2.

With that simpler example in hand, you might now want to take another look at my DNA calculation. It's essentially the same.

Now, I am most definitely not trying to denigrate the use of DNA profiling. It is the most reliable and powerful means at our disposal to catch criminals (and to free wrongly convicted innocent victims). What I am trying to do is ensure that citizens understand the math sufficiently to make a proper evaluation of DNA evidence. That original figure of 1 in a billion is terribly and dangerously misleading, and could lead to innocent people being convicted. Equally, the fact that on mathematical grounds alone the CODIS database is more likely than not to include two individuals whose DNA matches on 9 loci is potentially misleading, and used by a skillful defense attorney might result in a guilty person getting away with the crime. That elusive, crucial concept the truth lies somewhere between those two extremes.

The fact is, in the era of DNA identification, judges and juries simply cannot avoid getting to grips with the relevant math. Identification hinges on those calculations. There may be no way of avoiding bringing mathematicians into court to explain how the calculations are done. But for that to be effective, those judges and juries need first to learn (and accept) that human intuitions about probabilities are hopelessly unreliable. That can prepare the way, not for mathematical laypersons to learn how to do the calculations themselves - the experts can do that part - rather how to follow the calculations and evaluate the answers. For it is on those answers that justice will ultimately depend.

Time to watch that video again. And this time, watch the two others in the series. Though on the surface they are about learning math, the real message is - as several of the participants say in different ways - learning to think.

Devlin's Angle is updated at the beginning of each month. Find more columns here. Follow Keith Devlin on Twitter at @nprmathguy.
Mathematician Keith Devlin (email: devlin@stanford.edu) is the Executive Director of the Human-Sciences and Technologies Advanced Research Institute (H-STAR) at Stanford University and The Math Guy on NPR's Weekend Edition. His most recent book for a general reader is The Unfinished Game: Pascal, Fermat, and the Seventeenth-Century Letter that Made the World Modern, published by Basic Books.