Thursday, January 14, 2010

Outliers

One of the most subtle and perilous questions in science is when to omit data.

Sometimes there are really good reasons to leave something out. After all, there are lots of ways to screw up data.

In the old days, people could read instruments wrong, or write or copy measurements incorrectly. Even with data acquired and processed by computers, instruments can overload or malfunction and produce incorrect readings. More frequently, even if a measurement itself is correct, changes in the apparatus or the external context can destroy its apparent significance. And it's almost always possible to save data in the wrong place or with the wrong description.

And it matters. Wrong data can cause a lot of headaches. Many analyses reflect a statistical representation of the complete data set, such as the average value, for example, and curve fitting typically penalizes large deviations even more than small ones. So even a single errant measurement can distract from many good ones.

All this means that scientists have good and powerful reasons to eliminate "outliers" that fall outside the normal range of variation, since there's a good chance they are wrong and could skew the results away from the "real" answer.

The problem is that eliminating points requires a subjective judgment by a human experimenter. Often this person is testing a hypothesis, and so has a working expectation of what the data "should" look like. The experimenter will be strongly motivated to toss out points that "don't look right"--even if that just means they are unexpected. That temptation must be avoided.

Distinguishing truly nonsensical measurements from those that simply don't accord with a researcher's expectations requires a level of objectivity and humility that is rare in most people, and difficult even for well-trained scientists.

But it is one of a scientist's most important tasks.

I once heard someone say that it's OK to throw out one data point out of seven. I think that's ridiculously general, and also dangerous. Human nature being what it is, I think the standards need to be higher.

What I learned in my undergraduate laboratory class is that you should check the data as you go along (plotting it by hand in your ever-present lab notebook, if you must know how old I am), to be ready for any measurement problems that arise. If a measurement looks funny, repeat it. If the repeat is what you originally expected, it may be OK toss out the funny one. The repeat might be an individual point, or an entire series. Even better is to do the new measurement twice, and use the majority rule.

Unfortunately, it's not always possible to repeat the measurement exactly. Another alternative is to make a similar measurement, for example with a similar sample. Whenever possible, replication should be part of normal quality control anyway, so this may not be too hard.

But what about when no repeat is possible at all, as happens in historical sciences? You could just throw out all the measurements as unreliable and find a new line of work. But if you opt to toss some of it and not the rest, you really need a very good argument about why that data has a problem. This is a really slippery slope, if there is no way to double-check your argument. People--including scientists--are notoriously good at coming up with post-hoc "just-so" stories for why things are the way they are.

If you really think the data is wrong, but you can't be sure everyone would agree with your logic, scientific tradition still gives you an option: say what you did. Whenever a data is chosen or processed according to a questionable procedure, proper conduct requires that you declare the procedure, certainly in any journal article.

Unfortunately, I have the feeling that, in the era where hot results are sent to general-interest journals like Scienceandnature, this sort of documentation is relegated to the supplementary material or never stated at all. This is a dangerous trend.

Incidentally, many definitions of scientific misconduct include errors of omission. For example, here is the relevant definition from the National Science Foundation's policy:

Falsification means manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

In other words, if your deliberate omission distorts the conclusion, you are guilty of fraud. Don't do it.

    

No comments:

Post a Comment