Select your language

The term “data outlier” is based on hidden assumptions. A completely different way to think about this is that they are points that do not fit your understanding of the distribution of errors that is underlying the data acquisition.

Unfortunately, we often falsely assume a “Normal” (Gaussian) distribution of errors. Did you know that in a “Normal” distribution a deviation of 11 sigma is much, much, much less likely than a deviation of 10 sigma? Does that correspond to your experience? Not mine: deviations of 11 sigma are about as likely as deviations of 10 sigma in practice. I see neither of these as outliers, they are just telling you that your error distribution is non-“Normal”.

In 1971, Abrahams and Keve (10.1107/S0567739471000305) described a beautiful way to verify the error model: sort the errors, and, based on an assumption that they follow a normal distribution, make a plot (Normal Probability Plot) of their value against their expected value. The resulting plot is expected to be a straight line. If it is not, this is telling you that the errors are not distributed following a Gaussian.

I suffered from this myself in my research. And for me, a very good solution was to replace the Normal distribution by a Student distribution (10.1107/S0108767309009908). The best parameter ν of that distribution can be derived by linearizing the probability plot. By following that procedure, it was no longer necessary for me to remove any “outliers”: all data points could be used in an analysis (10.1107/S0021889810018601).

Outliers don’t exist. If you think they do, you are probably misunderstanding your error model. And properly understanding your error model can teach you much more than you can learn from rejecting outliers through applying some empirical rule.

[This post was triggered by an AI-generated guide on handling outliers on linkedin.]