data

Data outliers do not exist

The term “data outlier” is based on hidden assumptions. A completely different way to think about this is that they are points that do not fit your understanding of the distribution of errors that is underlying the data acquisition.

Unfortunately, we often falsely assume a “Normal” (Gaussian) distribution of errors. Did you know that in a “Normal” distribution a deviation of 11 sigma is much, much, much less likely than a deviation of 10 sigma? Does that correspond to your experience? Not mine: deviations of 11 sigma are about as likely as deviations of 10 sigma in practice. I see neither of these as outliers, they are just telling you that your error distribution is non-“Normal”.

In 1971, Abrahams and Keve (10.1107/S0567739471000305) described a beautiful way to verify the error model: sort the errors, and, based on an assumption that they follow a normal distribution, make a plot (Normal Probability Plot) of their value against their expected value. The resulting plot is expected to be a straight line. If it is not, this is telling you that the errors are not distributed following a Gaussian.

I suffered from this myself in my research. And for me, a very good solution was to replace the Normal distribution by a Student distribution (10.1107/S0108767309009908). The best parameter ν of that distribution can be derived by linearizing the probability plot. By following that procedure, it was no longer necessary for me to remove any “outliers”: all data points could be used in an analysis (10.1107/S0021889810018601).

Outliers don’t exist. If you think they do, you are probably misunderstanding your error model. And properly understanding your error model can teach you much more than you can learn from rejecting outliers through applying some empirical rule.

[This post was triggered by an AI-generated guide on handling outliers on linkedin.]

Is there a need for a research data management specialism?

Fire damaged chemical lab Hiroshima; from Wikimedia; public domain

Yes!

There is an interesting difference between how risks are often approached in a research lab where a lot of data is handled and in a chemical lab. Many people working with data regularly encounter problems like not being able to locate data quickly or not being able to reproduce results exactly, but do often they think these problems are an integral consequence of working with large amounts of data, and do not recognize these are problems with the data management practice and preparation. The equivalent in a chemical lab would be researchers thinking that daily fires and explosions naturally belong to working with chemical compounds, rather than recognizing these as a consequence of bad lab practice and bad preparation.

There is also resistance to the uptake of a data management specialism because many researchers think that data management is relatively easy. Everybody has a computer at home, and many maintain photo libraries. However, this experience does not directly translate into work with large amounts of data in the lab:

Data in the lab is often 1-3 orders of magnitude larger than a photo library at home. A maintenance job that costs an hour for photolibrary would translate into more than 6 months of work in a large data-intensive project. Because of this, there is really a need for different approaches.
Data in a photo library consists of JPG files and maybe RAW files, and these files have simple 1 to 1 relationships. In the lab there are many more different kinds of data, and the relationships are much more complex.
A photo library is usually maintained by a single person. In the lab, the same data is worked on by different people, and they must each be aware of everything that is done by the others.

And in fact, even in a photo library at home one can not always quickly find what one is looking for.

Tell me what it is, not how you use it!

Regarding research data management I have been telling people the importance of describing a data set as "what it is" as opposed to "how you use it".

An early example that was given to me by a biobanking expert in The Netherlands was in the description of a chest X-ray: most likely such an image can be used both to study the bone in the spine, and to study the state of the major arteries (e.g. the aorta). If such an X-ray is acquired, it is likely that it is for one purpose only, but that does not exclude a re-use in another field. To optimize the re-usability of the data (see the FAIR principles), a chest X-ray should be labeled "chest X-ray" and not "X-ray of the spine" even if it was acquired for that specific goal.

I think this is similar to a "book cupboard". Most "book cupboards" are actually "book-size shelves". Those shelves can be used to store books, but can also used for other purposes. To optimize the Findability of the right storage solution in the shop, it would be useful if the label would not express a single use.

Recently I heard yet another very good example of the same principle in an episode of the SE-radio podcast: Function names when writing computer code. It really improves the human readability of computer software (and hence the maintainability) if each function is named after what it does, rather than how it is used. The example from the podcast: do not name the function "reformat_email", but name it "remove_double_newlines" if that is what it does.

I've heard someone say "Researchers are the worst judges on the possibilities for re-use of their own data". It is true: a researcher studying the aorta will not even see the spine on their own X-ray images, let alone think about ways in which the data can be reused by bone researchers. I think labeling a data set with how it is used is a consequence of this. A trained librarian/archivist, with training in classification systems, will quickly see through such a mistake and suggest better naming.

Who is supposed to benefit from the EOSC?

Story 1. The European Open Science Cloud had their launch meeting in Vienna on November 23, 2018. Lectures were representing big (hundreds of millions of Euros) single topic focused European science projects, telling the audience how important the EOSC will be.

Story 2. I like answering questions on Quora. Recently someone asked whether it is unhealthy to live in a humid climate. It took me 2 hours to find data on climate humidity and longevity, summarize both per country, and correlate. I could only find both numbers for 46 countries.

My conclusion? A well-functioning European Open Science Cloud will be most helpful to speed up the answers for small questions by making it possible to couple different data sets from diverse sources. The big projects collecting their own petabytes of data will manage to do exactly the same with or without a science cloud. But most of science consists of much smaller questions, possibly composing data into solutions of grand societal goals. These kinds of projects are most served a lot by datasets that are adhering to standards. They will benefit from the EOSC. And that is why it is a pity that the voice of such projects was not represented at the launch event.