Wednesday 29 March 2017

Numbers Don't Speak: Misleading Statistics Come To Science




Humans are bad with numbers—mainly because centuries of evolution molded them to think quickly and intuitively, not abstractly. Though this method of thinking, known as heuristics was enough to get them out of savannas and settle into small subsistent communities—it did not make them the ‘masters of the planet.' In fact, it was their ability to bend the laws of nature and manipulate the strings of earth—which involved observation, exact measurement, and scientific thinking. This recipe allowed them to weasel out truth from sham, hunch from ‘hit,' and luck from meaningful occurrences.

There are a variety of tools which help humans do this. One of them is statistics. Though not old, statistical tools have profound applications in almost every realm of our lives. Google uses it. Netflix and Amazon depend on it, but most importantly, our source of knowledge—published research and journals, won’t exist without it. However, among the goodly proportions of scientists and researchers who use it, only a few truly know how to. This realization came with the paper, “Why Most Published Research Findings Are False,” written by John Ioannidis at Stanford.

Statistical illiteracy is not just an epidemic among ordinary people, who get misled by poor statistical reporting of mainstream media, are ignorant of sampling, errors, and fallacy of averages. But, as the paper reveals, the scientific community is neither immune from it.

A lot can go wrong while doing statistics. Choosing a faulty sample, ignoring extreme values, maneuvering graphs and visuals, proposing causation instead of correlation for example, implying pornography as a ‘cause’ of sexual violence, and finding nonsensical correlations, like correlating the number of Hollywood movies released per year with the number of bird deaths in that year, are some among the well-known. Other problems are inherent, which one can never eliminate such as randomness of real world or complexity of the system. Others result from scientific conventions—popular practices developed over time. Those are rampant in the fields of experimental psychology, neuroscience, medicine, health, and nutrition.

Probably the most articulate book on the topic Statistics Done Wrong highlights many of the statistical sins committed by academicians and researchers in different study areas. A handful of them is given below.

Everything wrong with ‘statistical significance.'

This phrase has become a de facto of statistics, whether in textbooks or prestigious journals like Nature and Science. A study is said to be statistically significant when its p-value is lower than a certain cutoff, usually set to p < 0.05. However, this p-value cutoff is somewhat arbitrary in nature—a mere convention tracing its lineage to RA Fisher, the god-father of significance tests.

But that is not the only problematic thing about p-values. In fact, it is their frequent and much-widespread misinterpretation which are troubling. Most people think a p-value of 5% means there is a 5% chance that a result is a fluke, or put in a better way, a 5% chance that a result is because of luck—which is downright false. The probability of getting a lucky result is much higher; a good 38%. Besides, there is not much difference between a p-value of 6% and 4%. But, as David Colquhoun in his paper puts it, “one of them gets your paper published, and the other does not”.

P-values are handy when used correctly. They are meant to be used as a gauging tool for researchers only to make informal guesses about whether the results make sense. They are not intended to be strict engraved rules for hypothesis testing.

Another head-scratching quality of p-values is that they are simply probabilities. They have no practical value. They don’t tell whether your medicine works, or there is a difference between two bacteria cultures, and therefore to try and lower your p-value as much as possible and present it as proof that you have found something is—futile. P-value is only supposed to tell you how good your data is. It’s not meant to tell whether you have made a discovery. And even if you want to judge your data, the confidence interval is a much better option because it calculates the variation bar, which gives you insights about uncertainty in your data. Sadly, however, only 10% of research papers in experimental psychology use confidence intervals, and in journals like Nature, 89% of papers involve p-values without any confidence interval.

It is precisely this reason that editors and scientists have screamed about the ‘replicability crises’ in many disciplines, especially the nascent ones which lack vigorous standards to test against such as neuroscience and behavioral studies. The p-values should not be deciding what has been ‘discovered,' the replication studies should.

Little Extremes

Little extremes were aptly explained by Kahneman, in Thinking Fast and Slow through an example which goes like this. In the US, for instance, it was found that counties with the lowest rates of kidney cancer tended to be Midwestern, Southern, and Western rural counties. And also counties with the highest rates of kidney cancers tended to be in the Midwestern, Southern and Western rural counties. How is that so? Well, because the countryside have less population than urban areas, and therefore extreme values were supposed to be found in the countryside, instead of urban settings where they averaged out.

The way of correcting this is to account for the population by using weighted averages instead of simple averages.

Pseudo-replicability

This erratum too is a common disaster. It involves collecting thousands of data points to give the illusion that you have replicated the results when in fact you have pseudo-replicated. It is when you take a smaller sample but measure your parameter again and again, with that same sample, For example, I want to know the average IQ of my class. I chose two people from the class and measured their IQs five times a day for twenty days. I collected 100 data points now. But they are useless. I should have taken 100 different individuals, may be measured their IQ five times and then averaged those five data points to make a single one. Now, I still collected 100 data points, but they are much robust.

Statistics can be quite baffling. Thier counter-intuitive nature sways even the best of people. I remember, once my father was cursing the weather prediction on his phone and said: “Look it says here that there is only a 20% chance of rain today, but see it’s raining so much”. Well, the problem did not lie with the prediction, it laid with his naïve understanding of probability. The chance of rain was 20% if you repeat your observation on several different days a good amount of time. Think of it this way, if you flip a coin, you get head. You flip a coin again, and you get head. And so you reach the conclusion that probability of getting a head is more than a tail—even though it is 50/50. Probability does not tell you the exact outcome on your next toss, and it only shows you the outcome if you repeat the experiment an enormous number of times. You won’t get a 50/50 head and tail ratio with just one toss, or ten tosses or even hundred tosses. That is why we simulate tosses millions of times using random numbers. Another quirky use of statistics is in the modern American rhetoric that “most people are above average”, which is laughable (most people cannot be above average. They are just average. ‘Average’ is called average for a reason).

But statistics is not only counter-intuitive, it also remains beyond the reach of those who don’t study them. A researcher cannot possibly run hundreds of simulations (even though they should), or apply sophisticated techniques. However, statistics is not the only reason science faces the perils of non-replicability. There are many other reasons as well, and it would be unfair not to bring them. Since the replicability crises mostly occur in behavioral sciences, can it be that the nature of those fields renders them unavoidable complexity and randomness — something you cannot ever eliminate? And, therefore it may not be that alarming that studies of those disciplines fail to withstand re-tests. Well, there is a difference between science failing to replicate, which is good, and science not attempting to replicate, which is horrendous. Replication is at the heart of scientific query. It cannot be detached from the scientific organism.

Therefore, when replication studies are treated as “lowly” than new and original endeavors by journals, or when replication stalls career growth of a researcher or hampers her from being funded, it all should be a violent blow to the integrity of science. In fact, nascent fields should be more careful about replicability, because they have yet to establish strong credibility. They should be the ones getting wrong most of the time, by putting their studies to test often. Being wrong is okay in science, not trying to prove wrong is disastrous. Hence, it should concern every one of us when universities or academia puts unnecessary pressure on researchers to ‘get published.' Instead, they should be propagating academic attitudes, most importantly, scientific attitudes —which hold replication dearest.

References and Further Reading

Reinhart, Alex, Statistics Done Wrong, 2015.
LA Times, "Why Failure To Replicate Findings Can Actually Be A Good Thing", 2016
The Economist, "Trouble At The Lab", 2013

No comments:

Post a Comment