Over this week, there has been a striking debate in the blogosphere and on Twitter concerning the flaws in many published neuroimaging studies. This was sparked off on Monday by Dorothy Bishop’s brutal, insightful highlighting of the methodological holes in a paper published in the prominent journal Proceedings of the National Academy of Sciences in 2003. The next day, one of the authors of this paper, Russ Poldrack, admirably held up his hands in submission, and agreed with every one of Bishop’s attacks. His partial explanation was that this was in a different age, with more lax conventions (and admittedly he was only a minor author on the paper himself). Late Tuesday night, Neurocritic posted a provocative blog article in response to this, asking the question: “How Much of the Neuroimaging Literature Should We Discard?” This initiated a lively debate on Twitter yesterday between me, Jon Simons, Dorothy Bishop and others, in answer to this question. Two key issues quickly surfaced: first, is there any mileage in retracting published results, if they are later found to be seriously flawed; and second, do these flawed studies have a generally positive worth, especially when bolstered by independent replication.
I thought it might help in this discussion to explain one of the main statistical issues that this debate is pinned on, that of corrected versus uncorrected statistics, and how this applies to brain-scanning. I then want directly to address Neurocritic’s question as to whether these problematic papers should be discarded or retracted. Related to this, I’ll then discuss whether a published, though deeply flawed neuroimaging study can do more harm than good. And if many published imaging papers are so flawed, I want to try to explain how the literature became so sloppy. I’ll end this blog entry by coming up with a few suggestions for how the situation can be improved, and then how a layperson can sift through the stories, and decide whether a neuroimaging study is of good quality of not.
Edit: Just to flag up that this blog is addressing two audiences. I wanted to explain the context of the debate to a general audience, which occurs in the next two sections, and suggest how they can assess neuroimaging stories in the light of this (in the last small section). The middle sections, although hopefully understandable (and maybe even of some interest) to all, is directed more at fellow scientists. And the comments at the end have become a little dominated by technical points, which is great, but if any non-academic wants to air an opinion or ask a question, I just wanted to emphasise that I’d be delighted to have these as comments too.
So what are corrected and uncorrected statistics?
Imagine that you are running some experiment, say, to see if corporate bankers have lower empathy than the normal population, by giving them and a control group an empathy questionnaire. Low and behold, the bankers do have a lower average empathy score, but it’s only a little bit lower. How can you tell whether this is just some random result, or that bankers really do have lower empathy? This is the point where statistical testing enters the frame.
Classically, a statistical test will churn out a probability that you would have got the same result, just by chance. If it is lower than some threshold, commonly probability (or p) =0.05, or a 1 in 20 chance, then because this is really very unlikely, we’d conclude that the test has passed, the result is significant, and that bankers really do have a lower empathy score than normal people. All well and good, but what if you also tested your control group against politicians, estate agents, CEOs and so on? In fact, let’s say you tested your control group against 20 different professions, and the banker group was the only one that was “significant”. Now we have a problem, because if we rerun a test 20 times, it is likely to be positive (under this p=0.05 threshold at least) one of those times, just by chance.
As an analogy, say Joe Superstitious flips a coin 4 times in a row, willing it with all his might to fall on heads 4 times in a row (with 1 in 16 odds, so pretty close to p=0.05). But the first time it’s just a mix of heads and tails. Oh he was just getting warmed up, so let’s ignore this round. So he tries again, and this time it’s three heads and a tail – or so nearly there. His mojo must be building! The third time it’s almost all tails, well that was because he was a bit distracted by a car horn outside. So he tries again, and again and again. Then, as if by magic, on the 20th attempt, he gets all 4 heads. Joe Superstitious proudly concludes that he is in fact very skilled at telekinesis, puts the coin in his pocket and saunters off.
Joe Superstitious was obviously flawed in his thinking, but the reason is actually because he was using uncorrected statistics, just as the empathy study would have been if it concluded that bankers are less empathic than normal people. If you do multiple tests, you normally have to apply some mathematical correction to take account of how many tests you ran. One simple yet popular method of correction (known as a Bonferroni correction) involves dividing the probability your statistical test outputs by the number of tests you’ve done in total. So for the bankers to be significantly lower than the control at a p=0.05 criterion, the statistical test would have had to output a probability of p=0.0025 (p=0.05/20), which only occurs 1 in 400 times by chance.
How does this apply to brainscanning?
Moving on to neuroimaging, the data is far more complex and inordinately larger, but in essence exactly the same very common statistical test one might have used for the empathy study, a t-test, is also used here in the vast majority of studies. However, whereas in the empathy study 20 t-tests were run, in a typical neuroimaging study, a t-test is separately carried out for each 3 dimensional pixel (known as a voxel) of a subject’s brain-scan, and they might well have 100,000 of these! So there is a vast problem of some of these voxels to be classed as significantly active, just by chance, unless you are careful to apply some kind of correction for the number of tests you ran.
One historical fudge was to keep to uncorrected thresholds, but instead of a threshold of p=0.05 (or 1 in 20) for each voxel, you use p=0.001 (or 1 in a 1000). This is still in relatively common use today, but it has been shown, many times, to be an invalid attempt at solving the problem of just how many tests are run on each brain-scan. Poldrack himself recently highlighted this issue by showing a beautiful relationship between a brain region and some variable using this threshold, even though the variable was entirely made up. In a hilarious earlier version of the same point, Craig Bennett and colleagues fMRI scanned a dead salmon, with a task involving the detection of the emotional state of a series of photos of people. Using the same standard uncorrected threshold, they found two clusters of activation in the deceased fish’s nervous system, though, like the Poldrack simulation, proper corrected thresholds showed no such activations.
So the take home message is that we clearly need to be applying effective corrections for the large quantities of statistical test we run for each and every brain activation map produced. I’m willing to concede that in a few special cases, for instance with a very small, special patient group, corrected statistics might be out of reach and there is some value in publishing uncorrected results, as long as the author heavily emphasises the statistical weakness of the results. But in almost all other circumstances, we should all be using corrected significance, and reviewers should be insisting on it.
Should we retract uncorrected neuroimaging papers?
Surprisingly, there is a vast quantity of published neuroimaging papers, even including some in press, which use uncorrected statistics. But in response to Neurocritic, and siding to some degree with Ben Goldacre, who also chipped in on the Twitter debate, it’s almost certainly impractical to retract these papers, en masse. For one thing, some might have found real, yet weak, results, which might now have been independently replicated, as Jon Simons pointed out. Many may have other useful clues to add to the literature, either in the behavioural component of the study, or due to an innovative design.
But whether a large set of literature should now be discarded is a quite separate question from whether they should have been published in the first place. Ideally, the authors should have been more aware of the statistical issues surrounding neuroimaging, and the reviewers should be barring uncorrected significance. More of this later.
Can any neuroimaging paper do more harm than good?
Another point, often overlooked, is the clear possibility that a published study can do more harm than good. Dorothy Bishop already implied this in her blog article, but I think it’s worth expanding on this point. If a published result is wrong, but influential and believed, then this can negatively impact on the scientific field. For instance, it can perpetuate an erroneous theory, thus diluting and slowing the adoption of better models. It can also make other scientists’ progress far less efficient.
A good proportion of scientific research involves reading a paper, getting excited by its results, and coming up with an idea to extend it in a novel way, with the added benefit that we have to perform an independent replication to support the extension – and everyone agrees that independent replication is a key stage in firmly establishing a result.
On a personal level, not only in neuroimaging, but also in many behavioural results, I and my research students have wasted many soul-destroying months failing to replicate the results of others. Perhaps a fifth of all experiments I’ve been involved in have been of this character, which if you include the work of research students as well, easily adds up to multiple man-years of wasted work. And I’m actually probably more critical than most, sneer at uncorrected statistics, and tend to go through papers with a fine tooth comb. But still I’ve been caught out all these times. For others who view scientists less suspiciously, the situation must be worse.
For the specifics of an fMRI study that fails to replicate another, the scanning costs can easily top $10,000, while the wage hours of radiographers, scientists, and so on that contributed to this study might add another $50-100,000. These costs, which may well have been funded by the taxpayer, are only one component of the equation, though. It can easily take 6-12 months to run a study. If the researcher carrying out the work is studying for their PhD, or in the early phase of their post-doctoral position, such a failed experiment, in the current ultra-competitive research climate, might turn a talented budding scientist away from an academic career, when those vital papers fail to get published.
The implications multiply dramatically when the study has a clinical message. One particularly tragic example in science more generally comes from the book, Baby and Childcare, by Dr Spock. Recently on the BBC Radio 4 programme, The Life Scientific, Iain Chalmers pointed out that this book, with its order that mothers put babies to sleep on their front, was probably responsible for 10,000 avoidable deaths in the UK alone.
I would therefore argue that scientists, particularly within the neuroimaging field, where experimental time and costs are substantial, and especially when this combines with a clinical message, have a duty to try, as far as possible, to publish papers that are as rigorous as they can be, with corrected statistics an obvious component of this.
Is there a culture of sloppy neuroimaging publications?
Effective corrected statistics are by no means a new addition to the neuroimaging methodology. The first (and still most popular) common correction method was published in 1992 by Keith Worsley and colleagues, while a second was published in 2002 by Tom Nichols and colleagues. When I began my PhD in 1998, when I almost immediately started my first neuroimaging study, it was already frowned upon in our department even to consider publishing uncorrected results. So how can uncorrected statistics be published so widely, even, to some extent, today?
I believe there are two components to the answer, one involving education, but the other, more worryingly, relating to certain cultural features of the cognitive neuroscience community.
Analysing neuroimaging data, especially if it’s of the fMRI flavour, is very complex -there’s no getting around that. Datasets are vast, and there are many issues to address in arriving at a clean, robust set of results. There are competing schools of thought for how to analyse the data, and a dizzying level of maths required to absorb in order fully to understand even the standard analysis steps. Thriving neuroimaging centres, such as the Cambridge community, where I carried out most of my imaging studies, invest much time in running seminars, writing webpages and so on, to disseminate the current state of play in imaging methods. More isolated neuroimaging centres, which are the norm rather than the exception, have a far greater challenge getting up to speed. The community as a whole does a reasonable job, both online and using onsite courses, in educating any scientists that need help. But clearly they could do far more, and I have a few ideas about this, which I’ll leave for a later article.
But this is only half the story – a paper can normally only be published if a set of reviewers approve it. If a paper is methodologically flawed, the reviewer should explain the flaw and suggest improvements. It is highly problematic if reviewers are either chosen by editors or allow themselves to act as gatekeepers for a paper, when they aren’t qualified to judge its methods.
Dwarfing the issue of lack of education, though, is that of culture. Papers which are obviously methodologically flawed, both in design and statistical analysis, tend to get published in minor journals and make little impact. On the other hand, there is an assumption that if you are published in the most prominent journals that you have produced high quality research, and a paper is far more likely to be influential. This is where a spate of cultural problems arise.
From the outside, the public assume that almost all scientists have noble, perfectly honest aims when papers are published. I believed this too, until I started my PhD, when I was quickly educated in how some neuroimaging scientists are masters at manipulating data to accord with their theories, and how research politics, in-fighting and many other ugly traits are relatively common. Throughout my academic career, this initial lesson has been heavily reinforced, and I think it’s a particular problem in neuroimaging, which combines a softer science with vast, complex datasets.
An ambitious scientist at the start of their career knows they need a stream of big papers to set them towards that hallowed tenured position, while an ambitious tenured scientist knows the big grants will flow if more big papers have your name on it. In other fields with large complex data sets, such as high energy physics, perhaps the transparency of the process means that you can only progress with scientific talent and genuine results. But in neuroimaging, an only slightly unscrupulous scientist can learn the many tricks hidden in this huge pile of data with its many analysis paths, to dress it up as a bold new set of results, even if the design is flawed and the analyses are invalid. I wouldn’t call this fraud, as the scientist might well have some skewed self-justification for their invalid steps, and it’s definitely not as if they are making up data from scratch – just exploiting the complexities to find some analysis that shows what they want – usually in a heavily statistically uncorrected way (though it might not be so obvious that this is happening when reading the paper).
This is not a hypothetical scientist, by the way. I know of a few senior scientists that employ such “techniques”, and any neuroimaging researcher who’s been in the field for some years could probably do the same. One huge issue here is that, as long as they can get rewarded for their tricks, by publishing, then they can flourish and progress in the field, and perpetuate these unscientific habits, which can even become general fashions (perhaps using uncorrected stats is one example here). The reviewers and editor should, ideally, stop such publications, but sometimes the reviewer is ignorant about the flaws, some of which can be quite subtle. At other times, though, there are cultural issues that lend a hand.
Some years ago, an editor at Nature Neuroscience – the most prominent specialist journal to publish neuroimaging results – came to give a talk at my old Cambridge department, the Medical Research Council Cognition and Brain Sciences Unit. When discussing what factors help some authors achieve repeated publications in this journal, she described how the author’s careful choice of which people to recommend for review and which reviewers to exclude was an influential component. One striking feature of the review process, which the non-scientific world is probably unaware of, is that in almost all journals, authors get to recommend who should review their manuscript. In principle there needn’t be anything wrong with this – after all, the author is best placed to know who in the field is most able to judge the topic of the paper, and the over-busy editor could use all the help they can get. And there is certainly no guarantee that a recommended reviewer will end up reviewing the manuscript – for one thing, they might just be too busy at that time. In practice, though, an ambitious author can easily exploit this system and recommend friends or even ex-lab members who are sure to review the manuscript favourably, and blacklist those who, perhaps for clear scientific reasons, will not. After all, the friendly reviewer knows that the author will soon be a reviewer for their papers, and the favour will be returned. The fact that the review process is ostensibly anonymous is meant to address this issue, but it can be easily bypassed.
A related trick is to send your manuscript to a journal where your friend and colleague is the main editor, and who will accept your manuscript, almost regardless of what the reviewers say. I should emphasise that these situations, while somewhat uncommon, are certainly not just hypothetical. For instance, for quite prominent journals, I have reviewed papers which were terribly shoddy, methodologically appalling with uncorrected statistics or far worse, and I as well as the other reviewer recommended against publication. I then found a year later that the article was published anyway, and did know that the lead author used to be in the same lab as the editor.
Of course, there is a wealth of exciting, valid, rigorous neuroimaging studies published, and the field is slowly becoming more standardised and robust as it matures. But, as I wrote in Twitter, the majority of neuroimaging studies I come across are so flawed, either due to design or statistical errors, that they add virtually nothing to my knowledge.
What can be done?
Okay, so we’re stuck with a series of flawed publications, imperfect education about methods, and a culture that knows it can usually get away with sloppy stats or other tricks, in order to boost publications. What can help solve some of these problems?
Of course as scientists we should strive to be more rigorous. We should consult more widely in forging our design. We should train better in the proper analysis methods, avoiding obvious mistakes like uncorrected data (which can usually be fixed by simply testing another half a dozen subjects, to increase the experiment’s statistical power). And we should try to be as honest as possible at every stage, especially by being more open about any residual drawbacks of the study.
But in some ways an even more important area for improvement is the review process. This should be made more transparent in various ways. Some journals, such as the open access Frontiers journals (which I just published in this month), publish the names of the reviewers (who are initially anonymous) towards the top of an accepted paper. This is a good first step, but perhaps the entire review discussion should be available as well somewhere.
Related specifically to neuroimaging, Dorothy Bishop made the suggestion that:
“reviewers and readers would benefit from a simple cribsheet listing the main things to look for in a methods section of a paper in this area. Is there an imaging expert out there who could write such a document, targeted at those like me, who work in this broad area, but aren’t imaging experts? Maybe it already exists, but I couldn’t find anything like that on the web.”
I think this is an excellent, pressing idea, and don’t think it would be too hard for a methodologist to generate such guidelines.
More than this, though, there should be a far greater emphasis generally on ensuring that the reviewer is equipped to judge the manuscript, and if they aren’t, then they should own up to this before reviewing. There was some talk a decade back for each neuroimaging paper to have at least one methods expert reviewing the paper, which I still think is a solid idea.
I also believe that the review process, as the shield against flawed publications, should generally be taken far more seriously than it currently is. As it stands, reviewing a paper is a thankless task we get no payment for, and usually takes (for me at least) an entire day, when almost all academics are already heavily overworked. Academic publishing is currently undergoing a revolution, amidst the call for open access. To publish in an open access journal, the author (or at least their department) has to pay a large fee, to cover the journal’s costs. Perhaps as part of this revolution, the fee could be increased by some modest amount, and the reviewers paid each time for their expertise. They would then be more likely to do a thorough job.
In addition, there should be a cultural shift in the review process, further towards not publishing a neuroimaging paper unless it’s of real worth, and has valid methods, at the very least by using corrected statistics. On the one hand, a huge amount of work may have gone into a manuscript, easily involving a year of one or more scientist’s life. And of course it’s a shame that all this work is wasted. But on the other hand, if the study is horribly flawed, the methods are invalid and so on, publishing the paper will merely drag the field down, and make it more likely that future researchers make the same mistake. I would argue that reviewers should put these personal questions entirely aside and be stubborn, tenacious and as critical as they can be, although also very clear about how the study could be improved (or even redone), to give it a later chance of publication.
Then there is the issue of nepotism in the review process. If the author has a conflict of interest, such as that they are funded by the pharmaceutical company whose drug they are testing, then they have to state this in the paper. Perhaps they should do something similar for their suggested reviewers. They could be asked, in addition to their suggested reviewer’s name, whether that person has ever worked in their lab, collaborated with them, or is considered a colleague or friend. This needn’t negate the potential reviewer being chosen, but the editor will have a firmer idea up front of the potential reviewer’s level of objectivity in the matter. And if this information was eventually attached to the potential conflict of interest section of a paper, then that would be another clue for the reader to glean about the level of rigour in the review process. Just knowing that this will happen may cause authors to choose less obviously generous reviewers in the first place.
A further issue relates to independent replication, which was one of the main topics on the Twitter debate. Should a reviewer or editor insist on independent replication of an entire study, for it to be accepted? In an ideal world, this makes some sense, but in practice, it could delay publication by a year or more, and be extremely difficult to implement. One compromise, though, is for the author to submit all their raw imaging data to an independent lab (or some new dedicated group that specializes in re-analysing data?), who can confirm that the analysis and results are sound, perhaps by using different neuroimaging software and so on. I’m not sure of the incentive for such work, beyond co-authorship on the original paper, which carries its own motivational problems. But for the top tier journals, and a particularly ground-breaking result, it’s a policy that may be worth considering.
How can a layperson know what to believe, in response to all these issues?
First off, a healthy dose of skepticism is a universally good idea.
Then for a given story you’re interested in, choose to focus not on the national newspapers (whose coverage is pitifully uncritical), but on blogs that describe it, written by scientists, who are usually quick to describe the flaws in studies. If they haven’t mentioned it, ask them these standard, but vital questions:
- Are the stats properly corrected for the multiple tests carried out?
- Are the results replicated elsewhere at all?
- If these activation areas are linked to a given function, does the blogger know of any other functions previously linked to these brain regions?
- Are there any plausible alternative interpretations of the results?
If no such blog exists, find a scientist in the field who blogs regularly and suggest they cover it.
Failing this, why not try to find out for yourself in the original paper? If a paper is behind a paywall, email the corresponding author to send you a free copy (almost all will), and see if they mention uncorrected or corrected statistics in the methods (FWE and FDR are the two main versions of corrected statistics methods), if they mention other studies with similar results, or if the main design fits with their conclusions. Sometimes these are tricky issues, even for others in the field, but other times flaws can be embarrassingly obvious to most intelligent laypeople (who occasionally see more due to a fresh perspective). In my upcoming popular science book, I relate a few examples, where a just a little common sense can destroy a well published paradigm.
If you wanted to take this further, by chatting on Twitter, Google plus, blogs and so on, most scientists should be very happy to answer your questions. I, for one, would be delighted to help if I can.