Reluctance to act on suspicions about fellow scientists: inside the frauds of Diederik Stapel (part 4).

It’s time for another post in which I chew on some tidbits from Yudhijit Bhattacharjee’s incredibly thought-provoking New York Times Magazine article (published April 26, 2013) on social psychologist and scientific fraudster Diederik Stapel. (You can also look at the tidbits I chewed on in part 1, part 2, and part 3.) This time I consider the question of why it was that, despite mounting clues that Stapel’s results were too good to be true, other scientists in Stapel’s orbit were reluctant to act on their suspicions that Stapel might be up to some sort of scientific misbehavior.

Let’s look at how Bhattacharjee sets the scene in the article:

[I]n the spring of 2010, a graduate student noticed anomalies in three experiments Stapel had run for him. When asked for the raw data, Stapel initially said he no longer had it. Later that year, shortly after Stapel became dean, the student mentioned his concerns to a young professor at the university gym. Each of them spoke to me but requested anonymity because they worried their careers would be damaged if they were identified.

The bold emphasis here (and in the quoted passages that follow) is mine. I find it striking that even now, when Stapel has essentially been fully discredited as a trustworthy scientist, these two members of the scientific community feel safer not being identified. It’s not entirely obvious to me if their worry is being identified as someone who was suspicious that fabrication was taking place but who said nothing to launch official inquiries — or whether they fear that being identified as someone who was suspicious of a fellow scientist could harm their standing in the scientific community.

If you dismiss that second possibility as totally implausible, read on:

The professor, who had been hired recently, began attending Stapel’s lab meetings. He was struck by how great the data looked, no matter the experiment. “I don’t know that I ever saw that a study failed, which is highly unusual,” he told me. “Even the best people, in my experience, have studies that fail constantly. Usually, half don’t work.”

The professor approached Stapel to team up on a research project, with the intent of getting a closer look at how he worked. “I wanted to kind of play around with one of these amazing data sets,” he told me. The two of them designed studies to test the premise that reminding people of the financial crisis makes them more likely to act generously.

In early February, Stapel claimed he had run the studies. “Everything worked really well,” the professor told me wryly. Stapel claimed there was a statistical relationship between awareness of the financial crisis and generosity. But when the professor looked at the data, he discovered inconsistencies confirming his suspicions that Stapel was engaging in fraud.

If one has suspicions about how reliable a fellow scientist’s results are, doing some empirical investigation seems like the right thing to do. Keeping an open mind and then examining the actual data might well show one’s suspicions to be unfounded.

Of course, that’s not what happened here. So, given a reason for doubt with stronger empirical support — not to mention the fact that scientists are trying to build a shared body of scientific knowledge (which means that unreliable papers in the literature can hurt the knowledge-building efforts of other scientists who trust that the work reported in that literature was done honestly), you would think the time was right for this professor to pass on what he had found to those at the university who could investigate further. Right?

The professor consulted a senior colleague in the United States, who told him he shouldn’t feel any obligation to report the matter.

For all the talk of science, and the scientific literature, being “self-correcting,” it’s hard to imagine the precise mechanism for such self-correction in a world where no scientist who is aware of likely scientific misconduct feels any obligation to report the matter.

But the person who alerted the young professor, along with another graduate student, refused to let it go. That spring, the other graduate student examined a number of data sets that Stapel had supplied to students and postdocs in recent years, many of which led to papers and dissertations. She found a host of anomalies, the smoking gun being a data set in which Stapel appeared to have done a copy-paste job, leaving two rows of data nearly identical to each other.

The two students decided to report the charges to the department head, Marcel Zeelenberg. But they worried that Zeelenberg, Stapel’s friend, might come to his defense. To sound him out, one of the students made up a scenario about a professor who committed academic fraud, and asked Zeelenberg what he thought about the situation, without telling him it was hypothetical. “They should hang him from the highest tree” if the allegations were true, was Zeelenberg’s response, according to the student.

Some might think these students were being excessively cautious, but the sad fact is that scientists faced with allegations of misconduct against a colleague — especially if they are brought by students — frequently side with their colleague and retaliate against those making the allegations. Students, after all, are new members of one’s professional community, so green one might not even think of them as really members. They are low status, they are learning how things work, they are judged likely to have misunderstood what they have seen. And, in contrast to one’s colleagues, students are transients. They are just passing through the training program, whereas you might hope to be with your colleagues for your whole professional life. In a case of dueling testimony, who are you more likely to believe?

Maybe the question should be whether your bias towards believing one over the other is strong enough to keep you from examining the available evidence to determine whether your trust is misplaced.

The students waited till the end of summer, when they would be at a conference with Zeelenberg in London. “We decided we should tell Marcel at the conference so that he couldn’t storm out and go to Diederik right away,” one of the students told me.

In London, the students met with Zeelenberg after dinner in the dorm where they were staying. As the night wore on, his initial skepticism turned into shock. It was nearly 3 when Zeelenberg finished his last beer and walked back to his room in a daze. In Tilburg that weekend, he confronted Stapel.

It might not be universally true, but at least some of the people who will lie about their scientific findings in a journal article will lie right to your face about whether they obtained those findings honestly. Yet lots of us think we can tell — at least with the people we know — whether they are being honest with us. This hunch can be just as wrong as the wrongest scientific hunch waiting for us to accumulate empirical evidence against it.

The students seeking Zeelenberg’s help in investigating Stapel’s misbehavior found a situation in which Zeelenberg would have to look at the empirical evidence first before he looked his colleague in the eye and asked him whether he was fabricating his results. They had already gotten him to say, at least in the abstract, that the kind of behavior they had reason to believe Stapel was committing was unacceptable in their scientific community. To make a conscious decision to ignore the empirical evidence would have meant Zeelenberg would have to see himself as displaying a kind of intellectual dishonesty — because if fabrication is harmful to science, it is harmful to science no matter who perpetrates it.

As it was, Zeelenberg likely had to make the painful concession that he had misjudged his colleague’s character and trustworthiness. But having wrong hunches is science is much less of a crime than clinging to those hunches in the face of mounting evidence against them.

Doing good science requires a delicate balance of trust and accountability. Scientists’ default position is to trust that other scientists are making honest efforts to build reliable scientific knowledge about the world, using empirical evidence and methods of inference that they display for the inspection (and critique) of their colleagues. Not to hold this default position means you have to build all your knowledge of the world yourself (which makes achieving anything like objective knowledge really hard). However, this trust is not unconditional, which is where the accountability comes is. Scientists recognize that they need to be transparent about what they did to build the knowledge — to be accountable when other scientists ask questions or disagree about conclusions — else that trust evaporates. When the evidence warrants it, distrusting a fellow scientist is not mean or uncollegial — it’s your duty. We need the help of other to build scientific knowledge, but if they insist that they ignore evidence of their scientific misbehavior, they’re not actually helping.

Scientific training and the Kobayashi Maru: inside the frauds of Diederik Stapel (part 3).

This post continues my discussion of issues raised in the article by Yudhijit Bhattacharjee in the New York Times Magazine (published April 26, 2013) on social psychologist and scientific fraudster Diederik Stapel. Part 1 looked at how expecting to find a particular kind of order in the universe may leave a scientific community more vulnerable to a fraudster claiming to have found results that display just that kind of order. Part 2 looked at some of the ways Stapel’s conduct did harm to the students he was supposed to be training to be scientists. Here, I want to point out another way that Stapel failed his students — ironically, by shielding them from failure.

Bhattacharjee writes:

[I]n the spring of 2010, a graduate student noticed anomalies in three experiments Stapel had run for him. When asked for the raw data, Stapel initially said he no longer had it. Later that year, shortly after Stapel became dean, the student mentioned his concerns to a young professor at the university gym. Each of them spoke to me but requested anonymity because they worried their careers would be damaged if they were identified.

The professor, who had been hired recently, began attending Stapel’s lab meetings. He was struck by how great the data looked, no matter the experiment. “I don’t know that I ever saw that a study failed, which is highly unusual,” he told me. “Even the best people, in my experience, have studies that fail constantly. Usually, half don’t work.”

In the next post, we’ll look at how this other professor’s curiosity about Stapel’s too-good-to-be-true results led to the unraveling of Stapel’s fraud. But I think it’s worth pausing here to say a bit more on how very odd a training environment Stapel’s research group provided for his students.

None of his studies failed. Since, as we saw in the last post, Stapel was also conducting (or, more accurately, claiming to conduct) his students’ studies, that means none of his students’ studies failed.

This is pretty much the opposite of every graduate student experience in an empirical field that I have heard described. Most studies fail. Getting to a 50% success rate with your empirical studies is a significant achievement.

Graduate students who are also Trekkies usually come to recognize that the travails of empirical studies are like a version of the Kobayashi Maru.

Introduced in Star Trek II: The Wrath of Khan, the Kobayashi Maru is a training simulation in which Star Fleet cadets are presented with a civilian ship in distress. Saving the civilians requires the cadet to violate treaty by entering the Neutral Zone (and in the simulation, this choice results in a Klingon attack and the boarding of the cadet’s ship). Honoring the treaty, on the other hand, means abandoning the civilians and their disabled ship in the Neutral Zone. The Kobayashi Maru is designed as a “no-win” scenario. The intent of the test is to discover how trainees face such a situation. Owing to James T. Kirk’s performance on the test, Wikipedia notes that some Trekkies also view the Kobayashi Maru as a problem whose solution depends on redefining the problem.

Scientific knowledge-building turns out to be packed with particular plans that cannot succeed at yielding the particular pieces of knowledge the scientists hope to discover. This is because scientists are formulating plans on the basis of what is already known to try to reveal what isn’t yet known — so knowing where to look, or what tools to use to do the looking, or what other features of the world are there to confound your ability to get clear information with those tools, is pretty hard.

Failed attempts happen. If they’re the sort of thing that will crush your spirit and leave you unable to shake it off and try it again, or to come up with a new strategy to try, then the life of a scientist will be a pretty hard life for you.

Grown-up scientists have studies fail all the time. Graduate students training to be scientists do, too. But graduate students also have mentors who are supposed to help them bounce back from failure — to figure out the most likely sources of failure, whether it’s worth trying the study again, whether a new approach would be better, whether some crucial piece of knowledge has been learned despite the failure of what was planned. Mentors give scientific trainees a set of strategies for responding to particular failures, and they also give reassurance that even good scientists fail.

Scientific knowledge is built by actual humans who don’t have perfect foresight about the features of the world as yet undiscovered, humans who don’t have perfectly precise instruments (or hands and eyes using those instruments), humans who sometimes mess up in executing their protocols. Yet the knowledge is built, and it frequently works pretty well.

In the context of scientific training, it strikes me as malpractice to send new scientists out into the world with the expectation that all of their studies should work, and without any experience grappling with studies that don’t work. Shielding his students from their Kobayashi Maru is just one more way Diederik Stapel cheated them out of a good scientific training.

Failing the scientists-in-training: inside the frauds of Diederik Stapel (part 2)

In this post, I’m continuing my discussion of the excellent article by Yudhijit Bhattacharjee in the New York Times Magazine (published April 26, 2013) on social psychologist and scientific fraudster Diederik Stapel. The last post considered how being disposed to expect order in the universe might have made other scientists in Stapel’s community less critical of his (fabricated) results than they could have been. Here, I want to shift my focus to some of the harm Stapel did beyond introducing lies to the scientific literature — specifically, the harm he did to the students he was supposed to be training to become good scientists.

I suppose it’s logically possible for a scientist to commit misconduct in a limited domain — say, to make up the results of his own research projects but to make every effort to train his students to be honest scientists. This doesn’t strike me as a likely scenario, though. Publishing fraudulent results as if they were factual is lying to one’s fellow scientists — including the generation of scientists one is training. Moreover, most research groups pursue interlocking questions, meaning that the questions the grad students are working to answer generally build on pieces of knowledge the boss has built — or, in Stapel’s case “built”. This means that at minimum, a fabricating PI is probably wasting his trainees’ time by letting them base their own research efforts on claims that there’s no good scientific reason to trust.

And as Bhattacharjee describes the situation for Stapel’s trainees, things for them were even worse:

He [Stapel] published more than two dozen studies while at Groningen, many of them written with his doctoral students. They don’t appear to have questioned why their supervisor was running many of the experiments for them. Nor did his colleagues inquire about this unusual practice.

(Bold emphasis added.)

I’d have thought that one of the things a scientist-in-training hopes to learn in the course of her graduate studies is not just how to design a good experiment, but how to implement it. Making your experimental design work in the real world is often much harder than it seems like it will be, but you learn from these difficulties — about the parameters you ignored in the design that turn out to be important, about the limitations of your measurement strategies, about ways the system you’re studying frustrates the expectations you had about it before you were actually interacting with it.

I’ll even go out on a limb and say that some experience doing experiments can make a significant difference in a scientist’s skill conceiving of experimental approaches to problems.

That Stapel cut his students out of doing the experiments was downright weird.

Now, scientific trainees probably don’t have the most realistic picture of precisely what competencies they need to master to become successful grown-up scientists in a field. They trust that the grown-up scientists training them know what these competencies are, and that these grown-up scientists will make sure that they encounter them in their training. Stapel’s trainees likely trusted him to guide them. Maybe they thought that he would have them conducting experiments if that were a skill that would require a significant amount of time or effort to master. Maybe they assumed that implementing the experiments they had designed was just so straightforward that Stapel thought they were better served working to learn other competencies instead.

(For that to be the case, though, Stapel would have to be the world’s most reassuring graduate advisor. I know my impostor complex was strong enough that I wouldn’t have believed I could do an experiment my boss or my fellow grad students viewed as totally easy until I had actually done it successfully three times. If I had to bet money, it would be that some of Stapel’s trainees wanted to learn how to do the experiments, but they were too scared to ask.)

There’s no reason, however, that Stapel’s colleagues should have thought it was OK that his trainees were not learning how to do experiments by taking charge of doing their own. If they did know and they did nothing, they were complicit in a failure to provide adequate scientific training to trainees in their program. If they didn’t know, that’s an argument that departments ought to take more responsibility for their trainees and to exercise more oversight rather than leaving each trainee to the mercies of his or her advisor.

And, as becomes clear from the New York Times Magazine article, doing experiments wasn’t the only piece of standard scientific training of which Stapel’s trainees were deprived. Bhattacharjee describes the revelation when a colleague collaborated with Stapel on a piece of research:

Stapel and [Ad] Vingerhoets [a colleague of his at Tilburg] worked together with a research assistant to prepare the coloring pages and the questionnaires. Stapel told Vingerhoets that he would collect the data from a school where he had contacts. A few weeks later, he called Vingerhoets to his office and showed him the results, scribbled on a sheet of paper. Vingerhoets was delighted to see a significant difference between the two conditions, indicating that children exposed to a teary-eyed picture were much more willing to share candy. It was sure to result in a high-profile publication. “I said, ‘This is so fantastic, so incredible,’ ” Vingerhoets told me.

He began writing the paper, but then he wondered if the data had shown any difference between girls and boys. “What about gender differences?” he asked Stapel, requesting to see the data. Stapel told him the data hadn’t been entered into a computer yet.

Vingerhoets was stumped. Stapel had shown him means and standard deviations and even a statistical index attesting to the reliability of the questionnaire, which would have seemed to require a computer to produce. Vingerhoets wondered if Stapel, as dean, was somehow testing him. Suspecting fraud, he consulted a retired professor to figure out what to do. “Do you really believe that someone with [Stapel’s] status faked data?” the professor asked him.

“At that moment,” Vingerhoets told me, “I decided that I would not report it to the rector.”

Stapel’s modus operandi was to make up his results out of whole cloth — to produce “findings” that looked statistically plausible without the muss and fuss of conducting actual experiments or collecting actual data. Indeed, since the thing he was creating that needed to look plausible enough to be accepted by his fellow scientists was the analyzed data, he didn’t bother making up raw data from which such an analysis could be generated.

Connecting the dots here, this surely means that Stapel’s trainees must not have gotten any experience dealing with raw data or learning how to apply methods of analysis to actual data sets. This left another gaping hole in the scientific training they deserved.

It would seem that those being trained by other scientists in Stapel’s program were getting some experience in conducting experiments, collecting data, and analyzing their data — since that experimentation, data collection, and data analysis became fodder for discussion in the ethics training that Stapel led. From the article:

And yet as part of a graduate seminar he taught on research ethics, Stapel would ask his students to dig back into their own research and look for things that might have been unethical. “They got back with terrible lapses­,” he told me. “No informed consent, no debriefing of subjects, then of course in data analysis, looking only at some data and not all the data.” He didn’t see the same problems in his own work, he said, because there were no real data to contend with.

I would love to know the process by which Stapel’s program decided that he was the best one to teach the graduate seminar on research ethics. I wonder if this particular teaching assignment was one of those burdens that his colleagues tried to dodge, or if research ethics was viewed as a teaching assignment requiring no special expertise. I wonder how it’s sitting with them that they let a now-famous cheater teach their grad students how to be ethical scientists.

The whole “those who can’t do, teach” adage rings hollow here.

The quest for underlying order: inside the frauds of Diederik Stapel (part 1)

Yudhijit Bhattacharjee has an excellent article in the most recent New York Times Magazine (published April 26, 2013) on disgraced Dutch social psychologist Diederik Stapel. Why is Stapel disgraced? At the last count at Retraction Watch, 54 53 of his scientific publications have been retracted, owing to the fact that the results reported in those publications were made up. [Scroll in that Retraction Watch post for the update — apparently one of the Stapel retractions was double-counted. This is the risk when you publish so much made-up stuff.]

There’s not much to say about the badness of a scientist making results up. Science is supposed to be an activity in which people build a body of reliable knowledge about the world, grounding that knowledge in actual empirical observations of that world. Substituting the story you want to tell for those actual empirical observations undercuts that goal.

But Bhattacharjee’s article is fascinating because it goes some way to helping illuminate why Stapel abandoned the path of scientific discovery and went down the path of scientific fraud instead. It shows us some of the forces and habits that, while seemingly innocuous taken individually, can compound to reinforce scientific behavior that is not helpful to the project of knowledge-building. It reveals forces within scientific communities that make it hard for scientists to pursue suspicions of fraud to get formal determinations of whether their colleagues are actually cheating. And, the article exposes some of the harms Stapel committed beyond publishing lies as scientific findings.

It’s an incredibly rich piece of reporting, one which I recommend you read in its entirety, maybe more than once. Given just how much there is to talk about here, I’ll be taking at least a few posts to highlight bits of the article as nourishing food for thought.

Let’s start with how Stapel describes his early motivation for fabricating results to Bhattacharjee. From the article:

Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that, he told me. He insisted that he loved social psychology but had been frustrated by the messiness of experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty — instead of the truth,” he said. He described his behavior as an addiction that drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high.

(Bold emphasis added.)

It’s worth noting here that other scientists — plenty of scientists who were never cheaters, in fact — have also pursued science as a quest for beauty, elegance, and order. For many, science is powerful because it is a way to find order in a messy universe, to discover simple natural laws that give rise to such an array of complex phenomena. We’ve discussed this here before, when looking at the tension between Platonist and Aristotelian strategies for getting to objective truths:

Plato’s view was that the stuff of our world consists largely of imperfect material instantiations of immaterial ideal forms -– and that science makes the observations it does of many examples of material stuff to get a handle on those ideal forms.

If you know the allegory of the cave, however, you know that Plato didn’t put much faith in feeble human sense organs as a route to grasping the forms. The very imperfection of those material instantiations that our sense organs apprehend would be bound to mislead us about the forms. Instead, Plato thought we’d need to use the mind to grasp the forms.

This is a crucial juncture where Aristotle parted ways with Plato. Aristotle still thought that there was something like the forms, but he rejected Plato’s full-strength rationalism in favor of an empirical approach to grasping them. If you wanted to get a handle on the form of “horse,” for example, Aristotle thought the thing to do was to examine lots of actual specimens of horse and to identify the essence they all have in common. The Aristotelian approach probably feels more sensible to modern scientists than the Platonist alternative, but note that we’re still talking about arriving at a description of “horse-ness” that transcends the observable features of any particular horse.

Honest scientists simultaneously reach for beautiful order and the truth. They use careful observations of the world to try to discern the actual structures and forces giving rise to what they are observing. They recognize that our observational powers are imperfect, that our measurements are not infinitely precise (and that they are often at least a little inaccurate), but those observations, those measurements, are what we have to work with in discerning the order underlying them.

This is why Ockham’s razor — to prefer simple explanations for phenomena over more complicated ones — is a strategy but not a rule. Scientists go into their knowledge-building endeavor with the hunch that the world has more underlying order than is immediately apparent to us — and that careful empirical study will help us discover that order — but how things actually are provides a constraint on how much elegance there is to be found.

However, as the article in the New York Times Magazine makes clear, Stapel was not alone in expecting the world he was trying to describe in his research to yield elegance:

In his early years of research — when he supposedly collected real experimental data — Stapel wrote papers laying out complicated and messy relationships between multiple variables. He soon realized that journal editors preferred simplicity. “They are actually telling you: ‘Leave out this stuff. Make it simpler,’” Stapel told me. Before long, he was striving to write elegant articles.

The journal editors’ preference here connects to a fairly common notion of understanding. Understanding a system is being able to identify that components of that system that make a difference in producing the effects of interest — and, by extension, recognizing which components of the system don’t feature prominently in bringing about the behaviors you’re studying. Again, the hunch is that there are likely to be simple mechanisms underlying apparently complex behavior. When you really understand the system, you can point out those mechanisms and explain what’s going on while leaving all the other extraneous bits in the background.

Pushing to find this kind of underlying simplicity has been a fruitful scientific strategy, but it’s a strategy that can run into trouble if the mechanisms giving rise to the behavior you’re studying are in fact complicated. There’s a phrase attributed to Einstein that captures this tension nicely: as simple as possible … but not simpler.

The journal editors, by expressing to Stapel that they liked simplicity more than messy relationships between multiple variables, were surely not telling Stapel to lie about his findings to create such simplicity. They were likely conveying their view that further study, or more careful analysis of data, might yield elegant relations that were really there but elusive. However, intentionally or not, they did communicate to Stapel that simple relationships fit better with journal editors’ hunches about what the world is like than did messy ones — and that results that seemed to reveal simple relations were thus more likely to pass through peer review without raising serious objections.

So, Stapel was aware that the gatekeepers of the literature in his field preferred elegant results. He also seemed to have felt the pressure that early-career academic scientists often feel to make all of his research time productive — where the ultimate measure of productivity is a publishable result. Again, from the New York Times Magazine article:

The experiment — and others like it — didn’t give Stapel the desired results, he said. He had the choice of abandoning the work or redoing the experiment. But he had already spent a lot of time on the research and was convinced his hypothesis was valid. “I said — you know what, I am going to create the data set,” he told me.

(Bold emphasis added.)

The sunk time clearly struck Stapel as a problem. Making a careful study of the particular psychological phenomenon he was trying to understand hadn’t yielded good results — which is to say, results that would be recognized by scientific journal editors or peer reviewers as adding to the shared body of knowledge by revealing something about the mechanism at work in the phenomenon. This is not to say that experiments with negative results don’t tell scientists something about how the world is. But what negative results tell us is usually that the available data don’t support the hypothesis, or perhaps that the experimental design wasn’t a great way to obtain data to let us evaluate that hypothesis.

Scientific journals have not generally been very interested in publishing negative results, however, so scientists tend to view them as failures. They may help us to reject appealing hypotheses or to refine experimental strategies, but they don’t usually do much to help advance a scientist’s career. If negative results don’t help you get publications, without which it’s harder to get grants to fund research that could find positive results, then the time and money spent doing all that research has been wasted.

And Stapel felt — maybe because of his hunch that the piece of the world he was trying to describe had to have an underlying order, elegance, simplicity — that his hypothesis was right. The messiness of actual data from the world got in the way of proving it, but it had to be so. And this expectation of elegance and simplicity fit perfectly with the feedback he had heard before from journal editors in his field (feedback that may well have fed Stapel’s own conviction).

A career calculation paired with a strong metaphysical commitment to underlying simplicity seems, then, to have persuaded Diederik Stapel to let his hunch weigh more heavily than the data and then to commit the cardinal sin of falsifying data that could be presented to other scientists as “evidence” to support that hunch.

No one made Diederik Stapel cross that line. But it’s probably worth thinking about the ways that commitments within scientific communities — especially methodological commitments that start to take on the strength of metaphysical commitments — could have made crossing it more tempting.