The Hellinga retractions (part 2): trust, accountability, collaborations, and training relationships.

Back in June, I wrote a post examining the Hellinga retractions. That post, which drew upon the Chemical & Engineering News article by Celia Henry Arnaud (May 5, 2008) [1], focused on the ways scientists engage with each other’s work in the published literature, and how they engage with each other more directly in trying to build on this published work. This kind of engagement is where you’re most likely to see one group of scientists reproduce the results of another — or to see their attempts to reproduce these results fail. Given that reproducibilty of results is part of what supposedly underwrites the goodness of scientific knowledge, the ways scientists deal with failed attempts to reproduce results have great significance for the credibility of science.

Speaking of credibility, in that post I promised you all (and especially Abi) that there would be a part 2, drawing on the Nature news feature by Erika Check Hayden (May 15, 2008) [2]. Here it is.

In this post, I shift the focus to scientists’ relationships within a research group (rather than across research groups and through the scientific literature). In research groups in academic settings, questions of trust and accountability are complicated by differentials in experience and power (especially between graduate students and principal investigators). Academic researchers are not just in the business of producing scientific results, but also new scientists. Within training relationship, who is making the crucial scientific decisions, and on the basis of what information?

The central relationship in this story is that between Homme W. Hellinga, professor of biochemistry at Duke University, and graduate student Mary Dwyer.

From the Nature article:

[S]tudents in Hellinga’s lab were warning Dwyer away. “It’s pretty tough, they told her; “there are other good labs.” One student even pulled her aside and told her flat out that working with Hellinga was so difficult that she should not join the lab. By that time, that student remembers, many more students had left Hellinga’s lab than had earned doctoral degrees under his tutelage.

Yet Dwyer had done a short rotation with Hellinga’s group, and had seen nothing alarming. “I felt like I would probably be able to handle it,” she recalls. (275)

What would make a newish graduate student think she couldn’t work with a particular PI? He yells at his grad students? Expects that they be able to run impossible experiments in unreasonably short amounts of time? Slaps their asses?

Any of those behaviors would be alarming. So would sitting down with his grad students and saying, “Let’s fabricate some data.” But it’s unlikely that Hellinga did any of these things. He did not twirl his mustache, cackle maniacally, and transmit unmistakable signals that he was evil.

Rather, he struck Dwyer as a successful, if demanding, grown-up scientist. She made a judgment that her skills and determination would probably get her through a research apprenticeship with him, and that he would provide the training and guidance that would help her become a successful grown-up scientist, too.

Asked whether he agrees with claims that he is arrogant, Hellinga replies, “I would say no. Can I appear to be personally arrogant? I would imagine yes. When you are trying to do a difficult experiment, you have to have a certain amount of self-confidence to say, ‘All right, this is the moment and we think we have the techniques and ideas together to try and give this a go’.” (275)

Confidence is one of those traits grown-up scientists have that scientists-in-training want to develop. Of course, ideally, the confidence you want is grounded in reality, supported by the input of others who are doing their level best to be objective.

There’s a tricky balance between self-confidence and self-deception. A good scientist wants to stay on the right side of that line.

In any case, when a grown-up scientist has the confidence to draw on his ideas and skills to try to make a difficult piece of research work, there is always a chance that, despite his ideas and skills, it still won’t succeed. How to deal with this as a scientist, it seems to me, is at least as important as how to muster the self-confidence to try the difficult piece of research in the first place.

It’s one of the grown-up scientist life skills you want your advisor to help you develop.

Hellinga chose Dwyer and another student, Loren Looger, to work on the project in Escherichia coli bacteria. The pair were to transform E. coli‘s ribose-binding protein, which has no enzymatic activity, into a TIM. Looger and Hellinga wrote computer programs to model how the structure of the ribose-binding protein could be changed to make it work like a TIM. Dwyer used the program to design mutated ribose-binding proteins, dubbed “NovoTIMs”, and tested whether they worked in the lab. (276)

In the Hellinga lab, as in most academic research labs, the labor of scientific research is divided. Presumably Hellinga, as the PI, took primary responsibility for generating the “big ideas” and broad strategies. Hellinga and Looger wrote the computer program to implement some of those ideas about the connections between protein structure and protein function. And Dwyer used that computer program to generate predictions about the activity of specific proteins, and then used her skills in the lab to try to see whether those predictions held up.

While writing (and debugging) this kind of computer program is a significant intellectual labor, synthesizing the designed proteins (in bacteria), purifying them, and assaying their activity (and that of substituted mutants) is a very different kind of challenge — one with many more parameters in need of control and multiple ways things could go wrong.

Dwyer, who describes herself as a “pretty conservative person”, was sceptical that the project would pan out. “I had my doubts all the time,” she says. After about 6 months testing 25 designs, Dwyer found that a couple of the designed proteins were active, but she also noticed some problems. The E. coli bacteria made much smaller amounts of the NovoTIM proteins than of their own natural, or native, proteins. And the NovoTIMs were very unstable.

Perhaps because of these issues, Dwyer’s experiments yielded confusing data about NovoTIM activity. When she measured the enzymes’ kinetic parameters — characteristics that describe how enzymes work — the tests didn’t always give the same results. “I felt like we couldn’t nail down the kinetic parameters because of the variability that we were seeing,” Dwyer recalls. Even after she started working with another member of the lab, “we were also getting a lot of variability. We just didn’t understand it,” Dwyer says. Hellinga says that the variability was “no more than you would expect in [such] an experiment”. (276)

Dwyer’s experiments seemed, maybe, to be showing something. The challenge, especially for a fledgling scientist, was to determine what they could be showing.

A big concern that Dwyer had, by her own account, was whether these results were robust — whether she had figured out how to control the experimental system well enough that she could repeat her experiments and get the same experimental results (at least within the experimental error). Dwyer worried that the variability she was seeing meant that there were important parameters that were not being properly controlled. If the system is not well controlled, it’s harder to know what you’re seeing, whether you’re seeing it reliably, and what exactly might account for the different outcomes in different experimental runs.

Hellinga, on the other hand, thought the amount of variability they were seeing was normal, and that the results were robust enough to be counted as real results and reported to other scientists in the literature.
Between Hellinga’s self-confidence and Dwyer’s doubt, where is the appropriate middle ground? The scientific trainee hopes that the grown-up scientist knows where to find this balance. But can the PI declare it unilaterally? Or must there be a real conversation where each party takes the concerns of the other seriously before arriving at a decision both can endorse?

As the PI, Hellinga made the call, although we don’t know what kids of discussions or negotiations actually preceded this decision. The paper was written and submitted to Science. I don’t know how responsibility for drafting and editing the manuscript was apportioned within the research group, but it would not be surprising if Hellinga took charge of the writing, given his experience in writing scientific papers and his confidence in the results.

The paper did not mention the variability Dwyer had noticed. It included only her best data and claimed victory. (276)

A scientific paper is supposed to give an accurate account of what was found, plus adequate information on how the results were obtained (so that other scientists, using these instructions, can repeat experiments and get the same results within experimental error). Such a paper also includes the authors’ take on what these results mean, why they matter, and how they fit into the existing body of knowledge. And, it identifies who is responsible for this scientific report and for the scientific research being reported.

That the experimental picture painted in the paper was so much rosier than the real experimental results obtained by Dwyer is disturbing. It was, especially, disturbing to Dwyer:

“I wanted to work more on the variability issue,” along with other odd results she had seen. “I felt we weren’t quite there yet.”

Dwyer says that she raised her concerns with Hellinga at the time. But Hellinga says he does not feel he pushed Dwyer or anyone else to publish prematurely. “These things were talked through very carefully with all the people involved,” he says. (276)

Now, Dwyer was the first author of the Science paper, and there are those inclined to assert that the first author has the primary responsibility for all aspects of the paper and the research reported in it. I think it’s harder to make this assertion when the first author is also a graduate student.

What should the PI’s involvement be, in the preparation of the scientific paper, the oversight of the research, the training of the graduate students to be grown-up scientists? While the PI may not be getting his hands dirty synthesizing or purifying the proteins, should he be overseeing the experiments in some way? Troubleshooting experimental controls and possible sources of error? Working through data to see if they make sense? Looking at the accumulated runs to judge whether the results are robust?
Should the PI be skeptical? Should the PI be a cheerleader? Both of these at different points in the process?

Ought the PI to involve the graduate student in the decision about whether a piece of research is ready to publicize? To involve the graduate student in a discussion about what’s involved more generally in deciding when it’s ready?

If the graduate student is asked to trust the PI’s judgment, will the PI be accountable for that judgment? Can the PI help the student to develop her own judgment?

And do representations of who is responsible for the scientific findings reported in a paper remain robust when the findings end up looking less solid than they did at first?

Because, as you’ll recall from part 1, those findings did end up looking less and less solid.

As Hellinga’s career was skyrocketing, it was perhaps easy for him to overlook a letter that crossed his desk in December 2004 amidst the flurry of accolades. “Dear Professor Hellinga,” it began. “I was wondering if you would be interested in collaborating.” (276)

At this point in the story, scientists from outside the Hellinga lab start to enter the conversation. In particular, was biochemist John P. Richard at SUNY-Buffalo was seeking Hellinga’s assistance. Having published his findings, Hellinga had to be able to account not just to his trainees and underlings, but also to other grown-up scientists.

Richard had developed a method to analyze reactions catalysed by TIM. He had seen Hellinga’s Science paper and wanted to compare the characteristics of the NovoTIMs with those of normal TIMs. Richard proposed such experiments to Hellinga, but received no response. “It wasn’t a high priority,” Hellinga says. …

On 9 August [2006], [UC-Berkeley professor Jack] Kirsch sent Hellinga an e-mail. “[Richard] informed me recently that he had sent you an e-mail requesting material,” Kirsch wrote. “Is there any reason why you cannot comply with his request?”

That email seemed to grease the wheels. On 20 October, Hellinga wrote to Richard, agreeing to send DNA templates for the NovoTIMs he had made for the Science paper. He also sent templates for a second batch of NovoTIMs made by Dwyer and another researcher the year before. A paper describing these new proteins was about to be published in the Journal of Molecular Biology. Hellinga sent Richard instructions for expressing and purifying all the NovoTIMs, as well as a note: “I hope that your experiments will be successful, and look forward to seeing the profiles for these designs.” (276)

Of course, it seems Hellinga had enough else going on that he didn’t prioritize communicating with those other grown-up scientists. When its finally did move to the top of his queue, the results about which Hellinga had been so confident started to unravel.

As [chemist Tina] Amyes studied the NovoTIMs throughout the first half of 2007, nothing about them was as Hellinga had reported, and everything suggested that they were wild-type TIMs.

[Technician Astrid] Koudelka then modified Hellinga’s procedures by using a continuous gradient elution, a more powerful purification than the step elution. The new method cleanly separated the NovoTIMs from the contaminants. But when Amyes analysed the pure NovoTIMs, they had no enzymatic activity. Instead, the contaminating proteins were active — and looked just like wild-type E. coli TIMs. (277)

Using the procedures Hellinga reported, Amyes and Koudelka were unable to obtain the results Hellinga reported. Using a different (and better) technique, they got results much different from those Hellinga reported.

“I was sort of distressed,” says Richard. “We spent quite a bit of time, money and resources to basically do nothing, to show something was wrong.” Yet the team felt an obligation to try to correct the scientific record. “Just saying, ‘This is not right, let’s discard it and move on’ — that’s not fair to the scientific community,” Koudelka says. (277)

Richard and Amyes and Koudelka all understood their responsibilities as part of the scientific community to share their findings. At this point, it was still possible for Hellinga to say, “Hmm, maybe we goofed,” and to take up his part of the responsibility of correcting the record. However,

In a 30 July [2007] e-mail, Hellinga wrote that the key experiments “have been repeated several times by different individuals in my research group”. The experiments included the tests that detected NovoTIM activity, and a set of negative control experiments. These negative controls — not shown in either paper — found no activity in purified ribose-binding proteins, Hellinga said. But he agreed to look again at the NovoTIMs: “We will carry out a purification similar to the one that you describe,” he wrote. (277)

Apparently, Hellinga’s self-confidence had not flagged. He assured the other scientists that the negative controls had been run. He didn’t quite explain why these negative controls, if run, had not been published. That they were not published might well have had an effect on the other scientists’ confidence in Hellinga and his lab as a reliable source of scientific information.

All this time, Dwyer had heard nothing about Richard’s communication with Hellinga. After earning her doctorate in 2004, she had left Hellinga’s lab in 2005 to pursue postdoctoral research in a different department. So she was not seriously concerned when Hellinga e-mailed her on the Labor Day holiday on 3 September last year, asking her to meet with him later in the week to discuss issues about NovoTIM. But Dwyer’s new adviser, Donald McDonnell, a professor of pharmacology and cancer biology, advised her not to meet Hellinga alone; he felt she should go with someone who could advocate on her behalf. McDonnell arranged a meeting later that week at which he, Dwyer and Hellinga were joined by two other faculty members from the biochemistry department. And that’s when Hellinga dropped the bombshell. “He said, ‘I find it really hard to believe that you didn’t make this up’, and he kept saying that kind of statement over and over again,” Dwyer says. “It was horrible.” (277)

This glimpse of the interaction between Dwyer and Hellinga is troubling. To the extent that Dwyer was accountable for the research coming into question, why wouldn’t Hellinga include her in these conversations — indeed, put her in touch with Richard and Amyes and Koudelka to get a handle on the system? Was she being excluded from these discussions because Hellinga didn’t trust her? If that were the case, what steps had Hellinga taken as the PI and the person charged with training Dwyer to ensure that her results were trustworthy prior to publication?

While Dwyer may have been the first author on one of the papers called into question, Hellinga shared that author line. As a grown-up scientist, he was accountable for the published results and for Dwyer’s training as a scientist; being listed as an author amounted to making assurances about both to the scientific community.

Dwyer showed Hellinga the data from her lab notebooks that, she thought, exonerated her. But, she recalls, “he didn’t want to look at any of that. It was just flat out my fault, and that was it.” Hellinga remembers it differently. “That’s not true,” he says. “Of course I looked at the data. I also had people in my lab repeat the experiments,” he says. (277)

Here, rather than examining their shared responsibility for the paper they published, Hellinga seems to have been most interested in assigning blame. To the extent that Hellinga could be expected to have responsibility for teaching Dwyer how to be a good scientist, maneuvering away from taking any responsibility here is kind of shocking.

On 8 October, Hellinga wrote to Richard. “We have completed our repeat experiments on NovoTIM,” he wrote. “I concur with your finding that the NovoTIM designs do not exhibit enzymatica activity, and that the reported activity is due to a contaminating activity which is very likely to be the endogenous, wild-type triose phosphate isomerase.” The repeat negative control experiments, Hellinga wrote, had found “TIM activity in the wild-type [ribose-binding protein] preparations prepared by the step gradient elution method.”

He added that the repeat experiments were done by three people, “but NOT Mary Dwyer, the author responsible for executing the experiments described in the Science paper, and responsible in large part for the negative control experiment in the Journal of Molecular Biology paper.” By naming Dwyer as the scientist primarily responsible for the experiments, Hellinga seemed to contradict his 30 July e-mail to Richard, in which he said “different individuals” had been involved. However, Hellinga clarified to Nature that his July e-mail was “slightly inaccurate”; at that time, Dwyer was the only person who had performed the negative controls, he says. (277-278)

Slightly inaccurate?

From the point of view of making the case that the results were robust — that different researchers in different labs could be expected to get the same results — this looks like a significant misrepresentation to me.

Between the claims of the published papers — and the way they were contradicted by the findings of other scientists trying to reproduce them — and the mixed messages Hellinga seemed to be giving out about his own confidence in this research, it was starting to look like something fishy was going on. An inquiry was convened, and Dwyer found herself at the center of the inquiry.

A committee on research misconduct [at Duke University] convened a formal inquiry hearing in December, at which Dwyer was asked to address the claims against her. On 4 February, she received a letter from Wesley Byerly, an associate dean in the medical school, clearing her of the allegation of falsifying and fabricating results. (278)

Dwyer was cleared of falsifying or fabricating results. Still, to the extent that the results were not robust, this is an indication that there probably should have been more quality control applied to them within the Hellinga lab before the findings were published.

Who is accountable for that? If the person who conducted the experiments is still a trainee, working under an advisor who is supposed to train, doesn’t that advisor also bear responsibility for the results, and the sort of critical examination they should receive prior to publication?

And ought we to trust a scientist who shirks that responsibility?

On that last question, other scientists have weighed in:

“It is reprehensible,” says [University of Wisconsin-Madison enzymologist] Frey. “It is up to the adviser to instruct the student, to guide the student to find out what problems exist with the data and their interpretation of it, and to show the student what the pitfalls are.” (278)

And:

“It is a bush-league error not to purify your proteins well, especially in a paper like this,” says Wallace Cleland of the University of Wisconsin-Madison. (278)

Graduate students might be expected to make bush-league errors. The PI’s training them are not.
Despite being cleared of charges of wrongdoing, Dwyer still seems to want to get to the bottom of just what she really found in the Hellinga lab:

Dwyer thinks that the issues with protein expression and assay variability are partly to blame, and says that in retrospect, the apparent decreased activity of NovoTIM mutants was actually insignificant, once experimental error is taken into account. but no one has offered a clear answer for what went wrong. (278)

Meanwhile, where is Hellinga’s head with regard to the situation?

Asked whether he would have done anything differently in the NovoTIM experiments, Hellinga says, “I would like to not have the problem that we encountered.” When asked whether the lab moved too quickly, he says: “Given how we understood things to be at the time, no. Obviously if we had known things had gone wrong, we wouldn’t have moved forward with the speed we did.” (278)

I suspect I’m not the only one who finds these answers unsatisfying. Asked the equivalent of “How would you do things differently?”, Hellinga comes across like a booted reality-show contestant who says he’d do it exactly the same way.

A little self-awareness might provide some needed balance to that self-confidence.

Because, post-fiasco, saying you’d do things exactly the same way amounts to rejecting the rules that led to the trouble or saying that the outcome is just a matter of luck. Does Hellinga really believe that sound scientific practice within a collaboration that is also a training relationship is just a matter of luck? Were the results that were published the result of bad luck or avoidable error? Even in the worst case scenario for Hellinga, a deviously sneaky graduate student fabricating or falsifying data, is there no way for a PI to be involved enough in the research to sniff that out before submitting a manuscript?
It seems like a scientist confident of his own ideas and skills might be equal to this challenge.

Unlike Hellinga, Dwyer seems to feel her responsibility to other scientists quite deeply. Not only does feel responsible to the other scientists trying to use the results she published, but she also feels a pang at having ignored the warnings of other graduate students trying to be accountable to her as a vulnerable scientist-in-training:

“I feel incredibly guilty that I didn’t catch it, but I didn’t, and I just have to live with that. It’s been really hard,” she says. She is trying to move forwards with her life and career, she says, and is working in a new lab in a new field — endocrinology — with McDonnell. But sometimes, Dwyer says, she thinks back to the people who tried to steer her away from Hellinga’s lab so many years ago. And she wonders how different things might have been of she had heeded their advice. “Everybody gets warned, but nobody listens,” she says. “Maybe now they will.” (278)

_______
[1] Celia Henry Arnaud, “Enzyme Design Papers Retracted,” Chemical & Engineering News, 86(18), 40-41 (May 5, 2008).

[2] Erika Check Hayden, “Designer Debacle,” Nature, 451, 275-278 (May 15, 2008).

All quotations in the post are from this article, with page numbers given parenthetically.

4 Comments

Robert Bird

September 18, 2008 at 3:29 pm Reply

Hellinga does not come out well here – it doesn’t seem like he should be mentoring students if he lacks the ability to see where he (or someone else) might be wrong, or to even think that he could possibly be wrong, nor the ability to accept responsibility for said lack. Without the ability to think that you might be wrong (or the unwillingness to see contrary data), there really isn’t an appropriate place for him to work. As a professor, the inability to be critical of his ideas (or those of others) critical will help him only to poorly train graduate students. At a business, the inability to question and temper one’s ideas is likely to cost the business money and time spent in the wrong places (which for a startup could be fatal).
These lacks, however, portend a profitable career in upper management (reinforced particularly by his ability to assign blame to others). If he cultivates the appropriate political connections, and becomes even more able to ignore contrary data, he could be President.
Frederick Ross

September 19, 2008 at 3:38 pm Reply

Wait a minute, the PI is supposed to be a scientist? I thought he was the secretary in charge of securing grant money. And what is this weird idea that the grad student is supposed to execute the big ideas of the advisor? That’s the job of a technician, a skilled professional who is paid a professional’s wage.
George Smiley

September 19, 2008 at 10:19 pm Reply

“Now, Dwyer was the first author of the Science paper, and there are those inclined to assert that the first author has the primary responsibility for all aspects of the paper and the research reported in it.”
Well, anyone who would so assert is naive, a fool, or both. The person with primary responsibility is the person listed as the contact for correspondence. To that person the (grant) bucks go; at that person’s desk the buck must stop. The person who takes the money also, by law, takes the responsibility.
This is not conceptually difficult.
hip hip array

September 22, 2008 at 9:05 pm Reply

“Everybody gets warned…” she says.
She’s wrong there, and she doesn’t know how lucky she was. In my experience, grad students with poor mentors are generally far too afraid of negative repercussions to their own careers to warn new students off a bad lab. The fact that Hellinga’s students had the balls and integrity to be honest with Dwyer was a rare gift that she naively rejected. She’s generally right that incoming students tend not to listen to such advice, being more concerned with the subject of their research.

4 Comments

Leave a Reply Cancel reply