Data release, ethics, and professional survival.

In recent days, there have been signs on the horizon of an impending blogwar. Prof-like Substance fired the first volley:

[A]lmost all major genomics centers are going to a zero-embargo data release policy. Essentially, once the sequencing is done and the annotation has been run, the data is on the web in a searchable and downloadable format.

Yikes.

How many other fields put their data directly on the web before those who produced it have the opportunity to analyze it? Now, obviously no one is going to yank a genome paper right out from under the group working on it, but what about comparative studies? What about searching out specific genes for multi-gene phylogenetics? Where is the line for what is permissible to use before the genome is published? How much of a grace period do people get with data that has gone public, but that they* paid for?

—–
*Obviously we are talking about grant-funded projects, so the money is tax payer money not any one person’s. Nevertheless, someone came up with the idea and got it funded, so there is some ownership there.

Then, Mike the Mad Biologist fired off this reply:

Several of the large centers, including the one I work at, are funded by NIAID to sequence microorganisms related to human health and disease (analogous programs for human biology are supported by NHGRI). There’s a reason why NIH is hard-assed about data release:

Funding agencies learned this the hard way, as too many early sequencing centers resembled ‘genomic roach motels’: DNA checks in, but sequence doesn’t check out.

The funding agencies’ mission is to improve human health (or some other laudable goal), not to improve someone’s tenure package. This might seem harsh unless we remember how many of these center-based genome projects are funded. The investigator’s grant is not paying for the sequencing. In the case of NIAID, there is a white paper process. Before NIAID will approve the project, several goals have to be met in the white paper (Note: while I’m discussing NIAID, other agencies have a similar process, if different scientific objectives).

Obviously, the organism and collection of strains to be sequenced have to be relevant to human health. But the project also must have significant community input. NIAID absolutely does not want this to be an end-run around R01 grants. Consequently, these sequencing projects should not be a project that belongs to a single lab, and which lacks involvement by others in the subdiscipline (“this looks like an R01” is a pejorative). It also has to provide a community resource. In other words, data from a successful project should be used rapidly by other groups: that’s the whole point (otherwise, write an R01 proposal). The white paper should also contain a general description of the analysis goals of the project (and, ideally, who in the collaborative group will address them). If you get ‘scooped’, that’s, in part, a project planning issue.

NIAID, along with other agencies and institutes, is pushing hard for rapid public release. Why does NIAID get to call the shots? Because it’s their money.

Which brings me to the issue of ‘whose’ genomes these are. The answer is very simple: NIH’s (and by extension, the American people’s). As I mentioned above, NIH doesn’t care about your tenure package, or your dissertation (given that many dissertations and research programs are funded in part or in their entirely by NIH and other agencies, they’re already being generous†). What they want is high-quality data that are accessible to as many researchers as possible as quickly as possible. To put this (very) bluntly, medically important data should not be held hostage by career notions. That is the ethical position.

Prof-like substance hurled back a hefty latex pillow of a rejoinder:

People feel like anything that is public is free to use, and maybe they should. But how would you feel as the researcher who assembled a group of researchers from the community, put a proposal together, drummed up support from the community outside of your research team, produced and purified the sample to be sequenced (which is not exactly just using a Sigma kit in a LOT of cases), dealt with the administration issues that crop up along the way, pushed the project through (another aspect woefully under appreciated) the center, got your research community together once they data were in hand to make sense of it all and herded the cats to get the paper together? Would you feel some ownership, even if it was public dollars that funded the project?

Now what if you submitted the manuscript and then opened your copy of Science and saw the major finding that you centered the genome paper around has been plucked out by another group and publish in isolation? Would you say, “well, the data’s publicly available, what’s unscrupulous about using it?”

[L]et’s couch this in the reality of the changing technology. If your choice is to have the sequencing done for free, but risk losing it right off the machine, OR to do it with your own funds (>$40,000) and have exclusive right to it until the paper is published, what are you going to choose? You can draw the line regarding big and small centers or projects all you want, but it is becoming increasingly fuzzy.

This is all to get back to my point that if major sequencing centers want to stay ahead of the curve, they have to have policies that are going to encourage, not discourage, investigators to use them.

It’s fair to say that I don’t know from genomics. However, I think the ethical landscape of this disagreement bears closer examination.

This is one of those situations in which there are multiple interested parties whose interests pull in different directions.

The scientists who generated the data have an interest in getting career credit for their hard work and ingenuity. This credit usually comes in the form of priority claims for their discoveries, established by publishing their findings in scientific journals. Such publications are also extremely helpful in building the CV that helps the scientists secure further funding so that they can do more science.

Generally speaking, scientists also have an interest in the free communication of scientific ideas and results. There are plenty of circumstances where findings communicated by other scientists can put one’s own findings in context, or provide the missing pieces of data one needs to solve the problem one is trying to solve.

For an individual scientist, then, automatic release of data with no embargo (during which the scientists who generated the data would have the first crack at drawing conclusions from it and submitting those conclusions for publication) would seem to pull with her interests if her scientific work would benefit from access to data that someone else generated and against her interests if she’s the one who generated that data. (It’s undoubtedly more complicated than this — you can imagine, for example, instances where the individual scientist might also benefit if another scientist who used the data she had generated also sought her out to initiate what turned out to be a very productive collaboration.)

The public, as the Mad Biologist points out, has an interest in the wide dissemination of scientific results that could lead to benefits in the form of cures for disease, better nutrition, or other advances in scientific knowledge that lead to improvements in human health. Generally speaking, the public has no special interest in the career fortunes of any particular scientist — the assumption being that in the project of knowledge-building, the scientists are all more or less interchangeable. Federal funding agencies (who are managing the public’s money here) seem to take the same approach. The goal is to get the knowledge built, to apply that knowledge to addressing issues that public wants and needs addressed, and to make sure it’s available so that the time required to build that knowledge and address those issues is as short as possible.

However, even the most idealistic scientist, who’s all about building the knowledge and helping people with it, will recognize that there are selective pressures in the career realm to which one must adapt — at least if one plans to remain employed as a research scientist. For many scientists, this requires bringing in the grant money, which in turn requires a strong track record of results published in articles for which one is the first author (rather than one of the cast of thousands lost in the “et al.”).

As Mike points out, it may not be the funding agency’s brief to care about the career impact of their data-sharing policies on the researchers who produce that data. However, funding agencies might have an interest in considering how those career impacts might create significant disincentives for researchers to participate in their large-scale collaborative enterprises (like the large centers sequencing microorganisms related to human health and disease). As Prof-like Substance describes it, scientists with a survival instinct are going to do the math (on budget, available technology, time to get the data, and pressure to get their own papers out before someone else scoops them with their own data) and make choices that keep them in the scientific gene pool. These rational individual choices might not correspond to the options that would best serve the interests of other scientists, the scientific community as a whole, the funding agencies, or the public at large.

Does this mean that funding agencies making zero-embargo data sharing policies are being unethical, or that the scientists who find ways to operate outside of those policies are being unethical? It’s not clear that either is necessarily unethical. However, given the structure of the scientific competition here, doing what serves the interests of the public (or even of the scientific community as a whole) seems to undermine the individual scientist’s interests. To the extent that the public (through funding agencies) is committed to supporting science, it seems to me that it’s worth considering whether there are other ways to set up the reward structure in order to support scientists too. If things we take to be better for the efficient building of scientific knowledge (like rapid data sharing) were also things that conferred career rewards (rather than penalties), it would be a lot easier to get scientists to embrace them.

facebooktwittergoogle_pluslinkedinmail
Posted in Doing science for the government, Ethical research, Institutional ethics, Professional ethics, Scientist/layperson relations, Tribe of Science.

11 Comments

  1. Yes. This. I’ve been trying to keep my nose out of this one, but the question that keeps nagging at me is “if a cited dataset were worth as much career credit as a cited article, would we even be having this argument?”

  2. A similar argument is made for patent protection for discoveries in Pharma. If the pursuit isn’t adequately incentivized, it is claimed, it will not go forward. I would support an embargo of limited but meaningful length, like 6 months. Similarly, smaller, less resource rich groups can work in difficult areas once the field has been ceded due to difficulty, but upon a breakthrough can be crushed by much larger competition. The argument would be that the competition still gives “science” a net benefit, but it is not clear if this is only a short term benefit or a long term (healthy ecosystem) kind of thing.

  3. Pingback: Tweets that mention Data release, ethics, and professional survival. | Adventures in Ethics and Science -- Topsy.com

  4. Thank you, Janet, for a very cogent argument — very well said.

    MtMB falls into a standard error of socialist thinking — failing to realize that “free” distribution of resources will impact the behavior of the producers of those resources, in such a way as to make those resources ultimately more scarce for all.

    • Neuro,

      Actually, I’m well aware of the problem: I even blogged about it. But what set me off more than anything is the idea that it’s ‘someone’s’ data, when, in fact (legally), it is not (if any private entity can draw claim to it, it’s your institution).

      Like Janet, I’ve argued that data production needs to be recognized as a useful, tenure-conferring activity. But the other issue is that, in an era of rapid data analysis, labs that aren’t capable of rapid analysis need to collaborate with other groups that can do this. We can’t wait for extended periods of time for data to be released–that others will ‘scoop’ the data demonstrates the need for the rapid release. They are useful data.

      • Given that this ideal of tenure has not been adopted, and Prof-like Substance was clearly writing as a grantee and not a contractor, your entire attack on him was at best a waste of pixels.

  5. I think about these sorts of dilemmas using a distinction between internal goods — things like knowledge, new technology, and new medical therapies — and external goods — things like money, power, and credit or status. It’s a bit of an oversimplification, but you can think of them as goals and means: the goal of science is knowledge and money and credit are means to pursuing those goals. The dilemma arises from the fact that pursuing internal goods requires having external goods — better understanding of genomes takes money — but the pursuit of external goods can corrupt or disrupt the pursuit of internal goods — your science turns into a pursuit of grant money and status, not a pursuit of new knowledge.

    Both of the bloggers you quote put more emphasis on one horn of this dilemma or the other: Prof-like Substance emphasizes the need for external goods and Mike the Mad Biologist emphasizes the priority of internal goods. You, by contrast, rightly point out both. But the solution you suggest in the last paragraph seems to me a little too optimistic. Maybe, through some clever institutional design, we can find a way to keep the two kinds of goods from pulling in opposite directions. But I have no idea what sort of institution would do that.

    To borrow an argument from Matt Brown (I’m reading his dissertation right now) and Dewey, there may not be a nice, neat solution to the problem that we can discover, as it were, through a priori reflection. It seems quite likely that the only way we can find the right balance between these competing interests is to try out different ways of organizing things and see what happens — to conduct experiments in science policy. It appears the NIH has decided the current policy puts too much emphasis on external goods; trying things this way for a little while (a few years or a decade, say) might balance things out, might cause scientists to move out of the field the way Prof-like Substance and Neuro-conservative are worried about (an overemphasis on internal goods), or might have no appreciable effect at all. We won’t really know until we try it.

  6. Jimminy folks, the Bermuda Accord (data release of sequencing projects within 24 hours of assembly) has been in effect for almost 20 years now. Tempest in a teapot, only two decades late.

  7. Janet,

    I’ll add, in addition, to the comment above, that, to a considerable extent, this is a result of university tenure-granting policies (as well as NIH granting policies, although it will be easier for NIH to credit people for work they’ve funded).

    Yet for some reason, the current tenure/promotion structure is viewed as immutable.

    Why?

  8. This is a case of, “there actually *is* a technical solution”.

    > Now what if you submitted the manuscript and then opened your
    > copy of Science and saw the major finding that you centered the
    > genome paper around has been plucked out by another group and
    > publish in isolation? Would you say, “well, the data’s publicly
    > available, what’s unscrupulous about using it?” …

    No, I’d say, “Web admin, researchers Foo, Bar, and Barfoo work at these institutions. They seem to have snarked my dataset and published findings based upon it. Do me a favor, look up the IP blocks owned by those institutions and grep through the web logs to see if they’ve been visiting the web site that hosts the data.

    Ah, hah! You say they visited the site 145 times between Aug 1st and Aug 14th of last year? Interesting!

    Now, I must go author a letter to the editor of Science. Also, I’m going to raise a ruckus with these people at the next conference.”

    Okay, maybe you’re not into confrontation (which sort of begs the question of how you got into academic research in the first place…)

    That aside… the longstanding gripe in the science community is that the publishing cycle is generally too long. How is it that someone (a) found your data (b) analyzed your data (c) wrote up your data (d) submitted it to a journal and (e) had it accepted before you actually submitted your write up?

    Especially given that you have a substantial head start on analyzing the data, given that you’re presumably designing your investigation process with some sort of goal in line?

    Let’s say I accept that this can happen. With what sort of frequency? Really, is this a big problem? Sure, it might happen.

    It might *also* happen that a bunch of other researchers find your data, think it will be useful, contact you, and you get another author credit. They might publish findings that you weren’t precisely looking for, and cite your main paper. They might meet you at that same conference and strike up a collaborative conversation that leads you both to the next Nobel.

    I’d say, given the academics that I’ve generally met, the second is a much more frequent occurrence. Why throw away the much-greater likelihood of all that academic goodness out of fear of a much-less-likely negative scenario?

Leave a Reply

Your email address will not be published. Required fields are marked *