Bench Press

The Crossroads of Science and Tech

Data, not in papers

with 3 comments

The always thoughtful Deepak Singh brings up a great point in a recent post on his personal blog:

Not all data should be published via a peer-reviewed publication. Not every protocol needs to be. But making the data available via wikis, open data resources is pretty much a no-brainer and not just for the future. You enrich currently available data, and have the ability to leverage an additional layer of resources.

Deepak isn’t the only guy to think this, Derek Lowe from In the Pipeline raised a similar point:

Perhaps there should be a way to dump chemical data directly into some archives, the way X-ray data goes into the Protein Data Bank. That wouldn’t count for much, but it would capture things for future use. Having it not count much would decrease the incentive for anyone to fill it full of fakery, too, since there would be even less point than usual. And before anyone objects to having a big pile of non-peer-reviewed chemical data like this, keep in mind that we already have one: it’s called the patent literature, and it can be quite worthwhile.

(all emphases mine)

image I think they both have a very good point. Some form of centralized data repository, even if non-peer reviewed, could help tackle the problem that everyone hears about but nobody ever tries to solve of not having a central place to share negative results and protocols (akin to what this blog proposed previously for bio/pharma companies).

It could also help us re-prioritize publication and peer review efforts away from sheer data collation which, while extremely important, is distinct from experimental/study design, data analyses, and drawing conclusions where peer-review is more valuable (there’s only so much peer-review can do to when looking at a data collection effort in isolation).

With modern internet technologies being as fast and as scalable as they are now, there’s simply no reason to use the traditional journal to chronicle every single discovery or achievement. Better to collect most of it in API-accessible/index-able repositories so that others can share in it and curate it and instead focus publications on building analytical insights.

(Image credit)

Written by ben

July 6th, 2010 at 11:59 pm

  • Robert

    “Data publication” imho is somehow in between…

    A data publication is archived and published but not necessaryily used in peer review papers. If used in such a paper it is however citable.

    For example Pirrung, M (2008): Geochemical composition of sediments sampled from Arctic sea ice. doi:10.1594/PANGAEA.686259
    is such a data publication which is cited and used in a peer review paper:

    Biogenic barium in surface sediments of the European Nordic Seas

    For earth sciences such a data archive is PANGAEA

  • Jean-Claude Bradley

    For situations where there is not a well developed central database, one can still convert a dataset to formats that are acceptable in more general repositories. For example, Nature Precedings won't accept a Zip or XLS file as a publication but it will accept a PDF version. If you can convert your data into a PDF you can then link from NP to other locations where the data are in a database. The advantage of this is that NP documents are archived on Google Scholar and so increases the probability of the data being found. For examples of this see http://onsbooks.wikispaces.com/

  • http://www.benjamintseng.com/ Ben

    Very cool, thanks for the pointer.

    I've oftentimes wondered why “repositories” like this are so restrictive of file format. I don't think they do much curation — shouldn't they encourage more people to use it (especially people who've never heard of RDF or know their way around PDF, but just want to upload their XLS) by being more permissive?